Thank you for the responses!
I understand the whole thing a bit better now.
By tweaking some parameters, I actually get a speed close to what you reported (multiplied by 8, since I am using 4 lanes and Gen2).
There still is one problem though:
I can get this speed when doing either a DMA read, or a DMA write. But I don't understand how to do both at the same time.
I am using the chaining DMA example, and thus need to fill the descriptor table.
I fill up the descriptors (endpoint address, root complex address, length of the data), then write the number of descriptors into the write header to launch a write, or the read header to launch a read.
From what I understand, both the DMA write and the DMA read modules share the same descriptor table, so how can I trigger them at the same time? how do they know which descriptor is for which module? or is a local copy of the descriptor table generated after launching one of them, so that I can overwrite previous descriptors even before the operation is over?