Put the dma descriptors into on-fpga memory.
Make sure the dma controller is doing Avalon burst transfers (probably 128 bytes, preferable 64 bits wide) into the pcie txs port.
You probably don't want burst transfers into the SDRAM - just pipelined.
Beware of large fifos in the dma controller - you don't need them.
The pcie txs block seems to complete write transfers quickly. I'm seeing reads take 128 clocks (of the 62.5MHz app clock) + a few clocks for the transfers size.
The same is true of host initiated transfers, writes are 'posted' and happen more or less back to back but there is a 128 clock delay between reads.
The only way to speed up DMA reads from host memory (once you are generating Avalon bursts and thus long PCIe TLP) is to generate concurrent read requests from multiple avalon masters.