Have you verified the lengths of the PCIe TLP?
If the DMA isn't generating long enough TLP then you'll get low throughput - but I suspect that even 60MB/s requires reasonable le
ngth TLP.
The other difference between read and write might be due to the extra pipelining that can be done for writes. This will be significant if the initiator only has 1 read TLP outstanding at any time.
I've measured throughout from a small ppc (root port and initiator) to the fpga (slave) - I got similar values for read and write but only 20ns/byte to internal memory (SDRAM is a lot slower). Although that is timed from userspace so includes the copyout.
For our purposes that was enough - after I'd managed to get the ppc's pcie dma working.