Your setup and terminology is confusing me a bit. Normally reads/writes are talked about with respect to the host (root port), but I can't tell if you're asking about this or data flow from RP to the EP (a read from the perspective of the EP, although your psuedo-code appears to be sending data the other way). I understand that the reads are initiated (requested) by the EP. Can you clarify your setup some more?
On the FPGA, what kind of design are you using? Is it one of the Altera-provided designs, or is it custom? Your fpga setup will make a significant impact to TLP size. For example, if you're using Qsys with the Altera PCIe core and your DMA is not sending bursts (or using a bursting interface) to the PCIe core, your TLPs are going to be sub-optimal. Is your role to work on the FPGA design or just the software?
An FPGA sim should show TLP size pretty quickly for data flow in the EP-> RP direction, but not the other way (since that depends on what the host is doing). Your problem is in the RP -> EP direction, then DSL is correct, and putting signaltap on your memory interface is probably a good way to figure out the TLP size (if you only get a small number of bytes grouped in a burst, you're probably seeing very small TLPs). If you're problem is in the other direction, this can help, but I'd recommend a sim instead. I normally use DrivExpress for this sort of sim because it's pretty easy to set up the DMA transactions and host mem and you can then see everything that's going on in the FPGA and the PCIe bus. You could adapt their DMA examples to match your exactly what your driver is doing. I think it would be free in your case, and better than trying to modify the Altera BFM.
What PCIe core are you using in the EP? Two outstanding requests isn't much. Altera HIP cores usually allow up to 32 tags, so this could be your problem.
The NIC card (GigE) throughput is a pretty different beast. Are you also using the e1000e driver? These are pretty well tuned hardware/driver pairs usually.