Only by getting the host to request more bytes in each PCIe request.
In my case the host is a small ppc running linux - and I was able to write a device driver for the dma engine embedded in the ppc's pcie block.
Not sure what you can do with an atom and windows 7.
You won't be able to affect the latency, but you can improve the throughput, reading into a wider register will probably still be a single PCIe request. You might be able to use of the the XMM3 (or whatever they are called) simd integer registers to read more bytes in one transfer.
(Do save/restore the register though.)