Forum Discussion
Altera_Forum
Honored Contributor
13 years agoOne more note.
--- Quote Start --- That is a non-sequitor. A modern cpu will execute other instructions following a PCIe memory read provided they aren't dependant on the value being read. Memory reads can be re-ordered, so other locations can be read. Any PCIe transfers must be sequenced, but that doesn't affect other operations. --- Quote End --- The CPU doesn’t know which memory locations are dependent on the PCI memory read data, i.e. they were updated by the device per DMA write just before the MMIO read was issued by the driver. For such dependent data, the old memory content shouldn’t be used. To avoid this race to bite you, you should have your best knowledge of memory barrier enforcement at hand when writing the drivers. Or you run a CPU architecture that simply does not do any related reordering (http://en.wikipedia.org/wiki/memory_ordering). For example, according to the table shown in the link, AMD64 does just »Store reordered after Loads« leading to a very strict and consistent timing compared to the assembly instructions, so it limits efficiency in return to more stability of drivers written by developers not aware of memory barriers. Nevertheless, I wouldn’t suggest to rely on the CPU architecture to write bad drivers. Always place proper explicit memory barriers into the code. Linux provides macros that typically resolve to nothing (or at most a volatile memory declaration avoiding compile-time reordering) if the code is compiled on one of the stricter architectures. Always suppose your driver might once be compiled for the Alpha. Remember: »and then there’s the alpha« (http://lxr.linux.no/#linux+v3.3.4/documentation/memory-barriers.txt). But as soon as a memory barrier (http://en.wikipedia.org/wiki/memory_barrier) is in place – either explicitly given as a compiler/assembler directive or implicitly by the CPU architecture – any PCI access will actually stall the CPU. Even if it can process some five or ten instructions more that really just handle independent register content, this accounts as background noise facing multi-gigahertz CPUs and a PCI read latency of 0.5 to 2 μs. Let me quote a document of the pci-sig (http://www.pcisig.com/developers/main/training_materials/get_document?doc_id=00941b570381863f8cc97850d46c0597e919a34b) on that, see page 10. It is from 2005 and addresses mostly PCI and PCI-X, but the reasons count even more with modern, highly switched PCIe system architectures. --- Quote Start --- • Why are MMIO Loads so bad – Processor stalls or has to do a context switch waiting for the MMIO Load Reply Data – MMIO Load Reply Data takes a long time due to PCI ordering rules – MMIO Load Request have to push MMIO store data – MMIO Load Reply Data have to push DMA store data --- Quote End --- So this means: With more MMIO Loads the CPU can do less. But also: With more DMA store data and MMIO Writes flying around, MMIO Loads take longer to finish, as the PCI ordering rules force some operations to be pushed ahead of the request or even the response. BTW, pages 25 and 26 show the reasons for having these ordering rules. And the affect reads and writes by the device, i.e. DMA accesses, too, but usually a DMA controller can handle reads and writes in parallel, so they don’t suffer that much from this pushing behavior. But it’s always better to set no-snoop and relaxed ordering for any request suited for that. Finally, PCI Express 3.0 adds some options to loosen these ordering requirements a little bit to allow higher efficient data exchange with less transaction coupling between independent DMA channels. They call it “loose transaction ordering” in their faq (http://www.pcisig.com/news_room/faqs/pcie3.0_faq/).