Forum Discussion
Altera_Forum
Honored Contributor
14 years agoShared memory is easy. Just responding to posted and non-posted requests from the host CPU is not that much of a burden for the FPGA designer. But shared memory is typically slow, especially for host read requests, and depending on the BAR settings it may be hard to design it properly and consistently.
Remember: CPU transactions to PCIe devices, especially non-posted requests, are (very!) costly in terms of CPU performance. Depending on your motherboard and system I/O load, a PCIe read transaction takes typically 0.5 to 2 us. During that period of time, the CPU is completely on hold – read: 100% CPU load for each request – and in multi-core systems this typically affects all cores (!) due to memory access transaction ordering constraints. If the BAR is marked as prefetchable, these numbers do improve at the cost of spending additional effort on getting the consistency right. Shared memory approaches are good if the advantages of a completely generic one-catches-all approach beats its disadvantages, most notably the (read) performance. If you want to gain transfer speed and CPU performance, especially if data has to travel from the PCIe device to the CPU, you will find no easy way around DMA. The CPU+chipset are way faster in accessing the data from main memory than to fetch the data word by word from the PCIe device. The best approach is to think in transactions: If the CPU has to push out one or multiple messages, it should write it to main memory, update the related descriptor list – think: a fifo holding message pointers – and notify the PCIe device of this change. The device will then read the descriptor and issue more DMA read requests to fetch the actual data from main memory, finally updating the status and indicating termination of activity to the CPU with an interrupt. In the meantime, the CPU was not stopped in any way and could spend its time servicing other tasks, probably generating new messages to the PCIe device. The same approach is used in the other direction: If the PCIe device has to notify the CPU of the arrival of new data or a state change, it uses another descriptor list entry, pointing to a free main memory location, and writes the message there via DMA, followed by an update of the descriptor (typically about the status of the reception and the message length), finally indicating an interrupt to the CPU. The CPU can then fetch the message quickly and efficiently from main memory. – Matthias