Forum Discussion
Altera_Forum
Honored Contributor
13 years ago --- Quote Start --- I would like to keep this as simple as possible, and avoid using DMA. --- Quote End --- Ron, Sorry about the bad news, but that is the wrong approach. It’s like: »I want to keep all my 8 CPU cores busy, and I set the target of doing it with a single-tasking application.« As long as the CPU initiates each (byte, word) transfer, you are at the performance you stated. DMA is a beast, but get used to it sooner or later, it will pay off. And it’s not *that* hard to do. You can approach DMA in multiple levels of efficiency. The first and most important aspect is to let the CPU do something while the hardware does the transfers. Without descriptor-based pipelining, this approach might not bring the absolute highest bus performance (in MB/s) but it will reduce CPU load already significantly and let it do other things while single blocks are transferred. In this solution, the driver would initiate a single DMA block transfer (in either direction, so device→memory or memory→device) and wait for an interrupt, issued by the device, when it’s done (or an error occurs). The larger the memory block is, the less frequent will the data exchange starve from finalizing the one transfer and setting up the next, in addition to lesser CPU context switches and interrupt calls. Say, you have a 32 MiB large window of kernel memory, cut in two halves, alternatingly filled with data by the device and then used by the application. Filling up 16 MiB of data at best takes the device about 70 to 100 ms at gen1x1, so you will get max. 15 interrupts per second indicating the buffer full condition, which is already low. But this doesn’t work as efficiently when you want to transfer only, say, 4 KiB per turnaround. And with higher lane width and PCIe speed, the device will already be done in maybe 1 or 2 ms. You could raise the memory size to work around this kind of speed issue then. So, doing such a simple one-shot DMA is quite easy. The DMA write accesses are easier to do than DMA reads as you don’t have to bother with tags, completion credits and any kind of completion ordering. First, check your max_payload_size setting and adhere to that for your transfers. Mind all the other PCIe write request rules like 4k boundaries, correct byte enables for short requests, too, and do the right byte shifting if the buffer is not aligned (or assure it is, in your driver). Then do your transfers and issue an interrupt when done. The read requests are more tricky and have another level of performance-vs.-implementation-complexity tradeoff: You either do interleaved reads, or you don’t. Going without interleaving reads makes the code rather easy: Issue a read request with at most max_read_request_size bytes (and, of course, properly aligned to 4k boundaries, mind max. completion credits in your receive hard IP), wait for all completion to arrive (or expect a timeout properly!), and once the request is complete, issue a request for the next part right away. In this mode you don’t have to bother about the tag as you only use a single tag at all. Once the whole block is done, just issue the next interrupt indicating »DMA ready«. For interleaved reads, you have to keep track of different read requests as the completions are allowed to overtake each other (note: parts of a read request always come in-order, but completions to different read requests might use the fast lane of some northbridge-internal data highway). And tracking the completion credit, tag IDs and timeouts might also look more complicated than for a strictly ordered implementation. Only the next step would be a descriptor-based DMA solution. It is targeted towards the next level of performance: The CPU assigns (ordered) jobs in one or more system-RAM-based tables, typically separate for read and write DMA, and the DMA controller autonomously does the transfers while the CPU can assign new tasks or work on already finished tasks. Interrupts are then only exchanged when a queue cannot be serviced from one side anymore because of a full or empty condition in a table. But not only the DMA controller changes but also the driver architecture. Look up terms like »bottom half« to get an impression how this is supported in the operating systems. --- Quote Start --- I suspect that the discrepancy between read and write performance has to do with the controller not pipelining read requests, and that it is waiting for a round-trip read completion before moving on to the next read request. I'm hoping there is some way to pipeline multiple read requests to bring up block read efficiency. --- Quote End --- No, there is no way, for good reason. The CPU has to adhere to the (strict) PCI ordering rules, so when a machine instruction indicates a move from the device memory BAR, this CPU has to wait on the result, no matter what. The CPU cannot know that you are not actually needing the data right away and there wouldn’t be any side effects from doing other work while the returned data is still in flight. This can only be worked around by a DMA engine, device-local or system-global. – Matthias