Oliver,
I agree on A, had no issues with B, never tried C for the reasons you mentioned and directly went to a self-written design based on the Avalon-ST interface. Sad thing is, in this case I have to disagree with your statement made in D, you have to know way more about PCIe transactions than is documented in the Altera PCIe UG unless you want to debug all your issues out of the design at high development cost – late in the design, some large-scale architectural changes, other team members starving upon system integration without stable hardware and/or drivers.
If one wants to follow my trail, I would advise to take at least two steps: first write a simple PIO-based design with simple hardware on Avalon ST and an inefficient driver that already mimics an application interface that is close to the final design. Then, replace more and more hardware+driver with an improved design based finally on DMA.
I have learned quite a lot about DMA-capable hardware and proper driver interfacing by reading and understanding some well-written and high performance Linux drivers together with the public data sheets of the devices. In addition, when designing for Linux or reviewing Linux device drivers, ldd3 (
http://lwn.net/kernel/ldd3/) is a valuable resource for understanding how the driver interacts with the operating system. Furthermore, one learns which interface optimizations have been applied by the kernel designers to let high-performance DMA-capable hardware do as much as possible for offloading the CPU, so the design can benefit from implementing one or the other optimization right into the DMA engine instead of forcing the driver or the user code to do that work, especially avoiding data copying (zero copy (
http://en.wikipedia.org/wiki/zero-copy)).
Things are slightly different with Windows, as it supports different application interfaces and there is less driver code to take as an example. But the basic operation modes are similar to the ones with Linux, and there is very rare cause to optimize hardware just for the needs of a specific OS.
Most important at this stage of development is to know the order of operations and which PCIe transactions and semantic packet transactions can overlap, i.e. accesses to data, the descriptor queues and hardware registers by device and driver. Avoid race conditions by design, avoid interrupt oversampling. In my case it was highly beneficial that I was the one to implement the PCIe application in hardware as well as the one to write the Linux driver, so the development loop was very short and any bad architectural decision on one side was quickly and with no emotions revised on the other side.
Another important thing are the no-snoop and relaxed ordering transaction bits. Making use of them properly not only gains performance but indicate that the hardware designer has a good understanding of semantic transactions.
For example, when sending data from the device to main memory via DMA, I have three different transactions to handle: First, I write the data with no-snoop (the memory is assigned to the hardware at cache line boundaries at this time) and relaxed ordering, then I update the descriptor with just relaxed ordering on, and finally I update the queue head pointer in main memory without any of these attributes, pushing out all outstanding requests to main memory. This ensures that the driver can only see the updated head pointer when all frame and descriptor data was updated, but their updates can be re-ordered before that with other transactions for improved performance. Furthermore, inside the hardware design, updates to the descriptor queue can be combined for multiple descriptor entries, and updates to the head pointer can be deferred until some descriptors are filled or there is no further data pending.
In my design, the high-level driver states are handled with kind of a virtual token: While the driver owns the token, the hardware is not allowed to issue an interrupt. At this time the driver handles as much work as possible in its bottom-half, perhaps scheduled for multiple execution runs by the kernel. When the driver is done processing the queue – it is empty for RX or full for TX – it stops the bottom-half process, indicates to the hardware what was the final task it could do, and the hardware takes the token. As soon as the hardware gets aware of new tasks to do by the driver – even tasks that were already written to the descriptor queue at the time of receiving the token – the hardware interrupts the driver, handing back the token to the driver. Remember: The token is precious, and there shall only be one token, so take care not to lose or duplicate it.
Another step, which depends on the target performance in terms of packets per second as well as CPU load, is the attempt to minimize PIO read operations from the CPU to the device. I currently have just one read request inside the ISR that checks for the cause of the interrupt by looking at a status register, and with higher load, the bottom-half of the driver is going into polling mode, so the ISR is called less frequently down to zero times at full load. The bottom-half does not contain any CPU read requests from device memory or registers, it just handles data, descriptors and pointers in main memory except for head/tail pointer updates written to device registers.
– Matthias