PCIe dma trasfer optimization
Hello,
We have an implementation of an FPGA device (intel stratix5) connected to Broadwell-D (1559 ) using x8 PCIe GenIII interface. The FPGA consists 8 input channels of aprox.250MB/s (total aggregated ~2GB/s input data) and we are trying to "Push" to the CPU RAM using DMA engine inside the FPGA- So the overall theoretical bandwidth of the PCIe is about x4 times of the data rate (8GB/s).
For some reason, the actual bandwidth we are able to achieve is much less than expected – about 500MB/s – only 2 channels of 8 can work in full bandwidth.
Our implementation consist 16 DMA channels ( 2 for each channel ) . each physical channel consist 2 separated DMA for the header and the data itself – this is for application needs.
Each DMA asks application a memory window of 24MB and push data continuously while providing an interrupt for each 4MB of data sent . this allows the CPU to fetch and clear the data-buffer before it gets re-written again in the next cycle. This means that for each chnnel the CPU gets around 60 interrupts per second and should be getting ~480 interrupts per second when operating the 8 channels.
So – our basic questions are :
- Looking at this implementation method – is there something in this method that can explain the low performance ?
- Is there a profiling software that can help us understand better what is the pipeline so maybe we will be able to solve it?
Thanks!