PCIe dma trasfer optimization

Question

Hello,

We have an implementation of an FPGA device (intel stratix5) connected to Broadwell-D (1559 ) using x8 PCIe GenIII interface. The FPGA consists 8 input channels of aprox.250MB/s (total aggregated ~2GB/s input data) and we are trying to "Push" to the CPU RAM using DMA engine inside the FPGA- So the overall theoretical bandwidth of the PCIe is about x4 times of the data rate (8GB/s). 
For some reason, the actual bandwidth we are able to achieve is much less than expected – about 500MB/s – only 2 channels of 8 can work in full bandwidth.
Our implementation consist 16 DMA channels ( 2 for each channel ) . each physical channel consist 2 separated DMA for the header and the data itself – this is for application needs.
Each DMA asks application a memory window of 24MB and push data continuously while providing an interrupt for each 4MB of data sent . this allows the CPU to fetch and clear the data-buffer before it gets re-written again in the next cycle. This means that for each chnnel the CPU gets around 60 interrupts per second and should be getting ~480 interrupts per second when operating the 8 channels.
So – our basic questions are :

Looking at this implementation method – is there something in this method that can explain the low performance ?
Is there a profiling software that can help us understand better what is the pipeline so maybe we will be able to solve it?

Thanks!

harris · Answer

Hi ZVere,

1, I guess the multiple channel DMA is designed by yourself, I know nothing about your design, so I don't have any suggestion for your design.
2, I suggest you try polling scheme instead of interrupt scheme. I suspect so many interrupts might not be handled by system timely. It might be the reason of low performance.
3, You also can use Signaltap to capture some signals and check the timing relation to help analyze the issue.

Thanks.
Harris

zvere · Answer

Hi Harris,

I must use interrupts. 
We are not sure the DMA design in FPGA is implemented in the best way. 
I wonder how a standard (Melanox) 40Gb ethernet card  is working compared to our design.

Thank you,
Zvika

harris · Answer

Hi Zvika ,

I'm sorry, I know nothing about Melanox ethernet card. And I guess the DMA design is implemented by yourselves. I can't provide any other suggestion to you. Thanks.

BR/Harris

Forum Discussion

PCIe dma trasfer optimization

3 Replies

Recent Discussions

Arria 10 GX RX max intra-differential pair skew

Cyclone 10 GX development board collaterals

Agilex 7 FPGA Availability on Cloud Platforms (AWS, Azure, GCP)?

AGRW027R28A2I2V Thermal Model

Why does PTA show zero W for F-tiles in Hierarchical Design Editor