Forum Discussion

Occasional Contributor

2 years ago

Solved

How to effectively implement PCIe transfer?

Hi,
I am using Stratix10 series PCIe IP core for the first time and while reading "L-tile and H-tile Avalon Memory mapped Intel FPGA IP for PCI Express User Guide 21.1 ( 8.1. Read DMA Example )", I found that the following steps need to be performed to achieve the transfer from FPGA to software:

Software allocates memory for Write Descriptor Status table and Write Descriptor Controller table in host memory.
Program the Write Descriptor Controller table.
Program the Write Descriptor Controller register "Write Status and Descriptor Base" with the starting address of the descriptor table.
Program the Write Descriptor Controller "Write Descriptor FIFO Base" with the starting address of the on-chip write descriptor table FIFO.
Program the Write Descriptor Controller register WR_DMA_LAST_PTR with the value.
The host waits for the MSI interrupt. The Write Descriptor Controllers send MSI to the host after completing the last descriptor.

Based on my understanding, the software first allocates memory for the Write Descriptor Status table and Write Descriptor. Then, the descriptor table is programmed to determine the transfer. The PCIe IP core's FIFO in the FPGA retrieves the descriptors from the software memory and performs the transfer according to the descriptors. Finally, an MSI interrupt is sent to the software to indicate the completion of the transfer, and the lowest bit of the Descriptor Status is set to 1.

If I need to continuously transfer image or video data from DDR to the CPU through the FPGA, do I only need to use one or a few descriptors to complete the transfer? After the transfer determined by the descriptors is completed, the software rewrites the descriptors and repeats the previous operations. If so, I think the transfer will be intermittent, as the FPGA needs to wait for the software to program the descriptors before initiating the next transfer. Is there a better way to implement data transfer?

Other Ips

VenT_Altera
2 years ago
Hi Allen,
Thank you for reaching out.
From the steps given, I believe that you are looking into 8.2. Write DMA Example instead of 8.1. The two questions you've raised are:
1. Do I only need to use one or a few descriptors to complete the transfer?
2. Is there a better way to implement data transfer?
To address your questions, the findings below are according to the user guide 21.1 as mentioned.
1. To use one or a few descriptors to complete the transfer, it is up to the transfer size that you wish to transfer. In PCIe system memory, the read and write descriptors are stored in separate descriptor tables. And each table can store up to 128 descriptors. Each descriptor is 8DW / 32 bytes. Based on the descriptor format, the maximum transfer size is (1 MB - 4 bytes). Hence, if you need to transfer large amounts of data, which is more than 4 bytes, you'll need to use a few descriptors (more than one) to complete the transfer. On the other hand, if the data is less than 4 bytes, then one descriptor will be enough to complete the transfer. However, take note that to avoid a possible overflow condition, allocate the memory needed for the number of descriptors supported by WR_TABLE_SIZE.
2. No. This is the way it works for DMA data transfer, as the descriptor ID will loop back to 0 after reaching WR_TABLE_SIZE.
If you want to process more pointers than the WR_TABLE_SIZE, there are two steps that you must follow:
1. Process the pointers up to WR_TABLE_SIZE by writing the same value as in WR_TABLE_SIZE.
2. Next, write the number of remaining descriptors to WR_DMA_LAST_PTR.
User Guide: https://www.intel.com/content/www/us/en/docs/programmable/683667/21-1/introduction.html
I hope these address your questions well.
Thanks.

Best Regards,
VenTing_Intel

4 Replies

VenT_Altera
Frequent Contributor
2 years ago
Hi Allen,
Thank you for reaching out.
From the steps given, I believe that you are looking into 8.2. Write DMA Example instead of 8.1. The two questions you've raised are:
1. Do I only need to use one or a few descriptors to complete the transfer?
2. Is there a better way to implement data transfer?
To address your questions, the findings below are according to the user guide 21.1 as mentioned.
1. To use one or a few descriptors to complete the transfer, it is up to the transfer size that you wish to transfer. In PCIe system memory, the read and write descriptors are stored in separate descriptor tables. And each table can store up to 128 descriptors. Each descriptor is 8DW / 32 bytes. Based on the descriptor format, the maximum transfer size is (1 MB - 4 bytes). Hence, if you need to transfer large amounts of data, which is more than 4 bytes, you'll need to use a few descriptors (more than one) to complete the transfer. On the other hand, if the data is less than 4 bytes, then one descriptor will be enough to complete the transfer. However, take note that to avoid a possible overflow condition, allocate the memory needed for the number of descriptors supported by WR_TABLE_SIZE.
2. No. This is the way it works for DMA data transfer, as the descriptor ID will loop back to 0 after reaching WR_TABLE_SIZE.
If you want to process more pointers than the WR_TABLE_SIZE, there are two steps that you must follow:
1. Process the pointers up to WR_TABLE_SIZE by writing the same value as in WR_TABLE_SIZE.
2. Next, write the number of remaining descriptors to WR_DMA_LAST_PTR.
User Guide: https://www.intel.com/content/www/us/en/docs/programmable/683667/21-1/introduction.html
I hope these address your questions well.
Thanks.

Best Regards,
VenTing_Intel
VenT_Altera
Frequent Contributor
2 years ago
Hi Allen,

May I know if there are any updates from your end regarding this forum case?
I look forward to your response which will allow us to proceed to the next step.

Thanks.

Best Regards,
VenTing_Intel
VenT_Altera
Frequent Contributor
2 years ago
Hi Allen,

I’m glad that your question has been addressed. I now transition this thread to community support. If you have a new question, please login to https://supporttickets.intel.com/, view details of the desire request, and post a feed or response within the next 15 days to allow me to continue to support you. After 15 days, this thread will transition to community support. The community users will be able to help you with your follow-up questions.

Thanks.

Best Regards,
VenTing_Intel

p/s: If any answers from the community or Intel support are helpful, please feel free to mark them as solutions, give them kudos, and rate 5/5 for the survey.
- allen18
  Occasional Contributor
  2 years ago
  Hi, VenTing:
  Thank you very much for your previous answer.
  
  I would like to continue with another question regarding interrupts. In a given scenario, let's assume that an FPGA needs to continuously upload data to a CPU. However, the data bandwidth is much smaller than the PCIe bandwidth. So, I am using DDR as a cache to accumulate a sufficient amount of data. Once enough data is collected in the DDR cache, the CPU configures the descriptors to initiate a DMA write transfer, allowing the data to be uploaded to the CPU.
  
  Since the CPU doesn't know when it can execute a transfer, the FPGA needs to send an interrupt to notify the CPU. If I intend to use the MSI-X interrupt mode and I need a total of three interrupt numbers, including those for functionalities other than DDR transfer, how should I configure the IP core? How do I initiate the interrupts? I am aware of Xilinx's XDMA, which can export a 3-bit signal called "pcie_user_irq". When the user raises this signal once, it triggers an MSI-X interrupt and IPcore will send a MSI-X tlp packet. However, it seems that Intel's IP does not have a similar mechanism.
  
  During working, the FPGA will continuously send interrupts, and the CPU will consistently respond interrupts and configure descriptors to achieve transfer. Is this the only way to accomplish this? Are there better methods to achieve this application scenario?

Forum Discussion

How to effectively implement PCIe transfer?

4 Replies

Recent Discussions

Cyclone V SoC 5CSXC6 Series GXB Utilization and Limitations

Mac internal loopback F-Tile, Quartus 25.2

Cyclone 10 LP Device Pin Match

Cyclone 10 GX development board collaterals

GPIO default state before FPGA configuration (weak pull-up vs. pull-down)