PCtoFPGA AVMM DMA data-transfer cann't complete in time randomly using A10 PCIe AVMM IP

Question

hello experts, 
we are using A10 HIP AVMM IP design as a PCIe device to pug inside X86 XEON SERVER to work with.

In Linux driver, we program FPGA AVMM DMA Engine to transfer bulk-data between FPGA SRAM and X86 DDR4 .  Each transfer of bulk data has same data length ( 256Kbytes IQ data per DMA Transfer in  71us), and this data transfer keeps running consistently and keep going.

The whole system works as expect for 2-3 hours, and keep running correctly. we measure that  each transfer of DATA DMA spends around 31 us or so, so generally in every "71 us" period,  DMA can complete in time.

But randomly, we can see a few DMA data transfer consumes  "+3977 us"  at rare condition,  which means that time DMA data transfer spend much more time than normal case , thus the system ran into problem and crashes.   Since it's a very rare exception, and it's related AVMM DMA IP engine inside A10 FPGA chipset, we have no idea how to move forward to debug it.

Is there any idea how to debug it, or any debuggig status (registers) we can check in FPGA HIP AVMM module ?    Thanks very much for help and advices.

jet60200 · Answer

continue to dig where is possible problem :

I tried to run lspci to check AER from PCIE HIP IP, before the final DMA stuck issue occurs, "lspci" doesn't see any ERROR , but afer the problem occurs, "lspci" shows there's a few AER error from FPGA HIP core, such as following :

"
[root@localhost ~]# lspci -s 0000:17:00.0 -vv17:00.0 Non-VGA unclassified device: Altera Corporation Device 1001Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast &gt;TAbort- &lt;TAbort- &lt;MAbort- &gt;SERR- &lt;PERR- INTx-Latency: 0, Cache Line Size: 32 bytesInterrupt: pin A routed to IRQ 355NUMA node: 0Region 0: Memory at 38007ff00000 (64-bit, prefetchable) [size=512]Region 2: Memory at c5800000 (32-bit, non-prefetchable) [size=4M]Capabilities: [50] MSI: Enable- Count=1/4 Maskable- 64bit+Address: 0000000000000000 Data: 0000Capabilities: [78] Power Management version 3Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-Capabilities: [80] Express (v2) Endpoint, MSI 00DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s &lt;64ns, L1 &lt;1usExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000WDevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+MaxPayload 256 bytes, MaxReadReq 512 bytesDevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-LnkCap: Port #1, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s &lt;4us, L1 &lt;1usClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not SupportedDevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF DisabledLnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-Compliance De-emphasis: -6dBLnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-Capabilities: [100 v1] Virtual ChannelCaps: LPEVC=0 RefClk=100ns PATEntryBits=1Arb: Fixed- WRR32- WRR64- WRR128-Ctrl: ArbSelect=FixedStatus: InProgress-VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01Status: NegoPending- InProgress-Capabilities: [200 v1] Vendor Specific Information: ID=1172 Rev=0 Len=044 &lt;?&gt;Capabilities: [300 v1] #19Capabilities: [800 v1] Advanced Error ReportingUESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr+ BadTLP+ BadDLLP- Rollover- Timeout+ NonFatalErr-CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-Kernel driver in use: nr_device_driver
"

Which means some ERR happen during A10 AVMM DMA moving operation.  But my facing problem is : for all these DMA operation, I just use 8 Descriptor Entries per each data moving, and every time these "8" dma descriptor entries keeps to have same content, why it keeping running for 3_4 hours, then suddenly met a "stuck" in PCIE HIP core ?  that's weird ！

Anyone  has any idea ?  Thanks in advance

khaichein_y_intel · Answer

Hi,Could you provide the Signal Tap?ThanksBest regards,KhaiY

jet60200 · Answer

Hi @KhaiChein_Y_Intel ,

1)  Regarding of " Signal Tap " capture, what signal（s） do you request to capture ？

2)  Also regarding of " RxErr+ ", " BadTLP+ Timeout+ ", I believe they are located in PCIe Physical Layer &amp; Link Layer , correct ?   Since they're not relatedd to PCIE Application Data,  does that mean it may be a Hardware related issue ?

Thanks for feedback //

jet60200 · Answer

Hello @KhaiChein_Y_Intel ,

what signals we should capture to investigate this "stuck" issue ? Is there any guidance to describre this ?  Thanks in adavance

khaichein_y_intel · Answer

Hi,Could you share the STP for the below signals and the .ip file? Please use translational for storage qualifier setting.  Txs dma_rd_master dma_wr_master wr_dts_slave rd_dts_slave wr_dcm_master rd_dcm_master Rxm_BAR* tx_out0[&lt;n&gt;-1:0]rx_in0[&lt;n&gt;-1:0]hip_reconfig_clkhip_reconfig_rst_nhip_reconfig_address[9:0]hip_reconfig_readhip_reconfig_readdata[15:0]hip_reconfig_writehip_reconfig_writedata[15:0]hip_reconfig_byte_en[1:0]ser_shift_loadinterface_selnpornreset_statuspin_perstrefclkRdDmaWrite_oRdDmaAddress_o[63:0]RdDmaWriteData[&lt;w&gt;-1:0]RdDmaBurstCount_o[&lt;n&gt; -1:0]RdDmaByteEnable_o[ &lt;w&gt;-1:0]RdDmaWaitRequest_iWrDmaRead_oWrDmaAddress_o[63:0]WrDmaReadData_i[&lt;w &gt;-1:0]WrDmaBurstCount_o[&lt;n&gt;-1:0]WrDmaWaitRequest_iWrDmaReadDataValid_icfg_par_errderr_cor_ext_rcvderr_cor_ext_rplderr_rpldlupdlup_exitev128nsev1ushotrst_exitins_status[3:0]ko_cpl_spc_data[11:0]ko_cpl_spc_header[7:0]l2_exitlane_act[3:0]ltssmstate[4:0]rx_par_errtx_par_err[1:0]currentspeed[1:0]Cra*ThanksBest regards,KhaiY

Forum Discussion

PCtoFPGA AVMM DMA data-transfer cann't complete in time randomly using A10 PCIe AVMM IP

6 Replies

Recent Discussions

Error when simulating F-tile Ethernet example design

Avalon Transaction Responses & Bridges

SerialLite II license for Arria10 FPGA

Agilex3/5 GTS Hard Ethernet IP 10G example design pin loc and io std wanted

CORDIC ATan2 Failed to Generate