Forum Discussion

JET60200's avatar
JET60200
Icon for Contributor rankContributor
4 years ago

PCtoFPGA AVMM DMA data-transfer cann't complete in time randomly using A10 PCIe AVMM IP

hello experts,

we are using A10 HIP AVMM IP design as a PCIe device to pug inside X86 XEON SERVER to work with.

In Linux driver, we program FPGA AVMM DMA Engine to transfer bulk-data between FPGA SRAM and X86 DDR4 . Each transfer of bulk data has same data length ( 256Kbytes IQ data per DMA Transfer in 71us), and this data transfer keeps running consistently and keep going.

The whole system works as expect for 2-3 hours, and keep running correctly. we measure that each transfer of DATA DMA spends around 31 us or so, so generally in every "71 us" period, DMA can complete in time.

But randomly, we can see a few DMA data transfer consumes "+3977 us" at rare condition, which means that time DMA data transfer spend much more time than normal case , thus the system ran into problem and crashes. Since it's a very rare exception, and it's related AVMM DMA IP engine inside A10 FPGA chipset, we have no idea how to move forward to debug it.

Is there any idea how to debug it, or any debuggig status (registers) we can check in FPGA HIP AVMM module ? Thanks very much for help and advices.

6 Replies

  • continue to dig where is possible problem :

    I tried to run lspci to check AER from PCIE HIP IP, before the final DMA stuck issue occurs, "lspci" doesn't see any ERROR , but afer the problem occurs, "lspci" shows there's a few AER error from FPGA HIP core, such as following :

    "

    [root@localhost ~]# lspci -s 0000:17:00.0 -vv
    17:00.0 Non-VGA unclassified device: Altera Corporation Device 1001
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 32 bytes
    Interrupt: pin A routed to IRQ 355
    NUMA node: 0
    Region 0: Memory at 38007ff00000 (64-bit, prefetchable) [size=512]
    Region 2: Memory at c5800000 (32-bit, non-prefetchable) [size=4M]
    Capabilities: [50] MSI: Enable- Count=1/4 Maskable- 64bit+
    Address: 0000000000000000 Data: 0000
    Capabilities: [78] Power Management version 3
    Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
    Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [80] Express (v2) Endpoint, MSI 00
    DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
    ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
    DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
    RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
    MaxPayload 256 bytes, MaxReadReq 512 bytes
    DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
    LnkCap: Port #1, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s <4us, L1 <1us
    ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
    LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
    ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
    LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
    DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported
    DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
    LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
    Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
    Compliance De-emphasis: -6dB
    LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
    EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
    Capabilities: [100 v1] Virtual Channel
    Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
    Arb: Fixed- WRR32- WRR64- WRR128-
    Ctrl: ArbSelect=Fixed
    Status: InProgress-
    VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
    Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
    Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
    Status: NegoPending- InProgress-
    Capabilities: [200 v1] Vendor Specific Information: ID=1172 Rev=0 Len=044 <?>
    Capabilities: [300 v1] #19
    Capabilities: [800 v1] Advanced Error Reporting
    UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
    UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
    UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

    CESta: RxErr+ BadTLP+ BadDLLP- Rollover- Timeout+ NonFatalErr-
    CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
    AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
    Kernel driver in use: nr_device_driver

    "

    Which means some ERR happen during A10 AVMM DMA moving operation. But my facing problem is : for all these DMA operation, I just use 8 Descriptor Entries per each data moving, and every time these "8" dma descriptor entries keeps to have same content, why it keeping running for 3_4 hours, then suddenly met a "stuck" in PCIE HIP core ? that's weird !

    Anyone has any idea ? Thanks in advance

    • JET60200's avatar
      JET60200
      Icon for Contributor rankContributor

      Hi @KhaiChein_Y_Intel ,

      1) Regarding of " Signal Tap " capture, what signal(s) do you request to capture ?

      2) Also regarding of " RxErr+ ", " BadTLP+ Timeout+ ", I believe they are located in PCIe Physical Layer & Link Layer , correct ? Since they're not relatedd to PCIE Application Data, does that mean it may be a Hardware related issue ?

      Thanks for feedback //

    • JET60200's avatar
      JET60200
      Icon for Contributor rankContributor

      Hello @KhaiChein_Y_Intel ,

      what signals we should capture to investigate this "stuck" issue ? Is there any guidance to describre this ? Thanks in adavance

  • KhaiChein_Y_Intel's avatar
    KhaiChein_Y_Intel
    Icon for Regular Contributor rankRegular Contributor

    Hi,


    Could you share the STP for the below signals and the .ip file? Please use translational for storage qualifier setting.


    Txs

    dma_rd_master

    dma_wr_master

    wr_dts_slave

    rd_dts_slave

    wr_dcm_master

    rd_dcm_master

    Rxm_BAR*

    tx_out0[<n>-1:0]

    rx_in0[<n>-1:0]

    hip_reconfig_clk

    hip_reconfig_rst_n

    hip_reconfig_address[9:0]

    hip_reconfig_read

    hip_reconfig_readdata[15:0]

    hip_reconfig_write

    hip_reconfig_writedata[15:0]

    hip_reconfig_byte_en[1:0]

    ser_shift_load

    interface_sel

    npor

    nreset_status

    pin_perst

    refclk

    RdDmaWrite_o

    RdDmaAddress_o[63:0]

    RdDmaWriteData[<w>-1:0]

    RdDmaBurstCount_o[<n> -1:0]

    RdDmaByteEnable_o[ <w>-1:0]

    RdDmaWaitRequest_i

    WrDmaRead_o

    WrDmaAddress_o[63:0]

    WrDmaReadData_i[<w >-1:0]

    WrDmaBurstCount_o[<n>-1:0]

    WrDmaWaitRequest_i

    WrDmaReadDataValid_i

    cfg_par_err

    derr_cor_ext_rcv

    derr_cor_ext_rpl

    derr_rpl

    dlup

    dlup_exit

    ev128ns

    ev1us

    hotrst_exit

    ins_status[3:0]

    ko_cpl_spc_data[11:0]

    ko_cpl_spc_header[7:0]

    l2_exit

    lane_act[3:0]

    ltssmstate[4:0]

    rx_par_err

    tx_par_err[1:0]

    currentspeed[1:0]

    Cra*


    Thanks

    Best regards,

    KhaiY


  • KhaiChein_Y_Intel's avatar
    KhaiChein_Y_Intel
    Icon for Regular Contributor rankRegular Contributor

    Hi,


    We do not receive any response from you to the previous question/reply/answer that I have provided. This thread will be transitioned to community support. If you have a new question, feel free to open a new thread to get the support from Intel experts. Otherwise, the community users will continue to help you on this thread. Thank you.


    Best regards,

    KhaiY