Forum Discussion

Jayden's avatar
Jayden
Icon for New Contributor rankNew Contributor
19 days ago

R-Tile Avalon Streaming PIPE Direct x16: Locks COM(K28.5) Symbols correctly but some lanes do not.

Hello, I am implementing a custom soft PCIe/CXL link layer and LTSSM using R-Tile Avalon Streaming IP in PIPE Direct mode, configured as x16.

At the moment, link training does not reliably move forward because some lanes receive valid COM/K-code alignment, but the following ordered-set symbols are corrupted.

Environment

  • Device / board: AGIB027R29A
  • IP: R-Tile Avalon Streaming FPGA IP for PCI Express
  • Mode: PIPE Direct
  • Link width: x16
  • Current focus: Gen1 training / Polling / Configuration
  • Custom implementation:
    • custom LTSSM
    • custom symbol lock using COM (K28.5)
    • custom TS1/TS2 decode logic

Symptom

In Polling.Active and Polling.Configuration , I can see that some lanes captures/decodes TS1/TS2 correctly, but some lanes do not.

For example, in the attached SignalTap screenshot:

  • Lane 9 appears to decode the TS2 sequence correctly.
  • Lane 8 shows COM (K28.5) and PAD (K23.7) correctly, but the symbols after that are unstable / corrupted.

From the screenshot:

  • Lane 9 example:
    • K28.5, K23.7, K23.7, D24.0, D30.0 D00.0, repeated
  • Lane 8 example:
    • K28.5, K23.7 are visible,
    • but the following TS2 fields fluctuate and do not remain valid/stable.

So it looks like:

  • COM-based symbol lock is working at least partially
  • but after COM/PAD, the ordered-set contents on some lanes(random) are corrupted before my soft IP can decode them correctly

To verify whether this was caused by my own logic, I captured the affected lanes directly in SignalTap using the first raw 10-bit RX data from the PIPE Direct IP (`ln*_pipe_direct_pipe_rxdata_o`), before any symbol lock/decoding stage in my soft IP. I searched for the COM symbol directly in this raw 10-bit stream and confirmed that the corruption is already present at the PIPE Direct IP output. So this does not appear to be caused by my combinational decode logic; the raw RX data delivered by the IP is already corrupted on those lanes.

What I already checked

I already checked the following items carefully:

  1. Gen1 rxdata interpretation
    • I only decode valid 10-bit portions for Gen1
    • I do not interpret the don't-care bits in rxdata[31:10] and rxdata[63:42]
  2. rxdatavalid qualification
    • TS decode / symbol shift only happens when rxdatavalid0/1 are valid
  3. Sampling clock
    • SignalTap capture is done in the corresponding lane RX clock domain
    • not with a shared TX/fabric clock
  4. Reset sequence
    • pld_pcs_rst_n_i release is gated after per-lane tx_transfer_en_o
    • I also reviewed cdrlock2data, reset_status_n, phystatus, powerdown sequencing
  5. Deskew-related status
    • active channels are detected

Current question

At this point, I suspect one of the following:

  • lane-specific analog/RX quality issue inside or before PIPE Direct output
  • lane-specific reset/power-up timing issue
  • internal alignment / deskew behavior that I am misunderstanding
  • some required PIPE Direct control/sideband setting that I am missing

What I would like to ask

  1. In PIPE Direct x16 Gen1, if one lane shows valid K28.5 / K23.7 but the following TS2 symbols are corrupted, what should I check first on the R-Tile side?
  2. Are there any lane-specific PMA / RX / PIPE Direct controls that should be reviewed for this symptom?
  3. Is there any recommended way to determine whether this is:
    • a true lane analog/RX problem,
    • a deskew/alignment issue,
    • or a reset/bring-up sequence issue?
  4. Are there any known recommendations for validating lane integrity directly at the PIPE Direct output during Polling.Configuration?

4 Replies

  • Wincent_Altera's avatar
    Wincent_Altera
    Icon for Regular Contributor rankRegular Contributor

    Hi Jayden ,

    Lane 9 appears to decode the TS2 sequence correctly. Lane 8 shows COM (K28.5) and PAD (K23.7) correctly, but the symbols after that are unstable / corrupted.
    >> I see you are using x16, do you occupied all of the 16 lane ?
    >> Is it only lane 9 and lane 8 got problem and other lane act as what you expected ?
    >> Do you tested in gen2 ? or this issue only seeing in gen1 ?
     

    What mode of the pipe direct you trying to perform ? 
    Reset sequence or speed change ?
    IF I understand correctly from your case description , I assume you are using reset sequence.
    IF Yes, please check the signal sequence example under 
    https://docs.altera.com/r/docs/683501/25.1.1/r-tile-avalon-streaming-ip-for-pci-express-user-guide/pipe-direct-reset-sequence
    Please do ensure that those sequence are strictly been followed.

    Based on my experience and understanding , once the system entering polling.configuration substate, the a transmitter will stop sending TS1s and start sending TS2s, still with PAD set for the Link and Lane numbers. The purpose of the change to sending TS2s instead of TS1s is to advertise to the link partner that this device is ready to proceed to the next state in the state machine. It is a handshake mechaȬ nism to ensure that both devices on the link proceed through the LTSSM together. Neither device can proceed to the next state until both devices are ready. The way they advertise they are ready is by sending TS2 orderedȬsets. So once a device is both sending AND receiving TS2s, it knows it can proceed to the next state because it is ready and its link partner is ready too. 

    BUT I not sure what happening with your system, perhaps checking back the reset sequence can be a good start for us.

    Regards,

    Wincent_Altera

    • Jayden's avatar
      Jayden
      Icon for New Contributor rankNew Contributor

      Hello Wincent,

      Thank you for the reply.

      Yes, I am using the x16 configuration, and all 16 lanes are occupied. After the reset release sequence, the link proceeds to Polling.Active and Polling.Configuration on all 16 lanes. However, the problem is that the lanes that receive consecutive TS1/TS2 correctly are not stable. The set of “good” lanes changes every time I reboot the server with the FPGA card installed. In other words, the behavior looks very random.

      So to answer your questions:

      • Yes, I am using all 16 lanes.
      • It is not only lane 8 and lane 9. The lanes that work and the lanes that fail change after each reboot.
      • At the moment, I am testing only in Gen1. This issue is currently being observed during Gen1 link training. After this is solved, my first goal is to bring the link to L0 and then speed up to Gen4.

      For the PIPE Direct configuration, I am using:

      • PIPE Direct 16-channel
      • 1x16, Octet 0 - 8 lane, Octet 1 - 8lane
      • Gen4 configuration, currently down-training and debugging in Gen1

      Regarding your question about reset sequence or speed change:


      At the moment, I am focusing on the reset sequence path.

      I already built an RTL simulation environment and connected my soft IP in the same way as in hardware. In RTL simulation, the reset sequence itself appears to complete normally without any issue.

       

      However, in SignalTap, I sometimes see unrealistic behavior, for example pin_perst_n_o continuing to toggle, possibly due to setup timing violations around the TX clock domain. Because of that, it is difficult to trust SignalTap captures for validating the reset sequence directly.

       

      Do you have any recommendation for how to debug or validate the reset sequence in this situation, when SignalTap itself may be showing unreliable behavior due to timing issues?

      I also have one additional question.

      In the waveform shown in the following link:
      PIPE Direct Reset Sequence

      this is a Gen1 reset sequence, but even in Gen1, both ln0_pipe_direct_txdatavalid0_i and ln0_pipe_direct_txdatavalid1_i appear to go high.

      However, in Figure 48, the PIPE Direct TX Data Path for Gen1 seems to show a format where only txdatavalid0 toggles.

      Because of this, I am not fully sure which behavior should be considered correct during the reset sequence for Gen1.

      Could you please clarify which one should be followed during Gen1 reset sequence?
      Should I treat the reset-sequence waveform as the expected behavior, or should I follow the interpretation shown in Figure 48 for Gen1 TX datapath formatting?

      Thank you again for your help.

      Best regards,
      Jayden

      • Wincent_Altera's avatar
        Wincent_Altera
        Icon for Regular Contributor rankRegular Contributor

        Hi Jayden​ ,

        If referring to the Figure 49,

        The pin_perst should asserted high at the beginning of the reset sequence.
        Can you please show your timing report ? just want to see if the violation is valid or not (A printscreen will do)

        Regarding your question on the txdatavalid shall be toggling or continuous be high, let me double confirm this on my place, get back to you shortly.

        Regards,
        Wincent_Altera