Forum Discussion

ChunyeHong's avatar
ChunyeHong
Icon for New Contributor rankNew Contributor
5 years ago

Serialite II Issue

We have 4 copies of SerialLite II IP core in our Stratix IV GX device.

It used to work fine when we use it in stream mode. After we changed it to packet mode, sporadic problem started to show up. When the problem shows up, the word order in the packet is messed up, and sometimes, the downstream device receives two EOPs for every SOP.

It does not happen all the time and it happens only for some particular compilation. For example, after one compilation, the problem could jump from instance one to instance three, and it could disappear altogether. And using signaltap, we have verified that the packets going into SerialLite is fine when the problem happens. My guess is that this is a classic poor design practice failed in timing or cross clock domain. Since the IP is encrypted so there is no way for me to look into it and debug. Looking in the signaltap, it seems some weird 8x83 FIFOs are used and I suspect that is where the problem is. Again, I can’t look further without the source code. One more piece of information, the first run is always clean after a reset is issued to SerialLite.

What do you suggest how I can go forward to trouble shoot this?

Thank you

10 Replies

  • CheepinC_altera's avatar
    CheepinC_altera
    Icon for Regular Contributor rankRegular Contributor

    Hi,


    As I understand it, you are observe intermittent issue with the SLII IP where you observe the RX received 2 EOP for every SOP after you switch from stream mode to packet mode. This seems to be dependent on Quartus builds where some builds do not exhibit the issue but some builds do. Also the failing instance seems to be random according to builds as well.


    Based on this observation, this is trending towards potential timing problem. To further narrow down the issue, would you mind to try running Modelsim simulation with your design to isolate the functional issue prior to debug into timing issue.


    Also, it is recommended for your to create simple test design ie with one or two instance with single lane to facilitate the debugging process.


    Do you observe any timing violation or anomaly in the failing compilation?


    Please let me know if there is any concern. Thank you.



    Best regards,

    Chee Pin



    • ChunyeHong's avatar
      ChunyeHong
      Icon for New Contributor rankNew Contributor

      First, from my experience, simulation is a good tool to catch logic issues, but a very poor way to debug timing issue, especially if the problem is between two different clocks. However, I have run simulation and did not see any problem. You know, running simulation over high speed serial link for reasonable during for the problem to show up is not very realistic. Sometimes, simulation takes longer than the compile time for it to show meaningful results. It is actually easier to put in signaltap to debug the problem.

      Your suggestion of simplify the design does not work either. As I said, every compilation of the same design gets different results and in most cases, the problem goes away, simplified design won't reproduce the problem. And the problem only shows up in our system, which comprises of multiple boards, FPGAs and software, when running some special cases. Simplified design won't trigger the issue.

      In those compilations that reproduced the problem, there is no timing violation, no unconstrained clocks, and everything looks normal.

      The biggest thing getting in the way for me to debug is the source code is encrypted. There is no way for me to debug it. I can put some signals on signaltap but I don't know the logic around those signals. And again, what makes it harder is that most times, after I changed the signaltap, the problem goes away.

      Thanks,

  • CheepinC_altera's avatar
    CheepinC_altera
    Icon for Regular Contributor rankRegular Contributor

    Hi,


    Thanks for your update. You are right, the main purpose of us running the simulation here is to help isolating potential functional issue to further narrow down the issue. Glad to hear that there is no issue with the simulation.


    Thanks for sharing that after changing the signaltap logic, the problem goes away. This is trending toward potential timing issue. Just would like to clarify with you on the following:


    1. When you mentioned two EOPs for every SOP, I believe you are observing this in signaltap. Is my understanding correct?


    2. When the issue occur in signaltap for a specific lane, do you observe failure in that SL lane as well? Just to further isolate out if it is signaltap only issue or SL lane issue. For example, potential timing issue which cause signaltap sampling error.


    Please let me know if there is any concern. Thank you.



    Best regards,

    Chee Pin


    • ChunyeHong's avatar
      ChunyeHong
      Icon for New Contributor rankNew Contributor

      Chee Pin,

      Thanks for the reply. I will answer your question one by one.

      1. Yes, but in the down stream device. Let's name the device that sources the traffic and we suspect causing the problem is FPGA1, the down stream device that receives all the packets on SerialLite is FPGA2. We put signaltap in FPGA2 at the Rx Atlantic interface out of the seriallite and saw the words are out of order in the packets and sop and eop do not match each other. We also put signaltap in FPGA 1 at Tx atlantic interface input right before seriallite and did not see any problem while problems were observed at signaltap in FPGA2. The reasons that we believe the problem is in FPGA 1 Tx is that problem goes away for every first run after we reset seriallite in FPGA 1, and FPGA 1 only. Problem comes and goes only when we recompile FPGA1, not FPGA 2. We recompiled FPGA 2 many times and nothing changes.

      2. This is bonded 4 lane link. And overall link status signal "stat_rr_link" never goes down even when the problem shows up. We never put signaltap at the lane level since it is not going to be of too much use to us. Again, seriallite is encrypted and it is a black box to us. We pull signals out in signaltap but we don't know what we are looking at without source code. I would assume byte alignment, word alignment and lane bonding all work since words are out of order on 64 bit word boundary, not individual bit or byte. Plus, link never goes down. In signaltap of both FPGA, we basically use the clock at the atlantic interface to clock the data, they should be the signals for the right clock. There is no setup violation in the timing report. No unconstrained clock.

      Thanks

      • ChunyeHong's avatar
        ChunyeHong
        Icon for New Contributor rankNew Contributor

        Chee Pin,

        If you would like, I can send you some signaltap waveforms. But not on this forum, because of IP and security concerns.

        Thanks,

  • CheepinC_altera's avatar
    CheepinC_altera
    Icon for Regular Contributor rankRegular Contributor

    Hi,


    Thanks for your update. I can understand the effort requires to create test design to help narrowing down the potential root causes. For your information, with the whole system hooked up, it is rather difficult to debug and narrow down. This is why I am suggesting that we perform a loopback at the FPGA1 where you suspect the issue coming from. To avoid affect other component in the system, not sure if it is possible for you to create simple duplex SL II design in FPGA1 and perform a loopback to see if issue pops up. If we can replicate this, it would be helpful for debugging since we narrow down to FPGA1 and single SLII IP core. I understand this might require some effort to work on.


    On the other hand, to avoid any further delay, I would suggest we further engage our timing team to help look into and advise if there is any potential anomaly from timing perspective. Since I am unable to duplicate case from here, would you mind to open a new Forum case with title specific to timing ie "Timing debugging required on Quartus build dependent packet issue". You may then briefly your observation where there are varitaion from build to build and mention timing analysis debugging is required. You may then let me know the case so that I can help to route to our timing team.


    Please let me know if there is any concern. Thank you.



    Best regards,

    Chee Pin


  • CheepinC_altera's avatar
    CheepinC_altera
    Icon for Regular Contributor rankRegular Contributor

    Hi,


    Just would like to follow up with you if you have had a chance a open a new case to request for timing team's assistance? Please feel free to let me know the case so that I could help to expedite the routing. Thank you.


  • CheepinC_altera's avatar
    CheepinC_altera
    Icon for Regular Contributor rankRegular Contributor

    We shall continue to support you in the new case. This thread will be transitioned to community support. If you have a new question, feel free to open a new thread to get the support from Intel experts. Otherwise, the community users will continue to help you on this thread. Thank you.