Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
21 years ago

NIOS SDRAM performance

I have measured the speed of memcpy's of NIOS2. I use optimized code consisting of four consecutive READs and four consecutive write accesses. Code snippet:

while (i--) {
       d0 = __builtin_ldwio(pfrom);
       d1 = __builtin_ldwio(pfrom+1);
       d2 = __builtin_ldwio(pfrom+2);
       d3 = __builtin_ldwio(pfrom+3);
       pfrom+=4;
              
       __builtin_stwio(pto,   d0);
       __builtin_stwio(pto+1, d1);
       __builtin_stwio(pto+2, d2);
       __builtin_stwio(pto+3, d3);
       pto+=4;
}

Compiling this with -O3 will yields quite optimal code with four reads to different regs and for writes:

    movhi    r7, %hiadj(1048576)   #    pfrom
    addi    r7, r7, %lo(1048576)   #    pfrom
    movhi    r6, %hiadj(1052672)   #    pto
    addi    r6, r6, %lo(1052672)   #    pto
    movi    r8, 15   #    i
.L25:
    ldwio    r3, 0(r7)   #    d0, * pfrom
    ldwio    r4, 4(r7)   #    d1
    ldwio    r5, 8(r7)   #    d2
    ldwio    r9, 12(r7)   #    d3
    addi    r7, r7, 16   #    pfrom,  pfrom
    stwio    r3, 0(r6)   #    d0, * pto
    stwio    r4, 4(r6)   #    d1
    stwio    r5, 8(r6)   #    d2
    stwio    r9, 12(r6)   #    d3
    addi    r8, r8, -1   #    i,  i
    cmpnei    r3, r8, -1   #    i
    addi    r6, r6, 16   #    pto,  pto
    bne    r3, zero, .L25

The transfer rates seemed too slow, so I did further investigations. It turns out that NIOS2 has a very poor SDRAM read performance because it does not perform consecutive SDRAM read accesses (but it does for write accesses).

Here's a link to an oscilloscope image of a READ access: oscilloscope: sdram read (http://dziegel.free.fr/nios2/sdram_read.jpg)

However, write access seems to be fine: oscilloscope: sdram write (http://dziegel.free.fr/nios2/sdram_write.jpg)

Note: As you can see in the oscilloscope images, the accesses to not cross a SDRAM row (no RAS cycle between reads).

Tests were performed on a NIOS 1C20 Development Kit, Project: NIOS2 full_featured.

So my questions are:

- What are the reasons for the slow read performance? IMHO, the read requests could be executed in the same speed than the write requests.

- Will this behaviour be changed / fixed?

Thank you,

Dirk

19 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi Dirk,

    I was trying to think why I didn't suspect the SDRAM/SDRAM controller and I remembered why.

    When I went to a NiosI class the instructor said the new SDRAM controller provided one-clock per word performance. I asked him why then cache was important. (one clock is one clock right ?) He said that was a good question and took my name and# to get back with me. (still no call)

    I also found the shiny brochure for one of my 3 devkit boards and it says this:

    "Enhanced SDRAM Controller"

    "The NIOS SDRAM controller has been enhanced to support pipelined data transactions; it provides single-cycle access to low cost single data rate (SDR) SDRAM devices"

    I'm not as hardware centric as I would like to be, but these two statements tell me not to expect 12 cycle access. Perhaps it's a matter of semantics and the fault is my misunderstanding of the exact meaning of the terms?

    Still the bottom line is how do we get decent SDRAM performance? New version of SDRAM controller? Secret .ptf settings? 3rd party controller?

    Any ideas?

    Thanks,

    Ken
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hello Ken,

    from my point of view there is nothing we can do. I think the SDRAM conroller is fine, it has a small pipeline to store requests. The main problem is the NIOS data master: as Jesse said, it's not latency aware. Look at the Avalon spec, this means NIOS cannot "enqueue" multiple read requests into the SDRAM controller pipeline, it has to wait until a request was processed. I still don't understand why this takes 12 cycles, maybe there is more overhead involved in the NIOS pipeline (flush??? hopefully not).

    But I can imagine that adding latency awareness to NIOS is a very intrusive change to the processor design. This means the NIOS needs to be able to analyse dependencies between instructions ("this ldwio instruction does not depend on the previous ldwio, so it can safely be executed"). This also implies the NIOS pipline stage "memory" must be able to hold multiple queued requests and execute them if the SDRAM delivers data (the memory pipeline stage can be active while the rest is stalled). I can image that this is expensive in both logic elements (config option?) and design "intrusion" since Altera would have to partly redesign the pipeline.

    This is my guess about why Altera is so "quiet" about this issue. But, these are just the things I can imagine about the reasons for the performance lack, it may be something else as well. I hope to get the confirmation from Altera about this some day.

    It&#39;s reasonable to optimize the CPU to be small and work well when things are cached. But IMHO applications with custom components that share RAM with the CPU are not a corner case for this FPGA system, so this should be a config option. You can&#39;t get that capability equally elegant anywhere for this price and effort - the only processor I found that is able to share SDRAM out of the box is the IBM PowerPC with it&#39;s external bus master feature. But even the smallest PowerPC (133MHz) was to powerful and expensive for my application. And IBM targets >500MHz, not <100MHz in the future.

    I have only one "bad hack"&#153; idea that could do something about it - use knowledge about the data cache for copying by using normal cached read instructions, but invalidating the cacheline(s) before reading may speed things up. The cache is AFAIK latency aware, so it can quickly retrieve the data from SDRAM, and NIOS can get it in full speed from cache. However, I won&#39;t try that in the near future, I am too busy developing my application.

    Dirk
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi guys,

    Sorry I don&#39;t mean to ignore the conversation here or keep quiet about it -- we are rapidly approaching our next Nios/Quartus/SOPC Builder release and have the associated time crunch to deal with. I will try to post something more useful early next week.

    There are several recently-introduced but not-yet-documented Avalon features I want to discuss.. this won&#39;t solve the immediate problem that Dirk presents (successive loads from SDRAM where the cache misses every time), but will be of assistance in complex (multi-master) systems where getting the best memory bandwidth is key. Additionally our aforementioned next release has several more features (and documentation !) that will speed things up further (sorry, latency awareness on the CPU data master isn&#39;t one of them...but as I say we will be giving this a serious look).
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi Guys,

    I think that I know what your problem is here. Do you have your code stored in the same SDRAM as the data. If so the SDRAM controller opens the bank and reads the data then opens another bank and reads the next bit of code. You can fix this in several ways:

    1. Put your code somewhere else.

    2. If you have an instruction cache and your code is in a loop this should be ok the nth time through the loop (where n != 1)

    3. The SDRAM can have mutilple banks open, if the data and code are in different banks you still should get fast performance. Unfortunately the SDRAM controller from altera does not support this and will always close the bank rather than leaving it open when the new address is in another bank. The SDRAM controller needs to be quite a bit more complex to take care of this. We wrote one but not for avalon. You could write your own, it took us about 2 months to do this. I cant distribute as it is the property of my old company.

    I could be wrong about this being the cause of your 12 cycles but to open a bank and do a read is about 5 cycles the next read should be 1 cycle. Changing banks (if bank is open) I think is 2 cycles. ie 3 cycle saving per read 12 cycles reduced to 6 (3 for the data read and 3 for the next instruction read.)

    Good Luck.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    If I read Jesse right, the problem is in Nios&#39; Data Master. It is not "Latency Aware" meaning it does not monitor &#39;readdatavalid&#39;.

    Soooo, with only static timing at it&#39;s disposal, the Data Master must use the same absolute worst case timing for each and every read.

    Privately I&#39;ve been shown that read performance is still a dismal 5-6 clocks from initial access to &#39;readdatavalid&#39; even when accessing the same row back to back.

    I&#39;d really like someone to explain the "single cycle access" statements that are given in Nios classes and docs. Better yet, make the statements come true http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif

    Ken
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hello to Ken and the other guys discussing this topic.

    Here is the link to the image that Ken mentioned: http://www.entner-electronics.com/images/n...explanation.jpg (http://www.entner-electronics.com/images/nios2sdram_with_explanation.jpg)

    As you can see, there is also a delay when writing to the SDRAM: Altera&#39;s SDRAM-controller has 2 write-buffers. Therefore the first 2 writes operate at full speed, that is 2 cycles per write. Then wait-states are inserted until one of the two write-buffers becomes free for the third write, etc. If you would have 8 back-to-back rights instead of 4, you could also see this on the SDRAM-signals.

    When reading, things become worse: Here the latency from the SDRAM and from the SDRAM-controller take full effect. Also the nios2-core itself requires several cycles, therefore even with internal SRAM you have about 4 cycles per read (I looked at it, but do not remember the exact number, maybe it was 3, more likely 5 or 6...).

    SDRAM-controllers are a topic I could discuss hours about, so I try to make it short (many things were already mentioned before):

    - The Altera controller does always keep only one bank open (you can see in the diagramm the writes are in bank 1, the reads in bank 0, they get precharged anyway. This is very conservative, at least he could have activated bank 1 before precharging bank 0. On the other side: What are this 3 cycles helping when he needs about 50 for reading 4 words...).

    - The controller has 2 write buffers, which will help a lot in many applications.

    - The 11 or 12 cycles per read are with the programm running from ANOTHER memory or cache.

    - I have not checked it, but I suppose that reading the program memory is much more effective (and more important in most cases).

    - The make the data-master latency-aware would be tough: He would need to guess what data will be read next by the programm and preload it into a buffer / small cache.

    Do not forget that Altera can not only have performance in mind, but also LC-count. A design that is very fast but requires e.g. 6.000 LCs would not help much either. We are taking about a Nios II with about 2.000 LCs (core + sdram), not about a Athlon 64 with I don&#39;t know how many million gates. Somewhere there will be performance bottlenecks.

    What can we do?

    - Increase the clock-rate

    - Use DMA

    - Solve the specific problem with own logic (nice, we have a FPGA...)

    I will most likely design a SDRAM- and a DDR-II-controller with a interface for very fast video-transfers (or other streaming things) and a Nios-II-avalon-interface within the next few months. But I do not think that I will address this specific issue as I will use the "fast-video-interface" for tasks that require maximum performance. If there is an interest in, I could also offer it as an IP for Nios II (but not for free ;-).

    Regards

    Thomas

    www.entner-electronics.com (http://www.entner-electronics.com)
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I am sorry, no, I have not forgotten about this. I have one other piece of "homework" that I have to write up at the moment (for my real job http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif and then I can get to work on a screed here.

    Topics to be covered are: why the data master is not latency aware, how to improve SDRAM performance in the case of multiple masters accessing it simultaneously, using the DMA controller in an efficient manner to (hopefully) alleviate part of the original poster&#39;s pain, and as much of a sneak preview as I can give without getting in trouble about new features that are coming out in the next quartus/sopc/nios release which we are finishing up now and will be available in the coming weeks.