I have measured the speed of memcpy&#39;s of NIOS2. I use optimized code consisting of four consecutive READs and four consecutive write accesses. Code snippet: while (i--) { d0 = __builtin_ldwio(pfrom); d1 = __builtin_ldwio(pfrom+1); d2 = __builtin_ldwio(pfrom+2); d3 = __builtin_ldwio(pfrom+3); pfrom+=4; __builtin_stwio(pto, d0); __builtin_stwio(pto+1, d1); __builtin_stwio(pto+2, d2); __builtin_stwio(pto+3, d3); pto+=4; } Compiling this with -O3 will yields quite optimal code with four reads to different regs and for writes: movhi r7, %hiadj(1048576) # pfrom addi r7, r7, %lo(1048576) # pfrom movhi r6, %hiadj(1052672) # pto addi r6, r6, %lo(1052672) # pto movi r8, 15 # i .L25: ldwio r3, 0(r7) # d0, * pfrom ldwio r4, 4(r7) # d1 ldwio r5, 8(r7) # d2 ldwio r9, 12(r7) # d3 addi r7, r7, 16 # pfrom, pfrom stwio r3, 0(r6) # d0, * pto stwio r4, 4(r6) # d1 stwio r5, 8(r6) # d2 stwio r9, 12(r6) # d3 addi r8, r8, -1 # i, i cmpnei r3, r8, -1 # i addi r6, r6, 16 # pto, pto bne r3, zero, .L25 The transfer rates seemed too slow, so I did further investigations. It turns out that NIOS2 has a very poor SDRAM read performance because it does not perform consecutive SDRAM read accesses (but it does for write accesses). Here&#39;s a link to an oscilloscope image of a READ access: oscilloscope: sdram read (http://dziegel.free.fr/nios2/sdram_read.jpg) However, write access seems to be fine: oscilloscope: sdram write (http://dziegel.free.fr/nios2/sdram_write.jpg) Note: As you can see in the oscilloscope images, the accesses to not cross a SDRAM row (no RAS cycle between reads). Tests were performed on a NIOS 1C20 Development Kit, Project: NIOS2 full_featured. So my questions are: - What are the reasons for the slow read performance? IMHO, the read requests could be executed in the same speed than the write requests. - Will this behaviour be changed / fixed? Thank you, Dirk

Dirk, Thanks for the good info. I was planning on writing the exact code you did for moving/copying SDRAM and I just assumed it would have good performance due to back to back reads. Looking at the datasheet a read should take the number of clocks you have set for CAS latency in most situations. (certainly back to back reads on the same row) I count about 12 clocks in your trace! Since SDRAM is so popular for NIOS/SOPC systems this would be an excellent place to focus efforts to increase performance. It would be nice if someone from Altera could at least comment on this topic. If Altera has no plans to improve this I would be willing to fund the improvements if anyone knows how to modify/replace the SDRAM controller. (assuming its legal to modify the SDRAM controller) Anyone know of a better SDRAM controller that is SOPC builder ready? Ken

Ken and Altera, I guess it&#39;s not the SDRAM controller&#39;s fault, this comes from NIOS http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/sad.gif . A copy using DMA does not show this behaviour, I see consecutive reads (sorry no oscilloscope image at hand). Maybe the NIOS data master implementation is suboptimal... I&#39;d really be interested in the technical reason for these delays ;-) Dirk

Dirk, I&#39;ve definitely found the dma to be the ultimate data mover as well. While this is ok for some things, you can&#39;t for instance dma directly into registers when you need to manipulate data. I guess a work around would be to dma small work packets into a small onchip ram. Sorta like a manual cache. Unless LDW on onchip ram has the same problem as it does when reading SDRAM. Have you looked into the LD and ST to and from onchip? Ken

Ken, I didn&#39;t look at LD and ST to onchip mem - I need memcpy performance by processor in my app - I often copy small amounts of data ~20 bytes, so the setup of a DMA ist not neglectible any more compared to the copy duration. And the SDRAM is accessed by custom components, so I really need the data in RAM (cache bypass!). What it also worries me is that this behaviour may affect performance in general, since all consecutive data reads (if you work on array or so) are affected. I&#39;d really like to see the NIOS data master fixed http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/biggrin.gif ... Dirk

Hi Dirk, What you&#39;re seeing is the result of the Nios data master not being &#39;latency aware&#39; (the instruction master is, and this allows relatively speedy instruction fetch even with a cache miss). Both master ports on the DMA controller are, and that is why Ken sees the performance he does. In a nutshell, Nios II was really designed to be as simple (small/fast) as possible and deliver best performance when things are cached. However, you raise a valid point with respect to more complex systems that have custom logic or other processors sharing memory -- as such things cannot be cached. I&#39;ll have a chat with our CPU expert to see what the penalty for adding latency awareness to the data master would be. In the mean time I have to second the opinions above for either using DMA (which sounds like something you don&#39;t want to do), or dedicating a small on-chip RAM(s) to your high-speed buffers. The onchip memories can also be dual-ported, further enhancing performance. PS: Latency aware means that an Avalon master accepts the &#39;readdatavalid&#39; signal, rather than merely the &#39;waitrequest&#39; signal as all masters must do.

NIOS SDRAM performance | Altera Community

19 Replies

Altera_Forum
Honored Contributor
21 years ago
Hi Dirk,

I was trying to think why I didn't suspect the SDRAM/SDRAM controller and I remembered why.

When I went to a NiosI class the instructor said the new SDRAM controller provided one-clock per word performance. I asked him why then cache was important. (one clock is one clock right ?) He said that was a good question and took my name and# to get back with me. (still no call)

I also found the shiny brochure for one of my 3 devkit boards and it says this:

"Enhanced SDRAM Controller"

"The NIOS SDRAM controller has been enhanced to support pipelined data transactions; it provides single-cycle access to low cost single data rate (SDR) SDRAM devices"

I'm not as hardware centric as I would like to be, but these two statements tell me not to expect 12 cycle access. Perhaps it's a matter of semantics and the fault is my misunderstanding of the exact meaning of the terms?

Still the bottom line is how do we get decent SDRAM performance? New version of SDRAM controller? Secret .ptf settings? 3rd party controller?

Any ideas?

Thanks,
Ken
Altera_Forum
Honored Contributor
21 years ago
Hello Ken,

from my point of view there is nothing we can do. I think the SDRAM conroller is fine, it has a small pipeline to store requests. The main problem is the NIOS data master: as Jesse said, it's not latency aware. Look at the Avalon spec, this means NIOS cannot "enqueue" multiple read requests into the SDRAM controller pipeline, it has to wait until a request was processed. I still don't understand why this takes 12 cycles, maybe there is more overhead involved in the NIOS pipeline (flush??? hopefully not).
But I can imagine that adding latency awareness to NIOS is a very intrusive change to the processor design. This means the NIOS needs to be able to analyse dependencies between instructions ("this ldwio instruction does not depend on the previous ldwio, so it can safely be executed"). This also implies the NIOS pipline stage "memory" must be able to hold multiple queued requests and execute them if the SDRAM delivers data (the memory pipeline stage can be active while the rest is stalled). I can image that this is expensive in both logic elements (config option?) and design "intrusion" since Altera would have to partly redesign the pipeline.
This is my guess about why Altera is so "quiet" about this issue. But, these are just the things I can imagine about the reasons for the performance lack, it may be something else as well. I hope to get the confirmation from Altera about this some day.

It's reasonable to optimize the CPU to be small and work well when things are cached. But IMHO applications with custom components that share RAM with the CPU are not a corner case for this FPGA system, so this should be a config option. You can't get that capability equally elegant anywhere for this price and effort - the only processor I found that is able to share SDRAM out of the box is the IBM PowerPC with it's external bus master feature. But even the smallest PowerPC (133MHz) was to powerful and expensive for my application. And IBM targets >500MHz, not <100MHz in the future.

I have only one "bad hack" idea that could do something about it - use knowledge about the data cache for copying by using normal cached read instructions, but invalidating the cacheline(s) before reading may speed things up. The cache is AFAIK latency aware, so it can quickly retrieve the data from SDRAM, and NIOS can get it in full speed from cache. However, I won't try that in the near future, I am too busy developing my application.

Dirk
Altera_Forum
Honored Contributor
21 years ago
Hi guys,

Sorry I don't mean to ignore the conversation here or keep quiet about it -- we are rapidly approaching our next Nios/Quartus/SOPC Builder release and have the associated time crunch to deal with. I will try to post something more useful early next week.

There are several recently-introduced but not-yet-documented Avalon features I want to discuss.. this won't solve the immediate problem that Dirk presents (successive loads from SDRAM where the cache misses every time), but will be of assistance in complex (multi-master) systems where getting the best memory bandwidth is key. Additionally our aforementioned next release has several more features (and documentation !) that will speed things up further (sorry, latency awareness on the CPU data master isn't one of them...but as I say we will be giving this a serious look).
Altera_Forum
Honored Contributor
21 years ago
Hi Guys,

I think that I know what your problem is here. Do you have your code stored in the same SDRAM as the data. If so the SDRAM controller opens the bank and reads the data then opens another bank and reads the next bit of code. You can fix this in several ways:

1. Put your code somewhere else.

2. If you have an instruction cache and your code is in a loop this should be ok the nth time through the loop (where n != 1)

3. The SDRAM can have mutilple banks open, if the data and code are in different banks you still should get fast performance. Unfortunately the SDRAM controller from altera does not support this and will always close the bank rather than leaving it open when the new address is in another bank. The SDRAM controller needs to be quite a bit more complex to take care of this. We wrote one but not for avalon. You could write your own, it took us about 2 months to do this. I cant distribute as it is the property of my old company.

I could be wrong about this being the cause of your 12 cycles but to open a bank and do a read is about 5 cycles the next read should be 1 cycle. Changing banks (if bank is open) I think is 2 cycles. ie 3 cycle saving per read 12 cycles reduced to 6 (3 for the data read and 3 for the next instruction read.)

Good Luck.
Altera_Forum
Honored Contributor
21 years ago
If I read Jesse right, the problem is in Nios' Data Master. It is not "Latency Aware" meaning it does not monitor 'readdatavalid'.

Soooo, with only static timing at it's disposal, the Data Master must use the same absolute worst case timing for each and every read.

Privately I've been shown that read performance is still a dismal 5-6 clocks from initial access to 'readdatavalid' even when accessing the same row back to back.

I'd really like someone to explain the "single cycle access" statements that are given in Nios classes and docs. Better yet, make the statements come true http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif

Ken
Altera_Forum
Honored Contributor
21 years ago
Hello to Ken and the other guys discussing this topic.

Here is the link to the image that Ken mentioned: http://www.entner-electronics.com/images/n...explanation.jpg (http://www.entner-electronics.com/images/nios2sdram_with_explanation.jpg)

As you can see, there is also a delay when writing to the SDRAM: Altera's SDRAM-controller has 2 write-buffers. Therefore the first 2 writes operate at full speed, that is 2 cycles per write. Then wait-states are inserted until one of the two write-buffers becomes free for the third write, etc. If you would have 8 back-to-back rights instead of 4, you could also see this on the SDRAM-signals.

When reading, things become worse: Here the latency from the SDRAM and from the SDRAM-controller take full effect. Also the nios2-core itself requires several cycles, therefore even with internal SRAM you have about 4 cycles per read (I looked at it, but do not remember the exact number, maybe it was 3, more likely 5 or 6...).

SDRAM-controllers are a topic I could discuss hours about, so I try to make it short (many things were already mentioned before):
- The Altera controller does always keep only one bank open (you can see in the diagramm the writes are in bank 1, the reads in bank 0, they get precharged anyway. This is very conservative, at least he could have activated bank 1 before precharging bank 0. On the other side: What are this 3 cycles helping when he needs about 50 for reading 4 words...).
- The controller has 2 write buffers, which will help a lot in many applications.
- The 11 or 12 cycles per read are with the programm running from ANOTHER memory or cache.
- I have not checked it, but I suppose that reading the program memory is much more effective (and more important in most cases).
- The make the data-master latency-aware would be tough: He would need to guess what data will be read next by the programm and preload it into a buffer / small cache.

Do not forget that Altera can not only have performance in mind, but also LC-count. A design that is very fast but requires e.g. 6.000 LCs would not help much either. We are taking about a Nios II with about 2.000 LCs (core + sdram), not about a Athlon 64 with I don't know how many million gates. Somewhere there will be performance bottlenecks.

What can we do?
- Increase the clock-rate
- Use DMA
- Solve the specific problem with own logic (nice, we have a FPGA...)

I will most likely design a SDRAM- and a DDR-II-controller with a interface for very fast video-transfers (or other streaming things) and a Nios-II-avalon-interface within the next few months. But I do not think that I will address this specific issue as I will use the "fast-video-interface" for tasks that require maximum performance. If there is an interest in, I could also offer it as an IP for Nios II (but not for free ;-).

Regards

Thomas
www.entner-electronics.com (http://www.entner-electronics.com)
Altera_Forum
Honored Contributor
21 years ago
Hello Jesse,

is there any progress regarding this topic?

Dirk
Altera_Forum
Honored Contributor
21 years ago
I am sorry, no, I have not forgotten about this. I have one other piece of "homework" that I have to write up at the moment (for my real job http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif and then I can get to work on a screed here.

Topics to be covered are: why the data master is not latency aware, how to improve SDRAM performance in the case of multiple masters accessing it simultaneously, using the DMA controller in an efficient manner to (hopefully) alleviate part of the original poster's pain, and as much of a sneak preview as I can give without getting in trouble about new features that are coming out in the next quartus/sopc/nios release which we are finishing up now and will be available in the coming weeks.
Altera_Forum
Honored Contributor
21 years ago
This thread seems to continue here: ddr vs. sdr ram... (http://www.niosforum.com/forum/index.php?act=st&f=2&t=797&st=0)

Forum Discussion

NIOS SDRAM performance

19 Replies

Recent Discussions

NiosV µC/OS-II

Recommended Quartus Prime Standard Edition for Nios V Development on MAX 10 FPGA (10M25DAF4817G)

AshlingRISCFree IDE Build system: 'source directory does not appear to contain CMakeLists.txt"

Nios-V on Cyclone IV

Debug Know-How: Ashling* RiscFree* NIOS® V debug using Command Line