Anyone know if DDR Ram will fare better with the NiosII data master than SDRAM? After Dirk figured out it takes a whopping 12 clocks @50MHz (~240ns) per SDRAM read when not using the dma, we realized we had to respin our board. Jesse explained that the problem is that the Nios&#39; data master is not latency aware and so must use worst case timing. I&#39;m wondering if DDR would fare better. I&#39;m not familiar with it at all. Anybody know? I&#39;m looking for a bulk memory that is also high performance with the NiosII. Thanks, Ken

Hi, Is it a solution to use another type of SDRAM controller that runs on a higher clock speed than the Nios. I think SDRAM can handle more than 100Mc as input clock. Then probably you can reduce 12 clocks on 50 Mc to 12 clocks on 100Mc. Can you try to run your Nios on a higher clock frequency? Eventually using another speed grade for the FPGA? I think 80 to 100Mc must be possible, depending on the number of peripherals you connect to the avalon bus. Stefaan

You&#39;re never going to overcome 12 clocks with MHz. You&#39;d need close to 500-1200 MHz to get the performance you should be getting at 100MHz. That&#39;s not going to happen. The only thing we can do right now is use memory that has fixed timing like SRAM or onchip SRAM. My question was whether DDR is any better/closer in this respect than SDR. Ken

Sorry, I didn&#39;t know the gap in performance was so big.

The SDRAM controller included in the Nios II kit only keeps one bank open at a time. The DDR controller from Altera keeps multiple banks open. This should help improve performance but is a function of your access pattern. The main problem with the Nios II/f data cache and DRAM performance (SDRAM or DDR) is that it only has a 4-byte line so it doesn&#39;t perform burst transfers to/from the DRAM.

Hi James, Can you elaborate on NiosII/f SDRAM access? Jesse indicated the largest part of the problem was that the NiosII/f data master was not "Latency Aware". So anything not dma&#39;d or read out of the cache incurs a large timing hit as demonstrated in the other thread on this topic -even if the reads are back to back in the same bank. A work around or a glimpse of the roadmap would sure be welcome. Thanks, Ken

DDR vs. SDR RAM... | Altera Community

27 Replies

Altera_Forum
Honored Contributor
21 years ago
Ken, is the issue the hit rate of the data cache or the time to process a miss?

Since the data cache is a writeback cache with 4 byte lines, every time you have a cache miss,
it can result in a 4-byte write to Avalon (it the victim line is dirty) and then a 4-read from Avalon to fetch the new line.
Because the CPU doesn't have a non-blocking cache, the CPU pipeline stalls while these Avalon transfers are performed.

Would a larger cache line size help your problem? If so, the Avalon reads and writes would be bursts which would tend to
lower the average number cycles on a miss but only if you need the other data in the line. The CPU would still be stalled
while these bursts are happening. More advance CPUs have features like non-blocking caches, scoreboarded loads, and
even out-of-order execution to try to keep the CPU busy while stalled for memory accesses. Alas, Nios II has none of
these features since they are probably too aggressive to implement in an FPGA and achieve acceptable Fmax.

I've designed chips in the past with color space conversion blocks for image processing.
The table accesses were always reads of 4 bytes but were not related to each other (low temporal and spatial locality).
We ended up storing this table in an off-chip SSRAM instead of the SDRAM because it was very wasteful of the SDRAM bandwidth.
To get good performance with SDRAM, you need to make large bursts (e.g. 16 or 32 bytes) and also should have high
temporal and spatial locality of references).
Altera_Forum
Honored Contributor
21 years ago
Hello,

please read nios2 sdram performance (http://www.niosforum.com/forum/index.php?act=st&f=2&t=629) for a deeper explanation of the issue. There are even oscilloscope images available.

Dirk
Altera_Forum
Honored Contributor
21 years ago
Hi James,

It's the time to process a miss or the time to process a read in the absence of data cache.

I feel like I completely understand your explanation, but I still don't see what it has to do with the worst case time to access sdram.

Forget cache, how long does it take to for an address to show up on the avalon bus and then how long does it take for the sdram to place the requested data back on the bus?

Am I misreading the sdram datasheet? They just show address and chipselect activating and data being placed on the bus in just a few clocks. The number of clocks is equal to the CAS setting for accesses to the same row. (ie. figure 8 "Random Reads" in MT48LC4M32B2 datasheet)

So stalled/cached/queued or not, once the Avalon bus puts out the address and selects the sdram responds in CAS clocks with the result.

So that's 3 clocks with CAS3 setting. Now where are the other 9 clocks coming from? Dirk's scope shots show 12 clocks between back to back reads on the same row. The cpu stalling until the result is returned is fine, because the code needs that result to continue. I'm just not seeing the justification for a 12 clock stall to read a non-cached word from sdram.

Thanks,
Ken
Altera_Forum
Honored Contributor
21 years ago
Hi Ken,

when I look back at the diagram I already posted in the first thread regarding this topic, I see that the SDRAM-controller needs 3 clocks to assert CAS after he got chip-selected (internally). +2 clocks CAS-latency +1 clock from the SDRAM-controller (I suppose the input-registers from the SDRAM-data-bus) The remaining 5 cycles appears to get lost somewhere inside the Nios (the CAS to CAS-time was 11 cycles in the case I observed).

Using DDR-SDRAM would not help at all, because it mainly increases the burst-rate (2bits per pin and clock instead of 1), but has basically the same latency-behavior.

You wrote that you need a 15bit look-up-table. If there are really random values to look-up, there will be no suitable solution with nios+SDRAM, I think. I would recommend that you perform the look-up-task by dedicated "hardware" in the FPGA. There you may do some pipelining and achieve 1 value per clock with the SSRAM you mentioned some time ago. With a dedicated SDRAM-controller you could get also a better speed, but it is very difficult to get a high worst-case-performance with the SDRAM for really random accesses. (The orignal problem from dziegel were really predictable sequential accesses, were you can achieve almost 1 access per clock with a non-nios-solution.)

Maybe we could help you more if we get more details from your application.

Regards

Thomas

www.entner-electronics.com (http://www.entner-electronics.com)
Altera_Forum
Honored Contributor
21 years ago
Hi Thomas,

In your analysis you have the Nios+SDRAM controller consuming 11-3 = 8 clocks. That's 8 clocks of overhead. Is this just to be expected as normal?

How much overhead would be added on say a Coldfire or an ARM or some other softcore? Do all/most embedded processors add over 300% overhead to memory reads? I don't know for sure, but it doesn't sound right.

I'd like to establish this as either an oversight, a work in progress, or the way it is and then have it documented. The current literature promises either "single cycle" or ">1 clocks" to access sdram. (11 != 1)

Thanks,
Ken
Altera_Forum
Honored Contributor
21 years ago
Ken,

I just sat down with James (who posts here sometimes) and we went over the numbers. The >= 1 clock in the documentation refers to all loads. A load that is a cache hit takes 1 clock. Everything else pays a penalty. Here is a rough break-down of the overall latency:

- ld instruction occurs
- cache miss - tick
- prepare avalon read - tick
- avalon read signals asserted - tick
- wait for avalon. The fastest memory would have data back on this clock. A random SDRAM access takes 5, as evidenced by previous discussion - 5 ticks
- register incoming data - tick
- align (this is because its possible that the user wanted an 8 or 16 bit load) - tick
- instructions immediately following that need the load data? another 2 ticks (this is seldom the case)

As you can see it pays to have something cached! A couple of the above clocks that you pay are a result of Nios II being optimized for f-max -- it makes sense to run it as fast as possible. One note: if your main performance bottleneck is loading this data (which cannot be cached), and you're changing your board run from faster memory, it may make sense to try the /s core. The reason is that you'll save 1 or 2 cycles per load as the "cache miss" and preparing the Avalon load penalties aren't there.

Also, I realize you're working with small data buffers but if they start to get larger (10, 20+ bytes perhaps) it would start making sense to do a quick DMA. By quick I mean setup an initial DMA and then do a few register writes to the peripheral directly to kick off a transfer. The basic things needed: start addr, stop addr, mode, transfer count.. I think a couple of these retain their values so it may be possible to start a DMA with 2-3 IO writes (this is part of that promised write-up -- all I'll do if you want to pre-empt me is look at the DMA datasheet on that one). The DMA controller will get one word of data per clock out of SDRAM after the initial penalty. I'd like to get into this more now but I have to catch a flight this afternoon. Happy holidays.
Altera_Forum
Honored Contributor
21 years ago
Hi Ken,

--- Quote Start ---
originally posted by kenland@Dec 22 2004, 01:47 PM
hi thomas,

in your analysis you have the nios+sdram controller consuming 11-3 = 8 clocks. that's 8 clocks of overhead. is this just to be expected as normal?

how much overhead would be added on say a coldfire or an arm or some other softcore? do all/most embedded processors add over 300% overhead to memory reads? i don't know for sure, but it doesn't sound right.

i'd like to establish this as either an oversight, a work in progress, or the way it is and then have it documented. the current literature promises either "single cycle" or ">1 clocks" to access sdram. (11 != 1)

thanks,
ken
--- Quote End ---

I do not think that a "real" processor would be that slow, but there would for sure be some clocks of delay too, when you have really RANDOM accesses. The Nios needs some cycles more, as it is very pipelined to achieve a high fmax (as Jesse / James pointed out).

This problem (that memory access that miss that cache add a large delay) is basically the reason why Intel added Hyperthreading to their Pentium 4: While the Pentium 4 is waiting for the data, he "simply" switches to another task, so it can do something useful during waiting. (Of course the Pentium 4 is a much more sophisticated architecture with out-of-order execution and such stuff, and a cache miss is there even a larger penalty, because the core operates at a much higher frequency (e.g. 3.6 GHz) then the memory (e.g. 400MHz), so you get easily delays in the range of about 50 to 100 CPU-clock-cycles).

Merry Christmas

Thomas

www.entner-electronics.com (http://www.entner-electronics.com)
Altera_Forum
Honored Contributor
21 years ago
Hi,

Original I choose Nios (the first one), because the X brand had a processor where it took some 7 to 8 cycles to read some data from external memory (even fast SRAM).

Nice to hear that the second version of NIOS is going in the same direction ;-)

A good start in 2005 for everyone!

Stefaan
Altera_Forum
Honored Contributor
21 years ago
Jesse,

Thank you for this incredibly valuable info. Please Please add this info to the documenation. I've already respun my board with a Stratix over this plus the bit shifting problem (1 clock per bit without hardware multiplier!) I'd hate to see this happen to someone else!
(at least say 1 clock if cached 7+ if not)

Based on this new info I'm not sure if the Stratix will help enough. I don't know exactly what the read overhead on our Coldfire is, but I suspect it is much less than 7 clocks minimum for non-cached reads. Data caching is really of little use for many embedded applications that are always streaming or otherwise processing only new information. (music, video, scanning, almost anything...) In fact what is typically interesting about revisiting the same old data?

I wonder if there is a way to dma into the data cache to get work packets to near 2 clocks? (overhead + one clock for dma + one clock for actual read) Actually, I'm surprised the existing cache controller doesn't assume read ahead and do this already.

I hope Altera will see how crucial fast memory access is. The good news is that anything that can be done will improve performance by 10%+ for each clock eliminated!

Thanks,
Ken
Altera_Forum
Honored Contributor
21 years ago
Hi Ken,

in most C-programms, most data-transfers are from/to the stack, where a data-stack helps much.

If you have a stream of data to process, maybe you can control the streaming yourself and read the next word always from the same address (not cached and also not SDRAM, of course). You could also implement custom instructions to access the stream, that would be even faster. Of course it is a pity, that the data-cache/SDRAM-controller does not read a line of data (then you would have to cache-miss-penalty only at the first of the words in a cache-line). You could also do a DMA-transfer to a internal SRAM-block and use this as "cache" for further processing.

You also mentioned a 16bit LUT, I think: If it is a steady transfer-function, it may be possible to reduce its size (to lets say 256 points) and interpolate in between (in FPGA-"hardware"). Then you can put the LUT into a internal SRAM-block. If you implement this with e.g. custom instructions, you could achieve about 2 clocks per look-up. If you implement this in a clever way, the resource-usage (LCs and RAM-blocks) would not be too much, I think.

Regards,

Thomas

Forum Discussion

DDR vs. SDR RAM...

27 Replies

Recent Discussions

AshlingRISCFree IDE Build system: 'source directory does not appear to contain CMakeLists.txt"

Correct way to use mSGDMA with a NIOSV/m processor on a MAX10 FPGA

University Program IP for NiosV

Nios-V on Cyclone IV

Nios II IDE File Name too long error