Ken,
I just sat down with James (who posts here sometimes) and we went over the numbers. The >= 1 clock in the documentation refers to all loads. A load that is a cache hit takes 1 clock. Everything else pays a penalty. Here is a rough break-down of the overall latency:
- ld instruction occurs
- cache miss - tick
- prepare avalon read - tick
- avalon read signals asserted - tick
- wait for avalon. The fastest memory would have data back on this clock. A random SDRAM access takes 5, as evidenced by previous discussion - 5 ticks
- register incoming data - tick
- align (this is because its possible that the user wanted an 8 or 16 bit load) - tick
- instructions immediately following that need the load data? another 2 ticks (this is seldom the case)
As you can see it pays to have something cached! A couple of the above clocks that you pay are a result of Nios II being optimized for f-max -- it makes sense to run it as fast as possible. One note: if your main performance bottleneck is loading this data (which cannot be cached), and you're changing your board run from faster memory, it may make sense to try the /s core. The reason is that you'll save 1 or 2 cycles per load as the "cache miss" and preparing the Avalon load penalties aren't there.
Also, I realize you're working with small data buffers but if they start to get larger (10, 20+ bytes perhaps) it would start making sense to do a quick DMA. By quick I mean setup an initial DMA and then do a few register writes to the peripheral directly to kick off a transfer. The basic things needed: start addr, stop addr, mode, transfer count.. I think a couple of these retain their values so it may be possible to start a DMA with 2-3 IO writes (this is part of that promised write-up -- all I'll do if you want to pre-empt me is look at the DMA datasheet on that one). The DMA controller will get one word of data per clock out of SDRAM after the initial penalty. I'd like to get into this more now but I have to catch a flight this afternoon. Happy holidays.