Ken, is the issue the hit rate of the data cache or the time to process a miss?
Since the data cache is a writeback cache with 4 byte lines, every time you have a cache miss,
it can result in a 4-byte write to Avalon (it the victim line is dirty) and then a 4-read from Avalon to fetch the new line.
Because the CPU doesn't have a non-blocking cache, the CPU pipeline stalls while these Avalon transfers are performed.
Would a larger cache line size help your problem? If so, the Avalon reads and writes would be bursts which would tend to
lower the average number cycles on a miss but only if you need the other data in the line. The CPU would still be stalled
while these bursts are happening. More advance CPUs have features like non-blocking caches, scoreboarded loads, and
even out-of-order execution to try to keep the CPU busy while stalled for memory accesses. Alas, Nios II has none of
these features since they are probably too aggressive to implement in an FPGA and achieve acceptable Fmax.
I've designed chips in the past with color space conversion blocks for image processing.
The table accesses were always reads of 4 bytes but were not related to each other (low temporal and spatial locality).
We ended up storing this table in an off-chip SSRAM instead of the SDRAM because it was very wasteful of the SDRAM bandwidth.
To get good performance with SDRAM, you need to make large bursts (e.g. 16 or 32 bytes) and also should have high
temporal and spatial locality of references).