The Nios II/f is optimized to run at 145 MHz in fast Stratix devices and even higher in Stratix II.
It achieves this performance by giving preference to cache hits over cache misses (a common CPU design technique).
This also takes advantage of the relatively fast speed of RAMs on FPGAs.
However, as you have discovered, when you miss in the D-cache or bypass it, it takes a significant number of cycles
to the load/store instruction to execute. This is required to maintain the 145MHz design goal because
of the relatively slow speed of logic, muxing, and wires in an FPGA.
Unfortunately, if you don't need 145 MHz, you still have all the extra cycles of latency.
I'd love to do a version of Nios II that is optimized for latency instead of Fmax.
It is just a matter of development priorities.
One thing that might help some customers is the new multiple clock
domain support in the Quartus 4.2 version of SOPC Builder.
You can now build a system where you can have the CPU run at a high frequency
and have other components run at a low frequency.
Of course, for good performance, you'll need to have your memory controller run at
the same frequency of the CPU (because it adds several cycles of latency to cross clock domains).
We have ideas for ways to reduce the latency of accesses to on-chip memories
that I'd like to see in a future product release. We'll let you know if it happens.