Hi Dirk,
What you're seeing is the result of the Nios data master not being 'latency aware' (the instruction master is, and this allows relatively speedy instruction fetch even with a cache miss). Both master ports on the DMA controller are, and that is why Ken sees the performance he does. In a nutshell, Nios II was really designed to be as simple (small/fast) as possible and deliver best performance
when things are cached.
However, you raise a valid point with respect to more complex systems that have custom logic or other processors sharing memory -- as such things cannot be cached. I'll have a chat with our CPU expert to see what the penalty for adding latency awareness to the data master would be.
In the mean time I have to second the opinions above for either using DMA (which sounds like something you don't want to do), or dedicating a small on-chip RAM(s) to your high-speed buffers. The onchip memories can also be dual-ported, further enhancing performance.
PS: Latency aware means that an Avalon master accepts the 'readdatavalid' signal, rather than merely the 'waitrequest' signal as all masters must do.