Another thing you can do is, if you can identify a place where the data cache hurts you like this, you can use the "bit 31" trick to copy from/to noncacheable memory. This may help your II/f benchmarks converge with your II/s core benchmarks.
The "bit 31" trick is covered on page 7-7 of the Nios II Software Developer's Handbook. In short, only bits 30-0 of an address are actually driven onto the address bus. Bit 31 controls whether it goes through the data cache or not; if set, the data cache is bypassed. So all you have to do is pass (address | 0x80000000) to memcpy for each buffer you want uncached. Try your benchmarks with the cache bypassed for reads only, writes only, or both, and see what works best.