Hello Ken,
from my point of view there is nothing we can do. I think the SDRAM conroller is fine, it has a small pipeline to store requests. The main problem is the NIOS data master: as Jesse said, it's not latency aware. Look at the Avalon spec, this means NIOS cannot "enqueue" multiple read requests into the SDRAM controller pipeline, it has to wait until a request was processed. I still don't understand why this takes 12 cycles, maybe there is more overhead involved in the NIOS pipeline (flush??? hopefully not).
But I can imagine that adding latency awareness to NIOS is a very intrusive change to the processor design. This means the NIOS needs to be able to analyse dependencies between instructions ("this ldwio instruction does not depend on the previous ldwio, so it can safely be executed"). This also implies the NIOS pipline stage "memory" must be able to hold multiple queued requests and execute them if the SDRAM delivers data (the memory pipeline stage can be active while the rest is stalled). I can image that this is expensive in both logic elements (config option?) and design "intrusion" since Altera would have to partly redesign the pipeline.
This is my guess about why Altera is so "quiet" about this issue. But, these are just the things I can imagine about the reasons for the performance lack, it may be something else as well. I hope to get the confirmation from Altera about this some day.
It's reasonable to optimize the CPU to be small and work well when things are cached. But IMHO applications with custom components that share RAM with the CPU are not a corner case for this FPGA system, so this should be a config option. You can't get that capability equally elegant anywhere for this price and effort - the only processor I found that is able to share SDRAM out of the box is the IBM PowerPC with it's external bus master feature. But even the smallest PowerPC (133MHz) was to powerful and expensive for my application. And IBM targets >500MHz, not <100MHz in the future.
I have only one "bad hack"™ idea that could do something about it - use knowledge about the data cache for copying by using normal cached read instructions, but invalidating the cacheline(s) before reading may speed things up. The cache is AFAIK latency aware, so it can quickly retrieve the data from SDRAM, and NIOS can get it in full speed from cache. However, I won't try that in the near future, I am too busy developing my application.
Dirk