Hi Ken,
Yes, the controller is pipelined, so is the SDRAM chip. You can achieve 1 word per clock reads only if you queue up the read requests - in advance - in order to fill up the pipeline (DMA like). The CPU data master cannot request data in advance because usually it cannot predict where the next read will be, so it has to wait until the current read goes all the way through the pipeline before the CPU finishes the current instruction, advances to the next and issue the next read. 7 clocks are spent in the controller/chip, but I am still not sure where the rest of 5 clocks come from and the source code for NiosII is not available.
The NiosII data cache doesn't help in this case, but if YOU know where your reads will be, you can try and write a custom cache controller to optimize the read pipeline (things could get complicated though, and if you need the product fast maybe SRAM is a better option).
Good luck,
clancy