Is the delay between the read cycles, or are the read cycles stretched?
You should also be able to find the source for the copy loop, might be illuminating.
If you run the code from tightly-coupled instruction memory then you can use signal tap to trace the instruction fetches - can be very informative since it also shows the cpu stalls. The pipeline delays do make it slightly 'interesting' to follow.
From what I remember, a little bit more logic in the SPI block would make it a lot faster.
I also did some quick sums and thought that a 100MHz nios could directly bit-bang EPCS almost as fast as it can go.