First each Nios 2 instruction takes several clock cycles (6 if I'm not mistaken) and IIRC the Nios 2/e doesn't have any fancy optimizations like branch prediction, heavy pipelining etc...
Then when you say that you counted the cycles using the debugger, was it when stepping through C code or assembly? One step in the C code could be several assembly instructions.
If you don't have instruction or data cache you also need to take into account several clock cycles due to the DRAM latency.
As for your last question it all depends on how the JTAG controller was implemented. As the JTAG interface uses a clock frequency that is not the same than the system frequency on the bus, some clock crossing logic needs to be implemented. If you assume that one clock is always higher than the other (sometimes with a factor 2 involved) that clock crossing logic is a lot easier to implement than if you have to take into account any possible case. IIRC the Nios 2 documentation states a minimal clock frequency that must be used on the JTAG debug module.