The code I run on the nios is carefully compiled without any actual function calls (they are all inlined) in order to give the compiler more registers.
All global data is accessed using 16bit offsets from the global pointer - this also significantly reduces pressure on registers.
You do get better code for global arrays if you use %gp as a register variable pointing to the array (and for global structs if you have built gcc with my patches).
I've also disabled the dynamic branch prediction logic to get guaranteed branch timings.
With code and data in tightly coupled memory the measured timing then match the calculated ones.
I only found one undocumented pipeline stall - there is a 1 cycle stall for a read following a write to the same tightly coupled data memory.