I presume you are running code from SDRAM through the instruction cache.
This will give differing timings dependant on whether the code is cache resident or not. The last three sets of values are fairly similar.
To get guaranteed execution times you really need to execute code from dedicated instruction memory (avoiding I-cache issues). You also need to worry about the branch predictor if you want cycle-accurate counts.
However any Avalon MM cycles may also vary by a few clocks - if the target (slave) is busy.
For the above code, all the time is spent in the timestamp routines and in printf(). The IOWR are probably 3 clocks (depending on the timings of your slave).
If you want to count clocks, add a 32bit up-counter to your own slave device clocked by sys_clk, and read that directly. None of the 'standard' timer stuff is really appropriate for high res timestamps.