You've not said which cpu you are using (/f /s or /e).
Are you executing the code out of tightly coupled memory, or via the instruction cache?
If you are using the instruction cache and are looking at the first 2 writes, then there may be instruction cache fills taking extra time.
In that case, the times between the later writes would be faster.
The /f cpu will execute 1 instruction every clock (from tightly coupled instruction memory or instruction cache). 11 clocks is somewhere near the value for a clock crossing bridge (10 clocks??).
I've measured 3 clocks for Avalon MM transfers to local PIO (1 clock delay configured for reads and writes), on chip memory might to 0 clock delay writes.