For performance measurements I got our HW guys to put a 32bit up-counter clocked by sys_clk onto an avalon bus and read that. (reading the counter with a custom instruction would save some clocks!)
I still needed to add asm("":::"memory") lines to stop gcc caching memory values in registers (it acts as a memory barrier to the compiler) and then check the generated code (eg with objdump) to ensure the correct code was being counted.
Looking at the asm will also allow you to significantly improve the performance by writing C that gcc can compile to better code!