I wrote a C program that analysed the object listing (generated gcc) and calculated the actual execution time for the code.
I then tweaked the C source to remove mis-predicted branches and stalls following memory reads (etc) - made easier because I arranged for the code to have no non-inlined function calls.
To squeeze the last clock cycle out you need to:
1) mark conditionals with __builtin_expect() to select the 'fall through' path
2) put dummy asm instructions in otherwise empty parts of conditionals so that gcc will generate a forwards jump (to the asm contents) and then jump backwards
3) use asm volatile("#gcc_membar, line " STR(__LINE__) "\n" ::: "memory") at various places to control which memory values gcc has cached in registers (can force reads early and force writes to avoid local variables)
4) build a better gcc (see the wiki, gcc4 seems worse!) so that structures can be put into the 'small memory' area.
5) get altera to tell you how to disable the dynamic branch predictor.
6) don't use volatile for 8 or 16bit items (gcc masks/sign extends them after doing the correct memory read).
I did manage to get my code to execute at the calculated rate.
In particular I needed to minimise the worst-case code path, not the common one!