You will need to understand the object code generated by the compiler.
Compiling with (IIRC) -S --verbose-asm will give an annotated assembly output.
You might either find that you've miscounted the number of instructions, or that there are some pipeline stalls because values read from memory can't be used for the next two clocks (ie there is a stall if either of the next two instructions use the read value).
A quick look at your code makes me think that you need to copy some values to locals. I suspect a lot of them are being reread from memory multiple times in the loop.