Times like these typically what I do is simulate the system and watch the instructions (program counter) being executed and compare to the objdump file. This should give a lot of insight as to whether cache misses are occurring and other performance penalties.
Also I would highly recommend starting and stopping the performance counter *infrequently*. The performance counter is accurate but if you constantly start and stop it even the smallest amount of overhead is going to add up and when compared to a register increment instruction this overhead will be very significant. Typically people start the counter before entering a loop and stop it after the loop completes. If you know how many iterations the loop took then you can get a ballpark estimate of how long each iteration took.
... edit just realized I said the same thing DSL said ..... so +1 :)