I'm not sure I'd use the profiler, it's statistical nature (usually based on timer ticks) means it is of limited use.
You can probably guess which parts of the code take time, implement a TSC instruction and use it to count the number of cycles taken to get through the code sections. Use the elapsed time to generate a histogram of how long each section takes.
I have these defines:
#define STAMP_SET(stamp) (stamp = SYS_CLK_COUNT())
# define STAMP_COUNT(array, stamp, factor)
do {
unsigned int new_stamp, bucket, count;
new_stamp = SYS_CLK_COUNT();
bucket = ((new_stamp - stamp) & ((nelem(array) - 1) << (factor))) >> (factor);
count = array + 1;
stamp = new_stamp;
array = count;
} while (0)
which I use to generate 64-entry histograms of the execution time of code blocks.
I've a comment that says the above costs about 10 clocks - I can't remember if that includes the Avalon read to get the cycle count.
Since my code runs from tightly coupled instruction/data memory and the few Avalon xfers are usually uncontended the clock counts I get match those I calculate from the object code. SDRAM accesses perterb things somewhat (but I don't have many of those).
To get the code to run fast(er):
1) Ensure functions that can be inlined are inlined, try to get everything inlined (if code space permits).
Mostly this reduces register pressure.
2) Try to use global register variable(s) to access static data (ie put it all in a single 'struct'). This generates slightly better code (even after my patches to gcc) than using 'small data'. With care you can use %gp as the global register.
Without this the compiler will need a register to reference each global - and I've seen it have two registers pointing to the same global.
3) Ensure your C doesn't force the compiler to keep re-reading variables from memory (eg because a write via a 'char *' might overwrite the same location).
4) Avoid having too many live values in a function, gcc will create virtual registers and then spill them to stack. Sometimes it is necessary to force gcc to write the register values out to memory (asm volatile ("":::memory) is your friend here).
5) Avoid read delay stalls. gcc hasn't been told about these properly. Sometimes you need to force a read early.
6) Find out how to disable the dynamic branch predictor, and set all the branches with correct static prediction.
7) I could carry on ....
I got considerable speedup from the above - and after I thought I'd made a good jod of the code!