Forum Discussion
Altera_Forum
Honored Contributor
15 years ago --- Quote Start --- The delay() loop should end up being something like:
delay:
ori r2,r0,<loop count> # for a 16bit constant
loop:
add r2,r2,-1
bne r0,r2,loop
ret With the /f cpu, the add instruction takes 1 clock, the conditional branch (as coded) 2 clocks when going round the loop, and 4 when the loop exits. It is possible to get the branch to be 1 clock in the loop exit path - by jumping forwards to an unconditional branch and disabling the dynamic branch predictor. In this case the loop would be 4 + 2 clocks. --- Quote End --- Thanks. This is a nice assembler loop, I agree. How do we know that the given C compiler would generate this optimized loop? Also you did not include cycles to fetch the instructions and progress thru the pipe to do the add register in 1 cycle. Also the objective was to compare two different cpu architectures by running C code generated by two different compilers. Also, the ori must be fetched and completed before the add can be done. That is something like 2-3 cycles memory access, plus 5 cycles thru the pipe. Assuming the add fetch was started a cycle after the ori fetch it is probably ready to execute, OK. Now the add result may be written to the register then compared, then the result used to determine the next instruction to fetch, then after the memory access it will start thru the pipe.