Forum Discussion
Altera_Forum
Honored Contributor
15 years ago --- Quote Start --- Thanks. This is a nice assembler loop, I agree. How do we know that the given C compiler would generate this optimized loop? Also you did not include cycles to fetch the instructions and progress thru the pipe to do the add register in 1 cycle. Also the objective was to compare two different cpu architectures by running C code generated by two different compilers. --- Quote End --- gcc will generate that code - probably from the given source - but it might need the variable changed to be 32bits (ie not a short). --- Quote Start --- Also, the ori must be fetched and completed before the add can be done. That is something like 2-3 cycles memory access, plus 5 cycles thru the pipe. Assuming the add fetch was started a cycle after the ori fetch it is probably ready to execute, OK. Now the add result may be written to the register then compared, then the result used to determine the next instruction to fetch, then after the memory access it will start thru the pipe. --- Quote End --- For the /f core, and executing from tightly coupled memory (or the instruction cache) the instruction fetches can be ignored. The pipeline loss for the call to the delay function would be attributed to the call instruction (and is 2 clocks). The 'ori' and 'add' and 'beq' execute in adjacent clocks, the backwards conditional branch will be (statically) predicted as taken - so be 2 clocks. Actually such a small function is probably best marked 'inline' (or as a# define) in order to make more registers available to the calling code. I've actually removed all subroutine calls from my code in order to give the compiler the best chance of not running out of registers. The only accesses to %sp are the pointless saving of the caller-saved registers on entry to a function that doesn't ever return!