Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
14 years ago

tightly-coupled memory performance !!

I wanted to compare the performance between cache and tightly-coupled memory. So I did the following experiment:

tic

call the function()

toc

tic

call the same function()

toc

I noticed that the second run was slightly faster than the first run.

All code + data are placed in the on-chip tightly-coupled memory, even the stack.

Can anyone comment on this behavior ?

9 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I would look at the assembled code (objdump file) to see what the compiler is doing. I'm guessing all the register preserving operations for the first call are not being duplicated for the second call and as a result the second call is faster. This would have nothing to do with tightly coupled memory, it's just a code optimization.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The calling function is very simple. And it is the called function that has the reading loop.

    The called function should be the same for each call, right?
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The called function assembly code

    05009254 <corner_turn_main>:

    5009254: 2015883a mov r10,r4

    5009258: 0013883a mov r9,zero

    500925c: 00001506 br 50092b4 <corner_turn_main+0x60>

    5009260: 40800017 ldw r2,0(r8)

    5009264: 3885883a add r2,r7,r2

    5009268: 11000017 ldw r4,0(r2)

    500926c: 01400784 movi r5,30

    5009270: 3145383a mul r2,r6,r5

    5009274: 1245883a add r2,r2,r9

    5009278: 1085883a add r2,r2,r2

    500927c: 1085883a add r2,r2,r2

    5009280: 00c14474 movhi r3,1297

    5009284: 18e7c604 addi r3,r3,-24808

    5009288: 10c5883a add r2,r2,r3

    500928c: 11000015 stw r4,0(r2)

    5009290: 00c00044 movi r3,1

    5009294: 30cd883a add r6,r6,r3

    5009298: 00800104 movi r2,4

    500929c: 388f883a add r7,r7,r2

    50092a0: 317fef1e bne r6,r5,5009260 <corner_turn_main+0xc>

    50092a4: 48d3883a add r9,r9,r3

    50092a8: 5095883a add r10,r10,r2

    50092ac: 00800a04 movi r2,40

    50092b0: 48800426 beq r9,r2,50092c4 <corner_turn_main+0x70>

    50092b4: 5011883a mov r8,r10

    50092b8: 000d883a mov r6,zero

    50092bc: 000f883a mov r7,zero

    50092c0: 003fe706 br 5009260 <corner_turn_main+0xc>

    50092c4: f800283a ret

    The calling function assembly code:

    5008cc8: 04010034 movhi r16,1024

    5008ccc: 84062804 addi r16,r16,6304

    5008cd0: 84800037 ldwio r18,0(r16)

    5008cd4: a009883a mov r4,r20

    5008cd8: 50092540 call 5009254 <corner_turn_main>

    5008cdc: 84400037 ldwio r17,0(r16)

    5008ce0: 8ca3c83a sub r17,r17,r18

    5008ce4: 04c000b4 movhi r19,2

    5008ce8: 9cc7af04 addi r19,r19,7868

    5008cec: 9809883a mov r4,r19

    5008cf0: 880b883a mov r5,r17

    5008cf4: 000feb00 call feb0 <printf>

    5008cf8: 84800037 ldwio r18,0(r16)

    5008cfc: a009883a mov r4,r20

    5008d00: 50092540 call 5009254 <corner_turn_main>

    5008d04: 84000037 ldwio r16,0(r16)

    5008d08: 84a1c83a sub r16,r16,r18

    5008d0c: 9809883a mov r4,r19

    5008d10: 800b883a mov r5,r16

    5008d14: 000feb00 call feb0 <printf>
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Actually I just mean look at the "calling function assembly code" to see if some of the instructions before the 'call' are omitted for the second one.

    Assuming all the instructions and data in the called function are all located in the tightly coupled memory, and if the code executes the same instructions between calls then it should have the same execution time. But that doesn't mean the stack operations leading into the call to the function will be the same between two calls. So you should look at the instructions before the 'call' to make sure you are simply not seeing additional work being done for the first call that is omitted for the second one.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Thank you BadOmen,

    I understand what you are saying. However, I still find it strange that I find this difference. The compiler did not do anything for the second call, and the the difference is increasing as the loop iteration number increases!!!!!:confused:

    I need somebody help me understand this issue.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I'd guess at the dynamic branch predictor changing the latency of some branches.

    There is a hidden (on sopc at least) menu with some extra nios cpu options. One of which is to remove the dynamic branch prediction logic and just use the static prediction 'assume backwards taken' and 'forwards not taken'.

    Actually a shame there isn't the option 'assume all not taken'.

    I needed to minimise the worst case path so had to persuade gcc to generate forward branches to backwards jumps in quite a few places (add an asm volatile () that only contains comments).
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Thank you dsl, that makes sense. However, is there any trick to force the NIOS-II/f to use the static branch predictor.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The SOPC builder has a hidden menu that allows some additional configuration of the NiosII cpu.

    If you ask your FAE they might tell you how to find it.

    I'm not sure its presence is supposed to be public knowledge!