Altera_Forum
Honored Contributor
9 years agoNios II/e timing anomalies
I'm experiencing some rather strange timing anomalies when running my Nios II/e based system and I was hoping that someone could help me shed some light on the problem. In the below code example, I'm creating an array of structs of type foo_t, containing two member variables a and b of type int. Note that the third member variable c is commented out for now. I then iterate over the array, accessing each struct by setting the member variable a to zero. All three loop iterations are timed using the Altera performance counter and reported back at the end of the program. I compile the program with no code optimisation using Nios II 14.1 Software Build Tools for Eclipse/GCC. The modules used for the HW platform on which I'm running the program can be seen here (http://imgur.com/1xczgti). I have two interval timers present, but as you can see from the code below, they are never initialized.
#include <stdio.h>
#include "system.h"
#include "altera_avalon_performance_counter.h"
#define ITERATIONS 3
typedef struct foo
{
int a;
int b;
//int c;
} foo_t;
foo_t foo_arr;
int main()
{int i;
PERF_RESET(PERFORMANCE_COUNTER_0_BASE);
PERF_START_MEASURING(PERFORMANCE_COUNTER_0_BASE);
for(i = 0; i < ITERATIONS; i++)
{
PERF_BEGIN(PERFORMANCE_COUNTER_0_BASE, 1+i);
foo_arr.a = 0;
PERF_END(PERFORMANCE_COUNTER_0_BASE, 1+i);
}
PERF_STOP_MEASURING(PERFORMANCE_COUNTER_0_BASE);
perf_print_formatted_report((void *)PERFORMANCE_COUNTER_0_BASE, alt_get_cpu_freq(), 3, "Iteration 0", "Iteration 1", "Iteration 2");
return 0;
}
Ok, so we know that Nios II/e has no cache memories nor branch prediction and we would expect all three iterations of the loop to require the same number of cycles. This is confirmed when we look at the timing report. Let's refer to this as Case A. Iteration 0: 124 clock cycles Iteration 1: 124 clock cycles Iteration 2: 124 clock cycles Now comes the part that I'm struggling to understand: If we now add the third member variable c to the foo_t struct, but leave the rest of the code as it is, the loop iterations no longer executes in the same number of clock cycles. Let's refer to this as Case B. Iteration 0: 155 clock cycles Iteration 1: 199 clock cycles Iteration 2: 236 clock cycles Here is the disassembly of the row foo_arr.a = 0 in the two cases: case a: 000402c8: movhi r3,5 000402cc: addi r3,r3,12872 000402d0: ldw r2,-4(fp) 000402d4: slli r2,r2,3 000402d8: add r2,r3,r2 000402dc: stw zero,0(r2) case b: 000402cc: movhi r16,5 000402d0: addi r16,r16,12888 000402d4: ldw r2,-8(fp) 000402d8: mov r4,r2 000402dc: movi r5,12 000402e0: call 0x4038c <__mulsi3> 000402e4: add r2,r16,r2 000402e8: stw zero,0(r2) In Case B, a call to the multiplication function __mulsi3 is is being made. Ok, we know that Nios II/e does not have hardware support for multiplication, fair enough. __mulsi3 is implemented in lib2-mul.c:
SItype
__mulsi3 (SItype a, SItype b)
{
SItype res = 0;
USItype cnt = a;
while (cnt)
{
if (cnt & 1)
res += b;
b <<= 1;
cnt >>= 1;
}
return res;
}
So, in Case B we perform a multiplication. I would understand that this would add to the total execution time of each iteration compared to Case A, but I still expect each iteration to require the same number of clock cycles. At least Case B is deterministic in the sense that it keeps reporting these same numbers for each run.Could anyone please try to give me an explanation on what is happening here? If you require more information, just let me know! /J