I didn't use any cycle count comparisons, but I used a standard digital output connected to a logic analyzer. I would set the output at the beginning of the function and then reset it. The function runs in a massive loop, so the output set/reset timing is negligable. The only difference between the code is how I do the set/reset of the digital line. Here is the fp code that was run on each processor:
In the main funct:
pio_data |= set_mask[2];
IOWR_ALTERA_AVALON_PIO_DATA (USER_PIO_BASE, pio_data);
fsum = TestFloatMult();
pio_data &= reset_mask[2];
IOWR_ALTERA_AVALON_PIO_DATA (USER_PIO_BASE, pio_data);
Fmult benchmark funct (simple mult/accum):
float TestFloatMult(void)
{
register float *ptr1, *ptr2, sum;
int i;
sum = 0;
ptr1 = &gFloatArray1[0]; // gFloatArray1 is a randomly generated float array
ptr2 = &gFloatArray2[0]; // gFloatArray2 is a randomly generated float array
for(i=0; i < ARRAY_SIZE; i++) // ARRAY_SIZE = 1000
{
sum += *ptr1++ * *ptr2++;
}
return sum;
}
What I found is that the ARM executed it in 940uS and the NIOS in 7.0mS. I went to great lengths (mix mode) to make sure that the ARM did not optimize the crap out of the loop. Everything looked normal.
Rick