First, I want to explain that this is for research, so we are trying to model contention and that is why our code might seem simple and trivial.
The CPUs are using the sdram for instruction and data memory. They have small 4kb caches, but our code is extremely small. We are basically using an architecture like the one in the 3 processor tutorial by Altera. The strange thing is, we don't see a considerable degradation in performance from the fast processors (performance data below code). For the code, CPU1 reads macarray index 1-3, CPU2 reads from 4-7, and CPU3 reads from 8-10. We run the code on all processors at the same time so they will fight for the lock. We expect the lock contention to increase, as it does, however everything else (execution/loop iter) seems like it shouldn't vary too much in execution time.
our code is:# define FAST 2666667# define STAN 1000000# define ECON 200000
# define LOOP FAST
int main(void) {
int i,j;
int id = 1;
int temp;
macarray = (int*)MESSAGE_BUFFER_RAM_BASE;
for(i=1;i<4;i++)
IOWR(&macarray
,0,1);
mutex = altera_avalon_mutex_open("/dev/message_buffer_mutex");
perf_reset(performance_cpu1_base);
perf_start_measuring(performance_cpu1_base);
perf_begin(performance_cpu1_base, 1);
for(j=0; j< loop; j++) {
for(i=1; i<4; i++) {
perf_begin(performance_cpu1_base, 2);
perf_begin(performance_cpu1_base, 3);
altera_avalon_mutex_lock(mutex, 1);
perf_end(performance_cpu1_base, 3);
temp = iord(&macarray,0);
IOWR(&macarray[i],0, temp*i + temp);
altera_avalon_mutex_unlock(mutex);
PERF_END(PERFORMANCE_CPU1_BASE, 2);
}
}
PERF_END(PERFORMANCE_CPU1_BASE, 1);
PERF_STOP_MEASURING(PERFORMANCE_CPU1_BASE);
in terms of performance: for 2 CPU fast
total 3707864005
lock contention 2727460037
loop iteration 122.5504807
for 3 CPU fast
total 5219197128
lock contention 4230256283
loop iteration 123.6175902
for 2 CPU standard
total 4004902599
lock contention 2745356939
loop iteration 419.8485533
for 3 CPU standard
total 8519047729
lock contention 6664269607
loop iteration 618.259374