loop-unrolling and memory access performance

Honored Contributor

7 years ago

actually my inner loop has large enough burst read. In the profiler, I see cache hit rate is almost 1 and memory access efficiency is 100%. In another level, I want to have multiple parallel access to improve the bandwidth.

1- First let me ask, if there are multiple accesses in different parts of code to a same global memory variable, like

line 100: X = GlMem;

line 200: y = glmem[j];

does it lead to port replication? or access are done serially through a single port to glmem variable? according to your statement, ports are replicated, right?

2- if so, is it logical to manually unroll the loop, taking care about port replication points? unroll those part i need, and leave the rest rolled? for example, instead of :

# pragma unroll 4

for (unsigned i = 0;i < 4000000; i++)

{

// i,j, acc, s, e are local, rest are global.)

acc = 0.0;

s = start_index;

e = end_index;

for(unsigned j = s;j < e;++j)

acc += value[j]; // target memory access

value_next = acc ;

}

I do this:

for (unsigned i = 0;i < 4000000; i=i+4 ){

// kept rolled

for(unsigned j = 0; j < 4; j++){

acc[j] = 0.0;

s[j] = start_index[i+j];

e[j] = end_index[i+j];

}

// unrolled, I want to improve performance of reading value[j] variable.

for(unsigned j = s[0];j < e[0];++j) // a large burst access

acc[0] += value[j];

for(unsigned j = s[1];j < e[1];++j) // a large burst access

acc[1] += value[j];

for(unsigned j = s[2];j < e[2];++j) // a large burst access

acc[2] += value[j];

for(unsigned j = s[3];j < e[3];++j) // a large burst access

acc[3] += value[j];

// kept rolled

for(unsigned j = 0; j < 4; j++){

value_next[i+j] = acc[j];

}

Thanks a lot for your help :)

Forum Discussion

loop-unrolling and memory access performance

Recent Discussions

Tensor block usage

When you double click on a word, the other instances do not highlight due to the Find Box being open

jtagserver.exe causing BSOD together with ftdi driver

Automatically added negative node for TDS output doesn't work with Agilex 5

Agilex3 - unknown IDCODE