Forum Discussion
Altera_Forum
Honored Contributor
7 years agoactually my inner loop has large enough burst read. In the profiler, I see cache hit rate is almost 1 and memory access efficiency is 100%. In another level, I want to have multiple parallel access to improve the bandwidth.
1- First let me ask, if there are multiple accesses in different parts of code to a same global memory variable, like line 100: X = GlMem;line 200: y = glmem[j];
does it lead to port replication? or access are done serially through a single port to glmem variable? according to your statement, ports are replicated, right?
2- if so, is it logical to manually unroll the loop, taking care about port replication points? unroll those part i need, and leave the rest rolled? for example, instead of :
# pragma unroll 4
for (unsigned i = 0;i < 4000000; i++)
{
// i,j, acc, s, e are local, rest are global.)
acc = 0.0;
s = start_index; e = end_index;
for(unsigned j = s;j < e;++j)
acc += value[j]; // target memory access
value_next = acc ; } I do this: for (unsigned i = 0;i < 4000000; i=i+4 ){ // kept rolled for(unsigned j = 0; j < 4; j++){ acc[j] = 0.0; s[j] = start_index[i+j]; e[j] = end_index[i+j]; } // unrolled, I want to improve performance of reading value[j] variable. for(unsigned j = s[0];j < e[0];++j) // a large burst access acc[0] += value[j]; for(unsigned j = s[1];j < e[1];++j) // a large burst access acc[1] += value[j]; for(unsigned j = s[2];j < e[2];++j) // a large burst access acc[2] += value[j]; for(unsigned j = s[3];j < e[3];++j) // a large burst access acc[3] += value[j]; // kept rolled for(unsigned j = 0; j < 4; j++){ value_next[i+j] = acc[j]; } } Thanks a lot for your help :)