loop-unrolling and memory access performance

Question

Hello, :)

I appreciate it if anyone can suggest something on my question:

I have an OpneCL task hardware (1CU, 1PE), which contains a main for loop. Iterations are independent. In every iteration, there is a burst global memory access (with another inner for loop) which is the bottleneck of design. Because the size of memory access in each iteration is variable, I could not manage a NDrange-with SIMD solution. Instead, keeping OCL design as task, I did a loop-unrolling by 4 on the main loop, hopping to have 4 parallel memory accesses. While hardware size is largely increased (almost 4 times), but the runtime not improved.

I used profiling tool and I saw the occupancy (activity) factor of memory access instruction is divided among 4 instructions after unrolling, which in my opinion it means 4 memory accesses run serially, not in parallel. Does anybody have any opinion on this? Can it be because of not replicating memory ports?

thanks

altera_forum · Answer

To improve memory performance in task kernels, you must unroll your memory access loop on the dimension in which the accesses are consecutive. Take the following example:  __global float input;
__local float data;
for (i = 0; i &lt; N; i++)
{
 # pragma unroll 4
  for (j = 0; j &lt; M; j++)
  {
    data = input;
  }
}  In this case, you will get one coalesced port to off-chip memory, but with a width of 128 bits. However, if the accesses are as follows:  
__global float  input;
__local float data;
for (i = 0; i &lt; N; i++)
{
 # pragma unroll 4
  for (j = 0; j &lt; M; j++)
  {
    data = input;
  }
}  You will get four non-coalesced 32-bit ports to external memory. In this case, your external memory performance will hardly improve since you now have 4 accesses competing with each other to acquire the memory bus.  To maximize memory performance, you should minimize the number of accesses, but maximize the size of the accesses.

altera_forum · Answer

thanks for your reply,

Actually my inner loop has a large and consecutive memory access and as reported by optimizer, it is pipelined well with II=1.

I put# pragma unroll 4 on the outer loop (not inner one as you did), hoping to have 4 parallel accesses using 4 memory ports, because outer loop body is independent in different iterations (no read after write). Area size increased by 4 (both logic and BRAMs which I think BRAMs are used as cache for global memory), then I guess there exist 4 memory ports replicated. But performance does not change.

Do you have any guess? my guess is somehow memory accesses are done serially. not in parallel.

# pragma unroll 4

for (unsigned i = 0;i < 4000000; i++)

{

acc = 0.0;

si = start_index;

ei = end_index;

for(unsigned j = si;j < ei;++j) //pipelined with II=1

acc += value[j]; // target memory access

value_next[i] = acc ;

}

altera_forum · Answer

There is no point in unrolling the outer loop. You have many global memory accesses as it is, and each access requires its own port to memory. Unrolling the outer loop results in 4 times more ports, and you will have 16 memory accesses competing with each other to acquire the memory bus, resulting in extremely poor memory performance. Assuming that "acc" and "value" are floating-point, you should first use the shift register optimization for floating-point accumulation as outlined in Intel's documents, and then unroll the inner loop to be able to achieve higher performance.

Most FPGA boards only have two memory banks, which means at each clock cycle, you can perform a maximum of two "parallel" memory accesses. You should actually avoid "parallel" memory accesses as much as you can and instead, try to have few but large coalesced accesses.

altera_forum · Answer

actually my inner loop has large enough burst read. In the profiler, I see cache hit rate is almost 1 and memory access efficiency is 100%. In another level, I want to have multiple parallel access to improve the bandwidth.

1- First let me ask, if there are multiple accesses in different parts of code to a same global memory variable, like

line 100: X = GlMem;

line 200: y = glmem[j];

does it lead to port replication? or access are done serially through a single port to glmem variable? according to your statement, ports are replicated, right?

2- if so, is it logical to manually unroll the loop, taking care about port replication points? unroll those part i need, and leave the rest rolled? for example, instead of :

# pragma unroll 4

for (unsigned i = 0;i < 4000000; i++)

{

// i,j, acc, s, e are local, rest are global.)

acc = 0.0;

s = start_index;

e = end_index;

for(unsigned j = s;j < e;++j)

acc += value[j]; // target memory access

value_next = acc ;

}

I do this:

for (unsigned i = 0;i < 4000000; i=i+4 ){

// kept rolled

for(unsigned j = 0; j < 4; j++){

acc[j] = 0.0;

s[j] = start_index[i+j];

e[j] = end_index[i+j];

}

// unrolled, I want to improve performance of reading value[j] variable.

for(unsigned j = s[0];j < e[0];++j) // a large burst access

acc[0] += value[j];

for(unsigned j = s[1];j < e[1];++j) // a large burst access

acc[1] += value[j];

for(unsigned j = s[2];j < e[2];++j) // a large burst access

acc[2] += value[j];

for(unsigned j = s[3];j < e[3];++j) // a large burst access

acc[3] += value[j];

// kept rolled

for(unsigned j = 0; j < 4; j++){

value_next[i+j] = acc[j];

}

Thanks a lot for your help :)

altera_forum · Answer

1- Every single access to global memory in any part of the code will have its own port to external memory. Such ports are never shared between different accesses, unless they can be coalesced at "compile time".

2- None of those accesses will be "large burst accesses", they will all be 32-bit accesses. You might think that since your accesses are consecutive and in a loop, they might be coalesced into a larger access at runtime, but this will not happen, since the memory interface does not perform runtime coalescing. The only way to achieve higher memory performance is to perform compile-time coalescing by unrolling the loop over the consecutive access.

The "System viewer" in the area report shows all the ports, their size and type. If you unroll the outer loop, you will get four 32-bit accesses, while if you unroll the inner loop, you will get one 128-bit access. The latter will give you far much better memory performance (but an II of 16 due to the floating-point accumulation which can be fixed using shift register inference).

Forum Discussion

loop-unrolling and memory access performance

7 Replies

Recent Discussions

How to fix Error(23782): Failed to find an expected report

Quartus 22.1 and 23.1 Synthesis Error

Connection bit order between hierarchy

Could not link 'vsim_auto_compile.dll' error troubleshooting.

Failed to run ip-setup-simulation: