Forum Discussion
Altera_Forum
Honored Contributor
7 years agoTo improve memory performance in task kernels, you must unroll your memory access loop on the dimension in which the accesses are consecutive. Take the following example:
__global float input;
__local float data;
for (i = 0; i < N; i++)
{
# pragma unroll 4
for (j = 0; j < M; j++)
{
data = input;
}
} In this case, you will get one coalesced port to off-chip memory, but with a width of 128 bits. However, if the accesses are as follows:
__global float input;
__local float data;
for (i = 0; i < N; i++)
{
# pragma unroll 4
for (j = 0; j < M; j++)
{
data = input;
}
} You will get four non-coalesced 32-bit ports to external memory. In this case, your external memory performance will hardly improve since you now have 4 accesses competing with each other to acquire the memory bus. To maximize memory performance, you should minimize the number of accesses, but maximize the size of the accesses.