loop-unrolling and memory access performance

Honored Contributor

7 years ago

To improve memory performance in task kernels, you must unroll your memory access loop on the dimension in which the accesses are consecutive. Take the following example:

__global float input;
__local float data;
for (i = 0; i < N; i++)
{
 # pragma unroll 4
  for (j = 0; j < M; j++)
  {
    data = input;
  }
}

In this case, you will get one coalesced port to off-chip memory, but with a width of 128 bits. However, if the accesses are as follows:


__global float  input;
__local float data;
for (i = 0; i < N; i++)
{
 # pragma unroll 4
  for (j = 0; j < M; j++)
  {
    data = input;
  }
}

You will get four non-coalesced 32-bit ports to external memory. In this case, your external memory performance will hardly improve since you now have 4 accesses competing with each other to acquire the memory bus.

To maximize memory performance, you should minimize the number of accesses, but maximize the size of the accesses.

Forum Discussion

loop-unrolling and memory access performance

Recent Discussions

Tensor block usage

When you double click on a word, the other instances do not highlight due to the Find Box being open

jtagserver.exe causing BSOD together with ftdi driver

Automatically added negative node for TDS output doesn't work with Agilex 5

Agilex3 - unknown IDCODE