Achieving parallel execution of loop on FPGA
I am having difficulty persuading the compiler to execute an outer loop in parallel in an FPGA kernel. I have constructed a simple example to illustrate the issue.
Here is a simple array initialisation loop with a loop carried dependency:
const int ITEM_LENGTH = 1000;
uint16_t a[ITEM_LENGTH];
for (int i = 0; i < ITEM_LENGTH; i++)
if ( i == 0 )
a[i] = j;
else
a[i] = a[i-1] + i;
The compiler correctly schedules this with an II of 6 (at 240MHz on Arria 10).
To improve throughput 10 of these are processed together by adding an outer loop:
const int ITEM_LENGTH = 1000;
const int GROUP_SIZE = 10;
uint16_t a[GROUP_SIZE][ITEM_LENGTH];
for (int j = 0; j < GROUP_SIZE; j++)
for (int i = 0; i < ITEM_LENGTH; i++)
if ( i == 0 )
a[j][i] = j;
else
a[j][i] = a[j][i-1] + i;
The compiler spots that the loops can be inverted to give an II of 1 on the inner loop thus improving throughput by a factor of 6.
Now I want to process 2 of these in parallel so an outer loop is introduced for which there are no depencencies between the two iterations of the loop so expect them to be executed in parallel by duplicating the logic.
const int ITEM_LENGTH = 1000;
const int GROUP_SIZE = 10;
uint16_t a[2][GROUP_SIZE][ITEM_LENGTH];
for (int block = 0; block < 2; block++)
for (int j = 0; j < GROUP_SIZE; j++)
for (int i = 0; i < ITEM_LENGTH; i++)
if ( i == 0 )
a[block][j][i] = j;
else
a[block][j][i] = a[block][j][i-1] + i;
But this causes the inner loop to revert to an II of 6 and the outer loop to be: 'Serial exe: Memory dependency' citing all 4 combinations of the two assignment statements as the problem.
An attempt to explicitly declare no dependencies inverts the two innermost loops manually and adds an ivdep:
const int ITEM_LENGTH = 1000;
const int GROUP_SIZE = 10;
uint16_t a[2][GROUP_SIZE][ITEM_LENGTH];
for (int block = 0; block < 2; block++)
[[intel::ivdep]]
for (int k = 0; k < GROUP_SIZE * ITEM_LENGTH; k++)
{
int i = k / GROUP_SIZE;
int j = k - i * GROUP_SIZE;
if ( i == 0 )
a[block][j][i] = j;
else
a[block][j][i] = a[2][j][i-1] + i;
The inner loop is now scheduled with II of 1 (trusting the ivdep) but the outer loop still does not execute in parallel for exactly the same reason as before.
So, given that there really isn't any actual dependency between iterations of the outer loop, how do I persuade the compiler of this so that I achieve parallel execution of the loop iterations?