Achieving parallel execution of loop on FPGA

Mickleman
Occasional Contributor
5 years ago
Many thanks for replying HRZ.

Your insights are very helpful. However, this is a very abstracted example of the problem. In the actual design the main inner loop is complex with a critical dependency of over 80 clocks which cannot easily be reduced. The data itself is too large to be implemented as registers.

The long dependency is resolved (as in the example) by operating on many data vectors concurrently and inverting the consequent pair of nested loops. The ivdep is needed because although there is a data dependency on the array as a whole the loop inversion ensures that LD/ST of any one element of the array is separated by at least the natural II (GROUP_SIZE=10 is greater than II of 6 in the example).

I am now trying to run several of these inverted loops in parallel on distinct data sets. In the example this is represented by the third array dimension indexed by block. The first iteration of the loop only accesses array elements with first index equal to 0 (a[0][j][i]), the second iteration only accesses array elements with first index 1 (a[1][j]i]).

Following MGRAV's suggestion on the unroll pragma, it unfortunately does not result in parallel execution because the compiler is insisting there is a memory dependency preventing it. I will be confirming that this is actually the case with an actual run as soon as the FPGA runtime nodes are available again. I will post the results here.

Any further insights would be most welcome.
- Mickleman
  Occasional Contributor
  5 years ago
  Hi again MGRAV and HRZ
  
  I have further developed the example to avoid the false memory dependency and manually unrolled the loop. This allows the compiler to automatically fuse the 2 two loops thus achieving the desired concurrency. BUT even though both loops carry the ivdep the resulting fused loop nevertheless has an II of 6 (as if the ivdeps had been ignored). Here is the code:
  
  const int ITEM_LENGTH = 10000;
  const int GROUP_SIZE = 10;
  uint16_t a[GROUP_SIZE][ITEM_LENGTH];
  uint16_t b[GROUP_SIZE][ITEM_LENGTH];
  
  [[intel::ivdep]]
  for (int k = 0; k < GROUP_SIZE * ITEM_LENGTH; k++)
  {
  int i = k / GROUP_SIZE;
  int j = k - i * GROUP_SIZE;
  
  if ( i == 0 )
  a[j][i] = j;
  else
  a[j][i] = a[j][i-1] + i;
  }
  
  [[intel::ivdep]]
  for (int k = 0; k < GROUP_SIZE * ITEM_LENGTH; k++)
  {
  int i = k / GROUP_SIZE;
  int j = k - i * GROUP_SIZE;
  
  if ( i == 0 )
  b[j][i] = j;
  else
  b[j][i] = b[j][i-1] + i;
  }
  
  I'm at a loss. Why can't the fused loop respect the ivdep?
  - MGRAV
    New Contributor
    5 years ago
    Hi @Mickleman,
    
    I am not sure but I assume that is the way you get you i and j out of the division and the modulo.
    
    I imagine you rewrite as follow (that do basically the same, without branching)
    
    uint16_t* c=(uint16_t*)a
    
    bool test=(k<GROUP_SIZE) ;
    
    int i=k / GROUP_SIZE;
    
    c[k]= (c[k-1]+i)*(!test)+(test)* (k-i*GROUP_SIZE);
    
    you can get that the compiler don't see the opportunity over the GROUP_SIZE parallelization.
    
    I think to get it automatically you should permute you array, a[j][i] ==> a[i][j] so the dependency inside look like c[k-GROUP_SIZE].
    
    I don't know if I am clear in what I mean

Forum Discussion

Achieving parallel execution of loop on FPGA

Recent Discussions

Test Survey

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - core_hw.tcl error

How does the FPGA AI Suite utilize Agilex 5 DSP Blocks?

AI Suite - Why does the Sequential IP not take a model argument?