Forum Discussion
Solution mentioned by @MGRAV is the right way to run loop iterations in "parallel". However, the difference you are experiencing here is likely caused by the resource used to implement i[], rather than the loop-carried dependency. If i[] is implemented using registers (which can be quite inefficient in case ITEM_LENGTH is too large), then you should achieve an II of 1 in all of the cases you mentioned above. You can force implementation using registers using the respective pragma mentioned in Intel's documentation and see what happens. If, however, i[] is implemented using Block RAMs, then you will end up with a load/store dependency since it is not possible to perform single-cycle read/write from/to Block RAMs and that is where the II=6 comes from (latency of one Block RAM read + write). There is nothing you can do to improve the II in this case, since your the load/store dependency is real and any attempt to force the compiler to ignore it using ivdep will result in incorrect output.
P.S. You should probably move "if ( i == 0 ) a[i] = j; " outside of the "i" loop and start "i" from 1. This will allow you to completely eliminate the branch from the code and save some area.
Many thanks for replying HRZ.
Your insights are very helpful. However, this is a very abstracted example of the problem. In the actual design the main inner loop is complex with a critical dependency of over 80 clocks which cannot easily be reduced. The data itself is too large to be implemented as registers.
The long dependency is resolved (as in the example) by operating on many data vectors concurrently and inverting the consequent pair of nested loops. The ivdep is needed because although there is a data dependency on the array as a whole the loop inversion ensures that LD/ST of any one element of the array is separated by at least the natural II (GROUP_SIZE=10 is greater than II of 6 in the example).
I am now trying to run several of these inverted loops in parallel on distinct data sets. In the example this is represented by the third array dimension indexed by block. The first iteration of the loop only accesses array elements with first index equal to 0 (a[0][j][i]), the second iteration only accesses array elements with first index 1 (a[1][j]i]).
Following MGRAV's suggestion on the unroll pragma, it unfortunately does not result in parallel execution because the compiler is insisting there is a memory dependency preventing it. I will be confirming that this is actually the case with an actual run as soon as the FPGA runtime nodes are available again. I will post the results here.
Any further insights would be most welcome.