Forum Discussion
Hi,
Did you try with "#pragma unroll 2" ?
I am working on something similar and I would say that something like that should almost work
const int ITEM_LENGTH = 1000; const int GROUP_SIZE = 10; uint16_t a[2][GROUP_SIZE][ITEM_LENGTH]; #pragma unroll 2 [[intel::ivdep]] for (int block = 0; block < 2; block++) [[intel::ivdep]] for (int j = 0; j < GROUP_SIZE; j++) for (int i = 0; i < ITEM_LENGTH; i++) if ( i == 0 ) a[block][j][i] = j; else a[block][j][i] = a[block][j][i-1] + i;
good luck
Solution mentioned by @MGRAV is the right way to run loop iterations in "parallel". However, the difference you are experiencing here is likely caused by the resource used to implement i[], rather than the loop-carried dependency. If i[] is implemented using registers (which can be quite inefficient in case ITEM_LENGTH is too large), then you should achieve an II of 1 in all of the cases you mentioned above. You can force implementation using registers using the respective pragma mentioned in Intel's documentation and see what happens. If, however, i[] is implemented using Block RAMs, then you will end up with a load/store dependency since it is not possible to perform single-cycle read/write from/to Block RAMs and that is where the II=6 comes from (latency of one Block RAM read + write). There is nothing you can do to improve the II in this case, since your the load/store dependency is real and any attempt to force the compiler to ignore it using ivdep will result in incorrect output.
P.S. You should probably move "if ( i == 0 ) a[i] = j; " outside of the "i" loop and start "i" from 1. This will allow you to completely eliminate the branch from the code and save some area.