Altera_Forum
Honored Contributor
8 years ago#pragma ivdep not allowing for parallel stores to local memory
Hello,
I am having difficulty in parallelizing local memory store operations and would appreciate help. The load from local memory appears to be parallelized however in the report.html's System Viewer tab it's showing a dependency on store(lmem[idx[0]] = result[0]) -> store(lmem[idx[1]] = result[1]) -> etc. I've experimented with setting the numbanks attribute and the tool correctly banks the memory but there seems to be no effect on the store dependencies. This is in a single work item kernel. This particular code snippet is part of a larger loop. During a given loop iteration this code will never attempt to store to the same address (no address collisions); but depending on the loop iteration# it may or may not have bank conflicts. The only way to have zero bank conflicts for all iterations is to use registers - but the lmem size is too large to use registers. Based on the above description I understand during some loop iterations the store operations will end up being sequential (when all are to the same bank); however I want to take advantage of parallel stores during the loop iterations that will not have bank conflicts. Things I've tried on the store section: - removing the# pragma unroll. This resulted in the compiler automatically unrolling the stores. -# pragma unroll 1. This bottlenecks my algorithm to the point where I won't see any benefit from vectorization. - flipping the# pragma ivdep and# pragma unroll. No effect. I would've expected the# pragma ivdep to resolve this - can someone please provide help? Thank you. See the code snippet below: __private float2 operand[8]; __private float2 result[8]; __private uint idx[8]; __local float2 __attribute__((bankwidth(8))) lmem1[8192]; __local float2 __attribute__((bankwidth(8))) lmem2[8192]; for (uint aa = 0; aa < 4; ++aa){#pragma ivdep for (uint bb = 0; bb < 8192; bb += 8) { ... ... Code that computes the idx array ... if ((aa & 0x1) == 0) // ping pong buffer { #pragma unroll #pragma ivdep for (uint ii = 0; ii < 8; ++ii) { operand[ii] = lmem1[idx[ii]]; } // ii } else { #pragma unroll #pragma ivdep for (uint ii = 0; ii < 8; ++ii) { operand[ii] = lmem2[idx[ii]]; } // ii } ... ... Code that computes the result array ... if ((aa & 0x1) == 0) // ping pong buffer { #pragma unroll #pragma ivdep for (uint ii = 0; ii < 8; ++ii) { lmem2[idx[ii]] = result[ii]; } // ii } else { #pragma unroll #pragma ivdep for (uint ii = 0; ii < 8; ++ii) { lmem1[idx[ii]] = result[ii]; } // ii } } // bb } // aa UPDATE : Perhaps two things are occurring: the BRAMs inferred by local memory have a fixed write port size so in theory if all the entries in the idx[ii] array write to the same BRAM 8 cycles are needed AND the compiler is trying to preserve order in case any of the entires of idx[ii] are equal (hence the data gets overwritten). I guess the way to solve my problem would be to somehow create a FIFO per each BRAM that can assert back pressure to the kernel? Surely I can't be the first person to encounter this...