Forum Discussion
Fully unrolling a loop with 500 iterations does not make any sense since there is simply not enough off-chip memory bandwidth to be able to support the memory accesses; a quick look at the report shows that you are creating 32 read ports because of the unrolled loop and if you profile the kernel, you will probably see a huge amount of stalling in off-chip accesses. Since your kernel has 2 off-chip memory reads and one write per cycle, an unroll factor between 8 to 16 should fully utilize the memory bandwidth. There is also no need to copy anything to local memory in this case since the accesses to the "rands" input are consecutive and can be coalesced at compile time. Your bottleneck here is off-chip memory bandwidth and copying stuff from global to local memory, other than lowering your performance by breaking the compile-time access coalescing and increasing global memory contention and wasting numerous cycles on the barrier, will not do anything else.