Extreme Performance Drop while using local memory

Honored Contributor

8 years ago

Fully unrolling a loop with 500 iterations does not make any sense since there is simply not enough off-chip memory bandwidth to be able to support the memory accesses; a quick look at the report shows that you are creating 32 read ports because of the unrolled loop and if you profile the kernel, you will probably see a huge amount of stalling in off-chip accesses. Since your kernel has 2 off-chip memory reads and one write per cycle, an unroll factor between 8 to 16 should fully utilize the memory bandwidth. There is also no need to copy anything to local memory in this case since the accesses to the "rands" input are consecutive and can be coalesced at compile time. Your bottleneck here is off-chip memory bandwidth and copying stuff from global to local memory, other than lowering your performance by breaking the compile-time access coalescing and increasing global memory contention and wasting numerous cycles on the barrier, will not do anything else.

Forum Discussion

Extreme Performance Drop while using local memory

Recent Discussions

Quartus 13.1 including Signal Tap License

Unable to find questa_fe license file

License maintainance expiration

License gone in altera SSLC

When you double click on a word, the other instances do not highlight due to the Find Box being open