Parallel accesses to banked local memory

Honored Contributor

8 years ago

You are unrolling the loop that reads from the local memory buffer, but the unrolling is done on the first dimension of the local buffer rather than the second one. Because of this, the accesses to the local buffer cannot be coalesced and the compiler instantiates 64 reads from that buffer which results in a very high replication factor. If you swap the dimensions of this buffer, your problem will be solved:

msgMem[r][k]=message[(k*L)+r]; --> msgMem[k][r]=message[(k*L)+r];

Lrji_row_sum[r]=msgMem[r]; --> lrji_row_sum[r]=msgmem[r];

This way, you can also unroll the write loop to get large coalesced accesses to both global and local memory which allows you to better use the global memory bandwidth without any extra local memory replication.

P.S. Barrier are not needed in task kernels since there is no threading/scheduling in this kernel type.

Forum Discussion

Recent Discussions

Installer cannot establish connection with SSL error

Duplicate_hierarchy_depth / duplicate_register

Is Quartus Prime Pro 22.4 Compatible with Stratix 10 NX Series Device 1SN21CEU2F55E2VG?

Unable to download Quartus

how to reduce clock skew between synchronous clock