Parallel accesses to banked local memory

Question

I've been attempting to bank local memory so that I can perform parallel access. I can't seem to get it to work. The compiler always either replicates the memory or generates the memory bank with stalls. I don't understand why the code snippet below doesn't generate a number of parallel BRAMs capable of being accessed concurrently.

# define L 64

__kernel __attribute__((task))

void test(

__global char * restrict message,

__global char * restrict decodedData

)

{

local char __attribute__((numbanks(L),

bankwidth(1))) msgMem[L][256];

int __attribute((register)) Lrji_row_sum[L];

//store data across L memory banks

for(uint k=0; k<256; k++)

{

for(uint r=0; r<L; r++)

{

msgMem[r][k]=message[(k*L)+r];

}

mem_fence(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);

//accumulator for each memory bank

for(uint i=0; i<256; i++)

{

# pragma unroll

for(uint r=0; r<L; r++)

{

Lrji_row_sum[r]=msgMem[r][i];

}

for(uint r=0; r<L; r++)

{

decodedData[r] = Lrji_row_sum[r];

}

I'm using Quartus 17.0.1

Appreciate the help,

Jason

altera_forum · Answer

You are unrolling the loop that reads from the local memory buffer, but the unrolling is done on the first dimension of the local buffer rather than the second one. Because of this, the accesses to the local buffer cannot be coalesced and the compiler instantiates 64 reads from that buffer which results in a very high replication factor. If you swap the dimensions of this buffer, your problem will be solved:

msgMem[r][k]=message[(k*L)+r]; --> msgMem[k][r]=message[(k*L)+r];

Lrji_row_sum[r]=msgMem[r]; --> lrji_row_sum[r]=msgmem[r];

This way, you can also unroll the write loop to get large coalesced accesses to both global and local memory which allows you to better use the global memory bandwidth without any extra local memory replication.

P.S. Barrier are not needed in task kernels since there is no threading/scheduling in this kernel type.

altera_forum · Answer

Ok I got it.  Thanks HRZ.  I appreciate the help.  Jason

Forum Discussion

Parallel accesses to banked local memory

2 Replies

Recent Discussions

Free Licence for Max+PlusII

Compile option not saved (reversed to default)

Connection bit order between hierarchy

quartus pro 25.3 bug?

SSLC Login Issue – "You need to enroll" loop after OTP verification