How to improve memory bandwidth

Question

Hi, I want to do empty kernels. First block read data from memory, send this data to channel. Second block receive data from channel and write this data to memory.  I execute this code on Arria10 DeVKit But information of bandwidth in profile is just ~3000 Mb/sec, but max bandwidth is 12800 Mb/sec  How can I improve this result ?   
# pragma OPENCL_EXTENSION cl_altera_channels : enable
# define THREAD_SIZE 1# define VEC_SIZE 128
typedef struct{
      float data;
} lane_data;
channel lane_data data_ch;
__kernel
void memRead(
                     int data_dim1,
                     int data_dim2,
                     int data_dim3,
                     __global float *restrict bottom
)
{
lane_data buff;
int size = data_dim1*data_dim2*data_dim3;
for(int ll=0; ll&lt;size/THREAD_SIZE/VEC_SIZE; ll++){
   # pragma unroll
    for(int i=0; i&lt;THREAD_SIZE; i++){
       # pragma unroll
        for(int j=0; j&lt;VEC_SIZE; j++){
            buff.data = bottom;
        }
        write_channel_altera(data_ch, buff);
    }
}
}
__kernel
void memWrite(
                     int data_dim1,
                     int data_dim2,
                     int data_dim3,
                     __global float *restrict result
)
{
lane_data buff;
int size = data_dim1*data_dim2*data_dim3;
for(int ll=0; ll&lt;size/THREAD_SIZE/VEC_SIZE; ll++){
   # pragma unroll
    for(int i=0; i&lt;THREAD_SIZE; i++){
        buff = read_channel_altera(data_ch);
       # pragma unroll
        for(int j=0; j&lt;VEC_SIZE; j++){
             result = buff.data;
        }
    }
}
}
  thank you

altera_forum · Answer

If you take a look at the report, the reason is pretty obvious. The compiler is being stupid and splitting your read access into 8x 512-bit simple accesses and also 3x 32-bit prefetching accesses (no idea what the hell this is), instead of inferring a single 4096-bit coalesced access like the write one. Because of this, you have 12 ports going to memory instead of 2. It goes without saying that this configurations results in a huge amount of contention on the memory bus and significantly reduces your memory performance.

If you add the volatile tag to your input (__global volatile float *restrict bottom), you will also get one single 4096-bit access for the read which will likely allow you to achieve close to peak performance.

Needless to say, since the devkit only has one memory bank, you should be able to achieve full bandwidth with a total access size of 512 bits (read + write), so a vector size of 8 or 16 should be enough in your case.

altera_forum · Answer

Thank you, HRZ

But if I use volatile (__global volatile float *restrict bottom). I compiler always can't to generate aocx file on the final stage of compilation.

I get next error

Error: Specified licence does not contain information required to run the Quartus Prime software.

Error: Quartus Prime Compiler Database Interface was unsuccessful

If I delete this tag compilation is become successful.

altera_forum · Answer

That is very strange, I have never seen such thing before. I recommend opening a service request with Altera and asking about your license issue; I am afraid they are the only ones who can help you with license issues.

Forum Discussion

How to improve memory bandwidth

3 Replies

Recent Discussions

Xcelium Simulation using XRUN with Verilog Configurations defining Library Search Order

setup and hold time with set_input_delay

Quartus 26.1 type inheritance bug.

Typedef inheritance in nested modules

Possible Quartus 24.1 bug?