What happens to global memory bandwidth when Multiple OpenCL kernels read and write to DRAM simultaneously?

Hello, I have a basic question about how opencl compiler handles global memory access across different opencl kernels. For eg:- __kernel input1( __global int *r1 ) { } __kernel input2(__global c...

HRZ
7 years ago
First you have to pay attention to the fact that you are calculating the throughput of your kernel in GiB/s, but comparing it with the theoretical peak throughput of the board in GB/s. The peak throughput of the board in GiB/s is around 17.9 GiB/s. The board indeed has two memory banks; however, only the DDR4 bank is supported in the OpenCL BSP. Unless you are willing to modify the BSP yourself to add support for the DDR3 bank, you are not going to be able to use it with OpenCL.
I have a set of recommendations that might help you get closer to the peak throughput:
1- Your kernel run time is too short to allow accurate timing measurement. Chances are, a big portion of the time you are measuring is the kernel launch overhead. I recommend increasing your input size so that kernel run time is at least a few hundred milliseconds.
2- Make sure you are only timing the kernel execution, and the functions used to set the kernel arguments or transfer data between the host and the device are outside of the timing region.
3- Try reducing your vector size to 32 or 64 to avoid extra contention on the memory bus.
4- Try merging your two kernels into one or increasing your channel depth to avoid possible pipeline stalls caused by the channels.

Forum Discussion

What happens to global memory bandwidth when Multiple OpenCL kernels read and write to DRAM simultaneously?

Recent Discussions

Generate Simulation Setup Script Fails

FIR IP configured for Interpolation

Altera SSLC License

Lisence issue when running .do script

How to create a Packaged Subsystem in TCL