What happens to global memory bandwidth when Multiple OpenCL kernels read and write to DRAM simultaneously?
- 7 years ago
First you have to pay attention to the fact that you are calculating the throughput of your kernel in GiB/s, but comparing it with the theoretical peak throughput of the board in GB/s. The peak throughput of the board in GiB/s is around 17.9 GiB/s. The board indeed has two memory banks; however, only the DDR4 bank is supported in the OpenCL BSP. Unless you are willing to modify the BSP yourself to add support for the DDR3 bank, you are not going to be able to use it with OpenCL.
I have a set of recommendations that might help you get closer to the peak throughput:
1- Your kernel run time is too short to allow accurate timing measurement. Chances are, a big portion of the time you are measuring is the kernel launch overhead. I recommend increasing your input size so that kernel run time is at least a few hundred milliseconds.
2- Make sure you are only timing the kernel execution, and the functions used to set the kernel arguments or transfer data between the host and the device are outside of the timing region.
3- Try reducing your vector size to 32 or 64 to avoid extra contention on the memory bus.
4- Try merging your two kernels into one or increasing your channel depth to avoid possible pipeline stalls caused by the channels.