Intel opencl Dynamic profiler report

Frequent Contributor

7 years ago

The report makes it much easier to understand what is happening. Basically, your memory performance is poor because you are putting too much pressure on the memory bus. Specifically, unrolling the "y" loop over the output write is unnecessary. The addresses are not consecutive over "y" and hence, you are getting too many [wide] memory ports competing with each other over the memory bus. Furthermore, since you are using the reference board which only has one bank of DDR4 memory, your memory bandwidth is limited to only ~17 GB/s. One single 512-bit access per iteration is enough to saturate the memory bandwidth in this case. Your kernel, however, has 14 such accesses. I would recommend using a vector size of 8 instead of 16, and avoiding to unroll the "y" loop over output writes. This will result in two 256-bit accesses for input and output. I assume the mask read is done only once, so it shouldn't cause much contention on the bus.

Forum Discussion

Recent Discussions

Agilex 7 FPGA Starter Kit with oneAPI Toolkit flow not detected over PCIe

MCTP over PCIe VDM routing to PMCI in OFS N6000 FIM configuration and datapath clarification

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

Error faced while executing on Agilex FPGA board....

AI Suite System Throughput Issue