Forum Discussion
The report makes it much easier to understand what is happening. Basically, your memory performance is poor because you are putting too much pressure on the memory bus. Specifically, unrolling the "y" loop over the output write is unnecessary. The addresses are not consecutive over "y" and hence, you are getting too many [wide] memory ports competing with each other over the memory bus. Furthermore, since you are using the reference board which only has one bank of DDR4 memory, your memory bandwidth is limited to only ~17 GB/s. One single 512-bit access per iteration is enough to saturate the memory bandwidth in this case. Your kernel, however, has 14 such accesses. I would recommend using a vector size of 8 instead of 16, and avoiding to unroll the "y" loop over output writes. This will result in two 256-bit accesses for input and output. I assume the mask read is done only once, so it shouldn't cause much contention on the bus.