Memory access pattern optimization in OpenCL
Hi,
I have some fundamental question about optimizing memory access pattern in OpenCL code. Based on best practices, it is mentioned that memory access pattern should be contiguous for best performance. In GPU, we makes sure workitems in a same workgroup or warp are having contiguous indexes. That means data access in should be sequential "spatially", since the parallelism only exists in spatial dimension.
In FPGA, it seems to be two opportunities. We can have spatial contiguous data access pattern in kernels like ND-Range, and we may also have temporal contiguous data access in both ND-Range and Single thread mode kernels. For example, if we have a loop and we try to unroll it, maybe it's good to make sure that data access pattern is based on the iteration counter, with 0 stride. Now my question is, how this is being handled while the kernel is being compiled? which dimension has the higher opportunity for being parallelized?