Forum Discussion
On FPGAs, there is no fixed warp and there is no thread-level parallelism either. Unless you use SIMD or loop unrolling, it doesn't make much of a difference whether your memory accesses are contiguous or not since only one access per access port is performed per cycle and the memory bandwidth will be underutilized.
By default, the compiler creates one access port to external memory for every access that exists in your kernel. The size of this port is equal to the size of the datatype used for the access rounded up to the nearest power of two. Now, when you use SIMD in NDRange kernels or loop unrolling in Single Work-item kernels, apart from widening the pipeline, the compiler will also coalesce all the accesses that are consecutive into a one larger access port; non-consecutive accesses will instead result in as many ports per access as the SIMD/unroll factor with the size of the datatype. This is done at compile-time. If you check the "System viewer" section of the area report, you can see that the ports to memory get wider when SIMD or unrolling is used over consecutive accesses.
Needless to say, best memory performance is achieved with a few very wide coalesced accesses rather than a lot of narrow non-coalesced ones since the latter will create a large amount of contention on the memory bus and significantly reduce memory access efficiency.
P.S. You should probably avoid using both SIMD and unrolling in an NDRange kernel over external memory accesses because it is not usually the cases that the accesses are consecutive both over the direction of the SIMD and the unrolling. SIMD is applied on the first dimension for NDRange kernels; hence, in 2D and 3D NDRange kernels you should make sure your accesses are consecutive over the first dimension to be coalesceable.