problem in streaming access to global memory in OpenCL
- 7 years ago
If you cannot use SIMD, utilizing the memory bandwidth efficiently will be difficult since your only remaining tool would be using multiple compute units and in the end, you will always get low external memory bandwidth efficiently since you will have multiple narrow accesses competing with other other for the memory bus.
The information about the private cache does not seem to have been carried over to the new HTML report yet. I used the older report from an older version of the compiler to see which accesses are being cached by the compiler. Based on what I can see, in both of your kernels, only the ovid_of_edge[j] and pg_val[ovid] reads are cached and the rest are not. If you can manually perform caching as you have done in your second kernel without breaking data consistency, then that is indeed a very good method to improve the performance. The compiler is also correctly coalescing the unrolled reads in your second kernel to wide 512-bit accesses; though this configuration could result in memory bandwidth overutilization for boards with one or two memory banks. Maybe an unroll size (cache line size) of 8 would be more appropriate in this case but you should probably compile and test both cases to see which one is faster.
Another thing I can think of is to merge all the buffers that are read only from their i index into a struct so that instead of having multiple narrow reads, you can read all of them at once using one large read from the struct; something like an array of structs. This could also improve the memory efficiency.
Finally, since you are separating compute from memory accesses by performing manual caching, you should make sure to have enough work-groups that can run concurrently in each compute unit to keep the pipeline busy.
P.S. I think you need to add a local memory barrier at the end of your unrolled memory loop.