problem in streaming access to global memory in OpenCL
Hello,
I have rather a simple OpenCL code (in below), compiled in ND-range configuration. In some part of code I have random memory access, and as I expected, profiler shows that efficiency of memory access is low (very low hit-rate, one cache-line per data access).
However, there are some fully consecutive memory access which I expect to have very high efficiency, as cache-line should be fully utilized. my measurements showed unexpected results. After investigating with profiler, it shows almost minimum efficiency like random memory access.
Can anybody suggest anything for this?
I appreciate it if @HRZ can take a look.
float toAdd = 0;
unsigned ei;
unsigned si;
float div;
unsigned ovid; //other vertex id
unsigned start_of_chunk = glb_id * chunk_size;
unsigned end_of_chunk = start_of_chunk + chunk_size;
for (unsigned i = start_of_chunk; i < end_of_chunk; i++ )
{
toAdd = 0.0;
ei = end_edge[i]; //? why not streaming?
si = start_edge[i]; //? why not streaming?
div = div_array[i]; //? why not streaming?
for(unsigned j = si; j < ei; j++) // edge loop
{
ovid = ovid_of_edge[j]; //? why not streaming?
toAdd += val[ovid];
}
val_next[i] = toAdd * div;
}
If you cannot use SIMD, utilizing the memory bandwidth efficiently will be difficult since your only remaining tool would be using multiple compute units and in the end, you will always get low external memory bandwidth efficiently since you will have multiple narrow accesses competing with other other for the memory bus.
The information about the private cache does not seem to have been carried over to the new HTML report yet. I used the older report from an older version of the compiler to see which accesses are being cached by the compiler. Based on what I can see, in both of your kernels, only the ovid_of_edge[j] and pg_val[ovid] reads are cached and the rest are not. If you can manually perform caching as you have done in your second kernel without breaking data consistency, then that is indeed a very good method to improve the performance. The compiler is also correctly coalescing the unrolled reads in your second kernel to wide 512-bit accesses; though this configuration could result in memory bandwidth overutilization for boards with one or two memory banks. Maybe an unroll size (cache line size) of 8 would be more appropriate in this case but you should probably compile and test both cases to see which one is faster.
Another thing I can think of is to merge all the buffers that are read only from their i index into a struct so that instead of having multiple narrow reads, you can read all of them at once using one large read from the struct; something like an array of structs. This could also improve the memory efficiency.
Finally, since you are separating compute from memory accesses by performing manual caching, you should make sure to have enough work-groups that can run concurrently in each compute unit to keep the pipeline busy.
P.S. I think you need to add a local memory barrier at the end of your unrolled memory loop.