If you have multiple kernels running in parallel in different queues updating the same global buffer, this is always going to give you an undefined output because, as I said, the OpenCL standard ensures global memory consistency only after the kernel execution has finished. I tried doing something like what you want once by using channels between two kernels running in parallel and sending messages from one to the other to synchronize them, but that didn't work since channel operations and memory operations have different latency and there is no guarantee that by the time the message reaches the second kernel, the memory operation in the first kernel has finished. Intel also provides a global memory barrier that should supposedly help for such cases but didn't seem to make any difference in my case. You can try using channels in conjunction with the global memory barrier to see if it works for you but note that if it doesn't, this is completely normal since such functionality is not expected to be supported by the OpenCL standard. Needless to say, there will always be alternative designs which do not require sharing global memory buffers.