--- Quote Start ---
I am not sure about CUDA, but with OpenCL on GPUs, you can still have multiple queues and try to run multiple kernels in parallel, and they could actually run in parallel on the hardware as long as there are shader blocks left unused by the first kernel. And you can also always have such kind of races between work-items from the same kernel which are running in different work-groups.
Regarding NDRange kernels, without SIMD, all work-items from all work-groups will be pipelined on the actual hardware and no two threads will ever be issued in the same clock (hence you don't need to recompile the kernel if you change local or global size). However, if you use SIMD, as many threads as your SIMD width can potentially be issued in the same clock. With num_compute_units, you can have multiple work-groups issued concurrently in different compute units.
--- Quote End ---
I tried num_compute_units on an image processing kernel, I guess because of I have to do indexing inside the kernel(to work on different image region) it became even slower. I made two copies of the kernel under different name and execute them under different queues, and pass different region of the image into the kernels, it took half of the original time as I expected.
So now I'm trying to figure out if there's an easier way to do this, it's doesn't seem wise to do this multiple copies strategy manually when I need a large number of copies, and the num_compute_units attribute doesn't help much because of the index computation overhead (unless I make copies for each processing pixels which will just take too much resources). And SIMD can only be applied on kernels which computation can be vectorized.
-----------------------
I found that simply using get_global_id(0) will cause a 10ms scale latency. No way to avoid this except launch kernels manually I guess.