Forum Discussion
Altera_Forum
Honored Contributor
8 years agoI am not sure about CUDA, but with OpenCL on GPUs, you can still have multiple queues and try to run multiple kernels in parallel, and they could actually run in parallel on the hardware as long as there are shader blocks left unused by the first kernel. And you can also always have such kind of races between work-items from the same kernel which are running in different work-groups.
Regarding NDRange kernels, without SIMD, all work-items from all work-groups will be pipelined on the actual hardware and no two threads will ever be issued in the same clock (hence you don't need to recompile the kernel if you change local or global size). However, if you use SIMD, as many threads as your SIMD width can potentially be issued in the same clock. With num_compute_units, you can have multiple work-groups issued concurrently in different compute units.