Forum Discussion
In the absence of SIMD, two threads in a work-group will never be issued at the same time but rather, all the threads will be pipelined one after another (i.e. partial overlapping) in an order that is determined at runtime by the scheduler. So yes, it is similar to a for loop with random order. With SIMD, however, the pipeline will be "widened", allowing multiple threads (as many as the SIMD width) to be issued at the same time, on top of the thread pipelining. With multiple compute units, the whole pipeline and the scheduler and everything will be duplicated, allowing multiple work-groups to run in parallel. Because of this, using multiple compute units has higher area footprint and it will also create more ports to memory which will have an adverse effect on memory performance; in contrast, SIMD has a smaller footprint and could potentially allow consecutive memory access to be coalesced into one bigger access, which will improve memory performance. The difference between SIMD and num_compute_units is also described in "Intel FPGA SDK for OpenCL Best Practices Guide, 6.3.1".