NDrange, work-itme level parallelism vs work-group level parallelism

Honored Contributor

8 years ago

--- Quote Start ---

Maybe I simplified things a little bit too much. It is not just about divisibility. As you said, there is no guarantee that work-groups running in different CUs would finish at the same time, hence some CUs will always remain unused. However, with more work-groups, the chance of a CU being unused will get smaller, resulting in closer-to-linear speed-up with number of CUs. Furthermore, at least based on what Altera's report claims, there is also work-group pipelining in place and hence, there could be multiple work-groups in-flight in the same CU at the same time and having more work-groups will further help to keep the CU busy.

--- Quote End ---

I understand.

In Altera's OpenCL, for single work-item kernel (task) loops are pipelined. But for NDrange, when a work-group with many work-items runs on one PE (one PE means no SIMD), and also in the similar and higher-level scenario, when multiple work-groups run on a CU, how the parallelism is implemented? pipelining or multi-threading? I guess it is pipelining, but I am not sure. Do you have any detailed document about this? because I could not see any details in "programming guide" and "best practices guide" manuals.

Forum Discussion

NDrange, work-itme level parallelism vs work-group level parallelism

Recent Discussions

Tensor block usage

When you double click on a word, the other instances do not highlight due to the Find Box being open

jtagserver.exe causing BSOD together with ftdi driver

Automatically added negative node for TDS output doesn't work with Agilex 5

Agilex3 - unknown IDCODE