NDrange, work-itme level parallelism vs work-group level parallelism

Honored Contributor

8 years ago

--- Quote Start ---

- Try to use large enough workgroup size to get benefit of multi-threading of many work-items over that single PE. I can guess why, probably the PE is pipelined over work-items (is it right?) and then pipeline is efficiently use if there are many work-items.

--- Quote End ---

This is true.

--- Quote Start ---

- Try to use large number of work-groups to get benefit of multiple CU. I really do not understand this. Are n't CUs completely independent? Why when I have multiple CUs, tool recommends this to me? how can be parallelism on work-group levels?

--- Quote End ---

This is more or less the same concept as above. Let's say you have a total of six work-groups. The time to process a work-group by a CU is X seconds. In this case, four work-groups will be scheduled into the four available CUs simultaneously. When finished, the remaining two work-groups are scheduled into two CUs, leaving the other two CU unused. In the end, the process will finish after 2X seconds. Now, a basic math tells you that in this case, even if you had only three CUs, run time would still be 2X; hence, you do not get any benefit from the extra CU, since you do not have enough work-groups to fully utilize the CUs all the time. However, if you have a large-enough number of work-groups, having four CUs will be ~33% faster than having three. Note that this is the theoretical case; in practice, performance scaling with multiple CUs also depends on external memory bandwidth and operating frequency.

Forum Discussion

NDrange, work-itme level parallelism vs work-group level parallelism

Recent Discussions

Tensor block usage

When you double click on a word, the other instances do not highlight due to the Find Box being open

jtagserver.exe causing BSOD together with ftdi driver

Automatically added negative node for TDS output doesn't work with Agilex 5

Agilex3 - unknown IDCODE