NDrange, work-itme level parallelism vs work-group level parallelism

Honored Contributor

8 years ago

--- Quote Start ---

This is true.

This is more or less the same concept as above. Let's say you have a total of six work-groups. The time to process a work-group by a CU is X seconds. In this case, four work-groups will be scheduled into the four available CUs simultaneously. When finished, the remaining two work-groups are scheduled into two CUs, leaving the other two CU unused. In the end, the process will finish after 2X seconds. Now, a basic math tells you that in this case, even if you had only three CUs, run time would still be 2X; hence, you do not get any benefit from the extra CU, since you do not have enough work-groups to fully utilize the CUs all the time. However, if you have a large-enough number of work-groups, having four CUs will be ~33% faster than having three. Note that this is the theoretical case; in practice, performance scaling with multiple CUs also depends on external memory bandwidth and operating frequency.

--- Quote End ---

then do you mean in the case of number of work-groups on CUs, the divisibility matters? then why it is not suggested to use dividable number of work-groups? assuming that work-groups all have same runtime (which may not always be true if the kernel code has group-id dependent control statement) running exactly 4 workgroups (1 per CU) is enough to not leave any CU idle, no need to use many workgroups on each CU.

and another related question, is there any pipelining in work-group level? meaning next work-group enters into CU, while previous one is still there?

thanks for your helps

Forum Discussion

NDrange, work-itme level parallelism vs work-group level parallelism

Recent Discussions

How to fix Error(23782): Failed to find an expected report

Mailbox Client IP - SEND_CERTIFICATE command through FPGA fabric

Quartus Prime license rehosted, unable to run

Failed to run ip-setup-simulation:

Connection bit order between hierarchy