Forum Discussion
Altera_Forum
Honored Contributor
8 years ago --- Quote Start --- Maybe I simplified things a little bit too much. It is not just about divisibility. As you said, there is no guarantee that work-groups running in different CUs would finish at the same time, hence some CUs will always remain unused. However, with more work-groups, the chance of a CU being unused will get smaller, resulting in closer-to-linear speed-up with number of CUs. Furthermore, at least based on what Altera's report claims, there is also work-group pipelining in place and hence, there could be multiple work-groups in-flight in the same CU at the same time and having more work-groups will further help to keep the CU busy. --- Quote End --- I understand. In Altera's OpenCL, for single work-item kernel (task) loops are pipelined. But for NDrange, when a work-group with many work-items runs on one PE (one PE means no SIMD), and also in the similar and higher-level scenario, when multiple work-groups run on a CU, how the parallelism is implemented? pipelining or multi-threading? I guess it is pipelining, but I am not sure. Do you have any detailed document about this? because I could not see any details in "programming guide" and "best practices guide" manuals.