Example showing multiple compute units providing speedup

Honored Contributor

8 years ago

num_compute_units is one of the most basic attributes provided by Altera's compiler. Examples of its usage are even available in Altera's OpenCL documents. When num_compute_units is used alongside with NDRange kernels, the compiler will internally replicate the kernel pipeline by the number of times defined by the user, so that multiple work-groups can run in parallel. Altera recommends having at least three times more work-groups than compute unit replicas, to be able to fully utilize the circuit. It goes without saying that all of the replicas are exactly the same, access the same memory buffers, and perform the exact same operations; they just allow the run-time scheduler to schedule more threads in parallel from different work-groups. All these operations are performed automatically and without user intervention.

If you want multiple kernels that perform different operations or access different memory buffers, you have to define them as separate kernels.

P.S. Regarding the original post, I have used num_compute_units numerous times and as long as the on-board memory bandwidth is not saturated, it does certainly lead to performance improvement. The key is to have many work-groups; a single work-group kernel (i.e. no local_id in the kernel) will not at all benefit from num_compute_units, which is probably the reason why the original poster could not achieve any performance improvement.

Forum Discussion

Example showing multiple compute units providing speedup

Recent Discussions

Timing analysis - long combinational path

Docker image for Quartus Pro 26.1 missing ?

Error (292014): Can't find valid feature line for core SLL_CA_HBC_T001_Hyperbus_Memory_Controller_10

Agilex 5 – Critical HSSI Error in JESD204B Example Design

The quartus license works with version 25.0 but not with version 17.0