Forum Discussion
num_compute_units is one of the most basic attributes provided by Altera's compiler. Examples of its usage are even available in Altera's OpenCL documents. When num_compute_units is used alongside with NDRange kernels, the compiler will internally replicate the kernel pipeline by the number of times defined by the user, so that multiple work-groups can run in parallel. Altera recommends having at least three times more work-groups than compute unit replicas, to be able to fully utilize the circuit. It goes without saying that all of the replicas are exactly the same, access the same memory buffers, and perform the exact same operations; they just allow the run-time scheduler to schedule more threads in parallel from different work-groups. All these operations are performed automatically and without user intervention.
If you want multiple kernels that perform different operations or access different memory buffers, you have to define them as separate kernels. P.S. Regarding the original post, I have used num_compute_units numerous times and as long as the on-board memory bandwidth is not saturated, it does certainly lead to performance improvement. The key is to have many work-groups; a single work-group kernel (i.e. no local_id in the kernel) will not at all benefit from num_compute_units, which is probably the reason why the original poster could not achieve any performance improvement.