Forum Discussion
The number of compute units in the report for NDRange kernels is always reported as 1. That is a bug I reported to Intel a long time ago and they confirmed it. I don't think they have fixed it yet, though.
With respect to CL_DEVICE_MAX_COMPUTE_UNITS, that attribute reflects physical compute units on the target device which will always be 1 in the case of an FPGA. The compute units created using num_compute_units are logical compute units.
Finally, your code does not use any work-groups (no get_local_id()/get_group_id()) in the code and hence, it will not benefit from compute unit replication. This feature allows multiple work-groups to run in parallel but your code only uses one work-group.
Thanks for the help.
I tried using get_local_id()/get_group_id() in a new design (which I have attached an image of the report for), however it still performs the same.
One strange thing I have noticed is that CL_DEVICE_MAX_WORK_ITEM_SIZES returns me (0,17,52) and CL_DEVICE_MAX_WORK_GROUP_SIZE returns me 2147483647. These number seem a bit strange to me.
For context I run the kernel with clEnqueueNDRangeKernel(queue_, kernel_, 1, NULL, gSize_, wgSize_, 0, NULL, NULL); where wgSize_[3] = {WORK_ITEM_SIZE, 1, 1} and gSize_[3] = {BUFFER_SIZE, 1, 1}. I assume I do not need to enqueue a command for each work group right?
- HRZ7 years ago
Frequent Contributor
No, you don't need a separate queue for each work-group; everything is handled automatically. How many work-groups are you using? The guides recommends at least 3x more work-groups than compute units to see a reasonable performance benefit. Furthermore, if your application is memory unfriendly (e.g random memory accesses) or one compute unit already saturates the memory bandwidth, you are not going to see any performance benefit from using multiple compute units.