Forum Discussion
Altera_Forum
Honored Contributor
8 years agoWith SIMD of 4, you will get one copy of the kernel (not 4) but with up to 4 threads being issued in the same clock. To have 4 kernel copies and 256 work-items running in each copy, you should replicate your kernel pipeline by using __attribute__((num_compute_units(4))). Needless to say, SIMD is faster and more area-efficient than kernel pipeline replication, and the latter should be avoided unless you cannot use SIMD or you have already used the maximum-allowed SIMD value but still have area left on the FPGA. The "Intel® FPGA SDK for OpenCL Best Practices Guide, Section 1.7.3" discusses this subject.
For single work-item, if you don't use functions like get_local_id(), get_global_id() and get_group_id(), i.e. your kernel is work-group and work-item-invariant, the compiler will automatically compile it as single work-item; anything else will be an NDRange kernel You can also save a small amount of area by adding max_global_work_dim(0) to single work-item kernels to remove the scheduler. In NDRange kernels, threads or work-items are scheduled onto the kernel pipeline by the runtime scheduler in a way that keeps the pipeline as busy as possible; this could also involve out-of-order execution. Obviously, no dependencies should exist between threads in this case and data sharing between threads can only be done by using local memory and barriers. In single work-item, however, there is no scheduler involved anymore and loop iterations are issued onto the kernel pipeline by an Initiation Interval that depends on dependencies between the loop iterations. I Strongly recommend fully reading both the "Intel® FPGA SDK for OpenCL Programming Guide" and "Intel® FPGA SDK for OpenCL Best Practices Guide"; most common questions are answered in those documents.