Forum Discussion
Altera_Forum
Honored Contributor
9 years agoThe "wiring" certainly takes place with clCreateProgramWithBinary() and takes around a few hundred milliseconds for Stratix V and a couple hundred milliseconds on Arria 10, which would be outside of your loop in this case and it will only occur once. Run-time Global and Local size have no effect on the circuit or wiring on the FPGA, they only affect "scheduling" which is software-based and happens at run-time.
The important distinction here is that threads from the same work-group do NOT run in parallel on the FPGA, unless you use SIMD (which requires _attribute__((reqd_work_group_size())). The threads are instead pipelined depending on the scheduler's behavior and possible local or global memory access contention. This is the major difference between the way things work on an Altera FPGA with the OpenCL SDK and a standard GPU. Because of this, regardless of what your Global or Local size is, the exact same circuit can be used. The major change that supplying _attribute__((reqd_work_group_size()) results in is that it allows the compiler to optimize area usage and memory accesses for that specific Work Group size, rather than assume the worst-case scenario which might not happen at run-time and result in resources being underutilized. In your code, you can safely remove the clFinish call because even though clEnqueueNDRangeKernel is not blocking, clEnqueueMapBuffer will always start after clEnqueueNDRangeKernel and global memory consistency is guaranteed at the end of kernel execution in OpenCL. You should just make sure to use a blocking clEnqueueMapBuffer if you are going to use the data on the host right away.