Forum Discussion
Altera_Forum
Honored Contributor
8 years ago --- Quote Start --- I have now executed one of my kernels on an actual FPGA. However, I do not get any speed up or slow down when I change my kernel from NDrange to single-work-item (which I was not expecting at least for this simple kernel). The kernel I am executing on the FPGA contains the following: --- Quote End --- This isn't necessarily surprising. If the kernel is simple and straightforward, NDRange and single work-item will perform very similarly. --- Quote Start --- Is this the correct way of changing to single-work-item kernel (I did not find a proper method anywhere)? If not how should do it?, what is your suggestion on to improve the execution time of this kernel? --- Quote End --- Wrapping the NDRange kernel in for loops over the work group dimensions is certainly the correct way to convert NDRange to single work-item; still, an NDRange kernel regularly has multiple barriers that are used to ensure local memory consistency. These barriers are not needed in single work-item and it is very likely that you would be able to combine the regions above and below a barrier into one loop in single work-item. I personally prefer to start from a baseline sequential implementation to create single work-item kernels, rather than converting an existing NDRange kernel to single work-item and manually merging all the loops. Assuming that the innermost loop is fully-pipelined in this case (iteration interval (II) of one reported by the compilation report), the most obvious optimization would be to partially unroll the innermost loop using# pragma unroll *factor*. --- Quote Start --- How about for a more complex kernel in my application like this. Shall I change to single work item (like above) and follow the optimization report or follow the guide on "how to improve NDRange kernels"? --- Quote End --- I personally start from single work-item, see how far I can get and how well I can achieve full-pipelining for the loops in the kernel, and if my attempts where not successful, I will switch to NDRange. The compilation report for single work-item helps considerably, while the report for NDRange is pretty much useless. The area report is much more useful for optimizing NDRange kernels, but you are not going to get the necessary info with the report generated by Quartus 14.0 For your specific code, you probably need to use the shift register-based optimization for floating-point reduction for the "sum += factor" operation. Check Altera's documents for how to implement this optimization. Assuming that this optimization allows you to get an II of one for both of the for loops, then you should start unrolling the loop on j to achieve best performance. There will be some parameter tuning involved in this case which needs timing after full kernel compilation to determine which value is best. In contrast, if the compiler reports that some loops cannot be pipelined due to variable exist conditions, then you should probably stick to NDRange and use SIMD or num_compute_units to achieve higher performance. This would be after you apply the basic optimization like using restrict or reqd_work_group_size. I strongly recommend fully reading and understanding Altera's OpenCL documents before experimenting with the compiler.