The same question also applies to an NDRange kernel which contains a loop with a high loop bound. In such a case, how is the kernel synthesized into hardware? Thanks
--- Quote Start ---
Loops are pipelined in task or NDRange kernels. The difference is that in a task kernel, the datapath of the loop will contain multiple "iterations" in flight. In an NDRange kernel, multiple work-items will be in-flight. So, if there are no stalls/etc, the loop datapath should be fully utilized by the work-items. For example, if the loop body takes 100-cycles in total, ideally 100 work-items should be in flight inside the loop.
Thanks for the explanation. What if an NDRange kernel contains a loop with unknown loop bound? In this case, how the loop is pipelined to contain multiple iterations in flight? Is there a default loop unroll factor being used to generate the loop pipeline?
--- Quote End ---