Loops are pipelined in task or NDRange kernels. The difference is that in a task kernel, the datapath of the loop will contain multiple "iterations" in flight. In an NDRange kernel, multiple work-items will be in-flight. So, if there are no stalls/etc, the loop datapath should be fully utilized by the work-items. For example, if the loop body takes 100-cycles in total, ideally 100 work-items should be in flight inside the loop.
Thanks for the explanation. What if an NDRange kernel contains a loop with unknown loop bound? In this case, how the loop is pipelined to contain multiple iterations in flight? Is there a default loop unroll factor being used to generate the loop pipeline?