Thanks! So basically the "depth" of the pipeline does not depend on how many loop iteration there are or number of work-items in a workgroup, but instead purely depends on how many cycle the loop body actually executes?
What if inside the loop body there are slow command such as floating point divider, and also have much faster command such as add/sub. Is the pipeline going to run at the speed of slowest command? Does AOC try to balance the pipeline stages through "fp-relaxed" or should the balancing be done manually?
In addition, sorry to sidetrack this thread, I am also wondering if the AOC could generate merged Multiply-add functions like the GPUs ,or are floating point addition always only implemented with LUTs and Registers?