--- Quote Start ---
Thanks! So basically the "depth" of the pipeline does not depend on how many loop iteration there are or number of work-items in a workgroup, but instead purely depends on how many cycle the loop body actually executes?
What if inside the loop body there are slow command such as floating point divider, and also have much faster command such as add/sub. Is the pipeline going to run at the speed of slowest command? Does AOC try to balance the pipeline stages through "fp-relaxed" or should the balancing be done manually?
In addition, sorry to sidetrack this thread, I am also wondering if the AOC could generate merged Multiply-add functions like the GPUs ,or are floating point addition always only implemented with LUTs and Registers?
--- Quote End ---
Yes, the depth of the pipeline does not depend on the number of iterations (unless you unroll the loop), but mostly the latency of the instructions. However, the compiler sometimes adjusts the depth according to the number of work-items to further optimize the pipeline.
The pipeline balancing is automatically done by the compiler so "slow" operations can be done in parallel with "fast" operations. -fp-relaxed is just an additional flag that tells the compiler that it can re-order floating point operations for further balancing.
With OpenCL, stalls are the main concern because throughput is achieved via work-items. If the pipeline of your kernel is N-cycles, but there are no stalls, a single work-item will enter and exit the pipeline every cycle. The latency of the operations is not a big concern unless they stall the pipeline.
FPGA "instructions" are different than GPU instructions. A multiply and an add can be done in a single cycle on the FPGA; but this affects the frequency of your circuit. The compiler tries to optimize for maximum frequency, so it may choose break up the multiply and the add, or not depending on these optimizations. You do not need to worry about the efficiency of your multiplies and adds.