Forum Discussion
Altera_Forum
Honored Contributor
7 years ago1) I am not exactly sure about the actual way loop unrolling is implemented in Intel's compiler; however, I would guess the pipeline is probably "widened", or as you say, replicated spatially, so that multiple iterations can be computed in parallel.
2) You are actually correct. In fact, I did take that into account, but I didn't actually divide 16384 by 128, thinking it would result in a few thousand iterations which should be enough to hide the pipeline latency. However, now that I put it into a calculator and saw that the number of iterations in the unrolled loop will be only 128, I believe it is safe to say it is not enough to fully hide the pipeline latency and that could be why you see performance improvement with higher loop trip count.