Forum Discussion
If the compiler report says the loop is pipelined, then index calculation is also pipelined.
70-80% is optimal for realistic cases based on my own testing. 85-90% is for perfect cases (one 512-bit access per memory bank with interleaving disabled).
If you do not see much difference in operating frequency without profiling, then performance difference will also be minimal. The profiler does not change loop II as far as I have seen and hence, the only factor causing performance difference between enabling profiling or not should be operating frequency.
Previously you had 14x 512-bit accesses which was excessive. One bank of DDR should saturate with a total of 512 bits read/written per iteration but the memory controller/interface is far from optimal. I have experienced that oversubscribing the memory interface (by 2x and not 14x) can sometimes improve the performance a little bit, but certainly not as much as it would increase resource usage.