Forum Discussion
Altera_Forum
Honored Contributor
8 years agoComparing with the ARM core is probably not very conclusive since the ARM core is extremely slow.
The most obvious way to increase performance on the FPGA would be to unroll the loop on "c". Though since you are performing a floating-point reduction, you should either fully unroll that loop, or first optimize that loop to achieve an iteration interval of one by inferring a shift register as outlined in "Intel® FPGA SDK for OpenCL Best Practices Guide, 1.6.1.5 Removing Loop-Carried Dependency by Inferring Shift Registers" and then unroll it to achieve best performance. You should consider fully reading Intel's programming and best practices guides since all the basic optimization techniques are covered there.