Different output on FPGA compared to emulation can have two reasons:
- A bug in the compiler that results in the generation of an incorrect hardware circuit (less likely)
- Race condition in global memory accesses or incorrect usage of ivdep pragama (more likely)
It is possible that I missed some important detail in your kernel and my suggestion of adding ivdep to avoid the dependencies was incorrect. You can try removing them to see if you will get correct output (at the cost of lower performance).
I wouldn't rely too much on the numbers reported by the profiler; in my experience, these numbers are not very accurate. The peak external memory bandwidth of your board is 25.6 GB/s (23.8 GiB/s); however, you should not expect to get close to that number unless in extremely ideal situations. You can find the math behind calculation of the external memory bandwidth and my recommendations on how to improve external memory performance in this thread (check the reply before the last, usernames have been lost after migration from Altera's forum):
https://forums.intel.com/s/question/0D50P00003yyTK3SAM/global-memory-access-512-bit-width-constrain
Regarding operating frequency, it largely depends on loop-carried dependencies and area usage. OpenCL users have very little control over the kernel operating frequency and it is difficult to give recommendations as to how it can be improved. You can try changing the default target operating frequency from 240 to some higher number using the -fmax switch and force the compiler to insert more registers into the pipeline; this can potentially improve operating frequency. However, it might result in higher II for loops that are the fmax bottleneck. In that case you should focus on optimizing those loops to resolve whatever dependency that is causing the bottleneck.