Thank you again for your reply.
Finally I was able to put the kernel into the HW. At the moment the resource utilization is around 50%. Your suggestions were really helpful.
Unfortunately the results of the opencl kernel in HW are different from the emulator. indeed, the kernel returns a buffer that has some correct elements and some other not. Do you know what could be the reasons for this behavior?
Looking the dynamic profiler, I checked that the transmission throughput from global memory to FPGA is 574 MB/s and from FPGA to global memory is 147 MB/s but I got around 2500 MB/s with the "aocl diagnose" command. Also the kernel clock frequency is not the optimal one since I got 155 MHz but I was able to reach more than 200 MHz with other kernels.
Could you please let me know what I can do to improve the performances in terms of execution time and global memory reading/writing troughput?
Thank you.