Dear all, I would like to use OpenCL SDK for a Terasic DE5-Net to deploy my algorithm (I am using Quartus 18.1 and OpenCL SDK 18.1). I successfully run the examples provided by Intel FPGA for OpenCL...
Thank you for your explanation and happy new year!
I am trying to understand where is exactly the bottleneck, thus I enable the ACL_PROFILE_TIMER variable to see the memory transfer.
It seems that the access to the global memory does not reach 100% occupancy but only 4.2%. Moreover, in the kernel execution panel, there are empty spaces that represent the global memory access time, if I understood correctly from the OpenCL best practices guide. I have also tried with 128 iterations of 64 bits each of transfer from/to global memory, but I did not see improvement. Please find attached the screenshots from the profiler.