Thank you for the information and calculation about accessing the global memory. In this way I can estimate the latency communication, considering the PCIe interface between host and device.
I will try to force the maximux frequency with -fmax option and I will see if the performance are better. I asked you how I can improve performances since my kernel lasts 600 us (according to the profiler) and I would like to reach hundreds of nanoseconds of processing, if possible.
About the different output, I followed your suggestion about the pragma ivdep but the area usage still was too much. So I tried to remove dependencies and I reached 50% of area only optimizing the loops, without using the ivdep pragma. The accesses to the global memory are only for reading and writing (plese see the code below).
Is there anyway a race condition in the global memory access? In case of bug compiler, could you please tell me how I can fix this?
reading:
#pragma unroll
for (ushort i = 0; i < 512; i++)
data[i] = x[i];
writing:
#pragma unroll
for (ushort i = 0; i < 512; i++)
y[i] = data[i];
Thank you.