Finally I found the error. In my kernel code, I have 3 different stages to complete the FFT. The first and third ones are composed by an unrolled nested loop while the second one consists on a pipelined nested loop. Converting the pipelined nested loop to unrolled nested loop in the second stage resolved the problem. About this issue, I would like to ask you a question to confirm my idea: is it possible that the third stage with unrolled loop is run before that the data from second stage are ready? This can explain the issue that I had. If not, what could be the reason of this error?
I noticed that the execution time is around 700 us: what are possible good practices to accelerate the execution time? My goal is to reach 5/600 ns or 1 us as maximum. Do you think it is feasible?
Thank you very much.