Forum Discussion
Altera_Forum
Honored Contributor
8 years agoNDRange kernels are pipelined, but the pipelining is at thread level, not loop iteration level. The first step to speed up NDRange kernels is to use the SIMD attribute. You should start from there. Furthermore, there will some limited amount of work-group pipelining; i.e. when threads from one work-group are in flight in one compute unit, threads from the next work-group can also enter the same compute unit.
There could be many reasons why your code is slow. I would say your local group size is pretty big as it is. You should pay careful attention to the way your local memory buffers are implemented; if you have too many non-consecutive accesses to local memory buffers, the compiler will run out of Block RAMs to implement your local buffer and will instead try to share Block RAM ports between different accesses which could result in stalling and low performance. You should also pay attention to global memory accesses and make sure they are consecutive to allow coalescing. If you want to know stalling percentages, use the profiler.