Forum Discussion
I still don't have this code working, but I'm stuck at a "higher level", at least. I now have only one autorun kernel with num_compute_units > 1, and the ndrange kernels on either end that feed it data and gather results are singletons that distribute the appropriate slices of the incoming (now num_compute_units times as big) work items to the various channels (or gather them at the other end). This code now works for few work items, but not for many. Mem and channel fences haven't helped. Simulation and hardware builds fail without useful messages. If I exhaust my debug avenues or if I find a solution, I'll post here again.
If you want an example of a working high-performance code with autorun kernels, you can take a look at this repository: