Forum Discussion
I still don't have this code working, but I'm stuck at a "higher level", at least. I now have only one autorun kernel with num_compute_units > 1, and the ndrange kernels on either end that feed it data and gather results are singletons that distribute the appropriate slices of the incoming (now num_compute_units times as big) work items to the various channels (or gather them at the other end). This code now works for few work items, but not for many. Mem and channel fences haven't helped. Simulation and hardware builds fail without useful messages. If I exhaust my debug avenues or if I find a solution, I'll post here again.
If you want an example of a working high-performance code with autorun kernels, you can take a look at this repository:
- JSchr205 years ago
Occasional Contributor
Thank you very much! I will check it out.
- JSchr205 years ago
Occasional Contributor
Thank you, that did the trick! I noticed that your read and write queues were different in your code. I had everything in one queue; I thought enqueuing was non-blocking, and I had my triggering events set up such that everything should have been able to launch and run. That must not have been the case, though, and some kernel enqueue was perhaps waiting on another in a way I didn't expect. I switched to two queues to separate the kernels on either side of my autorun kernel, and now I no longer hang once my FIFOs fill up. I still don't really get why that had to happen, but... all's well that ends well, I guess. Thanks again!
- HRZ5 years ago
Frequent Contributor
Indeed the enqueue operations are non-blocking (from the point of view of the host), but each queue can only execute one operation on the device at a time which means the actual execution of the queued operations or kernels on the device happens sequentially. To be able to execute multiple kernels in parallel on one device, you need one separate queue for each such operation.