Forum Discussion

Occasional Contributor

6 years ago

OpenCL: What's best practice for replicating a task kernel in-FPGA?

I've written a fairly simple kernel that multiplies multi-limb operands. I've written two versions: an ndrange version and a task version. Both work fine, but the performance of the task version is s...

JSchr20

Occasional Contributor

6 years ago

I still don't have this code working, but I'm stuck at a "higher level", at least. I now have only one autorun kernel with num_compute_units > 1, and the ndrange kernels on either end that feed it data and gather results are singletons that distribute the appropriate slices of the incoming (now num_compute_units times as big) work items to the various channels (or gather them at the other end). This code now works for few work items, but not for many. Mem and channel fences haven't helped. Simulation and hardware builds fail without useful messages. If I exhaust my debug avenues or if I find a solution, I'll post here again.

HRZ

Frequent Contributor

6 years ago

If you want an example of a working high-performance code with autorun kernels, you can take a look at this repository:

https://github.com/zohourih/Diffusion_FPGA

JSchr20
Occasional Contributor
6 years ago
Thank you very much! I will check it out.
JSchr20
Occasional Contributor
6 years ago
Thank you, that did the trick! I noticed that your read and write queues were different in your code. I had everything in one queue; I thought enqueuing was non-blocking, and I had my triggering events set up such that everything should have been able to launch and run. That must not have been the case, though, and some kernel enqueue was perhaps waiting on another in a way I didn't expect. I switched to two queues to separate the kernels on either side of my autorun kernel, and now I no longer hang once my FIFOs fill up. I still don't really get why that had to happen, but... all's well that ends well, I guess. Thanks again!
HRZ
Frequent Contributor
6 years ago
Indeed the enqueue operations are non-blocking (from the point of view of the host), but each queue can only execute one operation on the device at a time which means the actual execution of the queued operations or kernels on the device happens sequentially. To be able to execute multiple kernels in parallel on one device, you need one separate queue for each such operation.

Forum Discussion

OpenCL: What's best practice for replicating a task kernel in-FPGA?

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process