OpenCL: What's best practice for replicating a task kernel in-FPGA?
I've written a fairly simple kernel that multiplies multi-limb operands. I've written two versions: an ndrange version and a task version. Both work fine, but the performance of the task version is significantly better.
The design is quite small, and I calculated that it only uses maybe 10% of the PCIe bandwidth. I'd like, therefore, to put ~8 copies of this kernel in the FPGA instead of just one, so that I can have 8x the parallelism.
With the ndrange kernel, this appears dead easy: I just put the attribute num_compute_units(8) at the top of my kernel, and my clEnqueueNDRangeKernel just works like a charm, divvying the work up among the compute units the same way it does among different devices.
With the task kernel, however, only one of my compute units ever gets driven. All of my attempts to cause better behavior by adjusting global or local work group size or max or required size only cause weird and undesirable behavior. All the documentation suggests that any attempt I might make to query work item ids or use attribute-driven SIMD vectorization will result in my kernel being an ndrange, and thus non-pipelined, kernel.
I want my kernel pipelined, as the performance is much better. I just want, in one FPGA, eight copies that get fed 1/8th of the data each.
What's the right way to do this??
(I'm using version 19.4 and 20.1.)