OpenCL: What's best practice for replicating a task kernel in-FPGA?
I've written a fairly simple kernel that multiplies multi-limb operands. I've written two versions: an ndrange version and a task version. Both work fine, but the performance of the task version is s...