Forum Discussion
Altera_Forum
Honored Contributor
12 years agoThe reason for the drop when using a task based kernel is because it's a single work-item execution unit at that point (NDRange = 1). In other words there would be only a single __local memory at that point (and as a task you might as well use __private memory in that case since there are no other work-items in flight that need visibility into that memory).
In the case of a bigger work group what is happening is that you have fewer work-groups in flight in the compute unit because each work-group has more than one work-item in this case. Remember that __local memory is shared by all the work-items in a workgroup and as a result of using a work-group size of 1 it would require lots of work-groups to keep the compute unit filled and as a result lots of __local memories as well. At the moment there isn't a way to trim back the number of work-groups that get scheduled into the compute unit that would help reduce your memory footprint. I don't really consider it a bug because the compute unit is a deep pipeline so it if didn't keep many tiny work-groups in flight like you are seeing the performance would drop. That said, I do see the need to have the ability to trade off throughput in exchange for a smaller hardware footprint and my colleagues have had discussions about this so we are already taking it into consideration. Without seeing the kernel I'm not sure this would be a good recommendation or not but if you use a task instead that would be one way to control the hardware blowup. A task or "single work-item execution" kernel typically contains loops that iterate over the problem set that a normal NDRange kernel would do using multiple work-items. So unlike NDRange kernels where you normally flatten out loops of an algorithm and replace them with parallel work-times, tasks typically resemble algorithms that still contain loops that you would normally execute on a host processor. Tasks on Altera FPGAs are efficient if there are no loop carried dependencies in your algorithm, if these dependencies exists the compiler will output a message that tells you the compute unit will be underutlized. If you take a look at the OpenCL optimization guide up on altera.com you'll see examples of what I'm talking about. If no dependencies exists then the compiler should be able to generate a compute unit that can execute a loop iteration every clock cycle assuming there is enough memory bandwidth available.