How to add the number of work items in flight for the NDRange kernel?
Hi,
Since the NDRange is implemented as work item based pipeline on FPGA, if I understand it correctly, the maximum number of work items in flight should be determined by the complexity (or stage) of the kernel, right?
Take the following kernel code for example (from the beginning of Chapter 4 of Intel "Best Practices")
__kernel void add (__global int * a,
__global int * b,
__global int * c)
{
int gid = get_global_id(0);
c[gid] = a[gid]+b[gid];
}The compiler generates a 3-stage pipeline for it:
1) Two Load units (load a and b simultaneously)
2) One Add unit
3) One Store unit
So for this 3-stage pipeline, at most only 3 work items can be in flight no matter how many work items are specified in the host code. If we want to get more in-flight work items, we have to add more computation or operations that will be translated into extra stages. Do I understand this correctly?
Since a deeper pipeline provides more parallelism, if my understanding above is correct, a simple kernel with few operations actually is not able to benefit much from the NDRange implementation (no matter how many work items are used or specified), right?
Thanks!