--- Quote Start ---
NDRange, as soon as I add the get_global_id it will become much slower...
It did get the correct global_id when I actually use it.
__attribute__((num_compute_units(2)))
__attribute__((reqd_work_group_size(2, 1, 1)))
__kernel void pointWiseMul(__global float2* restrict d_afCorr, __global float2* restrict d_afPadScn, __global float2* restrict d_afPadTpl, int dataN, float fScale)
{
int begin = get_global_id(0);//mark out this line and the speed change dramatically
for (int iIndx = 0; iIndx < dataN; iIndx++)
{
float2 cDat = d_afPadScn;
float2 cKer = d_afPadTpl;
//take the conjugate of the kernel
cKer.y = -cKer.y;
float2 cMul = { cDat.x* cKer.x - cDat.y * cKer.y, cDat.y * cKer.x + cDat.x * cKer.y };
cMul.x = fScale * cMul.x;
cMul.y = fScale * cMul.y;
d_afCorr = cMul;
}
}
--- Quote End ---
In this specific code, whether you comment out the get_global_id(0) function or not, the output circuit will be EXACTLY the same since the "begin" variable is never used and the compiler automatically optimizes it out anyway. I tested and confirmed this on my own environment. If you are seeing any performance difference when you comment that line and when you don't, the reason for that is probably in the host code or your timing function, not the kernel code. Variations up to 10 ms are standard on GPUs and even FPGAs, you should not make any conclusions based on run times that are less that 10 ms. In fact, the kernel launch overhead itself will be a few milliseconds.
However, if you remove the loop and use the global_id in place of the loop iterator, then everything will be different. That would convert the kernel from single work-item to NDRange, which are implemented in completely different ways. The pipeline latency for the single work-item case in your code is reported as 192 cycles in the compilation report, while in the NDRange case it is reported to be 32 cycles. In the former case the pipeline is deeper so that latency of external memory accesses can be hidden if there are enough inputs. In the latter case, however, the pipeline is much shorter and the runtime scheduler will determine the best thread scheduling to hide the latency of memory accesses.
If you want to perform correct timing comparisons, use test cases which at least run for a few seconds.