get_global_id(0) cause much latency ?

Question

Hi

I'm trying to compare the performance of FPGA and GPU based on image processing algorithm. I've found that there's a 10ms different when using the get_global_id(0) and not using it. The kernel in question was performing point wise multiplication and it took 0.5 ms on GPU and even lower on FPGA when I launch multiply kernel copies under different names and queues.

I experiment several time and was sure that the line "int id=get_global_id(0)" (not even using this id) was causing all the difference.

It's possible to launch parallel kernels manually if it's only a few copies like mine, but when it need a large amount of copies, I don't see it efficient. SIMD doesn't help as it also needs to get the id.

Is there anyway around this? Thanks.

----

The examples from Altera like the fft design used get_global_id(0) and the processing speed are about 1.5ms...

I did use reqd_work_group_size attribute in my kernel

I'm confused...

altera_forum · Answer

Are you launching this as a single work-item kernel or as an NDRange kernel?  Since you're using get_global_id (or trying to), I presume NDRange.  Maybe the design would work better as a single work item kernel.  Some code (from the host and kernel) might help to explain.

altera_forum · Answer

--- Quote Start ---  Are you launching this as a single work-item kernel or as an NDRange kernel?  Since you're using get_global_id (or trying to), I presume NDRange.  Maybe the design would work better as a single work item kernel.  Some code (from the host and kernel) might help to explain.  --- Quote End ---    NDRange, as soon as I add the get_global_id it will become much slower... It did get the correct global_id when I actually use it. __attribute__((num_compute_units(2)))
__attribute__((reqd_work_group_size(2, 1, 1)))
__kernel void pointWiseMul(__global float2* restrict d_afCorr, __global float2* restrict d_afPadScn, __global float2* restrict d_afPadTpl, int dataN, float fScale)
{
        int begin = get_global_id(0);//mark out this line and the speed change dramatically
	for (int iIndx = 0; iIndx &lt; dataN; iIndx++)
	{
		float2 cDat = d_afPadScn;
		float2 cKer = d_afPadTpl;
		//take the conjugate of the kernel
		cKer.y = -cKer.y;
		float2 cMul = { cDat.x* cKer.x - cDat.y * cKer.y, cDat.y * cKer.x + cDat.x * cKer.y };
		cMul.x = fScale * cMul.x;
		cMul.y = fScale * cMul.y;
		d_afCorr = cMul;
	}
}

altera_forum · Answer

I'm confused.  "begin" as the work-item number is not used anywhere in the code.  Is that a typo?  Should it be:  int iIndx = get_global_id(0);  replacing the for loop?

altera_forum · Answer

--- Quote Start ---

I'm confused. "begin" as the work-item number is not used anywhere in the code. Is that a typo? Should it be:

int iIndx = get_global_id(0);

replacing the for loop?

--- Quote End ---

yes originally it was like that. I was trying to figure out what was causing all the trouble and change back to single for loop, that's why it looks like typo.

I experimented more (each modify takes one hour to compile...) and get rid off get_global_id(0) and it sill have that latency...

Let me do more test and figure what the hell is going on...

altera_forum · Answer

How are you launching the kernel on the host?  What does your clEnqueueNDRangeKernel command look like?

Forum Discussion

get_global_id(0) cause much latency ?

10 Replies

Recent Discussions

Could not link 'vsim_auto_compile.dll' error troubleshooting.

Connection bit order between hierarchy

How to fix Error(23782): Failed to find an expected report

Quartus 22.1 and 23.1 Synthesis Error

Failed to run ip-setup-simulation: