Using Profiling to measure the Kernel Execution Time

Question

Dear All,
I am using clGetEventProfilingInfo of OpenCL to measure the Kernel Execution time of the IFFT implementation. I tried getting the difference of QUEUED to END time and the START to END time using the code snippet below.
status = clEnqueueTask(queue, fft_kernel, 0, NULL, &amp;timing_event);checkError(status, "Failed to launch kernel");
status = clFinish(queue);checkError(status, "failed to finish");
status= clGetEventProfilingInfo(timing_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &amp;time_start, NULL);status= clGetEventProfilingInfo(timing_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &amp;time_end, NULL);
status= clGetEventProfilingInfo(timing_event, CL_PROFILING_COMMAND_QUEUED, sizeof(cl_ulong), &amp;time_queued, NULL);
total_time += time_end - time_start;queued_time += time_end - time_queued;

For the total_time, I got around 20us result, while on the queued_time, I have around 90us result. Which means that the QUEUED to START experience more latency than START to END. Can someone explain to me what causes this delay? 
I implemented this on Quartus 19.1 using a DE10_pro FPGA board.
Thank you in advance,
Justine

hrz · Accepted Answer

"CL_PROFILING_COMMAND_QUEUED" returns the time at which the command was added to the queue. There is no guarantee that the command is executed as soon as it is added to the queue since there might be some other commands already in the queue (e.g. non-blocking data transfers or clSetKernelArg() commands that were queued before the kernel launch). At the same time, even if there are no other commands in the queue, there is some queuing/launch overhead associated with all OpenCL commands, including transfer of metadata via PCI-E. General consensus is that if your kernel execution time is less than at least some hundreds of us, the command queue/launch overhead could very well become the performance bottleneck and there is probably no point in accelerating such application on a PCI-E-attached accelerator.

Forum Discussion

Using Profiling to measure the Kernel Execution Time

2 Replies

Recent Discussions

Connection bit order between hierarchy

How to fix Error(23782): Failed to find an expected report

Quartus 22.1 and 23.1 Synthesis Error

Could not link 'vsim_auto_compile.dll' error troubleshooting.

Failed to run ip-setup-simulation: