Unable to coalesce kernel execution to memory reads in Arria 10 FPGA using opencl 17.1 bsp

Question

Hi,
I have a benchmarking program based on the intel example for read and kernel coalesced execution: 
https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807516407.html#ran1551293458343

I know this example was added to the manual after version 17.1 but its core is just basic opencl event and kernel queue usage.
As seen in  @buffers_timeline.PNG the first execution does get coalesced (kernel execution and read buffer at the same time), but all the subsequent reads are not coalesced. The 3rd execution of the kernel is waiting for the second buffer somehow.

Any ideas?

I only need a way of getting data out. Is this a problem from version 17.1 ?  As you can see the data generation kernel "data_in" gets stalled (I supose because the "daq" kernel is slowing it down). 
@BW_occ_stall.PNG

So I coded the following example host program:
@hostcode.PNG
...

cl_event kernelEvent[100], readEvent[100];const size_t global_work_size[1] = {1};
clEnqueueNDRangeKernel(kernelQ2[0], kernel[1], 1,NULL,global_work_size,NULL, 0, NULL, NULL);const double start_time = getCurrentTimestamp();
//start daq loop. DAQ_ITERATIONS buffers in host to be filled by the deviceprintf("
 Starting DAQ %d iterations... 
", num_iterations);
clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 0, NULL, &amp;kernelEvent[0]);clEnqueueReadBuffer(readQ[0], d_output_buff[0], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[0], 1, &amp;kernelEvent[0], &amp;readEvent[0]);clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 0, NULL, &amp;kernelEvent[1]);clEnqueueReadBuffer(readQ[0], d_output_buff[1], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[1], 1, &amp;kernelEvent[1], &amp;readEvent[1]);// clFlush(readQ[0]);clFlush(kernelQ[0]);clFlush(readQ[0]);for (int i=2; i&lt;num_iterations; i++) {printf("
Iteration %d, buffer %d: 
", i, i%2);status = clSetKernelArg(kernel[0], i%2, sizeof(cl_mem), &amp;d_output_buff[i]);checkError(status, "Failed to set argument %d", i%2);clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 1, &amp;readEvent[i-2], &amp;kernelEvent[i]);// clFlush(kernelQ[0]);clEnqueueReadBuffer(readQ[0], d_output_buff[i], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[i], 1, &amp;kernelEvent[i], &amp;readEvent[i]);// clFlush(readQ[0]);}
clFlush(kernelQ[0]);clFlush(readQ[0]);
// Wait for all kernels to finish.clWaitForEvents(1,&amp;readEvent[num_iterations-1]);
const double end_time = getCurrentTimestamp();const double total_time = end_time - start_time;

...

hrz · Answer

Doesn't your "checkError" function happen to have a clFlush, or clFinish, or clWaitForEvents, etc. that might be serializing some of the enqueues? Other than that, your host code looks fine and the kernel and read operations should get completely overlapped.

P.S. I remember there was this limitation in Intel's runtime and BSP that it wasn't possible to do simultaneous reads and writes through PCI-E, despite PCI-E being a full-duplex medium, which was fixed in some 18.x version of the compiler, and that limitation could cause unnecessary serialization of simultaneous PIC-E reads and writes. However, I don't think that applies to your case since you are trying to overlap kernel execution with PCI-E read and there is no PCI-E write involved, and that fix requires a compatible BSP anyway, which I don't think you have or else you wouldn't be using v17.1. of the compiler.

boonbengt_altera · Answer

Hi @MAstr,

Thank you for posting in Intel community forum and hope all is well.
By any chances did you managed to recommendation mention by HRZ, please do let us know if there is any further clarification we can help with.

Best WishesBB

mastr · Answer

Thanks for the suggestion. I am not able to use the hardware at the moment, will get back to this topic once I can test the program again. Thank you! @BoonBengT_Altera

boonbengt_altera · Answer

Hi @MAstr,

Good to know your doubts has been clarified, with no further clarification on this thread, it will be transitioned to community support for further help on doubts in this thread.
Thanks for your questions and as always pleasure having you here.

Best WishesBB

Forum Discussion

Unable to coalesce kernel execution to memory reads in Arria 10 FPGA using opencl 17.1 bsp

4 Replies

Recent Discussions

Agilex 7 FPGA Starter Kit with oneAPI Toolkit flow not detected over PCIe

MCTP over PCIe VDM routing to PMCI in OFS N6000 FIM configuration and datapath clarification

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

Error faced while executing on Agilex FPGA board....

AI Suite System Throughput Issue