Forum Discussion

MAstr's avatar
MAstr
Icon for New Contributor rankNew Contributor
4 years ago

Unable to coalesce kernel execution to memory reads in Arria 10 FPGA using opencl 17.1 bsp

Hi,

I have a benchmarking program based on the intel example for read and kernel coalesced execution:

https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807516407.html#ran1551293458343

I know this example was added to the manual after version 17.1 but its core is just basic opencl event and kernel queue usage.

As seen in @buffers_timeline.PNG the first execution does get coalesced (kernel execution and read buffer at the same time), but all the subsequent reads are not coalesced. The 3rd execution of the kernel is waiting for the second buffer somehow.

Any ideas?

I only need a way of getting data out. Is this a problem from version 17.1 ? As you can see the data generation kernel "data_in" gets stalled (I supose because the "daq" kernel is slowing it down).

@BW_occ_stall.PNG

So I coded the following example host program:

@hostcode.PNG

...

cl_event kernelEvent[100], readEvent[100];
const size_t global_work_size[1] = {1};

clEnqueueNDRangeKernel(kernelQ2[0], kernel[1], 1,NULL,global_work_size,NULL, 0, NULL, NULL);
const double start_time = getCurrentTimestamp();

//start daq loop. DAQ_ITERATIONS buffers in host to be filled by the device
printf("\n Starting DAQ %d iterations... \n", num_iterations);

clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 0, NULL, &kernelEvent[0]);
clEnqueueReadBuffer(readQ[0], d_output_buff[0], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[0], 1, &kernelEvent[0], &readEvent[0]);
clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 0, NULL, &kernelEvent[1]);
clEnqueueReadBuffer(readQ[0], d_output_buff[1], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[1], 1, &kernelEvent[1], &readEvent[1]);
// clFlush(readQ[0]);
clFlush(kernelQ[0]);
clFlush(readQ[0]);
for (int i=2; i<num_iterations; i++) {
printf("\nIteration %d, buffer %d: \n", i, i%2);
status = clSetKernelArg(kernel[0], i%2, sizeof(cl_mem), &d_output_buff[i]);
checkError(status, "Failed to set argument %d", i%2);
clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 1, &readEvent[i-2], &kernelEvent[i]);
// clFlush(kernelQ[0]);
clEnqueueReadBuffer(readQ[0], d_output_buff[i], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[i], 1, &kernelEvent[i], &readEvent[i]);
// clFlush(readQ[0]);
}

clFlush(kernelQ[0]);
clFlush(readQ[0]);

// Wait for all kernels to finish.
clWaitForEvents(1,&readEvent[num_iterations-1]);

const double end_time = getCurrentTimestamp();
const double total_time = end_time - start_time;

...

4 Replies

  • HRZ's avatar
    HRZ
    Icon for Frequent Contributor rankFrequent Contributor

    Doesn't your "checkError" function happen to have a clFlush, or clFinish, or clWaitForEvents, etc. that might be serializing some of the enqueues? Other than that, your host code looks fine and the kernel and read operations should get completely overlapped.

    P.S. I remember there was this limitation in Intel's runtime and BSP that it wasn't possible to do simultaneous reads and writes through PCI-E, despite PCI-E being a full-duplex medium, which was fixed in some 18.x version of the compiler, and that limitation could cause unnecessary serialization of simultaneous PIC-E reads and writes. However, I don't think that applies to your case since you are trying to overlap kernel execution with PCI-E read and there is no PCI-E write involved, and that fix requires a compatible BSP anyway, which I don't think you have or else you wouldn't be using v17.1. of the compiler.

  • Hi @MAstr,

    Thank you for posting in Intel community forum and hope all is well.
    By any chances did you managed to recommendation mention by HRZ, please do let us know if there is any further clarification we can help with.

    Best Wishes
    BB

  • MAstr's avatar
    MAstr
    Icon for New Contributor rankNew Contributor

    Thanks for the suggestion. I am not able to use the hardware at the moment, will get back to this topic once I can test the program again. Thank you! @BoonBengT_Altera

  • Hi @MAstr,

    Good to know your doubts has been clarified, with no further clarification on this thread, it will be transitioned to community support for further help on doubts in this thread.
    Thanks for your questions and as always pleasure having you here.

    Best Wishes
    BB