ContributionsMost RecentMost LikesSolutionsRe: Unable to coalesce kernel execution to memory reads in Arria 10 FPGA using opencl 17.1 bsp Thanks for the suggestion. I am not able to use the hardware at the moment, will get back to this topic once I can test the program again. Thank you! @BoonBengT_Altera Unable to coalesce kernel execution to memory reads in Arria 10 FPGA using opencl 17.1 bsp Hi, I have a benchmarking program based on the intel example for read and kernel coalesced execution: https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807516407.html#ran1551293458343 I know this example was added to the manual after version 17.1 but its core is just basic opencl event and kernel queue usage. As seen in @buffers_timeline.PNG the first execution does get coalesced (kernel execution and read buffer at the same time), but all the subsequent reads are not coalesced. The 3rd execution of the kernel is waiting for the second buffer somehow. Any ideas? I only need a way of getting data out. Is this a problem from version 17.1 ? As you can see the data generation kernel "data_in" gets stalled (I supose because the "daq" kernel is slowing it down). @BW_occ_stall.PNG So I coded the following example host program: @hostcode.PNG ... cl_event kernelEvent[100], readEvent[100]; const size_t global_work_size[1] = {1}; clEnqueueNDRangeKernel(kernelQ2[0], kernel[1], 1,NULL,global_work_size,NULL, 0, NULL, NULL); const double start_time = getCurrentTimestamp(); //start daq loop. DAQ_ITERATIONS buffers in host to be filled by the device printf("\n Starting DAQ %d iterations... \n", num_iterations); clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 0, NULL, &kernelEvent[0]); clEnqueueReadBuffer(readQ[0], d_output_buff[0], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[0], 1, &kernelEvent[0], &readEvent[0]); clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 0, NULL, &kernelEvent[1]); clEnqueueReadBuffer(readQ[0], d_output_buff[1], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[1], 1, &kernelEvent[1], &readEvent[1]); // clFlush(readQ[0]); clFlush(kernelQ[0]); clFlush(readQ[0]); for (int i=2; i<num_iterations; i++) { printf("\nIteration %d, buffer %d: \n", i, i%2); status = clSetKernelArg(kernel[0], i%2, sizeof(cl_mem), &d_output_buff[i]); checkError(status, "Failed to set argument %d", i%2); clEnqueueNDRangeKernel(kernelQ[0], kernel[0], 1,NULL,global_work_size,NULL, 1, &readEvent[i-2], &kernelEvent[i]); // clFlush(kernelQ[0]); clEnqueueReadBuffer(readQ[0], d_output_buff[i], CL_FALSE, 0, d_buffer_num_elem*sizeof(cl_short8), h_output_buf[i], 1, &kernelEvent[i], &readEvent[i]); // clFlush(readQ[0]); } clFlush(kernelQ[0]); clFlush(readQ[0]); // Wait for all kernels to finish. clWaitForEvents(1,&readEvent[num_iterations-1]); const double end_time = getCurrentTimestamp(); const double total_time = end_time - start_time; ... Re: Synchronization issues on Implementation of Buffer Management for OpenCL Kernels Yes. The issue on the stalling was already solved and a new issue arised that I explained on https://forums.intel.com/s/question/0D50P00004IQXgeSAH/separate-queue-synchronization-and-buffer-data-corruption-on-feedforward-design-model-with-buffer-management Thanks for the support. Separate queue synchronization and buffer data corruption on Feed-Forward Design Model with Buffer Management I am having issues on implementing the "Feed-Forward Design Model with Buffer Management". Bare in mind that this is not the first implementation but the last of many attempts. I described my gathered knowledge so far and appreciate any help: 1- I am using OpenCL version 17.1 on an Arria 10 platform. 2- The problem to solve is to organize data coming from a pipe into buffers (large, global memory buffers) that are then used by other kernels or host. 3- The kernel writing to pipe must never stall (or its buffer must be enough to hold the data). I have implemented the following ping-pong buffer like solution: kernel 1: "StreamingToPipe" (streams the data to pipe with a know pattern to later be checked). kernel 2: "Producer" reads the pipe from kernel 1, writes to a buffer and sends tokens to the consumers when data is available. kernel 3 & 4: "ConsumerA" and "ConsumerB" when data is available they copy a fragment of the buffer requested by "producer" to a host allocated buffer. HOST: 4 independent queues, each one executes 1 kernel. The 2 queues on the consumers use callbacks to gather the data and check the patterns. Consumers are enqueued first. Both examples showed below use the same kernels but change the host code: EXAMPLE A: Uses enqueueMapBuffer calls to manage data transfers to host. EXAMPLE B: Uses enqueueReadBuffer calls to manage data transfers to host. PROBLEMS AND QUESTIONS: I have followed the guidelines and advices from best practices guide to use mem_fences. Consumers end, which is supposed to guarantee memory consistency. Example A manages better throughput. But the number of maximun enqueued kernels is low (seems like even when unmapping buffers, data is somehow still stored on RTE and an Error is raised when resources are depleted). Example B the queues for each consumer enqueues the NDrange execution and the enqueueReadBuffer alternatively. However, consumer A and B end up synchronized when they should not be (Higher stall rate and lower overall throughput). The number of kernels I can enqueue with this method does not seem to saturate (good memory handling) On BOTH examples the data on the first 2 buffers (one for each consumer) is inconsistent (data does not check with the patters, from element 8192 onwards). The rest of the buffers are correctly checked on HOST. The models that worked even worse that I tried are: Single consumer feed-forward (more buffer incosistencies) Event synchronized queues (having no events and synchonizing by blocking channels caused better management). Creating a host side-buffer pool to send different buffers each time to the consumers. (Idea taken from the 19.1 introduced "Double Buffered Host Application Utilizing Kernel Invocation Queue" example). Any comment on what is going on with the RTE is appreciated. The code is pretty much the same as the intel programing guide example for managed buffers but modified to use 2 consumers. Thanks. Re: Synchronization issues on Implementation of Buffer Management for OpenCL Kernels Self-answer, maybe helps someone. I have observed that my consumer kernels need more time to execute the first time than succesive executions. This causes synchronization issues. I wonder this is because they use host allocated buffers. Any insight is apreciated. Re: Intel opencl Dynamic profiler report Cannot follow the whole issue as I am quite new here. I have had more success (in regards to wirting bandwith) with declaring dependencies using the #pragma ivdep safelen(numberOfIterations), vectorizing and letting the compiler orginize the pipeline, than manually paralellizing the kernel. Hope it can help. Re: Host-to-Intel arria streaming over PCIe Could you provide a link to an example of this double buffering mechanism? Synchronization issues on Implementation of Buffer Management for OpenCL Kernels Hello: I have been trying to implement the "Implementation of Buffer Management for OpenCL Kernels" from the Programming Guide. With a simple aproach, my goal is to copy data from a pipe to global memory, and have the consumer copy some part of this buffer to be able to transfer it to host. This is kind of a ping-pong buffer. First to iterations work well, if I try to continuously run it, it stalls. Has anyone successfully implemented this example?