Why does a blocked channel in one kernel also block other kernels' running?

hiratz
Occasional Contributor
7 years ago
Thanks, HRZ.
I just tested this with emulator (with the option "-emulator-channel-depth-model=strict"). My host code is very simple and no kernels are waiting for any events. I show the key code as follows (you can see I put Producer and Consumer into two different command queues, respectively). It is really weird.
#define MAX_CMD_QUEUES 4 #define N_KERNEL 2 void run_kernel(cl_command_queue cmd_queue[MAX_CMD_QUEUES], cl_kernel (&kernel)[N_KERNEL], size_t n_thread) { size_t global_work_size[1] = {(size_t)n_thread}; size_t local_work_size[1] = {(size_t)n_thread}; cl_event event_write[1], event_exec[2]; cl_int status; cl_command_queue &rcmd_queue0 = cmd_queue[0], &rcmd_queue1 = cmd_queue[1], &rcmd_queue2 = cmd_queue[2]; // Write data into buf_in from h_in status = clEnqueueWriteBuffer(rcmd_queue0, buf_in, CL_TRUE, 0, bufsize, h_in, 0, NULL, event_write); error_check(status, "Write buf_in failed!\n"); // Launch the kernel Producer status = clEnqueueNDRangeKernel(rcmd_queue0, kernel[0], 1, NULL, global_work_size, local_work_size, 1, event_write, NULL); error_check(status, "Run kernel Producer error!\n"); // Launch the kernel Consumer status = clEnqueueNDRangeKernel(rcmd_queue1, kernel[1], 1, NULL, global_work_size, local_work_size, 0, NULL, NULL); error_check(status, "Run kernel Consumer error!\n"); // Read results back to h_out from buf_out read_back_results(rcmd_queue1, bufsize, buf_out, (char*)h_out, NULL); clReleaseEvent(event_write[0]); clReleaseEvent(event_exec[0]); clReleaseEvent(event_exec[1]); }
- HRZ
  Frequent Contributor
  7 years ago
  Have you tried commenting the error_check functions? Depending on its implementation, that function could be serializing the kernel launches. If you provide the full code for your example so that I can compile it on my own machine, I might be able to find the source of the problem.
- hiratz
  Occasional Contributor
  7 years ago
  Thanks! Sure, I attached the whole directory which contains the host code, kernel code and some necessary scripts.
  I tried commenting the error_check function but the problem still exists ...
  My code's directory has a Readme file that shows how to compile/run the code.
  Please note: Since I am working on the Intel Harp machine, I have to use some header files provided by Intel. I put them into the directory "common-fpga". You may not need them if you have your own configuration environment. The files in the directory "common" are created by myslef and most of their code are written by me.
  Currently the run.sh uses 16 workitems that caused this problem. If you change 16 to 8, the problem will disappear.
  fpga-test.tar.bz2810 KB
- HRZ
  Frequent Contributor
  7 years ago
  I tested with both Quartus v16.1.2 and 18.1 (the latter with strict channel depth emulation). I do not seem to get any deadlock with 8, 16 or 32 work-items (though 32 gives garbled output for the second half of data since your buffer size is hardcoded to be 16 indexes). Maybe there is some problem specific to the HARP system or version of Quartus you are using. I have two recommendations for you:
  1- Try compiling your host and kernel code against the a10_ref BSP which is shipped with all the newer versions of Quartus and see if you observe the same behavior.
  2- Try allocating buf_in using CL_MEM_READ_WRITE instead of CL_MEM_ALLOC_HOST_PTR to see if it makes any difference.

Forum Discussion

Why does a blocked channel in one kernel also block other kernels' running?

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process