MultiKernel design with oneAPI

JRome28
New Contributor
5 years ago
Hi Kenny, thank you very much for your help and your quick answer. That link helps me a lot, because I had a older version of the oneAPI FPGA optimization guide and I didn't know there was updates! Thanks for share.
I have been reading the page you told me and now I have some more idea but still not know how to solve my issue. I you don't mind, I am going to extend a little bit my question with more pseudocode to help explaining myself.
In OpenCL I have in the device file, kernel.cl the following kernels which once I compile, all of them are placed in the same bitstream file kernel.aocx (I have put in the pseudocode the opencl calls in host side):
//Inside kernel.cl void Kernel1( Arguments #1){ //Some code } void Kernel2(Arguments#2){ //Some code } void Kernel3(Arguments#3){ //Some code } //In host side main.cpp for (some iterations){ cl_write_buffers(Arguments #1); setKernelArguments (Arguments #1); clEnqueueTask(Kernel1); cl_readbuffers(Arguments #1); cl_write_buffers(Arguments #2); setKernelArguments (Arguments #2); clEnqueueTask(Kernel2); cl_readbuffers(Arguments #2); cl_write_buffers(Arguments #3); setKernelArguments (Arguments #3); clEnqueueTask(Kernel3); cl_readbuffers(Arguments #3); }
In this code, I have read the .aocx and I call in each iterations the different kernels placed in the FPGA. This way I can call different kernels without changing the bitstream (without paying that delay in my execution time) and make best use of the FPGA's resources.
I want to code the same in oneAPI. I have the host part (main.cpp) and one single kernel inside kernel.cpp. I call the kernel as a function in host side as the following:
//Inside kernel.cpp void run_kernel(arguments){ queue->submit(handler h){ h.single_task<>([]{ //Kernel code } } } //In main.cpp for (some iterations){ h.memcpy(data_device, data_host); run_kernel(arguments); h.memcpy(data_host, data_device); }
The problem I see here is that I do not control when the "bitstream .aocx" is placed in the FPGA like I did in OpenCL so I do not have the control that if I call a second o third kernel, The bitstream is going to be placed only once and the different kernels distributed among the FPGA surface. I want to replicate my kernel as many times as they fill in the FPGA surface in order to parallelize and speedup my execution. So I do not want to make the mistake that when I call different kernels, the bitstream placed in the FPGA is being changed because each kernel is placed in a different one.
I supose that what I have to do is like I did with OpenCL, replicate my kernels inside the kernel.cpp and just call him from the host side:
//Inside kernel.cpp void run_kernel1(arguments){ queue->submit(handler h){ h.single_task<>([]{ //Kernel code } } } void run_kernel2(arguments){ queue->submit(handler h){ h.single_task<>([]{ //Kernel code } } } void run_kernel3(arguments){ queue->submit(handler h){ h.single_task<>([]{ //Kernel code } } } //In main.cpp for (some iterations){ h.memcpy(data_device, data_host); run_kernel1(arguments); h.memcpy(data_host, data_device); h.memcpy(data_device, data_host); run_kernel2(arguments); h.memcpy(data_host, data_device); h.memcpy(data_device, data_host); run_kernel3(arguments); h.memcpy(data_host, data_device); }
My question here is that I do not know if doing this I have the certain that the 3 kernels are placed at the same time in the FPGA and , if I call the three of them in host, they are running in parallel without changing the FPGA's bitstream for each kernel.
Thank you very much for your feedback!
- KennyT_altera
  Super Contributor
  5 years ago
  Here is some confusion that needs to be cleared up.
  
  The memcpy function is part of the handler class, which you can find in handler.hpp source code that ships with the tool. This function isn’t documented because we recommend using buffer + accessor types in the oneAPI Programming Guide to move data between host and device. So, the first piece of advice to the user is to stop using memcpy.
  
  If the source code containing multiple kernels was built into a single binary, then there shouldn’t be any worries about each kernel being placed in a different bitstream. The fat binary that’s used to run the workload contains only one bitstream with all kernels that it was compiled with.
  
  In order to create a replicated kernel in oneAPI, it’s best to create a templated function and call it with different parameters from the main code. An example of this is attached. That’s preferable to duplicating the same kernel code. You can also pass in different buffers to each templated function call to make sure each kernel operates on different data
  - asenjo
    New Contributor
    5 years ago
    Thanks a lot Kenny. But regarding this "An example of this is attached", I can not see where your example is. Thanks once again.

Forum Discussion

Recent Discussions

Agilex 7 FPGA Starter Kit with oneAPI Toolkit flow not detected over PCIe

MCTP over PCIe VDM routing to PMCI in OFS N6000 FIM configuration and datapath clarification

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

Error faced while executing on Agilex FPGA board....

AI Suite System Throughput Issue