Hi Kenny, thank you very much for your help and your quick answer. That link helps me a lot, because I had a older version of the oneAPI FPGA optimization guide and I didn't know there was updates! Thanks for share.
I have been reading the page you told me and now I have some more idea but still not know how to solve my issue. I you don't mind, I am going to extend a little bit my question with more pseudocode to help explaining myself.
In OpenCL I have in the device file, kernel.cl the following kernels which once I compile, all of them are placed in the same bitstream file kernel.aocx (I have put in the pseudocode the opencl calls in host side):
//Inside kernel.cl
void Kernel1( Arguments #1){
//Some code
}
void Kernel2(Arguments#2){
//Some code
}
void Kernel3(Arguments#3){
//Some code
}
//In host side main.cpp
for (some iterations){
cl_write_buffers(Arguments #1);
setKernelArguments (Arguments #1);
clEnqueueTask(Kernel1);
cl_readbuffers(Arguments #1);
cl_write_buffers(Arguments #2);
setKernelArguments (Arguments #2);
clEnqueueTask(Kernel2);
cl_readbuffers(Arguments #2);
cl_write_buffers(Arguments #3);
setKernelArguments (Arguments #3);
clEnqueueTask(Kernel3);
cl_readbuffers(Arguments #3);
}
In this code, I have read the .aocx and I call in each iterations the different kernels placed in the FPGA. This way I can call different kernels without changing the bitstream (without paying that delay in my execution time) and make best use of the FPGA's resources.
I want to code the same in oneAPI. I have the host part (main.cpp) and one single kernel inside kernel.cpp. I call the kernel as a function in host side as the following:
//Inside kernel.cpp
void run_kernel(arguments){
queue->submit(handler h){
h.single_task<>([]{
//Kernel code
}
}
}
//In main.cpp
for (some iterations){
h.memcpy(data_device, data_host);
run_kernel(arguments);
h.memcpy(data_host, data_device);
}
The problem I see here is that I do not control when the "bitstream .aocx" is placed in the FPGA like I did in OpenCL so I do not have the control that if I call a second o third kernel, The bitstream is going to be placed only once and the different kernels distributed among the FPGA surface. I want to replicate my kernel as many times as they fill in the FPGA surface in order to parallelize and speedup my execution. So I do not want to make the mistake that when I call different kernels, the bitstream placed in the FPGA is being changed because each kernel is placed in a different one.
I supose that what I have to do is like I did with OpenCL, replicate my kernels inside the kernel.cpp and just call him from the host side:
//Inside kernel.cpp
void run_kernel1(arguments){
queue->submit(handler h){
h.single_task<>([]{
//Kernel code
}
}
}
void run_kernel2(arguments){
queue->submit(handler h){
h.single_task<>([]{
//Kernel code
}
}
}
void run_kernel3(arguments){
queue->submit(handler h){
h.single_task<>([]{
//Kernel code
}
}
}
//In main.cpp
for (some iterations){
h.memcpy(data_device, data_host);
run_kernel1(arguments);
h.memcpy(data_host, data_device);
h.memcpy(data_device, data_host);
run_kernel2(arguments);
h.memcpy(data_host, data_device);
h.memcpy(data_device, data_host);
run_kernel3(arguments);
h.memcpy(data_host, data_device);
}
My question here is that I do not know if doing this I have the certain that the 3 kernels are placed at the same time in the FPGA and , if I call the three of them in host, they are running in parallel without changing the FPGA's bitstream for each kernel.
Thank you very much for your feedback!