Hi , I am trying to port multiple kernels onto FPGA. The online material on Altera OpenCL compilation suggests to have all the kernels in a single cl file. Can anyone please tell me the steps to compile such a file. I have a gpu based opencl code that has a intelligence of picking the kernel binary file if its already present. The binary for Altera FPGA can be generated by offline compiling the kernel. In case of GPU, the binaries are seperate for each kernels. By having all the kernels in a single file all that we generate is a single aocx binary file. My question now is how should i link the appropriate kernel functions with the clCreateProgramObject routine of OpenCL implementation ???? I hope you have understood the question Thanks

The recommendation of placing all the kernels into a single file is for performance reasons. You can compile each kernel separately, but just keep in mind that there is an overhead associating to swaping out kernels during the run time. This limitation is due to the fact that the underlining hardware changes (on GPUs they just swap out microcode). To place all your kernels into a single file you just copy and paste all the sources into a single .cl file and compile it. When you invoke aoc.exe it's compiling the .cl file, not an individual kernel, so just by putting all your code into a single .cl file causes the compiler to create a hardware image with all your kernels within it.

Ok suppose If I use a Single cl file that contains all the kernels, Can i invoke each of the kernels by the corresponding names. What I mean is can create kernel object for each of the kernels ?

Yep, that's exactly how it works. So lets say you have kernels "A" and "B" in the same .cl file compiled into a single .aocx file. When you enqueue kernels A and then B, the hardware will remain programmed when the second kernel (B) is invoked in the same .aocx file. The only limitation you may face is that by putting more kernels into the same .cl file you will use up more hardware resources which may fill up the chip or require you to undo some optimizations you may have previously included to each one independently (to free up some room to make them fit). The other challenge is if you have multiple operating concurrently in the hardware if they all access global memory at the same time you might run out of bandwidth, in cases like those sometimes you can combine kernels to minimize the bandwidth (for example if kernel A writes to global memory then kernel B reads that data in, just combine the kernel and move the data using private/local memory instead).

Hi, BadOmen, I have a question. How long does the FPGA (say, Altera V 5SGA7) require to reconfigure the new kernel? Does it dare to compare with GPU context switch? Thanks. --- Quote Start --- The recommendation of placing all the kernels into a single file is for performance reasons. You can compile each kernel separately, but just keep in mind that there is an overhead associating to swaping out kernels during the run time. This limitation is due to the fact that the underlining hardware changes (on GPUs they just swap out microcode). To place all your kernels into a single file you just copy and paste all the sources into a single .cl file and compile it. When you invoke aoc.exe it's compiling the .cl file, not an individual kernel, so just by putting all your code into a single .cl file causes the compiler to create a hardware image with all your kernels within it. --- Quote End ---

If it's a different kernel within the same .aocx file then the switch should be fairly quick since the hardware that implements the kernel is already present. In that case it's not really a context switch since each kernel gets it's own compute unit (or multiple units if you use the num_compute_units attribute) so there isn't a generic compute unit being used to execute multiple kernels one at a time. You can also operate multiple kernels concurrently within the same hardware (.aocx file). If you mean kernel switches between multiple .aocx files then the configuration time and buffer movement times can be significant if you don't ammortize it over the compute time (so if the kernel run times are quick then the overhead will be significant). Due to this, choosing which kernels to combine into a single .aocx file can be important, depending on how you group them you can decrease the amount of overhead. When the OpenCL runtime has to swap out .aocx files all global buffers live in the FPGA have to be pulled back to the host before reconfiguration then restored after.

Executing Multiple kernels on Altera FPGA | Altera Community

12 Replies

Altera_Forum
Honored Contributor
12 years ago
The recommendation of placing all the kernels into a single file is for performance reasons. You can compile each kernel separately, but just keep in mind that there is an overhead associating to swaping out kernels during the run time. This limitation is due to the fact that the underlining hardware changes (on GPUs they just swap out microcode).

To place all your kernels into a single file you just copy and paste all the sources into a single .cl file and compile it. When you invoke aoc.exe it's compiling the .cl file, not an individual kernel, so just by putting all your code into a single .cl file causes the compiler to create a hardware image with all your kernels within it.
Altera_Forum
Honored Contributor
12 years ago
Ok suppose If I use a Single cl file that contains all the kernels, Can i invoke each of the kernels by the corresponding names. What I mean is can create kernel object for each of the kernels ?
Altera_Forum
Honored Contributor
12 years ago
Yep, that's exactly how it works. So lets say you have kernels "A" and "B" in the same .cl file compiled into a single .aocx file. When you enqueue kernels A and then B, the hardware will remain programmed when the second kernel (B) is invoked in the same .aocx file. The only limitation you may face is that by putting more kernels into the same .cl file you will use up more hardware resources which may fill up the chip or require you to undo some optimizations you may have previously included to each one independently (to free up some room to make them fit). The other challenge is if you have multiple operating concurrently in the hardware if they all access global memory at the same time you might run out of bandwidth, in cases like those sometimes you can combine kernels to minimize the bandwidth (for example if kernel A writes to global memory then kernel B reads that data in, just combine the kernel and move the data using private/local memory instead).
Altera_Forum
Honored Contributor
12 years ago
Hi, BadOmen,
I have a question. How long does the FPGA (say, Altera V 5SGA7) require to reconfigure the new kernel? Does it dare to compare with GPU context switch? Thanks.

--- Quote Start ---
The recommendation of placing all the kernels into a single file is for performance reasons. You can compile each kernel separately, but just keep in mind that there is an overhead associating to swaping out kernels during the run time. This limitation is due to the fact that the underlining hardware changes (on GPUs they just swap out microcode).

To place all your kernels into a single file you just copy and paste all the sources into a single .cl file and compile it. When you invoke aoc.exe it's compiling the .cl file, not an individual kernel, so just by putting all your code into a single .cl file causes the compiler to create a hardware image with all your kernels within it.
--- Quote End ---
Altera_Forum
Honored Contributor
12 years ago
If it's a different kernel within the same .aocx file then the switch should be fairly quick since the hardware that implements the kernel is already present. In that case it's not really a context switch since each kernel gets it's own compute unit (or multiple units if you use the num_compute_units attribute) so there isn't a generic compute unit being used to execute multiple kernels one at a time. You can also operate multiple kernels concurrently within the same hardware (.aocx file).

If you mean kernel switches between multiple .aocx files then the configuration time and buffer movement times can be significant if you don't ammortize it over the compute time (so if the kernel run times are quick then the overhead will be significant). Due to this, choosing which kernels to combine into a single .aocx file can be important, depending on how you group them you can decrease the amount of overhead. When the OpenCL runtime has to swap out .aocx files all global buffers live in the FPGA have to be pulled back to the host before reconfiguration then restored after.
Altera_Forum
Honored Contributor
12 years ago
Thanks for your reply. I mean the kernel switches between multiple .aocx files. It is much more difficult for FPGA to update context than CPU/GPU. So, I want to know the exact time the FPGA requires to swap out .aocx file. Have you ever tested it? How to test?
--- Quote Start ---
If you mean kernel switches between multiple .aocx files then the configuration time and buffer movement times can be significant if you don't ammortize it over the compute time (so if the kernel run times are quick then the overhead will be significant). Due to this, choosing which kernels to combine into a single .aocx file can be important, depending on how you group them you can decrease the amount of overhead. When the OpenCL runtime has to swap out .aocx files all global buffers live in the FPGA have to be pulled back to the host before reconfiguration then restored after.
--- Quote End ---
Altera_Forum
Honored Contributor
12 years ago
While I was asking around for the best method to measure this, I received some information that you are looking for. Using a Linux host, a Stratix V - A7 device takes approximately 750ms to be reconfigured by the runtime. Note this number does not take into consideration the amount of time necessary to move any buffers that are active in the FPGA so whenever possible it's recommended to free any buffers that are allocated in the FPGA before the kernel hardware switchover occurs. Active buffers must be copied up to the host before the hardware is swapped out and restored after the hardware has been replaced, and there is an overhead associated with this, I can't give you a number for this because it's heavily dependent on your software implementation.

If this amount of time is a significant amount of time in comparison to the kernel execution time then you should examine amortizing this cost. Lets say you have a billion data points of data move between kernels "A" and "B" and you handle it a million points (work-items) at a time. Instead of calling up kernel A followed by B for each million points, you would call up kernel A many times to finish off all billion points, followed by kernel B to do the same. That way there is only one swapping out of the hardware instead of a around two thousand hardware swaps (A --> B --> A --> B --> etc...) In situations like these I also try to combine the kernels if possible since not only do you eliminate the hardware swapping in and out, but you often end up with a more efficient hardware implementation because the same compute unit will encapsulate both kernels.
Altera_Forum
Honored Contributor
12 years ago
Thanks for your helpful information. I have checked the Altera Handbook, there are Core Image (contains logic that is programmed by configurationRAM(CRAM)) and Periphery Image (contains general purpose I/Os (GPIOs), I/O registers, the GCLK, QCLK,
and RCLK clock networks, and logic that is implemented in hard IP such as the Hard IP for PCI Express IP Core). So, I think the .aocx file might just reconfigure the Core Image, and the data in the device memory is safe during the hardware switchover.
--- Quote Start ---
While I was asking around for the best method to measure this, I received some information that you are looking for. Using a Linux host, a Stratix V - A7 device takes approximately 750ms to be reconfigured by the runtime. Note this number does not take into consideration the amount of time necessary to move any buffers that are active in the FPGA so whenever possible it's recommended to free any buffers that are allocated in the FPGA before the kernel hardware switchover occurs. Active buffers must be copied up to the host before the hardware is swapped out and restored after the hardware has been replaced, and there is an overhead associated with this, I can't give you a number for this because it's heavily dependent on your software implementation.

If this amount of time is a significant amount of time in comparison to the kernel execution time then you should examine amortizing this cost. Lets say you have a billion data points of data move between kernels "A" and "B" and you handle it a million points (work-items) at a time. Instead of calling up kernel A followed by B for each million points, you would call up kernel A many times to finish off all billion points, followed by kernel B to do the same. That way there is only one swapping out of the hardware instead of a around two thousand hardware swaps (A --> B --> A --> B --> etc...) In situations like these I also try to combine the kernels if possible since not only do you eliminate the hardware swapping in and out, but you often end up with a more efficient hardware implementation because the same compute unit will encapsulate both kernels.
--- Quote End ---
Altera_Forum
Honored Contributor
12 years ago
The SDRAM controllers are reconfigured while the device is reconfigured so the contents in the memory devices may (probably will) become corrupt. That's why the buffers are pulled up to the host and back down after the configuration cycle.
Altera_Forum
Honored Contributor
12 years ago
If the SDRAM controllers are not reconfigured, the whole system will be better. How about partial reconfiguration?

Forum Discussion

Executing Multiple kernels on Altera FPGA

12 Replies

Recent Discussions

Duplicate_hierarchy_depth / duplicate_register

Timing analysis - long combinational path

Automatically added negative node for TDS output doesn't work with Agilex 5

Quartus 20.1std compilation fails for Quartus map - Device 10AS057K2F40I1SG

QuartusPro 25.3 Crashed after using the Signal Tap Logic Analyzer