Forum Discussion
This is an interesting question you pose. I'm not too familiar with GPU computing background, but I feel like these applications are not well suited in terms of HPC. If you have hundreds of kernels and only some of them may actually be used, it sounds like essentially, you have a library of kernels and depending on the specific application you are trying to run, you would run specific kernels similar to choosing a specific function and connecting them up (correct me if I'm wrong. If you have an example problem or application it would help if I can see what type of application you are targeting or what exactly you are trying to do). In this case, this is more of a generic approach and modular such that it's more focused on portability rather than performance. Moreover, I would think that to get as much performance out of an HPC, you would fine tune and create kernels that best accelerates your application.
On the other hand, to answer your question, if you have hundreds of OpenCL kernels that may be used at run time, and only some of them may actually be used, I would put all of the ones that need to be used in one OpenCL file and disregard the ones not being used (however, you may not be able to fit the required kernels on a device still). Another approach is to group the kernels that are commonly used together such that it fits, but create and compile a set of binaries that is composed of all the different kernels. This way, during run time, you can essentially reprogram the fpga such that the kernel of interest is loaded onto the fpga. But note that the overhead in reprogramming an fpga is on the order of seconds, so if you're constantly reprogramming the fpga, it becomes very inefficient. Ideally you would just target the FPGA to accelerate a specific application rather than having essentially a library of kernels that you can potentially use depending on the application in question.