Forum Discussion
What kernels will reside on the FPGA is determined by what kernel(s) was(were) put in the ".cl" file in the first place. If you put all of your kernels in one ".cl" file and compile that and program the FPGA with that binary, all of those kernels, whether they are used or not during execution, will physically reside on the FPGA (I don't mean running, I mean the circuit is programmed on the FPGA) at the same time. The order of execution or the queues or basically anything that you put in the host code will not affect what kernel physically resides on the FPGA and what doesn't; only the binary file that is being loaded does. Remember, the host and the kernel are compiled separately, the kernel compiler has no idea about what is happening in the host code.
Anyway, always there will only be one instance (physical circuit) of each kernel on the FPGA, unless you manually replicate the code in the same ".cl" file and use a different name for the second kernel. Even with num_compute_units you will still have one "instance" of the kernel but multiple copies of the pipeline which are automatically handled by the run-time scheduler (no user control). If you have two different kernels in the same queue, even though I haven't tried doing it myself, I am pretty sure they would run serially (even without a clFinish in-between), even though both kernels have their own separate circuit on the FPGA. This is because the OpenCL run-time has to guarantee global memory consistency for kernels in the same queue (but not kernels between different queues). If you want to run two kernels in parallel, you have to run them from two different queues and either use OpenCL events to synchronize them or use clFinish on both queues. I have done this one and it works; this is also the standard procedure for when you are connecting two kernels to each other via channels.