Forum Discussion
Altera_Forum
Honored Contributor
8 years agoAny type of kernel can run in parallel with another, as long as they are invoked in a separate queue, and no event is used to forcibly sentimentalize them; the key point here is that they must run in a different queue and you should not force the host to wait for each kernel execution separately using commands like clFlush() or clFinish(), or by waiting on events. You can, and probably should, wait for an event associated with each kernel invocation, or use clFinish() on every single queue you have, after invoking all the kernels in the host, to make sure all kernels have finished execution, to be then able to read the data back from the device.
Another way this can be accomplished more efficiently is to use replicated autorun kernels; more details about this are available in "Intel FPGA SDK for OpenCL Programming Guide, Section 11.4". Finally, I need to emphasize on the fact that since external memory bandwidth is shared between the kernels running in parallel, you should not expect to get linear speed-up by using multiple parallel kernels. In fact, assuming that one of your kernels is memory-bound on its own, you will not see any speed-up at all by replicating it. P.S. I have done this multiple times, and it certainly works.