Forum Discussion
HRZ
Frequent Contributor
6 years agoThe best way to overlap the execution of two different blocks of code in single work-item kernels is to put them int two different kernels, create two queues on the host, and queue the kernels concurrently. It is expected that the compiler should implement two independent blocks of code within the same kernel in a parallel fashion anyway. May I ask why you care about the "latency" of the operations? As long as you have a fully pipelined loop with an initiation interval of 1 and your input size (loop trip count) is large enough, the latency of the loop will have negligible effect on performance/run time.