The code will work, but it's not efficient. Essentially you don't need to send the end flag at all. The code in the first post will do what you want efficiently if you want a producer that produces data and the consumer to accumulate the data. The kernel is only worried about the necessary acceleration and process of data. How much data to be processes should be done on the host side and should be known during run time. You already know how many global work items there are for the producer right? In otherwords you know that the number of threads of producer is N. If you know the number of threads that the producer has, then N should be the number of threads for the consumer. Otherwise if you have more or less threads for the consumer, there will be too much or too little data, and the kernel won't terminate.