Forum Discussion
NDRange kernels will typically hide these stalls by ensuring that many work-items are in flight to fill in the bubbles in the pipeline. The first line in your pseudo code suggests that you are indexing into memory in a non-sequential (or non-predictable) sequence which I think is the source of the problem you are running into. So even thought the kernel scheduler will attempt to keep the pipeline full, the access pattern will most likely prevent the data being read to keep the pipeline busy doing work. OpenCL aside if a master reads from an SDRAM device in a random order you will typically see idle periods in between blocks of read data returning. When SDRAM is accessed sequentially then the read data typically returns in long continuous blocks (i.e. no stalls)
Instead of trying to elongate the pipeline (which I doubt will help nor is it easy to do without knowing how the compiler works) maybe you can describe the size of the data being accessed by the kernel and whether the index used has any predictable pattern and we can try to suggest a way to improve the memory accesses to avoid the issue at the root of the problem. In cases like these I typically attempt to change my algorithm to access memory in a different order or attempt to preload a block of global memory contents sequentially then access the local copy randomly (local memory can be accessed in any order without any performance degradation).