--- Quote Start ---
An increment operation consists of three operations: read current value, increment, and write.
If you have two work-items in the pipeline W0 and W1, and lets say W0 is ahead in the pipeline, it is possile that when W1 is doing the read, W0 has still not performed its write yet. So, both W0 and W1 would have read the same "current" value from the memory, and they will write the same value to the memory. This is a classic scenerio for a race condition and requires atomic_inc.
If we could make the assumption that read+increment+write operations take N cycles, and two work-items are always more than N-cycles apart in the pipeline, then you would not need an atomic_inc, but we cannot make this assumption.
Or, instead of doing an increment, you were just writing to the local/global variable, we could say that the final update would be done by the last work-item (although OpenCL spec does not allow this kind of speculations).
--- Quote End ---
That makes sense.
Thanks!