Forum Discussion
Thanks for the reply. When I compile for the emulator, it works correctly which is why I'm not sure why it doesn't work on the FPGA.
By 'using atomic global memory accesses' do you mean to have the memory accesses within the locked region as atomic? If so, you're correct I could do this with the current code, however I want to be able to do more things in the locked region and have all of these actions be exclusive based on the lock.
Hope that makes things more clear.
It seems to me that perhaps the atomic cmpxchg could possibly be caching, and so when there is contention the work item waiting on the lock does not read the fresh value when the work item with the lock unlocks it. Although it seems like this shouldn't happen as the lock table is marked as volatile. Do you know anything about the LSUs the compiler generates for atomic functions? I can't find anything in the documentation about this.
I read your code, your implementation of the lock seems correct to me; however, I would point out that the emulator does not emulate work-item concurrency accurately and it cannot be reliably used as reference to debug kernels that deadlock on the FPGA; for such cases it is best to run the code on a CPU or GPU (as long as FPGA-specific constructs like channels are not used).
Regarding "atomic global memory accesses", I was under the impression that there are atomic global memory loads and stores in the OpenCL specification but it seems I was wrong; however, I guess the existing functions can be used to indirectly implement such operations (e.g. use atom_xchg as atomic store).
My guess is that your kernel is not actually deadlocking; it simply so slow that it does not finish execution in a reasonable amount of time, making you think that it is deadlocking. Latency of global memory read/writes is 100-200 cycles. Each of the atomic functions you are using will perform one such load and one such store, and this gets repeated throughout the time a work-item is stuck in the while loop. Chances are, if you use a very small number of work-items and a very small input size, the kernel will finish execution. My recommendation is to use local memory for your lock table and to convert your global memory fences to local memory fences; that should significantly reduce the latency of the atomic operations. Also, I don't think the global memory fences are needed here at all; each work-item in your code uses the same key it itself reads from global memory and it is guaranteed the key is used after the global memory load is finished.
Regarding the LSU, using volatile will certainly disable the private cache, so you should not need to worry about caching. You can see the LSU type in the "System Viewer" tab of the HTML report.
P.S. Don't waste your time trying to debug this by putting printf in the kernel and running it on the FPGA; printf results from FPGA execution will be only printed out after the kernel has finished execution and if the kernel deadlocks or is too slow, you will never see the printf output.