Thanks HRZ,
I rather have a producer consumer relation between the kernels running simultaneously. As you pointed out as well, I expected to have a latency for the store operation in the producer and thats why wanted to understand the best way to implement a synchronization method. the channels like you said didn't work for me as well but I have not tried the combination of channels and memory fence.
I also find an interesting discussion regarding the buffer management using volatile memory in the following post
https://forums.intel.com/s/question/0D50P00003yyQkbSAE/question-regarding-buffer-management-for-aocl-kernels
I did try this and it is producing a much better results but there are still some errors and I need to debug it further to be sure.
Also I would like to understand if any one can tell what is the purpose of write-ack LSU. They have higher latency then the burst coalesced LSU. Does they kind of guarantee memory updates while sacrificing cycles?