Still, I have another question here about the usage of constant memory.
According to the estimation of the compiler, my kernel should have a throughput of 144M work_item/sec. However the real performance is much worse than that, about 1/20 I think. I wonder whether this is because I used the constant memory to store read only argument, which is a large amount of data (about 200M byte). According to the optimization guide, the constant cache is 16KB and if the constant data size is bigger than that, I will suffer large latency. (Am I understanding this right?) If so, my option can only be using global const qualifier, right?