My apologies, i thought i could simplify my code. Anyway, heres my actual code
# define Max 2046 //maximum data can be stored in data_buffer
__kernel void kernel1( global *restrict data, uint col, uint size, uint row){
__local float data_buffer;
float data;
for(int i=0;i<size;++i){
//preload data into data_buffer so it can reuse
// change new dataset everytime size++;
for(int g; g< col; ++g){
data_buffer = data;
}
for(int h=0;h<row;++h){
for(int f=0; f<col; ++f){
data = data_buffer; //use back the same data stored in data_buffer
writeintelchannel(data_ch,data); //send to another kernel to compute
}
}
}
Here, i wanted to put data on chip due to its smaller latency to access global since it repeatedly taking the same data. the data preloader should run everytime size increment and has no effect on row/col iteration.