Some of your __local buffers seem unnecessary to me. For the first kernel, I think you can remove the dependency by reordering the i and j loop, convert the weight_ocr buffer to a single scoped variable, and move the load from external memory between the j and i loop as follows:
for( a = 0 ; a < depth ; ++a){
for( j = 0 ; j < col ; ++j){
lane_data weight_ocr = weights;
for( i = 0 ; i < row ; ++i){
# pragma unroll
for( k = 0; k < LANE_NUM ; ++k){
data_ch_vec.lane = input; //lanenum*col can pass as param port bcaz they are constant // here use 8 dsp //lc = lane_num*col
//printf("Lane:%d %f %f %f \n",k,data_ch_vec.lane.data,data_ch_vec.lane.data,data_ch_vec.lane.data);
}
//load weights
weight_buffer = weight_ocr; //0,1,2,3,4,5,6 repeat until new filter 7,8,9,10,11,12,13
write_channel_altera(weight_ch,weight_buffer);
write_channel_altera(data_ch,data_ch_vec);
}
}
}
This removes the memory dependency; however, it might break your function so make sure that it works correctly before using it.
For the second kernel, a similar thing can be done. The conv_out buffer does not need to be a __local buffer; you can just replace it with a single scoped variable as follows:
# pragma unroll
for(unsigned char ll=0; ll<LANE_NUM; ll++){
float conv_out = 0;
# pragma unroll
for(unsigned i=0; i<PIPE_DEPTH; i++){
conv_out += accum_piped;
}
conv_ch_in.data = conv_out;
This removes the dependency in your second kernel.