You've set up double buffering correctly on the inner loop, which should achieve II=1. But the compiler is right to say the outer loop cannot be pipelined. To execute two or more outer loop iterations in (pipeline) parallel would mean you were reading and writing to the same half of your double buffer simultaneously.
Is it bad for performance that the outer loop isn't pipelined? Maybe. The answer depends on the latency (let's say L) and iteration count (let's say N) of your inner loop. If you picture the occupancy of your loop as a function of time this is easy to see. If N >> L, your pipeline will be saturated most of the time. But if L >> N, your pipeline occupancy will never exceed N. In the first situation, outer loop pipelining would give you only an incremental performance increase. In the latter situation, outer loop pipelining would be essential to get satisfactory performance.
Is there a better way to write it? Yes. I made an assumption that you can refactor your code as below, though I can't be 100% sure from your snippet that it works in your situation. The key is that OpenCL can handle double- or multi-buffering for you automatically. Here's a code sketch:
for( ... ){ // The outer loop will be pipelined
lane_data win_buffer[WIN_BUF_SIZE]; // This array will be automatically multi-buffered to support concurrent outer loop iterations
// This shows up in the reports as "private copies"
for(unsigned int win_itm_xyz = 0; win_itm_xyz < item_loop_bound; win_itm_xyz++) {
....
if(win_itm_z<weight_dim3/VEC_SIZE){
.....
win_buffer[win_itm_z*win_size_y*win_size_x + win_itm_y*win_size_x + win_itm_x] = data_vec;
.....
}
}
for(unsigned int win_itm_xyz = 0; win_itm_xyz < item_loop_bound; win_itm_xyz++) {
....
if(gp_num_x*CONV_GP_SIZE_X+gp_item_idx_x<conv_x){
......
data_vec = win_buffer[flag][output_idx_dim3*win_size_y*win_size_x + output_idx_dim2*win_size_x + (output_idx_dim1+gp_item_idx_x*stride)];
......
}
......
}
}