Sorry,
in the second posted code, I forgot to copy the last line in the consumer kernel, to save the data into memory (otherwise the compiler remove compilation)
//...
*res=acc_o;
}
Apart from this, even in my case it complains about conditional write, but then, according the report, loops are unrolled but the with lower Fmax.
I would like to avoid generates more data than needed, for the sake of code portability (e.g. the generator is implemented by some other code). Apart from this, even if I try to do it, the problem remains (bottleneck is still Fmax) and a new compiler warning appears "Cannot unroll loop for.body3 in producer because channel endpoints would undergo different amounts of unrolling"