Does the local memory usage increase with banking and coalescing?
Does the number of M20K blocks increase when compiler automatically banks or increases the bank width? Eg:- int sample[4] [32]; int a[4][32]; int b[4][32]; #pragma unroll for(j = 0; j < 4; j++...