Reducing initiation interval, relaxing loop-carried dependency
- 5 years ago
I did some testing on your kernel, the problem is from your overly large unroll factor and all the memory ports it creates. If you set II_CYCLES to 5 (but not below that) and reduce UF to 16, you will get an II of 1. Moreover, at least with v19.4 (probably 19.0+ but I didn't test), you will get an II of 1 even with II_CYCLES set to 1 on Arria 10 because the compiler optimizes out the shift register and implements single-cycle accumulation instead. This bring me to a better solution to your problem if you are targeting Arria 10:
#define UF 16 kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) { for (int i = 0; i < N; ++i) { for (int yy = 0; yy < M; ++yy) { for (int xx = 0; xx < M; ++xx) { int yy_curr = yy * M; int i_curr = i * O; float final_sum = 0.0f; int exit = O / UF; for (int j = 0; j < exit; j++) { float acc_i = 0.0; #pragma unroll for (int k = 0; k < UF; k++) { int rc = j * UF + k; acc_i += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)]; } final_sum += acc_i; } compute[((yy_curr) + xx)] = final_sum; } } } }This will give an II of 1 regardless of UF; though I think single-cycle accumulation is not available on Stratix 10 since the default target frequency on Stratix 10 is 480 MHz which is too high for single-cycle accumulation. Either way, you can still use the shift register method on Stratix 10, just don't use an unroll factor larger than 16. My guess is that if you compile your kernel with varying values of UF from 1 to 32, performance will probably go up until 4 or 8, but it will start going down after that.
P.S. You should also consider the operating frequency when doing performance comparison between different kernels.