Reducing initiation interval, relaxing loop-carried dependency
Hi, I am trying to implement and optimize multiply-add accumulate as a single work-item. The access pattern is not sequential for one of the reads, so unrolling creates a separate LSU corresponding to the unroll factor and a private cache for each unit. This improves performance, but obviously blows up area which isn't ideal. I'll post the simplified code that still compiles and shows the issues that I have:
kernel void base(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
for (int i = 0; i < N; ++i) {
for (int yy = 0; yy < M; ++yy) {
for (int xx = 0; xx < M; ++xx) {
int yy_curr = yy * M;
int i_curr = i * O;
#pragma unroll 32
for (int rc = 0; rc < O; ++rc) {
compute[((yy_curr) + xx)] += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
}
}
}
}
}
kernel void single_var(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
for (int i = 0; i < N; ++i) {
for (int yy = 0; yy < M; ++yy) {
for (int xx = 0; xx < M; ++xx) {
int yy_curr = yy * M;
int i_curr = i * O;
float tmp = 0.0;
#pragma unroll 32
for (int rc = 0; rc < O; ++rc) {
tmp += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
}
compute[((yy_curr) + xx)] = tmp;
}
}
}
}
kernel void separate_sums(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
for (int i = 0; i < N; ++i) {
for (int yy = 0; yy < M; ++yy) {
for (int xx = 0; xx < M; ++xx) {
int yy_curr = yy * M;
int i_curr = i * O;
float p_sums[32];
#pragma unroll
for (int jj = 0; jj < 32; ++jj) {
p_sums[jj] = 0.0;
}
#pragma unroll 32
for (int rc = 0; rc < O; ++rc) {
p_sums[rc % 32] += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
}
float tmp = 0.0;
#pragma unroll
for (int jj = 0; jj < 32; ++jj) {
tmp += p_sums[jj];
}
compute[((yy_curr) + xx)] = tmp;
}
}
}
}
#define II_CYCLES 5
#define UF 32
kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
for (int i = 0; i < N; ++i) {
for (int yy = 0; yy < M; ++yy) {
for (int xx = 0; xx < M; ++xx) {
int yy_curr = yy * M;
int i_curr = i * O;
float shift_reg[II_CYCLES + 1];
#pragma unroll
for (unsigned j = 0; j < II_CYCLES + 1; j++)
shift_reg[j] = 0.0f;
for (int rc = 0; rc < O; rc += UF) {
float acc_i = 0.0;
#pragma unroll
for (int rcrc = rc; rcrc < rc + UF; rcrc++) {
acc_i += in[((((rcrc * M) + yy) * M) + xx)] * w[((i_curr) + rcrc)];
}
shift_reg[II_CYCLES] = shift_reg[0] + acc_i;
#pragma unroll
for (unsigned j = 0; j < II_CYCLES; ++j)
shift_reg[j] = shift_reg[j+1];
}
float sum = 0.0;
#pragma unroll
for (int jj = 0; jj < II_CYCLES; ++jj) {
sum += shift_reg[jj];;
}
compute[((yy_curr) + xx)] = sum;
}
}
}
}
kernel void shift_reg_no_unroll(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
for (int i = 0; i < N; ++i) {
for (int yy = 0; yy < M; ++yy) {
for (int xx = 0; xx < M; ++xx) {
int yy_curr = yy * M;
int i_curr = i * O;
float shift_reg[II_CYCLES + 1];
#pragma unroll
for (unsigned j = 0; j < II_CYCLES + 1; j++)
shift_reg[j] = 0.0f;
for (int rc = 0; rc < O; ++rc) {
shift_reg[II_CYCLES] = shift_reg[0] + (in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)]);
#pragma unroll
for (unsigned j = 0; j < II_CYCLES; ++j)
shift_reg[j] = shift_reg[j+1];
}
float sum = 0.0;
#pragma unroll
for (int jj = 0; jj < II_CYCLES; ++jj) {
sum += shift_reg[jj];;
}
compute[((yy_curr) + xx)] = sum;
}
}
}
}
Base and single_var compile with an II = 161 for lines 10 and 27, 5 (cycles for hardened-point float) * 32 (unroll factor) + 1 due to unrolling. In hopes to bring it down I declare separate registers for each unroll factor, which has a significant impact on performance and is able to bring the II down to 5.
I want to bring this down further, and the manual suggests inferring shift registers to bring the II to 1. Shift_reg is my attempt at trying to preserve the unrolling and adding the shift register at the same time by creating another nested loop. Unfortunately, the loop is still schedule with an II=5 with an "Undetermined reason" in Loop Analysis in the reports. If I get rid of the loop unrolling and keep the loop structure as is, adding the shift registers bring down the II to 1 in shift_reg_no_unroll. But this ends up getting worse performance than when I had unrolling.
Any suggestions to improve the performance here? It is likely that the compiler does not like the fact that the loop increment is not unit, that would be my guess, and the kernel also suffers from poor performance due to non-sequential reads. Or perhaps this is the wrong approach altogether?
I did some testing on your kernel, the problem is from your overly large unroll factor and all the memory ports it creates. If you set II_CYCLES to 5 (but not below that) and reduce UF to 16, you will get an II of 1. Moreover, at least with v19.4 (probably 19.0+ but I didn't test), you will get an II of 1 even with II_CYCLES set to 1 on Arria 10 because the compiler optimizes out the shift register and implements single-cycle accumulation instead. This bring me to a better solution to your problem if you are targeting Arria 10:
#define UF 16 kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) { for (int i = 0; i < N; ++i) { for (int yy = 0; yy < M; ++yy) { for (int xx = 0; xx < M; ++xx) { int yy_curr = yy * M; int i_curr = i * O; float final_sum = 0.0f; int exit = O / UF; for (int j = 0; j < exit; j++) { float acc_i = 0.0; #pragma unroll for (int k = 0; k < UF; k++) { int rc = j * UF + k; acc_i += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)]; } final_sum += acc_i; } compute[((yy_curr) + xx)] = final_sum; } } } }This will give an II of 1 regardless of UF; though I think single-cycle accumulation is not available on Stratix 10 since the default target frequency on Stratix 10 is 480 MHz which is too high for single-cycle accumulation. Either way, you can still use the shift register method on Stratix 10, just don't use an unroll factor larger than 16. My guess is that if you compile your kernel with varying values of UF from 1 to 32, performance will probably go up until 4 or 8, but it will start going down after that.
P.S. You should also consider the operating frequency when doing performance comparison between different kernels.