Forum Discussion

SChun19's avatar
SChun19
Icon for New Contributor rankNew Contributor
5 years ago
Solved

Reducing initiation interval, relaxing loop-carried dependency

Hi, I am trying to implement and optimize multiply-add accumulate as a single work-item. The access pattern is not sequential for one of the reads, so unrolling creates a separate LSU corresponding to the unroll factor and a private cache for each unit. This improves performance, but obviously blows up area which isn't ideal. I'll post the simplified code that still compiles and shows the issues that I have:

kernel void base(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
 
        #pragma unroll 32
        for (int rc = 0; rc < O; ++rc) {
          compute[((yy_curr) + xx)] += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
        }
      }
    }
  }
}
 
kernel void single_var(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
        float tmp = 0.0;
 
        #pragma unroll 32
        for (int rc = 0; rc < O; ++rc) {
          tmp += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
        }
        compute[((yy_curr) + xx)] = tmp;
      }
    }
  }
}
 
kernel void separate_sums(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
 
        float p_sums[32];
        #pragma unroll
        for (int jj = 0; jj < 32; ++jj) {
            p_sums[jj] = 0.0;
        }
 
        #pragma unroll 32
        for (int rc = 0; rc < O; ++rc) {
          p_sums[rc % 32] += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
        }
 
        float tmp = 0.0;
        #pragma unroll
        for (int jj = 0; jj < 32; ++jj) {
           tmp += p_sums[jj];
        }
 
        compute[((yy_curr) + xx)] = tmp;
      }
    }
  }
}
 
#define II_CYCLES 5
#define UF 32
kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
        
        float shift_reg[II_CYCLES + 1];
        #pragma unroll
        for (unsigned j = 0; j < II_CYCLES + 1; j++)
            shift_reg[j] = 0.0f;
 
        for (int rc = 0; rc < O; rc += UF) {
            float acc_i = 0.0;
 
            #pragma unroll 
            for (int rcrc = rc; rcrc < rc + UF; rcrc++) {
                acc_i += in[((((rcrc * M) + yy) * M) + xx)] * w[((i_curr) + rcrc)];
            }
 
            shift_reg[II_CYCLES] = shift_reg[0] + acc_i;
            #pragma unroll
            for (unsigned j = 0; j < II_CYCLES; ++j) 
                shift_reg[j] = shift_reg[j+1];
        }
 
        float sum = 0.0;
        #pragma unroll
        for (int jj = 0; jj < II_CYCLES; ++jj) {
           sum += shift_reg[jj];;
        }
 
        compute[((yy_curr) + xx)] = sum;
      }
    }
  }
}
 
kernel void shift_reg_no_unroll(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
  for (int i = 0; i < N; ++i) {
    for (int yy = 0; yy < M; ++yy) {
      for (int xx = 0; xx < M; ++xx) {
        int yy_curr = yy * M;
        int i_curr = i * O;
        
        float shift_reg[II_CYCLES + 1];
        #pragma unroll
        for (unsigned j = 0; j < II_CYCLES + 1; j++)
            shift_reg[j] = 0.0f;
 
        for (int rc = 0; rc < O; ++rc) {
            shift_reg[II_CYCLES] = shift_reg[0] + (in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)]);
            
            #pragma unroll
            for (unsigned j = 0; j < II_CYCLES; ++j) 
                shift_reg[j] = shift_reg[j+1];
        }
 
        float sum = 0.0;
        #pragma unroll
        for (int jj = 0; jj < II_CYCLES; ++jj) {
           sum += shift_reg[jj];;
        }
 
        compute[((yy_curr) + xx)] = sum;
      }
    }
  }
}
 
 

Base and single_var compile with an II = 161 for lines 10 and 27, 5 (cycles for hardened-point float) * 32 (unroll factor) + 1 due to unrolling. In hopes to bring it down I declare separate registers for each unroll factor, which has a significant impact on performance and is able to bring the II down to 5.

I want to bring this down further, and the manual suggests inferring shift registers to bring the II to 1. Shift_reg is my attempt at trying to preserve the unrolling and adding the shift register at the same time by creating another nested loop. Unfortunately, the loop is still schedule with an II=5 with an "Undetermined reason" in Loop Analysis in the reports. If I get rid of the loop unrolling and keep the loop structure as is, adding the shift registers bring down the II to 1 in shift_reg_no_unroll. But this ends up getting worse performance than when I had unrolling.

Any suggestions to improve the performance here? It is likely that the compiler does not like the fact that the loop increment is not unit, that would be my guess, and the kernel also suffers from poor performance due to non-sequential reads. Or perhaps this is the wrong approach altogether?

  • I did some testing on your kernel, the problem is from your overly large unroll factor and all the memory ports it creates. If you set II_CYCLES to 5 (but not below that) and reduce UF to 16, you will get an II of 1. Moreover, at least with v19.4 (probably 19.0+ but I didn't test), you will get an II of 1 even with II_CYCLES set to 1 on Arria 10 because the compiler optimizes out the shift register and implements single-cycle accumulation instead. This bring me to a better solution to your problem if you are targeting Arria 10:

    #define UF 16
    kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
      for (int i = 0; i < N; ++i) {
        for (int yy = 0; yy < M; ++yy) {
          for (int xx = 0; xx < M; ++xx) {
            int yy_curr = yy * M;
            int i_curr = i * O;
            float final_sum = 0.0f;
            
            int exit = O / UF;
            for (int j = 0; j < exit; j++) {
                float acc_i = 0.0;
     
                #pragma unroll 
                for (int k = 0; k < UF; k++) {
                    int rc = j * UF + k;
                    acc_i += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
                }
                
                final_sum += acc_i;
            }
            
            compute[((yy_curr) + xx)] = final_sum;
          }
        }
      }
    }

    This will give an II of 1 regardless of UF; though I think single-cycle accumulation is not available on Stratix 10 since the default target frequency on Stratix 10 is 480 MHz which is too high for single-cycle accumulation. Either way, you can still use the shift register method on Stratix 10, just don't use an unroll factor larger than 16. My guess is that if you compile your kernel with varying values of UF from 1 to 32, performance will probably go up until 4 or 8, but it will start going down after that.

    P.S. You should also consider the operating frequency when doing performance comparison between different kernels.

15 Replies

  • HRZ's avatar
    HRZ
    Icon for Frequent Contributor rankFrequent Contributor

    In general you should avoid partially unrolling a loop unless the loop has a compile-time-known trip count and that trip count is divisible by the unroll factor. In the particular case of reductions loops, partial unrolling will also break the shift register optimization since when you partially unroll the loop, you are effectively increasing the latency of each loop iteration, which means that you will have to use a larger shift register to solve the dependency on the reduction variable and you will have to keep increasing the size of the shift register as you increase the unroll factor; this is certainly not a scalable method. To solve this issue, check Section 3.2.2.1 in the following document:

    https://arxiv.org/ftp/arxiv/papers/1810/1810.09773.pdf

    Specifically, the code segment in Figure 3-5 shows how you can implement partial unrolling manually by splitting your reduction loop into two loops, the innermost one having a trip count equal to the unroll factor and being fully unrolled (which will not require shift register anymore), and the outer loop's trip count being adjusted accordingly and the shift register optimization being used only for the outer loop which is not unrolled, with a fixed size that is independent of the unroll factor.

    One more thing you should consider doing is to re-order the content of your "in" buffer on the host side so that accesses to that buffer in the kernel also become consecutive or else your external memory performance will suffer greatly.

  • SChun19's avatar
    SChun19
    Icon for New Contributor rankNew Contributor

    Hi HRZ, thanks for the response. I've been able to get the II down to 1 with the optimization in Figure 3-5 and setting the shift register size to 32. The first problem I was having was that setting the shift register size to the actual MAC latency was still getting an II~6 for an "Undetermined reason". Despite this, the shift register optimization leads to worse results in practice, about 2 to 4x worse runtime on my board. Is this something that you encountered?

    I will consider reordering the buffer for consecutive access.

    • HRZ's avatar
      HRZ
      Icon for Frequent Contributor rankFrequent Contributor

      Can you post your new kernel? You should not need to set the shift register size to a value higher than the MAC latency using the method I mentioned.

      Regarding run time, it is worse than which case? Due to the non-coalescable accesses in your code to the "in" buffer, you are going to experience a huge amount of contention caused by all those memory ports competing for the memory bus, and your unroll factor is also quite large which makes things even worse; it is not very surprising to get worse performance in such cases compared to the non-unrolled case or smaller unroll factors since the kernel would be memory-bound anyway while lower unroll factor will result in less memory contention and better performance. Moreover, in the method I mentioned, you will be doing some extra unused computation in the last iteration; hence, if your iteration count is not divisible by the unroll factor and at the same time, it is not very large compared to the unroll factor, the unused computation in the last iteration can potentially have a substantial effect on run time.

      • SChun19's avatar
        SChun19
        Icon for New Contributor rankNew Contributor
        #define II_CYCLES 24
        #define UF 32
        kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) {
          for (int i = 0; i < N; ++i) {
            for (int yy = 0; yy < M; ++yy) {
              for (int xx = 0; xx < M; ++xx) {
                int yy_curr = yy * M;
                int i_curr = i * O;
                float shift_reg[II_CYCLES] = {0.0f}, final_sum = 0.0f;
                
                int exit = O / UF;
                for (int j = 0; j < exit; j++) {
                    float acc_i = 0.0;
         
                    #pragma unroll 
                    for (int k = 0; k < UF; k++) {
                        int rc = j * UF + k;
                        acc_i += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)];
                    }
         
                    shift_reg[II_CYCLES - 1] = shift_reg[0] + acc_i;
         
                    #pragma unroll
                    for (unsigned k = 0; k < II_CYCLES - 1; ++k) 
                        shift_reg[k] = shift_reg[k+1];
                }
         
                #pragma unroll
                for (int jj = 0; jj < II_CYCLES; ++jj) {
                   final_sum += shift_reg[jj];;
                }
         
                compute[((yy_curr) + xx)] = final_sum;
              }
            }
          }
        }

        This is the new kernel. It is worse than all of the above. It still does get much better performance than the non-unrolled case as the generation of 32 separate cache lines greatly reduces memory contention due to a much higher hit rate compared to a single LSU. In my case, the iteration count is always divisible by the unroll factor as they are known prior to runtime.

        The MAC latency is 4 cycles. If I set it to 4, then I get this from Loops Analysis:

        So I tried setting it to 7+4=11. Still scheduled with II~2. Only with larger shift registers does the compiler successfully schedule it to 1.

    • SChun19's avatar
      SChun19
      Icon for New Contributor rankNew Contributor

      In the case where II=1 due to the single-cycle accumulator, would loop speculation still benefit performance? Esp. if the exit condition is already simple enough.

  • AnilErinch_A_Intel's avatar
    AnilErinch_A_Intel
    Icon for Frequent Contributor rankFrequent Contributor

    Hi ,

    In your latest optimized design , enabling loop speculation will not have have an impact. But knowing it would help , in situations like you have encountered initially.

    Thanks and Regards

    Anil