Reducing initiation interval, relaxing loop-carried dependency

Hi, I am trying to implement and optimize multiply-add accumulate as a single work-item. The access pattern is not sequential for one of the reads, so unrolling creates a separate LSU corresponding t...

HRZ
6 years ago
I did some testing on your kernel, the problem is from your overly large unroll factor and all the memory ports it creates. If you set II_CYCLES to 5 (but not below that) and reduce UF to 16, you will get an II of 1. Moreover, at least with v19.4 (probably 19.0+ but I didn't test), you will get an II of 1 even with II_CYCLES set to 1 on Arria 10 because the compiler optimizes out the shift register and implements single-cycle accumulation instead. This bring me to a better solution to your problem if you are targeting Arria 10:
#define UF 16 kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) { for (int i = 0; i < N; ++i) { for (int yy = 0; yy < M; ++yy) { for (int xx = 0; xx < M; ++xx) { int yy_curr = yy * M; int i_curr = i * O; float final_sum = 0.0f; int exit = O / UF; for (int j = 0; j < exit; j++) { float acc_i = 0.0; #pragma unroll for (int k = 0; k < UF; k++) { int rc = j * UF + k; acc_i += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)]; } final_sum += acc_i; } compute[((yy_curr) + xx)] = final_sum; } } } }
This will give an II of 1 regardless of UF; though I think single-cycle accumulation is not available on Stratix 10 since the default target frequency on Stratix 10 is 480 MHz which is too high for single-cycle accumulation. Either way, you can still use the shift register method on Stratix 10, just don't use an unroll factor larger than 16. My guess is that if you compile your kernel with varying values of UF from 1 to 32, performance will probably go up until 4 or 8, but it will start going down after that.
P.S. You should also consider the operating frequency when doing performance comparison between different kernels.

HRZ

Frequent Contributor

6 years ago

Can you post your new kernel? You should not need to set the shift register size to a value higher than the MAC latency using the method I mentioned.

Regarding run time, it is worse than which case? Due to the non-coalescable accesses in your code to the "in" buffer, you are going to experience a huge amount of contention caused by all those memory ports competing for the memory bus, and your unroll factor is also quite large which makes things even worse; it is not very surprising to get worse performance in such cases compared to the non-unrolled case or smaller unroll factors since the kernel would be memory-bound anyway while lower unroll factor will result in less memory contention and better performance. Moreover, in the method I mentioned, you will be doing some extra unused computation in the last iteration; hence, if your iteration count is not divisible by the unroll factor and at the same time, it is not very large compared to the unroll factor, the unused computation in the last iteration can potentially have a substantial effect on run time.

SChun19

New Contributor

6 years ago

I see, that makes sense. In this case all the accesses are always aligned by the vector size, so this shouldn't be an issue. I did have a few experiments in the past with non-aligned accesses and have seen the same results, almost 5-10x reduction in performance.

At this point I guess I can try maintaining my own private caches manually. Perhaps I can revisit loop reordering & tiling optimizations again...

Forum Discussion

Reducing initiation interval, relaxing loop-carried dependency

Recent Discussions

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process

Any date for the release of the Docker image alterafpga/fpgaaisuite-quartus-v2026.1.1?

Downloading AI Suite deb file returns text file

Is Spatial IP ready for LLM / transformer inference?