ContributionsMost RecentMost LikesSolutionsStratix 10 MX Development Kit OpenCL BSP Hello, Is there an OpenCL BSP for the Stratix 10 MX development kit engineering sample? If so, where might I be able to find it? Re: Bidirectional OpenCL Channel Stalls (both at read and write insts) I intentionally removed all external memory accesses and replaced them with a pseudorandom generator on 38 and 60 rather than pointer arithmetic so that I can rule that out. There are no global memory accesses here. (that pseudorandom generation technique is from the matrix multiplication example from Quartus 19.1) Bidirectional OpenCL Channel Stalls (both at read and write insts) Hi, I am experiencing an issue where there are stalls both at the read and write instructions for the channel in a feed-forward situation. I am not sure how to fix this or interpret the issue even since the implication is contradictory. I have attached the profiler and will paste the code below. Thanks. #pragma OPENCL EXTENSION cl_intel_channels : enable #define Tc 4 #define Tm 16 #define Tn 8 struct input_features { float input_buf[Tc]; bool rc_zero; bool flush; }; struct filter_weights { float weight_buf[Tn]; }; struct outputs { float output_buf[Tm][Tc]; }; channel struct input_features loadAChannel __attribute__((depth(64))); channel struct filter_weights loadBChannel __attribute__((depth(64))); channel struct outputs storeCChannel __attribute__((depth(64))); kernel void loadA(global float* restrict compute, global float* volatile restrict input0, global float* restrict input1, global float* restrict T_clip, global float* restrict input2, global float* restrict input3, int ax1_bound, int yy_bound, int xx_bound, int rc_bound) { for (int ax1 = 0; ax1 < ax1_bound; ax1 += Tm) { for (int yy = 0; yy < yy_bound; yy++) { for (int xx = 0; xx < yy_bound; xx += Tc) { for (int rc = 0; rc < rc_bound; rc += Tn) { struct input_features i_local; i_local.rc_zero = (rc == 0); i_local.flush = (rc + Tn) >= rc_bound; for (int tii = 0; tii < Tn; tii++) { #pragma unroll for (int tcc = 0; tcc < Tc; tcc++) { uint tmp = (0x3F800000 + tcc) + ((rc * 1 + tii) & 0xFFFF); i_local.input_buf[tcc] = *(float *) &tmp; } write_channel_intel(loadAChannel, i_local); } } } } } } kernel void loadB(global float* restrict compute, global float* restrict input0, global float* volatile restrict input1, global float* restrict T_clip, global float* restrict input2, global float* restrict input3, int ax1_bound, int yy_bound, int xx_bound, int rc_bound) { for (int ax1 = 0; ax1 < ax1_bound; ax1 += Tm) { for (int yy = 0; yy < yy_bound; yy++) { for (int xx = 0; xx < yy_bound; xx += Tc) { for (int rc = 0; rc < rc_bound; rc += Tn) { struct filter_weights w_local; for (int too = 0; too < Tm; too++) { // ax1 #pragma unroll for (int tii = 0; tii < Tn; tii++) { // rc uint tmp = (0x3F800000 + too) + ((rc * 1 + tii) & 0xFFFF); w_local.weight_buf[tii] = *(float *) &tmp; } write_channel_intel(loadBChannel, w_local); } } } } } } __attribute__((max_global_work_dim(0))) __attribute__((autorun)) kernel void monolithic() { float __attribute__((memory)) output_buf[Tm][Tc]; float __attribute__((memory)) weight_buf[Tm][Tn]; float __attribute__((memory)) input_buf[Tn][Tc]; while (1) { struct outputs out; bool resetsum, flush; for (int tii = 0; tii < Tn; tii++) { // rc - input feature maps (C) struct input_features valA = read_channel_intel(loadAChannel); resetsum = valA.rc_zero; flush = valA.flush; #pragma unroll for (int tcc = 0; tcc < Tc; tcc++) { // xx - output columns (Q) input_buf[tii][tcc] = valA.input_buf[tcc]; } } for (int too = 0; too < Tm; too++) { // ax1 - output features (K) struct filter_weights valB = read_channel_intel(loadBChannel); #pragma unroll for (int tii = 0; tii < Tn; tii++) { // rc - input feature maps (C) weight_buf[too][tii] = valB.weight_buf[tii]; } } /* compute here */ if (flush) { #pragma unroll for (int too = 0; too < Tm; too++) { // ax1 - output features (K) #pragma unroll for (int tcc = 0; tcc < Tc; tcc++) { // xx - output columns (Q) out.output_buf[too][tcc] = output_buf[too][tcc]; } } write_channel_intel(storeCChannel, out); } } } kernel void storeC(global float* restrict compute, global float* restrict input0, global float* restrict input1, global float* restrict T_clip, global float* restrict input2, global float* restrict input3, int ax1_bound, int yy_bound, int xx_bound, int rc_bound) { for (int ax1 = 0; ax1 < ax1_bound; ax1 += Tm) { for (int yy = 0; yy < yy_bound; yy++) { for (int xx = 0; xx < yy_bound; xx += Tc) { struct outputs out_local = read_channel_intel(storeCChannel); #pragma unroll for (int too = 0; too < Tm; too++) { #pragma unroll for (int tcc = 0; tcc < Tc; tcc++) { out_local.output_buf[too][tcc] += 1; } } } } } } Re: Reducing initiation interval, relaxing loop-carried dependency In the case where II=1 due to the single-cycle accumulator, would loop speculation still benefit performance? Esp. if the exit condition is already simple enough. Re: Reducing initiation interval, relaxing loop-carried dependency I see, that makes sense. In this case all the accesses are always aligned by the vector size, so this shouldn't be an issue. I did have a few experiments in the past with non-aligned accesses and have seen the same results, almost 5-10x reduction in performance. At this point I guess I can try maintaining my own private caches manually. Perhaps I can revisit loop reordering & tiling optimizations again... Re: Reducing initiation interval, relaxing loop-carried dependency The compiler is creating a private cache for both cases, but I have just noticed that in the sequential case the compiler is building a non-aligned burst-coalesced cached LSU (whereas for the non-sequential case, it is aligned). Perhaps this might account for the performance drop. Re: Reducing initiation interval, relaxing loop-carried dependency I changed the access patern so that in is now accessed sequentially. With a UF of 64 (fmax=220MHz) this becomes coalesced to a 2048-bit read. This results in a 5x increase in runtime compared to the non-sequential access pattern above. Minding your comment about memory saturation I decreased that UF to 16 (fmax=309MHz) and resynthesized. Area is drastically saved in both cases but performance does not improve much. I would have imagined that sequential reads would have greatly improved performance here, but it does not. I've included some profiled test cases below (from OCL events). N, M, O: runtime UF 64, II=1 64, 112, 32, : 14.711209 ms 128,56, 64, : 14.717709 ms 128,56, 128, : 29.326209 ms 256,28, 128, : 10.700833 ms 256,28, 256, : 29.360500 ms 512,14, 256, : 2.439333 ms 512,14, 512, : 21.220958 ms 1024, 7, 512, : 2.243875 ms 1024, 7, 1024, : 3.990750 ms UF 16, II=1 64, 112, 32, : 10.554042 ms 128,56, 64, : 10.457500 ms 128,56, 128, : 20.892041 ms 256,28, 128, : 10.520417 ms 256,28, 256, : 20.920375 ms 512,14, 256, : 10.513583 ms 512,14, 512, : 21.001666 ms 1024, 7, 512, : 8.579375 ms 1024, 7, 1024, : 21.323458 ms In comparison, the non-sequential read (as per the kernel code in a previous post) with 64x UF gets this: 64, 112, 112, 32: 7.376835 ms 128, 56, 56, 64 : 3.769259 ms 128, 56, 56, 128: 5.600221 ms 256, 28, 28, 128: 2.882847 ms 256, 28, 28, 256: 4.712566 ms 512, 14, 14, 256: 2.420952 ms 512, 14, 14, 512: 4.206385 ms 1024, 7, 7, 512 : 2.233914 ms 1024, 7, 7, 1024: 4.076918 ms Regarding the HBMs, I don't think much has changed, only one HBM channel can be assigned to a global memory arg. Creating 32 different kernels as per the bandwidth test example seems to be another way of doing it... but also cannot think of other ways, or if it would be worthwhile... Re: Reducing initiation interval, relaxing loop-carried dependency I see. From what I read I thought that as well but in practice I was seeing a linear increase in performance from 4 to 64, so this is why I went with this. One thing that I should mention is that I am using an experimental Stratix 10 MX board with HBM2 on Quartus 19.1, though I have seen the same thing occur with the Stratix 10 SX (PAC with DDR4) on vLab. I will give that a try. Aside from this (and making the in reads coalesceable), what are more common methods to increase parallelism? I can think of vectorization or invoking more compute units -- but I find it hard to imagine how I would be able to maximize parallel use of DSPs before hitting a BRAM/logic wall due to LSU bloat. I will also keep the fmax in mind for future comparisons. Thank you so much for your help so far. Re: Reducing initiation interval, relaxing loop-carried dependency #define II_CYCLES 24 #define UF 32 kernel void shift_reg(global float* restrict compute, global float* restrict in, global float* restrict w, int N, int M, int O) { for (int i = 0; i < N; ++i) { for (int yy = 0; yy < M; ++yy) { for (int xx = 0; xx < M; ++xx) { int yy_curr = yy * M; int i_curr = i * O; float shift_reg[II_CYCLES] = {0.0f}, final_sum = 0.0f; int exit = O / UF; for (int j = 0; j < exit; j++) { float acc_i = 0.0; #pragma unroll for (int k = 0; k < UF; k++) { int rc = j * UF + k; acc_i += in[((((rc * M) + yy) * M) + xx)] * w[((i_curr) + rc)]; } shift_reg[II_CYCLES - 1] = shift_reg[0] + acc_i; #pragma unroll for (unsigned k = 0; k < II_CYCLES - 1; ++k) shift_reg[k] = shift_reg[k+1]; } #pragma unroll for (int jj = 0; jj < II_CYCLES; ++jj) { final_sum += shift_reg[jj];; } compute[((yy_curr) + xx)] = final_sum; } } } } This is the new kernel. It is worse than all of the above. It still does get much better performance than the non-unrolled case as the generation of 32 separate cache lines greatly reduces memory contention due to a much higher hit rate compared to a single LSU. In my case, the iteration count is always divisible by the unroll factor as they are known prior to runtime. The MAC latency is 4 cycles. If I set it to 4, then I get this from Loops Analysis: So I tried setting it to 7+4=11. Still scheduled with II~2. Only with larger shift registers does the compiler successfully schedule it to 1. Re: Reducing initiation interval, relaxing loop-carried dependency Hi HRZ, thanks for the response. I've been able to get the II down to 1 with the optimization in Figure 3-5 and setting the shift register size to 32. The first problem I was having was that setting the shift register size to the actual MAC latency was still getting an II~6 for an "Undetermined reason". Despite this, the shift register optimization leads to worse results in practice, about 2 to 4x worse runtime on my board. Is this something that you encountered? I will consider reordering the buffer for consecutive access.