Why does this design not fit and how to reason about congestion

Question

Hi all,  I'm trying to optimize a simple program which computes a correlation matrix. To promote reuse/parallelism the design proceeds in blocks. However, I do not get the design to route for larger block sizes (&gt; 12)   I do not fully understand why it does not fit. Here is the code:  # define NR_RECEIVERS        576# define NR_BASELINE             166176 # define NR_SAMPLES_PER_CHANNEL    1024# define NR_CHANNELS        64# define NR_POLARIZATIONS        2# define    COMPLEX            2# define BLOCK_SIZE_X                 8# define BLOCK_SIZE              (BLOCK_SIZE_X * BLOCK_SIZE_X)# define NR_BLOCK_X             (NR_RECEIVERS / BLOCK_SIZE_X)             # define NR_BLOCKS              (NR_BLOCK_X * (NR_BLOCK_X + 1)) / 2# define OUTSIZE                NR_BLOCKS * BLOCK_SIZE
typedef signed char int8_t;
typedef float2 fcomplex;
typedef fcomplex InputType/**/;
typedef fcomplex OutputType/**/; 
fcomplex  __attribute__((__overloadable__,__always_inline__,const)) mulConj(fcomplex a, fcomplex b){
  return (float2)(a.x * b.x + a.y * b.y, a.y * b.x - a.x * b.y);
}
__kernel __attribute__((reqd_work_group_size(1,1,1))) __attribute__((max_global_work_dim(0)))
void Correlator(__global OutputType *restrict output, const __global volatile InputType *restrict input)
{
    int blockx = 0;
    int blocky = 0;
    for( int baselineBlock = 0 ; baselineBlock &lt; NR_BLOCKS; baselineBlock++) {
    fcomplex sums;
        # pragma unroll
        for(int i = 0 ; i &lt; BLOCK_SIZE_X ; i++){
            # pragma unroll
            for(int j = 0 ; j &lt; BLOCK_SIZE_X ; j++){
                sums = (fcomplex)0;
            }
        }
        for (int time = 0 ; time &lt; NR_SAMPLES_PER_CHANNEL; time++){
            fcomplex memx;
            fcomplex memy;
           # pragma unroll
            for (int i = 0; i &lt; BLOCK_SIZE_X ; i++){
                    memy = (*input);
            }
           # pragma unroll
            for (int i = 0; i &lt; BLOCK_SIZE_X ; i++){
                    memx = (*input);
            }
           # pragma unroll
            for(int i = 0; i &lt; BLOCK_SIZE_X ; i++){
               # pragma unroll
                for(int j = 0; j &lt; BLOCK_SIZE_X ; j++){
                    sums += mulConj( memx , memy);
                }
            }
        }
        
       # pragma unroll
        for(int i = 0 ; i &lt; BLOCK_SIZE_X ; i++){
         # pragma unroll
            for(int j = 0 ; j &lt; BLOCK_SIZE_X ; j++){
                    (*output) = sums;
            }
        }
        blocky+=BLOCK_SIZE_X;
        if(blocky &gt; blockx){
            blockx+=BLOCK_SIZE_X;
            blocky = 0;
        }
  }
}
 For block size 16, the compiler estimate (--report) suggests that it should fit, with estimates like this: 
+--------------------------------------------------------------------+
; Estimated Resource Usage Summary                                   ;
+----------------------------------------+---------------------------+
; Resource                               + Usage                     ;
+----------------------------------------+---------------------------+
; Logic utilization                      ;   40%                     ;
; ALUTs                                  ;   11%                     ;
; Dedicated logic registers              ;   28%                     ;
; Memory blocks                          ;   14%                     ;
; DSP blocks                             ;   67%                     ;
+----------------------------------------+---------------------------;
  At large block sizes, the quartus_sh_compile says that there is high congestion in the design. I assumed that this was due to the high amount of reuse in the design, every value in memx or memy is read 16 times, for a blocksize of 16.   To combat this effect, I redesigned the kernel in a systolic array-like design, where data is shifted through a grid in both the x and y direction and each value is only read once (instead of 16 times): \# define NR_RECEIVERS        576# define NR_BASELINE             166176 # define NR_SAMPLES_PER_CHANNEL    1024# define NR_CHANNELS        64# define NR_POLARIZATIONS        2# define    COMPLEX            2# define BLOCK_SIZE_X             16# define BLOCK_SIZE_Y            (BLOCK_SIZE_X / 2)# define BLOCK_SIZE              (BLOCK_SIZE_X * BLOCK_SIZE_X)# define NR_BLOCK_X             (NR_RECEIVERS / BLOCK_SIZE_X)             # define NR_BLOCKS              (NR_BLOCK_X * (NR_BLOCK_X + 1)) / 2# define OUTSIZE                ((NR_RECEIVERS * (NR_RECEIVERS + 1)) / 2)
typedef signed char int8_t;
typedef float2 fcomplex;
typedef fcomplex InputType/**/;
// Write non-contigous: 
// typedef fcomplex OutputType/**/; 
// Write contigous:
typedef fcomplex OutputType/**/; 
fcomplex  __attribute__((__overloadable__,__always_inline__,const)) mulConj(fcomplex a, fcomplex b){
  return (float2)(a.x * b.x + a.y * b.y, a.y * b.x - a.x * b.y);
}
__kernel __attribute__((reqd_work_group_size(1,1,1))) __attribute__((max_global_work_dim(0)))
void Correlator(__global OutputType *restrict output, const __global volatile InputType *restrict input)
{
    int blockx = 0;
    int blocky = 0;
   # pragma unroll 1
    for( int baselineBlock = 0 ; baselineBlock &lt; NR_BLOCKS; baselineBlock++) {
        fcomplex sums;
        # pragma unroll
        for(int i = 0 ; i &lt; BLOCK_SIZE_X ; i++){
            # pragma unroll
            for(int j = 0 ; j &lt; BLOCK_SIZE_X ; j++){
                sums = (fcomplex)0;
            }
        }
        fcomplex bufx;
       # pragma unroll
        for(int i = 0 ; i &lt; 2 * BLOCK_SIZE - 1 ; i++){
            # pragma unroll
            for(int j = 0 ; j &lt; BLOCK_SIZE  ; j++){
                bufx = (fcomplex)0;
            }
        }
        fcomplex bufy;
       # pragma unroll
        for(int i = 0 ; i &lt; BLOCK_SIZE ; i++){
            # pragma unroll
            for(int j = 0 ; j &lt; 2 * BLOCK_SIZE - 1  ; j++){
                bufy = (fcomplex)0;
            }
        }
       # pragma unroll 1
        for (int time = 0 ; time &lt; NR_SAMPLES_PER_CHANNEL; time++){
           # pragma unroll
            for(int j = 0 ; j &lt; BLOCK_SIZE ; j++){         
               # pragma unroll
                for(int i = 2 * BLOCK_SIZE - 2 ; i &gt;=1 ; i--){                
                    bufx = bufx;
                }
            }
           # pragma unroll
            for(int i = 0 ; i &lt; BLOCK_SIZE ; i++){         
               # pragma unroll
                for(int j = 2 * BLOCK_SIZE - 2 ; j &gt;=1 ; j--){                
                    bufy = bufy;
                }
            }
           # pragma unroll
            for(int i = 0 ; i &lt; BLOCK_SIZE ; i++){         
                bufx =  (*input);
            }
           # pragma unroll
            for(int i = 0 ; i &lt; BLOCK_SIZE ; i++){         
                bufy =  (*input);
            }
           # pragma unroll
            for(int i = 0; i &lt; BLOCK_SIZE_X ; i++){
               # pragma unroll
                for(int j = 0; j &lt; BLOCK_SIZE_Y ; j++){
                    float2 a = bufx;
                    float2 b = bufy;
                    sums +=  (float2)(a.x * b.x + a.y * b.y, a.y * b.x - a.x * b.y);
                }
            }
        }
       # pragma unroll
        for(int i = 0 ; i &lt; BLOCK_SIZE_X ; i++){
         # pragma unroll
            for(int j = 0 ; j &lt; BLOCK_SIZE_X ; j++){
                    (*output) = sums;
            }
        }
        blocky+=BLOCK_SIZE_X;
        if(blocky &gt; blockx){
            blockx+=BLOCK_SIZE_X;
            blocky = 0;
        }
  }
}
   This gives estimates like before.  However, this also does not fit for larger block sizes (&gt;12), just like the earlier design. The log states:  Cannot place the following 18 DSP cells -- a legal placement which satisfies all the DSP requirements could not be found.....   I do not understand why. Why do these designs have high congestion, how can I reason about this and how can I make a larger block size fit? Is it simply unrealistic to use more than 50% of the DSP nodes?  Any ideas appreciated!

altera_forum · Answer

I don't know, but you might want to take a look at routing resources, it could be that the most of the interconnect is used and there is none left to connect the DSPs. This is just a guess, so it could be something completely different.

altera_forum · Answer

Thanks for the suggestion. What do you mean with look at routing resources? Where can I find this information? I'm a bit of a beginner here. Thanks in advance!

altera_forum · Answer

Altera's estimation can be very wrong at times, especially on Arria 10.

If your design is failing during placement, then it means the design is too big to fit on the FPGA, even though in my experience, AOC's DSP usage estimation is generally correct.

If, however, your design is failing during routing due to congestion, with a an area utilization that is not relatively high, then there isn't much that can be done about it except redesigning the kernel. On Arria 10, since OpenCL uses Partial Reconfiguration through PCI-E by default, routing fails very regularly due to routing congestion in the region that connects the static block to the dynamic block (I have encountered this far too many times to count). I have never gone over 60% DSP utilization on Arria 10, though (logic and RAM are usually over-utilized long before DSP in my experience), so that might also be a contributing factor.

On Stratix V, however, I have encountered routing failure due to congestion only a few times, and that was because my design was bad (too many non-coalesced reads and writes from the same local buffer).

Can you provide the quartus_sh_compile.log and also mention your target device and the version of Quartus/AOC you are using?

P.S. Your first kernel looks relatively straightforward and data parallel; you might have better luck if you implement it as NDRange. In particular, spatial blocking using local buffers that are implemented as standard multi-ported memory buffers on the FPGA (in contrast to, for example, shift registers) seem to work better in NDRange kernels in my [limited] experience.

altera_forum · Answer

Thanks for the input. I've attached the quartus_sh_compile.sh for the second kernel below. I'm afraid I've deleted it for the first kernel, but I will rerun this tonight, and send it later.

I'm using an Arria 10 with Quartus 16.0 (no 16.1 yet for Nalla board).

What exactly is the definition of congestion here? How do I prevent it?

I'll also try a NDRange kernel, the problem with that is that to devide the computation some extra computations involving squareroots are needed.

Thanks in advance for the help!

altera_forum · Answer

This report shows that failure is occurring during fitting due to lack of DSPs (or maybe DSPs that match placement requirements). Does the compiler also generate the file named top.fit.summary in this case? That report should make it clear whether you are really running out of DSPs or not (even though I regularly see negative or overflown numbers in that file).  With block size 16 for your first code, I get this with Quartus 16.1.2 and Arria 10:  +--------------------------------------------------------------------+
; Estimated Resource Usage Summary                                   ;
+----------------------------------------+---------------------------+
; Resource                               + Usage                     ;
+----------------------------------------+---------------------------+
; Logic utilization                      ;   65%                     ;
; ALUTs                                  ;   31%                     ;
; Dedicated logic registers              ;   35%                     ;
; Memory blocks                          ;   40%                     ;
; DSP blocks                             ;   90%                     ;
+----------------------------------------+---------------------------;  I would expect this to be very hard to fit on the FPGA, especially since the logic utilization is probably underestimated. Your second code takes a very long time to do the first stage of compilation with a block size of 16, so I gave up on it, but the estimation seems to report lower number of DSPs for a block size of 8, compared to the first code.  Both codes seem to compile very nicely with all loops being fully-pipelined with an II of one, and based on the system overview from Quartus v16.1.2, the system has 4 fully-coalesced memory ports and a low-latency peipeline (100-200 clocks). I guess this case is simply a problem of high area utilization. You could reduce your area overhead by converting some of your less important fully-unrolled loops to partially-unrolled ones.

Forum Discussion

Why does this design not fit and how to reason about congestion

7 Replies

Recent Discussions

Quartus Prim Pro: "Fatal Error: Segment Violation, Access Violation"

timing violation fix

log signal to signal tap

Docker image for Quartus Pro 26.1 missing ?

Licensing ‘Know-How’ Guide