Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
15 years ago

Fmax is small, please, help me to improve it!

Hi,

in my small systemverilog project (about 1000 lines) I have one module that is critical to the performance (I attached it below).

This module takes 4x16 bit words (In) per clock (Clk), pipelines them over N stages (Data), calculates scalar products over all possible combinations and sum them using something similar to FIR filter. Using other clock (ClkSW) I take data out of this module. I urgently need Clk running on 400MHz, and ClkSW can be run with 100MHz.

I am trying to experiment it on Stratix III from DE3 board (EP3SL150F1152C2), with DSP blocks that can achieve 440MHz on a*b+c*d operations, indeed I develop everything maximally pipelined, so I perform only one operation per parallel line, and I see that Quartus uses DSP blocks for my multiplier pairs, and they are optimally implemented as a*b+c*d operators.

Sure I switched all possible compiler optimizations on for speed and switched off any re-usage of synthesis for clear experiment.

In my module I have the parameter N that is refer to the total amount of similar parallel work to be performed.

Here I print a table with my results that I am achieving:


 N     SHR    FMAX 0C/85C   Logic      DSP    Total Synt. Time
N=18, SHR=14, 337/311MHz, Logic 22%, DSP 75%, Time=29minutes
N=18, SHR= 2, 331/308MHz, Logic 19%, DSP 75%, Time=28minutes
N=12, SHR= 2, 366/340MHz, Logic 13%, DSP 50%, Time=17minutes
N= 8, SHR= 2, 370/339MHz, Logic  8%, DSP 33%, Time=11minutes
N= 6, SHR= 2, 413/384MHz, Logic  6%, DSP 25%, Time=8minutes
N= 4, SHR= 2, 448/410MHz, Logic  4%, DSP 17%, Time=5minutes
N= 4, SHR=14, 382/355MHz, Logic  5%, DSP 17%, Time=6minutes
N= 2, SHR=14, 432/401MHz, Logic  2%, DSP  8%, Time=3minutes
Indeed if my DSP usage is small, and N=2,4, I can achieve something close to peak performance (FMax for multipliers should be about 440MHz), but I cannot achieve it if I use many multipliers even if my module behaves totally the same.

I fighted with this module almost one month trying to append some intermediate registers, but it does not help, I cannot achieve even 400MHz (it should be enough for me) for N=16/18. For large projects I cannot run many attempts - each recompile costs me half-hour.

Please, suggest me what I should try to achieve FMax=400MHz for Clk and N=18. I urgently need it, otherwise I will need to demux my data, and use at least SL340 with impressive $8000 price :(

Sincerely,

Ilghiz

Here is my module, you can try it with your Quartus using Stratix III and see my problem:


module test(Clk, In, ClkSW, SW, Scal);
parameter N=18; // can be 2, 4, 6, ..., but I need 16 or 18
parameter SHR=14; // can be 2, 3, 4, 5, ..., but I need 12-20
input Clk, ClkSW;
input signed  In;
input  SW;
reg signed  Scal;
output      Scal;
// Memory ////////////////////////////
reg signed  D, Data;
reg signed  Mul;
reg signed  Sum, Sum2;
reg signed  ScalX;
reg signed  ScalY;
reg InDataCounter;
// Reading Data from Channels - the key place where I cannot achieve to clock it with 400MHz for N=16, or 18
 always @(posedge Clk)
 begin
   for(int i=0; i<2; i++)
     for(int j=0; j<4; j++)
       D<=Data;
   for(int j=0; j<4; j++)
     Data<=In;
   for(int i=0; i<N-1; i++)
     for(int j=0; j<4; j++)
       Data<=Data;
   InDataCounter<=~InDataCounter;
   for(int i=0; i<N; i+=2)
     for(int j=0; j<4; j++)
       for(int k=0; k<4; k++)
       begin
         Mul  <=D*Data;
         Mul<=D*Data;
       end
   for(int i=0; i<N; i+=2)
     for(int j=0; j<16; j++)
     begin
       Sum<=Mul+Mul;
       Sum2<=Sum; // intermediate register that helps a lot
       ScalX<=ScalX-(ScalX>>>SHR);
       ScalX<=Sum2+ScalX;
       ScalY<=ScalX>>>(16+SHR);
       ScalY<=ScalX>>>(16+SHR);
     end
 end
 // Output, it is clocked with 100MHz and I hope that is not relevant to my performance problem
 
 always @(posedge ClkSW)
 begin
   for(int i=1; i<N; i++)
   begin
     case(SW)
       4'b0000: begin Scal<=ScalY; Scal<=ScalY; end
       4'b0001: begin Scal<=ScalY; Scal<=ScalY; end
       4'b0010: begin Scal<=ScalY; Scal<=ScalY; end
       4'b0011: begin Scal<=ScalY; Scal<=ScalY; end
//
       4'b0100: begin Scal<=ScalY; Scal<=ScalY; end
       4'b0101: begin Scal<=ScalY; Scal<=ScalY; end
       4'b0110: begin Scal<=ScalY; Scal<=ScalY; end
       4'b0111: begin Scal<=ScalY; Scal<=ScalY; end
//
       4'b1000: begin Scal<=ScalY; Scal<=ScalY; end
       4'b1001: begin Scal<=ScalY; Scal<=ScalY; end
       4'b1010: begin Scal<=ScalY; Scal<=ScalY; end
       4'b1011: begin Scal<=ScalY; Scal<=ScalY; end
//
       4'b1100: begin Scal<=ScalY; Scal<=ScalY; end
       4'b1101: begin Scal<=ScalY; Scal<=ScalY; end
       4'b1110: begin Scal<=ScalY; Scal<=ScalY; end
       4'b1111: begin Scal<=ScalY; Scal<=ScalY; end
     endcase
   end
   case(SW)
      0: Scal<=ScalY;
      1: Scal<=ScalY;
      2: Scal<=ScalY;
      3: Scal<=ScalY;
      4: Scal<=ScalY;
      5: Scal<=ScalY;
      6: Scal<=ScalY;
      7: Scal<=ScalY;
      8: Scal<=ScalY;
      9: Scal<=ScalY;
     10: Scal<=ScalY;
     11: Scal<=ScalY;
     12: Scal<=ScalY;
     13: Scal<=ScalY;
     14: Scal<=ScalY;
     15: Scal<=ScalY;
   endcase
 end
endmodule

6 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Your test.v shows the beauty and strength of real high level languages. But unfortunately this may also be a weakness when you target 'resource-fixed' architectures like FPGAs.

    If you inspect the inferred altmult_add files and enter into the (lowest) .tdf file you will see that almost all ports are unregistered. The top line is very long, but if you edit it (by inserting carriage returns) you can see all the assumptions taken.

    I recompiled (10.0 SP1 Web) for N=2 and SHR = 2, and failed timing by 28 ps only.

    I you select an inferred altmult_add in the navigation window and locate it in the resource Property Editor, you can see that 'dataa[]' is unregistered but 'datab[]' is. In the TimeQuest failed path reports I can see that this not-registering 'dataa[]' accounts for 718 ps interconnect delay.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Dear Josyb,

    thank you for your kind answer.

    Would you, or somebody else, explain me, please, why unregistered memory provides such a delay, how to solve it, and why I get unregistered memory here?

    My questions occur due to the following: when you told about multipliers, I decide to put additional intermediate pipeline registers (in my attached code they are D1 and Data1) and I got some improvement for FMax:

    
     N     SHR    FMAX 0C/85C   Logic      DSP      Total Synt. Time
    N=24, SHR=14, 363/336MHz, Logic 33%, DSP 100%, Time=56minutes
    N=12, SHR=14, 376/346MHz, Logic 16%, DSP  50%, Time=19minutes
    N= 6, SHR=14, 405/383MHz, Logic  8%, DSP  25%, Time= 9minutes
    N=24, SHR= 2, 383/357MHz, Logic 29%, DSP 100%, Time=43minutes
    N=12, SHR= 2, 407/380MHz, Logic 15%, DSP  50%, Time=23minutes
    N= 6, SHR= 2, 443/413MHz, Logic  7%, DSP  25%, Time=12minutes
    

    however, I cannot figure out myself when I should do these tricks, and what kind of other tricks are available for FMax improvement!

    PS: in my design I am free to append more pipeline stages, but where, please, help me with procedure to find it. I can see something in "Property Editor" however I cannot interpret it to make correct decision, please, help me!!!

    Thank you in advance!

    Sincerely,

    Ilghiz

    
    module test(Clk, In, ClkSW, SW, Scal);
    parameter N=6; // can be 2, 4, 6, ..., but I need 18 and dreaming about 24
    parameter SHR=14; // can be 2, 3, 4, 5, ..., but I need 12-20
    input Clk, ClkSW;
    input signed  In;
    input  SW;
    reg signed  Scal;
    output      Scal;
    // Memory ////////////////////////////
    reg signed  D, Data;
    reg signed  D1, Data1; // new pipeline registers
    reg signed  Mul;
    reg signed  Sum, Sum2;
    reg signed  ScalX;
    reg signed  ScalY;
    reg InDataCounter;
    reg signed  ScalY1, ScalY2; // new intermediate
    reg signed  ScalY3, ScalY4; // registers for simple output
    // Reading Data from Channels - the key place where I cannot achieve to clock it with 400MHz for N=16 or 24
     always @(posedge Clk)
     begin
       for(int i=0; i<2; i++)
         for(int j=0; j<4; j++)
           D<=Data;
       for(int j=0; j<4; j++)
         Data<=In;
       for(int i=0; i<N-1; i++)
         for(int j=0; j<4; j++)
           Data<=Data;
       InDataCounter<=~InDataCounter;
    //
       for(int i=0; i<N; i++)
         for(int j=0; j<4; j++)
           Data1<=Data; // new pipeline registers that helps for small N
       for(int i=0; i<2; i++)
         for(int j=0; j<4; j++)
           D1<=D; // new pipeline registers that helps for small N
    //
       for(int i=0; i<N; i+=2)
         for(int j=0; j<4; j++)
           for(int k=0; k<4; k++)
           begin
             Mul  <=D1*Data1;
             Mul<=D1*Data1;
           end
       for(int i=0; i<N; i+=2)
         for(int j=0; j<16; j++)
         begin
           Sum<=Mul+Mul;
           Sum2<=Sum;
           ScalX<=ScalX-(ScalX>>>SHR);
           ScalX<=Sum2+ScalX;
           ScalY<=ScalX>>>(16+SHR);
           ScalY<=ScalX>>>(16+SHR);
         end
     end
     // Output - different from previous one, just to save space...
     
     always @(posedge ClkSW)
     begin
       for(int i=0; i<N; i++)
         for(int j=0; j<16; j+=2)
           ScalY1<=ScalY+j];
       for(int i=0; i<N; i++)
         for(int j=0; j<8; j+=2)
           ScalY2<=ScalY1+j];
       for(int i=0; i<N; i++)
         for(int j=0; j<4; j+=2)
           ScalY3<=ScalY2+j];
       for(int i=0; i<N; i++)
         ScalY4<=ScalY3];
       Scal<=ScalY4];
     end
    endmodule
    
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Ilghiz,

    Increasing Fmax further probably means messing up your nice code ...

    To make full use of the registering inside the altmult_add blocks, now inferred by the synthesis of your source code, you actually have to use the Megawizard to define that block to your wishes, in this case for speed by enabling all pipeline registers inside the DSP block itself. That's the easy part, the hard work is now instantiating this block in your code. Unfortunately I only know very little Verilog, let alone System-Verilog, so I can't help you much here.

    Later on you can replace the additions by calling lpm_add_sub (with pipelines), or by defining your own pipelined adder block. (I noticed failure paths in the N=18, SHR=14 compilation due to long adder chains as well).
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Josyb,

    thank you for your kind suggestion, actually I did small improvement with Megawizards and achieve 400MHz on 0C for N=24 and SHR=14, however it is still very unstable:

    
      N    SHR    FMAX 0C/85C   Logic      DSP     Total Synt. Time
    N=24, SHR=14, 401/369MHz, Logic 53%, DSP 100%, Time=115m
    N=12, SHR=14, 362/346MHz, Logic 26%, DSP  50%, Time=55m
    N= 6, SHR=14, 451/418MHz, Logic 13%, DSP  25%, Time=20m
    

    Hence, the behavior is very strange, sometimes it is fast, sometimes - no, the synthesis time is impressive - almost 2 hours on modern i7 quad core. Due to this instability I will probably switch to SL340 with demux of my global clock, otherwise I will fight more with unstable results of this fitter.

    Indeed I was able to write nice code with Megawizard that can be written again in <100 lines :), that I am publishing below.

    PS and OFF to Altera Quartus developers: in case if it is interesting to improve Quartus fitter using GPU or massively parallel platforms or even apply better mathematics in the fitter, do not hesitate to ask our help.

    Sincerely,

    Ilghiz

    --

    Elegant Mathematics Ltd.

    
    module TestOne(Clk, A1, A2, B1, B2, SW, Res);
    parameter SHR=6; // can be 2, 3, 4, 5, ..., but I need 12-20
    input Clk, SW;
    input signed  A1, A2, B1, B2;
    output reg signed  Res;
    reg signed  P1, P2, Q1, Q2;
    reg signed  Mul1, Mul2;
    reg signed  Sum, Sum2;
    reg signed  ScalX1, ScalX2, ScalX3, ScalX4;
    // you need to install altmult_add module and call it as "mu_mmadd"
    my_mmadd my_mmadd_module(Clk, P1, Q1, P2, Q2, Sum);
     always @(posedge Clk)
     begin
       P1<=A1;      P2<=A2;
       Q1<=B1;      Q2<=B2;
    // Mul1<=P1*Q1; Mul2<=P2*Q2;
    // Sum<=Mul1+Mul2;
       Sum2<=Sum;
       ScalX2<=ScalX1+(ScalX1>>>SHR);
       ScalX4<=ScalX3+Sum2;
       ScalX1<=ScalX4;
       ScalX3<=ScalX2;
       Res<=(SW)?ScalX1:ScalX3;
     end
    endmodule
    module test(Clk, In, ClkSW, SW, Scal);
    parameter N=24; // can be 2, 4, 6, ..., but I need 18
    input Clk, ClkSW;
    input signed  In;
    input  SW;
    output reg signed  Scal;
    // Memory
    reg signed  D, Data;
    reg InDataCounter, SW0;
    wire signed  ScalY;
    reg signed  ScalY1, ScalY2;
    reg signed  ScalY3, ScalY4;
    // Generating modules
     generate
     genvar i, j, k;
     for(i=0; i<N; i+=2)
     begin : aaa
       for(j=0; j<4; j++)
    	begin : bbb
         for(k=0; k<4; k++)
    	  begin : ccc
           TestOne TestOne_Module(Clk, D, D, Data, Data, SW0, ScalY);
         end
    	end
     end
     endgenerate
    // Reading Data
     
     always @(posedge Clk)
     begin
       for(int i=0; i<2; i++)
         for(int j=0; j<4; j++)
           D<=Data;
       for(int j=0; j<4; j++)
         Data<=In;
       for(int i=0; i<N-1; i++)
         for(int j=0; j<4; j++)
           Data<=Data;
       InDataCounter<=~InDataCounter;
       SW0<=SW^InDataCounter;
     end
     // Output
     
     always @(posedge ClkSW)
     begin
       for(int i=0; i<N/2; i++)
         for(int j=0; j<16; j+=2)
           ScalY1<=ScalY+j];
       for(int i=0; i<N/2; i++)
         for(int j=0; j<8; j+=2)
           ScalY2<=ScalY1+j];
       for(int i=0; i<N/2; i++)
         for(int j=0; j<4; j+=2)
           ScalY3<=ScalY2+j];
       for(int i=0; i<N/2; i++)
         ScalY4<=ScalY3];
       Scal<=ScalY4];
     end
    endmodule
    
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Ilghiz,

    I think there is room for one further improvement: the lines
      ScalX2<=ScalX1+(ScalX1>>>SHR);
       ScalX4<=ScalX3+Sum2;
    result in two reasonably large adders which probably cause the 'unstable' Fmax result. It would be a good idea to write a module that calculates the sum in two clocks in a split-manner: you add the lower halves of the input factors on the first clock while pipelining the upper halves, and adding these together with the carry-out of the first operation on the second clock edge.

    After that you probably can save a few pipeline stages in your module, as you will end up with a few back-to-back registers with no logic between them.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Josyb,

    thank you for your kind suggestions. You was right, this long bit operations near to DSP multipliers make large instability. I tried to make it pipelined as you told, but it was only small improvement, however, when I demux my result (Sum) and compute ScalX<=ScalX+Sum2-(ScalX>>>SHR) with half frequency, everything was ok, I succeed to achieve 406MHz with N=24 (384 multipliers) and very large (56 bit ScalX).

    Thank you for your kind advice!

    Sincerely,

    Ilgis