Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
15 years ago

Please, advise me how to improve the performance of my verilog module

Hi,

exercising on DE3 Terasic board, I meet a situation that I cannot find myself a solution of my questions and kindly ask this forum to advise me.

I have small verilog module that reads N words of N infinite vectors v_1,...,v_N and I need to compute all possible s_{i,j,k}=v_i^T D P^k v_j, where P is permutation matrix that shifts vector to one entry down, and D is the diagonal with (d_1,...) on diagonal so that d_1=1,d_2=(1-2^{-m}), d_3=(1-2^{-m})^2..., hence I am implementing something similar to stable IIR filters.

In my case I am trying to pipeline input data that is arriving from each vector (InDataA, InDataB, for the simplicity I take an example with N=2), and compute all products and store it in the result output in (ScalAA, ScalAB, ScalBA, ScalBB ).

If I install this module into standard Tesasic DE3 environment I got two issues that I cannot resolve:

1. all my data are reg signed [13:0], so one multiplication can be fitted into 18x18 bits mults. I am doing massively parallel multiplications and hope to use so called "Four Multiplier Adder Mode" as it is described in Stratix III Device Handbook 1, but I cannot understand how to implement it. I urgently need it otherwise I will run out of recourses of my DE3 board.

2. timing of this module was not very perfect, I achieve only 260-310MHz, however, in the "Four Multiplier Adder Mode" I should achieve 600MHz. I also need it because in my design I expect to have data with 400, 500 and probably 600 MHz input data rate.

And now there is by module. Please, advise me how to:

1. switch four multiplier adder mode on,

2. and ideas to improve the performance.

Thank you!

Ilghiz


module DATA_Aq(InDataClkA, InDataA, InDataB, OutData);
parameter NBUF=16; // the maximum possible shift in the design, I should be able to run it with:
// 1) A,B,...H=8 channels, and NBUF=6, or
// 2) A,B,C,D=4 channels, and NBUF=16, so both designs
// need 384 or 256 18x18 multipliers in Four Multiplier Adder Mode
parameter UpdateSpeed=12;
input InDataClkA;
input  InDataA;
input  InDataB;
reg  OutData; // this is some artificial output that prevents Quartus to optimize out the main part of computations
output  OutData;
// Memory Declaration
reg signed  DataA;
reg signed  DataB;
reg signed   ScalAA, Scal1AA, Scal2AA;
reg signed   ScalAB, Scal1AB, Scal2AB;
reg signed   ScalBA, Scal1BA, Scal2BA;
reg signed   ScalBB, Scal1BB, Scal2BB;
// reg signed  Scal3AA, Scal4AA, Scal5AA;
reg signed  Tmp;
reg  InDataCounter;
// Initialization
initial
begin
  integer i;
  InDataCounter=0;
  for(i=0; i<NBUF; i=i+1)
  begin
    ScalAA=0; Scal1AA=0; Scal2AA=0;
    ScalAB=0; Scal1AB=0; Scal2AB=0;
    ScalBA=0; Scal1BA=0; Scal2BA=0;
    ScalBB=0; Scal1BB=0; Scal2BB=0;
    DataA=0;
    DataB=0;
  end
end
// Reading Data from Channels and Computation
 always @(posedge InDataClkA)
 begin
   integer i;
   InDataCounter<=InDataCounter+1;
   for(i=0; i<NBUF-1; i=i+1)
   begin
     DataA<=DataA;
     DataB<=DataB;
   end
   DataA<=InDataA;
   DataB<=InDataB;
   for(i=0; i<NBUF; i=i+1)
   begin
     Scal1AA<=InDataA*DataA;
     Scal1AB<=InDataA*DataB;
     Scal1BA<=InDataB*DataA;
     Scal1BB<=InDataB*DataB;
     
     Scal2AA<=ScalAA-(ScalAA>>UpdateSpeed);
     Scal2AB<=ScalAB-(ScalAB>>UpdateSpeed);
     Scal2BA<=ScalBA-(ScalBA>>UpdateSpeed);
     Scal2BB<=ScalBB-(ScalBB>>UpdateSpeed);
     
     ScalAA<=Scal1AA+Scal2AA;
     ScalAB<=Scal1AB+Scal2AB;
     ScalBA<=Scal1BA+Scal2BA;
     ScalBB<=Scal1BB+Scal2BB;
   end
 end
 
// This is artificial always block to simulate that I am using Scal?? data
 always @(InDataCounter)
 begin
   case(InDataCounter)
     0: Tmp=ScalAA];
     1: Tmp=ScalAB];
     2: Tmp=ScalBA];
     3: Tmp=ScalBB];
   endcase
   OutData=Tmp+Tmp+Tmp+Tmp+Tmp;
end
endmodule

6 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi,

    Quartus will infer the four multipler adder mode from your Verilog, if it follows a suitable template.

    Check the Quartus manual, section 6-9, for details and examples.

    http://www.altera.com/literature/hb/qts/qts_qii51007.pdf

    That said, I don't see how it will help your with your resource problem: a Stratix III DSP block can implement 4 18x18 multipliers, weather it's 4 independent multipliers or 4 multipler-adders.

    As for fMax, you need to take a look at your critical paths and see where the largest delay is. In such a case, adding some extra register stages might help.

    PS: your "artificial block" looks like something that will be synthesized to latches.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi,

    thank you for your kind respond. May I try to comment your answer and probably figure out the main problem that forced me to ask at this forum.

    --- Quote Start ---

    Quartus will infer the four multiplier adder mode from your Verilog, if it follows a suitable template.

    Check the Quartus manual, section 6-9, for details and examples.

    http://www.altera.com/literature/hb/...s_qii51007.pdf (http://www.altera.com/literature/hb/qts/qts_qii51007.pdf)

    --- Quote End ---

    yes, it is one reason why I am asking at this forum.

    --- Quote Start ---

    That said, I don't see how it will help your with your resource problem: a Stratix III DSP block can implement 4 18x18 multipliers, weather it's 4 independent multipliers or 4 multipler-adders.

    --- Quote End ---

    No! Actually at the Altera document

    http://www.altera.com/literature/hb/stx3/stx3_siii51005.pdf

    at page 5-2, there is a table 5-1 that says if for SL150 I use four multiplier mode I can achieve 384 18x18 multipliers, otherwise if they are just normal (independent) multipliers, I am achieving only 192 18x18 multipliers. I need more performance!!!

    From the other hand, at

    http://www.altera.com/literature/hb/stx3/stx3_siii5v2.pdf

    at page 1-17 and table 1-21 at 5-th line I should achieve 600 MHz at 18x18 mode and 440 MHz at double mode (I have C2 speed grade). I need more speed (FMax) for my project!!!

    Hence I am trying to find the solution how to organize my computation such a way to achieve this performance.

    --- Quote Start ---

    PS: your "artificial block" looks like something that will be synthesized to latches.

    --- Quote End ---

    Please, do not care about it! In the reality it is completely different algorithm in this "artificial block" but it is about 2000 lines and these lines can take you our of my main question.

    Sincerely,

    Ilghiz
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I see, I misinterpreted the doc. :)

    Anyway, taking a second look at your code, if I reading this right, it can't be mapped to the 4-M-A mode.

    The 4-M-Adder performs the operation "output = (a0*b0) + (a1*b1) + (a2*b2) + (a3*b3)" with 3 levels of registers (for fMAX)

    ra0 <= a0; ... rb3 <= b3;

    rma01 <= (ra0 * rb0) + (ra1 * rb1)

    rma23 <= (ra2 * rb2) + (ra3 * rb3)

    rma <= rma01 + rm23;

    output <= round_saturate(rma);

    Or you can use it in 4-M-Accumulator mode

    ra0 <= a0; ... rb3 <= b3;

    rma01 <= (ra0 * rb0) + (ra1 * rb1)

    rma23 <= (ra2 * rb2) + (ra3 * rb3)

    rma <= rma01 + rm23 + rma;

    output <= round_saturate(rma);

    You need to, somehow, convert your algorithm into one of these patterns.

    But looking at it, I can't figure out a way.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Dear Rbuhalho,

    yes, you are right, thank you! It seems that my fist question regarding to the usage of multipliers is solved. I was able to convert the algorithm such a way that it makes A*B+C*D and immediately see that the usage of multipliers drops two times!

    However, my performance is still far from possible peak, I achieving right now only 330MHz instead of 440MHz (is it possible to have 600MHz here on my hardware?).

    It seems that I need to tune my settings in Quartus or change something more in the algorithm.

    Here I attached the Quartus settings and modified code:

    
    module GenScal(A1, A2, B1, B2, C1, C2, D1, D2,
                   P1, P2, Q1, Q2, R1, R2, S1, S2,
                   AP, AQ, AR, AS,
                   BP, BQ, BR, BS,
                   CP, CQ, CR, CS,
                   DP, DQ, DR, DS, Clk);
    parameter UpdateSpeed=12;
    input Clk;
    input  A1, A2, B1, B2, C1, C2, D1, D2;
    input  P1, P2, Q1, Q2, R1, R2, S1, S2;
    output  AP, AQ, AR, AS;
    output  BP, BQ, BR, BS;
    output  CP, CQ, CR, CS;
    output  DP, DQ, DR, DS;
    // Memory
    reg  ScalAP, ScalAQ, ScalAR, ScalAS;
    reg  ScalBP, ScalBQ, ScalBR, ScalBS;
    reg  ScalCP, ScalCQ, ScalCR, ScalCS;
    reg  ScalDP, ScalDQ, ScalDR, ScalDS;
    reg  AddAP, AddAQ, AddAR, AddAS;
    reg  AddBP, AddBQ, AddBR, AddBS;
    reg  AddCP, AddCQ, AddCR, AddCS;
    reg  AddDP, AddDQ, AddDR, AddDS;
    reg  MulAP1, MulAQ1, MulAR1, MulAS1;
    reg  MulBP1, MulBQ1, MulBR1, MulBS1;
    reg  MulCP1, MulCQ1, MulCR1, MulCS1;
    reg  MulDP1, MulDQ1, MulDR1, MulDS1;
    reg  MulAP2, MulAQ2, MulAR2, MulAS2;
    reg  MulBP2, MulBQ2, MulBR2, MulBS2;
    reg  MulCP2, MulCQ2, MulCR2, MulCS2;
    reg  MulDP2, MulDQ2, MulDR2, MulDS2;
    reg  SumAP, SumAQ, SumAR, SumAS;
    reg  SumBP, SumBQ, SumBR, SumBS;
    reg  SumCP, SumCQ, SumCR, SumCS;
    reg  SumDP, SumDQ, SumDR, SumDS;
    assign AP=ScalAP;
    assign AQ=ScalAQ;
    assign AR=ScalAR;
    assign AS=ScalAS;
                    
    assign BP=ScalBP;
    assign BQ=ScalBQ;
    assign BR=ScalBR;
    assign BS=ScalBS;
                    
    assign CP=ScalCP;
    assign CQ=ScalCQ;
    assign CR=ScalCR;
    assign CS=ScalCS;
                    
    assign DP=ScalDP;
    assign DQ=ScalDQ;
    assign DR=ScalDR;
    assign DS=ScalDS;
    // Initialization
     initial
     begin
    //
       MulAP1=0; MulAQ1=0; MulAR1=0; MulAS1=0;
       MulBP1=0; MulBQ1=0; MulBR1=0; MulBS1=0;
       MulCP1=0; MulCQ1=0; MulCR1=0; MulCS1=0;
       MulDP1=0; MulDQ1=0; MulDR1=0; MulDS1=0;
    //
       MulAP2=0; MulAQ2=0; MulAR2=0; MulAS2=0;
       MulBP2=0; MulBQ2=0; MulBR2=0; MulBS2=0;
       MulCP2=0; MulCQ2=0; MulCR2=0; MulCS2=0;
       MulDP2=0; MulDQ2=0; MulDR2=0; MulDS2=0;
    //
       SumAP=0; SumAQ=0; SumAR=0; SumAS=0;
       SumBP=0; SumBQ=0; SumBR=0; SumBS=0;
       SumCP=0; SumCQ=0; SumCR=0; SumCS=0;
       SumDP=0; SumDQ=0; SumDR=0; SumDS=0;
    //
       AddAP=0; AddAQ=0; AddAR=0; AddAS=0;
       AddBP=0; AddBQ=0; AddBR=0; AddBS=0;
       AddCP=0; AddCQ=0; AddCR=0; AddCS=0;
       AddDP=0; AddDQ=0; AddDR=0; AddDS=0;
    //
       ScalAP=0; ScalAQ=0; ScalAR=0; ScalAS=0;
       ScalBP=0; ScalBQ=0; ScalBR=0; ScalBS=0;
       ScalCP=0; ScalCQ=0; ScalCR=0; ScalCS=0;
       ScalDP=0; ScalDQ=0; ScalDR=0; ScalDS=0;
     end
    // Main Computations
     always @(posedge Clk)
     begin
    // 1*1
       MulAP1<=A1*P1; MulAQ1<=A1*Q1; MulAR1<=A1*R1; MulAS1<=A1*S1;
       MulBP1<=B1*P1; MulBQ1<=B1*Q1; MulBR1<=B1*R1; MulBS1<=B1*S1;
       MulCP1<=C1*P1; MulCQ1<=C1*Q1; MulCR1<=C1*R1; MulCS1<=C1*S1;
       MulDP1<=D1*P1; MulDQ1<=D1*Q1; MulDR1<=D1*R1; MulDS1<=D1*S1;
    // 2*2
       MulAP2<=A2*P2; MulAQ2<=A2*Q2; MulAR2<=A2*R2; MulAS2<=A2*S2;
       MulBP2<=B2*P2; MulBQ2<=B2*Q2; MulBR2<=B2*R2; MulBS2<=B2*S2;
       MulCP2<=C2*P2; MulCQ2<=C2*Q2; MulCR2<=C2*R2; MulCS2<=C2*S2;
       MulDP2<=D2*P2; MulDQ2<=D2*Q2; MulDR2<=D2*R2; MulDS2<=D2*S2;
    // Sum
       SumAP<=MulAP1+MulAP2; SumAQ<=MulAQ1+MulAQ2; SumAR<=MulAR1+MulAR2; SumAS<=MulAS1+MulAS2;
       SumBP<=MulBP1+MulBP2; SumBQ<=MulBQ1+MulBQ2; SumBR<=MulBR1+MulBR2; SumBS<=MulBS1+MulBS2;
       SumCP<=MulCP1+MulCP2; SumCQ<=MulCQ1+MulCQ2; SumCR<=MulCR1+MulCR2; SumCS<=MulCS1+MulCS2;
       SumDP<=MulDP1+MulDP2; SumDQ<=MulDQ1+MulDQ2; SumDR<=MulDR1+MulDR2; SumDS<=MulDS1+MulDS2;
    // Scal: if I change A+B-C into two stage pipeline it does not improve the performance...
       ScalAP<=ScalAP+SumAP-AP; ScalAQ<=ScalAQ+SumAQ-AQ; ScalAR<=ScalAR+SumAR-AR; ScalAS<=ScalAS+SumAS-AS;
       ScalBP<=ScalBP+SumBP-BP; ScalBQ<=ScalBQ+SumBQ-BQ; ScalBR<=ScalBR+SumBR-BR; ScalBS<=ScalBS+SumBS-BS;
       ScalCP<=ScalCP+SumCP-CP; ScalCQ<=ScalCQ+SumCQ-CQ; ScalCR<=ScalCR+SumCR-CR; ScalCS<=ScalCS+SumCS-CS;
       ScalDP<=ScalDP+SumDP-DP; ScalDQ<=ScalDQ+SumDQ-DQ; ScalDR<=ScalDR+SumDR-DR; ScalDS<=ScalDS+SumDS-DS;
     end
    endmodule
    

    
    Device	EP3SL150F1152C2	
    Top-level entity name	my_t2_DE3	my_t2_DE3
    Family name	Stratix III	Stratix II
    Optimization Technique	Speed	Balanced
    Use Generated Physical Constraints File	Off	
    Use smart compilation	Off	Off
    Enable parallel Assembler and TimeQuest Timing Analyzer during compilation	On	On
    Enable compact report table	Off	Off
    Restructure Multiplexers	Auto	Auto
    Create Debugging Nodes for IP Cores	Off	Off
    Preserve fewer node names	On	On
    Disable OpenCore Plus hardware evaluation	Off	Off
    Verilog Version	Verilog_2001	Verilog_2001
    VHDL Version	VHDL_1993	VHDL_1993
    State Machine Processing	Auto	Auto
    Safe State Machine	Off	Off
    Extract Verilog State Machines	On	On
    Extract VHDL State Machines	On	On
    Ignore Verilog initial constructs	Off	Off
    Iteration limit for constant Verilog loops	5000	5000
    Iteration limit for non-constant Verilog loops	250	250
    Add Pass-Through Logic to Inferred RAMs	On	On
    Parallel Synthesis	Off	Off
    DSP Block Balancing	Auto	Auto
    NOT Gate Push-Back	On	On
    Power-Up Don't Care	On	On
    Remove Redundant Logic Cells	Off	Off
    Remove Duplicate Registers	On	On
    Ignore CARRY Buffers	Off	Off
    Ignore CASCADE Buffers	Off	Off
    Ignore GLOBAL Buffers	Off	Off
    Ignore ROW GLOBAL Buffers	Off	Off
    Ignore LCELL Buffers	Off	Off
    Ignore SOFT Buffers	On	On
    Limit AHDL Integers to 32 Bits	Off	Off
    Carry Chain Length	70	70
    Auto Carry Chains	On	On
    Auto Open-Drain Pins	On	On
    Perform WYSIWYG Primitive Resynthesis	Off	Off
    Auto ROM Replacement	On	On
    Auto RAM Replacement	On	On
    Auto DSP Block Replacement	On	On
    Auto Shift Register Replacement	Auto	Auto
    Auto Clock Enable Replacement	On	On
    Strict RAM Replacement	Off	Off
    Allow Synchronous Control Signals	On	On
    Force Use of Synchronous Clear Signals	Off	Off
    Auto RAM Block Balancing	On	On
    Auto RAM to Logic Cell Conversion	Off	Off
    Auto Resource Sharing	Off	Off
    Allow Any RAM Size For Recognition	Off	Off
    Allow Any ROM Size For Recognition	Off	Off
    Allow Any Shift Register Size For Recognition	Off	Off
    Use LogicLock Constraints during Resource Balancing	On	On
    Ignore translate_off and synthesis_off directives	Off	Off
    Timing-Driven Synthesis	Off	Off
    Show Parameter Settings Tables in Synthesis Report	On	On
    Ignore Maximum Fan-Out Assignments	Off	Off
    Synchronization Register Chain Length	2	2
    PowerPlay Power Optimization	Normal compilation	Normal compilation
    HDL message level	Level2	Level2
    Suppress Register Optimization Related Messages	Off	Off
    Number of Removed Registers Reported in Synthesis Report	5000	5000
    Number of Inverted Registers Reported in Synthesis Report	100	100
    Clock MUX Protection	On	On
    Auto Gated Clock Conversion	Off	Off
    Block Design Naming	Auto	Auto
    SDC constraint protection	Off	Off
    Synthesis Effort	Auto	Auto
    Shift Register Replacement - Allow Asynchronous Clear Signal	On	On
    Analysis & Synthesis Message Level	Medium	Medium
    Disable Register Merging Across Hierarchies	Auto	Auto
    Resource Aware Inference For Block RAM	On	On
    

    Please, suggest me what I still can improve in settings or/and in code to get better performance!

    Sincerely,

    Ilghiz
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi,

    Dumb suggestion one:

    Try changing the Optimization Technique from Balanced to Speed.

    Dumb suggestion two:

    Add two more register levels: register the inputs and the outputs.

    How are you obtaining that 330MHz? Are you synthesizing your entire design or just that module?

    Which is the critical path?
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi,

    --- Quote Start ---

    Try changing the Optimization Technique from Balanced to Speed.

    --- Quote End ---

    I have several (not all) optimizations switched on for speed.

    If I synthesizing one instance of this module in the complete project, I am getting FMax=330MHz. If I have 10 instances, then FMax=310MHz only :(

    --- Quote Start ---

    Add two more register levels: register the inputs and the outputs.

    --- Quote End ---

    Please, help me with short example on it, I did not get the idea!

    --- Quote Start ---

    How are you obtaining that 330MHz? Are you synthesizing your entire design or just that module?

    Which is the critical path?

    --- Quote End ---

    Actually, I measure FMax for entire design, but it is not too complicated, right now the inputs are set from HSTC LVDS data, and the output is pipelined over "artificial part" to GPIO.

    I am newbie in FPGA design, I just turned into this field after 20 years massively parallel numerical math experience. I tried to follow "set_false_path" but it seems that I did not set it properly and cannot understand how to figure out where is my critical part.

    PS: I can publish entire project, it is just 600 lines, 100 lines already here, 300 lines just from Terasic, the rest one just binding different inputs and outputs to each other.

    Sincerely,

    Ilghiz