Please, advise me how to improve the performance of my verilog module

Question

Hi,  exercising on DE3 Terasic board, I meet a situation that I cannot find myself a solution of my questions and kindly ask this forum to advise me.  I have small verilog module that reads N words of N infinite vectors v_1,...,v_N and I need to compute all possible s_{i,j,k}=v_i^T D P^k v_j, where P is permutation matrix that shifts vector to one entry down, and D is the diagonal with (d_1,...) on diagonal so that d_1=1,d_2=(1-2^{-m}), d_3=(1-2^{-m})^2..., hence I am implementing something similar to stable IIR filters.  In my case I am trying to pipeline input data that is arriving from each vector (InDataA, InDataB, for the simplicity I take an example with N=2), and compute all products and store it in the result output in (ScalAA, ScalAB, ScalBA, ScalBB ).  If I install this module into standard Tesasic DE3 environment I got two issues that I cannot resolve:  1. all my data are reg signed [13:0], so one multiplication can be fitted into 18x18 bits mults. I am doing massively parallel multiplications and hope to use so called "Four Multiplier Adder Mode" as it is described in Stratix III Device Handbook 1, but I cannot understand how to implement it. I urgently need it otherwise I will run out of recourses of my DE3 board.  2. timing of this module was not very perfect, I achieve only 260-310MHz, however, in the "Four Multiplier Adder Mode" I should achieve 600MHz. I also need it because in my design I expect to have data with 400, 500 and probably 600 MHz input data rate.  And now there is by module. Please, advise me how to:  1. switch four multiplier adder mode on, 2. and ideas to improve the performance.  Thank you!  Ilghiz  
module DATA_Aq(InDataClkA, InDataA, InDataB, OutData);
parameter NBUF=16; // the maximum possible shift in the design, I should be able to run it with:
// 1) A,B,...H=8 channels, and NBUF=6, or
// 2) A,B,C,D=4 channels, and NBUF=16, so both designs
// need 384 or 256 18x18 multipliers in Four Multiplier Adder Mode
parameter UpdateSpeed=12;
input InDataClkA;
input  InDataA;
input  InDataB;
reg  OutData; // this is some artificial output that prevents Quartus to optimize out the main part of computations
output  OutData;
// Memory Declaration
reg signed  DataA;
reg signed  DataB;
reg signed   ScalAA, Scal1AA, Scal2AA;
reg signed   ScalAB, Scal1AB, Scal2AB;
reg signed   ScalBA, Scal1BA, Scal2BA;
reg signed   ScalBB, Scal1BB, Scal2BB;
// reg signed  Scal3AA, Scal4AA, Scal5AA;
reg signed  Tmp;
reg  InDataCounter;
// Initialization
initial
begin
  integer i;
  InDataCounter=0;
  for(i=0; i&lt;NBUF; i=i+1)
  begin
    ScalAA=0; Scal1AA=0; Scal2AA=0;
    ScalAB=0; Scal1AB=0; Scal2AB=0;
    ScalBA=0; Scal1BA=0; Scal2BA=0;
    ScalBB=0; Scal1BB=0; Scal2BB=0;
    DataA=0;
    DataB=0;
  end
end
// Reading Data from Channels and Computation
 always @(posedge InDataClkA)
 begin
   integer i;
   InDataCounter&lt;=InDataCounter+1;
   for(i=0; i&lt;NBUF-1; i=i+1)
   begin
     DataA&lt;=DataA;
     DataB&lt;=DataB;
   end
   DataA&lt;=InDataA;
   DataB&lt;=InDataB;
   for(i=0; i&lt;NBUF; i=i+1)
   begin
     Scal1AA&lt;=InDataA*DataA;
     Scal1AB&lt;=InDataA*DataB;
     Scal1BA&lt;=InDataB*DataA;
     Scal1BB&lt;=InDataB*DataB;
     
     Scal2AA&lt;=ScalAA-(ScalAA&gt;&gt;UpdateSpeed);
     Scal2AB&lt;=ScalAB-(ScalAB&gt;&gt;UpdateSpeed);
     Scal2BA&lt;=ScalBA-(ScalBA&gt;&gt;UpdateSpeed);
     Scal2BB&lt;=ScalBB-(ScalBB&gt;&gt;UpdateSpeed);
     
     ScalAA&lt;=Scal1AA+Scal2AA;
     ScalAB&lt;=Scal1AB+Scal2AB;
     ScalBA&lt;=Scal1BA+Scal2BA;
     ScalBB&lt;=Scal1BB+Scal2BB;
   end
 end
 
// This is artificial always block to simulate that I am using Scal?? data
 always @(InDataCounter)
 begin
   case(InDataCounter)
     0: Tmp=ScalAA];
     1: Tmp=ScalAB];
     2: Tmp=ScalBA];
     3: Tmp=ScalBB];
   endcase
   OutData=Tmp+Tmp+Tmp+Tmp+Tmp;
end
endmodule

altera_forum · Answer

Hi,

Quartus will infer the four multipler adder mode from your Verilog, if it follows a suitable template.

Check the Quartus manual, section 6-9, for details and examples.

http://www.altera.com/literature/hb/qts/qts_qii51007.pdf

That said, I don't see how it will help your with your resource problem: a Stratix III DSP block can implement 4 18x18 multipliers, weather it's 4 independent multipliers or 4 multipler-adders.

As for fMax, you need to take a look at your critical paths and see where the largest delay is. In such a case, adding some extra register stages might help.

PS: your "artificial block" looks like something that will be synthesized to latches.

altera_forum · Answer

Hi,

thank you for your kind respond. May I try to comment your answer and probably figure out the main problem that forced me to ask at this forum.

--- Quote Start ---

Quartus will infer the four multiplier adder mode from your Verilog, if it follows a suitable template.

Check the Quartus manual, section 6-9, for details and examples.

http://www.altera.com/literature/hb/...s_qii51007.pdf (http://www.altera.com/literature/hb/qts/qts_qii51007.pdf)

--- Quote End ---

yes, it is one reason why I am asking at this forum.

--- Quote Start ---

That said, I don't see how it will help your with your resource problem: a Stratix III DSP block can implement 4 18x18 multipliers, weather it's 4 independent multipliers or 4 multipler-adders.

--- Quote End ---

No! Actually at the Altera document

http://www.altera.com/literature/hb/stx3/stx3_siii51005.pdf

at page 5-2, there is a table 5-1 that says if for SL150 I use four multiplier mode I can achieve 384 18x18 multipliers, otherwise if they are just normal (independent) multipliers, I am achieving only 192 18x18 multipliers. I need more performance!!!

From the other hand, at

http://www.altera.com/literature/hb/stx3/stx3_siii5v2.pdf

at page 1-17 and table 1-21 at 5-th line I should achieve 600 MHz at 18x18 mode and 440 MHz at double mode (I have C2 speed grade). I need more speed (FMax) for my project!!!

Hence I am trying to find the solution how to organize my computation such a way to achieve this performance.

--- Quote Start ---

PS: your "artificial block" looks like something that will be synthesized to latches.

--- Quote End ---

Please, do not care about it! In the reality it is completely different algorithm in this "artificial block" but it is about 2000 lines and these lines can take you our of my main question.

Sincerely,

Ilghiz

altera_forum · Answer

I see, I misinterpreted the doc. :)

Anyway, taking a second look at your code, if I reading this right, it can't be mapped to the 4-M-A mode.

The 4-M-Adder performs the operation "output = (a0*b0) + (a1*b1) + (a2*b2) + (a3*b3)" with 3 levels of registers (for fMAX)

ra0 <= a0; ... rb3 <= b3;

rma01 <= (ra0 * rb0) + (ra1 * rb1)

rma23 <= (ra2 * rb2) + (ra3 * rb3)

rma <= rma01 + rm23;

output <= round_saturate(rma);

Or you can use it in 4-M-Accumulator mode

ra0 <= a0; ... rb3 <= b3;

rma01 <= (ra0 * rb0) + (ra1 * rb1)

rma23 <= (ra2 * rb2) + (ra3 * rb3)

rma <= rma01 + rm23 + rma;

output <= round_saturate(rma);

You need to, somehow, convert your algorithm into one of these patterns.

But looking at it, I can't figure out a way.

altera_forum · Answer

Dear Rbuhalho,  yes, you are right, thank you! It seems that my fist question regarding to the usage of multipliers is solved. I was able to convert the algorithm such a way that it makes A*B+C*D and immediately see that the usage of multipliers drops two times!  However, my performance is still far from possible peak, I achieving right now only 330MHz instead of 440MHz (is it possible to have 600MHz here on my hardware?).  It seems that I need to tune my settings in Quartus or change something more in the algorithm.  Here I attached the Quartus settings and modified code:   
module GenScal(A1, A2, B1, B2, C1, C2, D1, D2,
               P1, P2, Q1, Q2, R1, R2, S1, S2,
               AP, AQ, AR, AS,
               BP, BQ, BR, BS,
               CP, CQ, CR, CS,
               DP, DQ, DR, DS, Clk);
parameter UpdateSpeed=12;
input Clk;
input  A1, A2, B1, B2, C1, C2, D1, D2;
input  P1, P2, Q1, Q2, R1, R2, S1, S2;
output  AP, AQ, AR, AS;
output  BP, BQ, BR, BS;
output  CP, CQ, CR, CS;
output  DP, DQ, DR, DS;
// Memory
reg  ScalAP, ScalAQ, ScalAR, ScalAS;
reg  ScalBP, ScalBQ, ScalBR, ScalBS;
reg  ScalCP, ScalCQ, ScalCR, ScalCS;
reg  ScalDP, ScalDQ, ScalDR, ScalDS;
reg  AddAP, AddAQ, AddAR, AddAS;
reg  AddBP, AddBQ, AddBR, AddBS;
reg  AddCP, AddCQ, AddCR, AddCS;
reg  AddDP, AddDQ, AddDR, AddDS;
reg  MulAP1, MulAQ1, MulAR1, MulAS1;
reg  MulBP1, MulBQ1, MulBR1, MulBS1;
reg  MulCP1, MulCQ1, MulCR1, MulCS1;
reg  MulDP1, MulDQ1, MulDR1, MulDS1;
reg  MulAP2, MulAQ2, MulAR2, MulAS2;
reg  MulBP2, MulBQ2, MulBR2, MulBS2;
reg  MulCP2, MulCQ2, MulCR2, MulCS2;
reg  MulDP2, MulDQ2, MulDR2, MulDS2;
reg  SumAP, SumAQ, SumAR, SumAS;
reg  SumBP, SumBQ, SumBR, SumBS;
reg  SumCP, SumCQ, SumCR, SumCS;
reg  SumDP, SumDQ, SumDR, SumDS;
assign AP=ScalAP;
assign AQ=ScalAQ;
assign AR=ScalAR;
assign AS=ScalAS;
                
assign BP=ScalBP;
assign BQ=ScalBQ;
assign BR=ScalBR;
assign BS=ScalBS;
                
assign CP=ScalCP;
assign CQ=ScalCQ;
assign CR=ScalCR;
assign CS=ScalCS;
                
assign DP=ScalDP;
assign DQ=ScalDQ;
assign DR=ScalDR;
assign DS=ScalDS;
// Initialization
 initial
 begin
//
   MulAP1=0; MulAQ1=0; MulAR1=0; MulAS1=0;
   MulBP1=0; MulBQ1=0; MulBR1=0; MulBS1=0;
   MulCP1=0; MulCQ1=0; MulCR1=0; MulCS1=0;
   MulDP1=0; MulDQ1=0; MulDR1=0; MulDS1=0;
//
   MulAP2=0; MulAQ2=0; MulAR2=0; MulAS2=0;
   MulBP2=0; MulBQ2=0; MulBR2=0; MulBS2=0;
   MulCP2=0; MulCQ2=0; MulCR2=0; MulCS2=0;
   MulDP2=0; MulDQ2=0; MulDR2=0; MulDS2=0;
//
   SumAP=0; SumAQ=0; SumAR=0; SumAS=0;
   SumBP=0; SumBQ=0; SumBR=0; SumBS=0;
   SumCP=0; SumCQ=0; SumCR=0; SumCS=0;
   SumDP=0; SumDQ=0; SumDR=0; SumDS=0;
//
   AddAP=0; AddAQ=0; AddAR=0; AddAS=0;
   AddBP=0; AddBQ=0; AddBR=0; AddBS=0;
   AddCP=0; AddCQ=0; AddCR=0; AddCS=0;
   AddDP=0; AddDQ=0; AddDR=0; AddDS=0;
//
   ScalAP=0; ScalAQ=0; ScalAR=0; ScalAS=0;
   ScalBP=0; ScalBQ=0; ScalBR=0; ScalBS=0;
   ScalCP=0; ScalCQ=0; ScalCR=0; ScalCS=0;
   ScalDP=0; ScalDQ=0; ScalDR=0; ScalDS=0;
 end
// Main Computations
 always @(posedge Clk)
 begin
// 1*1
   MulAP1&lt;=A1*P1; MulAQ1&lt;=A1*Q1; MulAR1&lt;=A1*R1; MulAS1&lt;=A1*S1;
   MulBP1&lt;=B1*P1; MulBQ1&lt;=B1*Q1; MulBR1&lt;=B1*R1; MulBS1&lt;=B1*S1;
   MulCP1&lt;=C1*P1; MulCQ1&lt;=C1*Q1; MulCR1&lt;=C1*R1; MulCS1&lt;=C1*S1;
   MulDP1&lt;=D1*P1; MulDQ1&lt;=D1*Q1; MulDR1&lt;=D1*R1; MulDS1&lt;=D1*S1;
// 2*2
   MulAP2&lt;=A2*P2; MulAQ2&lt;=A2*Q2; MulAR2&lt;=A2*R2; MulAS2&lt;=A2*S2;
   MulBP2&lt;=B2*P2; MulBQ2&lt;=B2*Q2; MulBR2&lt;=B2*R2; MulBS2&lt;=B2*S2;
   MulCP2&lt;=C2*P2; MulCQ2&lt;=C2*Q2; MulCR2&lt;=C2*R2; MulCS2&lt;=C2*S2;
   MulDP2&lt;=D2*P2; MulDQ2&lt;=D2*Q2; MulDR2&lt;=D2*R2; MulDS2&lt;=D2*S2;
// Sum
   SumAP&lt;=MulAP1+MulAP2; SumAQ&lt;=MulAQ1+MulAQ2; SumAR&lt;=MulAR1+MulAR2; SumAS&lt;=MulAS1+MulAS2;
   SumBP&lt;=MulBP1+MulBP2; SumBQ&lt;=MulBQ1+MulBQ2; SumBR&lt;=MulBR1+MulBR2; SumBS&lt;=MulBS1+MulBS2;
   SumCP&lt;=MulCP1+MulCP2; SumCQ&lt;=MulCQ1+MulCQ2; SumCR&lt;=MulCR1+MulCR2; SumCS&lt;=MulCS1+MulCS2;
   SumDP&lt;=MulDP1+MulDP2; SumDQ&lt;=MulDQ1+MulDQ2; SumDR&lt;=MulDR1+MulDR2; SumDS&lt;=MulDS1+MulDS2;
// Scal: if I change A+B-C into two stage pipeline it does not improve the performance...
   ScalAP&lt;=ScalAP+SumAP-AP; ScalAQ&lt;=ScalAQ+SumAQ-AQ; ScalAR&lt;=ScalAR+SumAR-AR; ScalAS&lt;=ScalAS+SumAS-AS;
   ScalBP&lt;=ScalBP+SumBP-BP; ScalBQ&lt;=ScalBQ+SumBQ-BQ; ScalBR&lt;=ScalBR+SumBR-BR; ScalBS&lt;=ScalBS+SumBS-BS;
   ScalCP&lt;=ScalCP+SumCP-CP; ScalCQ&lt;=ScalCQ+SumCQ-CQ; ScalCR&lt;=ScalCR+SumCR-CR; ScalCS&lt;=ScalCS+SumCS-CS;
   ScalDP&lt;=ScalDP+SumDP-DP; ScalDQ&lt;=ScalDQ+SumDQ-DQ; ScalDR&lt;=ScalDR+SumDR-DR; ScalDS&lt;=ScalDS+SumDS-DS;
 end
endmodule
    
Device	EP3SL150F1152C2	
Top-level entity name	my_t2_DE3	my_t2_DE3
Family name	Stratix III	Stratix II
Optimization Technique	Speed	Balanced
Use Generated Physical Constraints File	Off	
Use smart compilation	Off	Off
Enable parallel Assembler and TimeQuest Timing Analyzer during compilation	On	On
Enable compact report table	Off	Off
Restructure Multiplexers	Auto	Auto
Create Debugging Nodes for IP Cores	Off	Off
Preserve fewer node names	On	On
Disable OpenCore Plus hardware evaluation	Off	Off
Verilog Version	Verilog_2001	Verilog_2001
VHDL Version	VHDL_1993	VHDL_1993
State Machine Processing	Auto	Auto
Safe State Machine	Off	Off
Extract Verilog State Machines	On	On
Extract VHDL State Machines	On	On
Ignore Verilog initial constructs	Off	Off
Iteration limit for constant Verilog loops	5000	5000
Iteration limit for non-constant Verilog loops	250	250
Add Pass-Through Logic to Inferred RAMs	On	On
Parallel Synthesis	Off	Off
DSP Block Balancing	Auto	Auto
NOT Gate Push-Back	On	On
Power-Up Don't Care	On	On
Remove Redundant Logic Cells	Off	Off
Remove Duplicate Registers	On	On
Ignore CARRY Buffers	Off	Off
Ignore CASCADE Buffers	Off	Off
Ignore GLOBAL Buffers	Off	Off
Ignore ROW GLOBAL Buffers	Off	Off
Ignore LCELL Buffers	Off	Off
Ignore SOFT Buffers	On	On
Limit AHDL Integers to 32 Bits	Off	Off
Carry Chain Length	70	70
Auto Carry Chains	On	On
Auto Open-Drain Pins	On	On
Perform WYSIWYG Primitive Resynthesis	Off	Off
Auto ROM Replacement	On	On
Auto RAM Replacement	On	On
Auto DSP Block Replacement	On	On
Auto Shift Register Replacement	Auto	Auto
Auto Clock Enable Replacement	On	On
Strict RAM Replacement	Off	Off
Allow Synchronous Control Signals	On	On
Force Use of Synchronous Clear Signals	Off	Off
Auto RAM Block Balancing	On	On
Auto RAM to Logic Cell Conversion	Off	Off
Auto Resource Sharing	Off	Off
Allow Any RAM Size For Recognition	Off	Off
Allow Any ROM Size For Recognition	Off	Off
Allow Any Shift Register Size For Recognition	Off	Off
Use LogicLock Constraints during Resource Balancing	On	On
Ignore translate_off and synthesis_off directives	Off	Off
Timing-Driven Synthesis	Off	Off
Show Parameter Settings Tables in Synthesis Report	On	On
Ignore Maximum Fan-Out Assignments	Off	Off
Synchronization Register Chain Length	2	2
PowerPlay Power Optimization	Normal compilation	Normal compilation
HDL message level	Level2	Level2
Suppress Register Optimization Related Messages	Off	Off
Number of Removed Registers Reported in Synthesis Report	5000	5000
Number of Inverted Registers Reported in Synthesis Report	100	100
Clock MUX Protection	On	On
Auto Gated Clock Conversion	Off	Off
Block Design Naming	Auto	Auto
SDC constraint protection	Off	Off
Synthesis Effort	Auto	Auto
Shift Register Replacement - Allow Asynchronous Clear Signal	On	On
Analysis &amp; Synthesis Message Level	Medium	Medium
Disable Register Merging Across Hierarchies	Auto	Auto
Resource Aware Inference For Block RAM	On	On
  Please, suggest me what I still can improve in settings or/and in code to get better performance!  Sincerely,  Ilghiz

altera_forum · Answer

Hi,

Dumb suggestion one:

Try changing the Optimization Technique from Balanced to Speed.

Dumb suggestion two:

Add two more register levels: register the inputs and the outputs.

How are you obtaining that 330MHz? Are you synthesizing your entire design or just that module?

Which is the critical path?

Forum Discussion

Please, advise me how to improve the performance of my verilog module

6 Replies

Recent Discussions

Agilex 7 slew rate reconfiguration

Agilex-7 AXI MCDMA for PCIe hang

Constraints not being picked for DCFIFO

Can't generate F-Tile Ethernet Hard IP Design Example

MAX10 TSE reference design