Forum Discussion
Altera_Forum
Honored Contributor
15 years agoDear Rbuhalho,
yes, you are right, thank you! It seems that my fist question regarding to the usage of multipliers is solved. I was able to convert the algorithm such a way that it makes A*B+C*D and immediately see that the usage of multipliers drops two times! However, my performance is still far from possible peak, I achieving right now only 330MHz instead of 440MHz (is it possible to have 600MHz here on my hardware?). It seems that I need to tune my settings in Quartus or change something more in the algorithm. Here I attached the Quartus settings and modified code:
module GenScal(A1, A2, B1, B2, C1, C2, D1, D2,
P1, P2, Q1, Q2, R1, R2, S1, S2,
AP, AQ, AR, AS,
BP, BQ, BR, BS,
CP, CQ, CR, CS,
DP, DQ, DR, DS, Clk);
parameter UpdateSpeed=12;
input Clk;
input A1, A2, B1, B2, C1, C2, D1, D2;
input P1, P2, Q1, Q2, R1, R2, S1, S2;
output AP, AQ, AR, AS;
output BP, BQ, BR, BS;
output CP, CQ, CR, CS;
output DP, DQ, DR, DS;
// Memory
reg ScalAP, ScalAQ, ScalAR, ScalAS;
reg ScalBP, ScalBQ, ScalBR, ScalBS;
reg ScalCP, ScalCQ, ScalCR, ScalCS;
reg ScalDP, ScalDQ, ScalDR, ScalDS;
reg AddAP, AddAQ, AddAR, AddAS;
reg AddBP, AddBQ, AddBR, AddBS;
reg AddCP, AddCQ, AddCR, AddCS;
reg AddDP, AddDQ, AddDR, AddDS;
reg MulAP1, MulAQ1, MulAR1, MulAS1;
reg MulBP1, MulBQ1, MulBR1, MulBS1;
reg MulCP1, MulCQ1, MulCR1, MulCS1;
reg MulDP1, MulDQ1, MulDR1, MulDS1;
reg MulAP2, MulAQ2, MulAR2, MulAS2;
reg MulBP2, MulBQ2, MulBR2, MulBS2;
reg MulCP2, MulCQ2, MulCR2, MulCS2;
reg MulDP2, MulDQ2, MulDR2, MulDS2;
reg SumAP, SumAQ, SumAR, SumAS;
reg SumBP, SumBQ, SumBR, SumBS;
reg SumCP, SumCQ, SumCR, SumCS;
reg SumDP, SumDQ, SumDR, SumDS;
assign AP=ScalAP;
assign AQ=ScalAQ;
assign AR=ScalAR;
assign AS=ScalAS;
assign BP=ScalBP;
assign BQ=ScalBQ;
assign BR=ScalBR;
assign BS=ScalBS;
assign CP=ScalCP;
assign CQ=ScalCQ;
assign CR=ScalCR;
assign CS=ScalCS;
assign DP=ScalDP;
assign DQ=ScalDQ;
assign DR=ScalDR;
assign DS=ScalDS;
// Initialization
initial
begin
//
MulAP1=0; MulAQ1=0; MulAR1=0; MulAS1=0;
MulBP1=0; MulBQ1=0; MulBR1=0; MulBS1=0;
MulCP1=0; MulCQ1=0; MulCR1=0; MulCS1=0;
MulDP1=0; MulDQ1=0; MulDR1=0; MulDS1=0;
//
MulAP2=0; MulAQ2=0; MulAR2=0; MulAS2=0;
MulBP2=0; MulBQ2=0; MulBR2=0; MulBS2=0;
MulCP2=0; MulCQ2=0; MulCR2=0; MulCS2=0;
MulDP2=0; MulDQ2=0; MulDR2=0; MulDS2=0;
//
SumAP=0; SumAQ=0; SumAR=0; SumAS=0;
SumBP=0; SumBQ=0; SumBR=0; SumBS=0;
SumCP=0; SumCQ=0; SumCR=0; SumCS=0;
SumDP=0; SumDQ=0; SumDR=0; SumDS=0;
//
AddAP=0; AddAQ=0; AddAR=0; AddAS=0;
AddBP=0; AddBQ=0; AddBR=0; AddBS=0;
AddCP=0; AddCQ=0; AddCR=0; AddCS=0;
AddDP=0; AddDQ=0; AddDR=0; AddDS=0;
//
ScalAP=0; ScalAQ=0; ScalAR=0; ScalAS=0;
ScalBP=0; ScalBQ=0; ScalBR=0; ScalBS=0;
ScalCP=0; ScalCQ=0; ScalCR=0; ScalCS=0;
ScalDP=0; ScalDQ=0; ScalDR=0; ScalDS=0;
end
// Main Computations
always @(posedge Clk)
begin
// 1*1
MulAP1<=A1*P1; MulAQ1<=A1*Q1; MulAR1<=A1*R1; MulAS1<=A1*S1;
MulBP1<=B1*P1; MulBQ1<=B1*Q1; MulBR1<=B1*R1; MulBS1<=B1*S1;
MulCP1<=C1*P1; MulCQ1<=C1*Q1; MulCR1<=C1*R1; MulCS1<=C1*S1;
MulDP1<=D1*P1; MulDQ1<=D1*Q1; MulDR1<=D1*R1; MulDS1<=D1*S1;
// 2*2
MulAP2<=A2*P2; MulAQ2<=A2*Q2; MulAR2<=A2*R2; MulAS2<=A2*S2;
MulBP2<=B2*P2; MulBQ2<=B2*Q2; MulBR2<=B2*R2; MulBS2<=B2*S2;
MulCP2<=C2*P2; MulCQ2<=C2*Q2; MulCR2<=C2*R2; MulCS2<=C2*S2;
MulDP2<=D2*P2; MulDQ2<=D2*Q2; MulDR2<=D2*R2; MulDS2<=D2*S2;
// Sum
SumAP<=MulAP1+MulAP2; SumAQ<=MulAQ1+MulAQ2; SumAR<=MulAR1+MulAR2; SumAS<=MulAS1+MulAS2;
SumBP<=MulBP1+MulBP2; SumBQ<=MulBQ1+MulBQ2; SumBR<=MulBR1+MulBR2; SumBS<=MulBS1+MulBS2;
SumCP<=MulCP1+MulCP2; SumCQ<=MulCQ1+MulCQ2; SumCR<=MulCR1+MulCR2; SumCS<=MulCS1+MulCS2;
SumDP<=MulDP1+MulDP2; SumDQ<=MulDQ1+MulDQ2; SumDR<=MulDR1+MulDR2; SumDS<=MulDS1+MulDS2;
// Scal: if I change A+B-C into two stage pipeline it does not improve the performance...
ScalAP<=ScalAP+SumAP-AP; ScalAQ<=ScalAQ+SumAQ-AQ; ScalAR<=ScalAR+SumAR-AR; ScalAS<=ScalAS+SumAS-AS;
ScalBP<=ScalBP+SumBP-BP; ScalBQ<=ScalBQ+SumBQ-BQ; ScalBR<=ScalBR+SumBR-BR; ScalBS<=ScalBS+SumBS-BS;
ScalCP<=ScalCP+SumCP-CP; ScalCQ<=ScalCQ+SumCQ-CQ; ScalCR<=ScalCR+SumCR-CR; ScalCS<=ScalCS+SumCS-CS;
ScalDP<=ScalDP+SumDP-DP; ScalDQ<=ScalDQ+SumDQ-DQ; ScalDR<=ScalDR+SumDR-DR; ScalDS<=ScalDS+SumDS-DS;
end
endmodule
Device EP3SL150F1152C2
Top-level entity name my_t2_DE3 my_t2_DE3
Family name Stratix III Stratix II
Optimization Technique Speed Balanced
Use Generated Physical Constraints File Off
Use smart compilation Off Off
Enable parallel Assembler and TimeQuest Timing Analyzer during compilation On On
Enable compact report table Off Off
Restructure Multiplexers Auto Auto
Create Debugging Nodes for IP Cores Off Off
Preserve fewer node names On On
Disable OpenCore Plus hardware evaluation Off Off
Verilog Version Verilog_2001 Verilog_2001
VHDL Version VHDL_1993 VHDL_1993
State Machine Processing Auto Auto
Safe State Machine Off Off
Extract Verilog State Machines On On
Extract VHDL State Machines On On
Ignore Verilog initial constructs Off Off
Iteration limit for constant Verilog loops 5000 5000
Iteration limit for non-constant Verilog loops 250 250
Add Pass-Through Logic to Inferred RAMs On On
Parallel Synthesis Off Off
DSP Block Balancing Auto Auto
NOT Gate Push-Back On On
Power-Up Don't Care On On
Remove Redundant Logic Cells Off Off
Remove Duplicate Registers On On
Ignore CARRY Buffers Off Off
Ignore CASCADE Buffers Off Off
Ignore GLOBAL Buffers Off Off
Ignore ROW GLOBAL Buffers Off Off
Ignore LCELL Buffers Off Off
Ignore SOFT Buffers On On
Limit AHDL Integers to 32 Bits Off Off
Carry Chain Length 70 70
Auto Carry Chains On On
Auto Open-Drain Pins On On
Perform WYSIWYG Primitive Resynthesis Off Off
Auto ROM Replacement On On
Auto RAM Replacement On On
Auto DSP Block Replacement On On
Auto Shift Register Replacement Auto Auto
Auto Clock Enable Replacement On On
Strict RAM Replacement Off Off
Allow Synchronous Control Signals On On
Force Use of Synchronous Clear Signals Off Off
Auto RAM Block Balancing On On
Auto RAM to Logic Cell Conversion Off Off
Auto Resource Sharing Off Off
Allow Any RAM Size For Recognition Off Off
Allow Any ROM Size For Recognition Off Off
Allow Any Shift Register Size For Recognition Off Off
Use LogicLock Constraints during Resource Balancing On On
Ignore translate_off and synthesis_off directives Off Off
Timing-Driven Synthesis Off Off
Show Parameter Settings Tables in Synthesis Report On On
Ignore Maximum Fan-Out Assignments Off Off
Synchronization Register Chain Length 2 2
PowerPlay Power Optimization Normal compilation Normal compilation
HDL message level Level2 Level2
Suppress Register Optimization Related Messages Off Off
Number of Removed Registers Reported in Synthesis Report 5000 5000
Number of Inverted Registers Reported in Synthesis Report 100 100
Clock MUX Protection On On
Auto Gated Clock Conversion Off Off
Block Design Naming Auto Auto
SDC constraint protection Off Off
Synthesis Effort Auto Auto
Shift Register Replacement - Allow Asynchronous Clear Signal On On
Analysis & Synthesis Message Level Medium Medium
Disable Register Merging Across Hierarchies Auto Auto
Resource Aware Inference For Block RAM On On
Please, suggest me what I still can improve in settings or/and in code to get better performance! Sincerely, Ilghiz