Altera_Forum
Honored Contributor
15 years agoFmax is small, please, help me to improve it!
Hi,
in my small systemverilog project (about 1000 lines) I have one module that is critical to the performance (I attached it below). This module takes 4x16 bit words (In) per clock (Clk), pipelines them over N stages (Data), calculates scalar products over all possible combinations and sum them using something similar to FIR filter. Using other clock (ClkSW) I take data out of this module. I urgently need Clk running on 400MHz, and ClkSW can be run with 100MHz. I am trying to experiment it on Stratix III from DE3 board (EP3SL150F1152C2), with DSP blocks that can achieve 440MHz on a*b+c*d operations, indeed I develop everything maximally pipelined, so I perform only one operation per parallel line, and I see that Quartus uses DSP blocks for my multiplier pairs, and they are optimally implemented as a*b+c*d operators. Sure I switched all possible compiler optimizations on for speed and switched off any re-usage of synthesis for clear experiment. In my module I have the parameter N that is refer to the total amount of similar parallel work to be performed. Here I print a table with my results that I am achieving:
N SHR FMAX 0C/85C Logic DSP Total Synt. Time
N=18, SHR=14, 337/311MHz, Logic 22%, DSP 75%, Time=29minutes
N=18, SHR= 2, 331/308MHz, Logic 19%, DSP 75%, Time=28minutes
N=12, SHR= 2, 366/340MHz, Logic 13%, DSP 50%, Time=17minutes
N= 8, SHR= 2, 370/339MHz, Logic 8%, DSP 33%, Time=11minutes
N= 6, SHR= 2, 413/384MHz, Logic 6%, DSP 25%, Time=8minutes
N= 4, SHR= 2, 448/410MHz, Logic 4%, DSP 17%, Time=5minutes
N= 4, SHR=14, 382/355MHz, Logic 5%, DSP 17%, Time=6minutes
N= 2, SHR=14, 432/401MHz, Logic 2%, DSP 8%, Time=3minutes
Indeed if my DSP usage is small, and N=2,4, I can achieve something close to peak performance (FMax for multipliers should be about 440MHz), but I cannot achieve it if I use many multipliers even if my module behaves totally the same. I fighted with this module almost one month trying to append some intermediate registers, but it does not help, I cannot achieve even 400MHz (it should be enough for me) for N=16/18. For large projects I cannot run many attempts - each recompile costs me half-hour. Please, suggest me what I should try to achieve FMax=400MHz for Clk and N=18. I urgently need it, otherwise I will need to demux my data, and use at least SL340 with impressive $8000 price :( Sincerely, Ilghiz Here is my module, you can try it with your Quartus using Stratix III and see my problem:
module test(Clk, In, ClkSW, SW, Scal);
parameter N=18; // can be 2, 4, 6, ..., but I need 16 or 18
parameter SHR=14; // can be 2, 3, 4, 5, ..., but I need 12-20
input Clk, ClkSW;
input signed In;
input SW;
reg signed Scal;
output Scal;
// Memory ////////////////////////////
reg signed D, Data;
reg signed Mul;
reg signed Sum, Sum2;
reg signed ScalX;
reg signed ScalY;
reg InDataCounter;
// Reading Data from Channels - the key place where I cannot achieve to clock it with 400MHz for N=16, or 18
always @(posedge Clk)
begin
for(int i=0; i<2; i++)
for(int j=0; j<4; j++)
D<=Data;
for(int j=0; j<4; j++)
Data<=In;
for(int i=0; i<N-1; i++)
for(int j=0; j<4; j++)
Data<=Data;
InDataCounter<=~InDataCounter;
for(int i=0; i<N; i+=2)
for(int j=0; j<4; j++)
for(int k=0; k<4; k++)
begin
Mul <=D*Data;
Mul<=D*Data;
end
for(int i=0; i<N; i+=2)
for(int j=0; j<16; j++)
begin
Sum<=Mul+Mul;
Sum2<=Sum; // intermediate register that helps a lot
ScalX<=ScalX-(ScalX>>>SHR);
ScalX<=Sum2+ScalX;
ScalY<=ScalX>>>(16+SHR);
ScalY<=ScalX>>>(16+SHR);
end
end
// Output, it is clocked with 100MHz and I hope that is not relevant to my performance problem
always @(posedge ClkSW)
begin
for(int i=1; i<N; i++)
begin
case(SW)
4'b0000: begin Scal<=ScalY; Scal<=ScalY; end
4'b0001: begin Scal<=ScalY; Scal<=ScalY; end
4'b0010: begin Scal<=ScalY; Scal<=ScalY; end
4'b0011: begin Scal<=ScalY; Scal<=ScalY; end
//
4'b0100: begin Scal<=ScalY; Scal<=ScalY; end
4'b0101: begin Scal<=ScalY; Scal<=ScalY; end
4'b0110: begin Scal<=ScalY; Scal<=ScalY; end
4'b0111: begin Scal<=ScalY; Scal<=ScalY; end
//
4'b1000: begin Scal<=ScalY; Scal<=ScalY; end
4'b1001: begin Scal<=ScalY; Scal<=ScalY; end
4'b1010: begin Scal<=ScalY; Scal<=ScalY; end
4'b1011: begin Scal<=ScalY; Scal<=ScalY; end
//
4'b1100: begin Scal<=ScalY; Scal<=ScalY; end
4'b1101: begin Scal<=ScalY; Scal<=ScalY; end
4'b1110: begin Scal<=ScalY; Scal<=ScalY; end
4'b1111: begin Scal<=ScalY; Scal<=ScalY; end
endcase
end
case(SW)
0: Scal<=ScalY;
1: Scal<=ScalY;
2: Scal<=ScalY;
3: Scal<=ScalY;
4: Scal<=ScalY;
5: Scal<=ScalY;
6: Scal<=ScalY;
7: Scal<=ScalY;
8: Scal<=ScalY;
9: Scal<=ScalY;
10: Scal<=ScalY;
11: Scal<=ScalY;
12: Scal<=ScalY;
13: Scal<=ScalY;
14: Scal<=ScalY;
15: Scal<=ScalY;
endcase
end
endmodule