Altera_Forum
Honored Contributor
15 years agoPlease, advise me how to improve the performance of my verilog module
Hi,
exercising on DE3 Terasic board, I meet a situation that I cannot find myself a solution of my questions and kindly ask this forum to advise me. I have small verilog module that reads N words of N infinite vectors v_1,...,v_N and I need to compute all possible s_{i,j,k}=v_i^T D P^k v_j, where P is permutation matrix that shifts vector to one entry down, and D is the diagonal with (d_1,...) on diagonal so that d_1=1,d_2=(1-2^{-m}), d_3=(1-2^{-m})^2..., hence I am implementing something similar to stable IIR filters. In my case I am trying to pipeline input data that is arriving from each vector (InDataA, InDataB, for the simplicity I take an example with N=2), and compute all products and store it in the result output in (ScalAA, ScalAB, ScalBA, ScalBB ). If I install this module into standard Tesasic DE3 environment I got two issues that I cannot resolve: 1. all my data are reg signed [13:0], so one multiplication can be fitted into 18x18 bits mults. I am doing massively parallel multiplications and hope to use so called "Four Multiplier Adder Mode" as it is described in Stratix III Device Handbook 1, but I cannot understand how to implement it. I urgently need it otherwise I will run out of recourses of my DE3 board. 2. timing of this module was not very perfect, I achieve only 260-310MHz, however, in the "Four Multiplier Adder Mode" I should achieve 600MHz. I also need it because in my design I expect to have data with 400, 500 and probably 600 MHz input data rate. And now there is by module. Please, advise me how to: 1. switch four multiplier adder mode on, 2. and ideas to improve the performance. Thank you! Ilghiz
module DATA_Aq(InDataClkA, InDataA, InDataB, OutData);
parameter NBUF=16; // the maximum possible shift in the design, I should be able to run it with:
// 1) A,B,...H=8 channels, and NBUF=6, or
// 2) A,B,C,D=4 channels, and NBUF=16, so both designs
// need 384 or 256 18x18 multipliers in Four Multiplier Adder Mode
parameter UpdateSpeed=12;
input InDataClkA;
input InDataA;
input InDataB;
reg OutData; // this is some artificial output that prevents Quartus to optimize out the main part of computations
output OutData;
// Memory Declaration
reg signed DataA;
reg signed DataB;
reg signed ScalAA, Scal1AA, Scal2AA;
reg signed ScalAB, Scal1AB, Scal2AB;
reg signed ScalBA, Scal1BA, Scal2BA;
reg signed ScalBB, Scal1BB, Scal2BB;
// reg signed Scal3AA, Scal4AA, Scal5AA;
reg signed Tmp;
reg InDataCounter;
// Initialization
initial
begin
integer i;
InDataCounter=0;
for(i=0; i<NBUF; i=i+1)
begin
ScalAA=0; Scal1AA=0; Scal2AA=0;
ScalAB=0; Scal1AB=0; Scal2AB=0;
ScalBA=0; Scal1BA=0; Scal2BA=0;
ScalBB=0; Scal1BB=0; Scal2BB=0;
DataA=0;
DataB=0;
end
end
// Reading Data from Channels and Computation
always @(posedge InDataClkA)
begin
integer i;
InDataCounter<=InDataCounter+1;
for(i=0; i<NBUF-1; i=i+1)
begin
DataA<=DataA;
DataB<=DataB;
end
DataA<=InDataA;
DataB<=InDataB;
for(i=0; i<NBUF; i=i+1)
begin
Scal1AA<=InDataA*DataA;
Scal1AB<=InDataA*DataB;
Scal1BA<=InDataB*DataA;
Scal1BB<=InDataB*DataB;
Scal2AA<=ScalAA-(ScalAA>>UpdateSpeed);
Scal2AB<=ScalAB-(ScalAB>>UpdateSpeed);
Scal2BA<=ScalBA-(ScalBA>>UpdateSpeed);
Scal2BB<=ScalBB-(ScalBB>>UpdateSpeed);
ScalAA<=Scal1AA+Scal2AA;
ScalAB<=Scal1AB+Scal2AB;
ScalBA<=Scal1BA+Scal2BA;
ScalBB<=Scal1BB+Scal2BB;
end
end
// This is artificial always block to simulate that I am using Scal?? data
always @(InDataCounter)
begin
case(InDataCounter)
0: Tmp=ScalAA];
1: Tmp=ScalAB];
2: Tmp=ScalBA];
3: Tmp=ScalBB];
endcase
OutData=Tmp+Tmp+Tmp+Tmp+Tmp;
end
endmodule