Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
17 years ago

Parallel adder timing issues

I am beginner in FPGA design and implementation.I have 2 questions regarding adder implementations. The simulation in below message refers to gate level simulations with SDO file generated by Quartus tool.

I am working on STARTIX II FPGA with Quartus tool. My design need to work at 266 mhz clock. I am looking for fast adder with 1 clock latency(3.75 ns) . I tried Parallel_ADD with mega wizard plug in manager and implemented and simulated in Modelsim. I am seeing output after 3 clock cycles in modelsim and 6.604ns as critical time period for worst path.

I need adder for 2-input with 10 bit wide , which should work at 266 mhz with above specified technology. Can any one suggest implementation views, timing constraints need to set while implementing and etc. Your help is appreciated.

Second Question is, I am seeing the Critical timing path is around 6.604 ns (TCO) after implementation. Does it mean, this design will work 150Mhz?.when i run simulation with model sim, the output for 2-input adder is available only after 3 clocks. Test bench is modeled to work at 266 Mhz. That means latency is 3 clocks, where as timing critical path shows 6.604ns. The data should be available at 7.5 ns in simulation(According to FPGA timing summary ) , but i can see data on the output only after 9 or 10 ns. Is it correlation problem with EDA tools?. Any suggestions?.

Is there any other way, we can calculate timing critical path and its time period and calculate the max frequency to match with Simulation results?.

Regards,

Sam

8 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    For something like an adder, I would just start coding it in HDL and seeing what happens. If there's something you can't do in HDL, or that doesn't get synthesized the way you want, then megafunctions tend to be better. But an adder should be no problem.

    Are you using the Classic Timing Analyzer or TimeQuest? Since you put 6.604ns TCO, I'm guessing you're using the Classic Timing Analyzer and looking at the delay to an output pin. When benchmarking small functions like this, make sure you're looking "inside the registers". Users often wrap these functions with registers just to make sure. Your I/O timing will be another matter that should be looked at independently, where I'm guessing in your scenario you have a fast adder, and are then looking at the timing on this last register to get out an output pin, which is not what you want to analyze for this function.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi Sam:

    I don't know your entire system, but by your description, I would look in the following areas:

    1) Doing the add at 266 MHz should be fine as long as both input and outputs are registered. Although you can do this adder using the megafunction wizard, This is a very basic function, that I would suggest you do in strait verilog or vhdl. For verilog this would look like the following:

    module adder10bit (

    input clk_i,

    input reset_i,

    input signed [9:0] A_i,

    input signed [9:0] B_i,

    output signed [10:0] Sum_o

    );

    reg signed [9:0] A_r;

    reg signed [9:0] B_r;

    reg signed [10:0] Sum_r;

    wire signed [10:0] Sum_c;

    assign Sum_o = Sum_r;

    assign Sum_c = A_r + B_r; // Actual adder

    always @(posedge clk_i)

    begin

    if (reset_i)

    begin

    A_r = 10'd0;

    B_r = 10'd0;

    Sum_r = 11'd0;

    end

    else

    begin

    A_r = A_i;

    B_r = B_i;

    Sum_r = Sum_c;

    end

    end

    endmodule

    Once you synthesize this block with the correct timing constraints, it should be able to do the 266MHz with no problem on Stratix II.

    It will have a clock Latency of 2 clocks, but can have a new result every clock cycle.

    My guess is, that your primary delay is in the input or output paths: IE IO buffers have lots of delay, so by the time you reach the adder, you've already used up most of you clock cycle. If 1 cycle latency is necessary, you can try replacing the above module with this one:

    module adder10bit (

    input clk_i,

    input reset_i,

    input signed [9:0] A_i,

    input signed [9:0] B_i,

    output signed [10:0] Sum_o

    );

    reg signed [10:0] Sum_r;

    wire signed [10:0] Sum_c;

    assign Sum_o = Sum_r;

    assign Sum_c = A_i + B_i; // Actual adder

    always @(posedge clk_i)

    begin

    if (reset_i)

    begin

    Sum_r = 11'd0;

    end

    else

    begin

    Sum_r = Sum_c;

    end

    end

    endmodule

    This will have the 1 cycle latency you want, but not the cycle time is limited by the input data path. If you have a lot of combinational logic here, you could be stuck.

    Always make sure your clocks are defined in your SDC file. If the clock is not defined properly, you could be failing because synthesis is just not optimizing the path for that high of speed.

    Hope this helps.

    Pete
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    If you're doing a schematic, then megafunctions are probably the way to go, but I think parallel_add is designed more for multiple adders in parallel, then just a single adder. If doing HDL, then when the file is open, Edit -> Insert -> Template has some decent examples.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Thanks for your answers. I tried the RTL adder also instead of mega function. Even then at 266mhz frequency, in simualtion output is after 3 clock cycles.

    Can any one explain, how to find out the module working frequency in altera FPGA reports. My assumption is critical path will give us the rough estimation of clock frequency. Can any one explain how to find critical path for design.

    I am using classic timing analyzer and specifying only clock freq is around 266 mhz. Is there any constraints will help to meet timing and better optimization?. Currently Tco is showing around 6.5 ns, which is 153mhz.

    --Sam
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    There are three different things:

    1) Tco -> This is the clock to out. Again, I'm assuming you're going to put more logic around the adder, and so this path should be ignored since your adder output won't go directly out. (Minimally, you will want to add another set of registers so they can be put into the IO cell and get better timing. You would also have to use a PLL). But the Tco isn't equatable to an Fmax as it's only part of the path. If it takes 6.5ns from a clock entering the device to data going out, that data will have to be clocked in by some other device. So you'll have board delay and setup time of the other device, making the path even slower. You'll also have clock skew across the board, which can hurt or help. But until you factor all of these things in, there is no way to equate Tco to an Fmax.

    2) Internal paths. These are register to register, and since Classic Tan knows the clock feeding both, it can give you a full calculation and will report an Fmax. (There are cases where Fmax doesn't make sense, like when going between clock domains, so it's recommend not to always think in terms of Fmax, but for a single clock domain it's generally all right).

    3) Finally, there is latency, which is the number of registers to get through the device. If you put down IP with multiple registers along the data path, then it would take three clock cycles. If you're doing RTL, you can look at the code and know exactly how many clock cycles it takes to get through. Of course, if you're looking at a timing simulation, that 6.5ns Tco tacked onto the end may span multiple clock cycles, even if the data "got out" a few cycles before.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Thanks. I am understanding the reasons not to meet timing now slowly.

    My question is, Why in simualtions output is showing after 3 clock cycles, when TCO is around 1.85 clock cycles.TSu.Th is around 1 clock cycle or less than clock period. I used the above anakha rtl code to synthesis.

    Second request is , is there any other timing constraints need to set apart from Clock (266mhz) to meet timing?. Sorry ..I was ASIC guy working in FPGA first time. I know timing constraints setup very well, but doesnt know in FPGA terms. It will be great help if any one help to solve this issue. Same adder is working in my ASIC libs with 90nm technology around 1Ghz and unable to synthesize for 150mhz in FPGA. There is some mistake in FPGA constraints.

    --Sam
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    If you're an ASIC guy and have any Primetime/Design Compiler experience, I would use TimeQuest instead. Enable this by going to Assignments -> Settings -> Timing Analysis Settings and enable TimeQuest. It is similar to Primetime and uses SDC(Synopsys Design Constraints) as the input constraint file. There will be a learning curve with it, but I strongly recommend going through the documentation. In fact, since your new to Classic Tan, I would recommend completely ignoring it and using TimeQuest, since that is the timing analyzer that will support newer families. (It's really head and shoulders above Classic TAN for what it can do, but it takes a few constraints to get it going.)

    As for the 3 clock cycles, I haven't looked at it and maybe anakha will chime in. You might want to examine internal registers to see exactly where the data is on each clock cycle.

    On that note, most users don't do timing simulations. They do RTL sims(which shows latency and functionality), and then do static timing analysis. If RTL shows 2 clock cycles, and you meet static timing analysis, then your timing sim would just show the same thing, 2 clock cycles. (IO timing is part of static timing analysis, and again needs to be done with a full round-trip analysis, which is what TimeQuest is designed to do.)
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi Sam:

    I agree with Rsync that TimeQuest is the way to go. It uses an SDC file that is very similar to the ASIC flow, and give you much more detail on the failing paths.

    If the only thing you are synthesizing is a single adder directly connected to the IO's of the FPGA, I can easily see where you could be failing 266 MHz.

    The IO cells are significantly slower than the core cells. Also you may loose lots of time if the input at output pins are located right next to each other in the design. (IE you may loose all your time, because the signals have to be buffered and routed all across the die.)

    If you have your clock defined in the classic TAN, if you manually open up TimeQuest, it should auto-generate a very basic SDC file and run. Then you can get a better idea of the timing issues.

    If you can simulate the RTL -vs- Gate and send a snapshot of what you're seeing for latency, maybe that could shine some light as well.

    One other problem, might be the SDF file isn't getting imported properly into the gate level simulation. Then all the cells will default to a very pessimistic delay, causing you all kinds of problems.. (I though it use to be 100 ns.. But if it's 10 ns, that would cause your 3 cycle delay).

    Since you're doing gate level simulation, you could also see where the delay is coming from by looking at the input's output, and the input and output of the registers.

    That way you could measure the delays and compare vs TimeQuest

    Pete