Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
12 years ago

How to force Quartus to synthesize constant mutiplies

I've got a decimator block written in Verilog. It's the standard structure: flops, then constant multiplication, then an accumulate tree.

However, Quartus is using the DSP blocks for the multiplies and then failing timing specs. This seems like an early stage problem as it starts consuming DSP resources right at the early portions of Analysis & Synthesis.

Is there a way to make Quartus synthesize those constant multiplies short of rewriting the block with shifts and adds by hand? I'm on Quartus 12.1sp1 on Windows 7 64-bit.

Thanks.

20 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    As a point of reference, Stratix V now allows each DSP block to have 8 fixed coefficients that can be muxed from coeff_sel lines, and if Im reading correctly the coefficients can be up to 27bits. This might actually be a way to avoid the Ram based multipliers that ive seen used for higher speed designs when the multiplier value is fixed.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    However, if one input is constant we don’t need that apart from shift only(or plus adder). So the two cases are distinct from design perspective. The original post is not obbssesed with shift but wants simplified design.

    Personally I prefer DSP blocks. I might target one or so coeffs (say) as power of 2 to save few mults just in case.

    --- Quote End ---

    Timing is the issue, primarily. I'm shoving things around at 150MHz to 200MHz on an Arria V. Not impossibly fast, but one definitely has to be alert to what is actually happening in synthesis.

    The multipliers want to finish the carry-propagate add before giving the result. Unfortunately, I have an add accumulation tree right after the multiplier, so the carry-propagate is effectively useless *and* soaks up a big chunk of time. I'd rather dump the final carry-save state and let the accumulate tree absorb it.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Timing is the issue, primarily. I'm shoving things around at 150MHz to 200MHz on an Arria V. Not impossibly fast, but one definitely has to be alert to what is actually happening in synthesis.

    The multipliers want to finish the carry-propagate add before giving the result. Unfortunately, I have an add accumulation tree right after the multiplier, so the carry-propagate is effectively useless *and* soaks up a big chunk of time. I'd rather dump the final carry-save state and let the accumulate tree absorb it.

    --- Quote End ---

    Well I regularly get timing problems on mults (stratix iv @ 368MHz) then I realise what to do: put a pipeline register after mult result (apart from block's registers). This makes a big difference. I was afraid at times that this pipe may be repacked into blocks but it never happened apparently. It seems that -otherwise- routing is too bad from these mult blocks to the fabric. If you get latency problems then you might discard an internal block pipe if applicable.

    On the other hand I must confirm that with fpgas we regularly have constants into mults e.g. coefficients and we don't target designing mults as simple shift/add. DSP blocks are usually fast.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I didn't yet hear any results when enforcing logic implementation of the constant multipliers for the present problem.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    I didn't yet hear any results when enforcing logic implementation of the constant multipliers for the present problem.

    --- Quote End ---

    Sorry for the lag, but I've been fighting a couple different issues.

    Enforcing the logic implementation removes the multiplier usage but loses significant speed. I would have to hand code a compression tree to win back enough speed. I may do that, at some point. If so, I will add to this post.

    However, unless I hit a speed wall, I probably won't do that. I'm finding that I am more than a bit underwhelmed at the speed performance of the Arria V's. I did not expect 250+MHz in Verilog to be this problematic in a 28nm technology chip.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I remember someone saying that it takes time to get values into and out of the DSP blocks - so to do a multiply in logic probably requires that you add another pipeline stage (or two) somewhere.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    I remember someone saying that it takes time to get values into and out of the DSP blocks - so to do a multiple in logic probably requires that you add another pipeline stage (or two) somewhere.

    --- Quote End ---

    Yes, Ive seen the fitter have no problems with the DSP blocks themselves, but then decides to put the next/previous register to the DSP half way across the chip to move it closer to the next/previous bit of logic. So adding in redundant pipeline stages pre/post DSP gives the fitter a bit of extra leeway on the timing.

    You can get the same problem with RAM Blocks too.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Yes, Ive seen the fitter have no problems with the DSP blocks themselves, but then decides to put the next/previous register to the DSP half way across the chip to move it closer to the next/previous bit of logic. So adding in redundant pipeline stages pre/post DSP gives the fitter a bit of extra leeway on the timing.

    You can get the same problem with RAM Blocks too.

    --- Quote End ---

    I took this advice to heart and double-pipelined the entry and exit point (2 flops in a row on both input and output).

    The system still can't hit 320MHz. It's going global clock->DSP block->flop from global clock and it can't seem to meet 320MHz for setup on the multipliers (at least at slow corners--fast corners claim to pass). And we're not talking a small miss here. On a 3.125ns clock cycle it misses by almost a full nanosecond.

    I did check the datasheet for the chip (and checked that my device settings are correct), and it claims 370MHz is supposed to be the minimum on these. Is there some file where I can look at exactly what this thing is doing in the corners. That seems to be an *enormous* variation from fast to slow.

    18x18 multiplier in 28nm and the system can't hold 320MHz while completely pipelined and with a constant on one input? Double pipelining on input and output and over half my delay is in interconnect?

    Something feels broken ...
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    How are these multipliers implemented? Ive also had experience showing that infered mult-add trees wont clock as fast as using megafunctions. And to make it clock faster, LUTs had to be in place over DSP blocks.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    How are these multipliers implemented? Ive also had experience showing that infered mult-add trees wont clock as fast as using megafunctions. And to make it clock faster, LUTs had to be in place over DSP blocks.

    --- Quote End ---

    No tree. Just a pure multiply. Flop->flop->multiply->flop->flop.

    TimeQuest critical path shows data arrival path as global clock->DSP block->flop (CLKCTRL_G2->DSP_X70_Y69_N0->FF_X71_Y69_N14) with data required path being global clock->flop (CLKCTRL_G2->FF_X71_Y69_N14).

    Don't really see how I can get any cleaner than that ...

    I have a ticket open with Altera. I also note that the release notes for the latest Quartus mention that timing models have been changed for the Arria V series.

    I'll report back if something changes.