I've got a decimator block written in Verilog. It's the standard structure: flops, then constant multiplication, then an accumulate tree. However, Quartus is using the DSP blocks for the multiplies and then failing timing specs. This seems like an early stage problem as it starts consuming DSP resources right at the early portions of Analysis & Synthesis. Is there a way to make Quartus synthesize those constant multiplies short of rewriting the block with shifts and adds by hand? I'm on Quartus 12.1sp1 on Windows 7 64-bit. Thanks.

I dont quite get what the problem is? DSP blocks are the fastest way to do multiplies on the chip. So why would you not want to use them. Sometime you can have problmes routing into or out of a DSP block, but the solution for that is add more pipeline registers around the multiplier so that it allows the fitter to shorten the distance between stages and can put a register right next to the multiplier.

PS. Ive only had problems with the above when setting the clock speed to >350MHz on a stratix 4 and 5.

--- Quote Start --- I dont quite get what the problem is? DSP blocks are the fastest way to do multiplies on the chip. So why would you not want to use them. --- Quote End --- That's true if both inputs are variables. However, when one side is a constant using the multiplier is probably the slowest. For example, 30*X would be DRAMATICALLY faster being done as 32*X - 2*X. A multiplier can't even get close.

True, but you'll have to work that out for yourself. Afaik, the synthesisor cannot work out the 2^n constants itself. The example would also fall down and get complicated as the constant values get larger and larger. Your example is very simple and just requires a single adder. As the constants get to values requiring 10s or 100s of adders, a DSP block might be easier.

--- Quote Start --- True, but you'll have to work that out for yourself. Afaik, the synthesisor cannot work out the 2^n constants itself. The example would also fall down and get complicated as the constant values get larger and larger. Your example is very simple and just requires a single adder. As the constants get to values requiring 10s or 100s of adders, a DSP block might be easier. --- Quote End --- You can get almost every constant out to 16 bits with right around 5 operations if you allow subtractions. I'm really surprised that the tools can't work out these constants. Even the low-end ASIC synthesis tools have been able to do this kind of thing for quite a few years. Sigh. I guess I have to create some code to solve the knapsack problem and generate verilog. Again. Thanks for the advice.

How to force Quartus to synthesize constant mutiplies

20 Replies

Altera_Forum
Honored Contributor
12 years ago
As a point of reference, Stratix V now allows each DSP block to have 8 fixed coefficients that can be muxed from coeff_sel lines, and if Im reading correctly the coefficients can be up to 27bits. This might actually be a way to avoid the Ram based multipliers that ive seen used for higher speed designs when the multiplier value is fixed.
Altera_Forum
Honored Contributor
12 years ago
--- Quote Start ---

However, if one input is constant we don’t need that apart from shift only(or plus adder). So the two cases are distinct from design perspective. The original post is not obbssesed with shift but wants simplified design.

Personally I prefer DSP blocks. I might target one or so coeffs (say) as power of 2 to save few mults just in case.
--- Quote End ---

Timing is the issue, primarily. I'm shoving things around at 150MHz to 200MHz on an Arria V. Not impossibly fast, but one definitely has to be alert to what is actually happening in synthesis.

The multipliers want to finish the carry-propagate add before giving the result. Unfortunately, I have an add accumulation tree right after the multiplier, so the carry-propagate is effectively useless *and* soaks up a big chunk of time. I'd rather dump the final carry-save state and let the accumulate tree absorb it.
Altera_Forum
Honored Contributor
12 years ago
--- Quote Start ---
Timing is the issue, primarily. I'm shoving things around at 150MHz to 200MHz on an Arria V. Not impossibly fast, but one definitely has to be alert to what is actually happening in synthesis.

The multipliers want to finish the carry-propagate add before giving the result. Unfortunately, I have an add accumulation tree right after the multiplier, so the carry-propagate is effectively useless *and* soaks up a big chunk of time. I'd rather dump the final carry-save state and let the accumulate tree absorb it.
--- Quote End ---

Well I regularly get timing problems on mults (stratix iv @ 368MHz) then I realise what to do: put a pipeline register after mult result (apart from block's registers). This makes a big difference. I was afraid at times that this pipe may be repacked into blocks but it never happened apparently. It seems that -otherwise- routing is too bad from these mult blocks to the fabric. If you get latency problems then you might discard an internal block pipe if applicable.

On the other hand I must confirm that with fpgas we regularly have constants into mults e.g. coefficients and we don't target designing mults as simple shift/add. DSP blocks are usually fast.
Altera_Forum
Honored Contributor
12 years ago
I didn't yet hear any results when enforcing logic implementation of the constant multipliers for the present problem.
Altera_Forum
Honored Contributor
12 years ago
--- Quote Start ---
I didn't yet hear any results when enforcing logic implementation of the constant multipliers for the present problem.
--- Quote End ---

Sorry for the lag, but I've been fighting a couple different issues.

Enforcing the logic implementation removes the multiplier usage but loses significant speed. I would have to hand code a compression tree to win back enough speed. I may do that, at some point. If so, I will add to this post.

However, unless I hit a speed wall, I probably won't do that. I'm finding that I am more than a bit underwhelmed at the speed performance of the Arria V's. I did not expect 250+MHz in Verilog to be this problematic in a 28nm technology chip.
Altera_Forum
Honored Contributor
12 years ago
I remember someone saying that it takes time to get values into and out of the DSP blocks - so to do a multiply in logic probably requires that you add another pipeline stage (or two) somewhere.
Altera_Forum
Honored Contributor
12 years ago
--- Quote Start ---
I remember someone saying that it takes time to get values into and out of the DSP blocks - so to do a multiple in logic probably requires that you add another pipeline stage (or two) somewhere.
--- Quote End ---

Yes, Ive seen the fitter have no problems with the DSP blocks themselves, but then decides to put the next/previous register to the DSP half way across the chip to move it closer to the next/previous bit of logic. So adding in redundant pipeline stages pre/post DSP gives the fitter a bit of extra leeway on the timing.

You can get the same problem with RAM Blocks too.
Altera_Forum
Honored Contributor
12 years ago
--- Quote Start ---
Yes, Ive seen the fitter have no problems with the DSP blocks themselves, but then decides to put the next/previous register to the DSP half way across the chip to move it closer to the next/previous bit of logic. So adding in redundant pipeline stages pre/post DSP gives the fitter a bit of extra leeway on the timing.

You can get the same problem with RAM Blocks too.
--- Quote End ---

I took this advice to heart and double-pipelined the entry and exit point (2 flops in a row on both input and output).

The system still can't hit 320MHz. It's going global clock->DSP block->flop from global clock and it can't seem to meet 320MHz for setup on the multipliers (at least at slow corners--fast corners claim to pass). And we're not talking a small miss here. On a 3.125ns clock cycle it misses by almost a full nanosecond.

I did check the datasheet for the chip (and checked that my device settings are correct), and it claims 370MHz is supposed to be the minimum on these. Is there some file where I can look at exactly what this thing is doing in the corners. That seems to be an *enormous* variation from fast to slow.

18x18 multiplier in 28nm and the system can't hold 320MHz while completely pipelined and with a constant on one input? Double pipelining on input and output and over half my delay is in interconnect?

Something feels broken ...
Altera_Forum
Honored Contributor
12 years ago
How are these multipliers implemented? Ive also had experience showing that infered mult-add trees wont clock as fast as using megafunctions. And to make it clock faster, LUTs had to be in place over DSP blocks.
Altera_Forum
Honored Contributor
12 years ago
--- Quote Start ---
How are these multipliers implemented? Ive also had experience showing that infered mult-add trees wont clock as fast as using megafunctions. And to make it clock faster, LUTs had to be in place over DSP blocks.
--- Quote End ---

No tree. Just a pure multiply. Flop->flop->multiply->flop->flop.

TimeQuest critical path shows data arrival path as global clock->DSP block->flop (CLKCTRL_G2->DSP_X70_Y69_N0->FF_X71_Y69_N14) with data required path being global clock->flop (CLKCTRL_G2->FF_X71_Y69_N14).

Don't really see how I can get any cleaner than that ...

I have a ticket open with Altera. I also note that the release notes for the latest Quartus mention that timing models have been changed for the Arria V series.

I'll report back if something changes.

Forum Discussion

How to force Quartus to synthesize constant mutiplies

20 Replies

Recent Discussions

jtagserver.exe causing BSOD together with ftdi driver

Automatically added negative node for TDS output doesn't work with Agilex 5

Agilex3 - unknown IDCODE

Signal Tap - *** Fatal Error: Segment Violation

Quartus Eda_Writer keeps crashing