Multiplying by four a register

Question

Hi all,

I have just started using NIOS II, and I have a small question:

which is the best way for multiplyiing a register by four?

until now I found the following alternatives (suppose r16 contains the value to be multiplied)

1) generated by gcc accessing an array of 32 bit integers... does the compiler uses this instruction also on the small and on the economic version of niosII?

muli r16, r16, 4

2)

addi r16, r16, r16

3)

sll r16, r16, 2

Best Regards,

Paolo

altera_forum · Answer

Ehm... of course it was:  2) add r16,r16,r16 add r16,r16,r16  Paolo

altera_forum · Answer

On the Economic version you don&amp;#39;t get multiplier hardware.  On the other two versions you have the multiplier that also performs shift operations (If you roll left by 2 (ie multiply by 4) then I wouldn&amp;#39;t doubt that you end up just multiplying anyway).  You&amp;#39;re solutions 1 and 3 probably take the exact same amount of time, whereas number 2 would be longer I would assume since I don&amp;#39;t see them being able to add all three in parallel (maybe they can).  Either way you&amp;#39;re talking probably 1 clock cycle to 2 or 3 cycles anyway.  Hopefully you don&amp;#39;t need better performance then that.

altera_forum · Answer

Ok, I think I wll use option 3, that also on the economic version of the NIOS II  is able to shift in 2 cycles...  Thanks!!!  Paolo

altera_forum · Answer

Be careful with shifting signed numbers.  (don&amp;#39;t want to modify the sign bit)

altera_forum · Answer

The best option depends on how soon the result of the multiply is used by other instructions,

which FPGA family you are using, and which Nios II you are using.

In general, option 2 is the best since it has a throughput of 0.5 cycles and a latency of 2 cycles

in all combinations.

BTW, a throughput of 0.5 cycles means you get a multiply result every 1/0.5 = 2 cycles and

a latency of 2 cycles means the result isn't ready for 2 cycles.

Let me explain more. On Stratix I and Stratix II devices, the Nios II/s and Nios II/f

use the hardware multipliers to perform multiplies. The throughput is one multiply per cycle but

with a 3 cycle latency. If you try to use the result of multiply in one or two cycles, the dependent

instruction is stalled which results in a throughput of 0.33 cycles and a latency of 3 cycles.

For example, this code:

muli r16, r16, 4

xor r4, r5, r16

will take 4 cycles to execute because the xor is stalled for 2 cycles since it uses the muli result.

However, this code:

muli r16, r16, 4

muli r17, r17, 4

muli r18, r18, 4

xor r4, r5, r16

will also take 4 cycles to execute because the non-dependent muli to r17 and r18 (or any other non-dependent

instructions) don't stall and the xor that uses r16 is far enough away from the muli to r16 to not stall.

So, this code achieves multiplies with a throughput of 1 cycle and the latency of 3 cycles is hidden by

the non-dependent instructions.

Option 3 (using a shift) has the same performance as the multiply on Nios II/f and Nios II/s on Stratix I and Stratix II

because we actually use the hardware multiplier to perform shifts and rotates.

Forum Discussion

Multiplying by four a register

9 Replies

Recent Discussions

Nios V Logic Element not include

Ashling IDE scripted project creation

NIOS SDK SBOM/FOSS info

JTAG_UART stuck in printf

Recommended Quartus Prime Standard Edition for Nios V Development on MAX 10 FPGA (10M25DAF4817G)