The best option depends on how soon the result of the multiply is used by other instructions,
which FPGA family you are using, and which Nios II you are using.
In general, option 2 is the best since it has a throughput of 0.5 cycles and a latency of 2 cycles
in all combinations.
BTW, a throughput of 0.5 cycles means you get a multiply result every 1/0.5 = 2 cycles and
a latency of 2 cycles means the result isn't ready for 2 cycles.
Let me explain more. On Stratix I and Stratix II devices, the Nios II/s and Nios II/f
use the hardware multipliers to perform multiplies. The throughput is one multiply per cycle but
with a 3 cycle latency. If you try to use the result of multiply in one or two cycles, the dependent
instruction is stalled which results in a throughput of 0.33 cycles and a latency of 3 cycles.
For example, this code:
muli r16, r16, 4
xor r4, r5, r16
will take 4 cycles to execute because the xor is stalled for 2 cycles since it uses the muli result.
However, this code:
muli r16, r16, 4
muli r17, r17, 4
muli r18, r18, 4
xor r4, r5, r16
will also take 4 cycles to execute because the non-dependent muli to r17 and r18 (or any other non-dependent
instructions) don't stall and the xor that uses r16 is far enough away from the muli to r16 to not stall.
So, this code achieves multiplies with a throughput of 1 cycle and the latency of 3 cycles is hidden by
the non-dependent instructions.
Option 3 (using a shift) has the same performance as the multiply on Nios II/f and Nios II/s on Stratix I and Stratix II
because we actually use the hardware multiplier to perform shifts and rotates.