Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
21 years ago

Multiplying by four a register

Hi all,

I have just started using NIOS II, and I have a small question:

which is the best way for multiplyiing a register by four?

until now I found the following alternatives (suppose r16 contains the value to be multiplied)

1) generated by gcc accessing an array of 32 bit integers... does the compiler uses this instruction also on the small and on the economic version of niosII?

muli r16, r16, 4

2)

addi r16, r16, r16

addi r16, r16, r16

3)

sll r16, r16, 2

Best Regards,

Paolo

9 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    On the Economic version you don't get multiplier hardware.

    On the other two versions you have the multiplier that also performs shift operations (If you roll left by 2 (ie multiply by 4) then I wouldn't doubt that you end up just multiplying anyway).

    You're solutions 1 and 3 probably take the exact same amount of time, whereas number 2 would be longer I would assume since I don't see them being able to add all three in parallel (maybe they can).

    Either way you're talking probably 1 clock cycle to 2 or 3 cycles anyway. Hopefully you don't need better performance then that.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Ok, I think I wll use option 3, that also on the economic version of the NIOS II is able to shift in 2 cycles...

    Thanks!!!

    Paolo
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Be careful with shifting signed numbers. (don't want to modify the sign bit)

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The best option depends on how soon the result of the multiply is used by other instructions,

    which FPGA family you are using, and which Nios II you are using.

    In general, option 2 is the best since it has a throughput of 0.5 cycles and a latency of 2 cycles

    in all combinations.

    BTW, a throughput of 0.5 cycles means you get a multiply result every 1/0.5 = 2 cycles and

    a latency of 2 cycles means the result isn't ready for 2 cycles.

    Let me explain more. On Stratix I and Stratix II devices, the Nios II/s and Nios II/f

    use the hardware multipliers to perform multiplies. The throughput is one multiply per cycle but

    with a 3 cycle latency. If you try to use the result of multiply in one or two cycles, the dependent

    instruction is stalled which results in a throughput of 0.33 cycles and a latency of 3 cycles.

    For example, this code:

    muli r16, r16, 4

    xor r4, r5, r16

    will take 4 cycles to execute because the xor is stalled for 2 cycles since it uses the muli result.

    However, this code:

    muli r16, r16, 4

    muli r17, r17, 4

    muli r18, r18, 4

    xor r4, r5, r16

    will also take 4 cycles to execute because the non-dependent muli to r17 and r18 (or any other non-dependent

    instructions) don't stall and the xor that uses r16 is far enough away from the muli to r16 to not stall.

    So, this code achieves multiplies with a throughput of 1 cycle and the latency of 3 cycles is hidden by

    the non-dependent instructions.

    Option 3 (using a shift) has the same performance as the multiply on Nios II/f and Nios II/s on Stratix I and Stratix II

    because we actually use the hardware multiplier to perform shifts and rotates.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi,

    really thanks a lot to all for the good informations you gave me...

    To answer the question "be aware of signed integers"...currently I'm using these instruction to address in assembler some small vectors of integers, so I expect the indexes to be small positive numbers :-)

    Thanks again,

    Paolo
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Another question related to this topic...

    Since stalls in the pipeline influence the performance of the code I'm writing, is it possible to know, given a NIOS II hardware, if and where a set of assembler instructions are stalling due to register precedence relations?

    Thanks again for all,

    Paolo
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    If you can run on modelsim, use the w command in modelsim to display waves.

    Then you can see your the exact timing of your instructions.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Somewhere in the big NIOS II doc, they give you the timing for the assembly instructions, but like James said their can be exceptions for many cases.

    Sounds like you need/want every clock cycle you can get so modelsim should be a lot of help to you (never used it but it looked like it could give you a lot of info).