Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
21 years ago

HW Multiplier

Who was the xyz! who took out the hardware multiplier in NIOS2 ????

That person did not think. I know the plan it to use the HW Mult units in CycloneII BUT consider the following.... NIOS is fluid ! With help of its flexibility I have fixed faults and upgraded my clients systems with ease. So I am a small timer. In the last 3 months I have only sold 92 boards. 50 of those are in vehicles 500 -900km away from me. In the last few weeks we noticed a few small timing problems that will be eradicated with NIOS II. BUT I need the HW Mult. command otherwize the speed gain is wasted. I cannot quickly rip out 50 units and replace them with fresh Cyclone IIs. Plus the design is now stable(according to my client) and approved. I am awaiting a long term order for 2500 units. For many reasons I will stick with Cyclone I. So no HW multiplier in NIOS II means that I cannot use it. I have systems out there running on Apex20K100 chips with NIOS I. One such system, with about 2 years of continuous upgrading, has run out of speed. This board clocks at about 20MHz. NIOS II will be like Viagra to this system, but no HW multiplier ! I know I am not the only one. So here is what I propose.... If this is already in action I appologize. Somewhere in the header file there is a directive for a HW multiplier. It is currently set to 0. If I set it to 1 I believe my compiler will compile code with 'mul', 'imul',and 'mulxss/mulxuu'. I know the difference is that the new multiply command is 32x32 bit compared to NIOS I's 16x16. But I will rather live with a delay of 8 to 16 cycles, for a multicycle multiplier, than a software calculation. I remember that the old HW mul command used up about 400 LEs. That I can still live with that on all my designs. so, is it possible that altera can publish the multiplier header/ instantiation so that those with time and brains can create a hdl solution for this problem? Victor

10 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    > Who was the idiot who took out the hardware multiplier in NIOS2 [for Cyclone] ????

    That would be me! (Actually there were three of us who decided that, but hey, I'm here.) And you are absolutely right--about the need for a multiplier in Nios II for Cyclone.

    The good news is that it's in the next release of the dev. kit (enabled by some Quartus II 4.1 synthesis enhancements). And further good news is that the release is "near the end of the year," as Alan says in another thread.

    But since you need something now, elan describes his solution in the thread "hw multiplier for Cyclone".

    And more related info is in the thread "NIOS II DMIPS".

    An added bonus is that the 32-bit multiplier in the Nios II/s and Nios II/f cores in Cyclone should be faster than 32-bit multiplication in the original Nios core, because in Nios II it's a full hardware solution, while in the original Nios it's a software routine that uses the 16x16 multiply instruction three times.

    Kerry

    Altera
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    HI all people who wants a fast multiply now,

    Even without adding a hardware multiplier, it is possible to gain some performance on the multiply. I disassembled the multiply routine (__mulsi3):

    __mulsi3:
            addi    sp,sp,-8                   <- useless, can be removed
            stw     fp,4(sp)                   <- useless, can be removed
            mov     r3,zero 
            mov     fp,sp                       <- useless, can be removed
            beq     r4,zero,mul_30
    mul_14:
            andi    r2,r4,1
            cmpeq   r2,r2,zero
            srli    r4,r4,1
            bne     r2,zero,mul_28
            add     r3,r3,r5
    mul_28:
            slli    r5,r5,1
            bne     r4,zero,mul_14
    mul_30:
            mov     r2,r3
            ldw     fp,4(sp)                <- useless, can be removed
            addi    sp,sp,8                <- useless, can be removed
            ret

    This routine can be optimized a lot. The most important optimization that can be done is removing the stack frame stuff. This removes 5 instructions without any problem. This together with two minor optimizations results in the following function:

    __mulsi3:
            mov     r2,zero
            beq     r4,zero,mul_30
    mul_14:
            andi    r3,r4,1
            srli      r4,r4,1
            beq     r3,zero,mul_28
            add     r2,r2,r5
    mul_28:
            slli      r5,r5,1
            bne     r4,zero,mul_14
    mul_30:
            ret

    Just put this function in an assember source file and it should overrule the default implementation from the library.

    For us this resulted in a performance gain of about 10%.

    When this is still not good enough, make a custom instruction and implement the the function like this:

    __mulsi3:
             custom  0,r2,r4,r5
             ret

    I used an opencore multiplier and changed it a bit to do a 32x32 multiply in 4 clock cycles (see below). Currently the upper 32-bits of the result are thrown away, so if you also want fast multiplies with &#39;long long&#39; results, some work still has to be done to make it possible to get the upper 32-bits and correctly implement the __muldi3 function.

    The multiplier takes 708 LEs in my Cyclone EP1C3 design. I think this is not the best implemenattion of a multiplier, but it works fast and it did only take a few hours to make.

    Have fun with it,

    Tim Brugman.

    -------------------------------------------------------------------------------
    --  File: mult_unit.vhd                                                      --
    --                                                                           --
    --  Copyright © Deversys, 2003                                             --
    --                                                                           --
    -- function : entity and architecture for multiplication algorithms testing  -- 
    --                                                                           --
    --  Author: Vladimir V. Erokhin, PhD,                                        --
    --         e-mails: vladvas@deversys.com; vladvas@verilog.ru;                --
    --                                                                           --
    ---------------  Revision History      ----------------------------------------
    --                                                                           --
    --     Date  Engineer               Description                        --
    --                                                                           --
    --      041108   T. Brugman               Customized to make a fast multiply --
    --                                        for Altera NiosII                  --
    -------------------------------------------------------------------------------
    library IEEE;
    use IEEE.std_logic_1164.all;
    use IEEE.std_logic_arith.all;
    use IEEE.std_logic_unsigned.all;
    entity MULT_UNIT is
        port (
         clk: in STD_LOGIC;
         clk_en: in STD_LOGIC;
         reset: in STD_LOGIC;
         start: in STD_LOGIC;
      dataa: in STD_LOGIC_VECTOR (31  downto 0);
      datab: in STD_LOGIC_VECTOR (31 downto 0);
      done: out std_logic;
      result: out STD_LOGIC_VECTOR (31 downto 0)
      );
    end MULT_UNIT;
    architecture RTL of MULT_UNIT is
    function MULT_HIER(MULTIPLICAND: STD_LOGIC_VECTOR; MULTIPLIER: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR is
    variable MUL_RESULT_1, MUL_RESULT_2, MUL_RESULT_3, MUL_RESULT_4: STD_LOGIC_VECTOR(MULTIPLICAND&#39;length - 1 downto 0); 
    variable RESULT: STD_LOGIC_VECTOR(MULTIPLICAND&#39;length * 2 - 1 downto 0); 
    variable HIGH_HALF_OF_MCD, LOW_HALF_OF_MCD, HIGH_HALF_OF_MER, LOW_HALF_OF_MER: STD_LOGIC_VECTOR((MULTIPLICAND&#39;length +1)/2 - 1 downto 0); 
    variable MUL_RESULT_1_4: STD_LOGIC_VECTOR((MULTIPLICAND&#39;length+1)/2*3 - 1 downto 0); 
    variable TEMP1, TEMP2: STD_LOGIC_VECTOR(1 downto 0); 
    variable TEMP3: STD_LOGIC_VECTOR(2 downto 0); 
    begin
      if MULTIPLICAND&#39;length = 2 then
          TEMP1 := (MULTIPLIER(1) and MULTIPLICAND(0)) & (MULTIPLIER(0) and MULTIPLICAND(0));
          TEMP2 := (MULTIPLIER(1) and MULTIPLICAND(1)) & (MULTIPLIER(0) and MULTIPLICAND(1));
          if TEMP1(1) = &#39;1&#39; then
             TEMP3 := ((MULTIPLICAND(1) and MULTIPLIER(1) and MULTIPLIER(0)) & (TEMP2 + 1));
          else
             TEMP3 := (&#39;0&#39; & TEMP2);
          end if;
          return TEMP3 & TEMP1(0);
       else
          HIGH_HALF_OF_MCD := MULTIPLICAND(MULTIPLICAND&#39;high downto (MULTIPLICAND&#39;high + 1)/2);
          LOW_HALF_OF_MCD := MULTIPLICAND((MULTIPLICAND&#39;high+1)/2 - 1 downto 0);
          HIGH_HALF_OF_MER := MULTIPLIER(MULTIPLIER&#39;high downto (MULTIPLIER&#39;high + 1)/2);
          LOW_HALF_OF_MER := MULTIPLIER((MULTIPLIER&#39;high+1)/2 - 1 downto 0);
          
          MUL_RESULT_1 := MULT_HIER(LOW_HALF_OF_MCD, LOW_HALF_OF_MER);
          MUL_RESULT_2 := MULT_HIER(LOW_HALF_OF_MCD, HIGH_HALF_OF_MER);
          MUL_RESULT_3 := MULT_HIER(HIGH_HALF_OF_MCD, LOW_HALF_OF_MER);
          MUL_RESULT_4 := MULT_HIER(HIGH_HALF_OF_MCD, HIGH_HALF_OF_MER);
          MUL_RESULT_1_4 := MUL_RESULT_4 & MUL_RESULT_1(MULTIPLICAND&#39;high downto (MULTIPLICAND&#39;high+1)/2);
          RESULT := (MUL_RESULT_1_4 + ((&#39;0&#39; & MUL_RESULT_2) + MUL_RESULT_3)) & MUL_RESULT_1((MULTIPLICAND&#39;high+1)/2-1 downto 0);
          return(RESULT);
       end if;    
    end MULT_HIER;
    TYPE TState IS (IDLE, MUL1, MUL2, MUL3);
    SIGNAL state : TState;
    SIGNAL result_1, result_2, result_3, mul16_result : STD_LOGIC_VECTOR(31 downto 0); 
    SIGNAL mul_result: STD_LOGIC_VECTOR(63 downto 0); 
    SIGNAL reg_a, reg_b: STD_LOGIC_VECTOR(15 downto 0);
    BEGIN
    process(reset,clk,mul16_result,state,dataa,datab,mul_result)
      VARIABLE result_1_4: STD_LOGIC_VECTOR(47 downto 0);
      VARIABLE arg_a, arg_b : STD_LOGIC_VECTOR(15 downto 0);
    begin
      case state is
        when IDLE =>
          arg_a := dataa(15 downto 0);
          arg_b := datab(15 downto 0);
        when MUL1 =>
          arg_a := dataa(15 downto 0);
          arg_b := datab(31 downto 16);
        when MUL2 =>
          arg_a := dataa(31 downto 16);
          arg_b := datab(15 downto 0);
        when MUL3 =>
          arg_a := dataa(31 downto 16);
          arg_b := datab(31 downto 16);
      end case;
      mul16_result <= MULT_HIER(arg_a, arg_b);
      if reset=&#39;1&#39; then
        state <= IDLE;
        done <= &#39;0&#39;;
      elsif clk=&#39;1&#39; and clk&#39;event then
        if clk_en = &#39;1&#39; then
          case state is
            when IDLE =>
              done <= &#39;0&#39;;
              result_1 <= mul16_result;
              if start = &#39;1&#39; then
                state <= MUL1;
              end if;
            when MUL1 =>
              result_2 <= mul16_result;
              state <= MUL2;
            when MUL2 =>
              result_3 <= mul16_result;
              state <= MUL3;
            when MUL3 =>
              result_1_4 := mul16_result & result_1(31 downto 16);
              mul_result <= (result_1_4 + ((&#39;0&#39; & result_2) + result_3)) & result_1(15 downto 0);
              done <= &#39;1&#39;;
              state <= IDLE;
          end case;
        end if;
      end if;
      result <= mul_result(31 downto 0);
    end process;
    END RTL;
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Tim

    Excellent !!

    http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/biggrin.gif

    Only one thing to remember...

    I have entered the &#39;overriding&#39; code in one of my SDK (I still using legacy code) libraries and compiled via &#39;make all&#39; in the lib directory.

    The thing to remember is to make the function global otherwise the old __mulsi3 will stick like mud on a bumper!

    .global __mulsi3
    .type __mulsi3, @function
    __mulsi3:
                  custom 0,r2,r4,r5
    ret

    I haven&#39;t done major code benchmarks but I did see serious improvements with the custom instruction.

    Be careful of the compiler! It is smart. Don&#39;t just multiply something and then check the speed. Make sure that your code can&#39;t be optimized out into add instructions. Look at your .objdump to see if the code really uses <__mulsi3>.

    "Now go forth and multiply..."

    Victor
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Tim:

    What fmax on which speed grade can be achieved with this synchronous multiplier?

    We are getting around 70MHz on a Cyclone C12-8 with our asynchronous multicycle command.

    Chris
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Chris,

    I got 58MHz on a Cyclone C3-8 instead of 70MHz without the multiplier. As I allready said, it is not the best implementation, just a quick hack based on an opencore. If you have a better solution, please share it with the community...

    Greetz,

    Tim Brugman.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    originally posted by timbr@Nov 8 2004, 09:27 AM

    hi all people who wants a fast multiply now,

    even without adding a hardware multiplier, it is possible to gain some performance on the multiply. i disassembled the multiply routine (__mulsi3):

    __mulsi3:
     â  â  â  â addi â  â sp,sp,-8 â  â  â  â  â  â  â  â  â  <- useless, can be removed
     â  â  â  â stw â  â  fp,4(sp) â  â  â  â  â  â  â  â  â  <- useless, can be removed
     â  â  â  â mov â  â  r3,zero 
     â  â  â  â mov â  â  fp,sp â  â  â  â  â  â  â  â  â  â  â  <- useless, can be removed
     â  â  â  â beq â  â  r4,zero,mul_30
    mul_14:
     â  â  â  â andi â  â r2,r4,1
     â  â  â  â cmpeq â  r2,r2,zero
     â  â  â  â srli â  â r4,r4,1
     â  â  â  â bne â  â  r2,zero,mul_28
     â  â  â  â add â  â  r3,r3,r5
    mul_28:
     â  â  â  â slli â  â r5,r5,1
     â  â  â  â bne â  â  r4,zero,mul_14
    mul_30:
     â  â  â  â mov â  â  r2,r3
     â  â  â  â ldw â  â  fp,4(sp) â  â  â  â  â  â  â  â <- useless, can be removed
     â  â  â  â addi â  â sp,sp,8 â  â  â  â  â  â  â  â <- useless, can be removed
     â  â  â  â ret

    this routine can be optimized a lot. the most important optimization that can be done is removing the stack frame stuff. this removes 5 instructions without any problem. this together with two minor optimizations results in the following function:

    __mulsi3:
     â  â  â  â mov â  â  r2,zero
     â  â  â  â beq â  â  r4,zero,mul_30
    mul_14:
     â  â  â  â andi â  â r3,r4,1
     â  â  â  â srli â  â  â r4,r4,1
     â  â  â  â beq â  â  r3,zero,mul_28
     â  â  â  â add â  â  r2,r2,r5
    mul_28:
     â  â  â  â slli â  â  â r5,r5,1
     â  â  â  â bne â  â  r4,zero,mul_14
    mul_30:
     â  â  â  â ret

    --- Quote End ---

    I think you get the same result if you should be able to compile the libraries with the -f-omit-frame-pointer compiler option.

    Has anyone done this already, is it hard to do for someone that has practically no experience in porting or making the gnu toolchain?

    Stefaan
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Another try :

    When implemeting the simple sequence as above as a custom instruction, it saves already a lot of time, for only 280 LEs in a cyclone.

    I have measured the time for this

    int r;
    volatile int* pr = &r; //prevent optimalisations
    for (int a = -1000; a < 1000; a++)
       for (int b = 1000; b >-1000; b--)  //count backwards : prevent optimalisations
          *pr = a*b;

    Without cutom instruction : this takes about 22.4 seconds at 50Mc with the standard NiosII

    With the custominstruction in the mulsi function : 4.3 seconds (without frame pointer stuff)

    with the custominstruction inlined (without the additinal function call and ret instruction) : 2.4 seconds.

    This is a very big advantage for only 280 LE&#39;s I think. (22.4 seconds down to 4.3 seconds for 4.000.000 muls and overhead for the loops)

    But I think the compiler still thinks that the cost for a multiply is very high, so it wants to optimise to shifts and adds where possible. This can reduce the benefit of this code a lot.

    If anyone is interested, I&#39;ll post the verilog code for the custom instruction.