HW Multiplier

Question

Who was the xyz! who took out the hardware multiplier in NIOS2 ????

That person did not think. I know the plan it to use the HW Mult units in CycloneII BUT consider the following.... NIOS is fluid ! With help of its flexibility I have fixed faults and upgraded my clients systems with ease. So I am a small timer. In the last 3 months I have only sold 92 boards. 50 of those are in vehicles 500 -900km away from me. In the last few weeks we noticed a few small timing problems that will be eradicated with NIOS II. BUT I need the HW Mult. command otherwize the speed gain is wasted. I cannot quickly rip out 50 units and replace them with fresh Cyclone IIs. Plus the design is now stable(according to my client) and approved. I am awaiting a long term order for 2500 units. For many reasons I will stick with Cyclone I. So no HW multiplier in NIOS II means that I cannot use it. I have systems out there running on Apex20K100 chips with NIOS I. One such system, with about 2 years of continuous upgrading, has run out of speed. This board clocks at about 20MHz. NIOS II will be like Viagra to this system, but no HW multiplier ! I know I am not the only one. So here is what I propose.... If this is already in action I appologize. Somewhere in the header file there is a directive for a HW multiplier. It is currently set to 0. If I set it to 1 I believe my compiler will compile code with 'mul', 'imul',and 'mulxss/mulxuu'. I know the difference is that the new multiply command is 32x32 bit compared to NIOS I's 16x16. But I will rather live with a delay of 8 to 16 cycles, for a multicycle multiplier, than a software calculation. I remember that the old HW mul command used up about 400 LEs. That I can still live with that on all my designs. so, is it possible that altera can publish the multiplier header/ instantiation so that those with time and brains can create a hdl solution for this problem? Victor

altera_forum · Answer

&gt; Who was the idiot who took out the hardware multiplier in NIOS2 [for Cyclone] ????  That would be me!  (Actually there were three of us who decided that, but hey, I&amp;#39;m here.)  And you are absolutely right--about the need for a multiplier in Nios II for Cyclone.  The good news is that it&amp;#39;s in the next release of the dev. kit (enabled by some Quartus II 4.1 synthesis enhancements).  And further good news is that the release is "near the end of the year," as Alan says in another thread.  But since you need something now, elan describes his solution in the thread "hw multiplier for Cyclone".  And more related info is in the thread "NIOS II DMIPS".  An added bonus is that the 32-bit multiplier in the Nios II/s and Nios II/f cores in Cyclone should be faster than 32-bit multiplication in the original Nios core, because in Nios II it&amp;#39;s a full hardware solution, while in the original Nios it&amp;#39;s a software routine that uses the 16x16 multiply instruction three times.  Kerry Altera

altera_forum · Answer

Kerry   http://forum.niosforum.com/work2/style_emoticons/&lt;#EMO_DIR#&gt;/wink.gif   Sorry for insulting you.   Will have a look at those threads.    Thanks  Victor

altera_forum · Answer

HI all people who wants a fast multiply now,  Even without adding a hardware multiplier, it is possible to gain some performance on the multiply. I disassembled the multiply routine (__mulsi3):  __mulsi3:
        addi    sp,sp,-8                   &lt;- useless, can be removed
        stw     fp,4(sp)                   &lt;- useless, can be removed
        mov     r3,zero 
        mov     fp,sp                       &lt;- useless, can be removed
        beq     r4,zero,mul_30
mul_14:
        andi    r2,r4,1
        cmpeq   r2,r2,zero
        srli    r4,r4,1
        bne     r2,zero,mul_28
        add     r3,r3,r5
mul_28:
        slli    r5,r5,1
        bne     r4,zero,mul_14
mul_30:
        mov     r2,r3
        ldw     fp,4(sp)                &lt;- useless, can be removed
        addi    sp,sp,8                &lt;- useless, can be removed
        ret  This routine can be optimized a lot. The most important optimization that can be done is removing the stack frame stuff. This removes 5 instructions without any problem. This together with two minor optimizations results in the following function:  __mulsi3:
        mov     r2,zero
        beq     r4,zero,mul_30
mul_14:
        andi    r3,r4,1
        srli      r4,r4,1
        beq     r3,zero,mul_28
        add     r2,r2,r5
mul_28:
        slli      r5,r5,1
        bne     r4,zero,mul_14
mul_30:
        ret  Just put this function in an assember source file and it should overrule the default implementation from the library. For us this resulted in a performance gain of about 10%.  When this is still not good enough, make a custom instruction and implement the the function like this:  __mulsi3:
         custom  0,r2,r4,r5
         ret  I used an opencore multiplier and changed it a bit to do a 32x32 multiply in 4 clock cycles (see below).  Currently the upper 32-bits of the result are thrown away, so if you also want fast multiplies with &amp;#39;long long&amp;#39; results, some work still has to be done to make it possible to get the upper 32-bits and correctly implement the __muldi3 function. The multiplier takes 708 LEs in my Cyclone EP1C3 design. I think this is not the best implemenattion of a multiplier, but it works fast and it did only take a few hours to make.  Have fun with it,  Tim Brugman.  -------------------------------------------------------------------------------
--  File: mult_unit.vhd                                                      --
--                                                                           --
--  Copyright © Deversys, 2003                                             --
--                                                                           --
-- function : entity and architecture for multiplication algorithms testing  -- 
--                                                                           --
--  Author: Vladimir V. Erokhin, PhD,                                        --
--         e-mails: vladvas@deversys.com; vladvas@verilog.ru;                --
--                                                                           --
---------------  Revision History      ----------------------------------------
--                                                                           --
--     Date  Engineer               Description                        --
--                                                                           --
--      041108   T. Brugman               Customized to make a fast multiply --
--                                        for Altera NiosII                  --
-------------------------------------------------------------------------------
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.std_logic_unsigned.all;
entity MULT_UNIT is
    port (
     clk: in STD_LOGIC;
     clk_en: in STD_LOGIC;
     reset: in STD_LOGIC;
     start: in STD_LOGIC;
  dataa: in STD_LOGIC_VECTOR (31  downto 0);
  datab: in STD_LOGIC_VECTOR (31 downto 0);
  done: out std_logic;
  result: out STD_LOGIC_VECTOR (31 downto 0)
  );
end MULT_UNIT;
architecture RTL of MULT_UNIT is
function MULT_HIER(MULTIPLICAND: STD_LOGIC_VECTOR; MULTIPLIER: STD_LOGIC_VECTOR) return STD_LOGIC_VECTOR is
variable MUL_RESULT_1, MUL_RESULT_2, MUL_RESULT_3, MUL_RESULT_4: STD_LOGIC_VECTOR(MULTIPLICAND&amp;#39;length - 1 downto 0); 
variable RESULT: STD_LOGIC_VECTOR(MULTIPLICAND&amp;#39;length * 2 - 1 downto 0); 
variable HIGH_HALF_OF_MCD, LOW_HALF_OF_MCD, HIGH_HALF_OF_MER, LOW_HALF_OF_MER: STD_LOGIC_VECTOR((MULTIPLICAND&amp;#39;length +1)/2 - 1 downto 0); 
variable MUL_RESULT_1_4: STD_LOGIC_VECTOR((MULTIPLICAND&amp;#39;length+1)/2*3 - 1 downto 0); 
variable TEMP1, TEMP2: STD_LOGIC_VECTOR(1 downto 0); 
variable TEMP3: STD_LOGIC_VECTOR(2 downto 0); 
begin
  if MULTIPLICAND&amp;#39;length = 2 then
      TEMP1 := (MULTIPLIER(1) and MULTIPLICAND(0)) &amp; (MULTIPLIER(0) and MULTIPLICAND(0));
      TEMP2 := (MULTIPLIER(1) and MULTIPLICAND(1)) &amp; (MULTIPLIER(0) and MULTIPLICAND(1));
      if TEMP1(1) = &amp;#39;1&amp;#39; then
         TEMP3 := ((MULTIPLICAND(1) and MULTIPLIER(1) and MULTIPLIER(0)) &amp; (TEMP2 + 1));
      else
         TEMP3 := (&amp;#39;0&amp;#39; &amp; TEMP2);
      end if;
      return TEMP3 &amp; TEMP1(0);
   else
      HIGH_HALF_OF_MCD := MULTIPLICAND(MULTIPLICAND&amp;#39;high downto (MULTIPLICAND&amp;#39;high + 1)/2);
      LOW_HALF_OF_MCD := MULTIPLICAND((MULTIPLICAND&amp;#39;high+1)/2 - 1 downto 0);
      HIGH_HALF_OF_MER := MULTIPLIER(MULTIPLIER&amp;#39;high downto (MULTIPLIER&amp;#39;high + 1)/2);
      LOW_HALF_OF_MER := MULTIPLIER((MULTIPLIER&amp;#39;high+1)/2 - 1 downto 0);
      
      MUL_RESULT_1 := MULT_HIER(LOW_HALF_OF_MCD, LOW_HALF_OF_MER);
      MUL_RESULT_2 := MULT_HIER(LOW_HALF_OF_MCD, HIGH_HALF_OF_MER);
      MUL_RESULT_3 := MULT_HIER(HIGH_HALF_OF_MCD, LOW_HALF_OF_MER);
      MUL_RESULT_4 := MULT_HIER(HIGH_HALF_OF_MCD, HIGH_HALF_OF_MER);
      MUL_RESULT_1_4 := MUL_RESULT_4 &amp; MUL_RESULT_1(MULTIPLICAND&amp;#39;high downto (MULTIPLICAND&amp;#39;high+1)/2);
      RESULT := (MUL_RESULT_1_4 + ((&amp;#39;0&amp;#39; &amp; MUL_RESULT_2) + MUL_RESULT_3)) &amp; MUL_RESULT_1((MULTIPLICAND&amp;#39;high+1)/2-1 downto 0);
      return(RESULT);
   end if;    
end MULT_HIER;
TYPE TState IS (IDLE, MUL1, MUL2, MUL3);
SIGNAL state : TState;
SIGNAL result_1, result_2, result_3, mul16_result : STD_LOGIC_VECTOR(31 downto 0); 
SIGNAL mul_result: STD_LOGIC_VECTOR(63 downto 0); 
SIGNAL reg_a, reg_b: STD_LOGIC_VECTOR(15 downto 0);
BEGIN
process(reset,clk,mul16_result,state,dataa,datab,mul_result)
  VARIABLE result_1_4: STD_LOGIC_VECTOR(47 downto 0);
  VARIABLE arg_a, arg_b : STD_LOGIC_VECTOR(15 downto 0);
begin
  case state is
    when IDLE =&gt;
      arg_a := dataa(15 downto 0);
      arg_b := datab(15 downto 0);
    when MUL1 =&gt;
      arg_a := dataa(15 downto 0);
      arg_b := datab(31 downto 16);
    when MUL2 =&gt;
      arg_a := dataa(31 downto 16);
      arg_b := datab(15 downto 0);
    when MUL3 =&gt;
      arg_a := dataa(31 downto 16);
      arg_b := datab(31 downto 16);
  end case;
  mul16_result &lt;= MULT_HIER(arg_a, arg_b);
  if reset=&amp;#39;1&amp;#39; then
    state &lt;= IDLE;
    done &lt;= &amp;#39;0&amp;#39;;
  elsif clk=&amp;#39;1&amp;#39; and clk&amp;#39;event then
    if clk_en = &amp;#39;1&amp;#39; then
      case state is
        when IDLE =&gt;
          done &lt;= &amp;#39;0&amp;#39;;
          result_1 &lt;= mul16_result;
          if start = &amp;#39;1&amp;#39; then
            state &lt;= MUL1;
          end if;
        when MUL1 =&gt;
          result_2 &lt;= mul16_result;
          state &lt;= MUL2;
        when MUL2 =&gt;
          result_3 &lt;= mul16_result;
          state &lt;= MUL3;
        when MUL3 =&gt;
          result_1_4 := mul16_result &amp; result_1(31 downto 16);
          mul_result &lt;= (result_1_4 + ((&amp;#39;0&amp;#39; &amp; result_2) + result_3)) &amp; result_1(15 downto 0);
          done &lt;= &amp;#39;1&amp;#39;;
          state &lt;= IDLE;
      end case;
    end if;
  end if;
  result &lt;= mul_result(31 downto 0);
end process;
END RTL;

altera_forum · Answer

Tim   Excellent !!    http://forum.niosforum.com/work2/style_emoticons/&lt;#EMO_DIR#&gt;/biggrin.gif     Only one thing to remember...  I have entered the &amp;#39;overriding&amp;#39; code in one of my SDK (I still using legacy code) libraries  and compiled via &amp;#39;make all&amp;#39; in the lib directory. The thing to remember is to make the function global otherwise the old __mulsi3 will stick like mud on a bumper!  .global __mulsi3
.type __mulsi3, @function
__mulsi3:
              custom 0,r2,r4,r5
ret   I haven&amp;#39;t done major code benchmarks but I did see serious improvements with the custom instruction.  Be careful of the compiler! It is smart. Don&amp;#39;t just multiply something and then check the speed. Make sure that your code can&amp;#39;t be optimized out into add instructions. Look at your .objdump to see if the code really uses &lt;__mulsi3&gt;.   "Now go forth and multiply..."   Victor

altera_forum · Answer

Tim:  What fmax on which speed grade can be achieved with this synchronous multiplier? We are getting around 70MHz on a Cyclone C12-8 with our asynchronous multicycle command.  Chris

Forum Discussion

HW Multiplier

10 Replies

Recent Discussions

NiosV and juart-terminal

Nios V license

NIOS does not start after SW download (timing issue?)

DK-DEV-AGI027-RA: JTAG chain broken after Nios V Hello, FPGA recovery fails

Ashling RISC Free IDE fails to download ELF file