Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
15 years ago

Hardware VS Software

I am doing some performance analysis for hardware and software. The program will generate a set of random 32 bits number. these numbers will be send to a user peripheral (hardware) for addition. then, these numbers will be used again for addition using ordinary C code (software). this is the result: --Performance Counter Report-- Total Time: 0.0771603 seconds (3858017 clock-cycles) +---------------+-----+-----------+---------------+-----------+ | Section | % | Time (sec)| Time (clocks)|Occurrences| +---------------+-----+-----------+---------------+-----------+ |Hardware | 50.7| 0.03911| 1955390| 1| +---------------+-----+-----------+---------------+-----------+ |Software | 49.3| 0.03805| 1902555| 1| +---------------+-----+-----------+---------------+-----------+ suppose that I guess hardware should be faster but it turned out that software to be faster. may i know why?

20 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    an example multiply entity - no problems compiling in quartus:

    
    library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;
    entity my_mult is
      port (
        
        clk     : in  std_logic;
        
        a,b     : in  unsigned( 7 downto 0);
        c       : out unsigned(15 downto 0)
      )
    end entity;
      
    architecture rtl of my_mult is
    begin
      
      process(clk)
      begin
        if rising_edge(clk) then
          c <= a*b;
        end if;      
      end process;
      
    end rtl;
    
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    ok thanks... because last time when i was doing my lab, multiplication doesnt work, so i change to addition... so in my case, random 32-bit number should work fine for multiplication, right?

    --- Quote End ---

    remember that multiplying two 32 bit numbers give a 64 bit result.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    remember that multiplying two 32 bit numbers give a 64 bit result.

    --- Quote End ---

    hahaha! thanks ya alot for the information and the code!
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Since you are not using dma and the adder takes one cycle in the processor and the adder module takes one cycle, then the time must be spent in the interface. When you said that you "send" the numbers to the peripheral, that implies that MMIO is being used by the interface therefore being driven by the processor. That means it must fetch both operands and write them to the adder and then read the sum from the adder. When the add is done by the processor, it fetches the two operands and does the add internally without writing and reading the interface. Also using random numbers only clouds the issue. Adding 32 bit numbers takes the same time no matter what the value.

    Whether you add or multiply the same scenario applies. There must be a computational function of several operations that can be overlapped in hardware but not in the processor code for any hardware accelerator to be effective. C to hardware schemes have been around for years and usually fail because those computations that can be accelerated are so rare.

    The attached text is a cycle log of a simulation of the C processor that I mention in a previous post.

    It is a few iterations of a for loop that adds two numbers where each loop takes 9 clock cycles. To add many numbers there would be access time to external memory so the 9 cycles is not really accurate. How many cycles does your processor take to execute the same code?
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hello SimKnutt, in the attachment is my code for the adder. signed_add is the adder processor, alu_interface is the interface and avalon is the processor with the interface. hwsw.c is the C source code. I am not sure how many cycles is my processor. How to check? Sorry I am new in this. Besides, if using DMA, how is the design? I am not sure. Thanks.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hi! I downloaded your adder zip but could not open it, either invalid or corrupted message.

    I think using SOPC builder with an avalon master slave to access memory is what you want,

    Then put the numbers in memory(which you probably already have). Then have the slave request 2 numbers(by address), do the adds then write the sum to memory. Of course I don't know if you add and write the sum after each add or do a summation then write that result.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Please check the attachment of this reply. Thanks. I hope you can explain more to me regarding my design and about dma as well. thanks

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The attached .bdf is diagram of a basic DMA to help you get started. Assuming that you are using NIOS for the processor to generate the numbers and build the arrays. What you now have is the NIOS sends (writes the numbers to the adder. That involves a loop to read the numbers from memory and to write them one at a time to the adder which is slower than reading the numbers and simply doing the add internally.

    DMA on the other hand can stream the data and use FIFO buffers to transfer blocks of data and overlap the add with the data transfer. It goes like this:

    1) send the array addresses and size to the peripheral and tell it to start transfer to both FIFO's.

    2) When the FIFO's are both not empty read the next number to the adder and do the add.

    3) I am pretty sure that the transfer into the FIFO can be broken up into blocks(segments) so the adding can start when a burst has been received by each FIFO.

    The net result is that most of the add time is overlapped with data transfer and once each burst transfer starts there is a new word each clock cycle, but the key is to not pay the overhead to transfer one word at a time. Now you should see a difference due to the overlap because the data transfer is much faster and overlapped, not because the add is faster.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    sorry, i cant open it because it said it has unsupported version. i am using quartus 9.0