I am doing some performance analysis for hardware and software. The program will generate a set of random 32 bits number. these numbers will be send to a user peripheral (hardware) for addition. then, these numbers will be used again for addition using ordinary C code (software). this is the result: --Performance Counter Report-- Total Time: 0.0771603 seconds (3858017 clock-cycles) +---------------+-----+-----------+---------------+-----------+ | Section | % | Time (sec)| Time (clocks)|Occurrences| +---------------+-----+-----------+---------------+-----------+ |Hardware | 50.7| 0.03911| 1955390| 1| +---------------+-----+-----------+---------------+-----------+ |Software | 49.3| 0.03805| 1902555| 1| +---------------+-----+-----------+---------------+-----------+ suppose that I guess hardware should be faster but it turned out that software to be faster. may i know why?

If you are shipping really short amounts of work to the accelerator then yes software will be faster due to the communication overhead. The only way this could be efficient is if you perform the same operation across a large block of data in memory and you use DMAs to stuff the data into the accelerator.

I would really like to see the C code that adds the numbers so I can compare the software cycle count that you have with the cycle count that I get running the same code on my processor that directly executes the C code without compiling to a native instruction set. Will you please attach the code to this thread? The whole object of the design is to minimize the number of cycles so it fits right in with what you are doing.

--- Quote Start --- If you are shipping really short amounts of work to the accelerator then yes software will be faster due to the communication overhead. The only way this could be efficient is if you perform the same operation across a large block of data in memory and you use DMAs to stuff the data into the accelerator. --- Quote End --- Yes, my project is to test the performance of the system with and without using DMA. But for the basic one, I try to test the performance between hardware and software first without involving dma. next step, i will try to include dma. by the way, is there any tutorial regarding how to transmit and receive data for dma in c language?

--- Quote Start --- I would really like to see the C code that adds the numbers so I can compare the software cycle count that you have with the cycle count that I get running the same code on my processor that directly executes the C code without compiling to a native instruction set. Will you please attach the code to this thread? The whole object of the design is to minimize the number of cycles so it fits right in with what you are doing. --- Quote End --- the code is like you generate 2 sets of random number, then you just add it up. for the "sub processor" that do the adding, you need to have 3 submodules: adding module, interface and the top level system. adding module is where you add up the generated number, interface is controlling the input and output and top level system is like whole system which includes interface and adding module.

Adding is hardly an issue in software when you're within 32 bits. But when you look at larger bit type (like > 64) or fixed point you would start to see a difference. Then multiply that up by large data sets (eg. video or mega pixel images) you would really notice the difference in throughput.

Hardware VS Software | Altera Community

20 Replies

Altera_Forum

Honored Contributor

15 years ago

an example multiply entity - no problems compiling in quartus:


library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity my_mult is
  port (
    
    clk     : in  std_logic;
    
    a,b     : in  unsigned( 7 downto 0);
    c       : out unsigned(15 downto 0)
  )
end entity;
  
architecture rtl of my_mult is
begin
  
  process(clk)
  begin
    if rising_edge(clk) then
      c <= a*b;
    end if;      
  end process;
  
end rtl;

Altera_Forum
Honored Contributor
15 years ago
--- Quote Start ---
ok thanks... because last time when i was doing my lab, multiplication doesnt work, so i change to addition... so in my case, random 32-bit number should work fine for multiplication, right?
--- Quote End ---

remember that multiplying two 32 bit numbers give a 64 bit result.
Altera_Forum
Honored Contributor
15 years ago
--- Quote Start ---
remember that multiplying two 32 bit numbers give a 64 bit result.
--- Quote End ---

hahaha! thanks ya alot for the information and the code!
Altera_Forum
Honored Contributor
15 years ago
Since you are not using dma and the adder takes one cycle in the processor and the adder module takes one cycle, then the time must be spent in the interface. When you said that you "send" the numbers to the peripheral, that implies that MMIO is being used by the interface therefore being driven by the processor. That means it must fetch both operands and write them to the adder and then read the sum from the adder. When the add is done by the processor, it fetches the two operands and does the add internally without writing and reading the interface. Also using random numbers only clouds the issue. Adding 32 bit numbers takes the same time no matter what the value.
Whether you add or multiply the same scenario applies. There must be a computational function of several operations that can be overlapped in hardware but not in the processor code for any hardware accelerator to be effective. C to hardware schemes have been around for years and usually fail because those computations that can be accelerated are so rare.
The attached text is a cycle log of a simulation of the C processor that I mention in a previous post.
It is a few iterations of a for loop that adds two numbers where each loop takes 9 clock cycles. To add many numbers there would be access time to external memory so the 9 cycles is not really accurate. How many cycles does your processor take to execute the same code?
cycle_log.txt2 KB
Altera_Forum
Honored Contributor
15 years ago
Hello SimKnutt, in the attachment is my code for the adder. signed_add is the adder processor, alu_interface is the interface and avalon is the processor with the interface. hwsw.c is the C source code. I am not sure how many cycles is my processor. How to check? Sorry I am new in this. Besides, if using DMA, how is the design? I am not sure. Thanks.
adder.zip2 KB
Altera_Forum
Honored Contributor
15 years ago
Hi! I downloaded your adder zip but could not open it, either invalid or corrupted message.
I think using SOPC builder with an avalon master slave to access memory is what you want,
Then put the numbers in memory(which you probably already have). Then have the slave request 2 numbers(by address), do the adds then write the sum to memory. Of course I don't know if you add and write the sum after each add or do a summation then write that result.
Altera_Forum
Honored Contributor
15 years ago
Please check the attachment of this reply. Thanks. I hope you can explain more to me regarding my design and about dma as well. thanks
adder.zip2 KB
Altera_Forum
Honored Contributor
15 years ago
The attached .bdf is diagram of a basic DMA to help you get started. Assuming that you are using NIOS for the processor to generate the numbers and build the arrays. What you now have is the NIOS sends (writes the numbers to the adder. That involves a loop to read the numbers from memory and to write them one at a time to the adder which is slower than reading the numbers and simply doing the add internally.
DMA on the other hand can stream the data and use FIFO buffers to transfer blocks of data and overlap the add with the data transfer. It goes like this:
1) send the array addresses and size to the peripheral and tell it to start transfer to both FIFO's.
2) When the FIFO's are both not empty read the next number to the adder and do the add.
3) I am pretty sure that the transfer into the FIFO can be broken up into blocks(segments) so the adding can start when a burst has been received by each FIFO.
The net result is that most of the add time is overlapped with data transfer and once each burst transfer starts there is a new word each clock cycle, but the key is to not pay the overhead to transfer one word at a time. Now you should see a difference due to the overlap because the data transfer is much faster and overlapped, not because the add is faster.
BasicDMA.zip1 KB
Altera_Forum
Honored Contributor
15 years ago
sorry, i cant open it because it said it has unsupported version. i am using quartus 9.0
Altera_Forum
Honored Contributor
15 years ago
Here is a scanned zip that you should be able to open with Paint. It is a .tif file so I hope no problem
scan0006.zip54 KB

Forum Discussion

Hardware VS Software

20 Replies

Recent Discussions

Cyclone-V SCFIFO - adding ECC to M10K/MLAB/Auto memory

Will serialization factor of 6 in LVDS serdes IP be supported in the future on Agilex5?

System PLL of Agliex5 PCIE example design cannot be locked after configuration

JTAG Chain Broken on Agilex 7-I Dev Kit

Request for Cyclone V Pinout File Information