The attached .bdf is diagram of a basic DMA to help you get started. Assuming that you are using NIOS for the processor to generate the numbers and build the arrays. What you now have is the NIOS sends (writes the numbers to the adder. That involves a loop to read the numbers from memory and to write them one at a time to the adder which is slower than reading the numbers and simply doing the add internally.
DMA on the other hand can stream the data and use FIFO buffers to transfer blocks of data and overlap the add with the data transfer. It goes like this:
1) send the array addresses and size to the peripheral and tell it to start transfer to both FIFO's.
2) When the FIFO's are both not empty read the next number to the adder and do the add.
3) I am pretty sure that the transfer into the FIFO can be broken up into blocks(segments) so the adding can start when a burst has been received by each FIFO.
The net result is that most of the add time is overlapped with data transfer and once each burst transfer starts there is a new word each clock cycle, but the key is to not pay the overhead to transfer one word at a time. Now you should see a difference due to the overlap because the data transfer is much faster and overlapped, not because the add is faster.