Since you are not using dma and the adder takes one cycle in the processor and the adder module takes one cycle, then the time must be spent in the interface. When you said that you "send" the numbers to the peripheral, that implies that MMIO is being used by the interface therefore being driven by the processor. That means it must fetch both operands and write them to the adder and then read the sum from the adder. When the add is done by the processor, it fetches the two operands and does the add internally without writing and reading the interface. Also using random numbers only clouds the issue. Adding 32 bit numbers takes the same time no matter what the value.
Whether you add or multiply the same scenario applies. There must be a computational function of several operations that can be overlapped in hardware but not in the processor code for any hardware accelerator to be effective. C to hardware schemes have been around for years and usually fail because those computations that can be accelerated are so rare.
The attached text is a cycle log of a simulation of the C processor that I mention in a previous post.
It is a few iterations of a for loop that adds two numbers where each loop takes 9 clock cycles. To add many numbers there would be access time to external memory so the 9 cycles is not really accurate. How many cycles does your processor take to execute the same code?