This is Nios II processing time that you are running into and not Avalon. I recommend taking a look at your system implementation to see if it merits hardware acceleration (co-processing, DMA, etc...) Like suntick said if you have highly sequential code then you may not see a lot of gains using hardware acceleration. This is not always the case since sequential algorithms can sometimes be pipelined in hardware in such a way that back to back data can be processed (it's algorithm dependent).
I think if you describe your data flow a bit more and the nature of your alogrithm you may get more precise answers on how to improve your processing latency.