Also somewhere else in this forum someone was asking about implementing the FFT algorithm in software on Nios II or in dedicated hardware, if you search for it you'll probably find it. I barely remember DSP from school but to me FFT means 2 multiplies and 2 additional/subtractions per input pair using the basic radix 2 algorithm. So per input there is a lot of software calculations going on which is why most people use DSPs and/or dedicated hardware for this sort of thing.
If I was implementing FFT under high bandwidth requirements I would use an FFT with Avalon streaming inputs and outputs and use DMAs that can transfer between memory-mapped and streaming to shovel data in and out of the dedicated FFT hardware. This might be overkill for what you need to implement so consider this the extreme end of the spectrum between all C code and highly parallelized hardware :)