--- Quote Start ---
You will need to understand the object code generated by the compiler.
Compiling with (IIRC) -S --verbose-asm will give an annotated assembly output.
You might either find that you've miscounted the number of instructions, or that there are some pipeline stalls because values read from memory can't be used for the next two clocks (ie there is a stall if either of the next two instructions use the read value).
A quick look at your code makes me think that you need to copy some values to locals. I suspect a lot of them are being reread from memory multiple times in the loop.
--- Quote End ---
I agree with DSL that the actual code generated needs to be analyzed to give a fair chance of correlating. I have been involve with benchmark tuning and that method will yield results ... analyzing what the code generator produced . I had access to a "scroll pipe" that was a trace of the pipeline execution of each instruction but we don't have that here.
Did you get over the contention to getting to the internal memory ? I have not tried dual porting but another approach may be to compute your coeffecients with NIOS and having two independent coeffiecient buffers for DMA to work on in a ping-pong fashion. This would require three internal memories.. the main one for NIOS and a dedicated ping and dedicated pong memory to fully decouple the DMA from the NIOS.
To view contention without side-effects, SignalTrace can be used else bring the AVALON DMA and NIOS data master reada and write signals out to probe then with a scope or Logic Analyzer.
Best Regards, Bob.