Thank you rppolicy for posting those data. It seems that we are both getting pretty much the same results. If i have done the calculations correct here, Nios II/f spends about 630 cycles in each iteration of your loop which contains a float multiply accumulate scenario and two pointers increment. So, my measured 1100 cycles per double precision multiply operation, considering the double vs float overhead, is more or less in agreement with your results.
On the other hand, according to your results, ARM needs about 85 cycles per iteration of your loop, which is at least a x7 performance advantage over Nios II/f as you said in your first post. Whether this performance gap is only a result of the H/W barrel shifter explotiation by the ARM's software floating point library, is definitely a question seeking an answer. Probably the most appropriate person to answer this question is someone from Altera that knows more details than us, regarding the Altera's software floating point library implementation.
If exploitation of a H/W barrel shifter gives such a boost in FP operations, and since Nios II/f allready has a H/W barrel shifter, I think that it is a pity not to use it. But then again, it could be that it is not the H/W barrel shifter that makes the difference. Perhaps it is the instruction set differences (not the best candidate in my opinion), or the software implementation of the Altera FP library, or even a "bug". In any case i feel that this issue defenitely needs and worths further investigation.
Jese: I fully agree with you that it sounds awfully like a software difference. Regarding your alternate solution suggestion and analysis, it is now clear to me.
BadOmen: I think it was me. I had started another thread before this one, about the FP performance, and you had post giving me the information about the available H/W FP IPs in here. Thank you once more.