Another try :
When implemeting the simple sequence as above as a custom instruction, it saves already a lot of time, for only 280 LEs in a cyclone.
I have measured the time for this
int r;
volatile int* pr = &r; //prevent optimalisations
for (int a = -1000; a < 1000; a++)
for (int b = 1000; b >-1000; b--) //count backwards : prevent optimalisations
*pr = a*b;
Without cutom instruction : this takes about 22.4 seconds at 50Mc with the standard NiosII
With the custominstruction in the mulsi function : 4.3 seconds (without frame pointer stuff)
with the custominstruction inlined (without the additinal function call and ret instruction) : 2.4 seconds.
This is a very big advantage for only 280 LE's I think. (22.4 seconds down to 4.3 seconds for 4.000.000 muls and overhead for the loops)
But I think the compiler still thinks that the cost for a multiply is very high, so it wants to optimise to shifts and adds where possible. This can reduce the benefit of this code a lot.
If anyone is interested, I'll post the verilog code for the custom instruction.