Use fixed point :-)
I've no idea how fast the FP instructions are, but:
Have you actually verified that the custom instructions are being executed.
Remember they only do 'float', not 'double' - and you need to make sure everything if 'float' otherwise you'll get a lot of float<->double conversion happening.
Does that fpga have the DSP multipler blocks in it? (and do the fp custom instructions use them if it does?).
(Even for the integer multiply, Altera ought to give the option of throwing logic into the multiply instruction to support faster multiplies and/or 64bit results.)