The easy way is to create custom instructions and use them like what you have shown below. There are things you can do with the compiler to associate * and / to the new multiply and divide hardware (I don't know how because I don't mind the extra typing of mul (a,http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/cool.gif and div(a, http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/cool.gif )
But when you say that the support library for the ARM is more efficient it is still like comparing apples to oranges. The amount of clock cycles you quoted it taking isn't all that bad (I think anyway). I have seen processors take much longer then that with floating point math and no hardware support. The amount of floating point support you add in hardware really depends on the application. If you just need to take two number and spit out an answer then simple math is only needed. If you need to do control based on floating point values then it becomes a question of whether or not to add hardware for that as well (logic size versus speed trade off).