Without modifying anything you will not be able to boost the performance.
The quickest way to do this is to OR the upper half with the sign bit (that shouldn't take too many more cycles).
When you just use the shifting functionality you
are using the DSP blocks to do this (hardware multiplier). So without modifying anything you will be stuck to using these (just like you can't make a P4 go any faster without changing your code).
The Cyclone is a low cost version of the Stratix, so you get what you pay for. If it could perform like a stratix then there would be no need for a Stratix and the Cyclone would end up costing the same as a Stratix
http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif
So long story short if you want performance you're going to have to do some work to get it from that FPGA