On Stratix they are using the hardware multiplier to do shifting functionality. Cyclone uses "soft" DSP blocks so it uses memory to perform something that Stratix can do in very few cycles (probably in 1).
So in short I think you're SOL for doing this fast without making you're own hardware to do it. It's possible to prevent the NIOS from using the hardware altogether by turning off the parameter in the ptf file for you're generated core. I don't know if that'll help you thought (never compiled NIOS II for Cyclone).
Sorry for the bad news but..... well you know my name and all
http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/sad.gif