The custom instruction idea is interesting.
You should in addition consider making, a dedicated hardware unit (interfaced as custom instruction or avalon or PIO) which takes the workload off the nios core and only returns precomputed results to the nios core. In that way you will be less reliant on the IO speed.
I dont know your application ofcourse, but often some rethinking of the architecture can put more functionality into HW, and the speed increase can be dramatic.
Given the details I am sure many people from the nios forum could give suggestions in that direction as well,
regards
henning