If you can do the locked read-write avalon bus cycle you should be able to generate one from a custom instruction - except it would have to use a separate avalon master and so bypass the data cache.
Actually you'll have no cache coherencey either - very grim!
You'd have to use an external data cache.
I've thought that the nios cpu isn't much more than a great heap of mux.
My guess is that RA and RB are always read, pipeline stalls (re-execute) if a write to RA is pending, and for RB if the low two instruction bits differ (NFI why the instructions aren't organised so this is a single bit!). This gives the instrcution three 32bit words to play with, write-back to RB or RC is dependant on the decoded instruction.
That makes me think that the readra/readrb bits of the custom instruction are ignored - but I've not done any experiments.