If your fpga blocks are running at 50MHz then you'll have a clock crossing bridge - which will slow things down a lot.
I'm not a hardware expert, but it might be possible to run the the avalon slave at 100MHz and use a synchronous 50MHz clock (eg from a divide by 2) for the rest of the logic in order to avoid the clock crossing bridge
(or, maybe, synchronise below the avalon slave interface).
One thing I haven't determined is whether the 'readra' and 'readrb' bits of a custom instruction actually have any affect on the pipeline stall waiting for writes to the 'rA' and 'rB' registers.
My suspicion is that (except for call and jmpi) the cpu always stalls on 'rA' and stalls on 'rB' if the low 2 bits of the instruction differ. Checking the bits of the custom instruction would take far too much logic.
The effect is that every instruction has the 32bit instruction word, and the two register values from rA and rB available as inputs!
I did a custom instruction for byte/bitswap that used the 'B' field to select the required operation.