For a custom instruction you'd have to use mutiple opcodes to write the 128bit values and then multiple opcodes to retrieve the result.
It could all be done with a single clocked (not combinatorial) custom instruction opcode.
If 'readrc' is zero use the 5 bit C value to select where to save the 32bit rA and rB values.
If 'readrc' is one use the A field to determine which result to return.
(Actually you can look at the writera bit and the 32bit A value as well, and all the B ones.)
I didn't ever look to see if the writera/writerb bits have any effect on the cpu logic. I strongly suspect that the pipeline stall (for an earlier 'late result') always happens. All other instructions stall on the A field, a B field stall is needed if the low two bits of the opcode differ. It seems highly unlikely that the custom opcode bits get fed in to that logic, it is even possible that it applies to jmpi and call instructions.