I'd read the gcc documentation about how to describe instructions, and look at the extisting ports and find one that is easy to copy!
Building the knowledge into gcc itself (rather than trying to directly use custom instructions) is probably easier.
I'd have thought that the 'clocks per operation' would tend to dominate over the absolute number of instructions though - especially if values have to be normalised (eg for add/subtract).
Might be worth using a combinatorial instruction to read from your FP register file - you'll still need 2 instructions to get a 64bit value, but at least you won't have to worry about 'late result' delays.
It is also worth remembering that if the 'writerc' bit is 0, then the 5 bits of C can be used for any purpose - you could use 3 bits to select a register and 2 bits as an opcode extension - st, add, sub, mul ?
Similarly if 'readrb' is zero, the B field could be used to determine how to convert the 32bit Ra to FP.
(I don't know if the cpu does a decode phase stall when readrb (or readra) is zero and the selected register value isn't available.)
In any case, this will reduce the number of custom instruction slots you need.
I'd build gcc, add some FP regsiter and instruction definitions, and look at the code!