The three instructions generated for each double precision operator is as follows:
1) Send first operand to custom instruction (usually single cycle)
2) Send second operand to custom instruction and execute (usually many clock cycles), read back half of result
3) Read back 2nd half of the result (usually single cycle)
All three are necessary since the processor uses a 32-bit data path and double precision is 64-bit. The only way I see having a register file will reduce the amount of communication (and instructions per operator) between the CPU and FPU is if you are performing a bunch operations that use a common operand. But like DSL said the number of instructions is fairly negligible compared to the total time required for the operator to complete the calculation (#2 above). At most you will only shave off a clock cycle if any at all.