The most obvious optimisation is for the FP result be made available as the source for the next operation - ie add a single FP accumulator.
That would remove the first 'frdy' and the second 'fwrx'.
This may even be true of the current fpga - but gcc hasn't been told about it.
It might be that the ability to obtain half the result on the operation opcode makes it difficult to describe.
Not sure why gcc's register tracking generated the 4 'mov' instructions either!
To get anything like the ppc code, you'd also have to change the way FP arguments are passed - so they can be passed in FP registers.