The processor manual for that chip advertises 3 cycles for floating point multiply-accumulate, which it looks like your inner kernel requires two of them and at 250MHz is considerable power.
If you are actually realizing performance significantly worse than that, then your question becomes an STM32 tools / optimization discussion not appropriate for this forum although many people here may be knowledgeable about it.
If you are certain you have already maxed out the chip and definitely require a co-processor in an FPGA, then you will need to say some more about your proposed system architecture and requirements.