I'd have thought you'd be able to run at 100MHz, maybe something is wrong with the clock sources when you are trying to do that.
Getting the clocking right is tricky (I'm no expert), ISTR some magic -3.3ns value and also using the memory clock as the master clock...
If increasing the cache sizes makes a significant difference, then you must be thrashing the caches - probably worth determining whether it is the data or instruction cache.
Floating point will be slow. If you've got the fpga real-estate the fp custom instructions will help float (but not double) operations.
One option is to convert your floating point to fixed point - then use integer operations. For that to work well you'll really want the mulx instructions (which seem to be only available with DSP multipliers), and maybe a custom instruction to extract the required 32bits from the 64bit product.
Thinks - would the 32x32 full adder array execute in a single clock to perform a multiply (throwing gates at it!).