Yes, we definitely need a lot of delay introduced. I haven't tried to run without optimize hold timings for some time, but I guess I would end up somewhere in the micro-seconds range TNS.
I can't currently do much against that, as the ASIC design is quite frozen atm. The only thing I'm not sure is if I could push the clock gating conversion for a better result with other timing constraints and a smart clock gating cell replacement. It doesn't convert anything right now.
Concerning elapsed times:
Info: Fitter placement operations ending: elapsed time is 00:21:32
Info: Fitter routing operations ending: elapsed time is 02:06:53
So I'm on the far other end then, I need 600% for the routing :-/
Concerning utilisation:
Info: Average interconnect usage is 15% of the available device resources
Info: Peak interconnect usage is 55% of the available device resources
But I have seen much higher values here, this one is for a quite nice fit. Logic utilization is 33%.
I have the optimize fast corner timing checked for 8.1. But I recently uncecked the optimize multi corner timing in 9.0, hoping to reduce the compile time somewhat.