Right off the bat, I notice that you have over 0.5 ns of clock skew, which could potentially be high (I usually see it down at 100 ps or lower), and there is over 2 ns of delay on the launch clock path. Can you use a regional (quadrant) clock instead of one of the global clock resources? Regional clock resources have less skew and a smaller insertion delay.
What do you have as your optimization mode in the Compiler Settings? Make sure it is set to one of the performance modes.
The path is indeed using a lot of extra routing. Are you meeting hold timing in the fast timing model(s)? One thing you could try is turning off the optimize hold timing option from the advanced Fitter settings. The Fitter may be trying to meet hold timing by adding extra routing in the path at the expense of setup timing, which could explain the routing congestion.
Can you post the code for the combinational logic between these registers?