IMHO, your constraints are fine. That is, setting a clock group constraint as shown in the cookbook is all you need to do. And it's all you can do.
IMHO, your problem lies with your design. You're using generic logic to implement the clock mux. In an FPGA, using regular logic for anything related to clocks (clock muxes, clock gating, ripple clocks.) is, generally, a bad idea.
The generic interconnect and logic has rather large propagation delays, which is why you're seeing a large skew between input and output clocks.
You need to remove your clock mux and use a ALT_CLKCTRL primitive to mux your clocks.