Up until now, I haven't looked at the set_output_delay constraints at all. It sounds like you're source-synchronously writing to the device. Anyway yes, registering your outputs should give a faster, more consistent Tco time. You should have a setup relationship of 5ns and hold of -5ns. Looking at your constraints, I think th_fx3 should be negative when calculating the min value:
set out_min_delay [expr $tbd_data_min - $th_fx3 - $tbd_clk_max]
(By making the external delay negative, the FPGA delay must be larger to counteract it. For example, if the hold relatinoship is -5ns and external delay is -2ns, then the FPGA delay can't be shorter than -3ns or it will fail timing. If you have the hold value as positive, then with a -5ns hold relatinoship and +2ns external delay, the FPGA delay can't be less than -7ns to fail timing).
For the roundtrip that stinks that it's not meeting timing. Can you try removing the -invert and see if that works?
I would avoid making multiple clock phases until you try everything else, but you probably have few paths going between these different domains and Quartus is pretty good at meeting these requirements internally. You might have to add a multicycle though. For example, if you PLL_int_clk0 with a regular 10ns clock, and PLL_int_clk1 as 10ns with a 1ns shift. Any transfers from clk0 to clk1 will have a default 1ns setup relationship, which might be impossible(a reg to reg transfer might make that), in which case you might have to add a multicycle setup 2 between those clocks. It should work, just another thing to deal with.