Yes, okay, I see what you are doing now. Looking at your launch clock path, you have 4+ ns routing delay from the input buffer to the output buffer and a 5+ ns delay through the output buffer. This is giving you a 6+ ns clock skew between launch and latch clock that is hurting you. The large delay through the output buffer could be due to several things including the I/O standard you have chosen, any capacitive loading you added to the I/O, or due to programmable output delays. Regardless, the right way to solve your problem is to run the clock through a PLL. That will give you the flexibility to control your clock skew to solve your problem.