First, from my experience, simulation is a good tool to catch logic issues, but a very poor way to debug timing issue, especially if the problem is between two different clocks. However, I have run simulation and did not see any problem. You know, running simulation over high speed serial link for reasonable during for the problem to show up is not very realistic. Sometimes, simulation takes longer than the compile time for it to show meaningful results. It is actually easier to put in signaltap to debug the problem.
Your suggestion of simplify the design does not work either. As I said, every compilation of the same design gets different results and in most cases, the problem goes away, simplified design won't reproduce the problem. And the problem only shows up in our system, which comprises of multiple boards, FPGAs and software, when running some special cases. Simplified design won't trigger the issue.
In those compilations that reproduced the problem, there is no timing violation, no unconstrained clocks, and everything looks normal.
The biggest thing getting in the way for me to debug is the source code is encrypted. There is no way for me to debug it. I can put some signals on signaltap but I don't know the logic around those signals. And again, what makes it harder is that most times, after I changed the signaltap, the problem goes away.
Thanks,