what you said seems to work. here is what I did - i don't know if it is right though.
- create virtual_cpu_clk of 200MHz with 0 shift from the FPGA 200MHz clk used for sampling.
- set_output_delay -clock virtual_cpu_clk -max 12.6 [get_port cpu_data
[*]]
- set_output_delay -clock virtual_cpu_clk -min -7.2 [get_port cpu_data
[*]]
- set multi-cycle of 4
this seems to improve the propagation delay within the FPGA and makes things work. I do output the data 4 cycles before it is needed. I did not add any board delays since they are not significant here. I still wonder if I should pad the numbers with 5 ns, so that the 12.6 become 17.6. This should account for worst case.
I too did search for anything that comes close to explaining how this could be constrained to no avail.
thanks for the help so far - it is helping.