The PCIe slave access times are dominated by the PCIe 'cycle' time, the per-byte times are fast enough for what we are doing. We are feeding the avalon clock into the hard PCIe block - not sure what type of clock crossing bridge that adds (that feature is removed in quartos 12). 'Interestingly' the PCIe -> avalon bridge shows 32bit data until you try to add a pipeline bridge when it suddenly becomes 64bit!
The memory block we are seeing problems is used to share data between the two processors. It could be TCM to both - but then we wouldn't be able to dump it out for debug.
The locking is fine, the only (byte) location that can be written by both sides is covered by a lock (Dekker's algorythm). The lock is also used for the one place where two data items are updated together.
I don't do these fpga builds (just write the software - I have done some build for one of the dev cards), so can't confirm we meet all the timing corner cases. However we did fix a few timing issues (there are no nasty red errors now) and that just seems to have changed which board fail.