After removing a few clock domain crossing issues and removing all the 'red' timing paths, the rebuilt fpga image worked on the boards that were failing.
However we've recently has some board fail the production tests (which now perform a test which is likely to show up this issue). Program the older fpga image and these boards work.
The whole thing is very strange, I've added a memory test in the idle loop of both cpu (using a 32bit value whose low 16 bits are the bit-reversed 1's complement of the high bits - which increment). While this test gives occaisonal errors, the ring index fails 3-4 times as often even though the read/write collision is much more likely.
The error has to be associated with the memory write, otherwise the following read would return the 'old' data - and that just dosn't happen.
I've tried many thing to increase the error rate, but nothing seems to make a significant difference. The only other avalon slave (to that memory block) is the PCIe interface, but it won't be accessing that memory at all. Loading the PCIe slave makes no difference at all.
The whole nios block (and quite a lot of other stuff) is running off the same 100MHz clock, there will be a bus width adapter and clock crossing bridge between the PCIe and any avalon slaves.
We do have to put all the Avalon bus signals from the PCIe block through a lump of vhdl in order to get the correct addres lines (otherwise the BAR becomes massive, not just 32MB ), but that doesn't contain any logic - it just renames signals.
About the only thing I haven't tried is accesses through the other 3 BARs. They go into different logic that is in the same SOPC build, but has no shared avalon master/slave parts.
IIRC Fmax is a lot higher than 100MHz.