We have a (near) identical scenario here - an old, well proven design (shipping for approx 6 years), that has started failing recently. Our trouble only occurs as low temperatures on, about, 1 in 40-50 units. Our trouble is on a interface between an Altera Aria II GX device and a X!Lynx device (forgive the swearing). Our manufacturer took to changing the later, cheaper part, in a bid to fix the problem. More often than not this works.
However, the problem became serious enough for it to be referred back to engineering. We found that the interface between the two devices had been poorly constrained. It had clearly been thoroughly considered - the interface in question was constrained. However, we found the values used to be poorly thought through. This meant that edge case devices, whose timing wasn't tight enough at low temperatures, started causing the problems. However, the FPGA tools had deemed the fit appropriate against the constraints used.
With 6 or so years more experience since then, the rtl blocks at each end of the link have moved on greatly - they're still used in newer designs. So, we've ended up releasing new FPGA images, incorporating the newer rtl, for both designs. The constaints have also been re-visited.
So, it could be a long standing issue you've had waiting to happen. You recently received a batch of edge case devices that are within spec but too near the edge for your design. We did send one of our 'faulty' devices back to the manufacturer. They retested it - it passed.
I hope you get somewhere with it.
Cheers,
Alex