Forum Discussion
Actually, another Knowledge Article (entitled "Local ready signal issues with Altera external memory controller IP") which was sent to me seems plausibly to fit our situation somewhat more closely, although again we have fairly few details to go on. To try to answer more/earlier questions as well as I can: Yes, we are apparently using burst mode, and yes, it appears that the DDR4 devices have successfully passed calibration (the local_cal_success signal remains high, and the local_cal_fail signal remains low). The aforementioned article mentions that one common potential cause of the "local_ready" signal (in our Quartus 16.0 case "amm_ready" / "waitrequest_n") going low and staying that way forever ("effectively locking up the controller and preventing any further accesses...at this point nothing can start the controller again other than a reset") might be, for example, beginning a burst of writes but not providing all beats of that write before requesting other commands/actions.
Unfortunately, this example seems to assume that a user is trying to bring up and debug a system which never works, presumably having been designed incorrectly from the start (and mentions that "determining if a write burst is incomplete at one particular point in a design can be difficult", apparently even when this behavior is always present as presumably in the example). They suggest using either simulation or Signal Tap to monitor various signals, particularly including "enough_data_to_write" and "proper_beats_in_fifo", neither of which seems to appear anywhere in our project, perhaps as it seems this article was based on an even much earlier version of Quartus than we are using -- and then identifying and fixing the precise location in the design where the error exists based on such investigation.
In our situation, the design has been working quite well in the field (as well as in all our internal testing) for some time and over a wide range of applications, but now a particular customer is reporting very occasional instances of the system locking up (apparently in this manner). Our application is in a video display monitor, which accepts several independent video sources concurrently, and each of those video sources is associated with a write and read port, all accessing the block of DDR4 via an arbiter circuit, which I believe may have come originally from an Altera reference design(?). It appears that the lockup (which may occur at random times over hours/days/weeks of normal/proper operation) may be associated with when a particular (and VERY expensive) IR camera (which we cannot have any direct access to), which is connected to one or two of our video inputs, occasionally has disruptions in its video signal(s). Try as we might, we have not been able to replicate such failures at our facility, so neither simulation nor Signal Tap seem to be viable here since basically all we would be able to see is normal/proper operation. We have provided the customer with customized FPGA content, firmware, and software which samples as many signals as we could find to be potentially applicable/useful, and logs them in a file which they provide to us in the event of a failure -- so far we have received just two such sets of event logs, obtained over the last couple months or so. In these cases, we do see the amm_ready signal going/staying low at around the time their screen goes black (corresponding to the DDR4 system "locking up" as described), and those have occurred roughly when the video from the IR camera has apparently been disrupted in some fashion, but we have seen no other indications in those logs of details as to how or why the system may have been caused, for example, to have had burst operations disturbed, or whatever else may be causing problems.
As you can see, this seems to be quite a difficult debug situation; any further ideas (or even just guiding questions) you might have would be greatly appreciated.