Forum Discussion

Arun_Prabakatr's avatar
Arun_Prabakatr
Icon for New Contributor rankNew Contributor
2 months ago

Validating ECC Functionality on Custom Agilex 5 SOM in Linux Kernel

We are now looking to validate ECC (Error Correction Code) functionality on our custom Agilex 5 System-on-Module (SOM) running Linux. Our objective is to ensure that ECC is correctly enabled and functioning across all relevant memory regions, and that error detection and correction mechanisms are properly integrated at the kernel level. Could you please provide guidance on the necessary kernel configurations, device tree modifications, and available tools or procedures to test and monitor ECC behavior on this platform? Any documentation or reference designs specific to Agilex 5 ECC support would be highly valuable.

7 Replies

  • KianHinT_altera's avatar
    KianHinT_altera
    Icon for Frequent Contributor rankFrequent Contributor

    Hi Arun, 

    May I know whether do you still have any question related to this issue, otherwise we would like to set it to closure and transition to community support.
     

    Thanks

    Regards

    Kian

  • Hi Arun, 

    You can refer to below steps to enable and ECC support in Linux for Agilex5 memory controller. 

    Agilex 5 uses the IO96B memory controller. Please ensure the following kernel configuration options are enabled:

    CONFIG_EDAC_DEBUG=y
    CONFIG_EDAC_ALTERA=y
    CONFIG_EDAC_ALTERA_IO96B=y

    ECC Error Injection:
    From the Linux prompt, use the following command to inject an ECC error:
    echo C > /sys/kernel/debug/edac/io96b0-ecc/altr_trigger
    Note:
    C → Inject Correctable Error
    U → Inject Uncorrectable Error
    This command injects a single-bit error syndrome into the memory controller, which triggers an interrupt to the CPU. The Linux driver then reports the error, and you should see logs like the following:

    [  531.047821] EDAC Altera: io96b0-ecc: SBE: word0:0x00409C00, word1:0x00014F00
    [  531.054873] EDAC DEVICE2: CE: Altera ECC Manager instance: io96b0-ecc0 block: io96b0-ecc0 count: 1 'io96b0-ecc'

    Field descriptions:
    word1 – Lower 32 bits of the ECC error address
    word0 – ECC error information
    Please refer to Table 253 in the documentation for details on the ECC Error Buffer Structure: 
    https://www.intel.com/content/www/us/en/docs/programmable/817467/25-1-1/ecc-error-handling.html
    This user guide provides detailed information about the Agilex 5 EMIF IP mailbox interface. 

    "We are now looking to validate ECC (Error Correction Code) functionality on our custom Agilex 5 System-on-Module (SOM) running Linux. Our objective is to ensure that ECC is correctly enabled and functioning across all relevant memory regions, and that error detection and correction mechanisms are properly integrated at the kernel level. "

    I am not entirely sure about your specific test plan, but I’m afraid that your intention to validate ECC functionality across all relevant memory regions using the Linux kernel is not appropriate.

    The Linux kernel EDAC (Error Detection and Correction) framework provides a mechanism to validate the error injection and error reporting flow for the IO96B memory controller through the mailbox interface. However, it is not a comprehensive debugging tool for validating different memory regions.

    The reason is that when the Linux kernel is running and actively using DDR memory, it is not visible to the user which regions are currently in use and which are free. Attempting to modify memory content that is in use by the kernel could result in a kernel crash.

    The correct way to use the Linux kernel EDAC driver is to ensure that the ECC error reporting path is functioning correctly.
    That is, by using the EDAC driver to perform an error injection and verifying that the kernel reports the corresponding error.
    This confirms that if any ECC error occurs on the memory controller, the CPU will receive an interrupt and the Linux driver will report the error appropriately.

    Hope this helps.

    • Arun_Prabakatr's avatar
      Arun_Prabakatr
      Icon for New Contributor rankNew Contributor

      Hi Nirav,
      Thanks for your support. We’re using kernel version 6.12.11 with QPD 25.1, and we were able to find the driver CONFIG_EDAC_ALTERA_IO96B. Could you please let us know the related driver we should use to test and validate?

      • KianHinT_altera's avatar
        KianHinT_altera
        Icon for Frequent Contributor rankFrequent Contributor

        Hi Arun,

        As there is no further enquiries related to this issue, we will step back and allow the community to assist with any future follow-up questions.

        Thank you for engaging with us!

        Best regards,
        Altera Technical Support

  • Hi Arun,

    You can refer to the Agilex 5 HPS Technical Reference Manual: https://www.intel.com/content/www/us/en/docs/programmable/814346/25-3/error-checking-and-correction-controller-93197.html

    The EDAC driver documentation also includes a link to the driver source code: https://altera-fpga.github.io/rel-25.3/linux-embedded/drivers/edac/edac/

    Check that your kernel includes the memory controller driver and that any ECC config options are enabled. Then ensure your device tree describes ECC-enabled memory regions and controllers. For information on device tree modifications: https://github.com/altera-fpga/agilex5e-ed-gsrd/tree/main/a5ed065es-premium-devkit-oobe/baseline/software/yocto_linux/meta-custom/recipes-bsp/device-tree

    • Arun_Prabakatr's avatar
      Arun_Prabakatr
      Icon for New Contributor rankNew Contributor

      Hi AnnaK_Altera

      Thanks for the references. We've enabled the EDAC and memory controller drivers in our kernel, and our device tree includes ECC-enabled memory regions.

      Since this is our first time testing ECC on Agilex 5 boards, could you please guide us on how to check if it's working properly? Any steps, tools, or test methods you recommend would be really helpful.