Issues with updating Arria10 PAC for AFU
Hello,
Platform info: Arria 10 GX PAC
Host System: Ubuntu 18.04 ( 4.15.0 kernel), Xeon Gold 6226R CPU dual-socket server
We have two Arria10 PAC cards that we are trying to run the AFU Getting Started examples (UG 20166), but we need to update our cards to the latest 1.2.1 firmware. On one of the cards, the fpgaotsu update finished correctly, but super-rsu fails with the following error:
sudo super-rsu --log-level trace /usr/share/opae/a10-gx-pac/super-rsu/base/rsu-09c4.json [2020-09-02 21:29:41,652] [DEBUG ] [MainThread ] - found fpga objects: ['/sys/class/fpga/intel-fpga-dev.0'] [2020-09-02 21:29:41,653] [DEBUG ] [MainThread ] - found device at 0000:89:00.0 -tree is [pci_address(0000:85:00.0), pci_id(0x8086, 0x2030)] [pci_address(0000:86:00.0), pci_id(0x10b5, 0x8747)] [pci_address(0000:87:08.0), pci_id(0x10b5, 0x8747)] [pci_address(0000:87:10.0), pci_id(0x10b5, 0x8747)] [pci_address(0000:89:00.0), pci_id(0x8086, 0x09c4)] [2020-09-02 21:29:41,654] [DEBUG ] [MainThread ] - could not find: "/sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/ifpga_sec_mgr/ifpga_sec*" [2020-09-02 21:29:41,654] [WARNING ] [MainThread ] - [0000:89:00.0] does not support secure update [2020-09-02 21:29:41,654] [ERROR ] [MainThread ] - missing one or more items required by rsu config [2020-09-02 21:29:41,654] [INFO ] [MainThread ] - super-rsu exiting with code '78' ############ FME info ################ fpgainfo fme Board Management Controller, microcontroller FW version 26889 Last Power Down Cause: POK_CORE Last Reset Cause: None //****** FME ******// Object Id : 0xEE00000 PCIe s:b:d:f : 0000:89:00:0 Device Id : 0x09C4 Socket Id : 0x00 Ports Num : 01 Bitstream Id : 0x123000200000185 Bitstream Version : 1.2.3 Pr Interface Id : 69528db6-eb31-577a-8c36-68f9faa081f6 Boot Page : user
It seems possible that we are missing some driver here, but I'm not sure what the next thing to check might be. Does anyone have any suggestions?
I was finally able to update this using super-rsu after completely shutting off power to the server (cold reboot):
[]$ super-rsu --log-level trace /usr/share/opae/a10-gx-pac/super-rsu/base/rsu-09c4.json [2020-12-28 16:33:37,086] [DEBUG ] [MainThread ] - found fpga objects: ['/sys/class/fpga/intel-fpga-dev.0'] [2020-12-28 16:33:37,088] [DEBUG ] [MainThread ] - found device at 0000:3d:00.0 -tree is [pci_address(0000:3a:00.0), pci_id(0x8086, 0x2030)] [pci_address(0000:3b:00.0), pci_id(0x10b5, 0x8747)] [pci_address(0000:3c:08.0), pci_id(0x10b5, 0x8747)] [pci_address(0000:3d:00.0), pci_id(0x8086, 0x09c4)] [pci_address(0000:3c:10.0), pci_id(0x10b5, 0x8747)] [pci_address(0000:3e:00.0), pci_id(0x198a, 0x385c)] [2020-12-28 16:33:37,096] [WARNING ] [MainThread ] - Update starting. Please do not interrupt. [2020-12-28 16:33:37,097] [DEBUG ] [MainThread ] - [3d:00.0] version (0x0124000200000367) up to date for sr [2020-12-28 16:33:37,098] [DEBUG ] [MainThread ] - bmc_fw is being force flashed [2020-12-28 16:33:37,098] [DEBUG ] [MainThread ] - bmc_fw versions not equal (system:0x0000000000026889 != manifest:0x0000000000026895) [2020-12-28 16:33:37,098] [DEBUG ] [MainThread ] - bmc_fw versions not equal (system:0x0000000000026889 != manifest:0x0000000000026895) [2020-12-28 16:33:37,099] [DEBUG ] [MainThread ] - [3d:00.0] update timeout set to: 1200.0 [2020-12-28 16:33:37,099] [DEBUG ] [3d:00.0 ] - update of board at [pci_address(0000:3d:00.0), pci_id(0x8086, 0x09c4)] started [2020-12-28 16:33:37,099] [DEBUG ] [MainThread ] - max timeout set to: 0:20:00 [2020-12-28 16:33:37,100] [DEBUG ] [3d:00.0 ] - starting task: fpgasupdate /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4_bootloader-26895-fw_Release.bin 0000:3d:00.0 [2020-12-28 16:33:37,222] [WARNING ] Update starting. Please do not interrupt. [2020-12-28 16:33:37,223] [INFO ] updating from file /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4_bootloader-26895-fw_Release.bin with size 38016 [2020-12-28 16:33:37,331] [INFO ] writing to staging area [2020-12-28 16:34:36,173] [DEBUG ] [MainThread ] - waiting (0:19:00.927721) for threads: 3d:00.0 [2020-12-28 16:34:36,674] [DEBUG ] [MainThread ] - waiting (0:19:00.426487) for threads: 3d:00.0 (100%) [____________________] [38016/38016 bytes][Time:0:01:34.404933] [2020-12-28 16:35:11,747] [INFO ] applying update to 0000:3d:00.0 (100%) [____________________][Time:0:00:08.010363] [2020-12-28 16:35:19,757] [INFO ] update of 0000:3d:00.0 complete [2020-12-28 16:35:19,758] [INFO ] Secure update OK [2020-12-28 16:35:19,758] [INFO ] Total time: 0:01:42.536032 [2020-12-28 16:35:19,809] [DEBUG ] [3d:00.0 ] - task completed in 0:01:42.707920 [2020-12-28 16:35:19,809] [DEBUG ] [3d:00.0 ] - starting task: fpgasupdate /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4-26895-fw_Release.bin 0000:3d:00.0 [2020-12-28 16:35:19,932] [WARNING ] Update starting. Please do not interrupt. [2020-12-28 16:35:19,934] [INFO ] updating from file /usr/share/opae/a10-gx-pac/super-rsu/base/a10sa4-26895-fw_Release.bin with size 244864 [2020-12-28 16:35:20,039] [INFO ] writing to staging area (100%) [____________________] [244864/244864 bytes][Time:0:00:01.575939] [2020-12-28 16:35:21,626] [INFO ] applying update to 0000:3d:00.0 [2020-12-28 16:35:36,247] [DEBUG ] [MainThread ] - waiting (0:18:00.853465) for threads: 3d:00.0 [2020-12-28 16:35:36,748] [DEBUG ] [MainThread ] - waiting (0:18:00.352268) for threads: 3d:00.0 (100%) [____________________][Time:0:00:43.055355] [2020-12-28 16:36:04,681] [INFO ] update of 0000:3d:00.0 complete [2020-12-28 16:36:04,682] [INFO ] Secure update OK [2020-12-28 16:36:04,682] [INFO ] Total time: 0:00:44.749368 [2020-12-28 16:36:04,702] [DEBUG ] [3d:00.0 ] - task completed in 0:00:44.892688 [2020-12-28 16:36:05,283] [INFO ] [MainThread ] - 1 board updated. A power-cycle is required. [2020-12-28 16:36:05,284] [INFO ] [MainThread ] - super_rsu.pyc completed in: 0:02:28.187105 [2020-12-28 16:36:05,284] [INFO ] [MainThread ] - super-rsu exiting with code '0' #Check the fme with fpgainfo to make sure it is updated []$ fpgainfo fme Board Management Controller, microcontroller FW version 26895 Last Power Down Cause: POK_CORE Last Reset Cause: None //****** FME ******// Object Id : 0xEB00000 PCIe s:b:d:f : 0000:3D:00:0 Device Id : 0x09C4 Socket Id : 0x00 Ports Num : 01 Bitstream Id : 0x124000200000367 Bitstream Version : 1.2.4 Pr Interface Id : 38d782e3-b612-5343-b934-2433e348ac4c Boot Page : userI'm not totally sure why the fpga-otsu command would not originally complete (and then finally completed), but my best guess is that using a slightly different kernel minor version or that restarting the server with cold reboot (powering off the server) helps to reinitialize the FPGA state and devices under /sys/. Note that warm reboots (normal power-cycle) may be causing some weirdness with the FPGA device initialization, which is why I've recommended cold reboots (turning off power completely for 20-30 seconds).
For the original issue with super-rsu my suggested solution is the following:
1) Follow the instructions to run fpga-otsu on pg. 40 of the AFU Quick Start Guide. If it fails, power the server off completely for ~30 seconds (cold reboot), power on, initialize the AFU devstack (`. /opt/inteldevstack/init_env.sh`) and rerun the command until it succeeds.
2) Once fpga-otsu completes, perform a cold reboot again. This command should probably replace Step 2 on pg. 41 that originally suggests to "2. Power cycle the server." which can mean either a warm or cold reboot.
3) Check that the ifpga_sec_mgr module is properly loaded correctly and that the ifpga_sec_mgr device exists. If it does not exist, try a cold reboot and check each time after initializing the AFU devstack.
`ls /sys/class/fpga/intel-fpga-dev.0/intel-fpga-fme.0/ifpga_sec_mgr/
ifpga_sec0`4) If this device exists, then the super-rsu command should complete successfully (or at least fail elsewhere).