FlexRAN crash issue observed in 23.07 in a long run followed by ACC100 PCie issue
Hello ,
We have been running Intel FlexRAN 23.07 in our setup for a long and we are facing one crash issue on the L1 side. While checking we found some PCIe errors from ACC100 following that L1 got crashed.
kernel: [1402131.454101] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
kernel: [1402131.454104] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
kernel: [1402131.454105] {4}[Hardware Error]: event severity: corrected
kernel: [1402131.454106] {4}[Hardware Error]: Error 0, type: corrected
l1.sh[299592]: #033[1;32mPDSCH / PUSCH Stats
kernel: [1402131.454107] {4}[Hardware Error]: section_type: PCIe error
kernel: [1402131.454107] {4}[Hardware Error]: port_type: 0, PCIe end point
kernel: [1402131.454109] {4}[Hardware Error]: version: 3.0
kernel: [1402131.454109] {4}[Hardware Error]: command: 0x0546, status: 0x0018
kernel: [1402131.454110] {4}[Hardware Error]: device_id: 0000:c3:00.0
kernel: [1402131.454111] {4}[Hardware Error]: slot: 0
kernel: [1402131.454112] {4}[Hardware Error]: secondary_bus: 0x00
kernel: [1402131.454112] {4}[Hardware Error]: vendor_id: 0x8086, device_id: 0x0d5c
kernel: [1402131.454113] {4}[Hardware Error]: class_code: 120001
kernel: [1402131.454145] igb_uio 0000:c3:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
kernel: [1402131.454147] igb_uio 0000:c3:00.0: [ 0] RxErr (First)
kernel: [1402131.454149] igb_uio 0000:c3:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
l1.sh[299592]: #033[0m----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
l1.sh[299592]: | MAC | RU | MAC-to-PHY Tput | PHY-to-MAC Tput | UL FEC CB Iterations | UL Packet Errors | |
l1.sh[299592]: Cell (MU / DL,UL MHz, CId) | Inst | Port | kbps Num CB | kbps UL BLER Num CB | Min Avg Max | PUSCH PUCCH PRACH SRS | SRS SNR |
l1.sh[299592]: -----------------------------|------|-------|------------------------|------------------------------------------------|-----------------------|-------------------------|------------|
l1.sh[299592]: 0 (MU 1 / 100,100, 21) | 0 | 0, 0 | 202,327 122,553 | 124,056 / 140,877 11.94% 85,102 | 1 2.49 12 | 40 39 0 0 | 0 Db |
l1.sh[299592]: 1 (MU 1 / 100,100, 31) | 0 | 1, 0 | 61 411 | 0 / 0 0.00% 0 | 0 0.00 0 | 0 0 0 0 | 0 Db |
l1.sh[299592]: 2 (MU 1 / 100,100, 41) | 0 | 2, 0 | 61 411 | 0 / 0 0.00% 0 | 0 0.00 0 | 0 0 0 0 | 0 Db |
l1.sh[299592]: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
l1.sh[299592]: #033[1;32mProcessing Latency Stats
l1.sh[299592]: #033[0m -----------------|--------------------------------|------------------------------
l1.sh[299592]: | usecs | % of TTI
l1.sh[299592]: Latency (-OTA) | Min Avg Max | Min Avg Max
l1.sh[299592]: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
l1.sh[299592]: #033[1;32mCore Utilization Stats [5 BBU core(s)]:
l1.sh[299592]: #033[0m Core Id : 16 17 18 19 20 Avg
l1.sh[299592]: Numa Node : 0 0 0 0 0
l1.sh[299592]: Core Type : ALL ALL ALL ALL ALL
l1.sh[299592]: Util % : 21.68 21.74 21.82 21.80 21.69 21.75
l1.sh[299592]: Intr % : 1.65 1.64 1.63 1.66 1.67 1.65
l1.sh[299592]: Spare % : 0.36 0.36 0.36 0.36 0.36 0.36
l1.sh[299592]: Sleep % : 76.29 76.24 76.17 76.16 76.27 76.23
l1.sh[299592]: Numa % : 0.00 0.00 0.00 0.00 0.00 0.00
l1.sh[299592]: TTI Cnt : 10001 10001 10001 10001 10001
l1.sh[299592]: TTI Min : 2 2 2 2 2
l1.sh[299592]: TTI Avg : 21 21 21 21 21
l1.sh[299592]: TTI Max : 91 99 99 98 99
l1.sh[299592]: Xran Cores: 21 22 Master Core Util: 65 %
l1.sh[299592]: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sep 29 17:47:39 ptp_all.sh[299519]: phc2sys[1402132.400]: CLOCK_REALTIME phc offset -8 s2 freq -278 delay 515
Sep 29 17:47:39 phc2sys: [1402132.400] CLOCK_REALTIME phc offset -8 s2 freq -278 delay 515
Sep 29 17:47:40 ptp_all.sh[299519]: phc2sys[1402133.400]: CLOCK_REALTIME phc offset -14 s2 freq -286 delay 513
Sep 29 17:47:40 phc2sys: [1402133.400] CLOCK_REALTIME phc offset -14 s2 freq -286 delay 513
Sep 29 17:47:40 kernel: [1402133.535477] fh_main_poll-21[299600]: segfault at 13540 ip 00000000014ed060 sp 00007fa1e3c70888 error 6 in l1app[645000+416d000]
Sep 29 17:47:40 kernel: [1402133.535488] Code: 4a c7 84 33 e0 00 00 00 00 00 00 00 ff 4b 14 44 89 f8 48 83 c4 38 5b 41 5c 41 5d 41 5e 41 5f 5d c3 66 0f 1f 84 00 00 00 00 00 <89> b7 40 35 01 00 c3 66 0f 1f 84 00 00 00 00 00 53 41 89 f1 48 8d