Forum Discussion
Hi @mkont1
Could you elaborate what "hangs" here means? Can you recover by Ctrl+C or you need to reboot?
As sanity check,
Does cold reset (power cycle) the server resolve the issue?
Upon every reboot/ new terminal:
Make sure that you have initialized the card.
Make sure that you have set the hugepages. Allocate 20, 2 MB hugepages per card.
Did the PAC pass the fpgabist? You may refer to below link (keyword "Running FPGA Diagnostics")
https://www.intel.com/content/www/us/en/programmable/documentation/iyu1522005567196.html
Did the PAC pass the aocl diagnose acl0? You may refer to: https://www.intel.com/content/www/us/en/programmable/documentation/fvf1521490619217.html#zru1523293789016
Could you run below? I want to check you have correct OPAE version.
rpm -qa | grep opae
Does this fail with 2019R1_RC_FP11_ResNet_SqueezeNet_VGG only or does it fail with other AOCX as well?
Could you try changing to use other aocx with lower FP?
In summary, test as I suggest above first:
reboot --> Initialize --> set hugepages --> fpgabist --> aocl diagnose acl0--> change other aocx --> change to lower FP.
If failure persist:
Please provide info of OS/kernel version and all the results you see from the above test.
cat /etc/*elease
uname -r
Thanks
- mkont16 years ago
New Contributor
Can recover with Ctrl+C.
Issue persists after power cycle.
Hugepages set with:
sudo sh -c "echo 20 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages"Output of fpgabist:
# sudo fpgabist $OPAE_PLATFORM_ROOT/hw/samples/nlb_mode_3/bin/nlb_mode_3.gbs ========================================================== Beginning FPGA Built-In Self-Test ========================================================== Device: bus = 6, device = , func = Board Management Controller, microcontroller FW version 26889 Last Power Down Cause: POK_CORE Last Reset Cause: External reset Power-on-reset //****** FME ******// Object Id : 0xF000000 PCIe s:b:d:f : 0000:06:00:0 Device Id : 0x09C4 Socket Id : 0x00 Ports Num : 01 Bitstream Id : 0x123000200000185 Bitstream Version : 0x30201 Pr Interface Id : 69528db6-eb31-577a-8c36-68f9faa081f6 Board Management Controller, microcontroller FW version 26889 Last Power Down Cause: POK_CORE Last Reset Cause: None //****** PORT ******// Object Id : 0xEF00000 PCIe s:b:d:f : 0000:06:00:0 Device Id : 0x09C4 Socket Id : 0x00 Ports Num : 01 Bitstream Id : 0x123000200000185 Bitstream Version : 0x30201 Pr Interface Id : 69528db6-eb31-577a-8c36-68f9faa081f6 Accelerator Id : 18b79ffa-2ee5-4aa0-96ef-4230dafacb5f Board Management Controller, microcontroller FW version 26889 Last Power Down Cause: POK_CORE Last Reset Cause: None //****** TEMP ******// Object Id : 0xF000000 PCIe s:b:d:f : 0000:06:00:0 Device Id : 0x09C4 Socket Id : 0x00 Ports Num : 01 Bitstream Id : 0x123000200000185 Bitstream Version : 0x30201 Pr Interface Id : 69528db6-eb31-577a-8c36-68f9faa081f6 (11) FPGA Core TEMP : 58.00 °C (12) Board TEMP : 47.00 °C (14) QSFP TEMP : No reading (reading state unavailable) (15) Core Supply Temp : 65.28 °C Board Management Controller, microcontroller FW version 26889 Last Power Down Cause: POK_CORE Last Reset Cause: None //****** POWER ******// Object Id : 0xF000000 PCIe s:b:d:f : 0000:06:00:0 Device Id : 0x09C4 Socket Id : 0x00 Ports Num : 01 Bitstream Id : 0x123000200000185 Bitstream Version : 0x30201 Pr Interface Id : 69528db6-eb31-577a-8c36-68f9faa081f6 ( 0) Total Input Power : 28.50 Watts ( 1) PCIe 12V Current : 2.47 Amps ( 2) PCIe 12V Voltage : 11.20 Volts ( 3) 1.2V Voltage : 1.22 Volts ( 4) 1.2V Current : 2.66 Amps ( 5) 1.8V Voltage : 1.83 Volts ( 6) 1.8V Current : 2.73 Amps ( 7) 3.3V Mgmt Voltage : 3.34 Volts ( 8) 3.3V Current : 0.54 Amps ( 9) FPGA Core Voltage : 0.91 Volts (10) FPGA Core Current : 13.11 Amps (13) QSFP P3V3 : No reading (reading state unavailable) (16) Core Supply Temp Input : 0.50 Volts (17) VCCR Voltage : 1.04 Volts (18) VCCT Voltage : 1.04 Volts (19) VCCR Current : 1.12 Amps (20) VCCT Current : 0.12 Amps (21) VPP Voltage : 2.53 Volts (22) VTT Voltage : 0.59 Volts Board Management Controller, microcontroller FW version 26889 Last Power Down Cause: POK_CORE Last Reset Cause: None //****** PORT ERRORS ******// Object Id : 0xEF00000 PCIe s:b:d:f : 0000:06:00:0 Device Id : 0x09C4 Socket Id : 0x00 Ports Num : 01 Bitstream Id : 0x123000200000185 Bitstream Version : 0x30201 Pr Interface Id : 69528db6-eb31-577a-8c36-68f9faa081f6 Accelerator Id : 18b79ffa-2ee5-4aa0-96ef-4230dafacb5f First Error : 0x0 First Malformed Req : 0xFFFFFFFFFFFFFFFF Errors : 0x0 Board Management Controller, microcontroller FW version 26889 Last Power Down Cause: POK_CORE Last Reset Cause: None //****** FME ERRORS ******// Object Id : 0xF000000 PCIe s:b:d:f : 0000:06:00:0 Device Id : 0x09C4 Socket Id : 0x00 Ports Num : 01 Bitstream Id : 0x123000200000185 Bitstream Version : 0x7FFF00030201 Pr Interface Id : 69528db6-eb31-577a-8c36-68f9faa081f6 First Error : 0x0 Next Error : 0x0 Errors : 0x0 PCIe1 Errors : 0x0 Nonfatal Errors : 0x0 Inject Error : 0x0 Catfatal Errors : 0x0 PCIe0 Errors : 0x0 Running mode: nlb_3 Attempting Partial Reconfiguration: Reading bitstream Looking for slot Found slot Programming bitstream Writing bitstream Done Running fpgadiag read test... Cachelines Read_Count Write_Count Cache_Rd_Hit Cache_Wr_Hit Cache_Rd_Miss Cache_Wr_Miss Eviction 'Clocks(@200 MHz)' Rd_Bandwidth Wr_Bandwidth 1024 544035292 0 0 0 0 0 0 1000011426 6.964 GB/s 0.000 GB/s VH0_Rd_Count VH0_Wr_Count VH1_Rd_Count VH1_Wr_Count VL0_Rd_Count VL0_Wr_Count 0 0 0 0 0 0 Running fpgadiag write test... Cachelines Read_Count Write_Count Cache_Rd_Hit Cache_Wr_Hit Cache_Rd_Miss Cache_Wr_Miss Eviction 'Clocks(@200 MHz)' Rd_Bandwidth Wr_Bandwidth 1024 0 762732 0 0 0 0 0 1000018957 0.000 GB/s 0.010 GB/s VH0_Rd_Count VH0_Wr_Count VH1_Rd_Count VH1_Wr_Count VL0_Rd_Count VL0_Wr_Count 0 0 0 0 0 0 Running fpgadiag trput test... Cachelines Read_Count Write_Count Cache_Rd_Hit Cache_Wr_Hit Cache_Rd_Miss Cache_Wr_Miss Eviction 'Clocks(@200 MHz)' Rd_Bandwidth Wr_Bandwidth 1024 488225340 489909832 0 0 0 0 0 1000023141 6.249 GB/s 6.271 GB/s VH0_Rd_Count VH0_Wr_Count VH1_Rd_Count VH1_Wr_Count VL0_Rd_Count VL0_Wr_Count 0 0 0 0 0 0 Finished Executing NLB (FPGA DIAG)Tests Built-in Self-Test Completed.aocl diagnose:
# aocl diagnose -------------------------------------------------------------------- Device Name: acl0 BSP Install Location: /root/intelrtestack/a10_gx_pac_ias_1_2_pv/opencl/opencl_bsp Vendor: Intel Corp Physical Dev Name Status Information pac_a10_ef00000 Passed PAC Arria 10 Platform (pac_a10_ef00000) PCIe 06:00.0 FPGA temperature = 61 degrees C. DIAGNOSTIC_PASSED -------------------------------------------------------------------- Call "aocl diagnose <device-names>" to run diagnose for specified devices Call "aocl diagnose all" to run diagnose for all devicesaocl diagnose acl0 gets stuck (recover with Ctrl+C)
# aocl diagnose acl0 Using platform: Intel(R) FPGA SDK for OpenCL(TM) Using Device with name: pac_a10 : PAC Arria 10 Platform (pac_a10_ef00000) Using Device from vendor: Intel Corp clGetDeviceInfo CL_DEVICE_GLOBAL_MEM_SIZE = 8589934592 clGetDeviceInfo CL_DEVICE_MAX_MEM_ALLOC_SIZE = 8589934592 Allocated 8589934592 bytes Actual maximum buffer size = 8589934592 bytes Writing 8192 MB to global memory ... Allocated 1073741824 Bytes host buffer for large transfers Write speed: 6917.17 MB/s [6912.93 -> 6919.78] Reading and verifying 8192 MB from global memory ... Read speed: 6648.18 MB/s [6541.27 -> 6688.25] Successfully wrote and readback 8192 MB buffer Poll(interrupt) timeoutrpm -qa | grep opae:
# rpm -qa | grep opae opae-libs-1.1.2-1.x86_64 opae-tools-1.1.2-1.x86_64 opae-intel-fpga-driver-1.1.2-1.x86_64 opae-tools-extra-1.1.2-1.x86_64 opae-devel-1.1.2-1.x86_64 opae-ase-1.1.2-1.x86_64OS and kernel versions:
# cat /etc/*elease Board : PCIECARD Release : Distro OS Version : 2.0.2 Build-Date : 24 January 2019 Kernel-Arch : x86_64 Linux-Distribution : CentOS.7.5.1804 CentOS Linux release 7.5.1804 (Core) NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" Board : PCIECARD Release : PCIe Manager Version : 2.0.2 Build-Date : 11 December 2018 Kernel-Arch : x86_64 Kernel-Version : 3.10.0-862.11.6.1.el7 Linux-Distribution : CentOS.7.5.1804 CentOS Linux release 7.5.1804 (Core) CentOS Linux release 7.5.1804 (Core) # uname -r 3.10.0-862.11.6.1.el7.x86_64Issue persists with 2019R1_RC_FP16_ResNet_SqueezeNet_VGG.aocx. I don't have an aocx with lower FP than 11.