Forum Discussion

mkont1's avatar
mkont1
Icon for New Contributor rankNew Contributor
6 years ago

benchmark and classification_sample apps hang on starting inference when running with -d HETERO:FPGA,CPU.

PAC installed in Artesyn MC1600 chassis with Intel(R) Xeon(R) CPU D-1567 @ 2.10GHz running CentOS 7.5.

fpgainfo fme:

Board Management Controller, microcontroller FW version 26889
Last Power Down Cause: POK_CORE
Last Reset Cause: None
//****** FME ******//
Object Id                     : 0xEF00000
PCIe s:b:d:f                  : 0000:06:00:0
Device Id                     : 0x09C4
Socket Id                     : 0x00
Ports Num                     : 01
Bitstream Id                  : 0x123000200000185
Bitstream Version             : 0x30201
Pr Interface Id               : 69528db6-eb31-577a-8c36-68f9faa081f6

Prior to running the inference, this bitsream was programmed:

aocl program acl0 /opt/intel/openvino/bitstreams/a10_dcp_bitstreams/2019R1_RC_FP11_ResNet_SqueezeNet_VGG.aocx

classification_sample and benchmark apps run without issue with target device set to CPU. Both applications hang when attempting run on the FPGA (with -d HETERO:FPGA,CPU). Inference on the FPGA usually complete successfully with a single iteration (-ni 1) but consistently hang with higher number of iterations.

# ./classification_sample -d HETERO:FPGA,CPU -ni 10 -i /opt/intel/openvino/deployment_tools/demo/car.png -m /root/openvino_models/ir/FP32/classification/squeezenet/1.1/caffe/squeezenet1.1.xml
[ INFO ] InferenceEngine:
        API version ............ 1.6
        Build .................. custom_releases/2019/R1.1_28dfbfdd28954c4dfd2f94403dd8dfc1f411038b
[ INFO ] Parsing input parameters
[ INFO ] Files were added: 1
[ INFO ]     /opt/intel/openvino/deployment_tools/demo/car.png
[ INFO ] Loading plugin
 
        API version ............ 1.6
        Build .................. heteroPlugin
        Description ....... heteroPlugin
[ INFO ] Loading network files:
        /root/openvino_models/ir/FP32/classification/squeezenet/1.1/caffe/squeezenet1.1.xml
        /root/openvino_models/ir/FP32/classification/squeezenet/1.1/caffe/squeezenet1.1.bin
[ INFO ] Preparing input blobs
[ WARNING ] Image is resized from (787, 259) to (227, 227)
[ INFO ] Batch size is 1
[ INFO ] Preparing output blobs
[ INFO ] Loading model to the plugin
[ INFO ] Starting inference (10 iterations)
# ./benchmark_app -d HETERO:FPGA,CPU -i /opt/intel/openvino/deployment_tools/demo/car.png -m /root/openvino_models/ir/FP32/classification/squeezenet/1.1/caffe/squeezenet1.1.xml
[ INFO ] InferenceEngine:
        API version ............ 1.6
        Build .................. custom_releases/2019/R1.1_28dfbfdd28954c4dfd2f94403dd8dfc1f411038b
 
[Step 1/8] Parsing and validation of input args
[ INFO ] Parsing input parameters
[ INFO ] Files were added: 1
[ INFO ]     /opt/intel/openvino/deployment_tools/demo/car.png
Progress: [....................] 100.00% done
 
[Step 2/8] Loading plugin
[ INFO ]
        API version ............ 1.6
        Build .................. heteroPlugin
        Description ....... heteroPlugin
Progress: [....................] 100.00% done
 
[Step 3/8] Read IR network
[ INFO ] Loading network files
[ INFO ] Network batch size: 1, precision: FP32
Progress: [....................] 100.00% done
 
[Step 4/8] Configure input & output of the model
[ INFO ] Preparing output blobs
Progress: [....................] 100.00% done
 
[Step 5/8] Loading model to the plugin
Progress: [....................] 100.00% done
 
[Step 6/8] Create infer requests and fill input blobs with images
[ INFO ] Infer Request 0 created
[ INFO ] Network Input dimensions (NCHW): 1 3 227 227
[ INFO ] Prepare image /opt/intel/openvino/deployment_tools/demo/car.png
[ WARNING ] Image is resized from (787, 259) to (227, 227)
[ INFO ] Infer Request 1 created
[ INFO ] Network Input dimensions (NCHW): 1 3 227 227
[ INFO ] Prepare image /opt/intel/openvino/deployment_tools/demo/car.png
[ WARNING ] Image is resized from (787, 259) to (227, 227)
Progress: [....................] 100.00% done
 
[Step 7/8]
Start inference asynchronously (120000.00 ms duration, 2 inference requests in parallel)
Progress: [                    ] 0.00% done

7 Replies

  • JonWay_altera's avatar
    JonWay_altera
    Icon for Frequent Contributor rankFrequent Contributor

    Hi @mkont1

    Could you elaborate what "hangs" here means? Can you recover by Ctrl+C or you need to reboot?

    As sanity check,

    Does cold reset (power cycle) the server resolve the issue?

    Upon every reboot/ new terminal:

    Make sure that you have initialized the card.

    Make sure that you have set the hugepages. Allocate 20, 2 MB hugepages per card.

    Did the PAC pass the fpgabist? You may refer to below link (keyword "Running FPGA Diagnostics")

    https://www.intel.com/content/www/us/en/programmable/documentation/iyu1522005567196.html

    Did the PAC pass the aocl diagnose acl0? You may refer to: https://www.intel.com/content/www/us/en/programmable/documentation/fvf1521490619217.html#zru1523293789016

    Could you run below? I want to check you have correct OPAE version.

    rpm -qa | grep opae

    Does this fail with 2019R1_RC_FP11_ResNet_SqueezeNet_VGG only or does it fail with other AOCX as well?

    Could you try changing to use other aocx with lower FP?

    In summary, test as I suggest above first:

    reboot --> Initialize --> set hugepages --> fpgabist --> aocl diagnose acl0--> change other aocx --> change to lower FP.

    If failure persist:

    Please provide info of OS/kernel version and all the results you see from the above test.

    cat /etc/*elease

    uname -r

    Thanks

    • mkont1's avatar
      mkont1
      Icon for New Contributor rankNew Contributor

      Can recover with Ctrl+C.

      Issue persists after power cycle.

      Hugepages set with:

      sudo sh -c "echo 20 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages"

      Output of fpgabist:

      # sudo fpgabist $OPAE_PLATFORM_ROOT/hw/samples/nlb_mode_3/bin/nlb_mode_3.gbs
      ==========================================================
       
      Beginning FPGA Built-In Self-Test
       
      ==========================================================
      Device: bus = 6, device = , func =
      Board Management Controller, microcontroller FW version 26889
      Last Power Down Cause: POK_CORE
      Last Reset Cause: External reset
      Power-on-reset
      //****** FME ******//
      Object Id                     : 0xF000000
      PCIe s:b:d:f                  : 0000:06:00:0
      Device Id                     : 0x09C4
      Socket Id                     : 0x00
      Ports Num                     : 01
      Bitstream Id                  : 0x123000200000185
      Bitstream Version             : 0x30201
      Pr Interface Id               : 69528db6-eb31-577a-8c36-68f9faa081f6
      Board Management Controller, microcontroller FW version 26889
      Last Power Down Cause: POK_CORE
      Last Reset Cause: None
      //****** PORT ******//
      Object Id                     : 0xEF00000
      PCIe s:b:d:f                  : 0000:06:00:0
      Device Id                     : 0x09C4
      Socket Id                     : 0x00
      Ports Num                     : 01
      Bitstream Id                  : 0x123000200000185
      Bitstream Version             : 0x30201
      Pr Interface Id               : 69528db6-eb31-577a-8c36-68f9faa081f6
      Accelerator Id                : 18b79ffa-2ee5-4aa0-96ef-4230dafacb5f
      Board Management Controller, microcontroller FW version 26889
      Last Power Down Cause: POK_CORE
      Last Reset Cause: None
      //****** TEMP ******//
      Object Id                     : 0xF000000
      PCIe s:b:d:f                  : 0000:06:00:0
      Device Id                     : 0x09C4
      Socket Id                     : 0x00
      Ports Num                     : 01
      Bitstream Id                  : 0x123000200000185
      Bitstream Version             : 0x30201
      Pr Interface Id               : 69528db6-eb31-577a-8c36-68f9faa081f6
      (11) FPGA Core TEMP           : 58.00 °C
      (12) Board TEMP               : 47.00 °C
      (14) QSFP TEMP                : No reading (reading state unavailable)
      (15) Core Supply Temp         : 65.28 °C
      Board Management Controller, microcontroller FW version 26889
      Last Power Down Cause: POK_CORE
      Last Reset Cause: None
      //****** POWER ******//
      Object Id                     : 0xF000000
      PCIe s:b:d:f                  : 0000:06:00:0
      Device Id                     : 0x09C4
      Socket Id                     : 0x00
      Ports Num                     : 01
      Bitstream Id                  : 0x123000200000185
      Bitstream Version             : 0x30201
      Pr Interface Id               : 69528db6-eb31-577a-8c36-68f9faa081f6
      ( 0) Total Input Power        : 28.50 Watts
      ( 1) PCIe 12V Current         : 2.47 Amps
      ( 2) PCIe 12V Voltage         : 11.20 Volts
      ( 3) 1.2V Voltage             : 1.22 Volts
      ( 4) 1.2V Current             : 2.66 Amps
      ( 5) 1.8V Voltage             : 1.83 Volts
      ( 6) 1.8V Current             : 2.73 Amps
      ( 7) 3.3V Mgmt Voltage        : 3.34 Volts
      ( 8) 3.3V Current             : 0.54 Amps
      ( 9) FPGA Core Voltage        : 0.91 Volts
      (10) FPGA Core Current        : 13.11 Amps
      (13) QSFP P3V3                : No reading (reading state unavailable)
      (16) Core Supply Temp Input   : 0.50 Volts
      (17) VCCR Voltage             : 1.04 Volts
      (18) VCCT Voltage             : 1.04 Volts
      (19) VCCR Current             : 1.12 Amps
      (20) VCCT Current             : 0.12 Amps
      (21) VPP Voltage              : 2.53 Volts
      (22) VTT Voltage              : 0.59 Volts
      Board Management Controller, microcontroller FW version 26889
      Last Power Down Cause: POK_CORE
      Last Reset Cause: None
      //****** PORT ERRORS ******//
      Object Id                     : 0xEF00000
      PCIe s:b:d:f                  : 0000:06:00:0
      Device Id                     : 0x09C4
      Socket Id                     : 0x00
      Ports Num                     : 01
      Bitstream Id                  : 0x123000200000185
      Bitstream Version             : 0x30201
      Pr Interface Id               : 69528db6-eb31-577a-8c36-68f9faa081f6
      Accelerator Id                : 18b79ffa-2ee5-4aa0-96ef-4230dafacb5f
      First Error                   : 0x0
      First Malformed Req           : 0xFFFFFFFFFFFFFFFF
      Errors                        : 0x0
      Board Management Controller, microcontroller FW version 26889
      Last Power Down Cause: POK_CORE
      Last Reset Cause: None
      //****** FME ERRORS ******//
      Object Id                     : 0xF000000
      PCIe s:b:d:f                  : 0000:06:00:0
      Device Id                     : 0x09C4
      Socket Id                     : 0x00
      Ports Num                     : 01
      Bitstream Id                  : 0x123000200000185
      Bitstream Version             : 0x7FFF00030201
      Pr Interface Id               : 69528db6-eb31-577a-8c36-68f9faa081f6
      First Error                   : 0x0
      Next Error                    : 0x0
      Errors                        : 0x0
      PCIe1 Errors                  : 0x0
      Nonfatal Errors               : 0x0
      Inject Error                  : 0x0
      Catfatal Errors               : 0x0
      PCIe0 Errors                  : 0x0
      Running mode: nlb_3
      Attempting Partial Reconfiguration:
      Reading bitstream
      Looking for slot
      Found slot
      Programming bitstream
      Writing bitstream
      Done
      Running fpgadiag read test...
       
       
      Cachelines Read_Count Write_Count Cache_Rd_Hit Cache_Wr_Hit Cache_Rd_Miss Cache_Wr_Miss   Eviction 'Clocks(@200 MHz)'   Rd_Bandwidth   Wr_Bandwidth
            1024  544035292           0            0            0             0             0          0       1000011426     6.964 GB/s     0.000 GB/s
       
      VH0_Rd_Count VH0_Wr_Count VH1_Rd_Count VH1_Wr_Count VL0_Rd_Count VL0_Wr_Count
                 0            0            0            0            0            0
       
      Running fpgadiag write test...
       
       
      Cachelines Read_Count Write_Count Cache_Rd_Hit Cache_Wr_Hit Cache_Rd_Miss Cache_Wr_Miss   Eviction 'Clocks(@200 MHz)'   Rd_Bandwidth   Wr_Bandwidth
            1024          0      762732            0            0             0             0          0       1000018957     0.000 GB/s     0.010 GB/s
       
      VH0_Rd_Count VH0_Wr_Count VH1_Rd_Count VH1_Wr_Count VL0_Rd_Count VL0_Wr_Count
                 0            0            0            0            0            0
       
      Running fpgadiag trput test...
       
       
      Cachelines Read_Count Write_Count Cache_Rd_Hit Cache_Wr_Hit Cache_Rd_Miss Cache_Wr_Miss   Eviction 'Clocks(@200 MHz)'   Rd_Bandwidth   Wr_Bandwidth
            1024  488225340   489909832            0            0             0             0          0       1000023141     6.249 GB/s     6.271 GB/s
       
      VH0_Rd_Count VH0_Wr_Count VH1_Rd_Count VH1_Wr_Count VL0_Rd_Count VL0_Wr_Count
                 0            0            0            0            0            0
       
      Finished Executing NLB (FPGA DIAG)Tests
       
       
      Built-in Self-Test Completed.

      aocl diagnose:

      # aocl diagnose
      --------------------------------------------------------------------
      Device Name:
      acl0
       
      BSP Install Location:
      /root/intelrtestack/a10_gx_pac_ias_1_2_pv/opencl/opencl_bsp
       
      Vendor: Intel Corp
       
      Physical Dev Name   Status            Information
       
      pac_a10_ef00000     Passed            PAC Arria 10 Platform (pac_a10_ef00000)
                                            PCIe 06:00.0
                                            FPGA temperature = 61 degrees C.
       
      DIAGNOSTIC_PASSED
      --------------------------------------------------------------------
       
      Call "aocl diagnose <device-names>" to run diagnose for specified devices
      Call "aocl diagnose all" to run diagnose for all devices

      aocl diagnose acl0 gets stuck (recover with Ctrl+C)

      # aocl diagnose acl0
      Using platform: Intel(R) FPGA SDK for OpenCL(TM)
      Using Device with name: pac_a10 : PAC Arria 10 Platform (pac_a10_ef00000)
      Using Device from vendor: Intel Corp
      clGetDeviceInfo CL_DEVICE_GLOBAL_MEM_SIZE = 8589934592
      clGetDeviceInfo CL_DEVICE_MAX_MEM_ALLOC_SIZE = 8589934592
      Allocated 8589934592 bytes
      Actual maximum buffer size = 8589934592 bytes
      Writing 8192 MB to global memory ...
      Allocated 1073741824 Bytes host buffer for large transfers
      Write speed: 6917.17 MB/s [6912.93 -> 6919.78]
      Reading and verifying 8192 MB from global memory ...
      Read speed: 6648.18 MB/s [6541.27 -> 6688.25]
      Successfully wrote and readback 8192 MB buffer
       
      Poll(interrupt) timeout

      rpm -qa | grep opae:

      # rpm -qa | grep opae
      opae-libs-1.1.2-1.x86_64
      opae-tools-1.1.2-1.x86_64
      opae-intel-fpga-driver-1.1.2-1.x86_64
      opae-tools-extra-1.1.2-1.x86_64
      opae-devel-1.1.2-1.x86_64
      opae-ase-1.1.2-1.x86_64

      OS and kernel versions:

      # cat /etc/*elease
       
      Board               : PCIECARD
      Release             : Distro OS
      Version             : 2.0.2
      Build-Date          : 24 January 2019
      Kernel-Arch         : x86_64
      Linux-Distribution  : CentOS.7.5.1804
      CentOS Linux release 7.5.1804 (Core)
      NAME="CentOS Linux"
      VERSION="7 (Core)"
      ID="centos"
      ID_LIKE="rhel fedora"
      VERSION_ID="7"
      PRETTY_NAME="CentOS Linux 7 (Core)"
      ANSI_COLOR="0;31"
      CPE_NAME="cpe:/o:centos:centos:7"
      HOME_URL="https://www.centos.org/"
      BUG_REPORT_URL="https://bugs.centos.org/"
       
      CENTOS_MANTISBT_PROJECT="CentOS-7"
      CENTOS_MANTISBT_PROJECT_VERSION="7"
      REDHAT_SUPPORT_PRODUCT="centos"
      REDHAT_SUPPORT_PRODUCT_VERSION="7"
       
      Board               :   PCIECARD
      Release             :   PCIe Manager
      Version             :   2.0.2
      Build-Date          :   11 December 2018
      Kernel-Arch         :   x86_64
      Kernel-Version      :   3.10.0-862.11.6.1.el7
      Linux-Distribution  :   CentOS.7.5.1804
      CentOS Linux release 7.5.1804 (Core)
      CentOS Linux release 7.5.1804 (Core)
       
      # uname -r
      3.10.0-862.11.6.1.el7.x86_64

      Issue persists with 2019R1_RC_FP16_ResNet_SqueezeNet_VGG.aocx. I don't have an aocx with lower FP than 11.

  • JonWay_altera's avatar
    JonWay_altera
    Icon for Frequent Contributor rankFrequent Contributor

    Hi @mkont1

    Would you perform a quick test:

    The demo cannot run the default batch size when running with FPGA. Need to make the changes on the batch size to more than 1. (eg. -b 10).

    • mkont1's avatar
      mkont1
      Icon for New Contributor rankNew Contributor

      Tried this with the benchmark_app. It didn't help.

    • mkont1's avatar
      mkont1
      Icon for New Contributor rankNew Contributor

      Hi @JwChin

      It seems better with "- b 10 -niter 100" but still gets stuck. Most of the time the run gets stuck below 10% done. On one run it got up to 78% done and then got stuck.

      [Step 7/8] Start inference asynchronously (100 async inference executions, 2 inference requests in parallel)

      Progress: [. ] 7.92% done