Hi Kian,
We did some experiments which might be helpful. Please see our observations below:
In the design example given AI design suit HPS DDR4 memory interfaced using following IO interface
// HPS Memory
input wire emif_ref_clk,
input wire hps_memory_oct_rzqin,
output wire [0:0] hps_memory_mem_ck,
output wire [0:0] hps_memory_mem_ck_n,
output wire [16:0] hps_memory_mem_a,
output wire [0:0] hps_memory_mem_act_n,
output wire [1:0] hps_memory_mem_ba,
output wire [0:0] hps_memory_mem_bg,
output wire [0:0] hps_memory_mem_cke,
output wire [0:0] hps_memory_mem_cs_n,
output wire [0:0] hps_memory_mem_odt,
output wire [0:0] hps_memory_mem_reset_n,
output wire [0:0] hps_memory_mem_par,
input wire [0:0] hps_memory_mem_alert_n,
inout wire [3:0] hps_memory_mem_dqs,
inout wire [3:0] hps_memory_mem_dqs_n,
inout wire [31:0] hps_memory_mem_dq,
inout wire [3:0] hps_memory_mem_dbi_n,
Above signal definitions are not complying to 1 GB DDR4 (256 Mb x 40 x single rank) memory with board.
We updated memory interface according to provided DDR memory 1 GB DDR4 (256 Mb x 40 x single rank), now we are able to boot the Arria 10 SOC board by referring Arria 10 SoC GSRD golden example.
However, we are still not able to run FPGA AI Suite SoC Design example (tried S2M and M2M and observations are same as below). We are getting DLA timeout when running inference.
root@arria10-a2524a6b645b:~/app# ./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=5 -plugins_xml_file ./plugins.xml -arch
_file $archfile -api=async -groundtruth_loc $imgdir/TF_ground_truth.txt -perf_est -nireq=4 -bgr
[Step 1/12] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[ INFO ] Found 1 compiled graph
[ INFO ] Using custom plugins xml file - ./plugins.xml
[ INFO ] Network is compiled
[ INFO ] Printing summary of arguments being used by dla_benchmark
[ INFO ] API (-api) ........................... async
[ INFO ] Device (-d) .......................... HETERO:FPGA,CPU
[ INFO ] Batch size (-b) ...................... 1
[ INFO ] Compiled model (-cm) ................. /home/root/resnet-50-tf/RN50_Performance_b1.bin
[ INFO ] Num iterations (-niter) .............. 5
[ INFO ] Input images directory (-i) .......... /home/root/resnet-50-tf/sample_images
[ INFO ] Num CPU threads (-nthreads) .......... Not specified
[ INFO ] Architecture file (-arch_file) ....... /home/root/resnet-50-tf/A10_Performance.arch
[ INFO ] Num inference requests (-nireq) ...... 4
[ INFO ] Plugins file (-plugins_xml_file) ..... ./plugins.xml
[ INFO ] Groundtruth file (-groundtruth_loc) .. /home/root/resnet-50-tf/sample_images/TF_ground_truth.txt
[ INFO ] Reverse input image channels (-bgr) .. True
[ INFO ] Reading /home/root/resnet-50-tf/sample_images for graph index 0
[ WARNING ] -nstreams default value is determined automatically for a device.
Although the automatic selection usually provides a reasonable performance,
but it still may be non-optimal for some cases, for more information look at README.
[Step 2/12] Loading Inference Engine
[ INFO ] OpenVINO: Build ................................. 2022.3.0-9052-9752fafe8eb-HEAD
[ INFO ]
[Step 3/12] Setting device configuration
[Step 4/12] Reading the Intermediate Representation network
[ INFO ] Skipping the step for compiled network
[Step 5/12] Resizing network to match image sizes and given batch
[ INFO ] Skipping the step for compiled network
[Step 6/12] Configuring input of the model
[ INFO ] Skipping the step for compiled network
[Step 7/12] Loading the model to the device
[ INFO ] Importing model from /home/root/resnet-50-tf/RN50_Performance_b1.bin to HETERO:FPGA,CPU as Graph_0
Runtime arch check is enabled. Check started...
Runtime arch check passed.
Runtime build version check is enabled. Check started...
Runtime build version check passed.
[ INFO ] Import network took 3493.0785 ms
[Step 8/12] Setting optimal runtime parameters
[ WARNING ] Number of iterations was aligned by request number from 5 to 8 using number of requests 4
[Step 9/12] Creating infer requests and filling input blobs with images
[ INFO ] Filling input blobs for network ( Graph_0 )
[ INFO ] Network input 'map/TensorArrayStack/TensorArrayGatherV3' precision U8, dimensions (NCHW): 1 3 224 224
[ WARNING ] Some image input files will be ignored: only 8 are required from 10
[Step 10/12] Measuring performance (Start inference asyncronously, 4 inference requests using 1 streams for CPU, limits: 8 iterations with each graph)
WaitForDla polling timeout with threadId_0
If inference on one batch is expected to take more than 30 seconds, then increase WAIT_FOR_DLA_TIMEOUT in dlia_plugin.cpp and recompile the runtime.
../src/inference/src/ie_common.cpp:75 FATAL ERROR: inference on FPGA did not complete, jobs finished 0, jobs waited 0
[ ERROR ] Infer failed
We also noticed that FPGA DDR4 test in BTS are failing. Attached screenshot.
Please let me know if you need any additional information.