AI Suite System Throughput Issue
When using AI Suite, we are seeing a significant gap between IP throughput and achieved system throughput on Agilex 5. I am using the following: Hardware: Agilex™ 5 FPGA and SoC E-Series Modular Development Kit (ES silicon) Software: Quartus Prime Pro + AI Suite 25.3.1 SD Image: agx5_soc_s2m coredla-image-agilex5_mk_a5e065bb32aes1.wic Architecture and Bitstream: AGX5_Performance Using MobileNetV2 (Open Model Zoo 2024.6.0) compiled using AGX5_Performance architecture gives the following results through dla_benchmark IP throughput per instance: ~151 FPS Estimated throughput (200 MHz): ~178 FPS System throughput: nireq=1 → 41 FPS nireq=4 → 54 FPS Why is there such a big delta between IP Performance and System Throughput and how can we improve the system throughput? For more details please see the append log showing the commands that I run to do the benchmark Any pointers or help would be highly appreciated. Thanks ===================================================================== 1. Using mobilenet v2 from model zoo ===================================================================== Commands used to download and compile model: git clone https://github.com/openvinotoolkit/open_model_zoo.git cd open_model_zoo git checkout 2024.6.0 omz_downloader --list omz_downloader --name mobilenet-v2-pytorch --output_dir $COREDLA_WORK/demo/models/ omz_converter --name mobilenet-v2-pytorch --download_dir ../demo/models/ --output_dir ../demo/models/ cd $COREDLA_WORK/demo/models/public/mobilenet-v2-pytorch/FP32 dla_compiler --march $COREDLA_ROOT/example_architectures/AGX5_Performance.arch --network-file ./mobilenet-v2-pytorch.xml --foutput-format=open_vino_hetero --o $COREDLA_WORK/demo/mobilenet-v2-pytorch_dla.bin --batch-size=1 --fanalyze-performance --fassumed-fmax-core 200 Executing performance estimate ---------------------------------------------------------------- main_graph_0 reported throughput: 178.617 fps TOTAL DDR SPACE REQUIRED = 16.9756 MB DDR INPUT & OUTPUT BUFFER SIZE = 0.781738 MB DDR CONFIG BUFFER SIZE = 0.0986328 MB DDR FILTER BUFFER SIZE = 15.3296 MB DDR INTERMEDIATE BUFFER SIZE = 0.765625 MB NOTE: THIS ESTIMATE ASSUMES 1x I/O BUFFER. THE COREDLA RUNTIME DEFAULTS TO 5 TOTAL DDR TRANSFERS REQUIRED = 18.7003 MB DDR FILTER READS REQUIRED = 16.2124 MB DDR FEATURE READS REQUIRED = 1.62164 MB DDR FEATURE WRITES REQUIRED = 0.767578 MB NUMBER OF DDR FEATURE READS = 9 MINIMUM AVERAGE DDR BANDWIDTH REQUIRED = 3340.19 MB/s ASSUMED DDR BANDWIDTH PER IP INSTANCE = 6400 MB/s ---------------------------------------------------------------- Performance Estimator Throughput Breakdown Arch: kvec64xcvec32_i12x1_fp12agx_sb32768_xbark32_actk32_poolk4 Number of DLA instances = 1 Number of DDR Banks per DLA instance = 1 CoreDLA Target Fmax = 200 MHz PE Target Fmax = 200 MHz Batch Size = 1 PE-only Conv Throughput No DDR = 186 fps PE-only Conv Throughput = 185 fps Overall Throughput Inf PE Buf Depth (zero MPBW) = 185 fps Overall Throughput Zero PE Buf Depth (zero MPBW) = 183 fps Overall Throughput Inf PE Buf Depth = 184 fps Overall Throughput Zero PE Buf Depth = 182 fps ---------------------------------------------------------------- FINAL THROUGHPUT = 178.617 fps FINAL THROUGHPUT PER FMAX (CoreDLA) = 0.893086 fps/MHz FINAL THROUGHPUT PER FMAX (PE) = 0.893086 fps/MHz Running the model on dev kit: ./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=1 -bgr -nthreads=1 [Step 11/12] Dumping statistics report count: 8 iterations system duration: 191.3784 ms IP duration: 52.7551 ms latency: 23.4076 ms system throughput: 41.8020 FPS number of hardware instances: 1 number of network instances: 1 IP throughput per instance: 151.6441 FPS IP throughput per fmax per instance: 0.7582 FPS/MHz IP clock frequency measurement: 200.0000 MHz estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed) estimated IP throughput per fmax per instance: 0.8931 FPS/MHz ./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=4 -bgr -nthreads=4 [Step 11/12] Dumping statistics report count: 8 iterations system duration: 147.8426 ms IP duration: 52.7619 ms latency: 69.8254 ms system throughput: 54.1116 FPS number of hardware instances: 1 number of network instances: 1 IP throughput per instance: 151.6246 FPS IP throughput per fmax per instance: 0.7581 FPS/MHz IP clock frequency measurement: 200.0000 MHz estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed) estimated IP throughput per fmax per instance: 0.8931 FPS/MHz4Views0likes0Comments