AI Suite System Throughput Issue
When using AI Suite, we are seeing a significant gap between IP throughput and achieved system throughput on Agilex 5. I am using the following:
Hardware: Agilex™ 5 FPGA and SoC E-Series Modular Development Kit (ES silicon)
Software: Quartus Prime Pro + AI Suite 25.3.1
SD Image: agx5_soc_s2m coredla-image-agilex5_mk_a5e065bb32aes1.wic
Architecture and Bitstream: AGX5_Performance
Using MobileNetV2 (Open Model Zoo 2024.6.0) compiled using AGX5_Performance architecture gives the following results through dla_benchmark
IP throughput per instance: ~151 FPS
Estimated throughput (200 MHz): ~178 FPS
System throughput:
nireq=1 → 41 FPS
nireq=4 → 54 FPS
Why is there such a big delta between IP Performance and System Throughput and how can we improve the system throughput? For more details please see the append log showing the commands that I run to do the benchmark
Any pointers or help would be highly appreciated. Thanks
=====================================================================
1. Using mobilenet v2 from model zoo
=====================================================================
Commands used to download and compile model:
git clone https://github.com/openvinotoolkit/open_model_zoo.git
cd open_model_zoo
git checkout 2024.6.0
omz_downloader --list
omz_downloader --name mobilenet-v2-pytorch --output_dir $COREDLA_WORK/demo/models/
omz_converter --name mobilenet-v2-pytorch --download_dir ../demo/models/ --output_dir ../demo/models/
cd $COREDLA_WORK/demo/models/public/mobilenet-v2-pytorch/FP32
dla_compiler --march $COREDLA_ROOT/example_architectures/AGX5_Performance.arch --network-file ./mobilenet-v2-pytorch.xml --foutput-format=open_vino_hetero --o $COREDLA_WORK/demo/mobilenet-v2-pytorch_dla.bin --batch-size=1 --fanalyze-performance --fassumed-fmax-core 200
Executing performance estimate
----------------------------------------------------------------
main_graph_0 reported throughput: 178.617 fps
TOTAL DDR SPACE REQUIRED = 16.9756 MB
DDR INPUT & OUTPUT BUFFER SIZE = 0.781738 MB
DDR CONFIG BUFFER SIZE = 0.0986328 MB
DDR FILTER BUFFER SIZE = 15.3296 MB
DDR INTERMEDIATE BUFFER SIZE = 0.765625 MB
NOTE: THIS ESTIMATE ASSUMES 1x I/O BUFFER. THE COREDLA RUNTIME DEFAULTS TO 5
TOTAL DDR TRANSFERS REQUIRED = 18.7003 MB
DDR FILTER READS REQUIRED = 16.2124 MB
DDR FEATURE READS REQUIRED = 1.62164 MB
DDR FEATURE WRITES REQUIRED = 0.767578 MB
NUMBER OF DDR FEATURE READS = 9
MINIMUM AVERAGE DDR BANDWIDTH REQUIRED = 3340.19 MB/s
ASSUMED DDR BANDWIDTH PER IP INSTANCE = 6400 MB/s
----------------------------------------------------------------
Performance Estimator Throughput Breakdown
Arch: kvec64xcvec32_i12x1_fp12agx_sb32768_xbark32_actk32_poolk4
Number of DLA instances = 1
Number of DDR Banks per DLA instance = 1
CoreDLA Target Fmax = 200 MHz
PE Target Fmax = 200 MHz
Batch Size = 1
PE-only Conv Throughput No DDR = 186 fps
PE-only Conv Throughput = 185 fps
Overall Throughput Inf PE Buf Depth (zero MPBW) = 185 fps
Overall Throughput Zero PE Buf Depth (zero MPBW) = 183 fps
Overall Throughput Inf PE Buf Depth = 184 fps
Overall Throughput Zero PE Buf Depth = 182 fps
----------------------------------------------------------------
FINAL THROUGHPUT = 178.617 fps
FINAL THROUGHPUT PER FMAX (CoreDLA) = 0.893086 fps/MHz
FINAL THROUGHPUT PER FMAX (PE) = 0.893086 fps/MHz
Running the model on dev kit:
./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=1 -bgr -nthreads=1
[Step 11/12] Dumping statistics report
count: 8 iterations
system duration: 191.3784 ms
IP duration: 52.7551 ms
latency: 23.4076 ms
system throughput: 41.8020 FPS
number of hardware instances: 1
number of network instances: 1
IP throughput per instance: 151.6441 FPS
IP throughput per fmax per instance: 0.7582 FPS/MHz
IP clock frequency measurement: 200.0000 MHz
estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed)
estimated IP throughput per fmax per instance: 0.8931 FPS/MHz
./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=4 -bgr -nthreads=4
[Step 11/12] Dumping statistics report
count: 8 iterations
system duration: 147.8426 ms
IP duration: 52.7619 ms
latency: 69.8254 ms
system throughput: 54.1116 FPS
number of hardware instances: 1
number of network instances: 1
IP throughput per instance: 151.6246 FPS
IP throughput per fmax per instance: 0.7581 FPS/MHz
IP clock frequency measurement: 200.0000 MHz
estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed)
estimated IP throughput per fmax per instance: 0.8931 FPS/MHz