New Member

13 hours ago

AI Suite System Throughput Issue

When using AI Suite, we are seeing a significant gap between IP throughput and achieved system throughput on Agilex 5. I am using the following:

Hardware: Agilex™ 5 FPGA and SoC E-Series Modular Development Kit (ES silicon)

Software: Quartus Prime Pro + AI Suite 25.3.1

SD Image: agx5_soc_s2m coredla-image-agilex5_mk_a5e065bb32aes1.wic

Architecture and Bitstream: AGX5_Performance

Using MobileNetV2 (Open Model Zoo 2024.6.0) compiled using AGX5_Performance architecture gives the following results through dla_benchmark

IP throughput per instance: ~151 FPS

Estimated throughput (200 MHz): ~178 FPS

System throughput:

nireq=1 → 41 FPS

nireq=4 → 54 FPS

Why is there such a big delta between IP Performance and System Throughput and how can we improve the system throughput? For more details please see the append log showing the commands that I run to do the benchmark

Any pointers or help would be highly appreciated. Thanks

=====================================================================

1. Using mobilenet v2 from model zoo

=====================================================================

Commands used to download and compile model:

git clone https://github.com/openvinotoolkit/open_model_zoo.git

cd open_model_zoo

git checkout 2024.6.0

omz_downloader --list

omz_downloader --name mobilenet-v2-pytorch --output_dir $COREDLA_WORK/demo/models/

omz_converter --name mobilenet-v2-pytorch --download_dir ../demo/models/ --output_dir ../demo/models/

cd $COREDLA_WORK/demo/models/public/mobilenet-v2-pytorch/FP32

dla_compiler --march $COREDLA_ROOT/example_architectures/AGX5_Performance.arch --network-file ./mobilenet-v2-pytorch.xml --foutput-format=open_vino_hetero --o $COREDLA_WORK/demo/mobilenet-v2-pytorch_dla.bin --batch-size=1 --fanalyze-performance --fassumed-fmax-core 200

Executing performance estimate

----------------------------------------------------------------

main_graph_0 reported throughput: 178.617 fps

TOTAL DDR SPACE REQUIRED = 16.9756 MB

DDR INPUT & OUTPUT BUFFER SIZE = 0.781738 MB

DDR CONFIG BUFFER SIZE = 0.0986328 MB

DDR FILTER BUFFER SIZE = 15.3296 MB

DDR INTERMEDIATE BUFFER SIZE = 0.765625 MB

NOTE: THIS ESTIMATE ASSUMES 1x I/O BUFFER. THE COREDLA RUNTIME DEFAULTS TO 5

TOTAL DDR TRANSFERS REQUIRED = 18.7003 MB

DDR FILTER READS REQUIRED = 16.2124 MB

DDR FEATURE READS REQUIRED = 1.62164 MB

DDR FEATURE WRITES REQUIRED = 0.767578 MB

NUMBER OF DDR FEATURE READS = 9

MINIMUM AVERAGE DDR BANDWIDTH REQUIRED = 3340.19 MB/s

ASSUMED DDR BANDWIDTH PER IP INSTANCE = 6400 MB/s

----------------------------------------------------------------

Performance Estimator Throughput Breakdown

Arch: kvec64xcvec32_i12x1_fp12agx_sb32768_xbark32_actk32_poolk4

Number of DLA instances = 1

Number of DDR Banks per DLA instance = 1

CoreDLA Target Fmax = 200 MHz

PE Target Fmax = 200 MHz

Batch Size = 1

PE-only Conv Throughput No DDR = 186 fps

PE-only Conv Throughput = 185 fps

Overall Throughput Inf PE Buf Depth (zero MPBW) = 185 fps

Overall Throughput Zero PE Buf Depth (zero MPBW) = 183 fps

Overall Throughput Inf PE Buf Depth = 184 fps

Overall Throughput Zero PE Buf Depth = 182 fps

----------------------------------------------------------------

FINAL THROUGHPUT = 178.617 fps

FINAL THROUGHPUT PER FMAX (CoreDLA) = 0.893086 fps/MHz

FINAL THROUGHPUT PER FMAX (PE) = 0.893086 fps/MHz

Running the model on dev kit:

./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=1 -bgr -nthreads=1

[Step 11/12] Dumping statistics report

count: 8 iterations

system duration: 191.3784 ms

IP duration: 52.7551 ms

latency: 23.4076 ms

system throughput: 41.8020 FPS

number of hardware instances: 1

number of network instances: 1

IP throughput per instance: 151.6441 FPS

IP throughput per fmax per instance: 0.7582 FPS/MHz

IP clock frequency measurement: 200.0000 MHz

estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed)

estimated IP throughput per fmax per instance: 0.8931 FPS/MHz

[Step 11/12] Dumping statistics report

count: 8 iterations

system duration: 147.8426 ms

IP duration: 52.7619 ms

latency: 69.8254 ms

system throughput: 54.1116 FPS

number of hardware instances: 1

number of network instances: 1

IP throughput per instance: 151.6246 FPS

IP throughput per fmax per instance: 0.7581 FPS/MHz

IP clock frequency measurement: 200.0000 MHz

estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed)

estimated IP throughput per fmax per instance: 0.8931 FPS/MHz

Acceleration

ai suite

Forum Discussion

AI Suite System Throughput Issue

Recent Discussions

AI Suite System Throughput Issue

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

How Do I get the License for HLS?

Deprecation Notice for FPGA Support Package for oneAPI DPC++/C++. What is the alternative?