Forum Discussion

OmerNaeem's avatar
OmerNaeem
Icon for New Member rankNew Member
13 hours ago

AI Suite System Throughput Issue

When using AI Suite, we are seeing a significant gap between IP throughput and achieved system throughput on Agilex 5. I am using the following:

 

Hardware: Agilex™ 5 FPGA and SoC E-Series Modular Development Kit (ES silicon)

Software: Quartus Prime Pro + AI Suite 25.3.1

SD Image: agx5_soc_s2m coredla-image-agilex5_mk_a5e065bb32aes1.wic

Architecture and Bitstream: AGX5_Performance

 

Using MobileNetV2 (Open Model Zoo 2024.6.0) compiled using AGX5_Performance architecture gives the following results through dla_benchmark

 

IP throughput per instance: ~151 FPS

Estimated throughput (200 MHz): ~178 FPS

 

System throughput:

nireq=1 → 41 FPS

nireq=4 → 54 FPS

 

Why is there such a big delta between IP Performance and System Throughput and how can we improve the system throughput? For more details please see the append log showing the commands that I run to do the benchmark

Any pointers or help would be highly appreciated. Thanks

 

 

 

 

=====================================================================

1. Using mobilenet v2 from model zoo

=====================================================================

Commands used to download and compile model:

 

git clone https://github.com/openvinotoolkit/open_model_zoo.git

cd open_model_zoo

git checkout 2024.6.0

omz_downloader --list

omz_downloader --name mobilenet-v2-pytorch --output_dir $COREDLA_WORK/demo/models/

omz_converter --name mobilenet-v2-pytorch --download_dir ../demo/models/ --output_dir ../demo/models/

 

cd $COREDLA_WORK/demo/models/public/mobilenet-v2-pytorch/FP32

dla_compiler --march $COREDLA_ROOT/example_architectures/AGX5_Performance.arch --network-file ./mobilenet-v2-pytorch.xml --foutput-format=open_vino_hetero --o $COREDLA_WORK/demo/mobilenet-v2-pytorch_dla.bin --batch-size=1 --fanalyze-performance --fassumed-fmax-core 200

 

Executing performance estimate

----------------------------------------------------------------

main_graph_0 reported throughput: 178.617 fps

TOTAL DDR SPACE REQUIRED = 16.9756 MB

      DDR INPUT & OUTPUT BUFFER SIZE = 0.781738 MB

      DDR CONFIG BUFFER SIZE = 0.0986328 MB

      DDR FILTER BUFFER SIZE = 15.3296 MB

      DDR INTERMEDIATE BUFFER SIZE = 0.765625 MB

NOTE: THIS ESTIMATE ASSUMES 1x I/O BUFFER. THE COREDLA RUNTIME DEFAULTS TO 5

TOTAL DDR TRANSFERS REQUIRED = 18.7003 MB

      DDR FILTER READS REQUIRED   = 16.2124 MB

      DDR FEATURE READS REQUIRED  = 1.62164 MB

      DDR FEATURE WRITES REQUIRED = 0.767578 MB

NUMBER OF DDR FEATURE READS = 9

MINIMUM AVERAGE DDR BANDWIDTH REQUIRED = 3340.19 MB/s

ASSUMED DDR BANDWIDTH PER IP INSTANCE = 6400 MB/s

----------------------------------------------------------------

Performance Estimator Throughput Breakdown

Arch: kvec64xcvec32_i12x1_fp12agx_sb32768_xbark32_actk32_poolk4

Number of DLA instances                          = 1

Number of DDR Banks per DLA instance             = 1

CoreDLA Target Fmax                              = 200 MHz

PE Target Fmax                                   = 200 MHz

Batch Size                                       = 1

PE-only Conv Throughput No DDR                   = 186 fps

PE-only Conv Throughput                          = 185 fps

Overall Throughput Inf PE Buf Depth (zero MPBW)  = 185 fps

Overall Throughput Zero PE Buf Depth (zero MPBW) = 183 fps

Overall Throughput Inf PE Buf Depth              = 184 fps

Overall Throughput Zero PE Buf Depth             = 182 fps

----------------------------------------------------------------

FINAL THROUGHPUT = 178.617 fps

FINAL THROUGHPUT PER FMAX (CoreDLA) = 0.893086 fps/MHz

FINAL THROUGHPUT PER FMAX (PE)      = 0.893086 fps/MHz

 

Running the model on dev kit:

 

./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=1 -bgr -nthreads=1

 

[Step 11/12] Dumping statistics report

count:             8 iterations

system duration:   191.3784 ms

IP duration:       52.7551 ms

latency:           23.4076 ms

system throughput: 41.8020 FPS

number of hardware instances: 1

number of network instances: 1

IP throughput per instance: 151.6441 FPS

IP throughput per fmax per instance: 0.7582 FPS/MHz

IP clock frequency measurement: 200.0000 MHz

estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed)

estimated IP throughput per fmax per instance: 0.8931 FPS/MHz

 

./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=4 -bgr -nthreads=4

 

[Step 11/12] Dumping statistics report

count:             8 iterations

system duration:   147.8426 ms

IP duration:       52.7619 ms

latency:           69.8254 ms

system throughput: 54.1116 FPS

number of hardware instances: 1

number of network instances: 1

IP throughput per instance: 151.6246 FPS

IP throughput per fmax per instance: 0.7581 FPS/MHz

IP clock frequency measurement: 200.0000 MHz

estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed)

estimated IP throughput per fmax per instance: 0.8931 FPS/MHz

 

No RepliesBe the first to reply