AI Suite System Throughput Issue

Question

When using AI Suite, we are seeing a significant gap between IP throughput and achieved system throughput on Agilex 5. I am using the following:

Hardware: Agilex™ 5 FPGA and SoC E-Series Modular Development Kit (ES silicon)

Software: Quartus Prime Pro + AI Suite 25.3.1

SD Image: agx5_soc_s2m coredla-image-agilex5_mk_a5e065bb32aes1.wic

Architecture and Bitstream: AGX5_Performance

Using MobileNetV2 (Open Model Zoo 2024.6.0) compiled using AGX5_Performance architecture gives the following results through dla_benchmark

IP throughput per instance: ~151 FPS

Estimated throughput (200 MHz): ~178 FPS

System throughput:

nireq=1 → 41 FPS

nireq=4 → 54 FPS

Why is there such a big delta between IP Performance and System Throughput and how can we improve the system throughput? For more details please see the append log showing the commands that I run to do the benchmark

Any pointers or help would be highly appreciated. Thanks

=====================================================================

1. Using mobilenet v2 from model zoo

=====================================================================

Commands used to download and compile model:

git clone https://github.com/openvinotoolkit/open_model_zoo.git

cd open_model_zoo

git checkout 2024.6.0

omz_downloader --list

omz_downloader --name mobilenet-v2-pytorch --output_dir $COREDLA_WORK/demo/models/

omz_converter --name mobilenet-v2-pytorch --download_dir ../demo/models/ --output_dir ../demo/models/

cd $COREDLA_WORK/demo/models/public/mobilenet-v2-pytorch/FP32

dla_compiler --march $COREDLA_ROOT/example_architectures/AGX5_Performance.arch --network-file ./mobilenet-v2-pytorch.xml --foutput-format=open_vino_hetero --o $COREDLA_WORK/demo/mobilenet-v2-pytorch_dla.bin --batch-size=1 --fanalyze-performance --fassumed-fmax-core 200

Executing performance estimate

----------------------------------------------------------------

main_graph_0 reported throughput: 178.617 fps

TOTAL DDR SPACE REQUIRED = 16.9756 MB

DDR INPUT & OUTPUT BUFFER SIZE = 0.781738 MB

DDR CONFIG BUFFER SIZE = 0.0986328 MB

DDR FILTER BUFFER SIZE = 15.3296 MB

DDR INTERMEDIATE BUFFER SIZE = 0.765625 MB

NOTE: THIS ESTIMATE ASSUMES 1x I/O BUFFER. THE COREDLA RUNTIME DEFAULTS TO 5

TOTAL DDR TRANSFERS REQUIRED = 18.7003 MB

DDR FILTER READS REQUIRED = 16.2124 MB

DDR FEATURE READS REQUIRED = 1.62164 MB

DDR FEATURE WRITES REQUIRED = 0.767578 MB

NUMBER OF DDR FEATURE READS = 9

MINIMUM AVERAGE DDR BANDWIDTH REQUIRED = 3340.19 MB/s

ASSUMED DDR BANDWIDTH PER IP INSTANCE = 6400 MB/s

----------------------------------------------------------------

Performance Estimator Throughput Breakdown

Arch: kvec64xcvec32_i12x1_fp12agx_sb32768_xbark32_actk32_poolk4

Number of DLA instances = 1

Number of DDR Banks per DLA instance = 1

CoreDLA Target Fmax = 200 MHz

PE Target Fmax = 200 MHz

Batch Size = 1

PE-only Conv Throughput No DDR = 186 fps

PE-only Conv Throughput = 185 fps

Overall Throughput Inf PE Buf Depth (zero MPBW) = 185 fps

Overall Throughput Zero PE Buf Depth (zero MPBW) = 183 fps

Overall Throughput Inf PE Buf Depth = 184 fps

Overall Throughput Zero PE Buf Depth = 182 fps

----------------------------------------------------------------

FINAL THROUGHPUT = 178.617 fps

FINAL THROUGHPUT PER FMAX (CoreDLA) = 0.893086 fps/MHz

FINAL THROUGHPUT PER FMAX (PE) = 0.893086 fps/MHz

Running the model on dev kit:

./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=1 -bgr -nthreads=1

[Step 11/12] Dumping statistics report

count: 8 iterations

system duration: 191.3784 ms

IP duration: 52.7551 ms

latency: 23.4076 ms

system throughput: 41.8020 FPS

number of hardware instances: 1

number of network instances: 1

IP throughput per instance: 151.6441 FPS

IP throughput per fmax per instance: 0.7582 FPS/MHz

IP clock frequency measurement: 200.0000 MHz

estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed)

estimated IP throughput per fmax per instance: 0.8931 FPS/MHz

./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=4 -bgr -nthreads=4

[Step 11/12] Dumping statistics report

count: 8 iterations

system duration: 147.8426 ms

IP duration: 52.7619 ms

latency: 69.8254 ms

system throughput: 54.1116 FPS

number of hardware instances: 1

number of network instances: 1

IP throughput per instance: 151.6246 FPS

IP throughput per fmax per instance: 0.7581 FPS/MHz

IP clock frequency measurement: 200.0000 MHz

estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed)

estimated IP throughput per fmax per instance: 0.8931 FPS/MHz

johnt_altera · Answer

Hi,
&nbsp;
1. Are the handbook AGX7 IP+Host numbers measured using S2M streaming architecture or M2M ?
The Agilex 7 performance is based on the PCIe + FPGA AI Suite IP benchmarking. This benchmarking will be different compare to the HPS method as the CPU use it higher performance compare to the HPS processor. With S2M implementation, it is relying on the Nios V to offload the HPS task.

&nbsp;
2. Is there a recommended method to measure true end-to-end throughput on AGX5 ?
Do you need the full system performance throughput? If yes, what type of implementation are you looking at as the current implementation might not be suitable to showcase the throughput of the full system.
&nbsp;
3.Are there any known bottlenecks in the Agilex 5 SoC Example Design S2M Bitstream and SD Card Image.The current S2M implementation is to emulate the data by copying the data to buffer before it is being stream into the FPGA AI Suite IP. In order to have the real throughput for the S2M implementation, the streaming data will need to directly pass it to the FPGA AI Suite.
&nbsp;
&nbsp;
You may refer to https://altera-fpga.github.io/rel-25.3.1/embedded-designs/agilex-5/e-series/modular/camera/camera_4k_ai/camera_4k_ai/ which implemented a direct data input to the FPGA without the use of the ARM processor to send the data to the streaming buffer.
&nbsp;
Thanks

johnt_altera · Answer

Hi,
&nbsp;
The dla_benchmark is to run the testing to get the throughput of the IP. It is not suitable to be used to get the full system throughput as it will include the data transfer from Arm to the DMA buffer before sending it to the AI Suite IP.You will need to run the application below which is for the full system application examplestreaming_inference_app
This application loads and runs a network and captures the results.
image_streaming_app
This application loads bitmap files from a folder on the SD card and continuously sends the images to the EMIF, simulating a running video source
Thanks.

omernaeem · Answer

Hi John, thanks for the suggestion. I tried the streaming demonstration example, while it shows functional end-to-end operation, it does not provide explicit performance benchmarking similar to dla_benchmark, so it is difficult to quantify full system throughput using this example.

My main concern is the large delta between IP throughput and achieved system throughput on AGX5E Modular Dev kit. In FPGA AI Suite Handbook Page 28 it shows the following results for mobilenet v2

AGX7_FP16_Performance: IP throughput 381fps, IP+Host Throughput 371 fps

AGX7_Performance: IP throughput 327fps, IP+Host Throughput 269 fps

The delta between IP throughput and System throughput is quite small, compared to what I am getting.

I understand that in applications where the input originates from a Avalon-ST source (for example HDMI IP), the S2M architecture could potentially achieve better throughput by avoiding HPS memory transfers. However, in my setup the input images reside in HPS DDR memory. Even in streaming mode, the data would still originate from HPS memory and ultimately be written to EMIF before DLA processing.

So incase of M2M mode like in dla_benchmark the mSGDMA would copy input from HPS Memory to EMIF but in case of S2M mode it would involve HPS -> mSGDMA MM to Avalon ST -> Layout Transform -> mSGDMA Avalon ST to MM -> EMIF.

Given this, I would expect streaming mode to incur equal or higher overhead in my scenario.

It brings me to these questions:

Are the handbook AGX7 IP+Host numbers measured using S2M streaming architecture or M2M ?
Is there a recommended method to measure true end-to-end throughput on AGX5 ?
Are there any known bottlenecks in the Agilex 5 SoC Example Design S2M Bitstream and SD Card Image

Any guidance on how to profile where the system bottleneck lies (DDR bandwidth, DMA latency, HPS overhead) would be greatly appreciated.

Thanks.

omernaeem · Answer

Hi John,Thanks for the reply. I totally missed that the Agilex 7 benchmarks were on the PCIe based design. I also noticed in the AI Suite handbook the M2M example command logs that they share has similar delta. For resnet 50, on Agilex 7 FPGA I Series Transceiver-SoC Development Kit it shows an system throughput of 27fps vs IP throughput of 123. So you are right dla_benchmark is not the suitable way to measure total system throughput.I would inspect and try the 4K AI camera example you shared, and get back to you on my strategy to benchmark the performance of AI Suite.&nbsp;Thanks&nbsp;

Forum Discussion

AI Suite System Throughput Issue

4 Replies

Recent Discussions

AI Suite System Throughput Issue

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

How Do I get the License for HLS?

Deprecation Notice for FPGA Support Package for oneAPI DPC++/C++. What is the alternative?