Hi John, thanks for the suggestion. I tried the streaming demonstration example, while it shows functional end-to-end operation, it does not provide explicit performance benchmarking similar to dla_benchmark, so it is difficult to quantify full system throughput using this example.
My main concern is the large delta between IP throughput and achieved system throughput on AGX5E Modular Dev kit. In FPGA AI Suite Handbook Page 28 it shows the following results for mobilenet v2
AGX7_FP16_Performance: IP throughput 381fps, IP+Host Throughput 371 fps
AGX7_Performance: IP throughput 327fps, IP+Host Throughput 269 fps
The delta between IP throughput and System throughput is quite small, compared to what I am getting.
I understand that in applications where the input originates from a Avalon-ST source (for example HDMI IP), the S2M architecture could potentially achieve better throughput by avoiding HPS memory transfers. However, in my setup the input images reside in HPS DDR memory. Even in streaming mode, the data would still originate from HPS memory and ultimately be written to EMIF before DLA processing.
So incase of M2M mode like in dla_benchmark the mSGDMA would copy input from HPS Memory to EMIF but in case of S2M mode it would involve HPS -> mSGDMA MM to Avalon ST -> Layout Transform -> mSGDMA Avalon ST to MM -> EMIF.
Given this, I would expect streaming mode to incur equal or higher overhead in my scenario.
It brings me to these questions:
- Are the handbook AGX7 IP+Host numbers measured using S2M streaming architecture or M2M ?
- Is there a recommended method to measure true end-to-end throughput on AGX5 ?
- Are there any known bottlenecks in the Agilex 5 SoC Example Design S2M Bitstream and SD Card Image
Any guidance on how to profile where the system bottleneck lies (DDR bandwidth, DMA latency, HPS overhead) would be greatly appreciated.
Thanks.