acceleration

89 Topics

AI Suite System Throughput Issue
When using AI Suite, we are seeing a significant gap between IP throughput and achieved system throughput on Agilex 5. I am using the following: Hardware: Agilex™ 5 FPGA and SoC E-Series Modular Development Kit (ES silicon) Software: Quartus Prime Pro + AI Suite 25.3.1 SD Image: agx5_soc_s2m coredla-image-agilex5_mk_a5e065bb32aes1.wic Architecture and Bitstream: AGX5_Performance Using MobileNetV2 (Open Model Zoo 2024.6.0) compiled using AGX5_Performance architecture gives the following results through dla_benchmark IP throughput per instance: ~151 FPS Estimated throughput (200 MHz): ~178 FPS System throughput: nireq=1 → 41 FPS nireq=4 → 54 FPS Why is there such a big delta between IP Performance and System Throughput and how can we improve the system throughput? For more details please see the append log showing the commands that I run to do the benchmark Any pointers or help would be highly appreciated. Thanks ===================================================================== 1. Using mobilenet v2 from model zoo ===================================================================== Commands used to download and compile model: git clone https://github.com/openvinotoolkit/open_model_zoo.git cd open_model_zoo git checkout 2024.6.0 omz_downloader --list omz_downloader --name mobilenet-v2-pytorch --output_dir $COREDLA_WORK/demo/models/ omz_converter --name mobilenet-v2-pytorch --download_dir ../demo/models/ --output_dir ../demo/models/ cd $COREDLA_WORK/demo/models/public/mobilenet-v2-pytorch/FP32 dla_compiler --march $COREDLA_ROOT/example_architectures/AGX5_Performance.arch --network-file ./mobilenet-v2-pytorch.xml --foutput-format=open_vino_hetero --o $COREDLA_WORK/demo/mobilenet-v2-pytorch_dla.bin --batch-size=1 --fanalyze-performance --fassumed-fmax-core 200 Executing performance estimate ---------------------------------------------------------------- main_graph_0 reported throughput: 178.617 fps TOTAL DDR SPACE REQUIRED = 16.9756 MB DDR INPUT & OUTPUT BUFFER SIZE = 0.781738 MB DDR CONFIG BUFFER SIZE = 0.0986328 MB DDR FILTER BUFFER SIZE = 15.3296 MB DDR INTERMEDIATE BUFFER SIZE = 0.765625 MB NOTE: THIS ESTIMATE ASSUMES 1x I/O BUFFER. THE COREDLA RUNTIME DEFAULTS TO 5 TOTAL DDR TRANSFERS REQUIRED = 18.7003 MB DDR FILTER READS REQUIRED = 16.2124 MB DDR FEATURE READS REQUIRED = 1.62164 MB DDR FEATURE WRITES REQUIRED = 0.767578 MB NUMBER OF DDR FEATURE READS = 9 MINIMUM AVERAGE DDR BANDWIDTH REQUIRED = 3340.19 MB/s ASSUMED DDR BANDWIDTH PER IP INSTANCE = 6400 MB/s ---------------------------------------------------------------- Performance Estimator Throughput Breakdown Arch: kvec64xcvec32_i12x1_fp12agx_sb32768_xbark32_actk32_poolk4 Number of DLA instances = 1 Number of DDR Banks per DLA instance = 1 CoreDLA Target Fmax = 200 MHz PE Target Fmax = 200 MHz Batch Size = 1 PE-only Conv Throughput No DDR = 186 fps PE-only Conv Throughput = 185 fps Overall Throughput Inf PE Buf Depth (zero MPBW) = 185 fps Overall Throughput Zero PE Buf Depth (zero MPBW) = 183 fps Overall Throughput Inf PE Buf Depth = 184 fps Overall Throughput Zero PE Buf Depth = 182 fps ---------------------------------------------------------------- FINAL THROUGHPUT = 178.617 fps FINAL THROUGHPUT PER FMAX (CoreDLA) = 0.893086 fps/MHz FINAL THROUGHPUT PER FMAX (PE) = 0.893086 fps/MHz Running the model on dev kit: ./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=1 -bgr -nthreads=1 [Step 11/12] Dumping statistics report count: 8 iterations system duration: 191.3784 ms IP duration: 52.7551 ms latency: 23.4076 ms system throughput: 41.8020 FPS number of hardware instances: 1 number of network instances: 1 IP throughput per instance: 151.6441 FPS IP throughput per fmax per instance: 0.7582 FPS/MHz IP clock frequency measurement: 200.0000 MHz estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed) estimated IP throughput per fmax per instance: 0.8931 FPS/MHz ./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=4 -bgr -nthreads=4 [Step 11/12] Dumping statistics report count: 8 iterations system duration: 147.8426 ms IP duration: 52.7619 ms latency: 69.8254 ms system throughput: 54.1116 FPS number of hardware instances: 1 number of network instances: 1 IP throughput per instance: 151.6246 FPS IP throughput per fmax per instance: 0.7581 FPS/MHz IP clock frequency measurement: 200.0000 MHz estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed) estimated IP throughput per fmax per instance: 0.8931 FPS/MHz
OmerNaeem
3 hours ago Place Acceleration
24Views
0likes
4Comments
How Do I get the License for HLS?
License for HLS
bbT
1 month ago Place Acceleration
61Views
0likes
10Comments
See the Next Wave of EW & Radar Technology
We’re gearing up for AOC 2025! From December 9–11, we’ll be at the Gaylord National Resort & Convention Center in National Harbor, Maryland for AOC2025—one of North America’s premier events dedicated to electronic warfare and radar. Visit us at booth #505 to discover the latest innovations in our Agilex™ 9 Direct RF and Agilex™ 5 product families. What to Expect at Altera’s Booth #505: 1. Wideband and Agility Demo using Agilex 9: Overview: Discover the power of frequency hopping with Altera’s Direct RF FPGA, enhancing system resilience and adaptability. Key Features: Demonstrates swift frequency changes and wideband monitoring. 2. Wideband Channelizer Demo using Agilex 9: Overview: Wideband Channelizer features polyphase filter and 65 phases FFT blocks with variable channel support. Key Features: Demonstrates sampling rate that supports 64 GSPS with 32GHz instantaneous bandwidth. 3. Direction of Arrival Demo using Agilex 5: Overview: Explore Direction of Arriaval estimation and signal detection using AI-based approach with deployment of neural networks. Key Features: Demonstrates neural networks implementation using DSP Builder Advanced Blockset (DSPBA), showcasing end-to-end operation running real time inference. 4. Altera COTS Partner Showcase: Come see our Agilex based COTS boards from partners including Annapolis Microsystems, CAES, Hitek, iWave Global, Mercury Systems, & Spectrum Controls. We are hosting customer meetings at the event, contact your local Altera salesperson to schedule a slot.
yuki
2 months ago Place Blog
131Views
0likes
0Comments
Creating PCB based on 10M08 Evaluation Board but with other MAX10 FPGA
Hello everyone, for a school project, I want to design a PCB for / around the MAX 10 FPGA. As I'm trying to make my life easier, I am using this (https://www.intel.com/content/www/us/en/products/details/fpga/development-kits/max/10m08-evaluation-kit.html) Intel Evaluation Board as a starting point. the FPGA used in the design is the 10M08SAE144C8G. However, it has only 8000 LE, which will not be enough, therefore I'm planning to use 10M16SAE144C8G as a (hopefully) drop in replacement. I think that this will work, why shouldn't it? All it really is, is a PCB which uses the same layout, but with more of the available GPIOs broken out and preferable the mentioned 10M16SAE144C8G (or even 10M25SAE144C8G). I checked the pinout of both FPGAs and they are the same. Also programming should be the same as far as i know. Thanks for reading!
Iukas
6 months ago Place Acceleration
1.3KViews
0likes
1Comment
checksum of pof file inside cfm sector
hii i have general question if i have a pof file after compiling my vhdl design , is the checksum of the pof file get written into the cfm0 sector , and if i want to read the cheksum how can i do it ?
aiedb
7 months ago Place Acceleration
5.4KViews
0likes
5Comments
writing and reading max10 ufm
hii i have the neek dev kit , and i did a project to write and read the max10 ufm i see in the signal tap in my project that i can write the data and read successfully i programmed the board with the pof so that my firmware is inside the board , but when i turn off the board and turn it on again , and read the data from ufm i see that the data is all zeros meaning that the data didn't get saved in the ufm , and i know that the ufm is non volatile memory . i am using the on chip flash ip , i expect the data to be saved in the ufm . when i program the pof and perform the write and the do the read while the board under power everything is ok , data get written and read . but the problem start when i turn the power down do i need to do a special thing in order to commit the data to the ufm and save so i can read it after power up ???
aiedb
8 months ago Place Acceleration
1.9KViews
0likes
4Comments
Requesting detailed information about Stratix 10 NX's Tensor Blocks
We have been using Stratix 10 NX(1SN21BHU2F53E2VG) to develop AI accelerator for a while, but always failed to find some detailed information about the special AI Tensor Blocks including the fmax under different speed grades, the public user guide of this IP and (maybe) the example design. Where can I find these support docs for the AI Tensor Block?
Solved
Tianchu
8 months ago Place Acceleration
3KViews
0likes
5Comments
Intel FPGA AI Sutie Inference Engine
Is there any official documentation on the DLA runtime or inference engine for managing the DLA from the ARM side? I need to develop a custom application for running inference, but so far, I’ve only found the dla_benchmark (main.cpp) and streaming_inference_app.cpp example files. There should be some documentation covering the SDK. The only documentation that i found related with is the Intel FPGA AI suite PCIe based design example https://www.intel.com/content/www/us/en/docs/programmable/768977/2024-3/fpga-runtime-plugin.html From what I understand, the general inference workflow involves the following steps: Identify the hardware architecture Deploy the model Prepare the input data Send inference requests to the DLA Retrieve the output data
RubenPadial
9 months ago Place Acceleration
6.7KViews
0likes
42Comments
Intel FPGA AI Suite Inference Engine
Hello, I'm using Intel FPGA AI 2023.2 on ubuntu 20.04 host computer and trying to infer a custom CNN in a Intel Arria 10 SoC FPGA. I have followed Intel FPGA AI Suite SoC Design Example Guide and I'm able to copile the Intel FPGA AI suite IP and run the M2M and S2M examples. I have also compiled the grpah for my custom NN and I'm trying to run it with the Inter FPGA AI suite IP but I have not clear how to do it. I'm trying to use the dla_benchmark app provided but for example, the input data of my NN (it is trained and graph was compiled in this way) must be float whereas the input data of the IP must be int8 if I'm not wrong. Another problem I have is regarding the ground truth file. I have a ground truth file for each imput file because each groud truth is a 225 array. Is there any additional information or guide to run custom model with Intel FPGA AI Suite? Thank you in advance
RubenPadial
1 year ago Place Acceleration
8.1KViews
0likes
31Comments
AN 754: MIPI D-PHY Solution with Cyclone V - questions on VCCIO/VCCDP/VREF connection
I am referring to the AN 754 (MIPI D-PHY Solution with Passive Resistor Networks in Intel® Low-Cost FPGAs) to acheive MIPI receive in Cyclone IV. We can see in the document at Table 1, in FPGA I/O buffer mode for RX, that : - For high-speed signaling mode, we can use differential I/O standard (LVDS25) - For low-power signaling mode, we can use single-ended mode with HSTL12 or LVCMOS12 I/O standard I would like your approval, for FPGA I/O buffer in RX mode only and for low-power signaling only, that we can use HSTL12 single-ended mode with the following connections for the same IO bank: 1) VCCIO=2.5V 2) VCCPD=2.5V 3) VREF=0.6V 4) and that there is no need for any VTT connection at all I will be grateful if there is someone that can give me an approval on the connections detailed in items 1/2/3/4. Thanks in advance.
Solved
DdRd
1 year ago Place Acceleration
3.6KViews
0likes
8Comments