Acceleration

Open FPGA Stack (OFS), FPGA AI Suite, and High-Level Design

Forum Widgets

Recent Discussions

AI Suite System Throughput Issue
When using AI Suite, we are seeing a significant gap between IP throughput and achieved system throughput on Agilex 5. I am using the following: Hardware: Agilex™ 5 FPGA and SoC E-Series Modular Development Kit (ES silicon) Software: Quartus Prime Pro + AI Suite 25.3.1 SD Image: agx5_soc_s2m coredla-image-agilex5_mk_a5e065bb32aes1.wic Architecture and Bitstream: AGX5_Performance Using MobileNetV2 (Open Model Zoo 2024.6.0) compiled using AGX5_Performance architecture gives the following results through dla_benchmark IP throughput per instance: ~151 FPS Estimated throughput (200 MHz): ~178 FPS System throughput: nireq=1 → 41 FPS nireq=4 → 54 FPS Why is there such a big delta between IP Performance and System Throughput and how can we improve the system throughput? For more details please see the append log showing the commands that I run to do the benchmark Any pointers or help would be highly appreciated. Thanks ===================================================================== 1. Using mobilenet v2 from model zoo ===================================================================== Commands used to download and compile model: git clone https://github.com/openvinotoolkit/open_model_zoo.git cd open_model_zoo git checkout 2024.6.0 omz_downloader --list omz_downloader --name mobilenet-v2-pytorch --output_dir $COREDLA_WORK/demo/models/ omz_converter --name mobilenet-v2-pytorch --download_dir ../demo/models/ --output_dir ../demo/models/ cd $COREDLA_WORK/demo/models/public/mobilenet-v2-pytorch/FP32 dla_compiler --march $COREDLA_ROOT/example_architectures/AGX5_Performance.arch --network-file ./mobilenet-v2-pytorch.xml --foutput-format=open_vino_hetero --o $COREDLA_WORK/demo/mobilenet-v2-pytorch_dla.bin --batch-size=1 --fanalyze-performance --fassumed-fmax-core 200 Executing performance estimate ---------------------------------------------------------------- main_graph_0 reported throughput: 178.617 fps TOTAL DDR SPACE REQUIRED = 16.9756 MB DDR INPUT & OUTPUT BUFFER SIZE = 0.781738 MB DDR CONFIG BUFFER SIZE = 0.0986328 MB DDR FILTER BUFFER SIZE = 15.3296 MB DDR INTERMEDIATE BUFFER SIZE = 0.765625 MB NOTE: THIS ESTIMATE ASSUMES 1x I/O BUFFER. THE COREDLA RUNTIME DEFAULTS TO 5 TOTAL DDR TRANSFERS REQUIRED = 18.7003 MB DDR FILTER READS REQUIRED = 16.2124 MB DDR FEATURE READS REQUIRED = 1.62164 MB DDR FEATURE WRITES REQUIRED = 0.767578 MB NUMBER OF DDR FEATURE READS = 9 MINIMUM AVERAGE DDR BANDWIDTH REQUIRED = 3340.19 MB/s ASSUMED DDR BANDWIDTH PER IP INSTANCE = 6400 MB/s ---------------------------------------------------------------- Performance Estimator Throughput Breakdown Arch: kvec64xcvec32_i12x1_fp12agx_sb32768_xbark32_actk32_poolk4 Number of DLA instances = 1 Number of DDR Banks per DLA instance = 1 CoreDLA Target Fmax = 200 MHz PE Target Fmax = 200 MHz Batch Size = 1 PE-only Conv Throughput No DDR = 186 fps PE-only Conv Throughput = 185 fps Overall Throughput Inf PE Buf Depth (zero MPBW) = 185 fps Overall Throughput Zero PE Buf Depth (zero MPBW) = 183 fps Overall Throughput Inf PE Buf Depth = 184 fps Overall Throughput Zero PE Buf Depth = 182 fps ---------------------------------------------------------------- FINAL THROUGHPUT = 178.617 fps FINAL THROUGHPUT PER FMAX (CoreDLA) = 0.893086 fps/MHz FINAL THROUGHPUT PER FMAX (PE) = 0.893086 fps/MHz Running the model on dev kit: ./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=1 -bgr -nthreads=1 [Step 11/12] Dumping statistics report count: 8 iterations system duration: 191.3784 ms IP duration: 52.7551 ms latency: 23.4076 ms system throughput: 41.8020 FPS number of hardware instances: 1 number of network instances: 1 IP throughput per instance: 151.6441 FPS IP throughput per fmax per instance: 0.7582 FPS/MHz IP clock frequency measurement: 200.0000 MHz estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed) estimated IP throughput per fmax per instance: 0.8931 FPS/MHz ./dla_benchmark -b=1 -cm $compiled_model -d=HETERO:FPGA,CPU -i $imgdir -niter=8 -plugins ./plugins.xml -arch_file $archfile -api=async -groundtruth_loc $imgdir/ground_truth.txt -perf_est -nireq=4 -bgr -nthreads=4 [Step 11/12] Dumping statistics report count: 8 iterations system duration: 147.8426 ms IP duration: 52.7619 ms latency: 69.8254 ms system throughput: 54.1116 FPS number of hardware instances: 1 number of network instances: 1 IP throughput per instance: 151.6246 FPS IP throughput per fmax per instance: 0.7581 FPS/MHz IP clock frequency measurement: 200.0000 MHz estimated IP throughput per instance: 178.6172 FPS (200 MHz assumed) estimated IP throughput per fmax per instance: 0.8931 FPS/MHz
OmerNaeem
4 hours ago
Acceleration
ai suite
24Views
0likes
4Comments
Agilex 7 I-Series "aocl diagnose acl0" error following OFS
Hello, I've been working through the Open FPGA Stack (OFS) guides to set up my Agilex 7 I-Series development kit for use with oneAPI. I've worked through prior SystemVerilog issues encountered by switching the generated FPGA Interface Manager (FIM) from a 1x16 PCIe configuration to a 2x8 configuration (although 1x16 would be more preferred). I am now on the final step of wrapping the FIM into a BSP and validating it for use with oneAPI by running the "aocl diagnose acl0" command. I should note that performing just "aocl diagnose" works fine. When I add "acl0" and execute, however, I find that all attempts to communicate between the host and FPGA via DMA fail (although we do see a single VTP L2 4KB hit). The exact output from the diagnose command is in the text file attached. I have tried using both a minimal FIM generated via command provided in the OFS guides, as well as pre-builts from the Github page. Why might this error be occurring, and how can I fix it? Any help is greatly appreciated, thank you! James
jjb169
11 days ago
636Views
1like
36Comments
HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found
Good day, I recently installed HLS Compiler 24.1 together with Quartus Prime Lite 23.1std in order to do the first HLS tests. I installed Visual Studio 2017 Community and the latest version of Microsoft Visual C++ Redistributables (2015-2022). My OS is Windows 10 Home 10.0.19045 Trying to test an HLS example, I enter the Visual Studio console by launching the init_hls.bat file in C:\intelFPGA_lite\23.1std\hls. It seems to detect every dependency. Launching the build.bat of an example in test-x86-64 mode returns the following error, translated below the image: "The procedure entry point [...] could not be located in the dynamic link library [...]." This also happens for mathlib.dll, generationlib.dll and mip_common.dll. I tried deleting and reinstalling HLS Compiler and deleting and reinstalling MVC++ Redistributables. I also tried launchig build.bat in test-FPGA mode, changing the board from Arria10 to CycloneV, which returns the same error. Below are linked two threads that inquire about similar errors, but the original poster did not share if and how it was finally resolved. aocl-clang.exe - Entry Point Not Found - Intel Community HLS i++ compile failure for Quartus Prime 21.1.1 Lite - Intel Community Any suggestions on how to proceed? Thank you in advance, Noah
Solved
NoahHuguenin
26 days ago
High-level Design Tools
3.7KViews
0likes
13Comments
How Do I get the License for HLS?
License for HLS
bbT
1 month ago
Acceleration
High-level Design Tools
61Views
0likes
10Comments
Deprecation Notice for FPGA Support Package for oneAPI DPC++/C++. What is the alternative?
Hi there, We recently began to port our HLS-C++ projects to oneAPI as the HLS-Compiler was no longer continued. Today I noticed the deprecation notice for "FPGA Support Package for Intel® oneAPI DPC++/C++ Compiler". See https://www.intel.com/content/www/us/en/developer/tools/oneapi/fpga.html . Hm. Looking at the Intel/Altera software page, it lists 4 HLS tools, of which two are the mentioned deprecated ones and the others are not suitable for continuation for our projects (no c++). So the questions are: * Is there a SYCL-for-FPGA-support in the future? * Is there a "HLS"ish C++-support planned in the future? (or other non-matlab languages) * What is the recommended High level approach for FPGA projects with an image processing background?
Solved
Achim_the_key
1 month ago
High-level Design Tools
2.6KViews
0likes
7Comments
Does the FPGA N3000 support OpenCL and OneApi?
I received an INTEL FPGA PAC N3000 card, and taking the opportunity, I decided to learn how to develop with SYCL and oneAPI. However, I ran into problems. I fully installed the FPGA PAC N3000 Acceleration Stacks v1.3.1 and also updated the board's BMC from D.1.0.12 to D.2.0.19. Then I started configuring oneAPI 2022 using intel-basekit and fpga-addon, but proper configuration requires bsp, and I couldn't find it anywhere. I also saw that on the website page for Quartus Prime Pro 19.2, which is installed with IAS 1.3.1, there's a tab with oneAPI and BSPs for boards. I looked at other versions, but I only found mentions of Arria 10-GX and Arria 10-SX. I'm not sure if this will help, but the log is from CentOS 7.6.1810. [root@node-fpga ~]# fpgainfo fme Board Management Controller, MAX10 NIOS FW version D.2.0.19 Board Management Controller, MAX10 Build version D.2.0.6 //****** FME ******// Object Id : 0xF300000 PCIe s:b:d.f : 0000:84:00.0 Device Id : 0x0b30 Numa Node : 1 Ports Num : 01 Bitstream Id : 0x23000110010309 Bitstream Version : 0.2.3 Pr Interface Id : f3c99413-5081-4aad-bced-07eb84a6d0bb Boot Page : user [root@node-fpga ~]# fpgainfo bmc Board Management Controller, MAX10 NIOS FW version D.2.0.19 Board Management Controller, MAX10 Build version D.2.0.6 //****** BMC SENSORS ******// Object Id : 0xF300000 PCIe s:b:d.f : 0000:84:00.0 Device Id : 0x0b30 Numa Node : 1 Ports Num : 01 Bitstream Id : 0x23000110010309 Bitstream Version : 0.2.3 Pr Interface Id : f3c99413-5081-4aad-bced-07eb84a6d0bb ( 1) Board Power : 59.65 Watts ( 2) 12V Backplane Current : 2.91 Amps ( 3) 12V Backplane Voltage : 11.92 Volts ( 4) 1.2V Voltage : 1.20 Volts ( 6) 1.8V Voltage : 1.80 Volts ( 8) 3.3V Voltage : 3.27 Volts (10) FPGA Core Voltage : 0.90 Volts (11) FPGA Core Current : 14.47 Amps (12) FPGA Core Temperature : 62.50 Celsius (13) Board Temperature : 42.00 Celsius (14) QSFP A Voltage : N/A (15) QSFP A Temperature : N/A (24) 12V AUX Current : 2.08 Amps (25) 12V AUX Voltage : 11.97 Volts (37) QSFP B Voltage : N/A (38) QSFP B Temperature : N/A (44) Retimer A Core Temperature : 63.00 Celsius (45) Retimer A Serdes Temperature : 64.00 Celsius (46) Retimer B Core Temperature : 0.00 Celsius (47) Retimer B Serdes Temperature : 0.00 Celsius [root@node-fpga ~]# aoc -list-boards Board list: pac_a10 (default) Board Package: /opt/intel/oneapi/compiler/2022.2.1/linux/lib/oclfpga/board/intel_a10gx_pac pac_s10 Board Package: /opt/intel/oneapi/compiler/2022.2.1/linux/lib/oclfpga/board/intel_s10sx_pac pac_s10_usm Board Package: /opt/intel/oneapi/compiler/2022.2.1/linux/lib/oclfpga/board/intel_s10sx_pac Memories: device, host [root@node-fpga ~]# aocl list-devices -------------------------------------------------------------------- Device Name: acl0 BSP Install Location: /opt/intel/oneapi/compiler/2022.2.1/linux/lib/oclfpga/board/intel_a10gx_pac Vendor: Intel Corp Physical Dev Name Status Information pac_f200000 Uninitialized OpenCL BSP not loaded. Must load BSP using command: 'aocl program <device_name> <aocx_file>' before running OpenCL programs using this device DIAGNOSTIC_PASSED -------------------------------------------------------------------- [root@node-fpga ~]# aocl initialize acl0 pac_a10 aocl initialize: Running initialize from /opt/intel/oneapi/compiler/2022.2.1/linux/lib/oclfpga/board/intel_a10gx_pac/linux64/libexec bitstream.c:391:validate_bitstream_metadata() **ERROR** : Interface ID check failed Error writing bitstream to FPGA: invalid parameter Error programming device aocl initialize: Program failed. [root@node-fpga ~]#
ron_doge
1 month ago
High-level Design Tools
32Views
0likes
0Comments
Agilex 7 M series Open FPGA Stack support
Hello, We are using Agilex 7 M series FPGA and considering to evaluate Open FPGA Stack (OFS) at some point. Currently the latest OFS release (2025.1) contains only pre-release level support for Agilex 7 M reference shell design and lacks support for many key features (https://github.com/OFS/ofs-agx7-pcie-attach/releases/tag/ofs-2025.1-1). Could we expect to get an official release level support some time in the near future? Best regards Otto
otto-Q
3 months ago
Ofs
17Views
0likes
0Comments
OneAPI Support for Agilex 5 and 7 Development Kits
Hello, I've recently acquired an Agilex 5 065B Modular development kit (https://www.intel.com/content/www/us/en/products/details/fpga/development-kits/agilex/a5e065b-modular.html) and an Agilex 7 I-Series development kit (https://www.intel.com/content/www/us/en/products/details/fpga/development-kits/agilex/agi027.html). I've been working through setting up the development environments for these boards, and would like to use the oneAPI HLS toolkit for them, if possible. So far, I've found a tutorial to create an accelerator support package for the Agilex 7 development kit that I have (https://ofs.github.io/ofs-2024.3-1/hw/common/user_guides/oneapi_asp/ug_oneapi_asp/). I'm planning to go through these steps shortly to setup my board for use with oneAPI, but I have yet to find any documentation for my Agilex 5 board. Is it possible to work with this Agilex 5 development kit using oneAPI, or is only the Quartus design flow supported at this time? Additionally, are there board support packages available (that I've just missed) - or soon to be available - for these cards if the HLS flow is an option? Thank you!
jjb169
5 months ago
High-level Design Tools
1.7KViews
0likes
5Comments
Agilex 5 Precision DSP block simulations
Hi, I'm using the Precision DSP blocks in my Agilex 5 design; i have a floating point Add (FP_Add_native_DSP) and a floating point MAC (FP_MAC_native_DSP), but when i try and run simulations with these in place i'm seeing odd behavior: 1/ The adder is not doing an addition, the output is merely following one of the input pins. 2/ The MAC is giving an output but this does not match the output i'm seeing from a similar MAC targeted for the Arria 10 FPGA. The Arria 10 design is proven on silicon so i would have thought the simulation model for this is correct. The above is making me nervous and i'm seeking clarification that: 1/ There are indeed bugs in the simulations models - if so is there a patch available? 2/ The Floating Point DSP functions work correctly on the actual Agilex 5 silicon. I look forward to hearing from you. Simon
Solved
S_J_S
6 months ago
High-level Design Tools
4.9KViews
0likes
10Comments
Creating PCB based on 10M08 Evaluation Board but with other MAX10 FPGA
Hello everyone, for a school project, I want to design a PCB for / around the MAX 10 FPGA. As I'm trying to make my life easier, I am using this (https://www.intel.com/content/www/us/en/products/details/fpga/development-kits/max/10m08-evaluation-kit.html) Intel Evaluation Board as a starting point. the FPGA used in the design is the 10M08SAE144C8G. However, it has only 8000 LE, which will not be enough, therefore I'm planning to use 10M16SAE144C8G as a (hopefully) drop in replacement. I think that this will work, why shouldn't it? All it really is, is a PCB which uses the same layout, but with more of the available GPIOs broken out and preferable the mentioned 10M16SAE144C8G (or even 10M25SAE144C8G). I checked the pinout of both FPGAs and they are the same. Also programming should be the same as far as i know. Thanks for reading!
Iukas
6 months ago
Acceleration
1.3KViews
0likes
1Comment