oneAPI on Cyclone10gx
Hi all, the official Intel fpga requirement page says the Cyclone10gx fpga is supported by oneAPI so I downloaded the latest version on my Ubuntu20 (Quartus Prime also installed), I tried to compile a sample-adder, the compiler (targeting the fpga) works but then when I run simple-add-buffer.fpga I get: tetto@ubuntuoffice:~/simple-add/build$ ./simple-add-buffers.fpga An exception is caught while computing on device. terminate called after throwing an instance of 'sycl::_V1::runtime_error' what(): No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (PI_ERROR_DEVICE_NOT_FOUND) Aborted (core dumped)12KViews0likes43CommentsStrange behavior of Quartus Fitter and how to get more information
Hi, I'm designing an accelerator for DTW computation using oneAPI and Stratix 10 at the board BittWare 520N-MX Gen3x16. I have a kernel (it's actually several different kernels connected with pipes) that I replicate as many as possible to get the maximum throughput. The different kernel entities work with different input data. In one of the versions, I fitted 12 kernels in the FPGA. Then for that kernel, I simplify the external memory interfaces and the "function overhead" ( using oneAPI pragmas). The compile estimated resource utilization shows a reduction of more than 30% per kernel. However, Fitter failed to place more than 12 kernels on the FPGA. What sounds even more strange to me is that if I try to compile 16 kernels I get the error: "Error (170012): Fitter requires 72611 LABs to implement the design, but the device contains only 66099 LABs." But, If I try to compile 14 kernels (same clock target) "Error (170012): Fitter requires 73646 LABs to implement the design, but the device contains only 66439 LABs" How could 14 identical kernels need more LABs than 16? I have tried other numbers of kernels and clock frequency and the results are very unpredictable. Any idea of why the estimation of resource utilization is so wrong? How can I get more information on the fitter process to try to figure out what is happening? Thanks.6.1KViews0likes25CommentsFPGA report fails for matrix transpose
Hi everyone, the FPGA report fails for a simple kernel for the matrix transpose operation. The output asks me to PLEASE submit a bug report to https://software.intel.com/en-us/support/priority-support and include the crash backtrace. But the web page does not work well for me. Thus, I attach the error messages and source code here. The source code: attached. The optimization report: icpx -fsycl -fintelfpga -DFPGA_HARDWARE -std=c++2b -Wall -Wextra -Wpedantic -Werror -O3 mattrans.cpp -Xshardware -fsycl-link=early -Xsv -Xsparallel=16 -Xsffp-reassociate -Xsffp-contract=fast -o mattrans.a The error messages: Dependency files for SYCL source and SYCL-source library: /dev/shm/icpx-32a212/mattrans-1ee627.d aoc: Environment checks completed successfully. aoc: Selected target board package /opt/software/FPGA/IntelFPGA/opencl_sdk/20.4.0/hld/board/bittware_pcie/s10_hpc_default aoc: Selected target board p520_hpc_sg280l aoc: Processing SPIR-V.... aoc: SPIR-V processing completed aoc: Linking Object files.... Device information not found: 1SG280LU3F50E1VGS1 aoc: Optimizing and doing static analysis of code... PLEASE submit a bug report to https://software.intel.com/en-us/support/priority-support and include the crash backtrace. Stack dump: 0. Program arguments: /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt -march=fpga -O3 -ipatemplate /opt/software/FPGA/IntelFPGA/opencl_sdk/20.4.0/hld/board/bittware_pcie/s10_hpc_default/hardware/p520_hpc_sg280l/board_spec.xml -board /opt/software/FPGA/IntelFPGA/opencl_sdk/20.4.0/hld/board/bittware_pcie/s10_hpc_default/hardware/p520_hpc_sg280l/board_spec.xml -vpfp-relaxed -sycl -dbg-info-enabled --soft-elementary-math=false -pass-remarks-output=pass-remarks.yaml mattrans.1.bc -o mattrans.kwgid.bc 1. Running pass 'Function Pass Manager' on module 'mattrans.1.bc'. 2. Running pass 'Mark the decision for loop pipelining' on function '@ZTS15mattrans_kernel' Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it): /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(+0x29c600f)[0x55f1186f700f] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(+0x29c2f1d)[0x55f1186f3f1d] /lib64/libpthread.so.0(+0x12ce0)[0x1517e8435ce0] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN3acl6MemDep25BasicMemoryDependenceInfo24distinctTermsInLoopNestsEPN4llvm4LoopERNS_16VariableGEPIndexES4_S6_+0x436)[0x55f1193b00f6] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN3acl6MemDep25BasicMemoryDependenceInfo38distinctTermsInDifferentLoopIterationsEPN4llvm11InstructionElRKSt6vectorINS_16VariableGEPIndexESaIS6_EEmS4_lSA_mPNS2_4LoopE+0x1952)[0x55f1193b25f2] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN3acl6MemDep25BasicMemoryDependenceInfo38distinctTermsInDifferentLoopIterationsEPN4llvm11InstructionERKNS_24AddressDecompositionInfoES4_S7_+0xb6)[0x55f1193b3486] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN3acl6MemDep25BasicMemoryDependenceInfo21analyzeDecompositionsEPN4llvm11InstructionES4_bb+0xe4c)[0x55f1193b6a8c] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN3acl6MemDep31LoopCarriedMemoryDependenceInfo13getDependenceEPN4llvm11InstructionES4_b+0xc0c)[0x55f1193bc03c] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN3acl6MemDep31LoopCarriedMemoryDependenceInfo13getDependentsEPN4llvm11InstructionERNS2_11SmallPtrSetIS4_Lj2EEEb+0x2de)[0x55f1193be96e] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN3acl33LoopCarriedDepsBeforePipelineInfo12process_loopEPKN4llvm4LoopE+0x54d)[0x55f119527e0d] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN3acl16LoopPipelineInfo19check_serial_regionEPKN4llvm4LoopE+0x64)[0x55f11951da74] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN3acl16LoopPipelineInfo19runPipelineAnalysisEPNS_13ArrayPrivInfoEPNS_33LoopCarriedDepsBeforePipelineInfoEPNS_14LocalMemConfigEPNS_20RestrictInterleavingE+0xc3)[0x55f119521953] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN20MarkPipelineDecision3runEv+0x4c)[0x55f119a7100c] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN30MarkPipelineDecisionLegacyPass13runOnFunctionERN4llvm8FunctionE+0x3cd)[0x55f119a718cd] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE+0x3eb)[0x55f117ca1f8b] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE+0x39)[0x55f117ca2199] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE+0x2fb)[0x55f117ca2aeb] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(main+0x20d4)[0x55f116bb3d44] /lib64/libc.so.6(__libc_start_main+0xf3)[0x1517e4337cf3] /opt/software/FPGA/IntelFPGA/oneapi/23.1.0/compiler/2023.1.0/linux/lib/oclfpga/llvm/aocl-bin/aocl-opt(_start+0x29)[0x55f116cde119] Error: Optimizer FAILED. Refer to /dev/shm/mattrans-96129e-fe9b4d/mattrans.log for details. Many thanks Xin5.3KViews0likes21CommentsHAL Kernel Version Mismatch Error During FPGA Emulation with vector-add Sample
Hello Intel FPGA team, I'm currently working on the vector-add example from the official oneAPI-samples repository, specifically from: DirectProgramming/DPC++/DenseLinearAlgebra/vector-add I’m encountering the following runtime error when I attempt to run the emulation build: Error output: ./vector-add-buffers.fpga_emu HAL Kern: Version mismatch! Expected 0xa0c00001 but read 0x4130 Hardware version ID differs from version expected by software. Either: a) Ensure your compiled design was generated by the same ACL build currently in use, OR b) The host can not communicate with the compiled kernel. vector-add-buffers.fpga_emu: /nfs/sc/disks/swip_hld_1/ops/SC/hld/nightly/2022.1/96.2/l64/work/acl/acl/source/57c9d2bcb46afcf445b5da2406c0e6d85be93ef3/src/acl_kernel_if.cpp:733: int acl_kernel_if_init(acl_kernel_if*, acl_bsp_io, acl_system_def_t*): Assertion `0' failed. make: *** [Makefile.fpga:35: run_emu] Error 1 Environment details: Board: DE10-Agilex BSP Path: /opt/intel/oneapi/intelfpgadpcpp/2021.4.0/board/de10_agilex oneAPI version: Installed multiple versions. Active: 2022.0.2 dpcpp path: /opt/intel/oneapi/compiler/2022.0.2/linux/bin/dpcpp OS: Ubuntu (detected as Rocky Linux during install attempts) What I have tried: Verified the AOCL_BOARD_PACKAGE_ROOT is correctly set. Recompiled the design using make clean && make fpga_emu. Ran aoc -list-board-packages to confirm the installed board. Ensured Quartus, BSP, and compiler are aligned. Despite that, I still encounter the HAL version mismatch. Request: Could someone guide me on how to: Resolve this version mismatch issue? Confirm the correct environment and runtime versions are in sync? Completely clean older/duplicate oneAPI installations if that’s the root cause? @intel @OneAPI @fpga @agilex7 @de10 @intel65.1KViews0likes3CommentsAgilex 5 Precision DSP block simulations
Hi, I'm using the Precision DSP blocks in my Agilex 5 design; i have a floating point Add (FP_Add_native_DSP) and a floating point MAC (FP_MAC_native_DSP), but when i try and run simulations with these in place i'm seeing odd behavior: 1/ The adder is not doing an addition, the output is merely following one of the input pins. 2/ The MAC is giving an output but this does not match the output i'm seeing from a similar MAC targeted for the Arria 10 FPGA. The Arria 10 design is proven on silicon so i would have thought the simulation model for this is correct. The above is making me nervous and i'm seeking clarification that: 1/ There are indeed bugs in the simulations models - if so is there a patch available? 2/ The Floating Point DSP functions work correctly on the actual Agilex 5 silicon. I look forward to hearing from you. SimonSolved4.9KViews0likes10CommentsWhy does aoc set ii to 6 when I use high clock frequencies?
I have a simple toy that I want to run at 1000 MHz kernel that doesn't do much: __attribute__((uses_global_work_offset(0))) __attribute__((max_global_work_dim(0))) __kernel void netsim( __global const volatile float * restrict gl_vm ) { float vm[50000]; #pragma ii 1 #pragma ivdep #pragma speculated_iterations 64 for (int i = 0; i < 50000; i++) { vm[i] = gl_vm[i]; } } According to the report (see screenshot), II=6 and latency=927. Why can't the compiler lower the latency and set II to 1 here?4.9KViews0likes10CommentsStable argument doesn't work in simulation
Hi everyone, I have an issue when I try to run my oneapi kernel by passing my arguments with "stable annotated_arg". I try to use a "for" loop with those "stable" arguments as variable in simulation but it is very slow and doesn't work very well whereas when i use classic "int" declared in the kernel without using an argument variable I don't have this issue in the "for" loop and the simulation work fine and fast. Do you have an idea of what could be the issue ? Thank you ! DorianLSolved4.8KViews0likes22CommentsSystem console giving up
Hi, I am using intel hls compiler and generated a design to add 4 numbers. I created a .cpp file emulated it and verified the output, I created the ip file using the hls commands and have integrated it in the platform designer along with jtag to avalon master bridge ip. While testing on the hardware through system console, i have established the jtag path and while trying to write values to the registers, it is showing the below error: master_write_32: This transaction did not complete in 60 seconds. System Console is giving up. while executing "master_write_32 $master_service_path 0x34 4" (file "load_vals.tcl" line 6) invoked from within "source load_vals.tcl". Below iam attaching the .qar file , screenshot of the system console window and also attaching the csr.h file consisting the register addresses and am also sharing the c++ code for which i generated the ip using hls compiler.Solved4.2KViews0likes11CommentsDeveloping a high speed stopwatch
I've been trying to make a high speed counter. Simulation says it should work- but the simulator I have working doesn't really account for gate propagation delays. When I compile this, and download it to this Arduino Vidor 4000 which has a Cyclone 10 FPGA on it, even from initialization I'm getting noise out... I've tried a few different approaches. I was sort of thinking maybe the verilog compilation would have a method to know when a counter had completed counting, but in the netlist I don't really see anything like that. This last iteration uses a small counter (9 bits) which 8 bits is used, and the 9th bit is a carry ingo a 17 bit counter, which 16 bits are used and the top bit carries into a 40 bit counter for a total of 64 bits of counting. the slower counters (above the 8 bit counter) should tick at a normal rate, but their output is just noise... I was able to get it to sort of work by just assigning the rCOUNTER value (a 40 bit counter at the time) directly to the output, bypassing the latching registers - but that will not work for my needs... I do need as close to the correct count of ticks latched as I can get. Right now I can send a command to generate a latch - but that eventually will come from external hardware. This is that version - it uses just a not-gate to drive the clock... I have had this working at various points somewhat better, but at some point something goes wrong and I just get noise out. https://github.com/d3x0r/STFRPhysics/blob/master/hardware/fpga/new-counter2.v This is a different version that just uses one 64 bit counter... https://github.com/d3x0r/STFRPhysics/blob/master/hardware/fpga/clock-module.v I sort of figured that the ripple count wouldn't matter a lot how fast it is clocked, since higher value bits in the counter would just be ahead of a prior input; and I would be able to latch that counter into 1 of 2 registers to provide a stable output to send out. The whole idea is that there's a high speed counter, and two signals that will latch the counter into a register (one for each signal), and hold the value until the next rising edge latch basically - I did add a reset signal, so really it will latch a new value with a new latch signal after the lock on the register is released with a reset signal. Is this possible with any FPGA? I have a requirement to count sub-nanosecond ticks, preferably 200ps or less. I tried looking for a more specialized sort of high tick rate realtime clock, but didn't really find anything, and I'd like to have something that is already on a board with USB communication to it... The following is some of the output - tl2d and t2 are one 64 register, which is getting latched. The other 64 bit tl3d and t3 (tl3d is the low part) has never been latched, and really should be 0 from the rLatch2 variable in the program... tl2d:DFF1F3FF t2:FFFFFFF4 tl3d:1FF1F3FF t3:FFFFFFF6 tl2d:77F20909 t2:57539CD1 tl3d:1FF1F3FF t3:FFFFFFF6 tl2d:6D82597B t2:57EFFC73 tl3d:1FF1F3FF t3:FFFFFFF6 - this is another run, I delayed trying to do a latch for a few seconds, so this is what the board gives with nothing sent to it other than the FPGA code... pins:0 tl2d:DFFF53FF t2:FFFFFFF6 tl3d:1FFF53FF t3:FFFFFFF6 pins:1 tl2d:6D39846D t2:1BF5FC1 tl3d:1FFF53FF t3:FFFFFFF6 (this is the first latched value, which the top 40 bits (t2 and the top byte of tl2d)) should be 0... I don't understand at all why everthing ends up so bad. btw - do #N fields in verilog programs matter when compiled for hardware? or are they only simulation tips?4KViews0likes16CommentsAdvice on CNN inference on Agilex 7 using oneAPI
Greetings everyone, I'm tasked with porting an CNN trained with PyTorch to an Agilex 7 FPGA using HLS. I think the right tool for the job is oneAPI. Since this is not a completely novel task, I wonder if there are any existing implementations, libraries, or similar material I can reuse? Like, I prefer to not to have to implement everything from loading weights, to max pooling layers, to fixed points numerics from scratch. I'd be happy with any pointers to materials or tutorials you might have. Thanks in advance.3.7KViews1like15Comments