The compilation time is too long for Intel FPGA OpenCL
I am trying to compile a HLS project with Intel SDK for OpenCL 20.3 on DE10 PRO. This project used to take 5~6 hours to compile on Intel SDK for OpenCL 19.4 on Arria10, but it has taken more than 17 hours of compilation now. The resource consumption in Linux is shown below. It seems that the resource used is not that much. The latest file generated in compilation is shown below. It seems that routing has been finished successfully. But after top.fit.route.rpt generated, 5 hours has been consumed with no refresh in any file. I want to know if it is usual to take such a long time in compilation and how can I reduce the compilation time in this flow.Solved2.4KViews0likes4CommentsI want to know how to control hyper-optimized handshaking setting
I just start using quartus pro 20.3 to compile on DE10-PRO. I used prefetch_load in my code and get a error as follow. Compiler Error: Prefetching LSU is not available when hyper-optimized handshaking is enabled Then I deleted prefetch_load and turn back into normal global memory access, this error disappeared. But I found than in my compile report, the hyper-optimized handshaking in Kernel Summary is off. Why I can't apply this feature after cancel prefetch_load lsu? And how can it affect my design.Solved1.2KViews0likes5CommentsIntel OpenCL compiler (aoc) does not coalesce global memory reads anymore
The two screenshots says it all. The old screenshot is generated with aoc 21.2.0. Note how it coalesces the 16 float reads into one 512 bit DDR read. The new screenshot is generated with aoc 2024.2.1. It does not coalesce the 16 float reads and instead creates 16 individual read ports. Afaict, that is quite bad for performance and it wastes a lot of hardware resources. Is there a way to make aoc 2024.2.1 coalesce, exactly like the old compiler did?2KViews0likes8CommentsChoosing FPGA board for ML implementation using oneAPI
Hello, I wish to implement transformer modules on an Intel FPGA using oneAPI. HBM is preferred and the compatability with oneAPI workflow is important. Some options I checked were: - Stratix 10 NX - Stratix 10 MX I did not find much about using the AI tensor blocks with oneAPI so wanted to check if there are no restrictions on that perspective. Other suggestions for the FPGA boards will be great too. Thanks1.6KViews0likes6CommentsValid substitute to EN5396QI-T
Goodmorning everyone, I am trying to find valid alternative to EN5396QI-T, EN5366QI and EN5337QI-T, which are now obsolete. I must ensure both the same output current, switching frequency and low input voltage. The substitute must have the integrated inductor too. Do you know any valid alternatives? At the moment, the best I found are the MPS DC-DC converters, but unfortunately they do not reach 5MHz of switching frequency. Thank you all for the attention, Best regards, Andrea B.1.2KViews2likes4CommentsWhy does aoc set ii to 6 when I use high clock frequencies?
I have a simple toy that I want to run at 1000 MHz kernel that doesn't do much: __attribute__((uses_global_work_offset(0))) __attribute__((max_global_work_dim(0))) __kernel void netsim( __global const volatile float * restrict gl_vm ) { float vm[50000]; #pragma ii 1 #pragma ivdep #pragma speculated_iterations 64 for (int i = 0; i < 50000; i++) { vm[i] = gl_vm[i]; } } According to the report (see screenshot), II=6 and latency=927. Why can't the compiler lower the latency and set II to 1 here?4.9KViews0likes10CommentsWhat causes OpenCL to insert arbitration for local memory accesses?
I know FPGA OpenCL is deprecated in favor of OneAPI, but I hope you can help me anyway. I've created a MWE of a kernel for which the compiler inserts arbitration: __attribute__((uses_global_work_offset(0))) __attribute__((max_global_work_dim(0))) __kernel void kmain(uint n_tics, __global const volatile uint * restrict dsts) { float frontier[100]; #pragma disable_loop_pipelining for (uint i = 0; i < 100; i++) { frontier[i] = 0; } uint nqueue[100]; uint nqueue_n = 20; for (uint t = 0; t < n_tics; t++) { for (uint i = 0; i < 100; i++) { float tmp = frontier[i]; frontier[i] = 0; } for (uint j = 0; j < nqueue_n; j++) { uint src=nqueue[j]; frontier[dsts[src]] += 50; } } } So first I reset all elements of frontier. Then the simulation loop starts and I read one element from frontier and clear it. Then I add 50 to the values at the indexes given by another variable. I know the kernel reads from uninitialized memory, but it's beside the point (I think). In the report aoc complains about "Potentially inefficient configuration" and I can see that it has inserted arbitration circuits (see screenshot). So the question is why? And how can I fix this memory access pattern to be arbitration-free?Solved2.1KViews0likes4CommentsLogic elements utilization in FPGA
Hello All, I have an old design on cyclone 3 with Quartus 10 and the used logic elements (in ALMs) are 20k. Now, I migrate exactly the same design to Quartus 21 also I changed the FPGA to Cyclone V and now the used logic elements (in ALMs) are 4k. So the only changes are the FPGA from Cyclone 3 to V and the Quartus version from 10 to 21. Why the used logic elements (in ALMs) reduced from 20k to 4k? What could wrong happend? w PS. no optimization is done for both Quartus project.Solved3KViews0likes9Comments