ContributionsMost RecentMost LikesSolutionsRe: Intel OpenCL compiler (aoc) does not coalesce global memory reads anymore @Björne2 wrote: The main problem is that icpx embeds the FPGA image into the host code so you can't have one binary that switches between multiple images via command line parameters. You can extract the aocx from the produced binary: https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/developer-guide/2024-1/extracting-the-fpga-hardware-configuration-aocx.html @Björne2 wrote: Nor one binary with kernels for multiple different devices. This is purpose of oneAPI (one API to target multiple devices) so I'm not sure what you are referring to here @Björne2 wrote: icpx also takes 10 seconds for simple examples which compile instantly in OpenCL. The legacy OpenCL compiler uses the same internal compiler as the SYCL compiler. So it should not be much slower. I have two example designs: - In SYCL: "time icpx -fsycl -fsycl-link=early [...]" real 0m9.221s user 0m7.064s sys 0m0.814s - In OpenCL: "time aoc -rtl [...]" real 0m7.636s user 0m3.490s sys 0m0.724s There is a difference, but not as big as 10 seconds to instant. @Björne2 wrote: It wouldn't be so bad if you could use a regular C++ compiler for the host code and just use icpx for the device code, but I haven't found any (easy) way of accomplishing that. SYCL is a superset of C++, so the host code written in SYCL is basically C++. You can have a look at the SYCL code samples: https://github.com/oneapi-src/oneAPI-samples/tree/development/DirectProgramming/C%2B%2BSYCL_FPGA In particular, you can have a look at the GettingStarted/fpga_compile sample to see a step by step C++ to SYCL example. The SYCL example is 90% C++: https://github.com/oneapi-src/oneAPI-samples/blob/development/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/GettingStarted/fpga_compile/part2_dpcpp_functor_usm/src/vector_add.cpp Re: Intel OpenCL compiler (aoc) does not coalesce global memory reads anymore There is no guarantee that anything coming out of aoc will be functional as this tool is now deprecated. What are your complaints about SYCL compared to OpenCL? Re: Intel OpenCL compiler (aoc) does not coalesce global memory reads anymore The OpenCL SDK for Intel FPGAs is no longer distributed since 22.4. Therefore, 22.4 is the last version of the compiler to officially support OpenCL as an input language. How did you get a 2024.2.1 version of aoc? I'm guessing that you got that binary from a oneAPI SYCL compiler for FPGA install. The SYCL compiler uses aoc internally, but is not expected to work as a standalone. Re: simple add fails on stratix10 (USM) Looks like you are using the non-USM BSP. Maybe this can be of help: community.intel.com/t5/Intel-DevCloud/Invalid-Binary-for-FPGA-Stratix-10-Nodes/m-p/1300748#M2604 Re: oneAPI on Cyclone10gx Hey @StefanoC, When doing "make report", you are generating RTL for the SYCL code that is in the add-oneapi folder. This RTL top module can indeed be found in "add_report_di.sv". This is the IP that you need to integrate into an existing RTL pipeline. The "add_quartus_sln" folder already contains RTL, and is there to mimic your own RTL pipeline. So the "add.sv" file is already there, before you do "make report" as this is not a generated file, this is the existing RTL pipeline. You can peak into this file and see it is making a led turn on on the FPGA based on another signal. This is not possible to express using SYCL. This tutorial shows how to connect the generated RTL from SYCL (the add_report_di.sv IP) with the existing "add.sv" RTL pipeline. So "add" from add.sv is the top level module, that depends on the SYCL generated RTL. The steps in the README tells you to: 1/ generate the SYCL IP Create a Quartus project with the existing RTL files: This also sets the top level module to "add" which is contained in the add.sv that was just copied from add-quartus-sln Then, import the SYCL generated IP: Then connects the two in the following steps, etc. Re: oneAPI on Cyclone10gx Hey @StefanoC This sample demonstrates how to integrate a generated IP into an existing RTL pipeline. In this case, the existing RTL files are located in https://github.com/oneapi-src/oneAPI-samples/blob/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/Tools/platform_designer/add-quartus-sln The "add" top level entity is found in the add.sv file. "Step 2." tells you to copy this "add.sv" file into your Quartus project folder: cp add-quartus-sln/add.sv add-quartus Then, to add it to your Quartus project (step 2.v): Following these steps (with all the other steps), Quartus should be able to understand that the "add" top level module is in this file. Re: oneAPI on Cyclone10gx Hey Stefano, When you compile using an FPGA family (such as "Cyclone10GX"), the compiler enters an HLS flow: it only generates an IP that you need to manually integrate into your own RTL pipeline. The fpga binary that you obtained is not executable: it is only produced for you to inspect the performance of the IP after quartus compiled it (fmax, resource usage, etc.) Here is a code sample demonstrating how one can integrate an HLS IP into an RTL pipeline to be able to run such IP on an FPGA: https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/Tools/platform_designer You were expecting the compiler to produce a binary that could be directly executed on the FPGA: to do so, the compiler needs to understand the interface between the IP and your FPGA. This is what we call the "BSP". Some FPGA board vendors do provide BSPs with their FPGA boards, which would have allowed you to compile your program using "icpx ... -Xstarget=<path to your BSP> ..." rather than "-Xstarget=Cyclone10GX". In that case, and in that case only, the FPGA binary produced could have been run on the FPGA natively. You can have a look at this documentation page to better understand the difference between the "FPGA acceleration flow" and the "HLS flow": https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/developer-guide/2024-0/intel-oneapi-fpga-development-flow.html Re: max_concurrency: scheduled at run-time or compile-time? Hey @xwuupb Changing the concurrency of the loop changes the hardware that gets produced. So this is a compile time change. Reducing the concurrency of a loop reduces the resource usage for that loop (because it needs to be able to handle fewer concurrent iterations). Now it is called "max_concurrecy" and not "concurrency" for a reason. Your loop may now not have a new iteration ready to execute at each cycle. This may be due to different reasons such as a blocking pipe read or any upstream component that stalls your loop. So there is also a run-time aspect to how loop iterations are scheduled. The reason why you can't find the exact answer you are looking for is because it would not make sense to artificially limit the concurrency of a loop at runtime. If you have the hardware that allows you to interleave as many iterations as possible, there is no reason to limit it at run time. So the entire point of this attribute is to tell the compiler that it can limit the maximum concurrency because you know that the consumer of that the result being computed in this loop does not require the result to be computed so fast. Therefore, you can make hardware savings by limiting the max concurrency (compile time). The page linked above by @hareesh says "The max_concurrency attribute enables you to control the on-chip memory resources required to pipeline your loop." which implicitly tells you this is done at compile time. Re: parallel_for very slow in dpc++ Hey amaltaha, 1- Yes, FPGAs are used for acceleration. If you are not getting an acceleration it can be that your application is not suitable for FPGA acceleration (e.g. a very quick FPGA computation compared to the overhead of offloading your computation to the FPGA), or that your code needs to be reworked to better suit the programming model for FPGAs. Yes, the latency is expressed as clock cycles. This is explained in section 2.1.1.1 of the "FPGA Optimization Guide for Intel® oneAPI Toolkits : Developer Guide". Setting the command line parameter "-Xsclock=500MHz" does not guarantee you the hardware is going to run at that speed. This is a clock target, not an achieved target. To see the achieved clock target, you should look in the generated report. The section 2.0 of the above quoted guide covers the analysis of the generated report. Section 4.1.1 explains the "-Xsclock" parameter. 3 - As I mentioned earlier, I'd suggest you retry your experiment using the "pragma unroll" compiler directive rather that parallel_for. This is demonstrated in the "Explore SYCL* Through Intel® FPGA Code Samples" webpage with hands-on code examples, and is also described in the optimization guide in section 4.6.8 Yohann Re: parallel_for very slow in dpc++ 1 - I'm not sure what other information than frequency you are looking for. To specify a clock target for your compile, you can add -Xsclock=<clock target> to your compile command. You can find all this documentation in the "FPGA Optimization Guide for Intel® oneAPI Toolkits : Developer Guide". The clock setting option is described in 4.1.1. 2 - I can't tell from your description what is limiting your implementation. However, if you are comparing the iterative vs parallel versions, both on FPGA, then you should in theory get better throughput with the parallel version. I don't know how long your computation lasts, but it should run more than a few seconds to get the benefits of an offload to an FPGA. I don't know what you mean by "The segmentation fault is caused by sorting". I'm not sure I understand what you mean by the "host and kernel codes separated even with fpga run" - your kernel code is in the q.submit section, the host code is everything around it. Your host code will issue a call to the FPGA, you'll need to wait for the FPGA to return the results and continue your host computation. 3 - Yes, parallel_for means that all the iterations are executed at the same time, however in the general case they won't execute in one cycle. I encourage you again to have a look at the "Explore SYCL* Through Intel® FPGA Code Samples" webpage that shows a lot of examples to familiarize yourself with these concepts, as well as teach you what are the good coding practices when developing for FPGAs. There even is a tutorial for loop unrolling on FPGA, which demonstrates the recommended method: use a for loop with a "pragma unroll" compiler directive (so no parallel_for). Cheers