User Profile

Björne2

New Contributor

19 Posts1 LikeLikes received

View All Badges

User Widgets

Contributions

Re: Intel OpenCL compiler (aoc) does not coalesce global memory reads anymore
I know that, but the replacement SYCL tools are that bad. The main problem is that icpx embeds the FPGA image into the host code so you can't have one binary that switches between multiple images via command line parameters. Nor one binary with kernels for multiple different devices. icpx also takes 10 seconds for simple examples which compile instantly in OpenCL. It wouldn't be so bad if you could use a regular C++ compiler for the host code and just use icpx for the device code, but I haven't found any (easy) way of accomplishing that.
1 year ago Place Acceleration
1.6KViews
0likes
1Comment
Re: Intel OpenCL compiler (aoc) does not coalesce global memory reads anymore
Hi BoonBengT and yuguen. I know that OneAPI only support SYCL. But SYCL is horrible for FPGA work so I'm sticking with OpenCL. Besides, SYCL is just a thin layer on top of OpenCL anyway. I solved my issue by removing the "volatile" keyword. Apparently, in recent versions volatile prevents memory coalescing.
1 year ago Place Acceleration
1.6KViews
0likes
3Comments
Intel OpenCL compiler (aoc) does not coalesce global memory reads anymore
The two screenshots says it all. The old screenshot is generated with aoc 21.2.0. Note how it coalesces the 16 float reads into one 512 bit DDR read. The new screenshot is generated with aoc 2024.2.1. It does not coalesce the 16 float reads and instead creates 16 individual read ports. Afaict, that is quite bad for performance and it wastes a lot of hardware resources. Is there a way to make aoc 2024.2.1 coalesce, exactly like the old compiler did?
1 year ago Place Acceleration
Board Debug
High-level Design Tools
2.1KViews
0likes
8Comments
Re: Advice on CNN inference on Agilex 7 using oneAPI
Hello again. I might have solved my problem. Apparently the -Xstarget option should name a specific board and not an FPGA family. 🙂 🙂 So with -Xstarget=B2E2_8GBx4 the generated report looks much better. When synthesis is done in a few hours I'll check if I can run the generated bitstream on the FPGA.
1 year ago Place Acceleration
3.4KViews
0likes
1Comment
Re: Advice on CNN inference on Agilex 7 using oneAPI
I think so. /vol/opt/intelFPGA_pro/21.2/hld/board/de10_agilex/hardware/B2E2_8GBx4/board_spec.xml contains:  <global_mem name="DDR" max_bandwidth="85312" interleaved_bytes="1024" config_addr="0x018"> <interface name="board" port="kernel_mem0" type="slave" width="512" maxburst="16" address="0x00000000" size="0x200000000" latency="240" waitrequest_allowance="6"/> <interface name="board" port="kernel_mem1" type="slave" width="512" maxburst="16" address="0x200000000" size="0x200000000" latency="240" waitrequest_allowance="6"/> <interface name="board" port="kernel_mem2" type="slave" width="512" maxburst="16" address="0x400000000" size="0x200000000" latency="240" waitrequest_allowance="6"/> <interface name="board" port="kernel_mem3" type="slave" width="512" maxburst="16" address="0x600000000" size="0x200000000" latency="240" waitrequest_allowance="6"/> </global_mem> Moreover, if I compile an equivalent OpenCL kernel (aoc -bsp-flow=flat -rtl vector_add.cl) global memory bandwidth is estimated correctly (see screenshot).
1 year ago Place Acceleration
3.4KViews
0likes
3Comments
Re: Advice on CNN inference on Agilex 7 using oneAPI
Thanks for your advice. With the FPGA support package I can now compile SYCL code for FPGA targets. It seems aoc (from the FPGA support package) uses the correct BSP: $ aoc -list-boards Board list: B1E1_8GBx4 Board Package: /vol/opt/intelFPGA_pro/21.2/hld/board/de10_agilex B2E2_8GBx4 (default) Board Package: /vol/opt/intelFPGA_pro/21.2/hld/board/de10_agilex ... I create a pre-synthesis report from the FGPA vector_add example like this: $ icpx -v -fsycl -fintelfpga -Xshardware -Xstarget=Agilex7 -fsycl-link=early vector_add.cpp -o reportz When I open the report it says "Report has invalid data. Ok to proceed?" If I do so I get a report that is not quite right. In particular, the global memory bandwidth estimates are wrong (see screenshot). Is there something else I need to do? Like add something to the icpx command to get it to use the right BSP?
1 year ago Place Acceleration
3.4KViews
0likes
5Comments
Re: Advice on CNN inference on Agilex 7 using oneAPI
Thanks for the advice. Does the BSP have to be specific to SYCL or is a BSP for OpenCL sufficient? Anyway, I install OneAPI as described here: https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=linux&linux-install-type=offline Then I source the ~/intel/oneapi/setvars.sh script to run SYCL tools and I create the Vector Add project. Compiling it for cpu-gpu works fine, but not when I compile it for FPGA targets: icpx -fsycl -fintelfpga -Xshardware -Xstarget=Agilex7 -v -Wsycl-strict vector-add-buffers.cpp I get the following error message: llvm-foreach --out-ext=aocx --in-file-list=/tmp/icpx-003d458ab4/vector-add-buffers-bc1783.txt --in-replace=/tmp/icpx-003d458ab4/vector-add-buffers-bc1783.txt --out-file-list=/tmp/icpx-003d458ab4/vector-add-buffers-1dc7d3.aocx --out-replace=/tmp/icpx-003d458ab4/vector-add-buffers-1dc7d3.aocx --out-increment=a.prj -- /vol/opt/intelFPGA_pro/21.2/hld/bin/aoc -o /tmp/icpx-003d458ab4/vector-add-buffers-1dc7d3.aocx /tmp/icpx-003d458ab4/vector-add-buffers-bc1783.txt -sycl -dep-files=/tmp/icpx-003d458ab4/vector-add-buffers-f000eb.d -output-report-folder=a.prj -g -hardware -target=Agilex7 AOCL_TMP_DIR directory was specified at /home/bjourne/.cache/aocl. Ensure Linux and Windows compiles do not share the same directory as files may be incompatible. InvalidModule: Invalid SPIR-V module: unsupported SPIR-V version number 'unknown (66560)'. Range of supported/known SPIR-V versions is 1.0 (65536) - 1.3 (66304) Error: SPIRV to LLVM IR FAILED I'm using the Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline Compiler version 21.2.0. Perhaps there is some mismatch between what the OneAPI tools expects and what the offline compiler is capable of?
1 year ago Place Acceleration
3.5KViews
0likes
9Comments
Re: Advice on CNN inference on Agilex 7 using oneAPI
> Do you have a board in mind? Keep in mind that if you wish to use > oneAPI FPGA Acceleration, you should choose an acceleration card > with a supported BSP. We have a list of vendor cards on our > homepage: Yeah, the board is a DE10 Agilex 7 from Terasic. The exact model is AGF 7 014 B2E2_8GBx4. > Instead of building a full oneAPI BSP, you can also use the oneAPI > DPC++/C++ compiler to create IP that you can integrate using a > platform designer system. Well, I have a server license for Quartus Prime 21.2. Previously I have used the aoc (Intel(R) FPGA SDK for OpenCL(TM) Kernel Compiler) command to build FPGA bitstreams from OpenCL code so I think I already have a suitable BSP installed. What I'm missing is how to "connect" icpx (Intel(R) oneAPI DPC++/C++ Compiler) to the FPGA. It was easy with OpenCL. I just compiled the kernel with aoc and then loaded it onto the FPGA with OpenCL host code. It appears it is not that easy with SYCL.
1 year ago Place Acceleration
3.6KViews
0likes
11Comments
Advice on CNN inference on Agilex 7 using oneAPI
Greetings everyone, I'm tasked with porting an CNN trained with PyTorch to an Agilex 7 FPGA using HLS. I think the right tool for the job is oneAPI. Since this is not a completely novel task, I wonder if there are any existing implementations, libraries, or similar material I can reuse? Like, I prefer to not to have to implement everything from loading weights, to max pooling layers, to fixed points numerics from scratch. I'd be happy with any pointers to materials or tutorials you might have. Thanks in advance.
1 year ago Place Acceleration
High-level Design Tools
3.8KViews
1like
15Comments
Re: Why does aoc set ii to 6 when I use high clock frequencies?
I'm not sure what you mean by reference design. The kernel I'm compiling is the one shown in the source code. The compile command I'm using is: aoc -bsp-flow=flat -seed=251 -parallel=16 -ffp-contract=fast -ffp-reassociate -clock=1000MHz -O3 test.cl where 251 is just some random number I picked.
1 year ago Place Acceleration
4.3KViews
0likes
0Comments