Forum Discussion

Occasional Contributor

3 years ago

parallel_for very slow in dpc++

Hello, I really need help with this. I am trying to accelerate an algorithm using DPC++. what happens is that the normal calculations takes 1.5 times faster than kernel parallel execution. The fo...

amaltaha

Occasional Contributor

3 years ago

1- I know that my target is A10, I am asking about the commands that help me see the information related to a specific device or general devices, and the commands that helps me specify the frequency and so on, but I couldn't find them?

2- The code finally worked with this dpcpp commands for emulation and hardware run:

dpcpp -fintelfpga -DFPGA_EMULATOR knn_trial2.cpp -o knn.fpga_emu

and

dpcpp -fintelfpga -Xshardware fpga_compile.cpp -o fpga_compile.fpga

The results give now an average of 0.0005 s, which is fine, it is still slower than the iterative code, this might be because of overhead you mentioned? yet it is way faster than python code that runs in 0.012 s.

The segmentation fault is caused by sorting, it works fine with emulation but the segmentation fault is only after fpga run, Isn't host and kernel codes separated even with fpga run?

3- I have a question regarding parallel_for, single_task, and work groups:

Doesn't parallel_for mean that all the elements (from 0 to num_size) do the same job at the same time? i.e. runs in parallel. I have searched for the difference between parallel_for, single_task, and work groups, and didn't find satisfying explanation for each of them. does parallel_for cause all elements to run in one operation in almost one clock cycle?

Thank you!

yuguen

Occasional Contributor

3 years ago

1 - I'm not sure what other information than frequency you are looking for. To specify a clock target for your compile, you can add -Xsclock=<clock target> to your compile command.

You can find all this documentation in the "FPGA Optimization Guide for Intel® oneAPI Toolkits : Developer Guide".

The clock setting option is described in 4.1.1.

2 - I can't tell from your description what is limiting your implementation. However, if you are comparing the iterative vs parallel versions, both on FPGA, then you should in theory get better throughput with the parallel version. I don't know how long your computation lasts, but it should run more than a few seconds to get the benefits of an offload to an FPGA.

I don't know what you mean by "The segmentation fault is caused by sorting".

I'm not sure I understand what you mean by the "host and kernel codes separated even with fpga run" - your kernel code is in the q.submit section, the host code is everything around it. Your host code will issue a call to the FPGA, you'll need to wait for the FPGA to return the results and continue your host computation.

3 - Yes, parallel_for means that all the iterations are executed at the same time, however in the general case they won't execute in one cycle.

I encourage you again to have a look at the "Explore SYCL* Through Intel® FPGA Code Samples" webpage that shows a lot of examples to familiarize yourself with these concepts, as well as teach you what are the good coding practices when developing for FPGAs.

There even is a tutorial for loop unrolling on FPGA, which demonstrates the recommended method: use a for loop with a "pragma unroll" compiler directive (so no parallel_for).

Cheers

Forum Discussion

parallel_for very slow in dpc++

Recent Discussions

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite System Throughput Issue

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

How Do I get the License for HLS?

Deprecation Notice for FPGA Support Package for oneAPI DPC++/C++. What is the alternative?