parallel_for very slow in dpc++

yuguen

Occasional Contributor

3 years ago

Hello amaltaha,

"how can I know the details and specifications of the hardware FPGA I am using? Like the frequency. "

If you did not specify an FPGA target, I believe you'll target A10.

If you did not speicify a frequency target, you'll target the default frequency for this FPGA.

When compiling for FPGA it is recommended to build a report, where you can see all this information as well as many other usefull information.

I strongly encourage you to follow the "Explore SYCL* Through Intel® FPGA Code Samples" webpage that will guide you through FPGA development by trying out small code examples.

This will give you a better understanding about how to get performance for your application using this compiler.

"The iterative code no more than 40 seconds while the parallel one takes more than 7 minutes."

There can be multiple reasons causing this.

A first reason is that the compiler is not optimized for CPU execution, but for FPGAs. So having performance gaps between two CPU execution will not reflect the FPGA performance.

A second reason is that I can see on the code snipet you provided above that your parallel loop accumulates in one single location which leads to a race condition on a CPU (again, it will be a completely different story on FPGAs).

"Does Double precision in the code make it heavy for the FPGA to calculate distances?"

Yes, double precision operations are more expensive in terms of resources and have longer latencies than their single precision counterparts.

"in emulation resulted in 0.03 seconds and started to decrease, but on FPGA it was 3.5 seconds"

When using a FPGA, you are offloading a computation.

This offloading comes with an overhead: copying data from the host DDR to the FPGA DDR and copying the data from the FPGA DDR to the host DDR.

So to make the most use out of the FPGA, you need an application that is compute intensive enough to cover these latencies.

In your case, you emulation program is 0.03 seconds, so:

1/ The overhead of using the FPGA is going to be larger than that

2/ Were you expecting an execution faster than 0.03 seconds?

I'm guessing that the application you are trying to accelerate runs longer than 0.03 seconds, so you can try to accelerate your real application and see from there.

"and then segmentation fault as in the second picture. "

The code executed in the picture seems different than the code provided before (not the same print messages).

So it is hard to help you with a screenshot of a segfault

You should try using a debugger (such as gdb) to find where this segfault is coming from so you'll be able to fix it.

Forum Discussion

Recent Discussions

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite System Throughput Issue

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

How Do I get the License for HLS?

Deprecation Notice for FPGA Support Package for oneAPI DPC++/C++. What is the alternative?