Forum Discussion

Occasional Contributor

3 years ago

parallel_for very slow in dpc++

Hello, I really need help with this. I am trying to accelerate an algorithm using DPC++. what happens is that the normal calculations takes 1.5 times faster than kernel parallel execution. The fo...

amaltaha

Occasional Contributor

3 years ago

Hello @BoonBengT_Altera

I still have doubts regarding the hardware run of the FPGA. I don't understand why a small parallel operation may take milliseconds while the FPGA runs on more than 500 MHz. Even after I specify the clock flga rate:

–Xsclock=500MHz

nothing changes, it still runs with milliseconds where host code runs way faster. Isn't FPGA for acceleration? And the FPGA code above is very basic. And also, is there a way to know how many clock cycles the kernel code took, is it equivalent to the latency in the report? The latency in the report is 343, but without units, what does 343 mean, is it the number of clock cycles for example?

2- The segmentation fault problem was solved, nothing wrong with the hardware run except the amount of time it takes compared to a normal c++ code.

3- In my experiment, single_task with a loop inside took almost the same time as parallel_for without a loop inside. which means an iterative code takes as much as the parallel one in the hardware. Is this normal?

Thank you!

yuguen
Occasional Contributor
3 years ago
Hey amaltaha,

1- Yes, FPGAs are used for acceleration. If you are not getting an acceleration it can be that your application is not suitable for FPGA acceleration (e.g. a very quick FPGA computation compared to the overhead of offloading your computation to the FPGA), or that your code needs to be reworked to better suit the programming model for FPGAs.

Yes, the latency is expressed as clock cycles. This is explained in section 2.1.1.1 of the "FPGA Optimization Guide for Intel® oneAPI Toolkits : Developer Guide".

Setting the command line parameter "-Xsclock=500MHz" does not guarantee you the hardware is going to run at that speed. This is a clock target, not an achieved target. To see the achieved clock target, you should look in the generated report. The section 2.0 of the above quoted guide covers the analysis of the generated report. Section 4.1.1 explains the "-Xsclock" parameter.

3 - As I mentioned earlier, I'd suggest you retry your experiment using the "pragma unroll" compiler directive rather that parallel_for. This is demonstrated in the "Explore SYCL* Through Intel® FPGA Code Samples" webpage with hands-on code examples, and is also described in the optimization guide in section 4.6.8

Yohann
- amaltaha
  Occasional Contributor
  3 years ago
  Thank you for your support Yohann!
  Finally, one last thing, it is not possible to view the report on Intel Devcloud as it is not allowed to install a browser like firefox. And viewing it through Jupyter Notebook on the cloud gives an empty HTML file. How to view the report?
  
  Thank you!

Forum Discussion

parallel_for very slow in dpc++

Recent Discussions

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite System Throughput Issue

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

How Do I get the License for HLS?

Deprecation Notice for FPGA Support Package for oneAPI DPC++/C++. What is the alternative?