DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

Frequent Contributor

6 years ago

Hope things are going fine there.

For processing the loop efficiently in the FPGA in parallel fashion , loop unrolling can be applied. This ensures that different execution units are created to handle the summation. Without using this the parallel and scalar operation performance will not differ much , and parallel will also involve setup and related time.

So to extract performance out of the FPGA please perform a loop unrolling

#pragma unroll UNROLL_FACTOR

before the for loop exection.

You can try with different UNROLL_FACTORS , as twos multiples like 2 , 4 ,8 ,16 etc

and then compare the performance with the scalar version.

Please find an example below

cgh.single_task<class covariance>(

[=]()

{

/* Accessor related code HERE */

#pragma unroll UNROLL_FACTOR

For (int j=0;j<num_items, j++)

{

accessorC[j] = accessorA[j] + accessorB[j];

}

Thanks and Regards

Anil

Forum Discussion

DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

Recent Discussions

Agilex 7 FPGA Starter Kit with oneAPI Toolkit flow not detected over PCIe

MCTP over PCIe VDM routing to PMCI in OFS N6000 FIM configuration and datapath clarification

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

Error faced while executing on Agilex FPGA board....

AI Suite System Throughput Issue