Forum Discussion
Hi
Hope things are going fine there.
For processing the loop efficiently in the FPGA in parallel fashion , loop unrolling can be applied. This ensures that different execution units are created to handle the summation. Without using this the parallel and scalar operation performance will not differ much , and parallel will also involve setup and related time.
So to extract performance out of the FPGA please perform a loop unrolling
#pragma unroll UNROLL_FACTOR
before the for loop exection.
You can try with different UNROLL_FACTORS , as twos multiples like 2 , 4 ,8 ,16 etc
and then compare the performance with the scalar version.
Please find an example below
cgh.single_task<class covariance>(
[=]()
{
/* Accessor related code HERE */
#pragma unroll UNROLL_FACTOR
For (int j=0;j<num_items, j++)
{
accessorC[j] = accessorA[j] + accessorB[j];
}
}
Thanks and Regards
Anil