Forum Discussion

Occasional Contributor

3 years ago

How to obtain the valuse of get_local_id(0), get_group(0) and get_local_range(0) in single_task

Hi support team I am modifying the code for running FPGA hardware on oneapi devcloud. The original sample code is for GPU/CPU which uses parallel_for lambda function for kernel, but for FPGA optim...

Wei-Chih

Occasional Contributor

3 years ago

Hi support team / Daouda

Following your suggestion to modify the single_task function for the kernel, I can get the result I expect now. However, if I set the for loop condition too large (size_t id = 0; id < global(I will use 65536, or 65536/4, 65536/8...); id++), FPGA hardware(arria10) will run very long time and even fail in some large condition case. Do you have any suggestion about how to optimize it? Currently, I only add unroll(with factor 1) on the loop. May you suggest me other optimization methods can use in my case? I need to decrease the kernel execution time when using FPGA hardwares.

below is my modified code(i use id<4 now)

h.single_task<class bude_kernel>([=]() [[intel::kernel_args_restrict]] {

#pragma unroll 1
for (size_t id = 0; id < 4; id++) {

const size_t lid = id;
const size_t gid = id;
const size_t lrange = 1;

float etot[NUM_TD_PER_THREAD];
cl::sycl::float3 lpos[NUM_TD_PER_THREAD];
cl::sycl::float4 transform[NUM_TD_PER_THREAD][3];

size_t ix = gid * lrange * NUM_TD_PER_THREAD + lid;
ix = ix < nposes ? ix : nposes - NUM_TD_PER_THREAD;

#pragma unroll 1
for (int i = lid; i < ntypes; i += lrange) local_forcefield[i] = forcefield[i];
#pragma unroll 1
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
size_t index = ix + i * lrange;

DDIAKITE
Occasional Contributor
3 years ago
Hi Wei-Chih,

You may need to use the oneAPI optimization guide to optimize the execution time of your kernel. You can find the guide here:

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top.html

Use the report file from the dpc++ compiler and ensure that your loops' initiation interval is not high (the ideal value is 1). Also, the report indicates the main bottleneck of your kernel so that you can optimize them.

Then You may increase your unrolling factor to express more parallelism instead of using just one as unrolling factor. Note that the loop trip count should be divisible by the unrolling factor for good performance, otherwise the II may be greater than 1. For example, you can fully unroll your "range" and "NUM_TD_PER_THREAD" loops if the trip count is not too large and partially unroll the "global" loop for high exit values. I don't know what the execution time of your kernel was, but it shouldn't be too high even for global = 65536.

When you say "... even fail in some large condition case.", do you mean that you get bad results or the execution does not complete ?

Best regards,
Daouda

Forum Discussion

How to obtain the valuse of get_local_id(0), get_group(0) and get_local_range(0) in single_task

Recent Discussions

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite System Throughput Issue

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

How Do I get the License for HLS?

Deprecation Notice for FPGA Support Package for oneAPI DPC++/C++. What is the alternative?