Forum Discussion
Hi support team / Daouda
Following your suggestion to modify the single_task function for the kernel, I can get the result I expect now. However, if I set the for loop condition too large (size_t id = 0; id < global(I will use 65536, or 65536/4, 65536/8...); id++), FPGA hardware(arria10) will run very long time and even fail in some large condition case. Do you have any suggestion about how to optimize it? Currently, I only add unroll(with factor 1) on the loop. May you suggest me other optimization methods can use in my case? I need to decrease the kernel execution time when using FPGA hardwares.
below is my modified code(i use id<4 now)
h.single_task<class bude_kernel>([=]() [[intel::kernel_args_restrict]] {
#pragma unroll 1
for (size_t id = 0; id < 4; id++) {
const size_t lid = id;
const size_t gid = id;
const size_t lrange = 1;
float etot[NUM_TD_PER_THREAD];
cl::sycl::float3 lpos[NUM_TD_PER_THREAD];
cl::sycl::float4 transform[NUM_TD_PER_THREAD][3];
size_t ix = gid * lrange * NUM_TD_PER_THREAD + lid;
ix = ix < nposes ? ix : nposes - NUM_TD_PER_THREAD;
#pragma unroll 1
for (int i = lid; i < ntypes; i += lrange) local_forcefield[i] = forcefield[i];
#pragma unroll 1
for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
size_t index = ix + i * lrange;
.
.
.
.
.
- DDIAKITE3 years ago
Occasional Contributor
Hi Wei-Chih,
You may need to use the oneAPI optimization guide to optimize the execution time of your kernel. You can find the guide here:
Use the report file from the dpc++ compiler and ensure that your loops' initiation interval is not high (the ideal value is 1). Also, the report indicates the main bottleneck of your kernel so that you can optimize them.
Then You may increase your unrolling factor to express more parallelism instead of using just one as unrolling factor. Note that the loop trip count should be divisible by the unrolling factor for good performance, otherwise the II may be greater than 1. For example, you can fully unroll your "range" and "NUM_TD_PER_THREAD" loops if the trip count is not too large and partially unroll the "global" loop for high exit values. I don't know what the execution time of your kernel was, but it shouldn't be too high even for global = 65536.
When you say "... even fail in some large condition case.", do you mean that you get bad results or the execution does not complete ?
Best regards,
Daouda