ContributionsMost RecentMost LikesSolutionsRe: How to obtain the valuse of get_local_id(0), get_group(0) and get_local_range(0) in single_task Thanks I will try to modify the code. Besides Unroll, what else could I do to optimize this code? any suggestion? Re: How to obtain the valuse of get_local_id(0), get_group(0) and get_local_range(0) in single_task thanks your help. I am waiting for your support reply Re: how to change h.parallel_for(range(M, P), [=](auto index) to single_task function Hi Aikeu Follow the link tutorial to modify it, I will get wrong result. May you show me that you how to set the for loop in single_task function in this case?( h.parallel_for(range(M, P), [=](auto index) { ) how to change h.parallel_for(range(M, P), [=](auto index) to single_task function Hi support team I modified the code from opeapi samples mul, and I wanna use it for FPGA hardware. I am not sure how to modified these three parallel_for function to single_task. May you give me some suggestions? below is the code: #if FPGA_EMULATOR // DPC++ extension: FPGA emulator selector on systems without FPGA card. ext::intel::fpga_emulator_selector d_selector; #elif FPGA // DPC++ extension: FPGA selector on systems with FPGA card. ext::intel::fpga_selector d_selector; #else // The default device selector will select the most performant device. default_selector d_selector; #endif try { queue q(d_selector, dpc_common::exception_handler); cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n"; // Create 2D buffers for matrices, buffer c is bound with host memory c_back buffer<float, 2> a_buf(range(M, N)); buffer<float, 2> b_buf(range(N, P)); buffer c_buf(reinterpret_cast<float*>(c_back), range(M, P)); cout << "Problem size: c(" << M << "," << P << ") = a(" << M << "," << N << ") * b(" << N << "," << P << ")\n"; // Using three command groups to illustrate execution order. The use of // first two command groups for initializing matrices is not the most // efficient way. It just demonstrates the implicit multiple command group // execution ordering. // Submit command group to queue to initialize matrix a //start the clock // dpc_common::TimeInterval kernel_runtime; dpc_common::TimeInterval kernel_e_a_runtime; auto e_a = q.submit([&](auto& h) { // Get write only access to the buffer on a device. accessor a(a_buf, h, write_only); // Execute kernel. h.parallel_for(range(M, N), [=](auto index) { // Each element of matrix a is 1. a[index] = 1.0f; }); }); double elapsed_e_a_time = kernel_e_a_runtime.Elapsed(); dpc_common::TimeInterval kernel_e_b_runtime; // Submit command group to queue to initialize matrix b auto e_b = q.submit([&](auto& h) { // Get write only access to the buffer on a device accessor b(b_buf, h, write_only); // Execute kernel. h.parallel_for(range(N, P), [=](auto index) { // Each column of b is the sequence 1,2,...,N b[index] = index[0] + 1.0f; }); }); double elapsed_e_b_time = kernel_e_b_runtime.Elapsed(); dpc_common::TimeInterval kernel_e_c_runtime; // Submit command group to queue to multiply matrices: c = a * b auto e_c = q.submit([&](auto& h) { // Read from a and b, write to c accessor a(a_buf, h, read_only); accessor b(b_buf, h, read_only); accessor c(c_buf, h, write_only); int width_a = a_buf.get_range()[1]; // Execute kernel. h.parallel_for(range(M, P), [=](auto index) { // h.single_task<c_calc>([=]() [[intel::kernel_args_restrict]] { // for (int i = 0; i < M; i++) { //#pragma unroll 1 // for (int j = 0; j < P; j++) { // Get global position in Y direction. int row = index[0]; // int row = j; // Get global position in X direction. int col = index[1]; // int col = i; float sum = 0.0f; // Compute the result of one element of c //#pragma unroll 1 for (int i = 0; i < width_a; i++) { sum += a[row][i] * b[i][col]; } c[index] = sum; //c[i][j] = sum; // } // } }); }); SolvedRe: How to obtain the valuse of get_local_id(0), get_group(0) and get_local_range(0) in single_task Hi Daouda below is my current single task function, may you tell me if i have any issue on adding unroll optimization? ok, I will study the optimization guide. Thanks a lot. h.single_task<class bude_kernel>([=]() [[intel::kernel_args_restrict]] { #pragma unroll 2 [[intelfpga::initiation_interval(1)]] for (size_t id = 0; id < DEFAULT_WGSIZE; id++) { const size_t lid = id; const size_t gid = id; const size_t lrange = 1; float etot[NUM_TD_PER_THREAD]; cl::sycl::float3 lpos[NUM_TD_PER_THREAD]; cl::sycl::float4 transform[NUM_TD_PER_THREAD][3]; size_t ix = gid * lrange * NUM_TD_PER_THREAD + lid; ix = ix < nposes ? ix : nposes - NUM_TD_PER_THREAD; #pragma unroll 2 [[intelfpga::initiation_interval(1)]] for (int i = lid; i < ntypes; i += lrange) local_forcefield[i] = forcefield[i]; #pragma unroll 2 [[intelfpga::initiation_interval(1)]] for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) { size_t index = ix + i * lrange; const float sx = cl::sycl::sin(transforms_0[index]); const float cx = cl::sycl::cos(transforms_0[index]); const float sy = cl::sycl::sin(transforms_1[index]); const float cy = cl::sycl::cos(transforms_1[index]); const float sz = cl::sycl::sin(transforms_2[index]); const float cz = cl::sycl::cos(transforms_2[index]); transform[i][0].x() = cy * cz; transform[i][0].y() = sx * sy * cz - cx * sz; transform[i][0].z() = cx * sy * cz + sx * sz; transform[i][0].w() = transforms_3[index]; transform[i][1].x() = cy * sz; transform[i][1].y() = sx * sy * sz + cx * cz; transform[i][1].z() = cx * sy * sz - sx * cz; transform[i][1].w() = transforms_4[index]; transform[i][2].x() = -sy; transform[i][2].y() = sx * cy; transform[i][2].z() = cx * cy; transform[i][2].w() = transforms_5[index]; etot[i] = ZERO; } // item.barrier(access::fence_space::local_space); // Loop over ligand atoms size_t il = 0; do { // Load ligand atom data const Atom l_atom = ligand_molecule[il]; const FFParams l_params = local_forcefield[l_atom.type]; const bool lhphb_ltz = l_params.hphb < ZERO; const bool lhphb_gtz = l_params.hphb > ZERO; const cl::sycl::float4 linitpos(l_atom.x, l_atom.y, l_atom.z, ONE); #pragma unroll 2 [[intelfpga::initiation_interval(1)]] for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) { lpos[i].x() = transform[i][0].w() + linitpos.x() * transform[i][0].x() + linitpos.y() * transform[i][0].y() + linitpos.z() * transform[i][0].z(); lpos[i].y() = transform[i][1].w() + linitpos.x() * transform[i][1].x() + linitpos.y() * transform[i][1].y() + linitpos.z() * transform[i][1].z(); lpos[i].z() = transform[i][2].w() + linitpos.x() * transform[i][2].x() + linitpos.y() * transform[i][2].y() + linitpos.z() * transform[i][2].z(); } size_t ip = 0; do { const Atom p_atom = protein_molecule[ip]; const FFParams p_params = local_forcefield[p_atom.type]; const float radij = p_params.radius + l_params.radius; const float r_radij = 1.f / (radij); const float elcdst = (p_params.hbtype == HBTYPE_F && l_params.hbtype == HBTYPE_F) ? FOUR : TWO; const float elcdst1 = (p_params.hbtype == HBTYPE_F && l_params.hbtype == HBTYPE_F) ? QUARTER : HALF; const bool type_E = ((p_params.hbtype == HBTYPE_E || l_params.hbtype == HBTYPE_E)); const bool phphb_ltz = p_params.hphb < ZERO; const bool phphb_gtz = p_params.hphb > ZERO; const bool phphb_nz = p_params.hphb != ZERO; const float p_hphb = p_params.hphb * (phphb_ltz && lhphb_gtz ? -ONE : ONE); const float l_hphb = l_params.hphb * (phphb_gtz && lhphb_ltz ? -ONE : ONE); const float distdslv = (phphb_ltz ? (lhphb_ltz ? NPNPDIST : NPPDIST) : (lhphb_ltz ? NPPDIST : -FloatMax)); const float r_distdslv = 1.f / (distdslv); const float chrg_init = l_params.elsc * p_params.elsc; const float dslv_init = p_hphb + l_hphb; #pragma unroll 2 [[intelfpga::initiation_interval(1)]] for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) { const float x = lpos[i].x() - p_atom.x; const float y = lpos[i].y() - p_atom.y; const float z = lpos[i].z() - p_atom.z; const float distij = cl::sycl::sqrt(x * x + y * y + z * z); const float distbb = distij - radij; const bool zone1 = (distbb < ZERO); etot[i] += (ONE - (distij * r_radij)) * (zone1 ? 2 * HARDNESS : ZERO); float chrg_e = chrg_init * ((zone1 ? 1 : (ONE - distbb * elcdst1)) * (distbb < elcdst ? 1 : ZERO)); const float neg_chrg_e = -cl::sycl::fabs(chrg_e); chrg_e = type_E ? neg_chrg_e : chrg_e; etot[i] += chrg_e * CNSTNT; const float coeff = (ONE - (distbb * r_distdslv)); float dslv_e = dslv_init * ((distbb < distdslv&& phphb_nz) ? 1 : ZERO); dslv_e *= (zone1 ? 1 : coeff); etot[i] += dslv_e; } } while (++ip < natpro); // loop over protein atoms } while (++il < natlig); // loop over ligand atoms const size_t td_base = gid * lrange * NUM_TD_PER_THREAD + lid; if (td_base < nposes) { #pragma unroll 2 [[intelfpga::initiation_interval(1)]] for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) { etotals[td_base + i * lrange] = etot[i] * HALF; } } // } } }); } Re: run ONEAPI devcloud FPGA Hardware issues Hi Aik Thanks. I will use Arria10 to test my own project currently, if you fix Stratix10 issue please inform me cause I intent to compare different fpga hardwares on my project. Re: How to obtain the valuse of get_local_id(0), get_group(0) and get_local_range(0) in single_task Hi Daouda I tried to increase the unroll factor(4,8,16...) but it will cause RAM utilization problem as below picture. Do you have any recommendation of it? Another question is that if i don't set any unroll factor ( #pragma unroll ) would the compile process set it automatically? aoc: Warning RAM Utilization is at 128% Re: run ONEAPI devcloud FPGA Hardware issues Hi Aik I think it's not code running problem cause I can use Arria10 normally. Below is error message. Re: run ONEAPI devcloud FPGA Hardware issues Hi Aik I have tried s001-n142 s001-n143 s001-n144, but stratix10 nodes are still having the error Re: How to obtain the valuse of get_local_id(0), get_group(0) and get_local_range(0) in single_task Hi support team / Daouda Following your suggestion to modify the single_task function for the kernel, I can get the result I expect now. However, if I set the for loop condition too large (size_t id = 0; id < global(I will use 65536, or 65536/4, 65536/8...); id++), FPGA hardware(arria10) will run very long time and even fail in some large condition case. Do you have any suggestion about how to optimize it? Currently, I only add unroll(with factor 1) on the loop. May you suggest me other optimization methods can use in my case? I need to decrease the kernel execution time when using FPGA hardwares. below is my modified code(i use id<4 now) h.single_task<class bude_kernel>([=]() [[intel::kernel_args_restrict]] { #pragma unroll 1 for (size_t id = 0; id < 4; id++) { const size_t lid = id; const size_t gid = id; const size_t lrange = 1; float etot[NUM_TD_PER_THREAD]; cl::sycl::float3 lpos[NUM_TD_PER_THREAD]; cl::sycl::float4 transform[NUM_TD_PER_THREAD][3]; size_t ix = gid * lrange * NUM_TD_PER_THREAD + lid; ix = ix < nposes ? ix : nposes - NUM_TD_PER_THREAD; #pragma unroll 1 for (int i = lid; i < ntypes; i += lrange) local_forcefield[i] = forcefield[i]; #pragma unroll 1 for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) { size_t index = ix + i * lrange; . . . . .