II is an approximation due to the following stallable instructions
Hi I am analyzing the report from oneAPI FPGA report generation.
I am currently facing
Compiler failed to schedule this loop with smaller II due to memory dependency
So I came back to a simple vector add example provided from github oneAPI C++_SYCL_FPGA, but I am still seeing the same errors
Another message that concerns me is
II is an approximation due to the following stallable instructions:
Load Operation (handler.hpp: 1531 > vector_add.cpp: 19)
Load Operation (handler.hpp: 1531 > vector_add.cpp: 20)
Store Operation (handler.hpp: 1531 > vector_add.cpp: 22)
In my application, I also need to load data from global memory, compute and store back to global memory.
Can you suggest me a way to resolve this issue?
the source code of vector_add.cpp:
#include <iostream> // oneAPI headers #include <sycl/ext/intel/fpga_extensions.hpp> #include <sycl/sycl.hpp> // Forward declare the kernel name in the global scope. This is an FPGA best // practice that reduces name mangling in the optimization reports. class VectorAddID; struct VectorAdd { int *const vec_a_in; int *const vec_b_in; int *const vec_c_out; int len; void operator()() const { for (int idx = 0; idx < len; idx++) { int a_val = vec_a_in[idx]; int b_val = vec_b_in[idx]; int sum = a_val + b_val; vec_c_out[idx] = sum; } } }; constexpr int kVectSize = 256; int main() { bool passed = true; try { // Use compile-time macros to select either: // - the FPGA emulator device (CPU emulation of the FPGA) // - the FPGA device (a real FPGA) // - the simulator device #if FPGA_SIMULATOR auto selector = sycl::ext::intel::fpga_simulator_selector_v; #elif FPGA_HARDWARE auto selector = sycl::ext::intel::fpga_selector_v; #else // #if FPGA_EMULATOR auto selector = sycl::ext::intel::fpga_emulator_selector_v; #endif // create the device queue sycl::queue q(selector); auto device = q.get_device(); std::cout << "Running on device: " << device.get_info<sycl::info::device::name>().c_str() << std::endl; if (!device.has(sycl::aspect::usm_host_allocations)) { std::terminate(); } // declare arrays and fill them // allocate in shared memory so the kernel can see them int *vec_a = sycl::malloc_shared<int>(kVectSize, q); int *vec_b = sycl::malloc_shared<int>(kVectSize, q); int *vec_c = sycl::malloc_shared<int>(kVectSize, q); for (int i = 0; i < kVectSize; i++) { vec_a[i] = i; vec_b[i] = (kVectSize - i); } std::cout << "add two vectors of size " << kVectSize << std::endl; q.single_task<VectorAddID>(VectorAdd{vec_a, vec_b, vec_c, kVectSize}) .wait(); // verify that vec_c is correct for (int i = 0; i < kVectSize; i++) { int expected = vec_a[i] + vec_b[i]; if (vec_c[i] != expected) { std::cout << "idx=" << i << ": result " << vec_c[i] << ", expected (" << expected << ") A=" << vec_a[i] << " + B=" << vec_b[i] << std::endl; passed = false; } } std::cout << (passed ? "PASSED" : "FAILED") << std::endl; sycl::free(vec_a, q); sycl::free(vec_b, q); sycl::free(vec_c, q); } catch (sycl::exception const &e) { // Catches exceptions in the host code. std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n"; // Most likely the runtime couldn't find FPGA hardware! if (e.code().value() == CL_DEVICE_NOT_FOUND) { std::cerr << "If you are targeting an FPGA, please ensure that your " "system has a correctly configured FPGA board.\n"; std::cerr << "Run sys_check in the oneAPI root directory to verify.\n"; std::cerr << "If you are targeting the FPGA emulator, compile with " "-DFPGA_EMULATOR.\n"; } std::terminate(); } return passed ? EXIT_SUCCESS : EXIT_FAILURE; }
the full message from loop analysis details:
- Hyper-Optimized loop structure: disabled.
- Memory dependency
- Compiler failed to schedule this loop with smaller II due to memory dependency:
- From: Load Operation (handler.hpp: 1531>vector_add.cpp: 19)
- To: Store Operation (handler.hpp: 1531>vector_add.cpp: 22)
- Compiler failed to schedule this loop with smaller II due to memory dependency:
- From: Load Operation (handler.hpp: 1531>vector_add.cpp: 20)
- To: Store Operation (handler.hpp: 1531>vector_add.cpp: 22)
- Most critical loop feedback path during scheduling:
- 70.00 clock cycles Load Operation (handler.hpp: 1531>vector_add.cpp: 19)
- 10.00 clock cycles Store Operation (handler.hpp: 1531>vector_add.cpp: 22)
- 1.16 clock cycle 32-bit Integer Add Operation (handler.hpp: 1531>vector_add.cpp: 21)
- II is an approximation due to the following stallable instructions:
- Load Operation (handler.hpp: 1531>vector_add.cpp: 19)
- Load Operation (handler.hpp: 1531>vector_add.cpp: 20)
- Store Operation (handler.hpp: 1531>vector_add.cpp: 22)
- Maximum concurrent iterations: Capacity of loop
- Use theLoop Analysisviewer to estimate capacity
- SeeFPGA Handbook : Loopsfor more information