Cannot compile with Stratix10 in DevCloud and Attributes-Local Memory Performance
Hi,
==== First part) Stratix10 ====
We tried different ways without luck, therefore, we attach the same steps using the interactive session in a fpga_compile machine. We tried directly with your example (via oneapi-cli). We hope you can assist us.
u148129@s001-n057:~/fpga_compile$ mkdir build u148129@s001-n057:~/fpga_compile$ cd build/ u148129@s001-n057:~/fpga_compile/build$ cmake .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10 -- The CXX compiler identification is Clang 14.0.0 -- Check for working CXX compiler: /glob/development-tools/versions/oneapi/2022.1.2/oneapi/compiler/2022.0.2/linux/bin/dpcpp -- Check for working CXX compiler: /glob/development-tools/versions/oneapi/2022.1.2/oneapi/compiler/2022.0.2/linux/bin/dpcpp -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Configuring the design to run on FPGA board intel_s10sx_pac:pac_s10 -- Configuring done -- Generating done -- Build files have been written to: /home/u148129/fpga_compile/build u148129@s001-n057:~/fpga_compile/build$ make report Scanning dependencies of target fpga_compile_report.a [ 50%] Building CXX object src/CMakeFiles/fpga_compile_report.a.dir/fpga_compile.cpp.o [100%] Linking CXX executable ../fpga_compile_report.a Can't opendir /glob/development-tools/versions/oneapi/2022.1.2/oneapi/intelfpgadpcpp/2022.1.0/board/de10_agilex: Permission denied at /glob/development-tools/versions/oneapi/2022.1.2/oneapi/compiler/2022.0.2/linux/lib/oclfpga/share/lib/perl/acl/Common.pm line 717. llvm-foreach: dpcpp: error: fpga compiler command failed with exit code 13 (use -v to see invocation) make[3]: *** [src/CMakeFiles/fpga_compile_report.a.dir/build.make:84: fpga_compile_report.a] Error 13 make[2]: *** [CMakeFiles/Makefile2:181: src/CMakeFiles/fpga_compile_report.a.dir/all] Error 2 make[1]: *** [CMakeFiles/Makefile2:134: src/CMakeFiles/report.dir/rule] Error 2 make: *** [Makefile:131: report] Error 2
Some context, the node and devices listed:
# requesting the machine u148129@login-2:~/logs$ qsub -I -l nodes=1:fpga_compile:ppn=2 -d . qsub: waiting for job 1874662.v-qsvr-1.aidevcloud to start qsub: job 1874662.v-qsvr-1.aidevcloud ready ######################################################################## # Date: Thu 31 Mar 2022 04:33:10 AM PDT # Job ID: 1874662.v-qsvr-1.aidevcloud # User: u148129 # Resources: neednodes=1:fpga_compile:ppn=2,nodes=1:fpga_compile:ppn=2,walltime=06:00:00 ########################################################################
aocl list-devices u148129@s001-n057:~/fpga_compile/build$ aocl list-devices /glob/development-tools/versions/oneapi/2022.1.2/oneapi/intelfpgadpcpp/2022.1.0/board/intel_a10gx_pac/linux64/libexec/diagnose: error while loading shared libraries: libopae-c.so.1: cannot open shared object file: No such file or directory -------------------------------------------------------------------- Warning: No devices attached for package: /glob/development-tools/versions/oneapi/2022.1.2/oneapi/intelfpgadpcpp/2022.1.0/board/intel_a10gx_pac --------------------------------------------------------------------
- We cannot compile for Stratix10, our target, in a fpga_compile node. If we try to use a "stratix10" is even worse (fpga runtime machine, more errors, but that is something somehow expected).
- There are errors when compiling and also when querying (aocl). Also, it emits something related with "intel_a10gx_pac", another board that we didn't select. If you go to such directories, there are also files for the Stratix10 board (s10sx) but somehow it does not work.
What are the problems here? How can we solve them? We are just running the basic example.
==== Second part) Performance ====
If we execute with the default board a simple matrix multiplication, Arria10 (although we are not interested in such), it compiles and emits the report. However, the performance is not good compared with the report of pure OpenCL - FPGA. We would like to see what should be changed to increase, for example, the bandwidth (currently, in many places, only 32 bits, not 2048 like in OpenCL FPGA):
#include <CL/sycl.hpp> #include <iostream> #include <limits> //#include <CL/sycl/INTEL/fpga_extensions.hpp> #include <sycl/ext/intel/fpga_extensions.hpp> // dpc_common.hpp can be found in the dev-utilities include folder. // e.g., $ONEAPI_ROOT/dev-utilities/<version>/include/dpc_common.hpp #include "dpc_common.hpp" using namespace std; using namespace sycl; class a_init; class b_init; class c_calc; /** * Each element of the product matrix c[i][j] is computed from a unique row and * column of the factor matrices, a[i][k] and b[k][j] */ // Matrix size constants. //#define m_size 512 // Must be a multiple of 8. #define N 4096 #define BL 4 #define SIMD 2 /** * Perform matrix multiplication on host to verify results from device. */ int VerifyResult(float (*c_back)[N]); double GetExecutionTime(const event &e) { double start_k = e.get_profiling_info<info::event_profiling::command_start>(); double end_k = e.get_profiling_info<info::event_profiling::command_end>(); double kernel_time = (end_k - start_k) * 1e-9; // ns to s return kernel_time; } int main() { // Host memory buffer that device will write data back before destruction. float(*c_back)[N] = new float[N][N]; // Intialize c_back for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) c_back[i][j] = 0.0f; // Initialize the device queue with the default selector. The device queue is // used to enqueue kernels. It encapsulates all states needed for execution. try { //sycl::INTEL::fpga_selector _device; ext::intel::fpga_selector _device; queue q(_device, dpc_common::exception_handler,cl::sycl::property::queue::enable_profiling()); cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n"; // Create 2D buffers for matrices, buffer c is bound with host memory c_back // These are all Global Memory //buffer<float, 2> a(range(N, N)); buffer<float, 1> aa(range(N*N)); //buffer<float, 2> b(range(N, N)); buffer<float, 1> bb(range(N*N)); buffer c(reinterpret_cast<float *>(c_back), range(N, N)); cout << "Problem size: c(" << N << "," << N << ") = a(" << N << "," << N << ") * b(" << N << "," << N << ")\n"; // Submit command group to queue to multiply matrices: c = a * b auto e_c = q.submit([&](handler &h) { // Read from a and b, write to c auto A = aa.get_access<access::mode::read>(h); auto B = bb.get_access<access::mode::read>(h); auto C = c.get_access<access::mode::write>(h); //local_accessor<float, 2> A_local(range<2>{BL, BL}, h); //local_accessor<float, 2> B_local(range<2>{BL, BL}, h); /*accessor<float, 2, access::mode::read_write, access::target::local> [[intel::numbanks(198), intel::bankwidth(2048)]] A_local(range<2>{BL, BL}, h); accessor<float, 2, access::mode::read_write, access::target::local> [[intel::numbanks(12), intel::bankwidth(2048)]] B_local(range<2>{BL, BL}, h);*/ accessor<float, 2, access::mode::read_write, access::target::local> A_local(range<2>{BL, BL}, h); accessor<float, 2, access::mode::read_write, access::target::local> B_local(range<2>{BL, BL}, h); range<2> num_groups(N, N); range<2> num_items(BL, BL); h.parallel_for<c_calc>(nd_range<2>(num_groups, num_items), [=](nd_item<2> item) [[ intel::kernel_args_restrict, intel::max_work_group_size(1, BL, BL), sycl::reqd_work_group_size(1,BL,BL), intel::num_simd_work_items(SIMD) ]] { /*[[intel::doublepump, intel::fpga_memory("MLAB"), intel::numbanks(1), intel::max_replicates(3)]] float A_local[BL][BL];*/ /*auto ptr = group_local_memory_for_overwrite<int[64]>(item.get_group()); auto ptrA = group_local_memory_for_overwrite<float[BL][BL]>(item.get_group()); auto& A_local = *ptrA;*/ /*[[intel::doublepump, intel::fpga_memory("MLAB"), intel::numbanks(16), intel::max_replicates(3)]] float B_local[BL][BL];*/ /*auto ptrB = group_local_memory_for_overwrite<float[BL][BL]>(item.get_group()); auto& B_local = *ptrB;*/ int block_x = item.get_group().get_id(0); int block_y = item.get_group().get_id(1); int local_x = item.get_local_id(0); int local_y = item.get_local_id(1); int a_start = N * BL * block_x; int a_end = a_start + N - 1; int b_start = BL * block_y; float sum = 0.0f; for (int a = a_start, b = b_start; a <= a_end; a += BL, b += (BL * N)){ A_local[local_x][local_y] = A[a + N * local_x + local_y]; B_local[local_y][local_x] = B[b + N * local_x + local_y]; item.barrier(access::fence_space::local_space); //#pragma unroll (BL/SIMD) #pragma unroll for (int k = 0; k < BL; ++k){ float aaa = A_local[local_x][k]; float bbb = B_local[local_y][k]; sum += aaa * bbb; //sum += A_local[local_x][k] * B_local[local_y][k]; } item.barrier(access::fence_space::local_space); } C[item.get_global_id(0)][item.get_global_id(1)] = sum; }); }); //double a_time = GetExecutionTime(e_a); //double b_time = GetExecutionTime(e_b); double c_time = GetExecutionTime(e_c); double input_size_kb = (2*N)*sizeof(float)/(1024); //std::cout << "Kernel throughput initializing a: " // << (input_size_kb/a_time) << " KB/s \n"; // this one is larger and not accurate. Some device initialization is included here. //std::cout << "Kernel throughput initializing b: " // << (input_size_kb/b_time) << " KB/s \n"; std::cout << "Kernel throughput calculating c: " << (input_size_kb/c_time) << " KB/s \n"; } catch (sycl::exception const &e) { cout << "An exception is caught while multiplying matrices.\n"; terminate(); } int result; cout << "Result of matrix multiplication using DPC++: "; result = VerifyResult(c_back); delete[] c_back; return result; } bool ValueSame(float a, float b) { return fabs(a - b) < numeric_limits<float>::epsilon(); } int VerifyResult(float (*c_back)[N]) { // Check that the results are correct by comparing with host computing. int i, j, k; // 2D arrays on host side. float(*a_host)[N] = new float[N][N]; float(*b_host)[N] = new float[N][N]; float(*c_host)[N] = new float[N][N]; // Each element of matrix a is 1. for (i = 0; i < N; i++) for (j = 0; j < N; j++) a_host[i][j] = 1.0f; // Each column of b_host is the sequence 1,2,...,N for (i = 0; i < N; i++) for (j = 0; j < N; j++) b_host[i][j] = i + 1.0f; // c_host is initialized to zero. for (i = 0; i < N; i++) for (j = 0; j < N; j++) c_host[i][j] = 0.0f; for (i = 0; i < N; i++) { for (k = 0; k < N; k++) { // Each element of the product is just the sum 1+2+...+n for (j = 0; j < N; j++) { c_host[i][j] += a_host[i][k] * b_host[k][j]; } } } bool mismatch_found = false; // Compare host side results with the result buffer from device side: print // mismatched data 5 times only. int print_count = 0; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { if (!ValueSame(c_back[i][j], c_host[i][j])) { cout << "Fail - The result is incorrect for element: [" << i << ", " << j << "], expected: " << c_host[i][j] << ", but found: " << c_back[i][j] << "\n"; mismatch_found = true; print_count++; if (print_count == 50) break; } } if (print_count == 50) break; } delete[] a_host; delete[] b_host; delete[] c_host; if (!mismatch_found) { cout << "Success - The results are correct!\n"; return 0; } else { cout << "Fail - The results mismatch!\n"; return -1; } }
Question 1) Is local memory properly used for the FPGA? (using such accessors)
Question 2) Defining the attributes is something really important for performance (as we saw in OpenCL), and we want to be sure her are using them correctly. If we use intel::reqd_work_group_size emits unknown attribute, while sycl::reqd_work_grou_size says nothing. However, we don't know if it worked properly.
block_matrix_mul_dpcpp.cpp:111:9: warning: unknown attribute 'reqd_work_group_size' ignored [-Wunknown-attributes] intel::reqd_work_group_size(1,BL,BL), ^~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 warning generated. Can't opendir /glob/development-tools/versions/oneapi/2022.1.2/oneapi/intelfpgadpcpp/2022.1.0/board/de10_agilex: Permission denied at /glob/development-tools/versions/oneapi/2022.1.2/oneapi/compiler/2022.0.2/linux/lib/oclfpga/share/lib/perl/acl/Common.pm line 717. llvm-foreach: dpcpp: error: fpga compiler command failed with exit code 13 (use -v to see invocation)
Thanks for your time.