DPC++ optimizations for FPGA and using report.html
Environment: Ubuntu 18.04.4, OneAPI
I guess my broader question is how does one control/guide the optimization and how can you observe/verify that the compiler "gets it right". (I think this would be a great addition to the DPC++ FPGA optimization document
Most specifically given the following "toy" kernel code:
q.submit([&](sycl::handler& cgh) { auto xaccessor = xbuffer.get_access<sycl::access::mode::read_write, sycl::access::target::global_buffer>(cgh); cgh.parallel_for<class k0a>( sycl::range<1> {Nproc}, [=] (sycl::item<1> item) { unsigned long gidx = item.get_linear_id(); LCG48 lcg48(xaccessor[gidx]); for (int i=0; i<Niter; i++) lcg48.step(); xaccessor[gidx] = lcg48.get(); } ); });
How does one get DPC++ to unroll the lcg48.step() loop and pipeline the work items? I've tried various "unroll", "max_concurrency", "max_interleaving", etc... but I must be missing something because I don't recognize that any optimization is happening (at least according to the compiler output and/or report.html). Based on the documentation DPC++ should also be able to do static coalesce the memory access but I don't see that either.
Just for completeness here's a full program to experiment with....
#include <CL/sycl.hpp> #include <cstdio> #include <unistd.h> #include <algorithm> namespace sycl = cl::sycl; const int Nproc=20; const int Niter=5; class LCG48 { public : LCG48() { state = 0x330E; } LCG48(unsigned int x) { state = (long(x)<<16) + 0x330E; } void seed(unsigned int x) { state = (long(x)<<16) + 0x330E; } long int step() { state = ((0x5DEECE66D * (state) + 0xB) % (1L<<48)); return (0x7FFFFFFF & (state >> 17)); } long int get() { return (0x7FFFFFFF & (state >> 17)); } private: unsigned long state; }; int main(int argc, char *argv[]) { unsigned long t1, t2; int xdata[Nproc]; for (int i=0; i<Nproc; i++) xdata[i] = i; LCG48 lcg48; for (int i=0; i<Nproc; i++) { lcg48.seed(i); for (int j=0; j<Niter; j++) lcg48.step(); if (i<8) printf("%08X ", (unsigned int)lcg48.get()); } printf("\n"); /** ** Choose a device **/ //sycl::device dev = sycl::default_selector().select_device(); //sycl::device dev = sycl::host_selector().select_device(); //sycl::device dev = sycl::gpu_selector().select_device(); //sycl::device dev = sycl::cpu_selector().select_device(); sycl::device dev = sycl::accelerator_selector().select_device(); //intel::fpga_emulator_selector dev //intel::fpga_selector dev; std::cout << "Device: " << "name: " << dev.get_info<sycl::info::device::name>() << std::endl << "vendor: " << dev.get_info<sycl::info::device::vendor>() << std::endl; sycl::queue q(dev); /** ** Parallel For **/ for (int i=0; i<Nproc; i++) xdata[i] = i+1; { sycl::buffer<int, 1> xbuffer((int *)xdata, sycl::range<1> {Nproc}); q.submit([&](sycl::handler& cgh) { auto xaccessor = xbuffer.get_access<sycl::access::mode::read_write, sycl::access::target::global_buffer>(cgh); cgh.parallel_for<class k0a>( sycl::range<1> {Nproc}, [=] (sycl::item<1> item) { unsigned long gidx = item.get_linear_id(); LCG48 lcg48(xaccessor[gidx]); for (int i=0; i<Niter; i++) lcg48.step(); xaccessor[gidx] = lcg48.get(); } ); }); } for (int i=0; i<std::min(Nproc,8); i++) printf("%08X ", xdata[i]); printf("\n"); /** ** Single task **/ for (int i=0; i<Nproc; i++) xdata[i] = i+0; { sycl::buffer<int, 1> xbuffer((int *)xdata, sycl::range<1> {Nproc}); q.submit([&](sycl::handler& cgh) { auto xaccessor = xbuffer.get_access<sycl::access::mode::read_write, sycl::access::target::global_buffer>(cgh); cgh.single_task<class k1a>( [=] () { for (int gidx=0; gidx<Nproc; gidx++) { LCG48 lcg48(xaccessor[gidx]); for (int i=0; i<Niter; i++) lcg48.step(); xaccessor[gidx] = lcg48.get(); } } ); }); } for (int i=0; i<std::min(Nproc,8); i++) printf("%08X ", xdata[i]); printf("\n"); }
... which I compile with:
dpcpp -O3 -g -mavx2 -fintelfpga -fsycl-link -Xshardware -Xsboard=intel_a10gx_pac:pac_a10 fpga1b.cpp -lOpenCL -lsycl