DPC++ optimizations for FPGA and using report.html

Question

Environment: Ubuntu 18.04.4, OneAPI
I guess my broader question is how does one control/guide the optimization and how can you observe/verify that the compiler "gets it right". (I think this would be a great addition to the DPC++ FPGA optimization document )
Most specifically given the following "toy" kernel code:
    q.submit([&amp;](sycl::handler&amp; cgh) {
      auto xaccessor = xbuffer.get_access&lt;sycl::access::mode::read_write, sycl::access::target::global_buffer&gt;(cgh);
      cgh.parallel_for&lt;class k0a&gt;(
        sycl::range&lt;1&gt; {Nproc},
        [=] (sycl::item&lt;1&gt; item) {
          unsigned long gidx = item.get_linear_id();
          LCG48 lcg48(xaccessor[gidx]);
          for (int i=0; i&lt;Niter; i++) lcg48.step();
          xaccessor[gidx] = lcg48.get();
        }
      );
    });

How does one get DPC++ to unroll the lcg48.step() loop and pipeline the work items?  I've tried various  "unroll", "max_concurrency", "max_interleaving", etc... but I must be missing something because I don't recognize that any optimization is happening (at least according to the compiler output and/or report.html). Based on the  documentation DPC++ should also be able to do static coalesce the memory access but I don't see that either.
Just for completeness here's a full program to experiment with....
#include &lt;CL/sycl.hpp&gt;
#include &lt;cstdio&gt;
#include &lt;unistd.h&gt;
#include &lt;algorithm&gt;

namespace sycl = cl::sycl;

const int Nproc=20;
const int Niter=5;

class LCG48 {
  public :
    LCG48() { state = 0x330E; }
    LCG48(unsigned int x) { state = (long(x)&lt;&lt;16) + 0x330E; }
    void seed(unsigned int x) { state = (long(x)&lt;&lt;16) + 0x330E; }
    long int step() {
      state = ((0x5DEECE66D * (state) + 0xB) % (1L&lt;&lt;48));
      return (0x7FFFFFFF &amp; (state &gt;&gt; 17));
    }
    long int get() {
      return (0x7FFFFFFF &amp; (state &gt;&gt; 17));
    }
  private:
    unsigned long state;
};

int 
main(int argc, char *argv[])
{
  unsigned long t1, t2;
  int xdata[Nproc];

for (int i=0; i&lt;Nproc; i++) xdata[i] = i;
  LCG48 lcg48;
  for (int i=0; i&lt;Nproc; i++) {
    lcg48.seed(i);
    for (int j=0; j&lt;Niter; j++) lcg48.step();
    if (i&lt;8) printf("%08X ", (unsigned int)lcg48.get());
  }
  printf("
");

/** 
   ** Choose a device 
   **/
  //sycl::device dev = sycl::default_selector().select_device();
  //sycl::device dev = sycl::host_selector().select_device();
  //sycl::device dev = sycl::gpu_selector().select_device();
  //sycl::device dev = sycl::cpu_selector().select_device();
  sycl::device dev = sycl::accelerator_selector().select_device();
  //intel::fpga_emulator_selector dev
  //intel::fpga_selector dev;

std::cout &lt;&lt; "Device: " 
        &lt;&lt; "name: " &lt;&lt; dev.get_info&lt;sycl::info::device::name&gt;() &lt;&lt; std::endl
        &lt;&lt; "vendor: " &lt;&lt; dev.get_info&lt;sycl::info::device::vendor&gt;() &lt;&lt; std::endl;
  sycl::queue q(dev);

/**
   ** Parallel For
   **/
  for (int i=0; i&lt;Nproc; i++) xdata[i] = i+1;
  {
    sycl::buffer&lt;int, 1&gt; xbuffer((int *)xdata, sycl::range&lt;1&gt; {Nproc});
    q.submit([&amp;](sycl::handler&amp; cgh) {
      auto xaccessor = xbuffer.get_access&lt;sycl::access::mode::read_write, sycl::access::target::global_buffer&gt;(cgh);
      cgh.parallel_for&lt;class k0a&gt;(
        sycl::range&lt;1&gt; {Nproc},
        [=] (sycl::item&lt;1&gt; item) {
          unsigned long gidx = item.get_linear_id();
          LCG48 lcg48(xaccessor[gidx]);
          for (int i=0; i&lt;Niter; i++) lcg48.step();
          xaccessor[gidx] = lcg48.get();
        }
      );
    });
  }
  for (int i=0; i&lt;std::min(Nproc,8); i++) printf("%08X ", xdata[i]); printf("
");

/**
   **  Single task
   **/
  for (int i=0; i&lt;Nproc; i++) xdata[i] = i+0;
  {
    sycl::buffer&lt;int, 1&gt; xbuffer((int *)xdata, sycl::range&lt;1&gt; {Nproc});
    q.submit([&amp;](sycl::handler&amp; cgh) {
      auto xaccessor = xbuffer.get_access&lt;sycl::access::mode::read_write, sycl::access::target::global_buffer&gt;(cgh);
      cgh.single_task&lt;class k1a&gt;(
        [=] () {
          for (int gidx=0; gidx&lt;Nproc; gidx++) {
            LCG48 lcg48(xaccessor[gidx]);
            for (int i=0; i&lt;Niter; i++) lcg48.step();
            xaccessor[gidx] = lcg48.get();
          }
        }
      );
    });
  }
  for (int i=0; i&lt;std::min(Nproc,8); i++) printf("%08X ", xdata[i]); printf("
");
}

... which I compile with:
dpcpp -O3 -g -mavx2 -fintelfpga -fsycl-link -Xshardware -Xsboard=intel_a10gx_pac:pac_a10 fpga1b.cpp -lOpenCL -lsycl

gouthamk_intel · Answer

Hi,Thanks for reaching out to us!Since your issue is related to FPGA, we are moving this query to the FPGA forum for a faster response.RegardsGoutham

cfr · Answer

Just to elaborate on what I realize is an otherwise broad question...
If I were optimizing my example I'd look to unroll the lcg48.step() loop and then parallelize the work items enough to use all the memory bandwidth.  Then each work item "column" could be pipelined.  For example if there's 256 bit wide memory, I'd have &lt;=8 pipelines (256/32) and each pipeline would process ceil(20/8) work items.  (Ok, my choice of example parameters isn't all that great, but I kept things small so I could see things.  You get the idea ;^)).  So back to the original question.  I can use "#pragma unroll" to unroll the inner loop (and observe that i the report.html).  What I don't see how to control how many parallel pipelines are created. (see note).  I don't think it's happening automatically; or a least I don't see anything in the report.html to suggest that it is.

Hope that narrows things a little bit.
note: unrolling both loops in the k1a and you do see what looks like parallelism in the graph view of report.html.

cfr · Answer

Just to follow up in case someone else might be trying similar experiments...
I took the OneAPI/FPGA Tutorial at FPL2020 (highly recommend it).  One of the things stressed in the tutorial was that the compilers are better at handling "single_task" and not so much good at "parallel_for".  It was really stressed to use "single_task".  This didn't come across to me in the documentation.

anilerinch_a_intel · Answer

Hi , Thanks for sharing your learning with the community.Thanks and RegardsAnil

Forum Discussion

DPC++ optimizations for FPGA and using report.html

4 Replies

Recent Discussions

Agilex 7 FPGA Starter Kit with oneAPI Toolkit flow not detected over PCIe

MCTP over PCIe VDM routing to PMCI in OFS N6000 FIM configuration and datapath clarification

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

Error faced while executing on Agilex FPGA board....

AI Suite System Throughput Issue