Forum Discussion

New Contributor

1 year ago

Solved

HLS i++ style IP generation with oneAPI for RTL IP integration in Quartus

Hello, I am new to the tool chain and was wondering if there was a method to generate RTL IP to integrate in Quartus Prime Pro in similar way to HLS i++ compiler but with oneAPI, as the former is be...

High-level Design Tools

whitepau_altera
1 year ago
> To take a concrete example, I adapted the add_oneapi example to compute a single precision floating-point mul add (that would use only 1 DSP in Quartus using the dedicated IP). So to the best of my knowledge, the invocation and data interface should be streaming pipes, but it creates some extra logic shown in the report (to handle pipes or avalon mm supposedly as with directive -Xsdsp-mode=prefer-dsp it should use internal registers of the DSP and perform all computation in the DSP slice).
Can you please share the code you wrote? I am not sure I understand what you are trying to describe.
In general though, in i++, you could describe a simple adder like this:
component int add(int a, int b) { return a + b; }
This would give you an IP that had two inputs synchronized to a start/busy handshake, and a single output synchronized to a done/stall handshake.

If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.

whitepau_altera

Contributor

1 year ago

> To take a concrete example, I adapted the add_oneapi example to compute a single precision floating-point mul add (that would use only 1 DSP in Quartus using the dedicated IP). So to the best of my knowledge, the invocation and data interface should be streaming pipes, but it creates some extra logic shown in the report (to handle pipes or avalon mm supposedly as with directive -Xsdsp-mode=prefer-dsp it should use internal registers of the DSP and perform all computation in the DSP slice).

Can you please share the code you wrote? I am not sure I understand what you are trying to describe.

In general though, in i++, you could describe a simple adder like this:

component
int add(int a, int b) {
    return a + b;
}

This would give you an IP that had two inputs synchronized to a start/busy handshake, and a single output synchronized to a done/stall handshake.

If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.

romabo

New Contributor

1 year ago

Sorry for the late reply I didn't receive email notifications.

The code I used corresponds to the basic example shown in the video you linked that I adapted to single precision floating point multiplication addition.

#include <iostream>

// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

#include "exception_handler.hpp"

using PipeOutProps = decltype(sycl::ext::oneapi::experimental::properties(
  sycl::ext::intel::experimental::protocol<
  sycl::ext::intel::experimental::protocol_name::avalon_mm_uses_ready>));

class PipeOutResID;
using PipeOutRes = sycl::ext::intel::experimental::pipe<PipeOutResID, float, 0, PipeOutProps>;

class MulAddFp32ID; // Kernel name
struct MulAddFp32 {
  float a;
  float b;
  float c;

  void operator()() const {
    float res = a * b + c;
    PipeOutRes::write(res);
  }

  auto get(sycl::ext::oneapi::experimental::properties_tag) {
    return sycl::ext::oneapi::experimental::properties{
        sycl::ext::intel::experimental::streaming_interface_remove_downstream_stall};
  }
};

int main() {
  bool passed = false;

  try {

    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
#if FPGA_SIMULATOR
    auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
    auto selector = sycl::ext::intel::fpga_selector_v;
#else  // #if FPGA_EMULATOR
    auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#endif

    sycl::queue q(selector, fpga_tools::exception_handler,
                  sycl::property::queue::enable_profiling{});

    auto device = q.get_device();

    // make sure the device supports USM host allocations
    if (!device.has(sycl::aspect::usm_host_allocations)) {
      std::cerr << "This design must either target a board that supports USM "
                   "Host/Shared allocations, or IP Component Authoring. "
                << std::endl;
      std::terminate();
    }

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;

    // Kernel inputs
    float a = 10.1;
    float b = 20.1;
    float c = 30.1;

    q.single_task<MulAddFp32ID>(MulAddFp32{a,b,c}).wait();
    float res = PipeOutRes::read(q);

    // verify that VC is correct
    passed = true;

    float expected = a*b + c;
    if (res != expected) {
      std::cout << "result " << res << ", expected ("
                << expected 
                << ") A=" << a << " + B=" << b << " + C=" << c << std::endl;
      passed = false;
    }

    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;
  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }

  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

Which gives the following estimated resource utilization summary

Name	ALMs	ALUTs	FFs	MLABs	RAMs	DSPs
Pipe resources	1	2	33	0	0	0
MulAddFp32ID	40	80	146	0	1	1

* -Xstarget=Agilex5 -Xsclock=200MHz -Xsdsp-mode=prefer-dsp

The RAM usage is apparently used for cluster logic and extra logic is used for computation as shown below:

When running fpga compilation Quartus fitter I get the following resource utilization summary.

Name	ALMs	ALUTs	FFs	MLABs	RAMs	DSPs
Quartus Fitter: Device Image	129	180	565	0	0	0
MulAddFp32ID	119	165	510	0	0	0

* -Xstarget=Agilex5 -Xsclock=200MHz -Xsdsp-mode=prefer-dsp

Let alone the fact that DSP is not inferred by the fitter, I don't really understand the schedule view and what the cluster exactly corresponds to along with the numerous write cycles.

I also tried with I/O pipes which appeared to be more suited for my application but I also get a similar operation.

// Pipe In
struct io_pipe_read_a_id    { static constexpr unsigned id = 0; };
struct io_pipe_read_b_id    { static constexpr unsigned id = 0; };
struct io_pipe_read_c_id    { static constexpr unsigned id = 0; };
using PipeReadA  = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_a_id, float, 0>;
using PipeReadB  = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_b_id, float, 0>;
using PipeReadC  = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_c_id, float, 0>;

// Pipe Out
struct io_pipe_write_res_id { static constexpr unsigned id = 1; };
using PipeWriteRes = sycl::ext::intel::kernel_writeable_io_pipe<io_pipe_write_res_id, float, 0>;

class MulAddFp32ID; // Kernel name
struct MulAddFp32 {
  auto get(sycl::ext::oneapi::experimental::properties_tag) {
    return sycl::ext::oneapi::experimental::properties{
        sycl::ext::intel::experimental::streaming_interface<>};
  }

  float a = PipeReadA::read();
  float b = PipeReadB::read();
  float c = PipeReadC::read();

  void operator()() const { 
    float res = a * b + c;
    PipeWriteRes::write(res);
  }
};

@whitepau_altera wrote:

If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.

Thank you for confirming. I was mostly interested in the details of what streaming interface/data I should use to get minimal "kernel" operation so as the generate IP would basically operate as an RTL module registers input/outputs with start/done signals.

Thank you for your time.

whitepau_altera
Contributor
1 year ago
> I don't really understand the schedule view and what the cluster exactly corresponds to along with the numerous write cycles.
You can learn about clusters in the Scheduling section of the FPGA Concepts chapter of our documentation. You can force stall-enabled clusters using the use_stall_enable_clusters kernel attribute.
As far as the 24-cycle pipe write, this is a common issue with any of the hyperflex-enabled FPGAs (Stratix™ 10, Agilex™ 7 and Agilex™ 5). In order to optimize for the high fMAX that these chips are capable of, the compiler tends to insert lots of pipeline registers, which can have negative impacts on latency (in particular small designs like yours). The compiler supports different optimization strategies. You can bias the compiler to prefer low latency over high fMAX using the -Xsoptimize=latency compiler flag.

Here is the schedule view after biasing for latency:
As far as the DSP block not being used; that is concerning. I've filed a bug report on this.
- romabo
  New Contributor
  1 year ago
  Thank you very much for the resources and explanations, that answered all my questions on that matter!

Forum Discussion

HLS i++ style IP generation with oneAPI for RTL IP integration in Quartus

Recent Discussions

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite System Throughput Issue

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

How Do I get the License for HLS?

Deprecation Notice for FPGA Support Package for oneAPI DPC++/C++. What is the alternative?