HLS i++ style IP generation with oneAPI for RTL IP integration in Quartus
Hello, I am new to the tool chain and was wondering if there was a method to generate RTL IP to integrate in Quartus Prime Pro in similar way to HLS i++ compiler but with oneAPI, as the former is be...
> To take a concrete example, I adapted the add_oneapi example to compute a single precision floating-point mul add (that would use only 1 DSP in Quartus using the dedicated IP). So to the best of my knowledge, the invocation and data interface should be streaming pipes, but it creates some extra logic shown in the report (to handle pipes or avalon mm supposedly as with directive -Xsdsp-mode=prefer-dsp it should use internal registers of the DSP and perform all computation in the DSP slice).
Can you please share the code you wrote? I am not sure I understand what you are trying to describe.
In general though, in i++, you could describe a simple adder like this:
component
int add(int a, int b) {
return a + b;
}
This would give you an IP that had two inputs synchronized to a start/busy handshake, and a single output synchronized to a done/stall handshake.
If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.
> To take a concrete example, I adapted the add_oneapi example to compute a single precision floating-point mul add (that would use only 1 DSP in Quartus using the dedicated IP). So to the best of my knowledge, the invocation and data interface should be streaming pipes, but it creates some extra logic shown in the report (to handle pipes or avalon mm supposedly as with directive -Xsdsp-mode=prefer-dsp it should use internal registers of the DSP and perform all computation in the DSP slice).
Can you please share the code you wrote? I am not sure I understand what you are trying to describe.
In general though, in i++, you could describe a simple adder like this:
component
int add(int a, int b) {
return a + b;
}
This would give you an IP that had two inputs synchronized to a start/busy handshake, and a single output synchronized to a done/stall handshake.
If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.
Sorry for the late reply I didn't receive email notifications.
The code I used corresponds to the basic example shown in the video you linked that I adapted to single precision floating point multiplication addition.
#include <iostream>
// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>
#include "exception_handler.hpp"
using PipeOutProps = decltype(sycl::ext::oneapi::experimental::properties(
sycl::ext::intel::experimental::protocol<
sycl::ext::intel::experimental::protocol_name::avalon_mm_uses_ready>));
class PipeOutResID;
using PipeOutRes = sycl::ext::intel::experimental::pipe<PipeOutResID, float, 0, PipeOutProps>;
class MulAddFp32ID; // Kernel name
struct MulAddFp32 {
float a;
float b;
float c;
void operator()() const {
float res = a * b + c;
PipeOutRes::write(res);
}
auto get(sycl::ext::oneapi::experimental::properties_tag) {
return sycl::ext::oneapi::experimental::properties{
sycl::ext::intel::experimental::streaming_interface_remove_downstream_stall};
}
};
int main() {
bool passed = false;
try {
// Use compile-time macros to select either:
// - the FPGA emulator device (CPU emulation of the FPGA)
// - the FPGA device (a real FPGA)
// - the simulator device
#if FPGA_SIMULATOR
auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
auto selector = sycl::ext::intel::fpga_selector_v;
#else // #if FPGA_EMULATOR
auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#endif
sycl::queue q(selector, fpga_tools::exception_handler,
sycl::property::queue::enable_profiling{});
auto device = q.get_device();
// make sure the device supports USM host allocations
if (!device.has(sycl::aspect::usm_host_allocations)) {
std::cerr << "This design must either target a board that supports USM "
"Host/Shared allocations, or IP Component Authoring. "
<< std::endl;
std::terminate();
}
std::cout << "Running on device: "
<< device.get_info<sycl::info::device::name>().c_str()
<< std::endl;
// Kernel inputs
float a = 10.1;
float b = 20.1;
float c = 30.1;
q.single_task<MulAddFp32ID>(MulAddFp32{a,b,c}).wait();
float res = PipeOutRes::read(q);
// verify that VC is correct
passed = true;
float expected = a*b + c;
if (res != expected) {
std::cout << "result " << res << ", expected ("
<< expected
<< ") A=" << a << " + B=" << b << " + C=" << c << std::endl;
passed = false;
}
std::cout << (passed ? "PASSED" : "FAILED") << std::endl;
} catch (sycl::exception const &e) {
// Catches exceptions in the host code.
std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";
// Most likely the runtime couldn't find FPGA hardware!
if (e.code().value() == CL_DEVICE_NOT_FOUND) {
std::cerr << "If you are targeting an FPGA, please ensure that your "
"system has a correctly configured FPGA board.\n";
std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
std::cerr << "If you are targeting the FPGA emulator, compile with "
"-DFPGA_EMULATOR.\n";
}
std::terminate();
}
return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}
Which gives the following estimated resource utilization summary
Let alone the fact that DSP is not inferred by the fitter, I don't really understand the schedule view and what the cluster exactly corresponds to along with the numerous write cycles.
I also tried with I/O pipes which appeared to be more suited for my application but I also get a similar operation.
// Pipe In
struct io_pipe_read_a_id { static constexpr unsigned id = 0; };
struct io_pipe_read_b_id { static constexpr unsigned id = 0; };
struct io_pipe_read_c_id { static constexpr unsigned id = 0; };
using PipeReadA = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_a_id, float, 0>;
using PipeReadB = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_b_id, float, 0>;
using PipeReadC = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_c_id, float, 0>;
// Pipe Out
struct io_pipe_write_res_id { static constexpr unsigned id = 1; };
using PipeWriteRes = sycl::ext::intel::kernel_writeable_io_pipe<io_pipe_write_res_id, float, 0>;
class MulAddFp32ID; // Kernel name
struct MulAddFp32 {
auto get(sycl::ext::oneapi::experimental::properties_tag) {
return sycl::ext::oneapi::experimental::properties{
sycl::ext::intel::experimental::streaming_interface<>};
}
float a = PipeReadA::read();
float b = PipeReadB::read();
float c = PipeReadC::read();
void operator()() const {
float res = a * b + c;
PipeWriteRes::write(res);
}
};
If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.
Thank you for confirming. I was mostly interested in the details of what streaming interface/data I should use to get minimal "kernel" operation so as the generate IP would basically operate as an RTL module registers input/outputs with start/done signals.
> I don't really understand the schedule view and what the cluster exactly corresponds to along with the numerous write cycles.
You can learn about clusters in the Scheduling section of the FPGA Concepts chapter of our documentation. You can force stall-enabled clusters using the use_stall_enable_clusters kernel attribute.
As far as the 24-cycle pipe write, this is a common issue with any of the hyperflex-enabled FPGAs (Stratix™ 10, Agilex™ 7 and Agilex™ 5). In order to optimize for the high fMAX that these chips are capable of, the compiler tends to insert lots of pipeline registers, which can have negative impacts on latency (in particular small designs like yours). The compiler supports different optimization strategies. You can bias the compiler to prefer low latency over high fMAX using the -Xsoptimize=latency compiler flag.
Here is the schedule view after biasing for latency:
As far as the DSP block not being used; that is concerning. I've filed a bug report on this.