User Profile

romabo

New Contributor

Joined 1 year ago

4 Posts

View All Badges

User Widgets

Contributions

Re: HLS i++ style IP generation with oneAPI for RTL IP integration in Quartus
Thank you very much for the resources and explanations, that answered all my questions on that matter!
1 year ago Place Acceleration
1.9KViews
0likes
0Comments
Re: HLS i++ style IP generation with oneAPI for RTL IP integration in Quartus
Sorry for the late reply I didn't receive email notifications. The code I used corresponds to the basic example shown in the video you linked that I adapted to single precision floating point multiplication addition. #include <iostream> // oneAPI headers #include <sycl/ext/intel/fpga_extensions.hpp> #include <sycl/sycl.hpp> #include "exception_handler.hpp" using PipeOutProps = decltype(sycl::ext::oneapi::experimental::properties( sycl::ext::intel::experimental::protocol< sycl::ext::intel::experimental::protocol_name::avalon_mm_uses_ready>)); class PipeOutResID; using PipeOutRes = sycl::ext::intel::experimental::pipe<PipeOutResID, float, 0, PipeOutProps>; class MulAddFp32ID; // Kernel name struct MulAddFp32 { float a; float b; float c; void operator()() const { float res = a * b + c; PipeOutRes::write(res); } auto get(sycl::ext::oneapi::experimental::properties_tag) { return sycl::ext::oneapi::experimental::properties{ sycl::ext::intel::experimental::streaming_interface_remove_downstream_stall}; } }; int main() { bool passed = false; try { // Use compile-time macros to select either: // - the FPGA emulator device (CPU emulation of the FPGA) // - the FPGA device (a real FPGA) // - the simulator device #if FPGA_SIMULATOR auto selector = sycl::ext::intel::fpga_simulator_selector_v; #elif FPGA_HARDWARE auto selector = sycl::ext::intel::fpga_selector_v; #else // #if FPGA_EMULATOR auto selector = sycl::ext::intel::fpga_emulator_selector_v; #endif sycl::queue q(selector, fpga_tools::exception_handler, sycl::property::queue::enable_profiling{}); auto device = q.get_device(); // make sure the device supports USM host allocations if (!device.has(sycl::aspect::usm_host_allocations)) { std::cerr << "This design must either target a board that supports USM " "Host/Shared allocations, or IP Component Authoring. " << std::endl; std::terminate(); } std::cout << "Running on device: " << device.get_info<sycl::info::device::name>().c_str() << std::endl; // Kernel inputs float a = 10.1; float b = 20.1; float c = 30.1; q.single_task<MulAddFp32ID>(MulAddFp32{a,b,c}).wait(); float res = PipeOutRes::read(q); // verify that VC is correct passed = true; float expected = a*b + c; if (res != expected) { std::cout << "result " << res << ", expected (" << expected << ") A=" << a << " + B=" << b << " + C=" << c << std::endl; passed = false; } std::cout << (passed ? "PASSED" : "FAILED") << std::endl; } catch (sycl::exception const &e) { // Catches exceptions in the host code. std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n"; // Most likely the runtime couldn't find FPGA hardware! if (e.code().value() == CL_DEVICE_NOT_FOUND) { std::cerr << "If you are targeting an FPGA, please ensure that your " "system has a correctly configured FPGA board.\n"; std::cerr << "Run sys_check in the oneAPI root directory to verify.\n"; std::cerr << "If you are targeting the FPGA emulator, compile with " "-DFPGA_EMULATOR.\n"; } std::terminate(); } return passed ? EXIT_SUCCESS : EXIT_FAILURE; } Which gives the following estimated resource utilization summary Name ALMs ALUTs FFs MLABs RAMs DSPs Pipe resources 1 2 33 0 0 0 MulAddFp32ID 40 80 146 0 1 1 * -Xstarget=Agilex5 -Xsclock=200MHz -Xsdsp-mode=prefer-dsp The RAM usage is apparently used for cluster logic and extra logic is used for computation as shown below: When running fpga compilation Quartus fitter I get the following resource utilization summary. Name ALMs ALUTs FFs MLABs RAMs DSPs Quartus Fitter: Device Image 129 180 565 0 0 0 MulAddFp32ID 119 165 510 0 0 0 * -Xstarget=Agilex5 -Xsclock=200MHz -Xsdsp-mode=prefer-dsp Let alone the fact that DSP is not inferred by the fitter, I don't really understand the schedule view and what the cluster exactly corresponds to along with the numerous write cycles. I also tried with I/O pipes which appeared to be more suited for my application but I also get a similar operation. // Pipe In struct io_pipe_read_a_id { static constexpr unsigned id = 0; }; struct io_pipe_read_b_id { static constexpr unsigned id = 0; }; struct io_pipe_read_c_id { static constexpr unsigned id = 0; }; using PipeReadA = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_a_id, float, 0>; using PipeReadB = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_b_id, float, 0>; using PipeReadC = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_c_id, float, 0>; // Pipe Out struct io_pipe_write_res_id { static constexpr unsigned id = 1; }; using PipeWriteRes = sycl::ext::intel::kernel_writeable_io_pipe<io_pipe_write_res_id, float, 0>; class MulAddFp32ID; // Kernel name struct MulAddFp32 { auto get(sycl::ext::oneapi::experimental::properties_tag) { return sycl::ext::oneapi::experimental::properties{ sycl::ext::intel::experimental::streaming_interface<>}; } float a = PipeReadA::read(); float b = PipeReadB::read(); float c = PipeReadC::read(); void operator()() const { float res = a * b + c; PipeWriteRes::write(res); } }; @whitepau_altera wrote: If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes. Thank you for confirming. I was mostly interested in the details of what streaming interface/data I should use to get minimal "kernel" operation so as the generate IP would basically operate as an RTL module registers input/outputs with start/done signals. Thank you for your time.
1 year ago Place Acceleration
2KViews
0likes
2Comments
Re: HLS i++ style IP generation with oneAPI for RTL IP integration in Quartus
Hello, Thank you for the clarification and links. I took some time to play a bit with examples and look at the different interfaces but I am still uncertain about what interface would be best for my applications. I basically want to use the HLS tools to generate optimized computation modules operating and communicating only with RTL. The main idea is to save development time by having HLS automatically enabling registers in DSP and synchronize the intermediate computations. I find this method more practical than generating the DSP IPs in Quartus, that would require to code some HDL to synchronize intermediate computations and estimation of registers to enable for the target fmax. With HLS I could simply describe the operations to be computed and obtain the latency of the module and fmax quickly. To take a concrete example, I adapted the add_oneapi example to compute a single precision floating-point mul add (that would use only 1 DSP in Quartus using the dedicated IP). So to the best of my knowledge, the invocation and data interface should be streaming pipes, but it creates some extra logic shown in the report (to handle pipes or avalon mm supposedly as with directive -Xsdsp-mode=prefer-dsp it should use internal registers of the DSP and perform all computation in the DSP slice). So I was wondering what was the most suited streaming data interface for RTL communication with minimal logic inference (basically just register input/output with start/done control) instead of the pipe implementing avalon_mm protocol of the example. Thank you for your time.
1 year ago Place Acceleration
2.3KViews
0likes
0Comments
HLS i++ style IP generation with oneAPI for RTL IP integration in Quartus
Hello, I am new to the tool chain and was wondering if there was a method to generate RTL IP to integrate in Quartus Prime Pro in similar way to HLS i++ compiler but with oneAPI, as the former is being deprecated. I just need standalone soft IP cores to integrate in RTL design in Quartus so as I don't need the kernel formalism and would like to simply generate .ip from multiple functions or files. I don't find much documentation on that specific RTL-oriented approach so as I would appreciate some help. Basically, from the fpga_template sample I would want to get a simple IP VectorAdd with a,b inputs instead of the wrapped kernel generated by icpx that integrates the stream interface. If i just simply use HLS syntax and compile with icpx -fsycl or dpcpp I get a warning because no kernel is specified and generated .prj shows empty description (only basic signals in instanciation are shown like reset/clock). Thank you for your time.
Solved
1 year ago Place Acceleration
High-level Design Tools
2.4KViews
0likes
9Comments