Forum Discussion

RN1's avatar
RN1
Icon for New Contributor rankNew Contributor
3 years ago

Error executing basic test code for FPGA (oneAPI/OpenCL)

Greetings,

We are running some experiments and we have detected that oneAPI for FGPA is not performing well, quite below the previous pure OpenCL implementation (same system, fpga, etc). We tested with different codes, and in even a simple Matrix Multiplication the OpenCL code/runtime performs much better than in a pure oneAPI one. So, we thought about doing OpenCL interoperability with oneAPI (as in this guide https://www.intel.com/content/www/us/en/developer/articles/technical/interoperability-dpcpp-sycl-opencl.html), to see if DPC++ can avoid all the overhead/wrong optimization steps of oneAPI by using something closer to the OpenCL implementation/runtime optimizations. We don't know what oneAPI is adding during the compilation/execution steps, but now using the FPGA is not worthwhile.

The fpga emulator works with the interoperability, but the real fpga does not. Here are our steps:

Error

user@node:~/interoperability$ ./ingest_fpga
Device: s10gx : Stratix 10 Reference Platform (acls10_ref0)
terminate called after throwing an instance of 'cl::sycl::feature_not_supported'
  what():  Online compilation is not supported by this device -3 (CL_COMPILER_NOT_AVAILABLE)

Code

#include <CL/sycl.hpp>
#include <iostream>
#include <array>

#include <CL/sycl/INTEL/fpga_extensions.hpp>
#include "dpc_common.hpp"

using namespace cl::sycl;

int main()
{
        const size_t szKernelData = 32;
        std::array<float, szKernelData> kernelData;
        kernelData.fill(-99.f);
        range<1> r(szKernelData);
        sycl::INTEL::fpga_selector _device;
        queue q{_device};
        //queue q{fpga_selector()};
        std::cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";
        program p(q.get_context());

        p.build_with_source(
          R"CLC(
            kernel void sinf_test(global float* data) {
              data[get_global_id(0)] = sin(get_global_id(0)*2*M_PI_F/get_global_size(0)) ;
            }
          )CLC", "-cl-std=CL1.2");
        {
                buffer<float, 1> b(kernelData.data(), r);

                q.submit([&](handler& cgh) {
                        auto b_accessor = b.get_access<access::mode::read_write>(cgh);
                        cgh.set_args(b_accessor);
                        cgh.parallel_for(r, p.get_kernel("sinf_test"));
                });
        }
        for(auto& elem : kernelData)
                std::cout << std::defaultfloat << elem << " " << std::hexfloat << elem << std::endl;
        return 0;
}

Compilation

dpcpp -fintelfpga -Xshardware -fsycl-unnamed-lambda ingest.cpp -std=c++17 -o ingest_fpga
# Output
WARNING: No kernels specified.

Any idea?

4 Replies

  • Hi @RN1,


    Thank you for posting in Intel community forum on your interest in oneAPI and hope all is well.

    While we are looking into the interoperability error, question if I may, would it be possible to share the mention test codes that you have running for both pure OpenCl and OneAPI which will allow us to check further.


    I have dropped you a note in your mailbox, if desire a more secure way to share the codes.

    Hope to hear from you soon.


    Best Wishes

    BB


  • Hi @RN1,


    Good day, just following up on the previous clarification.

    By any chances did you managed to look into the it?


    Best Wishes

    BB


    • RN1's avatar
      RN1
      Icon for New Contributor rankNew Contributor

      Good day.

      I answered in the other thread.

      We did more tests and nothing, the performance is not good and the results are the same as those attached in the other post.

      We compared "System viewers->Graph viewer (beta)" in openCL with "Views->System viewer" in oneAPI. OpenCL is doing coalescing to local accesses, while oneAPI it isn't. It uses many small units. We assume that regarding the DSPs something is bad as well, but it is complex to see since the generated graph is huge. Maybe it is not reducing/trimming the tree, but we did nothing different from OpenCL, so, we need support in how to optimize these local accesses to achieve at least similar performance than in OpenCL (nothing is explained in the optimization guides).

      Following this thread, Do you know why interoperability OpenCL-oneAPI does not work with the fpga?