Forum Discussion

FCive's avatar
FCive
Icon for Occasional Contributor rankOccasional Contributor
7 years ago

OpenCL error on code compilation

Dear all,

I would like to use OpenCL SDK for a Terasic DE5-Net to deploy my algorithm (I am using Quartus 18.1 and OpenCL SDK 18.1). I successfully run the examples provided by Intel FPGA for OpenCL and they work. Thus, I am trying to compile my code, which is an integer iFFT based on the Cooley-Tukey algorithm. The OpenCL emulator gives me correct results when I run the code on the x86 machine. But, when I try to generate the .aocx for the DE5-Net, the compiler returns with the following warnings and one error (after 5/6 hours of compilation):

..several warnings about auto unrolled loop..

Compiler Warning: removing out-of-bounds accesses to xtmp1.i.i.i

Compiler Warning: removing out-of-bounds accesses to xtmp2.i.i.i

Compiler Warning: removing out-of-bounds accesses to xtmp3.i.i.i

intelFPGA/18.1/hld/board/terasic/de5net/tests/fft1Dint/device/fft1dint.cl:1424: Compiler Warning: Aggressive compiler optimization: removing unnecessary storage to local memory

Error: exception "std::bad_alloc" in PostBuildSteps [ifft_512_function ifft_512_function ]Error: Verilog generator FAILED.

Could you please provide me more information about this error? How can I understand which error in my code is linked to this output (buffer allocation or something else)?

Thank you for your support.

Best Regards,

Federico

24 Replies

  • FCive's avatar
    FCive
    Icon for Occasional Contributor rankOccasional Contributor

    I would like to understand more in depth if the problem is caused by an issue on my code or it really is a bug compiler. Do you think I have to open a technical support ticket with Intel in order to report it?

    About the run time performance, if I understood well, I can calculate the FPGA external memory bandwidth as:

    kernel_frequency x number_of_banks x bus_width

    In my case I have 2 banks of DDR3 @933MHz, thus the memory operating frequency is 933x2=1866MHz. According to your answer in the thread that you suggested me, the memory controller on the FPGA has a frequency of 1866/8=233MHz. Thus, the maximum frequency achievable for the kernel is 233MHz. I have to read/write 512x16=8192 bits from/to the global memory. Assuming that the kernel operative frequency is 233MHz, the max FPGA external memory bandwidth is 233MHz x 2 x 64 = 29.8Gbps and the upper-bound is 8192bit/29.8Gbps=275 nanoseconds to transfer data from/to the FPGA.

    Are these calculation correct? Am I making some mistakes?

    Thank you very much.

  • FCive's avatar
    FCive
    Icon for Occasional Contributor rankOccasional Contributor

    Ok. I can try to explain the issue to one of the Intel-affiliated moderators in the forums. Thank you for your suggestion.

    About the processing time, I tried to read/write less bit per clock cycle. At the moment I read/write 1024 bits 8 times instead of 8192 bits in a single reading/writing. In this way, the kernel operating frequency is higher and it reaches 242 MHz, according to the profiler. Unfortunately, the kernel execution time remains the same.

    About the PCIe bottleneck, do you mean that it is not worth to do the processing in hw for only 8192 bits because of the PCIe transfer bottleneck?

    Moreover, I would like to be sure that I understood correctly the flow from host to FPGA and viceversa. As matter of example, I consider the kernel reading operation. The flow consists on:

    • PCIe writes the data to the DDR
    • DDR has to be read by the kernel.

    Thus, the bottlenecks are DDR access and PCIe writing. Is it correct?

    Thank you for your support.

  • FCive's avatar
    FCive
    Icon for Occasional Contributor rankOccasional Contributor

    Thank you for your explanation and happy new year!

    I am trying to understand where is exactly the bottleneck, thus I enable the ACL_PROFILE_TIMER variable to see the memory transfer.

    It seems that the access to the global memory does not reach 100% occupancy but only 4.2%. Moreover, in the kernel execution panel, there are empty spaces that represent the global memory access time, if I understood correctly from the OpenCL best practices guide. I have also tried with 128 iterations of 64 bits each of transfer from/to global memory, but I did not see improvement. Please find attached the screenshots from the profiler.

    How can I improve the occupancy?

    Thank you!

  • HRZ's avatar
    HRZ
    Icon for Frequent Contributor rankFrequent Contributor

    The Best Practices Guide says:

    "The Kernel Execution tab also displays information on memory transfers between the host and your devices"

    I don't think "memory transfers" here refer to "global memory" since I highly doubt the profiler implements separate counters for global memory traffic. Furthermore, memory and compute operations generally overlap in a kernel and it would not be very easy to separate them with run-time counters. The guide does not address this topic very clearly, so I am not sure how the information you have obtained from the profiler can be interpreted.

    Regarding the low occupancy, it could simply be caused by your code performing memory operations less frequently that compute. "Best Practices Guide, Section 4.3.4.2. Low Occupancy Percentage" contains the official guidelines to improve occupancy.