Dear all,I would like to use OpenCL SDK for a Terasic DE5-Net to deploy my algorithm (I am using Quartus 18.1 and OpenCL SDK 18.1). I successfully run the examples provided by Intel FPGA for OpenCL and they work. Thus, I am trying to compile my code, which is an integer iFFT based on the Cooley-Tukey algorithm. The OpenCL emulator gives me correct results when I run the code on the x86 machine. But, when I try to generate the .aocx for the DE5-Net, the compiler returns with the following warnings and one error (after 5/6 hours of compilation):..several warnings about auto unrolled loop..Compiler Warning: removing out-of-bounds accesses to xtmp1.i.i.iCompiler Warning: removing out-of-bounds accesses to xtmp2.i.i.iCompiler Warning: removing out-of-bounds accesses to xtmp3.i.i.iintelFPGA/18.1/hld/board/terasic/de5net/tests/fft1Dint/device/fft1dint.cl:1424: Compiler Warning: Aggressive compiler optimization: removing unnecessary storage to local memoryError: exception "std::bad_alloc" in PostBuildSteps [ifft_512_function ifft_512_function ]Error: Verilog generator FAILED.Could you please provide me more information about this error? How can I understand which error in my code is linked to this output (buffer allocation or something else)?Thank you for your support.Best Regards,Federico

3 hours is indeed excessive. The first stage of compilation should only take a few minutes for typical kernels. Based on the line numbers in the log, you seem to have a relatively large kernel. Furthermore, the compiler is auto-unrolling a lot of loops, which might not necessarily be what you want do (especially from a resource usage point of view) while it is also removing some out-of-bound accesses to some of your buffers which shows you have logical issues in your code. I think your kernel is probably too large and complex for the compiler to handle and it is probably running into a memory leak somewhere and filling your memory and finally crashing when it runs out of memory.My recommendation is to first make sure to modify your code to remove all the warnings in the log and then try to simplify your kernel. As it is, even if your kernel passes the first stage of the compilation, it will probably be too big to fit on the device.

It seems your code has been initially written to run on a standard CPU, hence not every construct used in the code is suitable for FPGA acceleration. There are lots of opportunities to improve your code:Starting from the top function I can see that you are processing 1024 of data while only writing 512 points back. This will result in a significant waste of computing cycles and also FPGA area. You should modify the code to only compute what you are going to write back to external memory and later read in the host.You are unrolling the read and write loops in the top function, which is the correct thing to do to achieve compile-time access coalescing. However, the unroll factor is far too large (512). Supporting such large accesses results in significant waste of FPGA resources, especially Block RAMs. The external memory bandwidth of the FPGA will be saturated with one 512-bit read and one 512-bit write per loop iteration (in case of two DDR memory banks and an II of one). This effectively translates to an unroll factor of 16 for the "int" datatype. What you should consider doing is to reconstruct your code so that you are reading, processing and writing back 16 points per loop iteration. Assuming that the FPGA is overutilized, you can then reduce the number of parallel points to fit the design.There is excessive use of function calls in your code. Every function call will be implemented individually as a circuit on the FPGA, resulting in excessive use of FPGA resources. This is similar to the case of a fully-unrolled loop. Furthermore, such calls prevent the compiler from correcting reporting the area usage per kernel line in the HTML report (as is evident in your report where "No Source Line" is occupying half the area), which in turn makes performance debugging very difficult. You should avoid function calls as much as possible and try to use loops over the functions instead and partially or fully unroll the loops based on the available area.The way the "ibfly4_16" is currently written is very inefficient on FPGAs (loop inside of a branch). Since the loops inside of both sides of the branch over "type" are the same, you should instead use one loop and move the branch inside of the loop. Furthermore, using the "out = (condition) ? in_1 : in_2;" construct rather than if/else could lead to area savings in some cases.The main problem in your code seems to stem from the cpack_16_64 function which cannot achieve an II of one due to dependency on "x", resulting in the depth of all the buffers in the loop being increased by the II. Since the function is instantiated multiple times, it leads to huge area waste. I think the dependency exists since you are reading from the x[i+1] point and then overwriting it. If you can split "x" into two buffers and write to x1[i] and x2[i] instead, you might be able to avoid this problem. Of course this will require significant code rewriting which will likely propagate all the way to the top function.There are probably other things that can be done to improve the code but I cannot find and list them all since the code is relatively large. You can try converting each function to a separate kernel manually and then compile them one by one and optimize each separately based on the information you get from the report and then put them back in the original kernel.

Those dependencies look like false write-after-write dependencies. The compiler seems to be assuming that the store addresses might overlap and cause undefined behavior in the pipelined loop but since the loop bound is fixed and the addresses do not seem to overlap, it is probably safe to add #pragma ivdep to the loop to avoid the false dependency.

Different output on FPGA compared to emulation can have two reasons:A bug in the compiler that results in the generation of an incorrect hardware circuit (less likely)Race condition in global memory accesses or incorrect usage of ivdep pragama (more likely)It is possible that I missed some important detail in your kernel and my suggestion of adding ivdep to avoid the dependencies was incorrect. You can try removing them to see if you will get correct output (at the cost of lower performance).I wouldn't rely too much on the numbers reported by the profiler; in my experience, these numbers are not very accurate. The peak external memory bandwidth of your board is 25.6 GB/s (23.8 GiB/s); however, you should not expect to get close to that number unless in extremely ideal situations. You can find the math behind calculation of the external memory bandwidth and my recommendations on how to improve external memory performance in this thread (check the reply before the last, usernames have been lost after migration from Altera's forum):https://forums.intel.com/s/question/0D50P00003yyTK3SAM/global-memory-access-512-bit-width-constrainRegarding operating frequency, it largely depends on loop-carried dependencies and area usage. OpenCL users have very little control over the kernel operating frequency and it is difficult to give recommendations as to how it can be improved. You can try changing the default target operating frequency from 240 to some higher number using the -fmax switch and force the compiler to insert more registers into the pipeline; this can potentially improve operating frequency. However, it might result in higher II for loops that are the fmax bottleneck. In that case you should focus on optimizing those loops to resolve whatever dependency that is causing the bottleneck.

Do you still get incorrect output after removing the ivdep pragmas? Also, as I mentioned before, there is really no point in fully unrolling your memory reads and writes since the memory bandwidth will be saturated with an unroll factor of 16, and you will be just wasting FPGA area with such large unroll factors.It is unlikely that your problem is caused by a bug in the compiler; however, if it is, there is nothing any of us can do about it other than reporting it to Intel and hoping that they would fix it in a later version. It might also be possible to avoid bugs in certain cases by changing the design strategy.

OpenCL error on code compilation | Altera Community

24 Replies

FCive
Occasional Contributor
7 years ago
I would like to understand more in depth if the problem is caused by an issue on my code or it really is a bug compiler. Do you think I have to open a technical support ticket with Intel in order to report it?
About the run time performance, if I understood well, I can calculate the FPGA external memory bandwidth as:
kernel_frequency x number_of_banks x bus_width
In my case I have 2 banks of DDR3 @933MHz, thus the memory operating frequency is 933x2=1866MHz. According to your answer in the thread that you suggested me, the memory controller on the FPGA has a frequency of 1866/8=233MHz. Thus, the maximum frequency achievable for the kernel is 233MHz. I have to read/write 512x16=8192 bits from/to the global memory. Assuming that the kernel operative frequency is 233MHz, the max FPGA external memory bandwidth is 233MHz x 2 x 64 = 29.8Gbps and the upper-bound is 8192bit/29.8Gbps=275 nanoseconds to transfer data from/to the FPGA.
Are these calculation correct? Am I making some mistakes?
Thank you very much.
FCive
Occasional Contributor
7 years ago
Ok. I can try to explain the issue to one of the Intel-affiliated moderators in the forums. Thank you for your suggestion.
About the processing time, I tried to read/write less bit per clock cycle. At the moment I read/write 1024 bits 8 times instead of 8192 bits in a single reading/writing. In this way, the kernel operating frequency is higher and it reaches 242 MHz, according to the profiler. Unfortunately, the kernel execution time remains the same.
About the PCIe bottleneck, do you mean that it is not worth to do the processing in hw for only 8192 bits because of the PCIe transfer bottleneck?
Moreover, I would like to be sure that I understood correctly the flow from host to FPGA and viceversa. As matter of example, I consider the kernel reading operation. The flow consists on:
PCIe writes the data to the DDR
DDR has to be read by the kernel.
Thus, the bottlenecks are DDR access and PCIe writing. Is it correct?
Thank you for your support.
FCive
Occasional Contributor
7 years ago
Thank you for your explanation and happy new year!
I am trying to understand where is exactly the bottleneck, thus I enable the ACL_PROFILE_TIMER variable to see the memory transfer.
It seems that the access to the global memory does not reach 100% occupancy but only 4.2%. Moreover, in the kernel execution panel, there are empty spaces that represent the global memory access time, if I understood correctly from the OpenCL best practices guide. I have also tried with 128 iterations of 64 bits each of transfer from/to global memory, but I did not see improvement. Please find attached the screenshots from the profiler.
How can I improve the occupancy?
Thank you!
HRZ
Frequent Contributor
7 years ago
The Best Practices Guide says:
"The Kernel Execution tab also displays information on memory transfers between the host and your devices"
I don't think "memory transfers" here refer to "global memory" since I highly doubt the profiler implements separate counters for global memory traffic. Furthermore, memory and compute operations generally overlap in a kernel and it would not be very easy to separate them with run-time counters. The guide does not address this topic very clearly, so I am not sure how the information you have obtained from the profiler can be interpreted.
Regarding the low occupancy, it could simply be caused by your code performing memory operations less frequently that compute. "Best Practices Guide, Section 4.3.4.2. Low Occupancy Percentage" contains the official guidelines to improve occupancy.

Forum Discussion

OpenCL error on code compilation

24 Replies

Recent Discussions

No access to the Self Service Licensing Center (SSLC)

Free Agilex3 license is non-commercial?

Quartus Prime 25.1 installation issue

Quartus 20.1std compilation fails for Quartus map - Device 10AS057K2F40I1SG

recovery timing issue