Forum Discussion

hiratz's avatar
hiratz
Icon for Occasional Contributor rankOccasional Contributor
6 years ago
Solved

The weird aggressive aocl optimization "removing unnecessary storage to local memory"

Hello, I used local memory variables in my kernels but got many compilation warnings like this when compiling them with aocl on Intel FPGA arria10. When the kernels are compiled into the task type ...
  • HRZ's avatar
    6 years ago

    With respect to functional verification, what I do is that I construct my host code in a way that both run-time and offline compilation are supported, the latter for FPGAs and the former for other devices, and I use AMD's OpenCL SDK for other devices. In this case, as long as the run-time OpenCL driver is installed, the same host code can then be used to execute the same kernel on any type of CPU, GPU or FPGA. You can take a look at the host code/makefiles of the optimized benchmarks in the following repository as example of achieving this:

    https://github.com/fpga-opencl-benchmarks/rodinia_fpga

    I emulated all of those kernels on CPUs/GPUs using the same host and kernel codes. What I would tell you is that if an NDRange kernel with sufficiently large local and global size performs correctly on a GPU, it should also perform correctly on an FPGA (unless there is a bug in the FPGA compiler). A CPU should also work fine even if the whole kernel runs on one core, since there will still be multiple threads (work-items) running on that core that could be issued out of order and this is usually enough to show concurrency issues but a GPU would likely be more trustworthy in this case.

    With respect to, let's say HDL vs. OpenCL, many old-school HDL programmers tend to think that OpenCL or HLS tools in general are insufficient and it is possible to achieve better results using HDL. This is indeed true in some cases like latency-sensitive or low-power applications where clock-by-clock control over the code is required, or applications that are limited by logic resources, but I would not say this is the case for high-throughput applications where limitation is Memory/PCI-E bandwidth or DSP count since these limitations are independent of the programming language. With respect to the particular case of unpipelinable nested loops, HDL or OpenCL would not make a difference. If you have a regular outer loop with an irregular inner loop, the outer loop cannot be pipelined; it doesn't matter how you "describe" the code. There are two ways to approach such loops on FPGAs:

    1- Use NDRange and let the run-time work-item scheduler do its best in maximizing pipeline efficiency and minimizing the average loop II.

    2- Collapse the nested loop as long as it is not too irregular and get an II of one at the cost of a noticeable Fmax hit. Though by "collapse" I mean manual collapse and not the compiler's "coalesce" pragma. Take a look at Section 3.2.4.3 in this document:

    https://arxiv.org/abs/1810.09773

    Even though the provided example involves collapsing a regular nested loop, this optimization also sometimes applies to irregular nested loops. I such case, the condition inside the collapsed loop that is used to increment the variable of the original outer loop will have more than one statement (which complicates the critical path and reduces the Fmax). Indeed the possibility also exists to implement parts of your application in HDL and use it as an HDL library in an OpenCL kernel but you are going to run into complications if your HDL library does not have a fixed latency and I highly doubt you would be able to achieve much better results in the end.

    Finally, with respect to NDRange vs. Single Work-item, I recommend reading Section 3.1 (and particularly 3.1.4) of the document I posted above.