Forum Discussion

Occasional Contributor

7 years ago

Solved

The weird aggressive aocl optimization "removing unnecessary storage to local memory"

Hello, I used local memory variables in my kernels but got many compilation warnings like this when compiling them with aocl on Intel FPGA arria10. When the kernels are compiled into the task type ...

HRZ
7 years ago
With respect to functional verification, what I do is that I construct my host code in a way that both run-time and offline compilation are supported, the latter for FPGAs and the former for other devices, and I use AMD's OpenCL SDK for other devices. In this case, as long as the run-time OpenCL driver is installed, the same host code can then be used to execute the same kernel on any type of CPU, GPU or FPGA. You can take a look at the host code/makefiles of the optimized benchmarks in the following repository as example of achieving this:
https://github.com/fpga-opencl-benchmarks/rodinia_fpga
I emulated all of those kernels on CPUs/GPUs using the same host and kernel codes. What I would tell you is that if an NDRange kernel with sufficiently large local and global size performs correctly on a GPU, it should also perform correctly on an FPGA (unless there is a bug in the FPGA compiler). A CPU should also work fine even if the whole kernel runs on one core, since there will still be multiple threads (work-items) running on that core that could be issued out of order and this is usually enough to show concurrency issues but a GPU would likely be more trustworthy in this case.
With respect to, let's say HDL vs. OpenCL, many old-school HDL programmers tend to think that OpenCL or HLS tools in general are insufficient and it is possible to achieve better results using HDL. This is indeed true in some cases like latency-sensitive or low-power applications where clock-by-clock control over the code is required, or applications that are limited by logic resources, but I would not say this is the case for high-throughput applications where limitation is Memory/PCI-E bandwidth or DSP count since these limitations are independent of the programming language. With respect to the particular case of unpipelinable nested loops, HDL or OpenCL would not make a difference. If you have a regular outer loop with an irregular inner loop, the outer loop cannot be pipelined; it doesn't matter how you "describe" the code. There are two ways to approach such loops on FPGAs:
1- Use NDRange and let the run-time work-item scheduler do its best in maximizing pipeline efficiency and minimizing the average loop II.
2- Collapse the nested loop as long as it is not too irregular and get an II of one at the cost of a noticeable Fmax hit. Though by "collapse" I mean manual collapse and not the compiler's "coalesce" pragma. Take a look at Section 3.2.4.3 in this document:
https://arxiv.org/abs/1810.09773
Even though the provided example involves collapsing a regular nested loop, this optimization also sometimes applies to irregular nested loops. I such case, the condition inside the collapsed loop that is used to increment the variable of the original outer loop will have more than one statement (which complicates the critical path and reduces the Fmax). Indeed the possibility also exists to implement parts of your application in HDL and use it as an HDL library in an OpenCL kernel but you are going to run into complications if your HDL library does not have a fixed latency and I highly doubt you would be able to achieve much better results in the end.
Finally, with respect to NDRange vs. Single Work-item, I recommend reading Section 3.1 (and particularly 3.1.4) of the document I posted above.

HRZ

Frequent Contributor

7 years ago

I had a look at your code, there are some fundamental issues in the code which raise doubt about its correctness:

-You are supposedly using local memory but there is not even one local memory barrier in the whole code (or at least I can't see any). This can only mean two things:

1-The code is incorrect.

2-Every local memory read is only done by the same work-item that initially wrote to that location and there is no data sharing between the work-items through local memory (or else barriers would have been required). This essentially means there is no point in using local memory for this code and the compiler's decision in removing the local memory operations is correct.

-You seem to be under the impression that you can convert an NDRange kernel to Single Work-item just by adding "__attribute__((task))" to the kernel header. This is indeed not the case (unless the compiler nowadays has the capability to automatically convert NDRange kernels to Single Work-item but I doubt it). You need to replace all NDRange-specific functions like "get_global_id", "get_local_id", etc. with appropriate loops so that the Single Work-item version performs the same number of operations as the NDRange one with loops iterations rather than work-items. I am surprised the compiler actually compiles the code like this!

Remember that just because the code works fine in the emulator it does not mean it is actually correct. The emulator does not fully replicate hardware execution environment and specifically, it cannot correctly emulate concurrency issues and race conditions that might exist in your code. Porting code optimized for CPUs/GPUs to FPGAs is extremely inefficient. You will essentially have to "de-optimize" the code first and then re-optimize it for FPGAs. Moreover, considering the size of your code and all the functions involved in it, debugging this code will not be very different from looking for a needle in a haystack. I would recommend starting from an unoptimized implementation instead (if you can get such implementation from the code authors); that would save a lot of time (and headache).

>For 17.1.1: they are 58%, 33%, 28%, 59%, 4%, respectively

>For 18.0, they are: 57%, 31%, 28%, 62% and 4%, respectively.

>For 19.1, they are 52%, 29%, 24%, 57%, 3%, respectively.

>But for 18.1, they are: 85%, 54%, 36%, 70% and 14%, respectively.

Are these post-place-and-route utilization numbers or resource estimation numbers from the first stage of compilation? The latter is highly unreliable and you should not make any conclusions based on that. Since the compiler is auto-unrolling some of the loops, and the auto-unroll logic might change from one compiler version to another, this could also change resource utilization drastically. In fact, auto-unrolling loops seems to have been completely disabled in v19.1, which means your code will definitely behave differently with this version compared to previous ones.

hiratz
Occasional Contributor
7 years ago
Hi HRZ,
@HRZ Thank you so much for spending time looking at our code and writing so much feedback!!
Actually I have multiple similar versions of this code. In the one shown here, I used local memory for two key struct variables frequently accessed by many stream related functions: __local zfp_stream zfp[MAX_SEG]; __local bitstream stream[MAX_SEG]; (in the kernel “decomp” and “compress”, respectively). You may notice the global pointer arguments “__global zfp_stream * restrict zfp2, __global bitstream * restrict stream2” which is not used here and is another implementation where zfp and stream are put in global memory. Some members in stream, like "buffer", "bits" and "i" (current read/write position) are accessed in many called functions. Removing some assignment statements to them (causing the warnings) make the emulation's results incorrect. Though emulation cannot emulate the concurrency, but it can tell us whether a function is correct from the perspective of logic (please correct me if I'm wrong).
For other buffers like xy_buffer and xy_bs1 in the kernel "decomp", they may be too big to be put into the local memory (e.g., for a 2048 x 2048 double matrix, xy_buffer occupies 2k x 2k x 8 = 32 MB bytes).
About the barrier, as you see from the code, our framework can be constructed as a 3-stage decompression -> processing -> compression. The processing could be any kind of computation (e.g., processing of one image, transposing of one matrix, etc., I did not show its code here). There needs synchronization between two consecutive stages. I once used the barrier to synchronize between stages in one of earlier versions in which only one kernel is used. Later on I found that is inefficient. So I breaked the single big kernel into four ones (3 of them are shown in the code here). The synchronization between them is controlled by the opencl events in the host side. So this becomes a barrier-free design.
If you look at the main compression loop (in codec_2d_public.h) (Input: data to be compressed "xy_buffer", Output: bitstream buffer "begin")
for(int b = start_b; b < start_b + nblock; b++)
{
zfp_encode_block_double_2(begin, stream, zfp, xy_buffer + b * BLOCK_ITEMS);
}
you can see what I want to do is: split a xy plane (like an image) into multiple regions and one region contains nblock 4x4 blocks. So each work item just compresses one region. The above loop should be executed by all workitems in parallel but they access different regions in a big chunk of global memory. Zfp and stream contain some control data, like current bitstream read/write position, etc. Therefore, actually there is no any data sharing among the work items. There is also no conflict or overlapping between them. (One potential synchronization across work items happen between compression and merge_streams, but it also can be done in the host side)
The decompression loop is similar to the compression one (its input: bitstream buffer xy_bs1, output: xy_buffer)
Unfortunately, my code’s ndrange version is not stable. For small matrix size (like 64 x 64), it works well; but for large ones (like 256x 256 or 512x512), only using one work item shows correct results; using more than one gave me wrong results most of the time. I am still not able to find the root cause of this phenomenon.
For the task version (using __attribute__((task))), there also exist a weird but interesting bug: If put zfp or stream in local memory (shown in the code) or private memory (defined as: zfp_stream zfp; bitstream stream), the results are not correct; but if I put them into global memory by defining them as the global pointer, the results are always correct. Still, I don't know what exactly happened behind this (though logically I cannot see any wrong things). I once suspected if something is wrong with the alignment of zfp or stream. But even if I changed the alignment size in their definitions (codec_2d.h) (like 256), such problems still exist.
“You seem to be under the impression that you can convert an NDRange kernel to Single Work-item just by adding "__attribute__((task))" to the kernel header. This is indeed not the case ..."
You are totally correct! I did not realize this until yesterday night I tried the latest 19.1. With 17.1.1, I can simply use __attribute__((task)) even though the "get_global_id()" or "get_global_size()" still exist in the code. The reports generated by the initial compilation show the code is indeed compiled into a single work item type and most loops are pipelined if possible (but its real underlying implementation may not follow the correct logic even though 17.1.1 successfully compiled it. I have not idea if the bug I mentioned above is related to this). However, with 19.1, the __attribute__((task)) is not supported any more and cannot be identified by the compiler. I have to use "__attribute__((max_global_work_dim(0)))" instead. In this case, if I still leave "get_global_id()" or "get_global_size()" in my code, I would get a obviously incorrect report:
Logic utilization (423226%), ALUTs (502536%), Dedicated logic registers (82%), Memory blocks(65%), DSP blocks(2%)
After I removed all "get_global_id()" or "get_global_size()" and replaced all "gid" with 0, the report looks normal.
Please note: __attribute__((reqd_work_group_size(1, 1, 1))) cannot make the 19.1 identify the code as a single work-time type (it is still be viewed as a ndrange type).
"Remember that just because the code works fine in the emulator it does not mean it is actually correct."
I've been stuck in such kind of problems for more than one month. For all my implementation versions, their emulations are always correct. But their hardware implementation are not necessary. The compression software zfp has not yet provided a FPGA implementation (their GPU version is published just recently). Is it possible that they already tried the FPGA but found it is inefficient? I think I need to contact the authors.
For the number of resource utilization across different Quartus versions, yes, they are from the first stage of compilation (it takes 1 ~ 2 mins). For the number from 18.1.1, I tried the compilation several times and 18.1.1 always give similar numbers.
With 17.1,1, I always get some warnings like "Compiler Warning: Auto-unrolled loop at file_path: 40 (line number)" if I did not use the "#pragma unroll N". That are exactly the auto-unrolling you mentioned. But with 19.1, they are gone. So you are right, this function probably has been removed (or disabled) in 19.1
Finally, would you like to consider a possible cooperation with us if you have interest and time? Currently I am the only programmer in this project but I don't have much experience. If you would like to join, we would consider you are one contributor of our project and add your name in our paper we would submit in the future :)
Thank you again!

Forum Discussion

The weird aggressive aocl optimization "removing unnecessary storage to local memory"

Recent Discussions

Any date for the release of the Docker image alterafpga/fpgaaisuite-quartus-v2026.1.1?

Downloading AI Suite deb file returns text file

Is Spatial IP ready for LLM / transformer inference?

Ai Suite - What is the purpose of the create HPS Image script?

Agilex 7 FPGA Starter Kit with oneAPI Toolkit flow not detected over PCIe