Hi HRZ,
@HRZ Thank you so much for spending time looking at our code and writing so much feedback!!
Actually I have multiple similar versions of this code. In the one shown here, I used local memory for two key struct variables frequently accessed by many stream related functions: __local zfp_stream zfp[MAX_SEG]; __local bitstream stream[MAX_SEG]; (in the kernel “decomp” and “compress”, respectively). You may notice the global pointer arguments “__global zfp_stream * restrict zfp2, __global bitstream * restrict stream2” which is not used here and is another implementation where zfp and stream are put in global memory. Some members in stream, like "buffer", "bits" and "i" (current read/write position) are accessed in many called functions. Removing some assignment statements to them (causing the warnings) make the emulation's results incorrect. Though emulation cannot emulate the concurrency, but it can tell us whether a function is correct from the perspective of logic (please correct me if I'm wrong).
For other buffers like xy_buffer and xy_bs1 in the kernel "decomp", they may be too big to be put into the local memory (e.g., for a 2048 x 2048 double matrix, xy_buffer occupies 2k x 2k x 8 = 32 MB bytes).
About the barrier, as you see from the code, our framework can be constructed as a 3-stage decompression -> processing -> compression. The processing could be any kind of computation (e.g., processing of one image, transposing of one matrix, etc., I did not show its code here). There needs synchronization between two consecutive stages. I once used the barrier to synchronize between stages in one of earlier versions in which only one kernel is used. Later on I found that is inefficient. So I breaked the single big kernel into four ones (3 of them are shown in the code here). The synchronization between them is controlled by the opencl events in the host side. So this becomes a barrier-free design.
If you look at the main compression loop (in codec_2d_public.h) (Input: data to be compressed "xy_buffer", Output: bitstream buffer "begin")
for(int b = start_b; b < start_b + nblock; b++)
{
zfp_encode_block_double_2(begin, stream, zfp, xy_buffer + b * BLOCK_ITEMS);
}
you can see what I want to do is: split a xy plane (like an image) into multiple regions and one region contains nblock 4x4 blocks. So each work item just compresses one region. The above loop should be executed by all workitems in parallel but they access different regions in a big chunk of global memory. Zfp and stream contain some control data, like current bitstream read/write position, etc. Therefore, actually there is no any data sharing among the work items. There is also no conflict or overlapping between them. (One potential synchronization across work items happen between compression and merge_streams, but it also can be done in the host side)
The decompression loop is similar to the compression one (its input: bitstream buffer xy_bs1, output: xy_buffer)
Unfortunately, my code’s ndrange version is not stable. For small matrix size (like 64 x 64), it works well; but for large ones (like 256x 256 or 512x512), only using one work item shows correct results; using more than one gave me wrong results most of the time. I am still not able to find the root cause of this phenomenon.
For the task version (using __attribute__((task))), there also exist a weird but interesting bug: If put zfp or stream in local memory (shown in the code) or private memory (defined as: zfp_stream zfp; bitstream stream), the results are not correct; but if I put them into global memory by defining them as the global pointer, the results are always correct. Still, I don't know what exactly happened behind this (though logically I cannot see any wrong things). I once suspected if something is wrong with the alignment of zfp or stream. But even if I changed the alignment size in their definitions (codec_2d.h) (like 256), such problems still exist.
“You seem to be under the impression that you can convert an NDRange kernel to Single Work-item just by adding "__attribute__((task))" to the kernel header. This is indeed not the case ..."
You are totally correct! I did not realize this until yesterday night I tried the latest 19.1. With 17.1.1, I can simply use __attribute__((task)) even though the "get_global_id()" or "get_global_size()" still exist in the code. The reports generated by the initial compilation show the code is indeed compiled into a single work item type and most loops are pipelined if possible (but its real underlying implementation may not follow the correct logic even though 17.1.1 successfully compiled it. I have not idea if the bug I mentioned above is related to this). However, with 19.1, the __attribute__((task)) is not supported any more and cannot be identified by the compiler. I have to use "__attribute__((max_global_work_dim(0)))" instead. In this case, if I still leave "get_global_id()" or "get_global_size()" in my code, I would get a obviously incorrect report:
Logic utilization (423226%), ALUTs (502536%), Dedicated logic registers (82%), Memory blocks(65%), DSP blocks(2%)
After I removed all "get_global_id()" or "get_global_size()" and replaced all "gid" with 0, the report looks normal.
Please note: __attribute__((reqd_work_group_size(1, 1, 1))) cannot make the 19.1 identify the code as a single work-time type (it is still be viewed as a ndrange type).
"Remember that just because the code works fine in the emulator it does not mean it is actually correct."
I've been stuck in such kind of problems for more than one month. For all my implementation versions, their emulations are always correct. But their hardware implementation are not necessary. The compression software zfp has not yet provided a FPGA implementation (their GPU version is published just recently). Is it possible that they already tried the FPGA but found it is inefficient? I think I need to contact the authors.
For the number of resource utilization across different Quartus versions, yes, they are from the first stage of compilation (it takes 1 ~ 2 mins). For the number from 18.1.1, I tried the compilation several times and 18.1.1 always give similar numbers.
With 17.1,1, I always get some warnings like "Compiler Warning: Auto-unrolled loop at file_path: 40 (line number)" if I did not use the "#pragma unroll N". That are exactly the auto-unrolling you mentioned. But with 19.1, they are gone. So you are right, this function probably has been removed (or disabled) in 19.1
Finally, would you like to consider a possible cooperation with us if you have interest and time? Currently I am the only programmer in this project but I don't have much experience. If you would like to join, we would consider you are one contributor of our project and add your name in our paper we would submit in the future :)
Thank you again!