How to add the number of work items in flight for the NDRange kernel?

HRZ
Frequent Contributor
7 years ago
Latency of most operations on the FPGA is higher than one cycle to allow reasonable operating frequency. For the particular case of external memory accesses, the latency is in the order of a few hundred cycles. Generally the compiler generates a deep-enough pipeline to be able to absorb the majority of the external memory stalls and at the same time accommodate all the necessary operations in the pipeline targeting a specific operating frequency (240 MHz by default). If you check the "System viewer" tab of the HTML report, you can find the latency of each block in your code and calculate the total pipeline depth by adding up all the latency values.
hiratz
Occasional Contributor
7 years ago
I see. Nice explanations! I just looked at the "System viewer" tab of the HTML report and it indeed shows the latency of each block in my code. Good info!
Thanks again!
hiratz
Occasional Contributor
7 years ago
One more question, the purpose of unrolling a loop is to add the depth of the pipeline (for single work item), not to let the unrolled iterations become a SIMD circuit (real parallel execution), right? If so, for NDRange version, since the loop cannot be pipelined as it is in the single work item, putting a "#pragma unroll" before a loop actually cannot bring some benefit (but add some extra area), right? (Note: when saying "the loop cannot be pipelined as ..." above, I mean their iterations cannot be pipelined. Instead, the loop is viewed as a whole and constructs the pipeline with other code. As a result, the loop becomes a stage as a whole. In this case, there is no difference between unrolling or not unrolling the loop. This is just my understanding.)
BTW, I'm curious why the compiler still can unroll a loop whose loop bound is a run-time value. For example, "while(i < n) {i++; do sth.}" (assume n is not changed in the loop body). If n is pretty large, there will be no enough area for the compiler to unroll the loop. (Please correct me if I understand this incorrectly.)
Thanks!
HRZ
Frequent Contributor
7 years ago
Actually loop unrolling has a similar effect to that of SIMD: it allows multiple loop iterations (rather than work-items) to be executed in parallel. For Single work-item kernels, unrolling is the main method of achieving parallelism (the other is to use multiple kernels in different queues or automatic kernel replication using the autorun attribute). For NDRange kernels you have SIMD, unrolling and compute unit replication. The difference between SIMD and unrolling in this case will be that SIMD enables multiple work-items to be executed in parallel, while unrolling allows each work-item execute multiple loop iterations in parallel. Indeed loop unrolling can have a big effect on the performance of NDRange kernels if each work-item has to go through a loop with a large trip count. This benchmark is probably a good example (compare v0 and v2):
https://github.com/fpga-opencl-benchmarks/rodinia_fpga/tree/master/opencl/lud/ocl
Though, unrolling loops in NDRange kernels can have a negative effect on performance if it results in non-coalesced memory ports.
With respect to unrolling loops with unknown loop bounds, the compiler cannot perform full unrolling on such loops but partial unrolling is possible in which case the compiler will create a branch inside of the loop to avoid going out of bounds. Partial unrolling of loops with unknown bounds is likely not a good idea since the branch will result in inefficient area utilization; a better method would be to perform manual loop unrolling as described in Section 3.2.2.1/Figure 3-5 in this document:
https://arxiv.org/ftp/arxiv/papers/1810/1810.09773.pdf
hiratz
Occasional Contributor
7 years ago
Thanks for the the benchmark link. v2 uses a lot of loop unrolling and should be faster than v0 (I also looked at your SC16 paper but not yet finished).
Sounds like loop unrolling is used for parallel execution for both single-workitem and NDRange. So it cannot increase the number of stages (pipeline depth) in a loop for single work-item, right? For example, assume a kernel only contains one loop with a const bound N, after it is completely unrolled, all iterations will disappear and no pipeline will exist any more (the real parallel circuit is generated).
One relevant topic is manual unrolling. Please see the following three different code snippets. Do you believe the compiler will generate the same circuit for them? (Assume we use single work-item). I doubt code 3 which is a sequence of regular statements. Without the keyword "#pragma unroll" or "for", they may not be compiled into a parallel circuit.
Code 1: #pragma unroll for(i = 0; i < 10; i++) a[i] = b[i] + c[i]; Code 2: #pragma unroll for(i = 0; i < 10; i+=2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; } Code 3: a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; ... a[9] = b[9] + c[9];
By saying "automatic kernel replication using the autorun attribute", what do you mean by "replication"? As far as I know, the "autorun" makes the kernel keep being launched automatically (and repeatedly) like a infinite loop. Unlike compute unit copy, the "autorun" kernel should be only one copy. Do I miss something?
I've been reading your thesis since you showed me it in another post last time. I notice that you posted the third version just several days ago. Is there any big change between it and the second version?
In Sec. 3.2.2.1, you mention 3-4 b) is worse than 3-5.
Quote:
"Even though the resulting optimized loop can be partially unrolled by using the unroll pragma to further improve the performance, doing so will break the shift register optimization and requires that the size of the shift register is increased further to accommodate for the unrolling. With large unroll factors, this method can result in large area overhead to implement the shift register."
By "partially unrolled by using the unroll pragma", do you mean the outer loop (because your "shifting" loop and "final reduction" loop have fixed loo p bound, so they do not need a partial unrolling)?
Since the "shifting" loop in both 3-4 b) and 3-5 have fixed bound ("FADD_LATENCY"), why do you say "the size of the shift register" need to be increased? Again, for "With large unroll factors", I'm confused with this because you have a fixed "FADD_LATENCY" bound. Sorry I may not totally understand the above quoted descriptions.
Two typos I found:
Page 19: "on the memory buss", "buss" should be "bus"
Page 20 : "read form the head", "form" should be "from"
Thanks!
HRZ
Frequent Contributor
7 years ago
Actually, most of the unrolling in the v2 version of LUD is not used in practice due to increase in ports to local buffers resulting in port sharing. The main performance difference is caused by the unrolling in the internal kernel. Details of the performance difference are mentioned in paragraph 2 of Section 4.3.1.6 in my thesis. By the way, the content of the SC paper is out of date by now; the thesis has the most up-to-date results.
Loop unrolling does increase the pipeline depth, but not relative to the unroll factor. My assumption is that the reason for the increase in the depth is the increase in circuit complexity that requires more registers inserted into the pipeline to meet the target frequency, rather than increase as a direct result of the unrolling. Note that it is incorrectly stated in the SC paper that pipeline depth increases relative to the unroll factor. This has been corrected in the thesis (Section 3.1.2). If you fully unroll a loop, you will still have a shallow pipeline with a depth that accommodates one instance of the loop body, but a width that accommodates all the unrolled loop iterations.
Regarding your code snippet, you can put that in the compiler and check ( ;) ); all three examples generate the exact same circuit with the exact same latency and area utilization. The compiler is actually smart enough to parallelize code segments that do not depend on each other.
Regarding replication of autorun kernels, check “Section 12.4.1. Customization of Replicated Kernels Using the get_compute_id() Function” of the Programming Guide. This feature is very useful for creating rings or systolic array of processing elements.
Regarding the recent thesis update, I just fixed a few typos here and there, no new content. I am already done with the thesis. ;)
Regarding the example in Section 3.2.2.1, indeed the quote refers to the outer loop; every other loop is already fully unrolled. The problem with required shift register size is not general. That example deals with the specific case of unrolling a reduction loop which requires shift register inference to achieve an II of one in the first place (Paragraph one of Section 3.2.2.1 in thesis and Section 5.1.5 of the Best Practices Guide). If the outer loop is partially unrolled using #pragma unroll, then the latency of the reduction operation will increase and hence, a bigger shift register will be required. With manual unrolling, this problem will be avoided. This problem will not exist in case of a standard loop that does not involve reduction; however, it is always best to avoid using partial unrolling using #pragma unroll unless the loop bound is known and is a multiple of the unroll factor.
And thank you for pointing out the typos, now I need to submit a v4. :D
P.S. This forum really needs a proper means of quoting...
hiratz
Occasional Contributor
7 years ago
Sorry for the late reply. In the past few days, I was not working on my research.
It's so nice that you point out that the SC paper is out of date now and some statements there are not correctly stated. I'll focus on your thesis :)
Your comments about "partial loop unrolling" helps me a lot. For my zfp project, I use various methods to replace a loop bound that is a run-time variable with a constant value. I show how I did this below:
#pragma unroll for(int i = 0; i < n; i++) do sth.
This is the original loop with "#pragma unroll" which causes the partial loop unrolling. What I did is:
1 If the n is actually a constant value N, I just change n to N; If I am sure that n is not changed during the run time by profiling, I did the same thing;
2 If n is not a constant during the run time and is a relatively large number, I change it like your 3-5:

int len = n/N, r = n%N; for(int i = 0; i < len ; i++) { #pragma unroll for(int j = 0; j < N; j++) do sth. } for(int i = 0; i < r; i++) do the remaining stuff
3 If I know the n has a upper bound by profiling, I can replace n with a constant "M". Here M > n, but extra iterations will not generate wrong results.
With these rewriting, I got a 2x speedup.
Now, two new questions come to my mind because I want to dig more concurrency of multiple work-items:
1) Since work items execute the kernel in a pipeline style, can I assume that only one work item is executing a given statement at any time point during the kernel's running? For example, for some statement "var = val + 1" (assume its initial value is 1), if only one work item executes it at any time point, we will get a result of N+1 (say we have N work items). This implicitly implements a mutex lock.
2) Currently once a work item finishes its execution of a kernel, it quits the pipeline. I was wondering if we can implement a thread-pool model like the case in the CPU-based programming model, where we pre-creates N threads and once new data arrives, they will be assigned to a idle thread or we actively pick one idle thread to handle the data.
The background is: my data to be processed is large but the global memory budget is limited. So I have to use a loop on the host code to transfer partial data to the FPGA global memory and launch the kernel to handle them many times. For each launching, assume we launch N work items, we can only process partial data. If N is pretty large, many work items that quit the pipeline early are actually not used fully. If we can let them stand by once they finish processing previous data, we can keep sending data to the kernel. As a result, we can get a longer streaming processing.
Do you think it is possible to do so with currently OpenCL FPGA mechanism?
Thank you!
HRZ
Frequent Contributor
7 years ago
I would say your approach to unrolling the loops is correct.
1) Indeed the loop iterations are run sequentially and there is an implicit guarantee for data consistency. This guarantee comes from the compiler's loop dependency analysis. However, in many cases, the compiler will detect a loop-carried dependency. In such cases, it will either increase the loop II to resolve the dependency, or if it cannot resolve it, it will serialize the loop. Of course this only applies to Single Work-item kernels. For the "var = val + 1" example, each iteration has a dependency on the execution of the previous one due to reuse of the same variable in both sides of the statement. If the operation can be performed in one clock cycle (e.g. in case of integer values), then the compiler can resolve the dependency without increasing the II. However, if the operation takes more than one clock cycle (e.g. in case of floating-point values), then the II will increase by the latency of the operation. This is essentially the problem of "reduction" that I mentioned in the previous replies and can be resolved by inferring a shift register.
2) Probably the best approach to solve the problem in your case would be to use the host channel/pipe feature, if your board supports it. In that case, you can stream data directly from the host to the kernel and process it all using the same set of work-items and the channel will enable implicit synchronization in this case. In general, however, as long as each chunk of data you pass to the FPGA is large enough, the overhead of work-items finishing early will be quite small. You can also try implementing global memory double-buffering to overlap computing of each chunk with the PCI-E transfer of the next one. In such streaming application, your performance bottleneck will likely be the PCI-E transfer rather than the actual computation. I am not sure if creating a pool of threads is at all possible using the OpenCL standard.
hiratz
Occasional Contributor
7 years ago
For 1), I did tests and it turned out that I cannot get an effect of mutex lock for a shared variable among work items. The test kernel code I used are as follows:
#define MAX_SEG (2048) #define attr_max_wg __attribute__((max_work_group_size(MAX_SEG))) attr_max_wg __kernel void test1(buf_addr_space double * restrict buf_in) { size_t gid = get_global_id(0); size_t gsize = get_global_size(0); local double share_var; if(gid == 0) share_var = 1; else share_var++; if(gid == gsize - 1) buf_in[0] = share_var; } attr_max_wg __kernel void test2(buf_addr_space double * restrict buf_in) { size_t gid = get_global_id(0); buf_in[0]++; }
test1 and test2 are two different versions to implement a mutex lock by pipelined work items. test1 uses a local shared variable "share_var" and test2 uses "buf_in[0]" directly. The kernel is compiled into a NDRange type. Assume I launch it with 8 work items and the initial value of buf_in[0] is assigned to 1 on the host code, the emulation will give me the results for test1 and test2 respectively: 8 and 9, which are I want. However, the hardware .aocx gave me the results: 1 and 2, respectively. So it looks like that no strict execution order is guaranteed and all work items seem to finish at the same time (where is the pipeline order?). So the hardware execution cannot implement an exclusive access to a shared variable by the pipeline mechanism. I may be wrong or missed something.
2) Unfortunately, host pipe is not supported on Intel Harp platform yet. Maybe double-buffering is the only choice.
Thank you!
HRZ
Frequent Contributor
7 years ago
1) The consistency I mentioned above is only for loop iterations in single work-item kernels. Work-items in NDRange kernel can, and will, be executed out of order. The only means of achieving data consistency in NDRange kernels is to use local memory barriers. However, you will not be able to get share_var=*num_work-items* from the first code snippet even with barriers, unless you use a switch case and add as many barriers as the thread number in each case to make sure every previous thread has already updated the shared variable; this would effectively sequentialize the computation. It might be easier to achieve your purpose using other parallel programming techniques like waiting on a shared flag. The second code snippet is functionally incorrect from the point of view of the OpenCL standard since the standard does not guarantee global memory consistency except at the end of kernel execution; hence, you should not use global memory for implementing a shared variable. You can, however, use atomic memory operations and you will correct results in that case but it will be extremely slow.
2) I believe the Harp system should support shared memory between the CPU and the FPGA, eliminating the need for double-buffering unless your data cannot fit on the host memory.

Forum Discussion

How to add the number of work items in flight for the NDRange kernel?

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process