Hi,Since the NDRange is implemented as work item based pipeline on FPGA, if I understand it correctly, the maximum number of work items in flight should be determined by the complexity (or stage) of the kernel, right?Take the following kernel code for example (from the beginning of Chapter 4 of Intel "Best Practices")__kernel void add (__global int * a, __global int * b, __global int * c) { int gid = get_global_id(0); c[gid] = a[gid]+b[gid]; }The compiler generates a 3-stage pipeline for it:1) Two Load units (load a and b simultaneously)2) One Add unit3) One Store unitSo for this 3-stage pipeline, at most only 3 work items can be in flight no matter how many work items are specified in the host code. If we want to get more in-flight work items, we have to add more computation or operations that will be translated into extra stages. Do I understand this correctly?Since a deeper pipeline provides more parallelism, if my understanding above is correct, a simple kernel with few operations actually is not able to benefit much from the NDRange implementation (no matter how many work items are used or specified), right?Thanks!

The number of work-items that can be in-flight simultaneously depends on the pipeline depth; even though you see only three units in the report, the total length of the pipeline should be in the order of 50-200 stages which would allow the same number of work-items be pipelined at the same time. Note that if you want work-item parallelism, you should use SIMD. By default, work-items are only pipelined in NDRange kernels.

Thank you, HRZ.Actually I did not compile this example code. I just read the description about how hardware pipeline stages are generated for a given kernel code in Intel's "Best Practices Guide". The guide provides many similar but simple examples to help people understand how the pipeline parallelism can be got.I'm still curious why only the single statement "c[gid] = a[gid]+b[gid];" can get a pipeline depth of the order of 50 - 200 stages by the compiler. It seems that the guide does not mention such implicit stages. Would you like to provide more details?

Latency of most operations on the FPGA is higher than one cycle to allow reasonable operating frequency. For the particular case of external memory accesses, the latency is in the order of a few hundred cycles. Generally the compiler generates a deep-enough pipeline to be able to absorb the majority of the external memory stalls and at the same time accommodate all the necessary operations in the pipeline targeting a specific operating frequency (240 MHz by default). If you check the "System viewer" tab of the HTML report, you can find the latency of each block in your code and calculate the total pipeline depth by adding up all the latency values.

I see. Nice explanations! I just looked at the "System viewer" tab of the HTML report and it indeed shows the latency of each block in my code. Good info!Thanks again!

One more question, the purpose of unrolling a loop is to add the depth of the pipeline (for single work item), not to let the unrolled iterations become a SIMD circuit (real parallel execution), right? If so, for NDRange version, since the loop cannot be pipelined as it is in the single work item, putting a "#pragma unroll" before a loop actually cannot bring some benefit (but add some extra area), right? (Note: when saying "the loop cannot be pipelined as ..." above, I mean their iterations cannot be pipelined. Instead, the loop is viewed as a whole and constructs the pipeline with other code. As a result, the loop becomes a stage as a whole. In this case, there is no difference between unrolling or not unrolling the loop. This is just my understanding.)BTW, I'm curious why the compiler still can unroll a loop whose loop bound is a run-time value. For example, "while(i < n) {i++; do sth.}" (assume n is not changed in the loop body). If n is pretty large, there will be no enough area for the compiler to unroll the loop. (Please correct me if I understand this incorrectly.)Thanks!

How to add the number of work items in flight for the NDRange kernel?

24 Replies

HRZ
Frequent Contributor
6 years ago
The number of work-items that can be in-flight simultaneously depends on the pipeline depth; even though you see only three units in the report, the total length of the pipeline should be in the order of 50-200 stages which would allow the same number of work-items be pipelined at the same time. Note that if you want work-item parallelism, you should use SIMD. By default, work-items are only pipelined in NDRange kernels.
- hiratz
  Occasional Contributor
  6 years ago
  Thank you, HRZ.
  Actually I did not compile this example code. I just read the description about how hardware pipeline stages are generated for a given kernel code in Intel's "Best Practices Guide". The guide provides many similar but simple examples to help people understand how the pipeline parallelism can be got.
  I'm still curious why only the single statement "c[gid] = a[gid]+b[gid];" can get a pipeline depth of the order of 50 - 200 stages by the compiler. It seems that the guide does not mention such implicit stages. Would you like to provide more details?
  - HRZ
    Frequent Contributor
    6 years ago
    Latency of most operations on the FPGA is higher than one cycle to allow reasonable operating frequency. For the particular case of external memory accesses, the latency is in the order of a few hundred cycles. Generally the compiler generates a deep-enough pipeline to be able to absorb the majority of the external memory stalls and at the same time accommodate all the necessary operations in the pipeline targeting a specific operating frequency (240 MHz by default). If you check the "System viewer" tab of the HTML report, you can find the latency of each block in your code and calculate the total pipeline depth by adding up all the latency values.

Forum Discussion

How to add the number of work items in flight for the NDRange kernel?

24 Replies

Recent Discussions

Error faced while executing on Agilex FPGA board....

AI Suite System Throughput Issue

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

How Do I get the License for HLS?