How to deal with the Out-of-Order Loop Iterations in single work-item kernel?

Question

Hi,  Today I tried to use single work-item kernel. I have a nested loop. In Loop Report, I found my outer loop not pipelined due to:      loop iteration ordering: iterations may get out of order with respect to the inner loop,     as the number of iterations of the inner loop may be different for different iterations of this loop.  I understood this problem. for different outer iterations of outer loop, actually i need different number of iterations of inner loop. And in "out-of-order loop iterations" section of the best practices guide, I found an example, it is just similar to my code:  
__kernel void order( __global unsigned* restrict input,
                              __global unsigned* restrict output, int N ) {
    unsigned sum=0;
    for (unsigned i = 0; i &lt; N; i++) {
        for (unsigned j = 0; j &lt; i; j++)
            sum += input;
    }
    output = sum;
}
  But no solution is mentioned here. How can I pipeline the loop? Or how to deal with this problem? If I use multiple kernels, will it work?

altera_forum · Answer

Sorry I just think about multiple kernels... Maybe it will solve this problem, is it right?  Thanks in advance.

altera_forum · Answer

You can pipeline the loop like this:  __kernel void order( __global unsigned* restrict input,
                                    __global unsigned* restrict output, int N ) {
    unsigned sum=0;
    for (unsigned i = 0; i &lt; N; i++) {
        for (unsigned j = 0; j &lt; N; j++)
            if (j &lt; i) sum += input;
    }
    output = sum;
}  However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels.

altera_forum · Answer

--- Quote Start ---  You can pipeline the loop like this:  __kernel void order( __global unsigned* restrict input,
                                    __global unsigned* restrict output, int N ) {
    unsigned sum=0;
    for (unsigned i = 0; i &lt; N; i++) {
        for (unsigned j = 0; j &lt; N; j++)
            if (j &lt; i) sum += input;
    }
    output = sum;
}  However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels.  --- Quote End ---       Thanks very much. My code is more complex then it is hard to make the same number of inner iterations... Yes, it is actually preferred to use NDRange kernels...

Forum Discussion

How to deal with the Out-of-Order Loop Iterations in single work-item kernel?

3 Replies

Recent Discussions

Automatically added negative node for TDS output doesn't work with Agilex 5

Design Space Explorer - *** Fatal Error: Access Violation at 0X000000001E19EB30

Tensor block usage

Error (169008): Can't turn on open-drain option for differential I/O pin HPS_DDR3_DQS_N[1]

Highlight similar instances of a selected word fails when scrolling