Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
7 years ago

How to deal with the Out-of-Order Loop Iterations in single work-item kernel?

Hi,

Today I tried to use single work-item kernel. I have a nested loop. In Loop Report, I found my outer loop not pipelined due to:

loop iteration ordering: iterations may get out of order with respect to the inner loop,

as the number of iterations of the inner loop may be different for different iterations of this loop.

I understood this problem. for different outer iterations of outer loop, actually i need different number of iterations of inner loop. And in "out-of-order loop iterations" section of the best practices guide, I found an example, it is just similar to my code:


__kernel void order( __global unsigned* restrict input,
                              __global unsigned* restrict output, int N ) {
    unsigned sum=0;
    for (unsigned i = 0; i < N; i++) {
        for (unsigned j = 0; j < i; j++)
            sum += input;
    }
    output = sum;
}

But no solution is mentioned here. How can I pipeline the loop? Or how to deal with this problem? If I use multiple kernels, will it work?

3 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Sorry I just think about multiple kernels... Maybe it will solve this problem, is it right?

    Thanks in advance.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    You can pipeline the loop like this:

    __kernel void order( __global unsigned* restrict input,
                                        __global unsigned* restrict output, int N ) {
        unsigned sum=0;
        for (unsigned i = 0; i < N; i++) {
            for (unsigned j = 0; j < N; j++)
                if (j < i) sum += input;
        }
        output = sum;
    }

    However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    You can pipeline the loop like this:

    __kernel void order( __global unsigned* restrict input,
                                        __global unsigned* restrict output, int N ) {
        unsigned sum=0;
        for (unsigned i = 0; i < N; i++) {
            for (unsigned j = 0; j < N; j++)
                if (j < i) sum += input;
        }
        output = sum;
    }

    However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels.

    --- Quote End ---

    Thanks very much.

    My code is more complex then it is hard to make the same number of inner iterations... Yes, it is actually preferred to use NDRange kernels...