Complex loop exit condition

Question

Hello,

I have a very simple code with a few lines, but the outer loop is not pipelined due to "Loop exit condition unresolvable at iteration initiation". Inner loop is pipelined well with II=1. Can anybody suggest any idea?

Thanks

__attribute__ ((task))

kernel void compute_BFS0(

__global const unsigned* restrict ovid_of_edge,

__global const unsigned* restrict start_edge,

__global const unsigned* restrict end_edge,

__global unsigned* restrict node_data

)

{

unsigned ei;

unsigned si;

unsigned ovid;

for (unsigned i = 0; i < 1000; i++ ) // iterates over graph nodes

{

si = start_edge; // sequential read

ei = end_edge; // sequential read

for(unsigned j = si; j < ei; j++) // iterates over node's outgoing edges

{

ovid = ovid_of_edge[j]; // child node. sequential read.

node_data[ovid] = 1000; // random-access write

}

} //kernel

altera_forum · Answer

Your inner loop has a variable exit condition; hence, the outer loop is not pipelineable. If ei has a maximum limit, you can replace ei in the header of the inner loop with that maximum limit, and add a branch inside of the inner loop to discard cases where j &gt;= ei. That will allow pipelining of both loops, at cost of redundant computation which could be significant depending on how far ei is from that maximum. If there is no maximum or it is too large, then an NDRange kernel would be more appropriate since the thread scheduler can minimize pipeline stalls bubbles at run-time.

altera_forum · Answer

If the problem is with variable loop limits of inner loop, I wonder why below code doesn't have this problem too, and both loops are pipelined (according to report).

In first problematic code (mentioned earlier), in a graph, a constant value is written to children nodes of every parent node, but in below code, value of children nodes are read, and their summation is written to their parent node.

In earlier code, there is a random-access write, however in second one, we have random-access read. Can this be the source of pipelining problem?

__attribute__ ((task))

__kernel void compute_pagerank(

__global const unsigned* restrict ovid_of_edge,

__global const unsigned* restrict start_edge,

__global const unsigned* restrict end_edge,

__global unsigned* restrict node_data,

__global unsigned* restrict node_data2

)

{

unsigned acc = 0;

unsigned ei;

unsigned si;

unsigned ovid;

for (unsigned i = 0;i < 1000; i++ ) // iterates over graph nodes

{

acc = 0;

si = start_edge; // sequential read

ei = end_edge; // sequential read

for(unsigned j = si; j < ei; j++) // iterates over node's outgoing edges

{

ovid = ovid_of_edge[j]; // child node. sequential read.

acc += node_data[ovid]; // random-access read

}

node_data2[i] = acc; // sequential write

}

} //kernel

thanks

altera_forum · Answer

I am genuinely surprised to the that the second kernel is actually pipelineable. Other than the difference in read and write that you mentioned, I don't see any other difference between the two kernels that could cause the difference. However, I would say the strange case here is the second kernel, not the first one. I checked with newer versions of the compiler. 16.1.2 and 17.0 fully pipeline the second kernel, but 17.1 and 18.0 say:

--- Quote Start ---

II >=1

II is an approximation due to variable inner loop trip count.

--- Quote End ---

I guess the compiler is doing some corner case optimization that allows pipelining in the second kernel. Other than that, I have no idea.

P.S. You can put your code segments in a CODE tag to preserve indentation.

altera_forum · Answer

I tested both of your suggestions:

1- inner loop with fixed limit. Now it is pipeliable, but increase in run-time, due to wasted clocks in inner loop, is large. Because my upper limit is huge.

2- I tested the ND-range. That really worked well! I don't know exactly what is the difference between implementation of pipelinging inside a task, or pipelining among work-items inside a ND-range, however, this is a good example for me that they are not similar.

Thank you very much for following this discussion. that was really instructive. :)

altera_forum · Answer

--- Quote Start ---

2- I tested the ND-range. That really worked well! I don't know exactly what is the difference between implementation of pipelinging inside a task, or pipelining among work-items inside a ND-range, however, this is a good example for me that they are not similar.

--- Quote End ---

The difference is that the run-time scheduler in the NDRange version will avoid all the redundant computation. NDRange is preferred for unpipelineable loops because of this (while single work-item is preferred for pretty much every other case).

Forum Discussion

Complex loop exit condition

5 Replies

Recent Discussions

Questa Sim on Windows - linking to external LIB

Free Licence for Max+PlusII

Quartus crashes on long carry chain in Agilex 5 FPGAs

Mailbox Client IP - SEND_CERTIFICATE command through FPGA fabric

Quartus did not start