Altera_Forum
Honored Contributor
10 years agoRelaxing Data Dependencies on Memory Access
I'm trying to stream in a block of contiguous memory, but only process the date until an end marker is reached. Put simply, iterate through an array until a certain value is found, after which all further elements are to be ignored. A simplistic solution with a OpenCL single work item kernel wold be as follows:
__kernel void in_streamer(__global const uint2* in, uint n) {
for(uint i = 0; i != n; ++i) {
uint2 value = in;
write_channel_altera(chan, value);
if(value.s0 == END_MARKER) {
break;
}
}
}
__kernel void consumer() {
while(true) {
uint2 value = read_channel_altera(chan);
// do work here
if(value.s0 == END_MARKER) {
break;
}
}
}
The consumer kernel is entirely unproblematic, the data dependency to the previous iteration only contains a equality operation. The in_streamer, while working as intended, causes terribly poor performance because there is a data dependency on a memory load operation. The AOCL compiler produces the following warning in the optimization report: "Successive iterations launched every 164 cycles due to: Data dependency on variable, Largest Critical Path Contributor: 98%: Load Operation". This in itself is of course nothing special. I've dealt with such data dependencies before by using a shift register to relax the dependency as the Altera Best Practice Guide suggests. The Idea is to let allow the compiler to pipeline an expensive operation. To make this possible one I can't use the data in the next iteration, but only after a large number iterations. This usually worked for me in these kinds of problems. It doesn't seem to work with memory accesses. The following solution tries to implement the in_streamer to break the loop after the end marker was found, but not immediately, in order to relax the dependency. The elements that are read after the end marker was found are discarded and not written to the channel:
__kernel void in_streamer(__global const uint2* in, uint n) {
const uint MEM_DELAY = 164;
bool endmarker_reached;
# pragma unroll
for(int s = 0; s < MEM_DELAY; ++s) {
endmarker_reached = false;
}
for(uint i = 0; i < n; ++i) {
uint2 value = in;
if(endmarker_reached) {
write_channel_altera(chan, value);
}
bool end_it = false;
if(value.s0 == 0x70000000) {
end_it = true;
}
# pragma unroll
for(int s = (MEM_DELAY-1); s > 0; --s) {
endmarker_reached = endmarker_reached;
}
endmarker_reached = end_it;
if(endmarker_reached) {
break;
}
}
}
Here I run into a problem. While the dependency is relaxed I still get reduced performance, just not as badly reduced as before. The optimization report now says "Successive iterations launched every 2 cycles...". It then gives the following details over a hundred times: "Data dependency on variable, Largest Critical Path Contributor: 45%: Load Operation". While this is much better than before, it's still a massive waste of processing time. It doesn;t matter how high I set the constant MEM_DELAY, the issue remains. Another working solution would be the following:
__kernel void in_streamer(__global const uint2* in, uint n) {
bool end = false;
for(uint i = 0; i != n; ++i) {
uint2 value = in;
if(!end) {
write_channel_altera(chan, value);
}
if(value.s0 == END_MARKER) {
end = true;
}
}
}
This works and both kernels are pipelined perfectly. The problem is, that the input array is read to the very end, the values are only discarded after the end marker. Has anyone encountered a similar issue? I'd be very interested in where the delay comes from.