Altera_Forum
Honored Contributor
8 years agoData dependency caused by conditional global memory read
Hi,
When I compile my code, in the loop analysis of the generated report file there's one loop which has iteration interval of 4, which means it's not pipelined well.
for (uint loop_cnt = 0, w = 0, cp = 0; loop_cnt < COLS_PER_PE * TILE_SIZE; loop_cnt++) {
//load data to local buffer
uint x = (n * BLOCK_SIZE + cp * TILE_SIZE + w) % conv_dim1 + i;
uint y = (n * BLOCK_SIZE + cp * TILE_SIZE + w) / conv_dim1 + j;
if ((cp * COLS_PER_PE + w > BLOCK_SIZE - 1 - col_pad_size && n == (conv_dim1 * conv_dim2 + col_pad_size) / BLOCK_SIZE - 1) || (x < pad_size || x > data_dim1 - pad_size - 1 || y < pad_size || y > data_dim2 - pad_size - 1)) {
# pragma unroll
for(uint v = 0; v < CVEC; v++) {
data_double_buf = 0.0f;
}
}
else {
data_double_buf = input;
}
//load weight to local buffer
if(cp * TILE_SIZE + w < BLOCK_SIZE - row_pad_size) {
//For the first 2 convolutional layers
if(conv_dim3 < BLOCK_SIZE) {
weight_double_buf = weight;
}
//For the last convolutional layer
else {
weight_double_buf = weight;
}
}
else {
if(conv_dim3 < BLOCK_SIZE) {
# pragma unroll
for(uint v = 0; v < CVEC; v++) {
weight_double_buf.vector = 0.0f;
}
}
}
//manual loop coalescing
if(w == TILE_SIZE - 1) {
cp += 1;
}
if(w == TILE_SIZE - 1) {
w = 0;
}
else {
w += 1;
}
}
And here is the report about this loop
pipelined II Bottleneck detail
Block7 (conv.cl:107)
Yes
4
II
Memory dependency
Block7:
II bottleneck due to memory dependency between:
Store Operation (conv.cl:122)
Store Operation (conv.cl:122)
Largest critical path contributor(s):
36%: Store Operation (conv.cl:122)
36%: Store Operation (conv.cl:122)
I don't see any data dependency here. if the compiler is inferring wrongly, does any one know how to avoid this? (if I make "weight_double_buf" and "data_double_buf" normal float or remove the conditions, the II will become 1) And advice would be greatly appreciated! Lancer