Altera_Forum
Honored Contributor
8 years agoForcing loop iterations to execute sequentially
Hi all,
I have some code that has a nested for loop in a single work item kernel like so: __local lmem[M][2]; // ping pong buffer for (uint outer = 0; outer < N; ++outer) {__private wr_bank_sel = outer & 0x1; __private rd_bank_sel = !(outer & 0x1); for (uint inner =0; inner < M; ++inner) { lmem[inner][wr_bank_sel] = lmem[inner][rd_bank_sel] * 10.0f/(inner + outer); // placeholder math op but real dependencies } } The code above worked in the emulator but failed in hardware. After much digging via printf statements, what I discovered was the outer loop and inner loop both executing out of order simultaneously! I.E. [outer = 0, inner = 0] -> [outer = 1, inner = 0] -> [outer = 2, inner = 0] -> [outer = 0, inner = 1] -> [outer = 3, inner = 0], etc when it should be: [outer = 0, inner = 0] -> [outer = 0, inner = 1] -> [outer = 0, inner = 2] -> [outer = 0, inner = M-1]-> [outer = 1, inner = 0], etc. i'm fine with the inner loop executing out of order but the outer loop executing out of order at the same time obviously doesn't work with a ping pong buffer strategy.
I've tried mem_fences which don't appear to have any effect (I've tried CLK_GLOBAL_MEM_FENCE, CLK_LOCAL_MEM_FENCE, CLK_CHANNEL_MEM_FENCE and combinations of those). What does seem to work is adding an unnecessary channel in my outer loop. This modification looks like the following: __local lmem[M][2]; // ping pong buffer for (uint outer = 0; outer < N; ++outer) { write_channel_intel(fake_channel, outer); // NEW mem_fence(CLK_CHANNEL_MEM_FENCE); // NEW const uint fake_outer = read_channel_intel(fake_channel); // NEW __private wr_bank_sel = fake_outer & 0x1; __private rd_bank_sel = !(fake_outer & 0x1); for (uint inner =0; inner < M; ++inner) { lmem[inner][wr_bank_sel] = lmem[inner][rd_bank_sel] * 10.0f/(inner + fake_outer); // NEW (replaced outer with fake_outer) } } In the report.html I now see it say that there is a serial execution dependency. "Iteration executed serially across BlockN. Only a single loop iteration will execute inside this region due to memory dependency". I think this is exactly what I want -- my outer loop to execute serially. I've built and run this modified code and it seems to work. I've also found you can play with atomics to get the same message in report.html (have not yet built and tried it in hardware though). is there a better way?
I'm having a hard time believing Intel/Altera would not have considered this use case. I also imagine I'm incurring some performance penalty with these workarounds. Thanks in advance for your help. If nobody replies I hope I've at least provided some workaround strategies for anyone in the future who stumbles upon this problem. This is with the 17.0 version of the compiler.