Minimum II of 2 but HTML report has no further information

Question

I have a single work item kernel with a local mem used for a ping pong buffer with a form similar to the following:

local float __attribute__((bankwidth(4),

numreadports(2),

numwriteports(2),

doublepump,

bank_bits(2,1,0))) mem[1024][4][2];

for (uint outer_outer = 0; outer_outer < 8; ++outer_outer)

{

// some integer add,sub,and shifts that are used to help compute x_idx, y_idx later

float x_pipe[4];

float y_pipe[4];

uint x_idx_pipe[4];

uint y_idx_pipe[4];

for (uint outer = 0; outer < 8; ++outer)

{

uint x_idx, y_idx;

// compute x_idx, and y_idx using integer add, subs, and shifts

# pragma unroll

for (uint inner = 0; inner < 4; ++inner)

{

float x_fetched = mem[x_idx][inner][0];

float y_fetched = mem[y_idx][inner][0];

mem[x_idx_pipe[0]][inner][1] = x_pipe[0];

mem[y_idx_pipe[0]][inner][1] = y_pipe[0];

// shift register statements + computations on x and y

x_pipe[3] = x_fetched;

y_pipe[3] = y_fetched;

x_idx_pipe[3] = x_idx;

y_idx_pipe[3] = y_idx;

}

The compiler seems to detect the parallelization of the inner loop correctly, but my II on the 'outer' loop is 2. Unfortunately there is no additional information in the Loop Analysis section of the HTML report about what's the limiting factor. Does anyone here have any insight into what it means if the HTML report doesn't provide info on what's limiting the II? Does that mean the control logic is causing it hence there's nothing I can do?

I've tried forcing it using# pragma ii 1 but the compiler fails. Looking at the system view I notice the two store ops are sequential (the second dependent on the first) but am unsure if this is just a graphical thing (I.E. the system view doesn't display doublepump allowing for parallel store).

altera_forum · Answer

In nested loops, II of outer loops will be two since the exit condition of the inner loop and the outer loop need to be evaluated in one cycle if you want II of one on the outer loop, and that will create a very large critical path and significantly reduce operating frequency. This issue does not necessarily result in lower performance; however, you can merge your loops manually into one to achieve II of one. There should be a note about this in the report at the bottom if you click on the line with the II info, but I don't remember exactly.

altera_forum · Answer

HRZ,

Thanks for the insight. I had not considered nested loops, and experimenting with that produced something interesting. I removed the inner loop and just did the calculations on one bank to test things out. I also unrolled a for loop I used to implement the shift regs (idx_pipes, etc) so that there were no nested loops inside the 'outer' loop. Now the tool still shows an II of 2 on the 'outer' loop except now it provides more info. It does say there's a store dependency on those two lines. I wouldn't expect that behavior because it's a doublepumped memory (2 wr ports, 2 rd ports).

Do you have any advice on that?

EDIT: This is with version 17.0.

altera_forum · Answer

Please post your new code. You seem to be using indirect addressing on the local buffer; this is very likely not a good idea. Double-pumping memory should not affect load/store dependencies.

altera_forum · Answer

I could see double pumping not affecting load/store dependencies, but from an II perspective I think it should matter. Here's what I'm thinking, please let me know if you disagree: if I do two writes per loop at clock rate 'clk', and my memory is doublepumped such that it operates at 'clk2x' then on the first cycle of clk2x the first write will be performed, and on the second cycle of clk2x the second write will be performed. The writes will have been performed in order, and in 1 cycle of 'clk'.

Also, do you have any insight into why indirect address is bad in OpenCL? Is it just Altera preventing anyone from accidentally causing write collisions?

local float __attribute__((bankwidth(4),

numreadports(2),

numwriteports(2),

doublepump,

bank_bits(2,1,0))) mem[1024][4][2];

for (uint outer_outer = 0; outer_outer < 8; ++outer_outer)

{// some integer add,sub,and shifts that are used to help compute x_idx, y_idx later

float x_pipe[4];

float y_pipe[4];

uint x_idx_pipe[4];

uint y_idx_pipe[4];

for (uint outer = 0; outer < 8; ++outer)

{

uint x_idx, y_idx;

// compute x_idx, and y_idx using integer add, subs, and shifts

float x_fetched = mem[x_idx][0][0];

float y_fetched = mem[y_idx][0][0];

mem[x_idx_pipe[0]][0][1] = x_pipe[0];

mem[y_idx_pipe[0]][0][1] = y_pipe[0];

// manually coded shift register statements (i.e. no for loop) + computations on x and y

x_pipe[3] = x_fetched;

y_pipe[3] = y_fetched;

x_idx_pipe[3] = x_idx;

y_idx_pipe[3] = y_idx;

}

altera_forum · Answer

HRZ,  You are correct that indirect addressing is causing it.  If I index with constants it reduces to II = 1.  I'm not sure I understand why indirect is such a problem though...

Forum Discussion

Minimum II of 2 but HTML report has no further information

10 Replies

Recent Discussions

Timing analysis - long combinational path

Questa unable to checkout a viewer license

Crash at elaboration

Quartus did not start

Self service license server doesn't work