Forum Discussion
Altera_Forum
Honored Contributor
11 years agoThank you so much for the help! I am just wondering: what if number of iteration for the for loop could not be determine at compilation time (either the operation is conditional or the size N changes during different kernel invocations)? I am forced to access local memory sequentially or is there any other optimization that can be done?
--- Quote Start --- Ok, I should point out that for this to work, unrolling the column loop is key; this creates consecutive accesses that compiler can merge. e.g. for(row = 0; row < N; row++) { # pragma unroll for(col = 0; col < 4; col++) { A[row][col] = row + col; } } This essentially creates: for(row = 0; row < N; row++) { A[row][0] = row; A[row][1] = row + 1; A[row][2] = row + 2; A[row][3] = row + 3; } which gets translated to smth like this for(row = 0; row < N; row++) { A[row][0] = (int4)( row, row+1, row+2, row+3 ); // very efficient wide access } I think this is mentioned in the best practices guide document. --- Quote End ---