Forum Discussion
This happens with 19.2 as well, or at least it returns the same warning message an disable Hyper Optimization.
Regarding your suggestion of using vector variables (or custom data types), I can see that this will work, but this will render more difficult to handle the case in which the sizes of the matrix are not a multiple of the used vector data type.
I stand corrected, it is actually 19.3+ that supports the hyperflex optimization for non-aligned LSUs. However, both 19.3 and 19.4 generate a new warning in your code as follows:
warning: test.cl:21:21: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
warning: test.cl:34:13: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation orderingIt seems they don't like the unrolled channel operation, even though it doesn't create multiple call-sites. A quick transformation as follows can solve this issue:
#pragma OPENCL EXTENSION cl_intel_channels : enable
#define BLOCK_ROWS 8
#define BLOCK_COLS 8
typedef struct
{
float data[BLOCK_COLS];
} CHAN_WIDTH;
channel CHAN_WIDTH channel_A __attribute__((depth(32)));
__kernel void readA(__global volatile const float * restrict A, const unsigned int N, const unsigned int M)
{
const uint BlocksN = 1 + (int)((N-1) / BLOCK_ROWS);
const uint BlocksM = 1 + (int)((M-1) / BLOCK_COLS);
for(int bi=0; bi < BlocksN; bi++){
for(int bj=0;bj<BlocksM;bj++){
for(int i=0;i<BLOCK_ROWS;i++){
//reading offset
const int offset = (bi*BLOCK_ROWS+i)*M+bj*BLOCK_COLS;
// dummy offset that will let hyper opt ON
// const int offset =0;
CHAN_WIDTH temp;
#pragma unroll
for(int j=0;j<BLOCK_COLS;j++){
temp.data[j] = A[offset + j];
}
write_channel_intel(channel_A,temp);
}
}
}
}
__kernel void dummy(__global volatile float * restrict A, const unsigned int N){
for(int i=0;i<N;i++){
CHAN_WIDTH temp;
temp = read_channel_intel(channel_A);
#pragma unroll
for(int j=0;j<BLOCK_COLS;j++){
A[i]=temp.data[j];
}
}
}i.e., instead of using multiple narrow channels, just use one wide channel so that the channel operations can be moved outside of the unrolled loops.
>but this will render more difficult to handle the case in which the sizes of the matrix are not a multiple of the used vector data type.
Indeed you will have to manually pad the input row by row on the host if the row size is not a multiple of the vector size, but implementing it will not too difficult and there is no other way to have fully aligned accesses anyway. You would want to absolutely avoid unaligned accesses on Intel FPGAs since they will kill your memory performance (reference: https://arxiv.org/abs/1910.06726).