Is HyperFlex Optimization disabled by Non-Aligned LSU?

Frequent Contributor

6 years ago

I stand corrected, it is actually 19.3+ that supports the hyperflex optimization for non-aligned LSUs. However, both 19.3 and 19.4 generate a new warning in your code as follows:

warning: test.cl:21:21: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
warning: test.cl:34:13: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering

It seems they don't like the unrolled channel operation, even though it doesn't create multiple call-sites. A quick transformation as follows can solve this issue:

#pragma OPENCL EXTENSION cl_intel_channels : enable
 
#define BLOCK_ROWS 8
#define BLOCK_COLS 8
 
typedef struct
{
	float data[BLOCK_COLS];
} CHAN_WIDTH;
 
channel CHAN_WIDTH channel_A __attribute__((depth(32)));
 
__kernel void readA(__global volatile const float * restrict A, const unsigned int N, const unsigned int M)
{
   const uint BlocksN = 1 + (int)((N-1) / BLOCK_ROWS);
   const uint BlocksM = 1 + (int)((M-1) / BLOCK_COLS);
 
   for(int bi=0; bi < BlocksN; bi++){
	  for(int bj=0;bj<BlocksM;bj++){
		 for(int i=0;i<BLOCK_ROWS;i++){
			//reading offset
			const int offset = (bi*BLOCK_ROWS+i)*M+bj*BLOCK_COLS;
			// dummy offset that will let hyper opt ON
			// const int offset =0;
			CHAN_WIDTH temp;
			#pragma unroll 
			for(int j=0;j<BLOCK_COLS;j++){
			    temp.data[j] = A[offset + j];
			}
			write_channel_intel(channel_A,temp);
		 }
	  }    
   }
}
 
 
__kernel void dummy(__global volatile  float * restrict A, const unsigned int N){
   for(int i=0;i<N;i++){
	  CHAN_WIDTH temp;
	  temp = read_channel_intel(channel_A);
	  #pragma unroll 
	  for(int j=0;j<BLOCK_COLS;j++){
			A[i]=temp.data[j];
	  }
   }
}

i.e., instead of using multiple narrow channels, just use one wide channel so that the channel operations can be moved outside of the unrolled loops.

>but this will render more difficult to handle the case in which the sizes of the matrix are not a multiple of the used vector data type.

Indeed you will have to manually pad the input row by row on the host if the row size is not a multiple of the vector size, but implementing it will not too difficult and there is no other way to have fully aligned accesses anyway. You would want to absolutely avoid unaligned accesses on Intel FPGAs since they will kill your memory performance (reference: https://arxiv.org/abs/1910.06726).

Forum Discussion

Is HyperFlex Optimization disabled by Non-Aligned LSU?

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process