ND-Range kernels vs SingleThread?

Occasional Contributor

7 years ago

Dear HRZ,

I have gone through the tips you have given me on advanced techniques of optimization of HLS codes. As a result I have applied some of the techniques to optimize the code I have given you before. This time I am writing my kernel as a single-thread mode kernel. For the first step, I'm applying blocking of the code (Based on what I've learnt on the net). After that, since there was some dependency, I have applied another technique to interleave computation between different inputs (The inputs are basically iterations of the outermost loop). Here is my developed code:

#ifdef INT_PRECISION
#define DTYPE int
#elif SINGLE_PRECISION
#define DTYPE float
#elif DOUBLE_PRECISION
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#define DTYPE double
#endif
 
__kernel void S1119 (__global DTYPE* restrict AA,
					   __global const DTYPE* restrict BB,
                    const int lllX
                    ,const int lllY)
{
	
	int exit = lllY / BLOCK_SIZE;
 
	for (int i = 0; i < exit; i+=4) {
 
		int i_real[4];
 
		i_real[0] = i*BLOCK_SIZE;
		i_real[1] = (i+1)*BLOCK_SIZE;
		i_real[2] = (i+2)*BLOCK_SIZE;
		i_real[3] = (i+3)*BLOCK_SIZE;
 
		// start processing
    	       for (int j = 1; j < lllX; j++) {
 
			DTYPE BB_SR[BLOCK_SIZE][4];
			DTYPE AA_SR[BLOCK_SIZE][4];
 
			if (j == 1) {
				#pragma unroll
				for (int k = 0; k < BLOCK_SIZE; k++) {
					for (int ii = 0; ii < 4; ii++)
						AA_SR[k][ii] = AA[i_real[ii]+k];
				}
			}
 
			#pragma ivdep
			for (int ii = 0; ii < 4; ii++){
	
				#pragma ivdep
				#pragma unroll
				for (int k = 0; k < BLOCK_SIZE; k++) {
					BB_SR[k][ii] = BB[j*lllY+k+i_real[ii]];
				}
		
    			        #pragma ivdep
      			        #pragma unroll UNROLL_FACTOR
				for (int k = 0; k < BLOCK_SIZE; k++) {
					AA_SR[k][ii] = AA_SR[k][ii] * BB_SR[k][ii];
				}
 
				#pragma unroll
				for (int k = 0; k < BLOCK_SIZE; k++) {
					AA[j*lllY+k+i_real[ii]] = AA_SR[k][ii];
				}
			}
		}
	
	}
}

Now my question is, The block that I'm doing the computation cannot be fully unrolled, and I can only partially unroll it (defined with UNROLL_FACTOR). Since I'm interleaving computation of various inputs, I don't know why that happens. I just wanna know how compiler takes care of this situation.

In addition, except from memory access re-arrangement, what other kinds of obvious optimizations can be applied to unroll more and achieve II=1?

Thanks

Forum Discussion

Recent Discussions

Quartus did not start

The quartus license works with version 25.0 but not with version 17.0

Docker image for Quartus Pro 26.1 missing ?

Timing analysis - long combinational path

timing violation fix