Dear HRZ,
I have gone through the tips you have given me on advanced techniques of optimization of HLS codes. As a result I have applied some of the techniques to optimize the code I have given you before. This time I am writing my kernel as a single-thread mode kernel. For the first step, I'm applying blocking of the code (Based on what I've learnt on the net). After that, since there was some dependency, I have applied another technique to interleave computation between different inputs (The inputs are basically iterations of the outermost loop). Here is my developed code:
#ifdef INT_PRECISION
#define DTYPE int
#elif SINGLE_PRECISION
#define DTYPE float
#elif DOUBLE_PRECISION
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#define DTYPE double
#endif
__kernel void S1119 (__global DTYPE* restrict AA,
__global const DTYPE* restrict BB,
const int lllX
,const int lllY)
{
int exit = lllY / BLOCK_SIZE;
for (int i = 0; i < exit; i+=4) {
int i_real[4];
i_real[0] = i*BLOCK_SIZE;
i_real[1] = (i+1)*BLOCK_SIZE;
i_real[2] = (i+2)*BLOCK_SIZE;
i_real[3] = (i+3)*BLOCK_SIZE;
// start processing
for (int j = 1; j < lllX; j++) {
DTYPE BB_SR[BLOCK_SIZE][4];
DTYPE AA_SR[BLOCK_SIZE][4];
if (j == 1) {
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; k++) {
for (int ii = 0; ii < 4; ii++)
AA_SR[k][ii] = AA[i_real[ii]+k];
}
}
#pragma ivdep
for (int ii = 0; ii < 4; ii++){
#pragma ivdep
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; k++) {
BB_SR[k][ii] = BB[j*lllY+k+i_real[ii]];
}
#pragma ivdep
#pragma unroll UNROLL_FACTOR
for (int k = 0; k < BLOCK_SIZE; k++) {
AA_SR[k][ii] = AA_SR[k][ii] * BB_SR[k][ii];
}
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; k++) {
AA[j*lllY+k+i_real[ii]] = AA_SR[k][ii];
}
}
}
}
}
Now my question is, The block that I'm doing the computation cannot be fully unrolled, and I can only partially unroll it (defined with UNROLL_FACTOR). Since I'm interleaving computation of various inputs, I don't know why that happens. I just wanna know how compiler takes care of this situation.
In addition, except from memory access re-arrangement, what other kinds of obvious optimizations can be applied to unroll more and achieve II=1?
Thanks