Forum Discussion

amrmesh's avatar
amrmesh
Icon for New Contributor rankNew Contributor
6 years ago

report shows no DSP usage for OpenCL kernel

Hello, I'm new to OpenCL for fpgas, I wrote this code for matrix and vector multiplication, after compiling with aoc 19.3, the report shows that no DSP is used for calculation of the kernel, anyone know what I'm doing wrong?

__kernel
void matvec(global float* restrict matrix_a,
	global float* restrict vectors_b,
	global float* restrict result,
	uint n,
	uint vec_count)
{
 
	float localmatrix[BLOCK_SIZE][BLOCK_SIZE];
	float localvector[BLOCK_SIZE];
	float localresult[BLOCK_SIZE];
	float presult[II_CYCLES+1];
 
	#pragma unroll 16
	for(uint e=0 ; e<n ; e++)
		result[e] = 0;
 
	//iteration over matrix blocks rows
	uint bi = 0;
	for(uint bi=0 ; bi<n ; bi+=BLOCK_SIZE)
	{
 
		//initializing the localresult
		#pragma unroll
		for(uint e=0 ; e<n ; e++)
			localresult[e] = 0;
 
		//iteration over matrix blocks colomns
		uint bj = 0;
		for(uint bj=0 ; bj<n ; bj+=BLOCK_SIZE)
		{
			//loading block of matrix to local
			#pragma unroll 16
			for(uint ei=0 ; ei<BLOCK_SIZE ; ei++)
				for(uint ej=0 ; ej<BLOCK_SIZE ; ej++)
					localmatrix[ei][ej] = matrix_a[(bi+ei)*n+(bj+ej)];
 
			//itteration over vectors
			for(uint k=0 ; k<vec_count ; k++)
			{
				//loading one block of one vector to local
				#pragma unroll 16
				for(uint e=0 ; e<BLOCK_SIZE ; e++)
					localvector[e] = vectors_b[k*n+(bj+e)];
		
 
				//***localresult[j] += localmatrix[i][j] * localvector[j]***
				//iteration over matrix colomns
				for(uint i=0 ; i<BLOCK_SIZE ; i++)
				{
					//initializing presult
					#pragma unroll
					for(uint e=0 ; e<II_CYCLES+1 ; e++)
						presult[e] = 0; 					
 
					//iteration over matrix rows
					for(uint j=0 ; j<BLOCK_SIZE ; j++)
					{
						//localresult[i] += localmatrix[i][j] * localvector[j];
						presult[II_CYCLES] = localmatrix[i][j] * localvector[j] + presult[0];
 
						#pragma unroll
						for(uint e = 0 ; e<II_CYCLES+1 ; e++)
							presult[e] = presult[e+1];
					}
					#pragma unroll
					for(uint e=0 ; e<II_CYCLES+1 ; e++)
						localresult[i] += presult[e];
				}
			}
		}
 
		//Writing the block of result back to main memory
		#pragma unroll 16
		for(uint e=0 ; e<BLOCK_SIZE ; e++)
			result [bi+e] = localresult[e];
 
	}
}

10 Replies

  • MEIYAN_L_Intel's avatar
    MEIYAN_L_Intel
    Icon for Frequent Contributor rankFrequent Contributor

    Hi,

    I would need some time to check internally about the information of DSP block implementation.

    Thanks

  • HRZ's avatar
    HRZ
    Icon for Frequent Contributor rankFrequent Contributor

    Have you verified the functional correctness of your code using emulation? It seems the v19.3 compiler is optimizing out most of your code. However, v16.1.2 which I still use for my main development behaves differently and does not optimize out your code but it generates very helpful warnings that could help finding the problem in your code:

    test.cl:28: Compiler Warning: Full unrolling of the loop is requested but the loop bounds cannot be determined. The loop is not unrolled in kernel matvec
    test.cl:67: Compiler Warning: removing out-of-bounds accesses to presult

    Even thought v19.3 also generates the first warning, it does not generate the second one which could in fact be the source of your problem. It is possible that v19.3 is assigning a value of zero to the out-of-bound index and since you are shifting the buffer, it is assuming the whole buffer is being zeroed out and hence, it is optimizing out the computation in your kernel. Maybe @MeiYanL_Intel​ can elaborate why the newer versions of the compiler are excluding such critical warnings, forcing programmers to run in circles trying to discover issues in their code.

    • amrmesh's avatar
      amrmesh
      Icon for New Contributor rankNew Contributor

      Hi,

      Thank you very much for your time testing my code,

      for functional correctness, the host code has a method to verify it, but apparently, that code also has some problems since it passed that test.

      I got your point, and thanks, I will change the inner loop to see if the it will fix the problem,

      a fast test (removing the shift register) proves your point is correct, by removing shift register, the report shows some DSP usage,

      The strange thing is that even in the report there is no warning about this out-of-bound access !

    • amrmesh's avatar
      amrmesh
      Icon for New Contributor rankNew Contributor

      Hello again,

      Sorry, I don't understand why at line 67 the compiler complains about out-of-bounds, do you know which part causes out-of-bounds access?

      at first I thought I'm mixing shift register size with block size, but I don't see any problem with that, do you know which part causing it?

      • HRZ's avatar
        HRZ
        Icon for Frequent Contributor rankFrequent Contributor

        Sorry, I had to manually define BLOCK_SIZE and II_CYCLES to test your code and forgot to adjust the line numbers accordingly. You should deduct 3 from the line numbers I posted above to match your code. The second warning is on line 64 on your code where "presult[e+1]" would be out of bounds for e=II_CYCLES.