report shows no DSP usage for OpenCL kernel

amrmesh
New Contributor
6 years ago
Hi,
Thank you very much for your time testing my code,
for functional correctness, the host code has a method to verify it, but apparently, that code also has some problems since it passed that test.
I got your point, and thanks, I will change the inner loop to see if the it will fix the problem,
a fast test (removing the shift register) proves your point is correct, by removing shift register, the report shows some DSP usage,
The strange thing is that even in the report there is no warning about this out-of-bound access !

New Contributor

6 years ago

Hello again,

Sorry, I don't understand why at line 67 the compiler complains about out-of-bounds, do you know which part causes out-of-bounds access?

at first I thought I'm mixing shift register size with block size, but I don't see any problem with that, do you know which part causing it?

HRZ
Frequent Contributor
6 years ago
Sorry, I had to manually define BLOCK_SIZE and II_CYCLES to test your code and forgot to adjust the line numbers accordingly. You should deduct 3 from the line numbers I posted above to match your code. The second warning is on line 64 on your code where "presult[e+1]" would be out of bounds for e=II_CYCLES.

amrmesh

New Contributor

6 years ago

Thanks for your reply,

After fixing the out of bound access, report generation takes long time,

"aoc: Optimizing and doing static analysis of code..."

at this stage the compiler seems having a hard time trying to optimize the code, do you know why this is happ

#define II_CYCLES 16
#define BLOCK_SIZE 64
 
__kernel
void matvec(global float* restrict matrix_a,
	global float* restrict vectors_b,
	global float* restrict result,
	uint n,
	uint vec_count)
{
 
	float localmatrix[BLOCK_SIZE][BLOCK_SIZE];
	float localvector[BLOCK_SIZE];
	float localresult[BLOCK_SIZE];
	float presult[II_CYCLES+1];
 
	#pragma unroll 16
	for(uint e=0 ; e<n ; e++)
		result[e] = 0;
 
	//iteration over matrix blocks rows
	uint bi = 0;
	for(uint bi=0 ; bi<n ; bi+=BLOCK_SIZE)
	{
 
		//initializing the localresult
		#pragma unroll
		for(uint e=0 ; e<BLOCK_SIZE ; e++)
			localresult[e] = 0;
 
		//iteration over matrix blocks colomns
		uint bj = 0;
		for(uint bj=0 ; bj<n ; bj+=BLOCK_SIZE)
		{
			//loading block of matrix to local
			#pragma unroll 16
			for(uint ei=0 ; ei<BLOCK_SIZE ; ei++)
				for(uint ej=0 ; ej<BLOCK_SIZE ; ej++)
					localmatrix[ei][ej] = matrix_a[(bi+ei)*n+(bj+ej)];
 
			//itteration over vectors
			for(uint k=0 ; k<vec_count ; k++)
			{
				//loading one block of one vector to local
				#pragma unroll 16
				for(uint e=0 ; e<BLOCK_SIZE ; e++)
					localvector[e] = vectors_b[k*n+(bj+e)];
		
 
				//***localresult[j] += localmatrix[i][j] * localvector[j]***
				//iteration over matrix colomns
				for(uint i=0 ; i<BLOCK_SIZE ; i++)
				{
					//initializing presult
					#pragma unroll
					for(uint e=0 ; e<II_CYCLES+1 ; e++)
						presult[e] = 0; 					
 
					//iteration over matrix rows
					for(uint j=0 ; j<BLOCK_SIZE ; j++)
					{
						//localresult[i] += localmatrix[i][j] * localvector[j];
						presult[II_CYCLES] = presult[0] + localmatrix[i][j] * localvector[j];
 
						#pragma unroll
						for(uint e = 0 ; e<II_CYCLES ; e++)
							presult[e] = presult[e+1];
					}
					#pragma unroll
					for(uint e = 0 ; e<II_CYCLES ; e++)
						localresult[i] += presult[e];
				}
			}
		}
 
		//Writing the block of result back to main memory
		#pragma unroll 16
		for(uint e=0 ; e<BLOCK_SIZE ; e++)
			result [bi+e] = localresult[e];
 
	}
}

ening? and is this long report generation time normal? or I'm doing something wrong again?

HRZ
Frequent Contributor
6 years ago
I just put your new code into the v19.3 compiler on my environment targeting Arria 10 and the OpenCL compilation finished in less than a minute and the report was generated. How long does it take on your side? I have had cases where the OpenCL compilation takes even more than half an hour, but that is for some very specific cases. The compilation time will also depend on your processor speed.
P.S. What FPGA are you targeting and what compiler version are you using?

Forum Discussion

report shows no DSP usage for OpenCL kernel

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process