Improvement of self-written OpenCL-Funktion (GaussianBlur)

Question

Hello, I have implemented a Gaussian Filter on the FPGA (Cyclone V SoC) using OpenCL which works ok (2.5 times faster as on the ARM) but i´m not quite sure if it´s optimal for the FPGA.

host-code:

...

status = clsetkernelarg(kernel,0,sizeof(cl_mem),&buffer_img); // Matrix which holds the Kernelparameters

status = clsetkernelarg(kernel,1,sizeof(cl_mem),&buffer_mask); // Matrix which holds an graysclae image,

status = clsetkernelarg(kernel,2,sizeof(cl_mem),&buffer_outputimg); // Matrix for output

status = clsetkernelarg(kernel,3,sizeof(int),&img.cols);

status = clsetkernelarg(kernel,4,sizeof(int),&maskwidth);

size_t globalWorkSize[2];

globalWorkSize[0] = output.cols;

globalWorkSize[1] = output.rows;

status = clEnqueueNDRangeKernel(cmdQueue,kernel,2,NULL, globalWorkSize, NULL,0, NULL,NULL);

...

kernel-code:

__kernel void convolve(__global uchar * input, __global float * mask, __global uchar * output,

const int inputWidth,const int maskWidth)

{

const int x = get_global_id(0);

const int y = get_global_id(1);

float sum = 0;

for (int r = 0; r < maskWidth; r++)

{

//Inkrementieren rowindex with picturewidth

const int idxrow = (y + r) * inputWidth + x;

for (int c = 0; c < maskWidth; c++)

{

//convolve

sum += mask[(r * maskWidth) + c] * input[idxrow + c];

}

output[y * get_global_size(0) + x] = sum;

}

Can someone tell me if and how it´s possible to improve the peroformance of the Gaussian Kernel on the FPGA?

Thanks :)

altera_forum · Answer

Comparing with the ARM core is probably not very conclusive since the ARM core is extremely slow.

The most obvious way to increase performance on the FPGA would be to unroll the loop on "c". Though since you are performing a floating-point reduction, you should either fully unroll that loop, or first optimize that loop to achieve an iteration interval of one by inferring a shift register as outlined in "Intel® FPGA SDK for OpenCL Best Practices Guide, 1.6.1.5 Removing Loop-Carried Dependency by Inferring Shift Registers" and then unroll it to achieve best performance.

You should consider fully reading Intel's programming and best practices guides since all the basic optimization techniques are covered there.

altera_forum · Answer

unrolling the loop on c improved the kernel. After reading the Best Practice Guide i tried to improve my GaussianBlur function like suggested in 1.6.1.5.  With help from the example in the guide and the exampleimplementation of an Sobel-Filter on: https://www.altera.com/support/support-resources/design-examples/design-software/opencl/sobel-filter.html   But my output of the kernel isn´t correct.  Kernel-Code: #define maskWidth 7
# define COLS 640
__kernel void gaussneu(global uchar * restrict frame_in, global uchar * restrict frame_out,
           const int iterations/*, const int COLS*/)
{
	// Filter coefficients
	float mask =
	{
		{ 0.0049, 0.0092, 0.0134, 0.0152, 0.0134, 0.0092, 0.0049},
		{ 0.0092, 0.0172, 0.0250, 0.0283, 0.0250, 0.0172, 0.0092},
		{ 0.0134, 0.0250, 0.0364, 0.0412, 0.0364, 0.0250, 0.0134},
		{ 0.0152, 0.0283, 0.0412, 0.0467, 0.0412, 0.0283, 0.0152},
		{ 0.0134, 0.0250, 0.0364, 0.0412, 0.0364, 0.0250, 0.0134},
		{ 0.0092, 0.0172, 0.0250, 0.0283, 0.0250, 0.0172, 0.0092},
		{ 0.0049, 0.0092, 0.0134, 0.0152, 0.0134, 0.0092, 0.0049},
	};	
	// Pixel buffer of 6 rows and 7 extra pixels
	int rows;
	// The initial iterations are used to initialize the pixel buffer.
	int count = -(6 * COLS + 7);
	while (count != iterations) 
	{
		// Each cycle, shift a new pixel into the buffer.
		// Unrolling this loop allows the compile to infer a shift register.
	# pragma unroll
		for (int i = COLS * 6 + 6; i &gt; 0; --i) 
		{
		    rows = rows;
		}
		rows = count &gt;= 0 ? frame_in : 0; // if count &gt;=0 -&gt; true rows = frame_in
		float sum = 0;
		// With these loops unrolled, one convolution can be computed every cycle.
	# pragma unroll
		for (int i = 0; i &lt; maskWidth; ++i) 
		{
		# pragma unroll
			for (int j = 0; j &lt; maskWidth; ++j) 
			{
				uchar pixel = rows;
				sum += mask*pixel;
			}
		}
		if (count &gt;= 0) {
		frame_out = sum;
		}
		count++;
    	}
}  Can someone tell me whats the problem?  Also I think there is an error on the sobel.cl file from the altera example in this part:  
         int temp = abs(x_dir) + abs(y_dir);
        unsigned int clamped;
        if (temp &gt; threshold) {
            clamped = 0xffffff;
        } else {
            clamped = 0;
        }
        if (count &gt;= 0) {
            frame_out = clamped;
        }  the computet value temp is never set, the output values only can take the values 0 or 0xffffffff

altera_forum · Answer

Edit @ question 1: my kernel code works, i did a mistake with handling the boarderproblem in the Host-Code :x. But i still think that the sobel.cl file has an error

altera_forum · Answer

--- Quote Start ---

Edit @ question 1: my kernel code works, i did a mistake with handling the boarderproblem in the Host-Code :x. But i still think that the sobel.cl file has an error

--- Quote End ---

They are just setting the output to either 0 or 0xffffff (max value). Since this code works based on threshold, anything above the threshold is seen as an edge and anything below is black. You can change the threshold value to obtain different images and choose the best threshold.

Forum Discussion

Improvement of self-written OpenCL-Funktion (GaussianBlur)

4 Replies

Recent Discussions

starting to learn FPGAs

qsys-generate outputs Info as Error

Timing analysis - long combinational path

Quartus Prime Lite 25.1 License Error - "Unable to checkout a license" (SALT_LICENSE_SERVER)

Regarding the issue of UFM not starting