Honored Contributor

8 years ago

Why FPGA OpenCL Matrix Multiplication is better than CPU?

Hello Altera Forum Geniuses ~!~!

I have a Question!

Today I made Matrix Multiplication kernel code. (.cl)

it are 3 codes below~

no _simd

__kernel void simpleMultiply(

__global float* A,

__global float* B,

__global float* C,

int wA, int wB)

{

int row = get_global_id(0);

int col = get_global_id(1);

float sum = 0.0f;

for (int i = 0; i < wA; i++)

{

sum += A[row*wA + i] * B[i*wB + col];

}

C[row*wB + col] = sum;

}

simd

__attribute__((num_simd_work_items(2)))

__attribute__((reqd_work_group_size(2,1,1)))

__kernel void simpleMultiply(

__global float* A,

__global float* B,

__global float* C,

int wA, int wB)

{

int row = get_global_id(0);

int col = get_global_id(1);

float sum = 0.0f;

for (int i = 0; i < wA; i++)

{

sum += A[row*wA + i] * B[i*wB + col];

}

C[row*wB + col] = sum;

}

add compute units 2

__attribute__((num_compute_units(2)))

__attribute__((num_simd_work_items(2)))

__attribute__((reqd_work_group_size(2,1,1)))

__kernel void simpleMultiply(

__global float* A,

__global float* B,

__global float* C,

int wA, int wB)

{

int row = get_global_id(0);

int col = get_global_id(1);

float sum = 0.0f;

for (int i = 0; i < wA; i++)

{

sum += A[row*wA + i] * B[i*wB + col];

}

C[row*wB + col] = sum;

}

I try to Mat_mult using this codes.

And I found strange thing.

it is that 3 codes has same operating time.

Mat_A = (576x26)

Mat_B = (26 x 6)

Mat C = Mat_A x Mat_B

this operation need 1440~1550us.

why operation time is not change?

and.....

Someone told me that "if you use only global memory, memory access time overhead is big.

so if you want to better performance, use local memory."

But I use only global_memory.(because... i didn't understand how to use local memory in 2D multiplication....)

Nevertheless, My SoC board show the better performance than use only CPU which is DE1-SoC's ARM.

Now i have question.

First!

Why 3 codes performance time are same?

SIMD code is operation?

and

Two Computing units are operation?

if it is not working, how can i operate?

Second!

How can i take better performance?

Third!

Could you recommand 'Document that is better than OpenCL best practice guide' to me?

I look forward to see your answer.

Thank you so much!

Have a nice day~!~!

Forum Discussion

Why FPGA OpenCL Matrix Multiplication is better than CPU?

Recent Discussions

Questa unable to checkout a viewer license

Timing analysis - long combinational path

Error(23098) when using IPM_IOPLL on Agliex 7

Quartus Prim Pro: "Fatal Error: Segment Violation, Access Violation"

Crash at elaboration