Altera_Forum
Honored Contributor
8 years agoUnexpected performance results of OpenCL kernel execution
Hi,
I have a simple dummy kernel that is being used for some benchmarking goal. Below is my OpenCL kernel:__attribute__((num_compute_units(1)))
__attribute__((num_simd_work_items(16)))
__attribute__((reqd_work_group_size(256,1,1)))
__kernel void WGS256MAPI16LL1048576(const __global float *GIn, __global float *GOut, const int M, const int N, const int P) {
const int XGL = get_global_id(0);
const int XGRid = get_group_id(0);
const int XGRnum = get_num_groups(0);
const int XLSize = get_local_size(0);
const int XLid = get_local_id(0);
// Just a private variable
float temp = 1.0;
// Start of a new level of for loop
long baseIndex1 = XGRid*XLSize*16+XLid;
temp += temp * M;
temp += temp * M;
temp += temp * M;
temp += temp * M;
temp += temp * M;
temp += temp * M;
temp += temp * M;
temp += temp * M;
...
temp += temp * M;
GOut = temp;
}
As it's clear in the code, my kernel is basically does nothing special. I also have removed any DRAM access. The kernel is performing 1024 "fma" operations and save the result into the memory, just to make sure the computation is not being optimized out by the compiler. I deploy this kernel as a NDRange, on a Nallatech 385A. It also has been compiled with Altera 16.0 compiler. Now, basically I would like to calculate the performance of my kernel as a number of floating points per second (GFlops). Here is the section of my code, which is responsible to deploy the kernel and calculate the performance:
Event evKernel (algorithm->getKernelName ());
err = clEnqueueNDRangeKernel (queue, kernel, algorithm->getWorkDim(),
NULL,
globalWorkSize,
localMemSize,
0, NULL, &evKernel.CLEvent());
CL_CHECK_ERROR (err);
err = clWaitForEvents (1, &evKernel.CLEvent());
evKernel.FillTimingInfo ();
As it's clear, it logs the beginning and then end of the execution. Also I'm making sure I'm waiting for the kernel termination being triggered and then log the time. Using all above configuration, the FPGA can deliver around 2451 GFlops, which is completely outperforming the theoretical performance as 1.5TFlops. I have checked the generated Verilog file and seen all FMA operations have been generated. Now my question is, am I doing anything wrong that I'm getting this non-sense performance number? Thanks, Saman