Unexplained performance difference for same kernels

Question

I have develop two identical kernels, with a single difference where the size of the for loop in one of them is higher than the other. Here are my two kernels, Kernel# 1 and Kernel# 2.  First Kernel: __attribute__((num_compute_units(5)))
__attribute__((num_simd_work_items(16)))
__attribute__((reqd_work_group_size(256,1,1)))
__kernel void TestS16VfloatI1048576D32Form1MUnrol0U16(__global float *data, __global float *rands, int index, int rand_max){
    float16 temp;
    int gid = get_global_id(0);
    temp = data;
   # pragma unroll 16
    for (int i = 0; i &lt; 32; i++){
        temp = (float) rands * temp;
    }
    data = temp.s0;
}
  Second Kernel: __attribute__((num_compute_units(5)))__attribute__((num_simd_work_items(16)))
__attribute__((reqd_work_group_size(256,1,1)))
__kernel void TestS16VfloatI1048576D256Form1MUnrol0U16(__global float *data, __global float *rands, int index, int rand_max){
    float16 temp;
    int gid = get_global_id(0);
    temp = data;
   # pragma unroll 16
    for (int i = 0; i &lt; 256; i++){
        temp = (float) rands * temp;
    }
    data = temp.s0;
}
  As it's clear in both kernel implementations, both acquire same amount of hardware resources and also both unfolding the loop with same degree. There are compiled as NDRanege and I deploy around 1 Million work items.  Now calculating the amount of floating point operations being done, I can see the first kernel can achieve 1.57 TFlops performance while the second kernel can achieve 4.37TFlops. I'm trying to come up with an explanation on how it's possible that increasing number of operation inside the kernel can increase performance, while keeping the run time the same?

altera_forum · Answer

The FLOPS numbers that you are reporting do not make sense; no current FPGA can get even remotely close to 1.5 or 4.3 TFLOPS. Are you sure you are timing your kernels and calculating the FLOPS correctly?

Apart from this, since the operation inside of your loop does not depend on i, chances are, during synthesis the circuit gets heavily simplified and both turn into something that does not include the for loop but instead, an equivalent operation. After all, your loop is equal to temp = pow(rands[-1], 256) * temp. Have you compared the report from the OpenCL compiler and the final area usage to see how big their difference is?

altera_forum · Answer

Hi HRZ,I edited my post above, there was a mistake where it was not temp[-1], but it was temp[i].also I'm just calculating the time it takes to run the kernel, not data transfer. Here is the way I calculate FLOPS: flops = [[num_of_work_items] * [num_ops_per_kernel] * [num_flops_per_ops] * [stream_size_of_vars]] / total time.For example for first kernel the flops would be something like =&gt; flops = [(around 1Mil) * 32 * 1 * 16] / total_time.Am I doing something wrong here?

altera_forum · Answer

That calculation seems correct to me, though since you are only writing "temp.s0" to memory, it is possible that during synthesis, the extra computation for "temp.s1" to "temp.sF" are optimized out since their results are never used and hence, you are estimating the number of operations at 16 times more than it actually is. Again I recommend comparing the OpenCL compiler's area usage estimation with the final area utilization to see if things are getting optimized out.

Can you post a snippet of your host code where you are timing the kernel? Specifically, have you put a clFlush() or clFinish() after clEnqueueNDRangeKernel() and before reading the end time of the operation?

altera_forum · Answer

Here is the part of the code where deploys the opencl kernel onto the device:     Event evKernel (alg.name);
   err = clEnqueueNDRangeKernel (queue, kernel, 1, NULL,
                                                         globalWorkSize, localWorkSize,
                                                         0, NULL, &amp;evKernel.CLEvent());
   CL_CHECK_ERROR (err);
   err = clWaitForEvents (1, &amp;evKernel.CLEvent());
   CL_CHECK_ERROR (err);  I feel like I already wait for the kernel to finish. So, do I still need clFlush and clFinish?  BTW, I have fixed the issue you have mentioned earlier, and I've seen the effect on the GPU at least. The performance has dropped which seems to be reasonable. I will go on and try them on the FPGA too and see how it'll behave.

altera_forum · Answer

No, waiting for the event would do the same thing.

Forum Discussion

Unexplained performance difference for same kernels

5 Replies

Recent Discussions

Connection bit order between hierarchy

How to fix Error(23782): Failed to find an expected report

Quartus 22.1 and 23.1 Synthesis Error

Could not link 'vsim_auto_compile.dll' error troubleshooting.

Failed to run ip-setup-simulation: