Altera_Forum
Honored Contributor
8 years agoUnexplained performance difference for same kernels
I have develop two identical kernels, with a single difference where the size of the for loop in one of them is higher than the other. Here are my two kernels, Kernel# 1 and Kernel# 2.
First Kernel:__attribute__((num_compute_units(5)))
__attribute__((num_simd_work_items(16)))
__attribute__((reqd_work_group_size(256,1,1)))
__kernel void TestS16VfloatI1048576D32Form1MUnrol0U16(__global float *data, __global float *rands, int index, int rand_max){
float16 temp;
int gid = get_global_id(0);
temp = data;
# pragma unroll 16
for (int i = 0; i < 32; i++){
temp = (float) rands * temp;
}
data = temp.s0;
}
Second Kernel: __attribute__((num_compute_units(5)))__attribute__((num_simd_work_items(16)))
__attribute__((reqd_work_group_size(256,1,1)))
__kernel void TestS16VfloatI1048576D256Form1MUnrol0U16(__global float *data, __global float *rands, int index, int rand_max){
float16 temp;
int gid = get_global_id(0);
temp = data;
# pragma unroll 16
for (int i = 0; i < 256; i++){
temp = (float) rands * temp;
}
data = temp.s0;
}
As it's clear in both kernel implementations, both acquire same amount of hardware resources and also both unfolding the loop with same degree. There are compiled as NDRanege and I deploy around 1 Million work items. Now calculating the amount of floating point operations being done, I can see the first kernel can achieve 1.57 TFlops performance while the second kernel can achieve 4.37TFlops. I'm trying to come up with an explanation on how it's possible that increasing number of operation inside the kernel can increase performance, while keeping the run time the same?