Altera_Forum
Honored Contributor
8 years agoDifferent kernels of same algorithm give different throughputs
Hi,
I'm trying to test the performance of my 385A card, using different OpenCL kernels, while all of them represent the same functionality. Here are two of my kernels:__attribute__((num_compute_units(1)))
__attribute__((num_simd_work_items(1)))
__attribute__((reqd_work_group_size(256,1,1)))
__kernel void WGSXMAPIXLLXOPS1024(const __global float * restrict GIn, __global float * restrict GOut, const float M, const float N, const float P) {
const int XGL = get_global_id(0);
const int XGRid = get_group_id(0);
const int XGRnum = get_num_groups(0);
const int XLSize = get_local_size(0);
const int XLid = get_local_id(0);
// Just a private variable
float NF = (float) N;
float PF = (float) P;
float tempOutTotal = 0;
// Start of a new level of for loop
for (int lcdd = 0; lcdd < 2; lcdd++) {
float temp1 = 1.0;
float temp2 = 1.0;
float temp3 = 1.0;
float temp4 = 1.0;
float MF = (float) lcdd + XGL;
float tempOut;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
...
temp1 += temp1 * MF;
tempOut = temp1 + temp2 + temp3 + temp4;
tempOutTotal += tempOut;
}
GOut = tempOutTotal;
}
and, __attribute__((num_compute_units(1)))
__attribute__((num_simd_work_items(1)))
__attribute__((reqd_work_group_size(256,1,1)))
__kernel void WGSXMAPIXLLXOPS1024(const __global float * restrict GIn, __global float * restrict GOut, const float M, const float N, const float P) {
const int XGL = get_global_id(0);
const int XGRid = get_group_id(0);
const int XGRnum = get_num_groups(0);
const int XLSize = get_local_size(0);
const int XLid = get_local_id(0);
// Just a private variable
float NF = (float) N;
float PF = (float) P;
float tempOutTotal = 0;
// Start of a new level of for loop
for (int lcdd = 0; lcdd < 256; lcdd++) {
float temp1 = 1.0;
float temp2 = 1.0;
float temp3 = 1.0;
float temp4 = 1.0;
float MF = (float) lcdd + XGL;
float tempOut;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
temp1 += temp1 * MF;
...
temp1 += temp1 * MF;
tempOut = temp1 + temp2 + temp3 + temp4;
tempOutTotal += tempOut;
}
GOut = tempOutTotal;
}
The number of iterations for the for loop is defined to be 2, 4, 8, 16, 32, 64, 128, 256. The loop contains 1024 FMA, as a result it cannot be unrolled since it will exceed the total available DSPs on the board. As the number of iterations in the loop increases, the number of work items is decreasing as 524288, 262144, 131072, 65536, 32768, 16384, 8192, 4096. This means all eight deployments are doing (almost) the same amount of operations. Also, we have only one write operation at the end of kernel. The compilation report and profiling are showing that the operation is not a bottleneck. My expectation is to receive same performance from all these deployments. But it's not the case, such that the GFlops are 629.372, 621.288, 648.249, 593.756, 576.578, 506.244, 472.351, 519.522. As you can see, by increasing the number of iterations in the loop, the performance drops. Looking at the Verilog code and also analysing the behaviour of the system in presence of loop (To the best of my knowledge), I couldn't find the root cause of such difference in performance. I'm wondering if any one has any idea, what is going on in my deployments? Thanks