Different kernels of same algorithm give different throughputs

Question

Hi,  I'm trying to test the performance of my 385A card, using different OpenCL kernels, while all of them represent the same functionality. Here are two of my kernels:  __attribute__((num_compute_units(1)))
__attribute__((num_simd_work_items(1)))
__attribute__((reqd_work_group_size(256,1,1)))
__kernel void WGSXMAPIXLLXOPS1024(const __global float * restrict GIn, __global float * restrict GOut, const float M, const float N, const float P) {
    const int XGL = get_global_id(0);
    const int XGRid = get_group_id(0);
    const int XGRnum = get_num_groups(0);
    const int XLSize = get_local_size(0);
    const int XLid = get_local_id(0);
    // Just a private variable
    float NF = (float) N;
    float PF = (float) P;
    float tempOutTotal = 0;
    // Start of a new level of for loop
    for (int lcdd = 0; lcdd &lt; 2; lcdd++) {
        float temp1 = 1.0;
        float temp2 = 1.0;
        float temp3 = 1.0;
        float temp4 = 1.0;
        float MF = (float) lcdd + XGL;
        float tempOut;
        temp1 += temp1 * MF;
        temp1 += temp1 * MF;
        temp1 += temp1 * MF;
        temp1 += temp1 * MF;
        ...
        temp1 += temp1 * MF;
        tempOut = temp1 + temp2 + temp3 + temp4;
        tempOutTotal += tempOut;
    }
    GOut = tempOutTotal;
}
  and,  __attribute__((num_compute_units(1)))
__attribute__((num_simd_work_items(1)))
__attribute__((reqd_work_group_size(256,1,1)))
__kernel void WGSXMAPIXLLXOPS1024(const __global float * restrict GIn, __global float * restrict GOut, const float M, const float N, const float P) {
    const int XGL = get_global_id(0);
    const int XGRid = get_group_id(0);
    const int XGRnum = get_num_groups(0);
    const int XLSize = get_local_size(0);
    const int XLid = get_local_id(0);
    // Just a private variable
    float NF = (float) N;
    float PF = (float) P;
    float tempOutTotal = 0;
    // Start of a new level of for loop
    for (int lcdd = 0; lcdd &lt; 256; lcdd++) {
        float temp1 = 1.0;
        float temp2 = 1.0;
        float temp3 = 1.0;
        float temp4 = 1.0;
        float MF = (float) lcdd + XGL;
        float tempOut;
        temp1 += temp1 * MF;
        temp1 += temp1 * MF;
        temp1 += temp1 * MF;
        temp1 += temp1 * MF;
        ...
        temp1 += temp1 * MF;
        tempOut = temp1 + temp2 + temp3 + temp4;
        tempOutTotal += tempOut;
    }
    GOut = tempOutTotal;
}
  The number of iterations for the for loop is defined to be 2, 4, 8, 16, 32, 64, 128, 256. The loop contains 1024 FMA, as a result it cannot be unrolled since it will exceed the total available DSPs on the board. As the number of iterations in the loop increases, the number of work items is decreasing as 524288, 262144, 131072, 65536, 32768, 16384, 8192, 4096. This means all eight deployments are doing (almost) the same amount of operations. Also, we have only one write operation at the end of kernel. The compilation report and profiling are showing that the operation is not a bottleneck.  My expectation is to receive same performance from all these deployments. But it's not the case, such that the GFlops are 629.372, 621.288, 648.249, 593.756, 576.578, 506.244, 472.351, 519.522. As you can see, by increasing the number of iterations in the loop, the performance drops. Looking at the Verilog code and also analysing the behaviour of the system in presence of loop (To the best of my knowledge), I couldn't find the root cause of such difference in performance.   I'm wondering if any one has any idea, what is going on in my deployments?  Thanks

altera_forum · Answer

Loops iterations are NOT pipelined in NDRange kernels, but instead, different threads are scheduled into the same loop pipeline at runtime; obviously, the more threads you have, the more successful the runtime scheduler will be in keeping the pipeline, resulting in higher performance. Furthermore, I believe a major contributing factor to the performance difference you are seeing could be the difference in the operating frequency of the kernels.

altera_forum · Answer

Hi HRZ,

I've checked the clock frequency as you've mentioned and it seems to be like: 314MHz, 311MHz, 330MHz, 302MHz, 315MHz, 315MHz, 294MHz, 324MHz. As it's clear, there seems to be no correlation between the frequency and the throughput.

From what you've mentioned regarding scheduling threads into the pipeline. I have some simple question: I have understood initiation interval, which is the way for the design to fire iteration after iteration after II clock cycles, which helps improving performance of the loop execution. Now you have mentioned about scheduling threads into loop pipeline. What exactly does that mean, and how it's being correlated with initiation interval? How can I understand how many threads can be scheduled at most into the pipeline? Could you please elaborate on this a little bit? I'm trying to figure out whether it can demonstrate the performance gap I see or not.

Thanks

altera_forum · Answer

Regarding operating frequency, when you are trying to find a pattern in your measured results, you should first normalize your numbers for a fixed operating frequency to eliminate the effect of the variable frequency. For example, after normalizing your numbers for a fixed operating frequency of 300 MHz, you will see a trend that roughly looks like this:  
_ _ _ _
        
         \ _ _ _  Furthermore, you should take memory bandwidth into account. If you are saturating the memory bandwidth, performance will not improve with higher operating frequency and hence, you should take extra measures when normalizing the performance.  Regarding the scheduling, in single work-item kernels, a runtime scheduler does not exist and loops iterations are initiated with a fixed II that is determined at compile time based on loop-carried and load/store dependencies. In NDRange kernels, however, there is no fixed II and the runtime scheduler, based on the the state of the pipeline at each clock, decides as to whether it should schedule another thread into the pipeline or not. You can think of this as threads being scheduled into the pipeline with a variable II. The maximum number of threads that can be in-flight in the pipeline per clock is equal to the depth of the pipeline (which you can get from the report); however, how many are actually in-flight at each given clock is determined at runtime. The details of the implementation of the scheduler is unknown to people outside of Altera/Intel (including me). Based on your measurement results, the latency numbers from the report, and some intuition and math, I think you might be able to extract the average II of the loop in your design.  If you want more predictable results, I recommend using single work-item kernels. There are many unknown variables involved in the operation of the runtime scheduler in NDRange kernels.

altera_forum · Answer

Hi HRZ,

Thanks for the answer.

I just need more clarification in one of your points. For single work-item kernels, you said "loop iterations" are being initiated (with some fixed II) into the pipeline, which I completely understand. On the other hand, for ND-Range node the threads are being scheduled and pushed into the pipeline. So if we have a for loop (not unrolled) in and ND-Range mode, then how will it be managed? Does that mean when a thread enters the pipeline and reaches into execution of the loop, then each iteration should be executed after the previous one has completely been finished?

I'm saying this question, since I've realized loop carried data dependency affect the II value and the performance in single work-item mode significantly, but in ND-Range mode I don't see anything. Can you elaborate on the FPGA mapping of combination of having multiple threads and loops?

Thanks,

Saman

altera_forum · Answer

From what I understand, in NDrange kernels, loops are executed in this way that first, iteration one of all of the threads is executed, then iteration two of all threads and so on. Since loop iterations for a given thread are not consecutive, loop-carried dependencies have no effect in NDRange kernels.

Forum Discussion

Different kernels of same algorithm give different throughputs

5 Replies

Recent Discussions

Generate Simulation Setup Script Fails

FIR IP configured for Interpolation

Altera SSLC License

Lisence issue when running .do script

How to create a Packaged Subsystem in TCL