Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
8 years ago

Report Loop Analysis NDrange threads

Hello everyone

Im struggling to find what means the Details tab in Loop Analysis section of the report, saying nd-range: thread capacity = 303

example 2Dimension Kernel:

___Kernel(__global short const * restrict input_a,__global short const * restrict input_b)
short aux;
int row = get_global_id(1);
int col = get_global_id(0);
int width = get_global_size(0);
for(int j=get_global_id(1); j < get_global_size(0); j++){
    
        aux = input_a;
        input_b += aux;
    }

I only can run a maximum of 303 work-items? And what causes that "constraint" on the code ? :confused:

7 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    I only can run a maximum of 303 work-items? And what causes that "constraint" on the code ? :confused:

    --- Quote End ---

    No, that is not what this means. That number is just the latency of the pipeline which determines the maximum number of threads that can be simultaneously "in-flight" in the pipeline. This does not limit the total number of work-items you can run with that kernel.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    No, that is not what this means. That number is just the latency of the pipeline which determines the maximum number of threads that can be simultaneously "in-flight" in the pipeline. This does not limit the total number of work-items you can run with that kernel.

    --- Quote End ---

    Thanks HRZ!

    What causes this number to be 303? Depends on the for loop condition, in the above example j=row+1 < height?
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The latency of the pipeline depends on the operations that are carried out inside of the loop, and the loop exit condition. Each iteration of your loop performs one external memory read, one external memory write, multiple integer operations for addressing and calculation, and one final integer comparison for the loop exit condition. Based on the compiler's decision, these operations require a minimum of 303 clocks to be carried out, without stalling the pipeline.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    The latency of the pipeline depends on the operations that are carried out inside of the loop, and the loop exit condition. Each iteration of your loop performs one external memory read, one external memory write, multiple integer operations for addressing and calculation, and one final integer comparison for the loop exit condition. Based on the compiler's decision, these operations require a minimum of 303 clocks to be carried out, without stalling the pipeline.

    --- Quote End ---

    Ok, now i get it :)

    Last doubt: Is the loop condition size (in other words, the number of iterations) at the compiling of the kernel "not known" affecting somehow the performance/circuit generated ? Because, with this code im assuming that certain work-items will have more iterations than others, depending on their ID on second dimension (row=get_global_id(1)).
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Thread-id-dependent branching will have a negative performance impact, because run time will be dominated by threads that perform the most amount of work. Furthermore, having such branches in your code will prevent you from being able to use the SIMD attribute to increase the performance of your kernel. However, the runtime scheduler will try to minimize the number of pipeline stalls/bubbles and maximize performance in any case.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Thread-id-dependent branching will have a negative performance impact, because run time will be dominated by threads that perform the most amount of work. Furthermore, having such branches in your code will prevent you from being able to use the SIMD attribute to increase the performance of your kernel. However, the runtime scheduler will try to minimize the number of pipeline stalls/bubbles and maximize performance in any case.

    --- Quote End ---

    Thanks again for the help HRZ! You are a big help here on OpenCL section :)