Different kernels of same algorithm give different throughputs

Honored Contributor

8 years ago

Regarding operating frequency, when you are trying to find a pattern in your measured results, you should first normalize your numbers for a fixed operating frequency to eliminate the effect of the variable frequency. For example, after normalizing your numbers for a fixed operating frequency of 300 MHz, you will see a trend that roughly looks like this:


_ _ _ _
        
         \ _ _ _

Furthermore, you should take memory bandwidth into account. If you are saturating the memory bandwidth, performance will not improve with higher operating frequency and hence, you should take extra measures when normalizing the performance.

Regarding the scheduling, in single work-item kernels, a runtime scheduler does not exist and loops iterations are initiated with a fixed II that is determined at compile time based on loop-carried and load/store dependencies. In NDRange kernels, however, there is no fixed II and the runtime scheduler, based on the the state of the pipeline at each clock, decides as to whether it should schedule another thread into the pipeline or not. You can think of this as threads being scheduled into the pipeline with a variable II. The maximum number of threads that can be in-flight in the pipeline per clock is equal to the depth of the pipeline (which you can get from the report); however, how many are actually in-flight at each given clock is determined at runtime. The details of the implementation of the scheduler is unknown to people outside of Altera/Intel (including me). Based on your measurement results, the latency numbers from the report, and some intuition and math, I think you might be able to extract the average II of the loop in your design.

If you want more predictable results, I recommend using single work-item kernels. There are many unknown variables involved in the operation of the runtime scheduler in NDRange kernels.

Forum Discussion

Different kernels of same algorithm give different throughputs

Recent Discussions

Generate Simulation Setup Script Fails

FIR IP configured for Interpolation

Altera SSLC License

Lisence issue when running .do script

How to create a Packaged Subsystem in TCL