Forum Discussion
HRZ
Frequent Contributor
7 years agoThis information is not explained properly (or at all) in Intel's documents. The following is MY understanding of the meaning of this information but it might NOT be necessarily true/correct:
- The "latency" of each block shows the depth of the pipeline generated for that block. This is NOT the time it takes to execute the block because that time, apart from latency, also depends on II, the loop trip count (which is not necessarily known at compile-time) and possible run-time stalls. Latency for Loads/Stores points to the depth of the pipeline the compiler generates to absorb stalls from these accesses; this basically shows the minimum latency of these operations. If the operation takes longer at run-time, then the pipeline will stall. Pipeline depth/latency is not directly controllable by the user but it can be reduced by simplifying the loop body/operation.
- I think "starting cycle" for an operation refers to the minimum latency from the start of the kernel execution until that specific operation is reached. This will be in the case all loop trip counts is one and no stalls happen from memory/channel accesses.
- Latency of local memory accesses is obviously less than global memory accesses.