Forum Discussion
The thoughtput measurement would take a while to explain but it's supposed to represent the best case work-item retirement rate. If -O3 didn't fill up the chip with more SIMD lanes or compute units then I suspect the throughput estimate did not increase and so the compiler did not bother adding the additional hardware. If more hardware is generated for no performance gains then you are just increasing your compile time without anything to gain. In the case of vector add since the kernel only adds two numbers together throwing more SIMD vector lanes at the problem shouldn't help since it is already limited by the global memory bandwidth (if it didn't help that means it automatically vectorized the kernel through memory coalescing). Putting more compute units into the hardware would again be memory limited and also have a negative impact on performance since each copy of the kernel would have it's own load/store units so by throwing more compute units at the problem you would also have more load/store units fighting over the same memory bandwidth. That wouldn't be ideal because with one compute unit you have one load unit and one store unit that will sequentially access memory. With multiple compute units they'll be doing the same thing only there will be more of them so that's a less than idea memory access pattern because they all access different locations in memory.