Forum Discussion
I don't see anything out of the ordinary in the report from your sample code. The compiler creates a 512-bit coalesced load from global memory, and two stores, one of of which is 512 bits wide and the other is 64 bits; since the size of global memory ports must be a power of two, the compiler is deciding that it is best if your 9 consecutive stores are split into one big and one small store, instead of a bigger 1024-bit store (which will waste a lot of memory bandwidth). This decision seems correct to me. Furthermore, the compiler is combining your stores from the if and the else, since the write addresses are the same and only the data is different; hence, the compiler can just instantiate a multiplexer to send the correct data to memory, instead of creating extra memory ports.
Regarding latency, I am not seeing any specific difference. You are not comparing the latency from the "white" store unit which belong to your local buffer, with the "blue" store units from the global buffer, are you? Finally, you should note that the actual latency of accesses to/from global memory is over 100 cycles; the latency the compiler reports for these accesses only depends on the number of extra registers the compiler inserts on the way to the memory port to absorb stalls, and does not reflect the real latency of the accesses. If the accesses finishes in less clocks than there are registers on its way, the pipeline will not be stalled (but some bubbles might be inserted). However, if the access takes longer, then the pipeline will stall. At the end of the day, having more registers on the way of global memory accesses will be beneficial since it allows absorbing more stalls, but will come at the cost of higher area usage.