To maximize memory performance you should:
- Minimize the number of global memory accesses/ports; global memory ports use quite a bit of area and the more ports you have, the more contention will happen on the memory bus, resulting in stalls that can potentially propagate all the way down the pipeline. If applicable, using structs could be beneficial in this case since it will allow you to fetch multiple different variables using one single memory port/access.
- Unroll loops iterating over memory accesses so that the compiler will coalesce the contiguous accesses to one larger access, enabling you to better utilize the memory bandwidth. Using vector types can also yield the same result.
The "Best Practices Guide" also includes some guidelines regarding this matter in Section 1.8.1.
Regarding disabling memory interleaving: doing so will pretty much never improve performance. The only case I have seen that doing so improves performance is when you have a very simple kernel with only one read and write, and a very short pipeline. When you have multiple accesses, interleaving pretty much always improves performance. However, if you want to favor some accesses over others by manual banking, you can put your more important buffers in different banks, and distribute the less important ones in a balanced fashion between the banks. Or if you have a few important buffers, but many less important ones, you can put the first set in one bank, and the second set in the other, so that accesses to the more important buffers get a bigger share of the bandwidth.