Optimising local memory access
I am trying to work out precisely what I need to do to achieve best possible performance for accessing local memory in a DPC++ kernel.
I am working from the "Accessing Work-Group Local Memory" in chapter 15 of Reinders et al.'s Data Parallel C++. However this leaves a number of unknowns.
1) It talks about "elements" in local memory with saying what size those elements are. I would guess that they are either 4-byte or 8-byte but how can I determine which for any given processor?
2) It talks about banks in local memory. How can I find out how many of these there are?
3) Clearly if two work-items access different elements in the same bank then that will have to be serialised. However what happens if two work-items access the same element in local memory. (e.g. One reads the top-half and the other reads the bottom-half). Do these have to be serialised? Is this the same for read and write operations?
I am programming (in the first instance) for a Kaby Lake HD 610 (Device Id: 5906).