Load/store cannot be vectorised - local memory
Hello,
I'm having some trouble with local memory and SIMD in a matrix transpose kernel I'm adapting from GPU. The code:
#define TILE_DIM 4
__attribute__((reqd_work_group_size(TILE_DIM, TILE_DIM, 1)))
__attribute__((num_simd_work_items(TILE_DIM)))
__kernel void MatTranspose(__global float* restrict dest,
__global float* restrict src)
{
__local float tile[TILE_DIM][TILE_DIM];
int tx = get_local_id(0);
int ty = get_local_id(1);
int bx = get_group_id(0);
int by = get_group_id(1);
int x = bx * TILE_DIM + tx;
int y = by * TILE_DIM + ty;
int width = get_num_groups(0) * TILE_DIM;
for(int j = 0; j < TILE_DIM; j += TILE_DIM) {
tile[ty + j][tx] = src[(y + j) * width + x];
}
barrier(CLK_LOCAL_MEM_FENCE);
x = by * TILE_DIM + tx;
y = bx * TILE_DIM + ty;
for(int j = 0; j < TILE_DIM; j += TILE_DIM) {
dest[(y + j) * width + x] = tile[tx][ty + j];
}
}The compiler warns:
Compiler Warning: Vectorized kernel contains loads/stores that cannot be vectorized. This might reduce performance.
I don't see why these memory operations cannot be vectorised. I know the problem is caused by local memory because replacing the accesses to tile by a constant both in lines 17 and 26 solves the issue.
Moreover I was expecting tile to be split in 4 memory banks so that 4 load/store operations occur at the same time, but only one is generated. Forcing this to occur with the attribute numbanks results in pipelined never stall memory, when it should be burst coalesced (i guess this comes from the warning) as global memory accesses are contiguous in every cycle.
Could someone explain me what is going on, please?
P.D: Although this kernel can be implemented avoiding SIMD and achieve a good performance this is part of some benchmarking work I'm undertaking, so I want it to remain vectorised.
That compiler warning in particular is a very misleading warning and it does not always point to an actual problem in your code. Looking at the report, both the load from and the store to global memory are coalesced into 128-bit accesses which points to correct vectorization. The local buffer "tile" is also replicated by 28 times to provide fully-parallel non-stallable accesses. 4 times of it is because your code has 4 non-coalescable reads on line 28, and one coalescable write on line 19 (each Block RAM has two ports, writes are connected to all replicas while reads are connected to one, resulting in a replication factor of 4 for 4 reads and one write). The buffer is also replicated by 7 extra times to support 7 work-groups running concurrently in the same compute unit; this latter replication factor is a compiler decision that cannot be overridden by the user. All in all there is nothing wrong with your code and I would say you can safely ignore the warning.