Weird arbitration logic
Hello,
I need to pass a structure with a pointer member to my kernel. Since this cannot be done straightforwardly I'm trying two different approaches: splitting the structure in no pointer members and the pointer and leveraging an auxiliary kernel that assembles the structure (non pointers and pointers are passed to this one) and loads it in global memory, so that the actual kernel can access the structure with a pointer to global memory. Both approaches work, however, the first one gives a much better performance than the second one (in fact, the second one leads to near 1% memory efficiency, according to the Dynamic Profiler). It seems the second approach generates a highly complex arbitration logic, which leads to high latency LSUs. I'd like to know if this can be bypassed in some way, since I prefer the second approach over the first. If not, an explanation that clarifies the behaviour of the compiler will also be welcome.
A minimal example of what I'm exposing (for the sake of simplicity I omitted the kernel that assembles the struct):
typedef struct {
int x;
int y;
__global int *restrict data;
} swp;
typedef struct {
int x;
int y;
} swop;
__kernel void struct_with_pointer(__global swp *restrict p_in, __global swp *restrict p_out) {
swp in = *p_in;
swp out = *p_out;
for(int i = 0; i < 10; i++)
out.data[i] = in.data[i] + 3;
}
__kernel void struct_without_pointer(swop in_coords, __global int *restrict data_in,
swop out_coords, __global int *restrict data_out) {
swp in = {
.x = in_coords.x,
.y = in_coords.y,
.data = data_in
};
swp out = {
.x = out_coords.x,
.y = out_coords.y,
.data = data_out
};
for(int i = 0; i < 10; i++)
out.data[i] = in.data[i] + 3;
}I'm also providing a report where you can check what I'm saying (you should compare struct_with_pointer.B2 and struct_without_pointer.B2 in the graph viewer to see the problem). I cannot give you any profiling data but what has already been stated due to timing constraints.