Well, it is not possible to give general guidelines that work for every case. Correct pipelining depends on loop-carried and load/store dependencies (either on local or global memory buffers). You should pay attention to the loop analysis part of the report and then try to reconstruct your code to resolve the dependency. However, dependencies are not necessarily always resolvable. One very useful approach to handling both of these types of dependencies is loop blocking. With blocking, since the bound of the inner loop become compile-time constant, the compiler can perform extra optimizations to handle loop-carried dependencies. In your current code example, the dependency is to the previous row (assume "i" is looping over rows) and hence, the dependency distance is relatively large. You can avoid the dependency by blocking the loop on "i" and using a large block size. In this case the compiler will analyze the dependency distance since the trip count of the blocked loop is known and insert enough stages into the pipeline to avoid the dependency and allow correct pipelining. Of course in this case you will still have an unpipelineable loop over the blocks but that loop will have a very small trip count and its negative effect will be minimized. Fixing your code example requires some work. You can take a look at the transformation I have performed from v1 to v5 of this benchmark as an example of how to resolve such dependencies:
https://github.com/zohourih/rodinia_fpga/tree/master/opencl/pathfinder
Though that code is quite a bit more complex than your example.
Regarding coalescing, mostly you just need to make sure the accesses are consecutive over the SIMD/unroll direction. However, as seen in your example, that is not always enough. I have encountered multiple cases where I could simply not get the compiler to coalesce accesses that were obviously coalesceable. However, in your case, I made a quick transformation that seems to at least allow correct coalescing:
#define UNROLL_FACTOR 16
__kernel void SWI(__global DTYPE* restrict AA, __global const DTYPE* restrict BB, const int lllX, const int lllY)
{
for (int i = 1; i < lllX; i++)
{
int exit = (lllY % UNROLL_FACTOR == 0) ? (lllY / UNROLL_FACTOR) : (lllY / UNROLL_FACTOR) + 1;
#pragma ivdep
for (int j = 0; j < exit; j++)
{
float a[UNROLL_FACTOR];
#pragma unroll
for (int k = 0; k < UNROLL_FACTOR; k++)
{
int j_real = j * UNROLL_FACTOR + k;
a[k] = AA[(i-1)*lllY+j_real];
}
#pragma unroll
for (int k = 0; k < UNROLL_FACTOR; k++)
{
int j_real = j * UNROLL_FACTOR + k;
if (j_real < lllY)
{
AA[i*lllY+j_real] = a[k] + BB[i*lllY+j_real];
}
}
}
}
}
I basically detached the two accesses to the AA buffer. However, this required that I perform manual loop unrolling rather than rely on the compiler's unroll pragma. In general, I would advise against performing partial unrolling in Single Work-item kernels using the pramga; manual partial unrolling as I did above pretty much always achieves better results. Note that if you merge the two fully-unrolled loops, you will again get non-coalesced accesses. It is worth noting that the mod and division operations are relatively cheap in hardware if the unroll factor is a power of two. Furthermore, "exit" can be calculated on the host and passed to the kernel as an argument to save some area on the FPGA.
In the end, I have to say that for such examples, it is certainly much easier to get good performance using an NDRange kernel; however, with correct optimization, the Single Work-item equivalent will usually result in better performance. Unfortunately, the learning curve for optimizing Single Work-item kernels is relatively steep and requires a lot of experience and knowledge of how the compiler works.