In this example, there are multiple reasons why the Single Work-item kernel is going to be slower than the NDRange one:
1- The outer loop in the Single Work-item kernel is not pipelined at all. Yes, the report says its II is one, but it executes serially over the inner loop which means your Single Work-item kernel is going to be extremely slow unless the trip count of the outer loop is very small and the trip count of the inner loop is very large. The NDRange kernel will achieve much better performance here due to dynamic scheduling at run-time.
2- Even though there is potential for memory access coalescing in the single work-item kernel as you mentioned, based on the report, the compiler is not actually coalescing the accesses. I have seen multiple patterns where the compiler simply refuses to coalesce the accesses even though they are clearly consecutive, this is one of those cases. The problem here is that there are two accesses to the same external buffer in the loop which, if coalesced, could potentially overlap with each other. In such cases, the compiler instead creates multiple 32-bit ports (for float) to memory, resulting in a huge amount of contention on the memory bus and very poor memory throughput. You might be able to get the accesses to coalesce correctly if you perform the memory accesses outside of the compute loop.
3- Since your NDRange kernel uses both SIMD and unrolling, while the Single Work-item kernel uses only unrolling with a factor the same as the NDRange kernel, you will have a 16 times higher degree of parallelism in the NDRange kernel which would give it an edge over the Single Work-item equivalent. However, you should note that the memory accesses in the NDRange kernel are only consecutive in the SIMD direction and the unrolling will result in multiple non-coalesced accesses.