Single Work-item is indeed the "preferred" method but most certainly not "always" preferred. e.g. in kernels with non-pipelineable loops, random memory accesses, or cases where the memory access and compute parts are in separate loops, NDRange kernels are preferred. However, I have to say that with basic optimizations, it is probably easier to get good performance out of NDRange compared to Single Work-item. On the other hand, the NDRange model will never allow maximizing the potential of the FPGA (cannot infer shift registers, cannot resolve dependencies other than by relying on barriers, operating frequency is limited due to Block RAM double-pumping, no user control over the number of simultaneous work-groups, etc.)
Have you checked the report to make sure your memory accesses are actually coalesced at compile-time? You can clearly see in the "System Viewer" tab that the size of the ports to memory get larger when correct coalescing happens. Also note that you MUST use SIMD in NDRange or loop unrolling in Single Work-item to enable compile-time coalescing; without these, no actual parallelism will exist in the design and no memory access coalescing will be performed (there is no run-time coalescing).
Dynamic thread-scheduling of NDRange kernels is preferred over the static scheduling of Single Work-item if the design is not pipelineable, since the former can potentially achieve a lower average initiation interval. Other than that, if it is possible to achieve an II of one in a Single Work-item kernel, I don't see why the NDRange equivalent would be faster at all, let alone "much faster". If you post some of your code examples (both NDRange and Single Work-item), I might be able to tell you why the NDRange is faster and how you can possibly fix the Single Work-item equivalent.
Since this topic requires a lot of discussion and I have already written a whole thesis chapter on this, I will just attach the relevant chapters of my thesis instead of putting everything here. Chapter 3 includes performance model and in-depth discussion on differences between the two programming models and when and why one should be preferred over the other. Chapter 4 includes multiple benchmarks developed and optimized both in NDRange and Single Work-item and compared with respect to performance alongside with discussion as to why the performance differences exist.