I am not getting any messages from the compiler regarding an Fmax bottleneck in your second code; however, on my environment, the compiler fails to unroll the loops over the channel operations now due to "conditional channel execution". Looks like my suggestion made things worse instead of better.
Is there any reason why you are trying to avoid passing the out-of-bound data via channels? Since W is small, the overhead of passing the few extra indexes will be extremely small. I suggest that you send the extra data via channels anyway to avoid the "conditional channel execution", but instead avoid processing the extra data. You already have the same condition in the second loop in the consumer kernel and that loop is correctly unrolled and pipelined; this should be enough to generate correct output even if you do send the out-of-bound data through the channels.