I'm not too familiar with what you're trying to do, but in terms of most efficient, if you're just trying to find total sum of two vectors, using two kernels and the way you've structured it might not necessarily be the best idea. It is more efficient to use one kernel than splitting a problem up into two small kernels like you have. You can structure one kernel the same way as the example, except instead of local..+= a..., you do local += a + b[i]. Because now, you can do other tweaking such as doing loop unrolling, simd, etc.
Furthermore, if you're worried about efficiency, if that's the application you're looking into, you have to take into account writing and read from global memory. Since this is all you're doing, it might be more efficient on the CPU than using the FPGA. If you use the FPGA, you have to take the time to do the calculation as well as the time it takes to write the data into the global memory of the FPGA from the CPU, the reading of the data into the computation units on the FPGA, and writing the data back to the CPU.