In one sense, that seems about right. The kernel takes one thread at a time, meaning it takes 1024000 work item and passes it into the kernel one by one in a pipelined manner. 40 ms doesn't seem to be around the the expected time with those kernels. There are somethings that can be changed that might improve efficiency, but the main limiting factor in my opinion right now is the memory access to global memory rather than the computation of the kernels.
Which reduction altera example are you referring to?
In terms of private memory, my take on it is that there is really no limit to each work item and the limit of the private memory is the limit of the device itself. Meaning you can use as much private memory as you want given that the amount does not exceed the onchip memory of your FPGA.