1) You will need a delay pipe of 1024 stages for input stream. Using fabric registers for this pipe will deplete fpga resource (imagine 1024 x 16 bits = a lot). Therefore you got a problem to start with but it can be implemented in ram blocks or ram ALUTs. ram based shifter is also possible but it has restrictions on tap distance, however you can set it to minimum tap distance e.g. 3 stages and speed up your stream at 3 times the input rate.
2) coeffs will need also ram and can updated there and be read back (dual port ram). I don't see any reason now to split up outputs into 8 sections but just add up the whole sum of products in an accumulator.
Whether you can add up all or part depends on your processing speed Vs input speed.
Regarding my note on updating coeffs to avoid glitchy output, actually I wouldn't worry about since coeffs need be designed with gentle gradient anyway or else a glitch will occur.