This page does a pretty good job describing the difference between whether or not tap 1 or 20 would be considered a tap:
http://www.walterg.uklinux.net/prbs.htm I recommend the "Galois-Field Arithmetic" implementation since it yields faster hardware (fewer XOR terms in front of the shift register).
I think the issue with your implementation is the way you are trying to feed the taps back into the shift register. I just finished coding up an LFSR to generate pseudo random values and here is the structure that I'm using .... assuming you know verilog:
// LSFR input data for all the pipeline stages
genvar d;
generate
for (d = 0; d < (DATA_WIDTH-1); d = d + 1) // highest order pipeline input will be generated independent of this loop
begin: pipeline_taps
assign pipeline_input = pipeline ^ (pipeline & polynomial);
end
endgenerate
assign pipeline_input = pipeline;
I take the pipeline_input bus and use it as the input to the pipeline register. So I use the polynomial bus (without including the +1 of the polynomial) and bitwise AND each bit with the output stage of the shift register. I then take that resulting bus and bitwise XOR it with the values of the shift register.
I think you could do this in VHDL by taking the pipeline output stage and replicating it 20 times and then performing the bus wide AND and XOR operations.