i think the hdl is a very simple solution for that.
this is a 1024 bit shift register loop in verilog hdl
reg [1023:0] myShiftBigReg;
always @ ( posedge clk )
myShiftBigReg <= { myShiftBigReg[1022:0] : myShiftBigReg[1023] };
if you need some embedded multipliers then you cann add somethink like that
reg [35:0] MyMulReg001;
always @ ( posedge clk )
MyMulReg001 <= myShiftBigReg[17:0] * myShiftBigReg[17:0];
this would add 1 multiplier 18x18
reg [71:0] MyMulReg002;
always @ ( posedge clk )
MyMulReg002<= MyMulReg001 * MyMulReg001 ;
this would add 2 multiplier 18x18
of course you could feed some bits into embedded memory cell as the address so the address is changing with every clock ...
you can stream the output of these memories into multipliers ...
this is very easily done in hdl i guess ...