By making a few changes like using std_logic_vectors with less bits where possible, prefering ranges that are multiple of 2 but still using 2 processes, I managed to increase the Fmax dramatically.
if w_en = '1' then
for step in 1 downto 0 loop
for i in DATA_WIDTH/2-1 downto 0 loop
if i = prev_index(6 downto 0) then
temp_reg1 <= joint(i + DATA_WIDTH downto i + 1) after 1 ns;
temp_reg2 <= joint(i + DATA_WIDTH*3/2 downto i + DATA_WIDTH/2 + 1) after 1 ns;
end if;
end loop;
end loop;
if prev_index(7) = '0' then
output_reg <= temp_reg1 after 1 ns;
else
output_reg <= temp_reg2 after 1 ns;
end if;
end if;
I hope that this pipelining is appropriate. Could anyone indicate a correction or improvement?