Does the spec actually say in the specification "You MUST use two processes" or is it just that you cant get it to work with a single process state machine? I would be very surprised if it was the former.
The problem comes, I assume, because of the massive long combinatorial path of index, that forms part of the write-enable input on the output_reg register. index is muxed based on a comparator and then muxed again and anded with w_en.
What you havent said is what the target FMax is and what the actual FMax is. You may have to just accept that your design isnt good enough or re-design it with more pipelining.