The problem is that you try to accomplish all the algorithm in one clock cycle. If you know that you won't have a valid data on each clock cycle, you can try and divide your code in two or more blocks and only execute one of the blocks on each clock cycle.
If you can have new data on each cycle, then you will need to use pipelining and will need a major rewrite of the code. A way to do this would be to "propagate" an incoming data through your registers, pushing it one level on each clock cycle until it reaches the correct place, and then propagate the existing data, pushing the values one level down on each clock cycle. I'm not sure what I'm laying is very clear. If you don't see what I mean, tell me and I'll do a schematic of what I'm thinking.