It has been said, that the inner loop is executed 163*163*653 times (about 17e6), I didn't check in detail. You can translate the code to C and execute it sequentially by a uP, then this number allows to estimate the execution time. It takes a considerable amount of time, but is feasible obviously. The present FPGA code in contrast enforces 17e6 instances of the logic in the innermost loop, it isn't feasible with any existing FPGA.
So you may want to find a way to execute a block of the code in parallel and repeat this block sequentially. This is actually often done with problems like the present one, and allows faster solving of complex numerical problems. But the problem has to be analyzed thoroughly for a meaningful solution.
To discuss the problem from a HDL coding view: You can't use a clock in a function. You have to rewrite the code completely to introduce sequential processing.