As far as I know, each set of nested loops in a kernel will be implemented as an individual pipeline. If the compiler does not detect any dependency between two such pipelines, it might reorder or parallelize them. However, this should not happen in your case since there is a data dependency between the loop nests. In your case, the compiler must guarantee that the each loops nest is completely processed and its pipelines is flushed before starting the next one. I am not sure what could be causing the problem in your case; it could as well be a compiler bug.
Regarding run time, you should probably first calculate the total amount of data that is transferred between the FPGA and its external memory in your code and divide it by the FPGA external memory bandwidth. This will give you an upper-bound for the performance you can achieve. If this upper-bound is higher (worse) than your goal, then your goal is unachievable. If your goal is higher (worse) than the upper-bound, the further it is from the upper-band, the more likely it will be to achieve. Of course there is never any guarantee you would be able to achieve this upper-band in practice.