It seems your code has been initially written to run on a standard CPU, hence not every construct used in the code is suitable for FPGA acceleration. There are lots of opportunities to improve your code:
- Starting from the top function I can see that you are processing 1024 of data while only writing 512 points back. This will result in a significant waste of computing cycles and also FPGA area. You should modify the code to only compute what you are going to write back to external memory and later read in the host.
- You are unrolling the read and write loops in the top function, which is the correct thing to do to achieve compile-time access coalescing. However, the unroll factor is far too large (512). Supporting such large accesses results in significant waste of FPGA resources, especially Block RAMs. The external memory bandwidth of the FPGA will be saturated with one 512-bit read and one 512-bit write per loop iteration (in case of two DDR memory banks and an II of one). This effectively translates to an unroll factor of 16 for the "int" datatype. What you should consider doing is to reconstruct your code so that you are reading, processing and writing back 16 points per loop iteration. Assuming that the FPGA is overutilized, you can then reduce the number of parallel points to fit the design.
- There is excessive use of function calls in your code. Every function call will be implemented individually as a circuit on the FPGA, resulting in excessive use of FPGA resources. This is similar to the case of a fully-unrolled loop. Furthermore, such calls prevent the compiler from correcting reporting the area usage per kernel line in the HTML report (as is evident in your report where "No Source Line" is occupying half the area), which in turn makes performance debugging very difficult. You should avoid function calls as much as possible and try to use loops over the functions instead and partially or fully unroll the loops based on the available area.
- The way the "ibfly4_16" is currently written is very inefficient on FPGAs (loop inside of a branch). Since the loops inside of both sides of the branch over "type" are the same, you should instead use one loop and move the branch inside of the loop. Furthermore, using the "out = (condition) ? in_1 : in_2;" construct rather than if/else could lead to area savings in some cases.
- The main problem in your code seems to stem from the cpack_16_64 function which cannot achieve an II of one due to dependency on "x", resulting in the depth of all the buffers in the loop being increased by the II. Since the function is instantiated multiple times, it leads to huge area waste. I think the dependency exists since you are reading from the x[i+1] point and then overwriting it. If you can split "x" into two buffers and write to x1[i] and x2[i] instead, you might be able to avoid this problem. Of course this will require significant code rewriting which will likely propagate all the way to the top function.
There are probably other things that can be done to improve the code but I cannot find and list them all since the code is relatively large. You can try converting each function to a separate kernel manually and then compile them one by one and optimize each separately based on the information you get from the report and then put them back in the original kernel.