--- Quote Start ---
Considering the very low memory bandwidth on current FPGA boards and the very high overhead of contention for off-chip memory accesses, you should actually avoid having parallel memory accesses and instead, unroll your memory accesses in a way that they will be coalesced into bigger ones, to minimize the number of ports to external memory. These ports, as you have noticed, waste a lot of space on the FPGA.
In your code example, you are unrolling the i loop, while the memory accesses are not contiguous over the i dimension and hence, you get 64 memory ports. This, apart from very high area usage, will lower your memory bandwidth to near-zero due to constant contention between all those ports. However, if you partially unroll the j loop 64 times, since the accesses are contiguous, you will get a few large coalesced ports with very low area overhead, and you will get very close to theoretical memory bandwidth.
--- Quote End ---
I totally agree with your suggestion. I have tried and it works.
Thanks.