Forum Discussion
Altera_Forum
Honored Contributor
7 years agoThere is no point in unrolling the outer loop. You have many global memory accesses as it is, and each access requires its own port to memory. Unrolling the outer loop results in 4 times more ports, and you will have 16 memory accesses competing with each other to acquire the memory bus, resulting in extremely poor memory performance. Assuming that "acc" and "value" are floating-point, you should first use the shift register optimization for floating-point accumulation as outlined in Intel's documents, and then unroll the inner loop to be able to achieve higher performance.
Most FPGA boards only have two memory banks, which means at each clock cycle, you can perform a maximum of two "parallel" memory accesses. You should actually avoid "parallel" memory accesses as much as you can and instead, try to have few but large coalesced accesses.