Question about M20K block packing
I am using Intel Quartus Prime 21.1, targeting the Stratix 10 MX 2100 device.
I have several read/write avalon memory-mapped interfaces from a Load/Store unit that are connected to True Dual-Port RAMs. I am using double-buffering, so each interface is connected to 2 such RAMs, with a simple demultiplexer interconnect, which lies between the load/store unit and the RAMs. The RAMs are implemented using "On-Chip Memory (RAM or ROM) IP" from Platform Designer. The ports of the RAMs are 32-bit wide and the size of the RAMs is 16384 Bytes. My simple design has only 28 such RAMs for now.
Since M20K blocks using 32-bit wide ports are configured at the 512x32 mode, it means that a total of 8 M20K blocks are needed to implement each RAM. This leads to an 80% utilization of the available block memory bits, as 100% utilization requires the 512x40 operating mode.
Nonetheless, the compiler is able to optimize the M20K packing and allocate 8 M20Ks for some of the RAMs, but fewer for the others, boosting block memory bit utilization to roughly 99%.
However, if I add a pipeline stage for the interface signals between the Load/Store unit and the RAMs (more specifically between the Load/Store unit and the demultiplexer interconnect, the compiler uses 8 M20Ks for all RAMs, dropping block memory bit utilization down to 80%.
My assumption is that the fitter does this in order to improve timing.
I tried to force a Synthesis setting of maximum number of M20Ks to be used to what was before adding the pipeline stage, but it gets ignored by the Fitter.
Do you know of a way that I can control this packing and guide the compiler to always try and maximize M20K block memory bit utilization? Saving this significant number of M20Ks will greatly help me in fitting my final design.