Latency of accesses to multi-ported on-chip buffers is not one cycle; hence, the compiler has to further replicate the buffer that is accessed in the loop so that loop iterations in-flight in the pipeline can access different copies of the same buffer in parallel, resulting in correct full-pipelining and an initiation interval of one. If the "#pragma max_concurrency" I mentioned above does not reduce this replication factor (e.g.# pragma max_concurrency 2), then using# pragma II might (e.g.# pragma II 3). Note that all of these come at the cost of lower performance, probably MUCH lower performance.