Forum Discussion
Altera_Forum
Honored Contributor
7 years agothanks for your reply,
Actually my inner loop has a large and consecutive memory access and as reported by optimizer, it is pipelined well with II=1. I put# pragma unroll 4 on the outer loop (not inner one as you did), hoping to have 4 parallel accesses using 4 memory ports, because outer loop body is independent in different iterations (no read after write). Area size increased by 4 (both logic and BRAMs which I think BRAMs are used as cache for global memory), then I guess there exist 4 memory ports replicated. But performance does not change. Do you have any guess? my guess is somehow memory accesses are done serially. not in parallel. # pragma unroll 4 for (unsigned i = 0;i < 4000000; i++) { acc = 0.0; si = start_index;ei = end_index; for(unsigned j = si;j < ei;++j) //pipelined with II=1 acc += value[j]; // target memory access value_next[i] = acc ; }