Forum Discussion
The new image you have posted looks completely different from what I got from compiling your original kernel. In the new case, the reason for the slow-down is not the difference in latency, but rather the fact that now, instead of one read and one write port going to external memory, you have one read and 9 writes, all of which will be competing with each other to obtain access to the memory bus. This will result in a very high amount of contention and very frequent stalls in memory accesses which will get propagated all the way down to the pipeline. If this is one of those cases that the compiler is failing to coalesce the accesses, even though they are consecutive, then, yes, the compiler is making a mistake here (I reported one such case to Altera long ago). If not, you should modify your kernel to minimize the number of write ports.
Unless your input is so small that the pipeline is not filled before execution finishes, the "latency" of the pipeline will not have a noticeable effect on run time.