Forum Discussion
There is really nothing wrong with the high-latency of the burst-coalesced mode. In fact, it is very effective in hiding the high latency of external memory accesses and avoiding stalls from propagating all the way through the pipeline. What you need to do is to use large inputs. If you have 10,000 inputs going through a pipeline with a latency of 159 clocks, the memory latency hiding effect from the deeper pipeline will far outweigh its higher warmup time/latency. You need to remember that even though the latency of the prefetching access is 2 cycles, the latency of external memory accesses are always over 100 cycles, which means the prefetching access will be stalling most of the time until the data is read/written from/to external memory.
There is no way to control the type of memory port the compiler infers, it is done automatically based on what the compiler thinks is best for the kernel. For NDRange kernels you can easily achieve replication using num_compute_units as I mentioned before; you need a lot of work-groups running in parallel to be able to use it efficiently, though. For single work-item kernels you can decouple your memory accesses from compute, and put them in separate kernels, and define your compute kernel as "autorun" which would allow you to easily replicate it using num_compute_units (same attribute as the one used for NDRange kernels, but used in a completely different manner) and customize the replicas using a static ID supplied by the compiler. You still need to create separate parallel queues for your memory read/write kernels in this case, but you do not need any queues for the compute kernel(s). Check Altera's documentation for more info on autorun kernels and how to create and replicate them.