Forum Discussion
Altera_Forum
Honored Contributor
8 years agoIn the specific case of vector-add, whether the kernel is NDRange or single work-item, the compiler will create one adder and three ports to global memory (two reads and one write), plus some buffers between global memory and the kernel to absorb possible stalls and some registers to allow pipelining. In this case, 2N values will be read from global memory, and N values will be written, with three values being read/written per clock. This will obviously result in poor performance; hence, SIMD (for NDRange kernels) and unrolling (for single work-item) can be used to increase the number of adders that are synthesized, and widen the ports to memory, to allow more data to be loaded and added per clock to improve performance.