Forum Discussion
Altera_Forum
Honored Contributor
8 years ago1) On the FPGA, by default, a deep pipeline is created so work-items can go in and come out every clock cycle. Optionally, the single pipeline can be vectored to bring more work-items in simultaneously or the entire pipeline can be duplicated to handle different workgroups simultaneously.
2) Depends on the implementation. Without vectorization, a work item goes in and comes out every clock cycle. With vectorization, all 16 can be processed in parallel. The tradeoff is always performance vs. FPGA resource use. 3) You would not want to do this. Better to use a barrier to synchronize all the work-items and write all the work-item data to global memory in one shot. 4) Yes. The compiler will coalesce memory accesses where it can. If it can't, the optimization report will indicate what implementation was selected and why coalescing could not be performed.