Forum Discussion
Altera_Forum
Honored Contributor
8 years ago1) With vectorization, all the control logic is shared, minimizing the extra resources required, so more work-items go into the pipeline in parallel. It is still pipelined, so if you vectorize to, say, 16, you can still put in 16 work-items each clock cycle and get 16 work-items out each clock cycle. If the pipeline is not vectorized, work-items go in the pipeline one at a time, but you get a constant throughput of 1 work-item per clock cycle assuming no stalls in the pipeline.
2) No, the barrier works just as you say. All work-items pause at the barrier, usually done before a memory operation. The hardware is implemented to handle this. Remember, you're creating FPGA hardware, which is completely customized on the way you write the code, so if you use a barrier and then perform a memory operation, LSUs are selected to handle this in the most efficient way possible (coalescence). Thus, in most cases, you want to also specify a required or maximum number of work-items since this tells the compiler how to organize the LSUs to be most efficient.