NDRnage Kernels Global Memory Write Pattern

Honored Contributor

8 years ago

1) With vectorization, all the control logic is shared, minimizing the extra resources required, so more work-items go into the pipeline in parallel. It is still pipelined, so if you vectorize to, say, 16, you can still put in 16 work-items each clock cycle and get 16 work-items out each clock cycle. If the pipeline is not vectorized, work-items go in the pipeline one at a time, but you get a constant throughput of 1 work-item per clock cycle assuming no stalls in the pipeline.

2) No, the barrier works just as you say. All work-items pause at the barrier, usually done before a memory operation. The hardware is implemented to handle this. Remember, you're creating FPGA hardware, which is completely customized on the way you write the code, so if you use a barrier and then perform a memory operation, LSUs are selected to handle this in the most efficient way possible (coalescence). Thus, in most cases, you want to also specify a required or maximum number of work-items since this tells the compiler how to organize the LSUs to be most efficient.

Forum Discussion

NDRnage Kernels Global Memory Write Pattern

Recent Discussions

Generate Simulation Setup Script Fails

FIR IP configured for Interpolation

Altera SSLC License

Lisence issue when running .do script

How to create a Packaged Subsystem in TCL