Forum Discussion
Altera_Forum
Honored Contributor
8 years agoWork-group parallelism will only be guaranteed if you use num_compute_units to replicate your kernel pipeline. With one compute unit, there can still be some limited degree of work-group pipelining depending on how many barriers you have in the kernel, but there will be no guaranteed parallelism.
If you are fine with a local size of one, i.e. you do not use any local-memory-based optimization, you might as well use the single work-item kernel type and simply wrap your computation in a for loop from 0 to global_work_size - 1. At least in that case you will get guaranteed pipelining with an iteration interval that will depend on iterations dependencies, and will be reported by the compiler. I do not have a good-enough understanding of the inner workings of the scheduler to tell you why your performance is increasing linearly with global_work_size. Assuming that all work-items from all of work-groups are pipelined one after another with a small iteration interval, run time will increase, but not linearly. I would expect linear increase in run time only if execution is fully sequential.