Optimization for complex algorithms

Honored Contributor

8 years ago

Work-group parallelism will only be guaranteed if you use num_compute_units to replicate your kernel pipeline. With one compute unit, there can still be some limited degree of work-group pipelining depending on how many barriers you have in the kernel, but there will be no guaranteed parallelism.

If you are fine with a local size of one, i.e. you do not use any local-memory-based optimization, you might as well use the single work-item kernel type and simply wrap your computation in a for loop from 0 to global_work_size - 1. At least in that case you will get guaranteed pipelining with an iteration interval that will depend on iterations dependencies, and will be reported by the compiler.

I do not have a good-enough understanding of the inner workings of the scheduler to tell you why your performance is increasing linearly with global_work_size. Assuming that all work-items from all of work-groups are pipelined one after another with a small iteration interval, run time will increase, but not linearly. I would expect linear increase in run time only if execution is fully sequential.

Forum Discussion

Recent Discussions

A5EG013BB18A OPN is visible in Quartus but not listed in Program File Generator

SSLC Login Issue – "You need to enroll" loop after OTP verification

altera scfifo ip with power-up initial value

FIR IP configured for Interpolation

Recommendations for Quartus Prime File Cloud Storage