Scaling up vector add example

Honored Contributor

10 years ago

So the work group size specifies how many work items each work group handles. It's essentially a way to partition the work-items that you need to process, not necessarily changing the overall hardware of the kernel. By partition your work-items into work groups, the work-items can communicate with one another using local memory that is shared between them.

The ways to speed it up that i am aware of is like what you mentioned: to increase the number of compute units or by specifying the number of simd work items. The thing that you have to realize is that the data is all coming from global memory and the data is accessed one by one. Depending on the application, it can be either compute or memory bound. Since your kernel is a simple vector add, it easily becomes a memory bounded problem since the computation is trivial and can compute the result faster than it can acquire it.

EDIT: One thing I also what to add is that you can also try experimenting with loop unrolling. Loop unrolling (as long as it's data independent) creates essentially multiple instances of the computation under the for loop. However, realize that this could impact memory access efficiency since each of these computation requires a load and store operation from global memory.

Forum Discussion

Recent Discussions

agilex7 ram back-annotation

FIR IP configured for Interpolation

SSLC Login Issue – "You need to enroll" loop after OTP verification

A5EG013BB18A OPN is visible in Quartus but not listed in Program File Generator

altera scfifo ip with power-up initial value