Forum Discussion
Altera_Forum
Honored Contributor
10 years agoSo the work group size specifies how many work items each work group handles. It's essentially a way to partition the work-items that you need to process, not necessarily changing the overall hardware of the kernel. By partition your work-items into work groups, the work-items can communicate with one another using local memory that is shared between them.
The ways to speed it up that i am aware of is like what you mentioned: to increase the number of compute units or by specifying the number of simd work items. The thing that you have to realize is that the data is all coming from global memory and the data is accessed one by one. Depending on the application, it can be either compute or memory bound. Since your kernel is a simple vector add, it easily becomes a memory bounded problem since the computation is trivial and can compute the result faster than it can acquire it. EDIT: One thing I also what to add is that you can also try experimenting with loop unrolling. Loop unrolling (as long as it's data independent) creates essentially multiple instances of the computation under the for loop. However, realize that this could impact memory access efficiency since each of these computation requires a load and store operation from global memory.