How does workgroup size impact the kernel performance ?

Question

Hello all,

I have implemented on my Nallatech PCIe_385nA7 board a design with one compute unit pipeline . I want to experiment how the workgroup size will impact the performance .My first idea was that increasing workgroup size using "__attribute((reqd_work_group_size (WKG_HOR_SIZE, WKG_VER_SIZE, 1))) " will increase the performance because work-items are mapped on the device with the granularity of a workgroup. For example, if I had 50 workgroups, each of them will be mapped sequentially on the compute unit .

That is to say :

- for a workgroup size 1x1 in a NDRange of 20x20, 400 workgroups will be sequentially mapped. Low performance because Pipeline is certainly not filled.

- for a workgroup size 10x10 in a NDRange of 20x20, 4 workgroups will be sequentially mapped. Good performance because Pipeline is more filled than for 1x1

I validate my reasoning on AMD GPU .But for the ALTERA FPGA, performance are not increasing when I increase the workgroup size.

Wkg 1x1 --> 33 ms

Wkg 64x4 --> 33 ms

Wkg 256x4 --> 38 ms

Can someone tell me what is wrong with my reasoning please ? Or is it an ALTERA OpenCL issue ?

Thanks !

altera_forum · Answer

From the Altera SDK for OpenCL Optimization Guide:

"The compiler implements each compute unit as a pipeline. Generally, each kernel compute unit can run multiple simultaneous work-groups (depending on the latency of the pipeline and the number of work-items present in a work-group). For example, a pipeline that is 1024 clock cycles deep can accommodate four entire work-groups of 256 work-items each. At a given point in execution, four or five work-groups are present in the pipeline, with earlier work-items further along in their processing than later ones."

GPUs perform better when they have work-group sizes that are multiples of their SIMD architecture, usually 64, 48, or 32 depending on vendor and model. If you were to use less than those sizes, you would waste clock cycles by under utilizing the GPU's native instruction width. However, an FPGA is a very efficient pipeline architecture so as long as you have enough work to fill your pipeline, you won't be wasting clock cycles. Also, if you use barriers in your kernel, smaller work-group sizes may yield better results as the latency between the first and last work-item within the same work-group would be lower. There may also be other factors at play with your memory fetch/store instructions, so it may take some experimenting to find the optimal balance.

altera_forum · Answer

On a GPU the workgroup size often dictates the occupancy of the compute units.  With FPGAs the workgroup size typically just dictates how many resources will be necessary to create things like the barrier logic for example.  Remember that on the FPGA the compute unit is a flexible piece of hardware that is tailored to your kernel so it's not a fixed resource that you are trying to achieve 100% occupancy on.  So for algorithms that work well in an FPGA the high occupancy rate "just happens" due to the compiler tailoring the compute unit to your kernel.

Forum Discussion

How does workgroup size impact the kernel performance ?

2 Replies

Recent Discussions

Connection bit order between hierarchy

How to fix Error(23782): Failed to find an expected report

Quartus 22.1 and 23.1 Synthesis Error

Could not link 'vsim_auto_compile.dll' error troubleshooting.

Failed to run ip-setup-simulation: