Forum Discussion
Altera_Forum
Honored Contributor
12 years agoFrom the Altera SDK for OpenCL Optimization Guide:
"The compiler implements each compute unit as a pipeline. Generally, each kernel compute unit can run multiple simultaneous work-groups (depending on the latency of the pipeline and the number of work-items present in a work-group). For example, a pipeline that is 1024 clock cycles deep can accommodate four entire work-groups of 256 work-items each. At a given point in execution, four or five work-groups are present in the pipeline, with earlier work-items further along in their processing than later ones." GPUs perform better when they have work-group sizes that are multiples of their SIMD architecture, usually 64, 48, or 32 depending on vendor and model. If you were to use less than those sizes, you would waste clock cycles by under utilizing the GPU's native instruction width. However, an FPGA is a very efficient pipeline architecture so as long as you have enough work to fill your pipeline, you won't be wasting clock cycles. Also, if you use barriers in your kernel, smaller work-group sizes may yield better results as the latency between the first and last work-item within the same work-group would be lower. There may also be other factors at play with your memory fetch/store instructions, so it may take some experimenting to find the optimal balance.