NRrange size, offsets and workgroup size

Question

Hi,

In situations in which you don't want to compute on all the elements in an 2d-array/opencl buffer. On the FPGA would I be better launching the exact amount of work items required to process them (assuming one WI per array element) and using offsets, or specifying an NDrange which is the size of the entire array (or some other multiple) and using a simple if statement within the kernel to control which array elements are actually processed.

eg if I had an array of X by X elements but I don't want to process the elements in the outer halo.

Similarly is it generally more performant to launch just the required number of work items in the NDrange or to round the NDrange up to a particular value? On other architectures I found this to be beneficial. I note that the value of PREFERRED_WORK_GROUP_MULTIPLE seems to be 0 on Altera, is this significant?

Regarding specifying the workgroup size at compilation time (assuming the problem size doesn't change) would I be better specifying a workgroup size which is the same as the number of workitems which will eventually be launched (assuming this size can fit in the hardware)? Or some other value?

Many thanks

altera_forum · Answer

If you have a gating 'if' statement that prevents some work-items from performing calculations they will quickly early out. For each work-item that needs to early out that will happen on a per work-item bases so they do consume computation cycles to perform the condition checking but for a long running kernel this overhead will be in the noise.

On the Altera platform the NDRange launch size isn't important (except the typical launching of a tiny NDRange for kernels that execute very quickly wouldn't be ideal but that's the case for any vendor). I suspect with the other vendors you have used NDRange sizing was important due to the way the scheduling works (warp/wavefront). On the Altera platform there is no concept of a warp or wavefront that you need to deal with.

On the Altera platform there are benefits of putting bounds on the work-group size though. You can configure the maximum work-group size or a fixed work-group size (ideal if your algorithm lends itself to it) using attibutes. The default maximum work-group size is 256 so if you want a different maximum/required size those attributes can change it.

I'm not sure about the last question, are you asking whether you should pick a maximum work-group size when compiling the kernel that will cover all the different work-group sizes that the host application will throw at the hardware? If so then generally speaking I would say yes, especially if you need to be able to handle work-groups larger than 256 work-items.

altera_forum · Answer

Thanks for the quick response BadOmen.

re the last question. I guess I'm just trying to understand what the tradeoffs / benefits are from specifying different sized workgroups. I can understand why you would vary this for the GPU architecture to fit the optimal block on each sub-processor/core etc. However its unclear to me how this maps to the FPGA. Grateful if you could elaborate.

altera_forum · Answer

Since the underlining hardware is flexible tuning the work-group size in the kernel file has more to do with efficiency (hardware footprint and compiler optimizations). If the kernel compiler knows the work-group size in advanced it can make sure only the hardware you need will be created.

So if you know what the maximum work-group size is ahead of time I recommend specifying the max_work_group_size attribute because if you don't the compiler will generate hardware to handle a 256 work-group size which might be overkill in terms of hardware. The reqd_work_group_size attribute has the same benefits as the max_work_group_size attribute except it makes sure the hardware footprint is set in stone which sometimes give additional footprint reduction as well as it gives the compiler more information to perform more agressive optimizations.

It's fairly difficult to give recommendations on work-group size because it's very algorithm dependent what the work-group size has on the underlining hardware. One thing to keep in mind is the hardware is flexible so you are not limited to trying to code your kernel to match the archeticture so experimentation is probably necessary. My recommendation is to make things like the work-group size configurable in the kernel using macros and try different sizes by compiling the kernel using the -c option so that just the accelerator gets generated instead of the final programming file which is the time consuming part of the compilation flow. That way you can sweep different sizes to see which one works best, you'll want to generate the reports as well so that you can keep track if the metrics improved or not by passing in the --report and --estimate-throughput flags. Often when I'm tuning a kernel I make the work-group size configurable through a macro then modify the macro by passing it's value to the compiler using the flag "-D <MACRO_NAME>=<MACRO_VALUE>".

Forum Discussion

NRrange size, offsets and workgroup size

3 Replies

Recent Discussions

Connection bit order between hierarchy

How to fix Error(23782): Failed to find an expected report

Quartus 22.1 and 23.1 Synthesis Error

Could not link 'vsim_auto_compile.dll' error troubleshooting.

Failed to run ip-setup-simulation: