Hello everybody !
I am trying to have a better understanding of how workitems (in the sense of OpenCL) are being secheduled / processed in parallel while running on the FPGA; as I am familiar with GPUs, I tend to compare these architectures.
I would like to know what defines the "width" of my pipeline, that is, the number of entries we operate on in parallel, at a given point in time, in a given stage of a pipeline/workflow scheduled on a single compute unit.
From the answers above, I understand that the num_simd_work_items parameter seems to be enough to answer this question.
Setting this parameter to (let's say) 16 should lead to workgroups being processed by chunks of 16 workitems, following each other through stages of the pipeline generated by the code.
Now, what if I want to set this number to 512 ? 2048 ?
Is it just a matter of available logic / space on the board ?
Is there a maximal value "M" for num_simd_work_items to process exactly M workitems by cycle / stage, perfectly sync'd?
If we go beyond this hypothetical "maximal value", are workitems processed by "batches", like on GPU? (in NVIDIA terminology, they'd be "warps").
Thanks for your clarifications !