processing element and work items

Question

Hi there,

I have a general question here.

If I compile a kernel with BLOCK_SIZE=32, SIMD_WORK_ITEM=2, 2-D work groups. The kernels has attributes as:

__attribute((reqd_work_group_size(BLOCK_SIZE,BLOCK_SIZE,1)))

__attribute((num_simd_work_items(SIMD_WORK_ITEMS)))

How many processing elements are generated then? Is that 32*32*2?

If I have multiple work items organized in multiple groups, how are these work items mapped to the hardware PEs?

Got a bit confused here.

Thanks,

-Rae

altera_forum · Answer

Number of work-items that will launched will be 16*32.

That is because the compiler will vectorize the work-items on the local_id(0) dimension; each work-item will do the work of two work-items. Hence, half the number of work-items on this dimension will be launched.

altera_forum · Answer

Thanks for the answer. But I mean to ask the hardware (processing elements) that is generated on the FPGA.

From the aoc --report, I can see, if i maintain the same block_size, butt change the simd_work_item from 2 to 1, the hardware resources costed will be less as well. So I am guessing more simd_work_item means more hardware resource usage.

Thanks.

--- Quote Start ---

Number of work-items that will launched will be 16*32.

That is because the compiler will vectorize the work-items on the local_id(0) dimension; each work-item will do the work of two work-items. Hence, half the number of work-items on this dimension will be launched.

--- Quote End ---

altera_forum · Answer

Yes, with simd_work_item(2), the compiler generates the same number of process elements (i.e. 1), however, each processing element is wider to do more work.

The "num_compute_units" attribute directly replicates each processing element.

altera_forum · Answer

--- Quote Start ---

Yes, with simd_work_item(2), the compiler generates the same number of process elements (i.e. 1), however, each processing element is wider to do more work.

The "num_compute_units" attribute directly replicates each processing element.

--- Quote End ---

I understand you now. I have another question then. If I didn't specify the " num_compute_units" attribute, does it mean the compiler will generate one compute unit for my kernel design?

And for the following example, with BLOCK_SIZE=32,SIMD_WORK_ITEMS=2

__attribute((reqd_work_group_size(64,64,1)))

__attribute((num_simd_work_items(SIMD_WORK_ITEMS)))

__attribute((num_compute_units(1)))

does it mean that the runtime will manage to schedule a problem size with 64*64 to one compute_unit on the device, with that compute_unit has 32*16 processing elements?

Thanks!

-Rae

altera_forum · Answer

FPGA hardware is different than GPU where there is hierarchy of streaming processors, kernels, warps, etc.

The compiler generates one or more compute units controlled by num_compute_units; the width of the compute unit is controlled by the num_simd_work_items. Work-items specified in the host are executed on this compute unit in a pipelined fashion. Ideally, every cycle a new work-item will be issued to the compute unit. If there are no stalls, every cycle a work-item will exit the compute unit.

Forum Discussion

processing element and work items

7 Replies

Recent Discussions

How to fix Error(23782): Failed to find an expected report

Quartus 22.1 and 23.1 Synthesis Error

Connection bit order between hierarchy

Could not link 'vsim_auto_compile.dll' error troubleshooting.

Failed to run ip-setup-simulation: