Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
11 years ago

processing element and work items

Hi there,

I have a general question here.

If I compile a kernel with BLOCK_SIZE=32, SIMD_WORK_ITEM=2, 2-D work groups. The kernels has attributes as:

__attribute((reqd_work_group_size(BLOCK_SIZE,BLOCK_SIZE,1)))

__attribute((num_simd_work_items(SIMD_WORK_ITEMS)))

How many processing elements are generated then? Is that 32*32*2?

If I have multiple work items organized in multiple groups, how are these work items mapped to the hardware PEs?

Got a bit confused here.

Thanks,

-Rae

7 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Number of work-items that will launched will be 16*32.

    That is because the compiler will vectorize the work-items on the local_id(0) dimension; each work-item will do the work of two work-items. Hence, half the number of work-items on this dimension will be launched.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Thanks for the answer. But I mean to ask the hardware (processing elements) that is generated on the FPGA.

    From the aoc --report, I can see, if i maintain the same block_size, butt change the simd_work_item from 2 to 1, the hardware resources costed will be less as well. So I am guessing more simd_work_item means more hardware resource usage.

    Thanks.

    --- Quote Start ---

    Number of work-items that will launched will be 16*32.

    That is because the compiler will vectorize the work-items on the local_id(0) dimension; each work-item will do the work of two work-items. Hence, half the number of work-items on this dimension will be launched.

    --- Quote End ---

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Yes, with simd_work_item(2), the compiler generates the same number of process elements (i.e. 1), however, each processing element is wider to do more work.

    The "num_compute_units" attribute directly replicates each processing element.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Yes, with simd_work_item(2), the compiler generates the same number of process elements (i.e. 1), however, each processing element is wider to do more work.

    The "num_compute_units" attribute directly replicates each processing element.

    --- Quote End ---

    I understand you now. I have another question then. If I didn't specify the " num_compute_units" attribute, does it mean the compiler will generate one compute unit for my kernel design?

    And for the following example, with BLOCK_SIZE=32,SIMD_WORK_ITEMS=2

    __attribute((reqd_work_group_size(64,64,1)))

    __attribute((num_simd_work_items(SIMD_WORK_ITEMS)))

    __attribute((num_compute_units(1)))

    does it mean that the runtime will manage to schedule a problem size with 64*64 to one compute_unit on the device, with that compute_unit has 32*16 processing elements?

    Thanks!

    -Rae
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    FPGA hardware is different than GPU where there is hierarchy of streaming processors, kernels, warps, etc.

    The compiler generates one or more compute units controlled by num_compute_units; the width of the compute unit is controlled by the num_simd_work_items. Work-items specified in the host are executed on this compute unit in a pipelined fashion. Ideally, every cycle a new work-item will be issued to the compute unit. If there are no stalls, every cycle a work-item will exit the compute unit.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The throughput gain is not coming from the "parallel" execution of work-items on different "processing units". Yes, work-items are executing in parallel in different compute units (if num_compute_units specified) and within the same compute unit (if num_simd_work_items specified). However, throughput gain is mainly coming from the pipeline execution of work-items. Let's say if you have 32 work-items, it will take 32-cycles (ideally) to issue all the work-items to one compute unit, and let's say the kernel computation takes 1000 cycles, after 1000 cycles, one work-item will complete every cycle. Essentially, these 32 work-items execute one-cycle after each other, but there is more than enough work to accomodate all in the compute unit.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Hello everybody !

    I am trying to have a better understanding of how workitems (in the sense of OpenCL) are being secheduled / processed in parallel while running on the FPGA; as I am familiar with GPUs, I tend to compare these architectures.

    I would like to know what defines the "width" of my pipeline, that is, the number of entries we operate on in parallel, at a given point in time, in a given stage of a pipeline/workflow scheduled on a single compute unit.

    From the answers above, I understand that the num_simd_work_items parameter seems to be enough to answer this question.

    Setting this parameter to (let's say) 16 should lead to workgroups being processed by chunks of 16 workitems, following each other through stages of the pipeline generated by the code.

    Now, what if I want to set this number to 512 ? 2048 ?

    Is it just a matter of available logic / space on the board ?

    Is there a maximal value "M" for num_simd_work_items to process exactly M workitems by cycle / stage, perfectly sync'd?

    If we go beyond this hypothetical "maximal value", are workitems processed by "batches", like on GPU? (in NVIDIA terminology, they'd be "warps").

    Thanks for your clarifications !