Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
8 years ago

can single work items kernels run in parallel on same device

Can single work item kernels run in parallel on the same device (i.e. on the same board).

I've been trying to get a very simple example of task parallelism working but have not been able to

to get more than one kernel to run at the same time on the same board.

The kernel computes part of summing equations - let's say it sums numbers from "start" to "end".

In a .cl file there are multiple identical kernels that do this - let's say there at 12 of them.

Single work items kernels have been used to insure that the equation can be pipelined.

The host code creates multiple kernels and multiple contexts in an effort to run more than one in parallel.

After trying many, many things, I've yet to get them to run in parallel. Initially I used just the time profile

to see how much time they take to run. Each kernel takes about the same time (e.g. 25 ms). If 12 kernels

are started, the time is 300 ms.

There are four identical boards in the system. If 12 kernels are used and three are used on each of the four

boards then each one takes 25 ms but each board can run them in parallel so the total time is only 75 ms.

What else is needed to get the kernels to run in parallel on the same board. I've been able to turn on

profiling and can see that each one is started - one after the other.

Everything seems to work (i.e. the correct answer is produced) but the kernels don't run at the same time on

a single board.

Do I need to use NDR range kernels?

Any suggestion would be greatly appreciated! (this should be so hard?!?)

14 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    What about:

    I - Add a clFlush after each clEnqueueTask()

    --- Quote End ---

    This changed the way the kernels ran (each one ran longer) but the over all time was the same.

    i.e. It seems like each kernel was started but it couldn't complete until the previous one completed.

    e.g. Without the clFlush()

    $ bin/host 100000 4

    Reprogramming device [0] with handle 1

    Task:0 complete (4.189 ms)

    Task:1 complete (8.172 ms)

    Task:2 complete (12.137 ms)

    Task:3 complete (16.093 ms)

    Time: 16.099 ms (4.025 ms / kernel)

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    e.g. w/clFlush()

    $ bin/host 100000 4

    Reprogramming device [0] with handle 1

    Task:0 complete (12.253 ms)

    Task:1 complete (12.283 ms)

    Task:2 complete (12.286 ms)

    Task:3 complete (16.191 ms)

    Time: 16.197 ms (4.049 ms / kernel)

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    --- Quote Start ---

    II - Profile the FPGA design (or print all start and end timestamps of the kernels' events) to see if kernels overlap in time.

    --- Quote End ---

    https://alteraforum.com/forum/attachment.php?attachmentid=14752&stc=1
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    but you could try using CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR

    --- Quote End ---

    This did not help.

    --- Quote Start ---

    adding the 'restrict' flag to your global variables in the kernel

    --- Quote End ---

    This did not help either.

    However, this got me tinkering though. I did try changing the CL_MEM_READ_WRITE to CL_MEM_WRITE_ONLY.

    This did work!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    $ bin/host 100000 4

    Reprogramming device [0] with handle 1

    Task:2 complete (4.529 ms)

    Task:3 complete (4.556 ms)

    Task:0 complete (4.559 ms)

    Task:1 complete (4.561 ms)

    Time: 4.563 ms (1.141 ms / kernel)

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

    https://alteraforum.com/forum/attachment.php?attachmentid=14753&stc=1

    Thanks SO MUCH to nicolacdnll and fand for giving some new suggestions that FINALLY lead to a solution!
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    However, this got me tinkering though. I did try changing the CL_MEM_READ_WRITE to CL_MEM_WRITE_ONLY.

    --- Quote End ---

    That sounds like the host compiler/runtime was assuming a false dependency between the answer[] buffers, either because the buffers are defined as an array, or because you are using host pointers. I always use CL_MEM_READ_WRITE for the buffers being accessed by parallel kernels, and never had such problem. However, I do not use host pointers.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Thanks HRZ for answering my original post and getting me on a path to a solution!