can single work items kernels run in parallel on same device

Can single work item kernels run in parallel on the same device (i.e. on the same board).

I've been trying to get a very simple example of task parallelism working but have not been able to

to get more than one kernel to run at the same time on the same board.

The kernel computes part of summing equations - let's say it sums numbers from "start" to "end".

In a .cl file there are multiple identical kernels that do this - let's say there at 12 of them.

Single work items kernels have been used to insure that the equation can be pipelined.

The host code creates multiple kernels and multiple contexts in an effort to run more than one in parallel.

After trying many, many things, I've yet to get them to run in parallel. Initially I used just the time profile

to see how much time they take to run. Each kernel takes about the same time (e.g. 25 ms). If 12 kernels

are started, the time is 300 ms.

There are four identical boards in the system. If 12 kernels are used and three are used on each of the four

boards then each one takes 25 ms but each board can run them in parallel so the total time is only 75 ms.

What else is needed to get the kernels to run in parallel on the same board. I've been able to turn on

profiling and can see that each one is started - one after the other.

Everything seems to work (i.e. the correct answer is produced) but the kernels don't run at the same time on

a single board.

Do I need to use NDR range kernels?

Any suggestion would be greatly appreciated! (this should be so hard?!?)

14 Replies

Altera_Forum
Honored Contributor
8 years ago
--- Quote Start ---
What about:
I - Add a clFlush after each clEnqueueTask()
--- Quote End ---

This changed the way the kernels ran (each one ran longer) but the over all time was the same.
i.e. It seems like each kernel was started but it couldn't complete until the previous one completed.
e.g. Without the clFlush()
$ bin/host 100000 4
Reprogramming device [0] with handle 1
Task:0 complete (4.189 ms)
Task:1 complete (8.172 ms)
Task:2 complete (12.137 ms)
Task:3 complete (16.093 ms)

Time: 16.099 ms (4.025 ms / kernel)
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

e.g. w/clFlush()
$ bin/host 100000 4
Reprogramming device [0] with handle 1
Task:0 complete (12.253 ms)
Task:1 complete (12.283 ms)
Task:2 complete (12.286 ms)
Task:3 complete (16.191 ms)

Time: 16.197 ms (4.049 ms / kernel)
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

--- Quote Start ---

II - Profile the FPGA design (or print all start and end timestamps of the kernels' events) to see if kernels overlap in time.
--- Quote End ---

https://alteraforum.com/forum/attachment.php?attachmentid=14752&stc=1
Altera_Forum
Honored Contributor
8 years ago
--- Quote Start ---
but you could try using CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR
--- Quote End ---

This did not help.

--- Quote Start ---
adding the 'restrict' flag to your global variables in the kernel
--- Quote End ---

This did not help either.

However, this got me tinkering though. I did try changing the CL_MEM_READ_WRITE to CL_MEM_WRITE_ONLY.

This did work!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

$ bin/host 100000 4
Reprogramming device [0] with handle 1
Task:2 complete (4.529 ms)
Task:3 complete (4.556 ms)
Task:0 complete (4.559 ms)
Task:1 complete (4.561 ms)

Time: 4.563 ms (1.141 ms / kernel)
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000
Sum 0-100000.000000 (step 1.000000) = 5000050000.000000

https://alteraforum.com/forum/attachment.php?attachmentid=14753&stc=1

Thanks SO MUCH to nicolacdnll and fand for giving some new suggestions that FINALLY lead to a solution!
Altera_Forum
Honored Contributor
8 years ago
--- Quote Start ---
However, this got me tinkering though. I did try changing the CL_MEM_READ_WRITE to CL_MEM_WRITE_ONLY.
--- Quote End ---

That sounds like the host compiler/runtime was assuming a false dependency between the answer[] buffers, either because the buffers are defined as an array, or because you are using host pointers. I always use CL_MEM_READ_WRITE for the buffers being accessed by parallel kernels, and never had such problem. However, I do not use host pointers.
Altera_Forum
Honored Contributor
8 years ago
Thanks HRZ for answering my original post and getting me on a path to a solution!

Forum Discussion

can single work items kernels run in parallel on same device

14 Replies

Recent Discussions

Automatically added negative node for TDS output doesn't work with Agilex 5

Design Space Explorer - *** Fatal Error: Access Violation at 0X000000001E19EB30

Tensor block usage

Error (169008): Can't turn on open-drain option for differential I/O pin HPS_DDR3_DQS_N[1]

Highlight similar instances of a selected word fails when scrolling