Pinned Memory and Host-Device Communication

Question

Hello,

I'm experimenting with OpenCL(1.2) using a Bittware S5PHQ board.

I've a task-based kernel that operates on a full input dataset (composed of a sequence of discrete sub datasets, if you will), provided via a 'global' memory buffer. The kernel also receives a 'global' results buffer to populate. The nature of the data is such that sequential processing is necessary and the results of processing each discrete data subset are of interest to the host. For reasons of efficiency, I can't start/stop a kernel instance for each discrete input sub dataset; hence a single kernel that runs through the full dataset.

I've 2 related scenarios that I'd appreciate some clarifications on -

1) I want the host to be able to see the results as and when they're generated rather than wait for the kernel to finish processing the full dataset. I looked at using MEM_ALLOC_HOST_PTR+enqueueMapBuffer for this and allocated the output buffer with MEM_ALLOC_HOST_PTR. The kernel populates this buffer with each sub dataset's processing result. I expected the host to be able to map this pinned memory periodically to get the results.

However, after starting the kernel, when the host makes a call to enqueueMapBuffer(cl_buf_output, CL_TRUE, ...), the call waits indefinitely, i.e. the mapping doesn't complete and/or the host doesn't get a chance to read the mapped memory.

However, the same code works on the CPU with Intel OpenCL support (i.e. the host detects the 'CPU' device and runs the kernel on the CPU). The kernel and the host share a 'last result index' and every time the kernel has processed a sub dataset and generated a new result, the host can see the new result using this index from the output buffer.

So, why am I not observing this behavior on the FPGA? I thought the host can, at any point, map pinned memory (allocated with MEM_ALLOC_HOST_PTR) and expect the latest data to be transferred from the device? Short of waiting for the kernel to finish, how can the host 'see' the results or any changes to pinned memory using the FPGA?

2) The 2nd scenario is related to above. I want the kernel to be able to see a change from the host. At any point of time, I'd like to be able to shutdown the kernel by setting a value in pinned memory, which the kernel checks every so often(like a shutdown flag). I expected the kernel to be able to see changes made by the host to that pinned memory.

However, my observation is similar as above i.e. once the kernel is running, when the host tries to map the pinned memory to set it to the new value, the enqueueMapBuffer() call hangs indefinitely. Again, as in the 1st scenario, this works successfully on the CPU with Intel OpenCL.

Why does this work with the CPU and not on the FPGA? Is what I'm looking for the same as shared memory (an OpenCL 2.0 feature)?

Apologies for the long post.

Thanks much in advance for any help.

altera_forum · Answer

Prior to OpenCL 2.0, the relaxed-consistency memory model only guarantees that global and local memory is consistent among work-items in the same work-group at barriers, and work-items in different work-groups upon kernel completions. Furthermore, consistency between the host and device is only guaranteed upon read/write buffer completions for buffers not being written to by the host or kernels.

Since a CPU device is also the host, and therefore has unified host memory, you may by good fortune observe the behavior you desire with what you've done. However, it is highly vendor dependent (especially for discrete devices) and thus not portable in OpenCL 1.0 -- 1.2 specifications. What you're looking for is Shared Virtual Memory (SVM) in the OpenCL 2.0 specification.

Forum Discussion

Pinned Memory and Host-Device Communication

1 Reply

Recent Discussions

Is Quartus Prime Pro 22.4 Compatible with Stratix 10 NX Series Device?

Timing analysis - long combinational path

QuartusPro 25.3 Crashed after using the Signal Tap Logic Analyzer

Duplicate_hierarchy_depth / duplicate_register

Automatically added negative node for TDS output doesn't work with Agilex 5