Forum Discussion

GSing13's avatar
GSing13
Icon for New Contributor rankNew Contributor
7 years ago

Waiting for IOs to complete

Hello everyone,

I am implementing openCL based kernels for the Intel Stratix 10 FPGA for some high performance application.

I would like to know what is the best way to guarantee that the current data write to the global memory from a kernel is completed, before next iteration of the kernel is executed.

I first thought of waiting in the kernel for some fixed amount of cycles but I don't see any defined way of achieving this using openCL.

I hope someone would be able to guide me and suggest a way to achieve this.

Regards,

Gaurav

7 Replies

  • HRZ's avatar
    HRZ
    Icon for Frequent Contributor rankFrequent Contributor

    If I understand your question correctly, this is not at all required. Kernel execution only finishes after all data is written to device memory; this is required by the OpenCL standard. Needless to say, kernel enqueue functions are non-blocking; hence, you need to use clFinish() or clWaitForEvents() to determine when the kernel execution has actually completed. If you enqueue two or more kernels back to back in the same queue, it is again guaranteed that each kernel starts only after the previous one finishes. Please note that the OpenCL standard does not guarantee global memory consistency during kernel execution.

    • GSing13's avatar
      GSing13
      Icon for New Contributor rankNew Contributor

      Thanks for response.

      I do require the synchronization of the device memory during the kernel execution between the kernels which execute simultaneously.

      Can you suggest a way in which this can be achieved?

      • HRZ's avatar
        HRZ
        Icon for Frequent Contributor rankFrequent Contributor

        If you have multiple kernels running in parallel in different queues updating the same global buffer, this is always going to give you an undefined output because, as I said, the OpenCL standard ensures global memory consistency only after the kernel execution has finished. I tried doing something like what you want once by using channels between two kernels running in parallel and sending messages from one to the other to synchronize them, but that didn't work since channel operations and memory operations have different latency and there is no guarantee that by the time the message reaches the second kernel, the memory operation in the first kernel has finished. Intel also provides a global memory barrier that should supposedly help for such cases but didn't seem to make any difference in my case. You can try using channels in conjunction with the global memory barrier to see if it works for you but note that if it doesn't, this is completely normal since such functionality is not expected to be supported by the OpenCL standard. Needless to say, there will always be alternative designs which do not require sharing global memory buffers.

  • HRZ's avatar
    HRZ
    Icon for Frequent Contributor rankFrequent Contributor

    Volatile is certainly required to disable the private cache for global memory accesses and force all accesses to actually go to global memory to make sure all updates by each kernel is propagated to others.

    I am not sure what you mean by "write-ack LSU". I think Intel's OpenCL compiler also supports "atomic" memory operations which might solve your problem; however, performance will be very poor because that basically serializes memory accesses and stalls the pipeline until the memory operation has finished (maybe this is what you call write-ack LSU?).

      • HRZ's avatar
        HRZ
        Icon for Frequent Contributor rankFrequent Contributor

        I see, that has been added in the newer versions of the compiler; didn't exist in the older versions. However, that seems to be something the compiler decides on based on the characteristics of the memory accesses, rather than something the programmer/user can explicitly control. Furthermore, the compiler will never analyze global memory access dependencies between two separate kernels and hence, such LSU will never be created by the compiler for your case. Based on the example in the guide, this LSU is created for cases a write-after-write dependency exists in the code; needless to say, such dependency is a false dependency and any sane compiler will optimize out the first write and only keep the second one. I fail to see why Intel even needed to add support for this LSU type...

        What you are looking for is likely the atomic memory read/write I mentioned earlier.