Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
8 years ago

Help with porting a CUDA code to OpenCL

Hi everyone.

I am porting a CUDA kernel to OpenCL kernel to execute it on a FPGA (Stratix V).

The code looks very similar so it shouldn't be a problem except that this FPGA board supports only OpenCL 1.0.

So I have some questions regarding the code I have to port.

1) With CUDA you can call different functions asynchronously and allocate and store the data on the GPU regardless of the kernel you are executing (as far as I understood). Is this possible with OpenCL?

2) This algorithm is partially sequential and parallel. How can I achieve the same behavior without loosing performances with OpenCL? Can I use more kernels? And if yes how does that work?

3) Any suggestion about how should I approach this?

Thank you very much for those who will be able to help me.

19 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    As I mentioned earlier:

    So, yes, do not do this unless you are using events to synchronize kernel calls from the host.

    Regarding the "note: candidate function has been explicitly made unavailable" message, Altera's Programming Guide says:

    The SDK does not support 64-bit atomic functions described in Section 9.7 of the OpenCL Specification version 1.0.

    That is probably the source of your problem.

    --- Quote End ---

    THank you very much. I was expecting that it wasn't supported. By the way I am not running multiple queues and I am always waiting for the end of one kernel execution before launching another so my question was about a single kernel launched with a 1D NDRange but with multiple workgroups. So if in this case the multiple workgroups access and write to the same global array would this cause a problem? Thanks
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    If you are using num_compute_units to replicate the kernel, and your accesses are random in a way that two different work-items from different work-groups might try to read/write the same memory location, then yes, this is certainly possible. Without num_compue_units, this shouldn't happen unless there is some race condition in the kernel itself (i.e. incorrect code which would also give incorrect results on CPU/GPU).

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    If you are using num_compute_units to replicate the kernel, and your accesses are random in a way that two different work-items from different work-groups might try to read/write the same memory location, then yes, this is certainly possible. Without num_compue_units, this shouldn't happen unless there is some race condition in the kernel itself (i.e. incorrect code which would also give incorrect results on CPU/GPU).

    --- Quote End ---

    I am not using num_compute_units and I don't know what it is so I guess that is not a problem. So running that kernel sequentially or in parallel should not be a problem I guess. I'll write here If I find something else. Thanks for the help.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Ok so the program is running perfectly on GPU. I have compiled the kernels used in the program and flashed them on the FPGA. Rebooted, checked with "aocl diagnose" that FPGA was communicating. used "aocl program" command to check. Then I runned the program on the FPGA and it gave me some access violation reading memory.... It is not the first time I run something on FPGA. What could it be?!

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    No idea. You don't really need to flash the FPGA manually, though; the OpenCL runtime will automatically do this during execution.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Ok so I have managed to run the same code on both CPU and GPU without having problems or crashes. But if I want to run it on GPU all clReleaseMemObject calls have to be changed to clRetainMemObject otherwise it would sometime output wrong values and take much more time to finish run. On CPU is the opposite, I need to switch all clRetainMemObject to clReleaseMemObject to make it work otherwise the program throws CL_OUT_OF_RESOURCES when creating some momentary buffers during execution.

    So what is going on? What is the difference between the two? Thanks for the support.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I have never used clRetainMemObject. Are you using these calls in-between the execution of your kernels? These are generally used at the end of execution for cleanup purposes, and in such case, will in no way affect data integrity. Are you using blocking read/write buffer calls? Make sure you are not releasing a device buffer right after a non-blocking read from it (which can obviously cause data corruption).

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    I have never used clRetainMemObject. Are you using these calls in-between the execution of your kernels? These are generally used at the end of execution for cleanup purposes, and in such case, will in no way affect data integrity. Are you using blocking read/write buffer calls? Make sure you are not releasing a device buffer right after a non-blocking read from it (which can obviously cause data corruption).

    --- Quote End ---

    I am always using blocking reading/writing calls. I am releasing only buffers every time I finish with a set of data and then start again with another set of data. I just don't understand why one works only with GPU and the other only with CPU.

    CPU gives me problems of resources when I use the "wrong one"... Is there anyone that can help me figure out this? Thanks.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Ok so thanks everyone first. Now I am having a very weird problem. So basically I have converted the kernels to one sequential kernel. I have modified host code and done cross checking of code and so on. It does work. I do clean buffer before each iteration but for some reason after a certain number of iterations the kernel fails and it throws this error: CL_INVALID_COMMAND_QUEUE. So as far as I know this means the kernel has failed but it doesn't make any sense.

    So in order to overcome this problem I re-initialize all the OpenCL variables (command queue, context and so on) once in a while after some iterations and now it goes through all the iterations.

    I am running the code on my NVIDIA GPU. What could it be causing this problem? I do release the buffers and re-initialize them... Also if I run it on CPU it fails randomly..