Hi everyone. I am porting a CUDA kernel to OpenCL kernel to execute it on a FPGA (Stratix V). The code looks very similar so it shouldn't be a problem except that this FPGA board supports only OpenCL 1.0. So I have some questions regarding the code I have to port. 1) With CUDA you can call different functions asynchronously and allocate and store the data on the GPU regardless of the kernel you are executing (as far as I understood). Is this possible with OpenCL? 2) This algorithm is partially sequential and parallel. How can I achieve the same behavior without loosing performances with OpenCL? Can I use more kernels? And if yes how does that work? 3) Any suggestion about how should I approach this? Thank you very much for those who will be able to help me.

Ok after a long research now I have a good idea about how to port a CUDA code to Opencl. The only thing is now bothering me is this: Assume I have kernels and they call __device__ functions in CUDA which do not exist in OpenCL but they are simple functions called by the kernel and these functions lunches other kernels with a particular global dimension and local dimension, how can I translate this easily into OpenCL? I know that with OpenCL 1.0 I cannot launch kernels from kernels so what's the best way to go about this? Thanks..

1) You can create multiple queues and have multiple kernels or buffer copy calls running in parallel, and then synchronize them using events. Note that global memory consistency is only guaranteed at the end of kernel execution and hence, if you try to share a read_write buffer between multiple queues, you will get undefined behavior, unless you synchronize the calls using events. 2) You can have as many kernels as you want in each queue. Kernel execution calls are non-blocking, and kernels will be queued on the device and executed in-order. You can put the two different parts of your code in two different kernels and run them in-order. 3) I would remove all GPU-based optimizations from the code and convert the kernel to a sequential C code and compile it as a single work-item kernel. You probably need to spend a good deal of time reading Khronos's OpenCL documents and also Altera's OpenCL documents (getting started guide, programming guide, best practices guide). Don't concern yourself with what version of OpenCL is supported on the FPGA, you should not need to use the features that only exist in newer versions just yet.

I am struggling to understand how to covert a NDRange for OpenCL from CUDA. Say this is my kernel launch for CUDA: initMatrix << < _h_cols, _h_rows >> > (var1, va2, var3, var4); Cols and Rows are just integer numbers and the matrix has been made flat so no 2D. How can I convert this to a OpenCL NDRange? Thanks..

If your issue is with general OpenCL usage, and not problems specific to using OpenCL on FPGAs, I recommend searching online or consulting with stackoverflow. You could take a look at the following pages for starters: https://github.com/vtsynergy/cu2cl https://www.sharcnet.ca/help/index.php/porting_cuda_to_opencl

The problem I am encountering is that the number of columns and rows of my matrix are not multiples or dividers of each other. Can someone help me with this?

Help with porting a CUDA code to OpenCL | Altera Community

19 Replies

Altera_Forum
Honored Contributor
8 years ago
--- Quote Start ---
As I mentioned earlier:

So, yes, do not do this unless you are using events to synchronize kernel calls from the host.

Regarding the "note: candidate function has been explicitly made unavailable" message, Altera's Programming Guide says:

The SDK does not support 64-bit atomic functions described in Section 9.7 of the OpenCL Specification version 1.0.

That is probably the source of your problem.
--- Quote End ---

THank you very much. I was expecting that it wasn't supported. By the way I am not running multiple queues and I am always waiting for the end of one kernel execution before launching another so my question was about a single kernel launched with a 1D NDRange but with multiple workgroups. So if in this case the multiple workgroups access and write to the same global array would this cause a problem? Thanks
Altera_Forum
Honored Contributor
8 years ago
If you are using num_compute_units to replicate the kernel, and your accesses are random in a way that two different work-items from different work-groups might try to read/write the same memory location, then yes, this is certainly possible. Without num_compue_units, this shouldn't happen unless there is some race condition in the kernel itself (i.e. incorrect code which would also give incorrect results on CPU/GPU).
Altera_Forum
Honored Contributor
8 years ago
--- Quote Start ---
If you are using num_compute_units to replicate the kernel, and your accesses are random in a way that two different work-items from different work-groups might try to read/write the same memory location, then yes, this is certainly possible. Without num_compue_units, this shouldn't happen unless there is some race condition in the kernel itself (i.e. incorrect code which would also give incorrect results on CPU/GPU).
--- Quote End ---

I am not using num_compute_units and I don't know what it is so I guess that is not a problem. So running that kernel sequentially or in parallel should not be a problem I guess. I'll write here If I find something else. Thanks for the help.
Altera_Forum
Honored Contributor
8 years ago
Ok so the program is running perfectly on GPU. I have compiled the kernels used in the program and flashed them on the FPGA. Rebooted, checked with "aocl diagnose" that FPGA was communicating. used "aocl program" command to check. Then I runned the program on the FPGA and it gave me some access violation reading memory.... It is not the first time I run something on FPGA. What could it be?!
Altera_Forum
Honored Contributor
8 years ago
No idea. You don't really need to flash the FPGA manually, though; the OpenCL runtime will automatically do this during execution.
Altera_Forum
Honored Contributor
8 years ago
Ok so I have managed to run the same code on both CPU and GPU without having problems or crashes. But if I want to run it on GPU all clReleaseMemObject calls have to be changed to clRetainMemObject otherwise it would sometime output wrong values and take much more time to finish run. On CPU is the opposite, I need to switch all clRetainMemObject to clReleaseMemObject to make it work otherwise the program throws CL_OUT_OF_RESOURCES when creating some momentary buffers during execution.

So what is going on? What is the difference between the two? Thanks for the support.
Altera_Forum
Honored Contributor
8 years ago
I have never used clRetainMemObject. Are you using these calls in-between the execution of your kernels? These are generally used at the end of execution for cleanup purposes, and in such case, will in no way affect data integrity. Are you using blocking read/write buffer calls? Make sure you are not releasing a device buffer right after a non-blocking read from it (which can obviously cause data corruption).
Altera_Forum
Honored Contributor
8 years ago
--- Quote Start ---
I have never used clRetainMemObject. Are you using these calls in-between the execution of your kernels? These are generally used at the end of execution for cleanup purposes, and in such case, will in no way affect data integrity. Are you using blocking read/write buffer calls? Make sure you are not releasing a device buffer right after a non-blocking read from it (which can obviously cause data corruption).
--- Quote End ---

I am always using blocking reading/writing calls. I am releasing only buffers every time I finish with a set of data and then start again with another set of data. I just don't understand why one works only with GPU and the other only with CPU.
CPU gives me problems of resources when I use the "wrong one"... Is there anyone that can help me figure out this? Thanks.
Altera_Forum
Honored Contributor
8 years ago
Ok so thanks everyone first. Now I am having a very weird problem. So basically I have converted the kernels to one sequential kernel. I have modified host code and done cross checking of code and so on. It does work. I do clean buffer before each iteration but for some reason after a certain number of iterations the kernel fails and it throws this error: CL_INVALID_COMMAND_QUEUE. So as far as I know this means the kernel has failed but it doesn't make any sense.
So in order to overcome this problem I re-initialize all the OpenCL variables (command queue, context and so on) once in a while after some iterations and now it goes through all the iterations.
I am running the code on my NVIDIA GPU. What could it be causing this problem? I do release the buffers and re-initialize them... Also if I run it on CPU it fails randomly..

Forum Discussion

Help with porting a CUDA code to OpenCL

19 Replies

Recent Discussions

Timing analysis - long combinational path

Docker image for Quartus Pro 26.1 missing ?

Error (292014): Can't find valid feature line for core SLL_CA_HBC_T001_Hyperbus_Memory_Controller_10

Agilex 5 – Critical HSSI Error in JESD204B Example Design

The quartus license works with version 25.0 but not with version 17.0