Forum Discussion
23 Replies
- Altera_Forum
Honored Contributor
There is no way the "data_in" kernel would deadlock without printing anything. If you don't get any output from that kernel at runtime, it is possible that there is something wrong with your host code. I have no idea how Altera implements printf, though; there might be implementation-specific details that block the printf calls.
Needless to say, you definitely have an ordering issue in reading and writing CONV1_BIAS and CONV1_WEIGHTS; you are first writing CONV1_WEIGHTS and then CONV1_BIAS, while reading them in the opposite direction. This is very likely one major (or the only) source of your problem. - Altera_Forum
Honored Contributor
--- Quote Start --- There is no way the "data_in" kernel would deadlock without printing anything. If you don't get any output from that kernel at runtime, it is possible that there is something wrong with your host code. I have no idea how Altera implements printf, though; there might be implementation-specific details that block the printf calls. Needless to say, you definitely have an ordering issue in reading and writing CONV1_BIAS and CONV1_WEIGHTS; you are first writing CONV1_WEIGHTS and then CONV1_BIAS, while reading them in the opposite direction. This is very likely one major (or the only) source of your problem. --- Quote End --- Thanks for your reply! First, about printf, "Intel FPFA for opencl programming guide" says: "During kernel execution, printf data is stored in a global printf buffer that the Intel FPGA SDK for OpenCL Offline Compiler allocates automatically. The size of this buffer is 64 kB; the total size of data arguments to a printf call should not exceed this size. When kernel execution completes, the contents of the printf buffer are printed to standard output. Buffer overflows are handled seamlessly; printf instructions can be executed an unlimited number of times. However, if the printf buffer overflows, kernel pipeline execution stalls until the host reads the buffer and prints the buffer contents. Because printf functions store their data into a global memory buffer, the performance of your kernel will drop if it includes such functions". So, I think my printf information stored in the buffer? Becaue in emulator mode,my code result is right, so i think the host code is right? The host code have four queus, every kernel correspond to on queue. Second, I have changed the sequence of reading and writing CONV1_BIAS and CONV1_WEIGHTS, and also increaing the depth of channel to 20.Complie is running,I expect I will get right result. Thank you very much!! - Altera_Forum
Honored Contributor
--- Quote Start --- So, I think my printf information stored in the buffer? Becaue in emulator mode,my code result is right, so i think the host code is right? The host code have four queus, every kernel correspond to on queue. --- Quote End --- That is correct. I was under the impression that printf calls are "streamed" over PCI-E. If the calls are stored in a buffer and printed out "after" kernel execution is finished, then printf is not only completely useless for debugging deadlocks, using it can make deadlocks more likely. --- Quote Start --- Second, I have changed the sequence of reading and writing CONV1_BIAS and CONV1_WEIGHTS, and also increaing the depth of channel to 20.Complie is running,I expect I will get right result. --- Quote End --- If your kernel still didn't work, I recommend removing the printfs (since it might cause deadlocks of its own) and also using "mem_fence(CLK_CHANNEL_MEM_FENCE)" to enforce channel ordering in kernels that have multiple channel reads/writes. - Altera_Forum
Honored Contributor
--- Quote Start --- If your kernel still didn't work, I recommend removing the printfs (since it might cause deadlocks of its own) and also using "mem_fence(CLK_CHANNEL_MEM_FENCE)" to enforce channel ordering in kernels that have multiple channel reads/writes. --- Quote End --- I have add "mem_fence(CLK_CHANNEL_MEM_FENCE)" to my code. I forget to say just now. But I considered that, in my code of last version, the printf information of kernel "weights_bias_in" can all printf out, so the kernel excute finished. That is to say, the other three kernels all start excute, read or write channels for many times before blocked. Especially for "conv", have been read “CONV1_WEIGHTS” for "num_feature_maps" times. So, is not the problem of ordering of reading and writing of channels? Is the problem of FIFO(channel) confilict? Have you ever encounter the problem? Does Chaneel has the problem?I am not quite sure, so I also increaing the deapth of channel! - Altera_Forum
Honored Contributor
--- Quote Start --- If your kernel still didn't work, I recommend removing the printfs (since it might cause deadlocks of its own) and also using "mem_fence(CLK_CHANNEL_MEM_FENCE)" to enforce channel ordering in kernels that have multiple channel reads/writes. --- Quote End --- Complie finished! But nothing printf out and kernel blocked! Now I remove most of printf to try again! I consider the host code is wrong?? What error will cause this result??? - Altera_Forum
Honored Contributor
--- Quote Start --- But I considered that, in my code of last version, the printf information of kernel "weights_bias_in" can all printf out, so the kernel excute finished. That is to say, the other three kernels all start excute, read or write channels for many times before blocked. Especially for "conv", have been read “CONV1_WEIGHTS” for "num_feature_maps" times. So, is not the problem of ordering of reading and writing of channels? Is the problem of FIFO(channel) confilict? Have you ever encounter the problem? Does Chaneel has the problem?I am not quite sure, so I also increaing the deapth of channel! --- Quote End --- Your observation is correct but there are a lot of unknown factors involved in the process. I recommend removing all the printf calls except the ones that are the beginning of your kernels. If your host code is structured like this, it is correct (this is a pseudo code with shortened function names): ClEnqueue(queue1, kernel1); ClEnqueue(queue2, kernel2); ClEnqueue(queue3, kernel3); ClEnqueue(queue4, kernel4); ClFinish(queue4); - Altera_Forum
Honored Contributor
--- Quote Start --- Your observation is correct but there are a lot of unknown factors involved in the process. I recommend removing all the printf calls except the ones that are the beginning of your kernels. If your host code is structured like this, it is correct (this is a pseudo code with shortened function names): ClEnqueue(queue1, kernel1); ClEnqueue(queue2, kernel2); ClEnqueue(queue3, kernel3); ClEnqueue(queue4, kernel4); ClFinish(queue4); --- Quote End --- Hi, my host code is like this. I consulted an Intel FAE, he said my be caused by imbalance of write and read of channels! But I think, imlalance will not cause channel blocked, because I use buffered channel! What do you think?Tank you! - Altera_Forum
Honored Contributor
If you have channel ordering issues, imbalance CAN result in deadlock even with buffered channels. Did you try removing all your printf calls and using mem_fence for all kernels that have more than one channel read/write?
- Altera_Forum
Honored Contributor
--- Quote Start --- If you have channel ordering issues, imbalance CAN result in deadlock even with buffered channels. Did you try removing all your printf calls and using mem_fence for all kernels that have more than one channel read/write? --- Quote End --- Yes, I have try, but still don't work. I have channel ording issues, so I think its caused by imbalance of access of channel. So, if this is the question, How do I change my code? - Altera_Forum
Honored Contributor
--- Quote Start --- Yes, I have try, but still don't work. I have channel ording issues, so I think its caused by imbalance of access of channel. So, if this is the question, How do I change my code? --- Quote End --- Easiest solution is to put all of your code in one kernel, remove all the channels, and use local memory buffers instead.