Channel Deadlock Debugging

Question

I've developed a design in which there are 4 NDRange kernels all running concurrently on the FPGA and communicating with each other via the Altera OpencL Channels extension. There are many channels in the program , however it is inherently cyclical as the data is passed in a circle around the 4 kernels. I know this design isn't recommended by Altera (as it can't optimize the channel depths), however I'm trying to look at all possible architectures to get the best throughput. But I'm having a problem while running it.

I developed and tested the kernel using the emulator successfully. Everything was functional and produced promising correct results during emulation. However, after completing the full compilation and running the kernels, they seem to enter a deadlock state. I'm assuming this only happens after full compilation due to the kernels operating concurrently.

Does anyone have any suggestions for debugging kernels that won't come to completion? The solution I have right now uses global memory and an independent queue that continuously transfers over the data from global memory, but it seems inefficient.

altera_forum · Answer

You can use printf calls in your kernels to find out where they are currently at. Yes, printf actually does work on the FPGA, it just takes up a good chunk of LEs FFs and RAMs. I've used it for debugging before. Unfortunately the emulator is useless for these kind of problems.

altera_forum · Answer

Thanks mscharrer! I thought printf was only supported in emulation. That should make debugging a little easier.

altera_forum · Answer

I was surprised that it works as well, and only noticed it by accidentally leaving it in after debugging in the emulator.

altera_forum · Answer

--- Quote Start ---

I was surprised that it works as well, and only noticed it by accidentally leaving it in after debugging in the emulator.

--- Quote End ---

For anyone following this thread, I was able to find the solution to the deadlock. By using printf statements into a csv format, I found that the deadlock was occuring due to the channel reads and writes being executed in an unexpected order.

In my case I was using an NDRange kernel iterating over the channels with a loop as follows:

--- Quote Start ---

__kernel __attribute__((reqd_work_group_size(32,1,1))) ExampleKernel() {

int id = get_global_id(0);

char4 data[NUM_ELEMENTS];

// The data is processed here.

...

for(int i=0; i<NUM_ELEMENTS; i++) {

switch(id) {

case 0: write_channel_altera(outputChannel[0],data);break;

case 1: write_channel_altera(outputchannel[1],data);break;

case 2: write_channel_altera(outputChannel[2],data);break;

case 3: write_channel_altera(outputchannel[3],data);break;

case 4: write_channel_altera(outputChannel[4],data);break;

case 5: write_channel_altera(outputchannel[5],data);break;

case 6: write_channel_altera(outputChannel[6],data);break;

case 7: write_channel_altera(outputchannel[7],data);break;

case 8: write_channel_altera(outputChannel[8],data);break;

case 9: write_channel_altera(outputchannel[9],data);break;

case 10: write_channel_altera(outputChannel[10],data);break;

case 11: write_channel_altera(outputchannel[11],data);break;

case 12: write_channel_altera(outputChannel[12],data);break;

case 13: write_channel_altera(outputchannel[13],data);break;

case 14: write_channel_altera(outputChannel[14],data);break;

case 15: write_channel_altera(outputchannel[15],data);break;

case 16: write_channel_altera(outputChannel[16],data);break;

case 17: write_channel_altera(outputchannel[17],data);break;

case 18: write_channel_altera(outputChannel[18],data);break;

case 19: write_channel_altera(outputchannel[19],data);break;

case 20: write_channel_altera(outputChannel[20],data);break;

case 21: write_channel_altera(outputchannel[21],data);break;

case 22: write_channel_altera(outputChannel[22],data);break;

case 23: write_channel_altera(outputchannel[23],data);break;

case 24: write_channel_altera(outputChannel[24],data);break;

case 25: write_channel_altera(outputchannel[25],data);break;

case 26: write_channel_altera(outputChannel[26],data);break;

case 27: write_channel_altera(outputchannel[27],data);break;

case 28: write_channel_altera(outputChannel[28],data);break;

case 29: write_channel_altera(outputchannel[29],data);break;

case 30: write_channel_altera(outputChannel[30],data);break;

case 31: write_channel_altera(outputchannel[31],data);break;

}

--- Quote End ---

The above is an example of a (32,1,1) NDRange kernel. The data arrived such that the current loop iteration was completely finished before it continued. Here is how the data arrived:

id

loop iteration data

0

data[0]

1

data[0]

2

data[0]

...

31

data[0]

0

data[1]

1

data[1]

2

data[1]

...

0

data[NUM_ELEMENTS-1]

1

data[NUM_ELEMENTS-1]

2

data[NUM_ELEMENTS-1]

...

31

data[NUM_ELEMENTS-1]

altera_forum · Answer

Hello, now I met the same problem with you! Have you solved your problem?How?Thanks!!

Forum Discussion

Channel Deadlock Debugging

9 Replies

Recent Discussions

Is Quartus Prime Pro 22.4 Compatible with Stratix 10 NX Series Device?

Timing analysis - long combinational path

QuartusPro 25.3 Crashed after using the Signal Tap Logic Analyzer

Duplicate_hierarchy_depth / duplicate_register

Automatically added negative node for TDS output doesn't work with Agilex 5