Forcing loop iterations to execute sequentially

Honored Contributor

8 years ago

Unless your kernel is being falsely compiled as NDRange, loops are never executed out of order in single work-item kernels. NDRange kernels have a scheduler that can change the order of threads at runtime, but single work-item kernels do not have n scheduler and loop iterations are guaranteed to be executed in-order. What is happening here is that your printfs are coming out out-of-order, not the iterations themselves. I am still not sure how printfs are implemented on the hardware, but there is likely some on-chip buffering in place and hence, I don't think correct ordering of printfs are guaranteed on the hardware. You should probably not use printf for debugging ordering on the hardware.

Anyway, I wrote a test kernel based on your code snippet, and as I expected, the compiler already correctly detects the dependency and forces the outer loop to be executed sequentially. This is the test kernel I wrote:

#define M 100
# define N 1000
//__attribute__((max_global_work_dim(0)))
__kernel void ast(__global float* restrict in, __global float* restrict out)
{
	__local float lmem;
	
	for (int i = 0; i < M; i++)
	{
		lmem = in;
	}
	for (uint outer = 0; outer < N; ++outer)
	{
		uint wr_bank_sel = outer & 0x1;
		uint rd_bank_sel = !(outer & 0x1);
		for (uint inner = 0; inner < M; ++inner)
		{
			lmem = lmem * 10.0f/(inner + outer); // placeholder math op but real dependencies
		}
	}
	
	for (int i = 0; i < M; i++)
	{
		out = lmem;
	}
}

And this the compiler's dependency report v16.1.2:

The kernel is compiled for single work-item execution.
The kernel has a required work-group size of (1, 1, 1).
Loop Report:
 + Loop "Block1" (file test.cl line 9)
   Pipelined well. Successive iterations are launched every cycle.
   
   
 + Loop "Block2" (file test.cl line 14)
 | Pipelined with successive iterations launched every cycle.
 |  
 | Iterations executed serially across the region listed below.
 | Only a single loop iteration will execute inside the listed region.
 | This will cause performance degradation unless the region is pipelined well
 | (can process an iteration every cycle).
 |      
 |     Loop "Block3" (file test.cl line 18)
 |     due to:
 |     Memory dependency on Load Operation from: (file test.cl line 20)
 |       Store Operation (file test.cl line 20)
 | 
 | 
 |-+ Loop "Block3" (file test.cl line 18)
     Pipelined well. Successive iterations are launched every cycle.
     
     
 + Loop "Block5" (file test.cl line 24)
   Pipelined well. Successive iterations are launched every cycle.

The dependency is also properly detected in v17.0.2 and sequential execution if forced in the outer loop. I am not sure why this is not happening in your case. You are not using# pragma ivdep on the outer loop, are you?

Forum Discussion

Forcing loop iterations to execute sequentially

Recent Discussions

Highlight similar instances of a selected word fails when scrolling

Automatically added negative node for TDS output doesn't work with Agilex 5

Design Space Explorer - *** Fatal Error: Access Violation at 0X000000001E19EB30

Tensor block usage

Error (169008): Can't turn on open-drain option for differential I/O pin HPS_DDR3_DQS_N[1]