You seem to be confusing replication of NDRange kernels with Single Work-item ones. The former is completely controlled by the compiler and you CANNOT use it to design a systolic array since you cannot customize the kernel copies. What the guide is referring to is replication of Single Work-item kernels where you can then use the get_compute_id() function to get a compile-time constant ID for each kernel copy and use it to customize each copy and its channel connections. In your case the get_global_id() returns a run-time variable value which cannot be used as channel index.