Hello,I'm currently struggling with enforcing the order of write/read to channels.I know, from the Intel programming guide, that two independent channels operation can be re-ordered by the compiler to generate efficient hardware.However, this seems to occur even if there is a clear dependence. I've created a minimal working example for this:#pragma OPENCL EXTENSION cl_intel_channels : enable //message typedef struct{ bool request; int data; }message_t; //represents the status of the computation typedef struct{ bool start; message_t m; }computation_t; channel message_t channels[2] __attribute__((depth(2))); // Auxiliary function for receiving data void receive(computation_t *status, int *data){ if(status->start){ //at the beginning send the request for data write_channel_intel(channels[0],status->m); status_>start=false; } //receive the data and store it status->m=read_channel_intel(channels[1]); *data=status->m.data; } __kernel void comp(const int N, const int start, __global int *mem){ int data; computation_t status; status.start=true; status.m.data=N; for(int i=0;i<N;i++) { //receive data, increment and store it to memory receive(&status,&data); data++; mem[i]=data; } } //generates a stream of data upon request __kernel void generator(){ //receive the request message_t m=read_channel_intel(channels[0]); for(int i=0;i<m.data;i++) { message_t send; send.data=i; send.request=false; write_channel_intel(channels[1],send); } }The "comp" kernel is characterized from a pipelined loop in which it receives data coming from the "generator" kernel using the "receive" function. At the first iteration, a request is sent to the generator in order to let it generate the right amount of data.If I try to compile this, the channel operations of the "receive" function are re-oderdered, as can be seen from the report: This occurs even if there is a clear dependency between the two.Clearly, if in hardware it is first executed the read, this will lead to deadlock.This happens with Quartus 18.1 and 19.1 (Stratix 10 as target board).In you opinion, is it a compiler bug or I have to handle this in a different way?Thanks

I highly doubt the order of channel operations in the System Viewer section of the HTML report follows the actual order implemented by the compiler; the drawing probably prioritizes minimizing space used for the figure, rather than accuracy with respect to order of operations. I would go ahead and just compile and run the kernel to see if it actually gives incorrect results. You can also try adding a barrier in-between the two channel operations as a test to see if the order of the operations will change in the report.

Hi,I've already tried to generate the bitstream. The compiled version hangs (of course in emulation it works).If I introduce a fence between the two channel operations, they result to be properly ordered, even if the loop II is now 7 instead of 1 (see figure). The compiled version works.Still, there exists a dependency between the two channel operations so they should not be arbitrarily re-ordered.

Then this looks like a compiler bug, indeed those channel operations should not be reordered. Can you mention the version of aoc you are using?Maybe @MUsman or one of the other Intel-affiliated moderators can take a look at this issue and report to the engineering team for a possible fix.

I'm using version 18.1.1 (build 263) and I'm compiling with the "-fpc -fp-relaxed: flags.I got the same report and the same hardware behaviour using version 19.1

Hi,This is expected behavior. Within a kernel, multiple channel calls (to different channels) are considered independent. This can be a problem if channels form a cycle between kernels.When I compile your code, the compiler generates a warning as follows: (check your log files)Compiler Warning: Kernels comp and generator may form a cycle due to connectivity of channels channels[0] and channels[1]. Use mem_fence if you require source code-based ordering of channel operations. Channel depths cannot be optimized.Ideally, you want to avoid forming a cycle between kernels with channels.https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807965224.html#mwh1391806067772

Ordering of channel operations | Altera Community

22 Replies

HRZ
Frequent Contributor
6 years ago
I highly doubt the order of channel operations in the System Viewer section of the HTML report follows the actual order implemented by the compiler; the drawing probably prioritizes minimizing space used for the figure, rather than accuracy with respect to order of operations. I would go ahead and just compile and run the kernel to see if it actually gives incorrect results. You can also try adding a barrier in-between the two channel operations as a test to see if the order of the operations will change in the report.
tde_m
Occasional Contributor
6 years ago
Hi,
I've already tried to generate the bitstream. The compiled version hangs (of course in emulation it works).
If I introduce a fence between the two channel operations, they result to be properly ordered, even if the loop II is now 7 instead of 1 (see figure). The compiled version works.
Still, there exists a dependency between the two channel operations so they should not be arbitrarily re-ordered.
- HRZ
  Frequent Contributor
  6 years ago
  Then this looks like a compiler bug, indeed those channel operations should not be reordered. Can you mention the version of aoc you are using?
  Maybe @MUsman or one of the other Intel-affiliated moderators can take a look at this issue and report to the engineering team for a possible fix.
tde_m
Occasional Contributor
6 years ago
I'm using version 18.1.1 (build 263) and I'm compiling with the "-fpc -fp-relaxed: flags.
I got the same report and the same hardware behaviour using version 19.1
Douglas_P_Intel
New Contributor
6 years ago
Hi,
This is expected behavior. Within a kernel, multiple channel calls (to different channels) are considered independent. This can be a problem if channels form a cycle between kernels.
When I compile your code, the compiler generates a warning as follows: (check your log files)
Compiler Warning: Kernels comp and generator may form a cycle due to connectivity of channels channels[0] and channels[1]. Use mem_fence if you require source code-based ordering of channel operations. Channel depths cannot be optimized.
Ideally, you want to avoid forming a cycle between kernels with channels.
https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807965224.html#mwh1391806067772
- HRZ
  Frequent Contributor
  6 years ago
  @douglas.prinn The "receive" kernel has a clear Write-after-Read hazard/dependency on "status->m". The two operations involving the variable, regardless of what those operations are, should not be reordered (by any sane compiler). The existence of a cycle a channels in this case, or channel operations for that matter, is irrelevant. Are you implying that aoc ignores data dependencies when channel operations are involved?
tde_m
Occasional Contributor
6 years ago
Hi Douglas, thanks for your reply.
I've seen the warning, and I recognize that (from the compiler point of view) a potential cycle is recognized. Yet, this is want I desired and, in any case, the re-ordering of the compiler is not resolving this issue (which, in this case is not a problem).
So:
should I assume that the compiler will treat all the channel operations as independent even if there is a clear data dependency (write-after-read in this case)?
why the presence of the barrier increases the II up to 7?
Thanks for your support
Douglas_P_Intel
New Contributor
6 years ago
@tde m @HRZ - Yes, you should assume all channel operations are independent even if you think there is a dependency. I know it looks like a read / write dependency on status, but the order that the channel operations are done is not dependent on the read / write order of status. The compiler will order the reads and writes to "status" correctly, but it considers the channel operations totally independent. It would be the same if you were reading or writing to 2 totally different regions in memory. You don't see the reads and writes to status because it's a register. If you force it to a RAM as shown below, it becomes clear. (See diagram)
computation_t __attribute__((memory) status;
cycle
4 - start read from CH1
5 - read (LD) from status (gets old data)
5 - write (ST) to status (result from CH1 read)
12 - start write to CH0 using old data from status
Douglas_P_Intel
New Contributor
6 years ago
@tde m I don't see an increase in II when I add the mem fence in 18.1 or 19.1.
mem_fence(CLK_CHANNEL_MEM_FENCE);
//receive the data and store it
status->m=read_channel_intel(channels[1]);
tde_m
Occasional Contributor
6 years ago
@douglas.prinn
I see the II increase, it jumps to 7 by inserting a fence exactly like you described (it jumps to 20 if the status variable is stored in memory). This is why the diagram was red in the picture attached above.
Also, if I compile with attibute memory, I obtain a different diagram wrt to yours (I don't have 4 stores, just one).
Still, I don't understand the argument about why channels are considered to be independent:
write channel take in input a variable V;
read channel writes into a variable Z
If Z==V, the compiler must enforce that the two operations are not swapped.
If this is not the case, please explain it clearly into your documentation.
- Douglas_P_Intel
  New Contributor
  6 years ago
  Here's one way to think of it
  temp = Z; (cycle 5 above)
  if(temp) write channel take in input temp;
  read channel writes into Z
  As long as temp=Z happens before read channel writes into Z, it doesn't matter what order the write channel and read channel happen.
- Douglas_P_Intel
  New Contributor
  6 years ago
  If I copy and paste the code above and add the mem fence I still see an II of ~1.
  - Douglas_P_Intel
    New Contributor
    6 years ago
HRZ
Frequent Contributor
6 years ago
Regarding the channel reordering, I think I now understand that the compiler always detaches channel operations from other read/write operations and uses extra registers (register renaming?) to handle dependencies such as the one discussed here which makes sense. Hence, it this case, if a cycle of channels did not exist, the channel operations in the "receive" kernel would still have been reordered, but no data corruption would have happened because the dependency is handled using extra registers. However, due to the cycle of channels and the channel reordering, a deadlock happens at run-time unless channel ordering is enforced using mem_fence.
Still, since I also thought all this time that channel reordering will not happen when data dependencies are involved, I would say the relationship between channel ordering and data dependencies could be very confusing for people who do not come across this thread and it is probably best if it is explained somewhere in the documentation.
- Douglas_P_Intel
  New Contributor
  6 years ago
  I agree that the documentation needs to be clarified regarding channel ordering. I will request clarification in the documents.