Forum Discussion

tde_m's avatar
tde_m
Icon for Occasional Contributor rankOccasional Contributor
6 years ago

Ordering of channel operations

Hello,

I'm currently struggling with enforcing the order of write/read to channels.

I know, from the Intel programming guide, that two independent channels operation can be re-ordered by the compiler to generate efficient hardware.

However, this seems to occur even if there is a clear dependence. I've created a minimal working example for this:

#pragma OPENCL EXTENSION cl_intel_channels : enable
 
//message
typedef struct{
	bool request;
	int data;
}message_t;
 
//represents the status of the computation
typedef struct{
	bool start;
	message_t m;
}computation_t;
 
channel message_t channels[2] __attribute__((depth(2)));
 
// Auxiliary function for receiving data
void receive(computation_t *status, int *data){
    if(status->start){
        //at the beginning send the request for data
        write_channel_intel(channels[0],status->m);
        status_>start=false;
    }
    //receive the data and store it
    status->m=read_channel_intel(channels[1]);
    *data=status->m.data;
}
 
__kernel void comp(const int N, const int start, __global int *mem){
    int data;
    computation_t status;
    status.start=true;
    status.m.data=N;
    for(int i=0;i<N;i++)
    {
        //receive data, increment and store it to memory
        receive(&status,&data);
        data++;
        mem[i]=data;
    }
 
}
 
 
//generates a stream of data upon request
__kernel void generator(){
	//receive the request
	message_t m=read_channel_intel(channels[0]);
	for(int i=0;i<m.data;i++)
	{
            message_t send;
            send.data=i;
            send.request=false;
            write_channel_intel(channels[1],send);
	}
}

The "comp" kernel is characterized from a pipelined loop in which it receives data coming from the "generator" kernel using the "receive" function. At the first iteration, a request is sent to the generator in order to let it generate the right amount of data.

If I try to compile this, the channel operations of the "receive" function are re-oderdered, as can be seen from the report:

This occurs even if there is a clear dependency between the two.

Clearly, if in hardware it is first executed the read, this will lead to deadlock.

This happens with Quartus 18.1 and 19.1 (Stratix 10 as target board).

In you opinion, is it a compiler bug or I have to handle this in a different way?

Thanks

22 Replies

  • HRZ's avatar
    HRZ
    Icon for Frequent Contributor rankFrequent Contributor

    I highly doubt the order of channel operations in the System Viewer section of the HTML report follows the actual order implemented by the compiler; the drawing probably prioritizes minimizing space used for the figure, rather than accuracy with respect to order of operations. I would go ahead and just compile and run the kernel to see if it actually gives incorrect results. You can also try adding a barrier in-between the two channel operations as a test to see if the order of the operations will change in the report.

  • tde_m's avatar
    tde_m
    Icon for Occasional Contributor rankOccasional Contributor

    Hi,

    I've already tried to generate the bitstream. The compiled version hangs (of course in emulation it works).

    If I introduce a fence between the two channel operations, they result to be properly ordered, even if the loop II is now 7 instead of 1 (see figure). The compiled version works.

    Still, there exists a dependency between the two channel operations so they should not be arbitrarily re-ordered.

    • HRZ's avatar
      HRZ
      Icon for Frequent Contributor rankFrequent Contributor

      Then this looks like a compiler bug, indeed those channel operations should not be reordered. Can you mention the version of aoc you are using?

      Maybe @MUsman​ or one of the other Intel-affiliated moderators can take a look at this issue and report to the engineering team for a possible fix.

  • tde_m's avatar
    tde_m
    Icon for Occasional Contributor rankOccasional Contributor

    I'm using version 18.1.1 (build 263) and I'm compiling with the "-fpc -fp-relaxed: flags.

    I got the same report and the same hardware behaviour using version 19.1

  • Hi,

    This is expected behavior. Within a kernel, multiple channel calls (to different channels) are considered independent. This can be a problem if channels form a cycle between kernels.

    When I compile your code, the compiler generates a warning as follows: (check your log files)

    • Compiler Warning: Kernels comp and generator may form a cycle due to connectivity of channels channels[0] and channels[1]. Use mem_fence if you require source code-based ordering of channel operations. Channel depths cannot be optimized.

    Ideally, you want to avoid forming a cycle between kernels with channels.

    https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807965224.html#mwh1391806067772

    • HRZ's avatar
      HRZ
      Icon for Frequent Contributor rankFrequent Contributor

      @douglas.prinn​ The "receive" kernel has a clear Write-after-Read hazard/dependency on "status->m". The two operations involving the variable, regardless of what those operations are, should not be reordered (by any sane compiler). The existence of a cycle a channels in this case, or channel operations for that matter, is irrelevant. Are you implying that aoc ignores data dependencies when channel operations are involved?

  • tde_m's avatar
    tde_m
    Icon for Occasional Contributor rankOccasional Contributor

    Hi Douglas, thanks for your reply.

    I've seen the warning, and I recognize that (from the compiler point of view) a potential cycle is recognized. Yet, this is want I desired and, in any case, the re-ordering of the compiler is not resolving this issue (which, in this case is not a problem).

    So:

    • should I assume that the compiler will treat all the channel operations as independent even if there is a clear data dependency (write-after-read in this case)?
    • why the presence of the barrier increases the II up to 7?

    Thanks for your support

  • @tde m@HRZ​ - Yes, you should assume all channel operations are independent even if you think there is a dependency. I know it looks like a read / write dependency on status, but the order that the channel operations are done is not dependent on the read / write order of status. The compiler will order the reads and writes to "status" correctly, but it considers the channel operations totally independent. It would be the same if you were reading or writing to 2 totally different regions in memory. You don't see the reads and writes to status because it's a register. If you force it to a RAM as shown below, it becomes clear. (See diagram)

    computation_t __attribute__((memory) status;

    cycle

    4 - start read from CH1

    5 - read (LD) from status (gets old data)

    5 - write (ST) to status (result from CH1 read)

    12 - start write to CH0 using old data from status

  • @tde m​ I don't see an increase in II when I add the mem fence in 18.1 or 19.1.

    mem_fence(CLK_CHANNEL_MEM_FENCE);

    //receive the data and store it

    status->m=read_channel_intel(channels[1]);

  • tde_m's avatar
    tde_m
    Icon for Occasional Contributor rankOccasional Contributor

    @douglas.prinn

    I see the II increase, it jumps to 7 by inserting a fence exactly like you described (it jumps to 20 if the status variable is stored in memory). This is why the diagram was red in the picture attached above.

    Also, if I compile with attibute memory, I obtain a different diagram wrt to yours (I don't have 4 stores, just one).

    Still, I don't understand the argument about why channels are considered to be independent:

    • write channel take in input a variable V;
    • read channel writes into a variable Z

    If Z==V, the compiler must enforce that the two operations are not swapped.

    If this is not the case, please explain it clearly into your documentation.

    • Douglas_P_Intel's avatar
      Douglas_P_Intel
      Icon for New Contributor rankNew Contributor

      Here's one way to think of it

      temp = Z; (cycle 5 above)

      if(temp) write channel take in input temp;

      read channel writes into Z

      As long as temp=Z happens before read channel writes into Z, it doesn't matter what order the write channel and read channel happen.

    • Douglas_P_Intel's avatar
      Douglas_P_Intel
      Icon for New Contributor rankNew Contributor

      If I copy and paste the code above and add the mem fence I still see an II of ~1.

  • HRZ's avatar
    HRZ
    Icon for Frequent Contributor rankFrequent Contributor

    Regarding the channel reordering, I think I now understand that the compiler always detaches channel operations from other read/write operations and uses extra registers (register renaming?) to handle dependencies such as the one discussed here which makes sense. Hence, it this case, if a cycle of channels did not exist, the channel operations in the "receive" kernel would still have been reordered, but no data corruption would have happened because the dependency is handled using extra registers. However, due to the cycle of channels and the channel reordering, a deadlock happens at run-time unless channel ordering is enforced using mem_fence.

    Still, since I also thought all this time that channel reordering will not happen when data dependencies are involved, I would say the relationship between channel ordering and data dependencies could be very confusing for people who do not come across this thread and it is probably best if it is explained somewhere in the documentation.

    • Douglas_P_Intel's avatar
      Douglas_P_Intel
      Icon for New Contributor rankNew Contributor

      I agree that the documentation needs to be clarified regarding channel ordering. I will request clarification in the documents.