Forum Discussion

YChoi1's avatar
YChoi1
Icon for New Contributor rankNew Contributor
6 years ago
Solved

Serialization problem with code blocks without dependency

Hi. I was implementing a really simple dense matrix-vector multiplication, with 32 blocks doing a partial sum of A*x, and one block doing final sum of the partial sum. Obviously, there is a dependen...
  • HRZ's avatar
    6 years ago

    The way you have described your design right now, the 32 loop nests will be pipelined rather than parallelized and since the outer loop is not pipelined, the loop nests will also be executed one by one. If you want those 32 loop nests to run in parallel, you should describe them as a single loop nest wrapped inside another loop with a trip count of 32 that is fully unrolled. In that case, you will get 32 parallel blocks. However, in your case it is impossible to construct the code like that since, due to the genius way HBM works on Intel FPGAs (and apparently also Xilinx) where there is no interleaving, you are forced to allocate 32 buffers with different names, and it is impossible to address buffers with different names in a loop (but of course this problem somehow does not exist on GPUs which have been using HBM since 3-4 years ago). One possible solution I can think of is to construct your code as I mentioned, and use a large switch case block inside of the unrolled loop to map each iteration of the unrolled loop to one of the differently-named buffers like this:

    #pragma unroll
    for (int U = 0; U < 32; U++)
    {
       for(int i = 0 ; i < matrix_size/1 ; i++ )
       {
           . . .
           union uf8 local_A;
           switch (U)
           {
             case 0:
               local_A.f8 = A0[i*matrix_size/8/32+j];
               break;
             case 1:
               local_A.f8 = A1[i*matrix_size/8/32+j];
               break;
               .
               .
               .
           }
           . . . 
       }
    }

    Hopefully the compiler will be smart enough to just create one memory port for each block in this case (and optimize out the rest), rather than 32 ports for each with a multiplexer at the end...

    If this doesn't work, another option is to use a multi-kernel design with each of the 32 blocks having their own kernel, and one kernel handling the memory reads, and one kernel performing the final reduction and memory writes. You can probably leverage the autorun kernel type to implement the 32 compute kernels with minimal code size. Though of course a multi-kernel design with blocking channels will incur huge area overhead if you also want to utilize the Hyper-Optimized Handshaking optimization (another great feature of Stratix 10).

    P.S. What Stratix 10 MX board is this that already supports OpenCL? Is it Bittware's board?