Forum Discussion

New Contributor

6 years ago

Solved

Serialization problem with code blocks without dependency

Hi. I was implementing a really simple dense matrix-vector multiplication, with 32 blocks doing a partial sum of A*x, and one block doing final sum of the partial sum. Obviously, there is a dependen...

hbm_boardtest.cpp15 KB

HRZ
6 years ago
The way you have described your design right now, the 32 loop nests will be pipelined rather than parallelized and since the outer loop is not pipelined, the loop nests will also be executed one by one. If you want those 32 loop nests to run in parallel, you should describe them as a single loop nest wrapped inside another loop with a trip count of 32 that is fully unrolled. In that case, you will get 32 parallel blocks. However, in your case it is impossible to construct the code like that since, due to the genius way HBM works on Intel FPGAs (and apparently also Xilinx) where there is no interleaving, you are forced to allocate 32 buffers with different names, and it is impossible to address buffers with different names in a loop (but of course this problem somehow does not exist on GPUs which have been using HBM since 3-4 years ago). One possible solution I can think of is to construct your code as I mentioned, and use a large switch case block inside of the unrolled loop to map each iteration of the unrolled loop to one of the differently-named buffers like this:
#pragma unroll for (int U = 0; U < 32; U++) { for(int i = 0 ; i < matrix_size/1 ; i++ ) { . . . union uf8 local_A; switch (U) { case 0: local_A.f8 = A0[i*matrix_size/8/32+j]; break; case 1: local_A.f8 = A1[i*matrix_size/8/32+j]; break; . . . } . . . } }
Hopefully the compiler will be smart enough to just create one memory port for each block in this case (and optimize out the rest), rather than 32 ports for each with a multiplexer at the end...
If this doesn't work, another option is to use a multi-kernel design with each of the 32 blocks having their own kernel, and one kernel handling the memory reads, and one kernel performing the final reduction and memory writes. You can probably leverage the autorun kernel type to implement the 32 compute kernels with minimal code size. Though of course a multi-kernel design with blocking channels will incur huge area overhead if you also want to utilize the Hyper-Optimized Handshaking optimization (another great feature of Stratix 10).
P.S. What Stratix 10 MX board is this that already supports OpenCL? Is it Bittware's board?

HRZ

Frequent Contributor

6 years ago

The way you have described your design right now, the 32 loop nests will be pipelined rather than parallelized and since the outer loop is not pipelined, the loop nests will also be executed one by one. If you want those 32 loop nests to run in parallel, you should describe them as a single loop nest wrapped inside another loop with a trip count of 32 that is fully unrolled. In that case, you will get 32 parallel blocks. However, in your case it is impossible to construct the code like that since, due to the genius way HBM works on Intel FPGAs (and apparently also Xilinx) where there is no interleaving, you are forced to allocate 32 buffers with different names, and it is impossible to address buffers with different names in a loop (but of course this problem somehow does not exist on GPUs which have been using HBM since 3-4 years ago). One possible solution I can think of is to construct your code as I mentioned, and use a large switch case block inside of the unrolled loop to map each iteration of the unrolled loop to one of the differently-named buffers like this:

#pragma unroll
for (int U = 0; U < 32; U++)
{
   for(int i = 0 ; i < matrix_size/1 ; i++ )
   {
       . . .
       union uf8 local_A;
       switch (U)
       {
         case 0:
           local_A.f8 = A0[i*matrix_size/8/32+j];
           break;
         case 1:
           local_A.f8 = A1[i*matrix_size/8/32+j];
           break;
           .
           .
           .
       }
       . . . 
   }
}

Hopefully the compiler will be smart enough to just create one memory port for each block in this case (and optimize out the rest), rather than 32 ports for each with a multiplexer at the end...

If this doesn't work, another option is to use a multi-kernel design with each of the 32 blocks having their own kernel, and one kernel handling the memory reads, and one kernel performing the final reduction and memory writes. You can probably leverage the autorun kernel type to implement the 32 compute kernels with minimal code size. Though of course a multi-kernel design with blocking channels will incur huge area overhead if you also want to utilize the Hyper-Optimized Handshaking optimization (another great feature of Stratix 10).

P.S. What Stratix 10 MX board is this that already supports OpenCL? Is it Bittware's board?

YChoi1
New Contributor
6 years ago
Using switch + unroll pragma for different HBM port has worked perfectly. I am planning to recommend this coding style to my group. Thanks so much for your advice :)
Reply to your ps - it is called S10MX ES ("early-silicon") version - received from Intel (not sure about Bitware)

Forum Discussion

Serialization problem with code blocks without dependency

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process