How does AOC compiler map fixed point MAC operations to DSP ip blocks?
Hello,
I am using Arria 10 GX 1150 FPGA board which contains 1518 DSP blocks.
I am trying to do MAC operations in 16 bit using "short" data type as shown in the below program
typedef short DTYPE;
__kernel
__attribute__((task))
__attribute__((max_global_work_dim(0)))
void multiply_input(
// Params Ports
__global volatile DTYPE *restrict a_in,
__global volatile DTYPE *restrict b_in,
__global volatile DTYPE *restrict c_out
)
{
int partial_sum[8];
for(uint i = 0; i< 8; i++) {
# pragma unroll
for(int j=0 ; j<512; j++){
partial_sum[i] += (a_in[j]* b_in[j]);
}
c_out[i] = 0xFFFF & (partial_sum[i]>>0x01);
}
}
If j = 512, if would require 256 DSP blocks to perform the MAC operations. AOC compiler maps it perfectly.
if j= 1024, AOC compiler must map 512 DSP blocks to perform 1024 16bit MAC operations. But the compiler fails to do that and logic utilization increases dramatically! Why does this happen?
Compiler fails to infer when j>759 (Total number of DSP = 1518 and 759 *2= 1518 ) Really Really Strange!
j = 759 report : (As intended)
!===========================================================================
! The report below may be inaccurate. A more comprehensive
! resource usage report can be found at conv_pipe/reports/report.html
!===========================================================================
+--------------------------------------------------------------------+
; Estimated Resource Usage Summary ;
+----------------------------------------+---------------------------+
; Resource + Usage ;
+----------------------------------------+---------------------------+
; Logic utilization ; 31% ;
; ALUTs ; 16% ;
; Dedicated logic registers ; 16% ;
; Memory blocks ; 25% ;
; DSP blocks ; 25% ;
+----------------------------------------+---------------------------;
aoc: First stage compilation completed successfully.
Compiling for FPGA. This process may take a long time, please be patient.
j = 1024
!===========================================================================
! The report below may be inaccurate. A more comprehensive
! resource usage report can be found at conv_pipe/reports/report.html
!===========================================================================
+--------------------------------------------------------------------+
; Estimated Resource Usage Summary ;
+----------------------------------------+---------------------------+
; Resource + Usage ;
+----------------------------------------+---------------------------+
; Logic utilization ; 82% ;
; ALUTs ; 56% ;
; Dedicated logic registers ; 32% ;
; Memory blocks ; 30% ;
; DSP blocks ; 25% ;
+----------------------------------------+---------------------------;
DSP blocks usage does not go beyond 380 for some reason.
This also happens when I use char (8 bit). Am i misssing some kind of mask to convince the compiler? Any suggestions to solve this case?
Similar problem in previous post but looks like problem is not solved from v16.1 to 18.1
https://forums.intel.com/s/question/0D50P00003yyTf2SAE/how-to-share-dsp-correctly-
With 16.1.2, whether DTYPE is short or int, as long as partial_sum is int, I get 1024 DSPs for j=512.
Based on what you are saying, it seems the OpenCL compiler (v17+) does not correctly instantiate the IPs and infers 32-bit MULs instead of 16-bit ones (while the estimation in the report is based on 16-bit MULs), and for j > 759, since it thinks DSPs are fully utilized, it instantiates the logic-based IP instead of the DSP-based one . For cases where j <= 759, it seems the mapper is smart enough to pack the operations properly and reduce DSP usage, but for j > 759, since the compiler has instantiated logic-based IP, the mapper does not convert the logic-based IP to DSP-based and you get high logic usage but low DSP utilization.
I tested all the way to v18.1. With DTYPE=short and partial_sum defined as int, the DSP usage gets capped at 379 (379.5 actually!!!) for j > 759 and the rest of the operations are mapped to logic. However, with both defined as short, 17.0+ seems to correctly map two MULs per DSP but DSP usage gets capped at 50% this time. It seems there is still some code somewhere in the compiler (probably the linker) that has not been updated correctly with the new way DSP mapping happens, and is still using the old codes (pre-17.0) of two DSPs per MUL for DTYPE=short and partial_sum=int, and one DSP per MUL for DTYPE=short and partial_sum=short.
It is certainly possible to use all the DSPs on Arria 10 with 16-bit arithmetic. For example, you can look at the paper from Altera which claims to do so in OpenCL and another one which does so using an HDL library wrapped in OpenCL (both linked in the other thread). The behavior you are observing here seems to be a bug in the OpenCL compiler.