How does AOC compiler map fixed point MAC operations to DSP ip blocks?
- 7 years ago
With 16.1.2, whether DTYPE is short or int, as long as partial_sum is int, I get 1024 DSPs for j=512.
Based on what you are saying, it seems the OpenCL compiler (v17+) does not correctly instantiate the IPs and infers 32-bit MULs instead of 16-bit ones (while the estimation in the report is based on 16-bit MULs), and for j > 759, since it thinks DSPs are fully utilized, it instantiates the logic-based IP instead of the DSP-based one . For cases where j <= 759, it seems the mapper is smart enough to pack the operations properly and reduce DSP usage, but for j > 759, since the compiler has instantiated logic-based IP, the mapper does not convert the logic-based IP to DSP-based and you get high logic usage but low DSP utilization.
I tested all the way to v18.1. With DTYPE=short and partial_sum defined as int, the DSP usage gets capped at 379 (379.5 actually!!!) for j > 759 and the rest of the operations are mapped to logic. However, with both defined as short, 17.0+ seems to correctly map two MULs per DSP but DSP usage gets capped at 50% this time. It seems there is still some code somewhere in the compiler (probably the linker) that has not been updated correctly with the new way DSP mapping happens, and is still using the old codes (pre-17.0) of two DSPs per MUL for DTYPE=short and partial_sum=int, and one DSP per MUL for DTYPE=short and partial_sum=short.
It is certainly possible to use all the DSPs on Arria 10 with 16-bit arithmetic. For example, you can look at the paper from Altera which claims to do so in OpenCL and another one which does so using an HDL library wrapped in OpenCL (both linked in the other thread). The behavior you are observing here seems to be a bug in the OpenCL compiler.