>With regard to DLA, I'm still confused that you said even unrolling can be considered as the systolic array. I fully understand the systolic array, but cannot really understand the connection between unrolling and having a systolic array implied. In another words, I don't know what are option that the compiler has to infer different design when facing and unrolled loop. Is there any resources that explain how compiler in different situations infer different designs for unrolled loop?
I personally do not consider loop unrolling as "generating an array of PEs" but some people are very keen on calling it that; this is just a matter of notation and is unrelated to what the compiler actually does. In fact, I would assume the compiler uses the same algorithm to unroll loops all the time. I can point you to multiple papers where the authors draw an array of PEs to describe their design while all they have actually done in the code is to add #pragma unroll to a loop... I am suspecting this might also be the case with the DLA paper. In fact, the best example of this is probably "Section 9.1.2. Use a Single Kernel to Describe Systolic Arrays" of the Best Practices Guide where it is described that now it is recommend to avoid autorun kernels on Startix 10 and instead use a single-kernel design to describe systolic arrays, while the code example is just unrolling two loops... This will just result in a single long and wide pipeline and not what I would call a systolic array which is a set of disjoint PEs connected with FIFOs and implicitly synchronized using the stall signals of the FIFOs... Please note that much of what I say here are my own personal opinions based on my personal experience and I could always be wrong.
Regarding resource utilization, MLAB estimation has only been added to the HTML report but not to the area report summary printed to stdout. The mapping of the rest of the resources is like this:
ALUT from HTML report = ALUT from summary
FF from HTML report = Dedicated logic registers from summary
RAMs from HTML report = Memory blocks from summary
DSP from HTML report = DSP from summary
Logic utilization from summary (supposed to estimate ALM usage) = ALUT + Dedicated logic registers/FF
You should remember that the report just uses a resource model Intel has come up with and can be highly inaccurate at times (both over and under-estimated). Specifically, the logic utilization estimation is very inaccurate and usually over-estimated since it does not account for the fact that much of the required ALUT and FFs will be mapped to the same ALM; then again estimation this mapping with any reasonable degree of accuracy will be very difficult. What I do is that I totally ignore the logic utilization and just place and route the design to see what happens; if the logic was indeed over-utilized, then I will go and modify my kernel.