Hi, I'm designing an accelerator for DTW computation using oneAPI and Stratix 10 at the board BittWare 520N-MX Gen3x16. I have a kernel (it's actually several different kernels connected with pipes) that I replicate as many as possible to get the maximum throughput. The different kernel entities work with different input data. In one of the versions, I fitted 12 kernels in the FPGA. Then for that kernel, I simplify the external memory interfaces and the "function overhead" ( using oneAPI pragmas). The compile estimated resource utilization shows a reduction of more than 30% per kernel. However, Fitter failed to place more than 12 kernels on the FPGA. What sounds even more strange to me is that if I try to compile 16 kernels I get the error:"Error (170012): Fitter requires 72611 LABs to implement the design, but the device contains only 66099 LABs."But, If I try to compile 14 kernels (same clock target)"Error (170012): Fitter requires 73646 LABs to implement the design, but the device contains only 66439 LABs"How could 14 identical kernels need more LABs than 16?I have tried other numbers of kernels and clock frequency and the results are very unpredictable. Any idea of why the estimation of resource utilization is so wrong? How can I get more information on the fitter process to try to figure out what is happening?Thanks.

Hi @JavierHormigo,Thank you for posting in Intel community forum and hope all is well.Quick clarification on the situation, error mention here are during hardware compilation?The emulation compilation are successfully right? Note: unfortunately the hardware and BSP should be coming from BittWare vendor, hence there are limited understand on how the build and architecture involved, hence might not be the right person to provide the exact solution here but we would try our best to help on this. Best WishesBB

HI BoonBeng,yes, it is during hardware compilation. Emulation is successful. And, even, hardware compilation is successful for fewer number of kernels. Thanks, JAvier

Hi @JavierHormigo,Apologies for the delayed in response, noted on the emulation as that rules our the code issues.From your finding of fitting different kernel and are getting different resources, per understanding that might be due to the partition in the design.The limit of kernel seems to be as mention which is 12 for the mention devices, as more than that it causes error on the resources. If more kernel are required to be fitted, a bigger devices would be required.Hope that clarify.Best WishesBB

Hi @BoonBengT_Altera ,Sorry, but I don't understand your answer. How can I get more information on the fitter process to determine what is happening?Thank you,JAvier

Hi,how much device resources does the "12 kernel" design utilize, e.g. expressed as percent value in summary? Does the resource map indicate that 14 or 16 kernels should fit?

Strange behavior of Quartus Fitter and how to get more information

25 Replies

BoonBengT_Altera
Moderator
2 years ago
Hi @JavierHormigo,

Thank you for posting in Intel community forum and hope all is well.
Quick clarification on the situation, error mention here are during hardware compilation?
The emulation compilation are successfully right?

Note: unfortunately the hardware and BSP should be coming from BittWare vendor, hence there are limited understand on how the build and architecture involved, hence might not be the right person to provide the exact solution here but we would try our best to help on this.

Best Wishes
BB
- JavierHormigo
  New Contributor
  2 years ago
  HI BoonBeng,
  yes, it is during hardware compilation. Emulation is successful. And, even, hardware compilation is successful for fewer number of kernels.
  Thanks,
  JAvier
BoonBengT_Altera
Moderator
2 years ago
Hi @JavierHormigo,

Apologies for the delayed in response, noted on the emulation as that rules our the code issues.
From your finding of fitting different kernel and are getting different resources, per understanding that might be due to the partition in the design.
The limit of kernel seems to be as mention which is 12 for the mention devices, as more than that it causes error on the resources. If more kernel are required to be fitted, a bigger devices would be required.
Hope that clarify.

Best Wishes
BB
- JavierHormigo
  New Contributor
  2 years ago
  Hi @BoonBengT_Altera ,
  Sorry, but I don't understand your answer. How can I get more information on the fitter process to determine what is happening?
  Thank you,
  JAvier
FvM
Super Contributor
2 years ago
Hi,
how much device resources does the "12 kernel" design utilize, e.g. expressed as percent value in summary? Does the resource map indicate that 14 or 16 kernels should fit?
- JavierHormigo
  New Contributor
  2 years ago
  The initial report with 12 kernels are:
  Device Static partition Quartus Fitter: Total Used (Entire System) Quartus Fitter: Kernel System Estimated: Kernel system
  ALM 702720 168990 645,368 415228.0
  - ALUT 553419
  - REG 2810880 675960 1,392,304 1022779 944410
  - MLAB 3233 4068
  RAM 6847 1590 3,867 2599 2147
  DSP 3960 786 144 144 144
  After optimizing the kernels using the simplest LSU (FIFO) and [[intel::max_global_work_dim(0)]] in functions to eliminate the function overhead, the report for 12 kernels is the one below
  Device Static partition Quartus Fitter: Total Used (Entire System) Estimated: Kernel system
  ALM 702720 168990 TBD
  - ALUT 420411
  - REG 2810880 675960 TBD 636982
  - MLAB 3288
  RAM 6847 1590 TBD 1403
  DSP 3960 786 TBD 144
  The estimated utilization is much lower but the implementation finished with the error similar to that:
  "Error (170012): Fitter requires 72611 LABs to implement the design, but the device contains only 66099 LABs."
  This error is not consistent because sometimes it says more LAB are needed for smaller designs and also the number of LAB the device contains changes.
  I'm not sure If I know where should I get the the resource map.
  Thanks.
  - JavierHormigo
    New Contributor
    2 years ago
    I'm not sure if the tables will arrive well enough so Irepited them as figures. thanks
BoonBengT_Altera
Moderator
2 years ago
Hi @JavierHormigo,

Per my understanding there are a few reports available, Quartus/HLS estimated resource utilization summary which shows total area utilization of entire design and each component individually. What are you showing I believe is the summary of the report, perhaps zooming in will give an idea on which component are taking up the resources.

Also there are a area analysis of system report (Area Analysis > Area Analysis of System), it can be used to identify parts of the design that are having large area overhead.
Hope that clarify.

Best Wishes
BB
- JavierHormigo
  New Contributor
  2 years ago
  Hi @BoonBengT_Altera
  All these detailed reports are "estimated resource utilization" reports. In all this report the new architecture uses much less area than the original one. However, when implemented, the fitter fails. So, there are no useful information on this problem in those report. I need a report on the fitting process to see what's going on or any inside why the estimate said the area is reduced when the fitter said it grows.
  Thanks
BoonBengT_Altera
Moderator
2 years ago
Hi @JavierHormigo,

Apologies for the hold up, you might be familiar with the report for quartus fitter available.

You would need to navigate to the report via System Resource Utilization Summary (Summary > System Resource Utilization Summary)

Report mention would have what quartus uses and oneAPI compiler estimation.

Just to add on to that, perhaps we can also navigate down to the kernels by the next table in the same tab.

Hopefully that will give an insight on which kernel are taking more resources, and from there some optimization would be required on the design to use less area.

Best Wishes

BB
- JavierHormigo
  New Contributor
  2 years ago
  Hi @BoonBengT_Altera
  But, the report you said only has values if the fitting is succesfull. it is empty in my case , when the implementation stops with errors.
  thanks,
  JAvier
BoonBengT_Altera
Moderator
2 years ago
Hi @JavierHormigo,

Form the screenshot that you have provided yes it is empty, and that make sense as suspecting the reason to that is the failure in the hardware compilation. (which will generate the quartus fitter data)

Would suggest here to use the 12 kernel as a baseline (as it is the only successfully compiled one) and look at the report from both oneAPI and quartus estimation and see if there is any optimization left we can do in the design to make less resource needed.
Hope that clarify.

Best Wishes
BB
- JavierHormigo
  New Contributor
  2 years ago
  Hi @BoonBengT_Altera,
  I did what you suggested and that lead us to my first question: Using the 12 kernel as a baseline, the estimation report said the new version uses around 40% fewer resources. However, the fitter fails to implement the hardware because of a lack of enough LABs. How is this possible? How could I get more information about the fitter process?
  Thanks,
  JAvier
BoonBengT_Altera
Moderator
2 years ago
Hi @JavierHormigo,

Greetings, just checking in to see if there is any further doubts in regards to this matter.
Hope your doubts have been clarified.

Best Wishes
BB
BoonBengT_Altera
Moderator
2 years ago
Hi @JavierHormigo,

Noted, clarification on the mention 40% fewer resources, that is referring to the 12 kernel right?

Would elaborate more on how that is being observed/calculated?

Managed to check internally, unfortunately we do not have specific report related to fitter processes.

However closest to that in the .prj folder created during the hardware compilation, there should be some logs of the fitter in quartus compile which end with .rpt.

Perhaps would suggest having a quick look into that.

If further details of the fitter processes are required, could you provide details of such report in quartus? (Perhaps a screenshot would be great)

That would enable us to check further with such details.

Looking for to hear from you.

Best Wishes

BB
- JavierHormigo
  New Contributor
  2 years ago
  Hi @BoonBengT_Altera ,
  Sorry, I exaggerated the reduction achieved by the new version of the 12 kernels. It wasn't 40 %. The detailed reduction is
  ALUT 25%
  REG 33%
  MLAB 20%
  RAM 35%
  which still are very significant. You can extract those figures from the tables I sent on October 10th.
  Attached you can find the rpt files regarding the fitter for the initial successful implementation and for the second that fails. I hope those files can help to understand what is happening.
  Thanks,
  JAvier
  rpt12good.zip12.5 MB
  rpt12fail.zip10.9 MB
BoonBengT_Altera
Moderator
2 years ago
Hi @JavierHormigo,

Noted with thanks for the explanation and report.
As for request on getting more information on the fitter process, would suggest to look into the failed compilation logs you have provided. In the 'top.fit.rpt' files you would be able to get the details logs on how the fitter process goings.

As you look at the bottom of the result there is a mention failure on which region.
Hence would suggest going through the logs mention in the files.
Hope that clarify.

Best Wishes
BB
- JavierHormigo
  New Contributor
  2 years ago
  Hi @BoonBengT_Altera
  I have been going through the files you suggested but I still don't understand why a design that occupies significantly fewer resources fails to be fitted while the original implement with no errors. The report only said that there are not enough LABs in the region, but why?
  Is it because the Quartus estimation is so wrong that instead of reducing the resources by 30% as it said, they actually grow? Is it because although the resources are less they are concentrated in the same region? If so, why? I have not changed anything to force the tool to put everything in the same region, why don't use a different region? How can I handle this situation?
  Too many questions raised.
  Thanks,
  JAvier
BoonBengT_Altera
Moderator
2 years ago
Hi @JavierHormigo,
Greetings, just checking in to see if there is any further doubts in regards to this matter.
Hope your doubts have been clarified.

Best Wishes
BB