Hello,I see a huge delay inserted by Quartus 19.3 pro on the Arria 10 FPGA.This was not seen in the stratix III FPGA using quartus 13.1.I will attach the picture showing this.This lead to huge hold time violations.Something named ~la_lab/laboutb by Quartus.I saw another Topic where another person had similar issue with the Arria 10.How can this be solved?

Hi,Can you provide the design.qar for investigation?Thanks.

No, I won't share it.The other person that have the same issue:https://forums.intel.com/s/question/0D50P00004OZtoI/timing-closure-on-arria-10?language=en

For example, the following path (gated clock block output to FF):rfd_ic_i|u_top|u_core|u_rfd_clockshop|i_mcu_flexcomm1_clockgate|i01_cnhlspd|Q -> rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array_1__A_flexcomm|genblk1_A_flexcomm|A_flexcomm_ctrl_gen_A_flexcomm_ctrl|A_flexcomm_fifo|A_flexcomm_fifo_ptrs_rx|rptr_gray_0_takes 0.68 ns in quartus 13.1 stratix 3and takes 9.1 ns !!! in quartus 19.3 Arria 10.Why is quartus adding so much delay in that path for the arria 10?Attaching 4 pictures showing this.report command was:report_timing -from_clock { flexcomm1_hclk } -to_clock { flexcomm1_hclk } -from {rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array_1__A_flexcomm|genblk1_A_flexcomm|A_flexcomm_ctrl_gen_A_flexcomm_ctrl|A_flexcomm_fifo|A_flexcomm_fifo_ptrs_rx|rptr_0_} -to {rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array_1__A_flexcomm|genblk1_A_flexcomm|A_flexcomm_ctrl_gen_A_flexcomm_ctrl|A_flexcomm_fifo|A_flexcomm_fifo_ptrs_rx|rptr_gray_0_} -hold -npaths 100 -detail full_path -panel_name {Report Timing}

There are a few things going on here. Extra delay is good for hold (in this case, removal) analysis. Remember that for hold/removal analysis, you want the signal to remain active longer to meet the timing requirement after the latch edge. So the issue here is the delay of the clock to the destination register (the data required path), not the control signal itself (data arrival path). The clock skew of 11 ns shown at the top of the screenshot is a quick giveaway to the problem.It looks like the clock is being routed through device logic instead of a global clock routing channel because you have a gated clock. If you must gate the clock, it's usually best to put the gating logic on the clock enable signal of the destination register instead of in the clock path. That would probably fix this issue. You could also try forcing the clock onto a global routing channel using the Global Signal assignment in the Assignment Editor, but the gating logic would still require the clock to come off of the global routing channel, adding potentially additional delay.There's no way of knowing why this routed OK on the older device vs. the Arria 10. Did the design change at all? Were there other assignments involved?#iwork4intel

huge delay inserted by Quartus 19.3 pro on the Arria 10 FPGA

28 Replies

KhaiChein_Y_Intel
Regular Contributor
6 years ago
Hi,
Can you provide the design.qar for investigation?
Thanks.
- AEsqu
  Contributor
  6 years ago
  No, I won't share it.
  The other person that have the same issue:
  https://forums.intel.com/s/question/0D50P00004OZtoI/timing-closure-on-arria-10?language=en
AEsqu
Contributor
6 years ago
For example, the following path (gated clock block output to FF):
rfd_ic_i|u_top|u_core|u_rfd_clockshop|i_mcu_flexcomm1_clockgate|i01_cnhlspd|Q -> rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array_1__A_flexcomm|genblk1_A_flexcomm|A_flexcomm_ctrl_gen_A_flexcomm_ctrl|A_flexcomm_fifo|A_flexcomm_fifo_ptrs_rx|rptr_gray_0_
takes 0.68 ns in quartus 13.1 stratix 3
and takes 9.1 ns !!! in quartus 19.3 Arria 10.
Why is quartus adding so much delay in that path for the arria 10?
Attaching 4 pictures showing this.
report command was:
report_timing -from_clock { flexcomm1_hclk } -to_clock { flexcomm1_hclk } -from {rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array_1__A_flexcomm|genblk1_A_flexcomm|A_flexcomm_ctrl_gen_A_flexcomm_ctrl|A_flexcomm_fifo|A_flexcomm_fifo_ptrs_rx|rptr_0_} -to {rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array_1__A_flexcomm|genblk1_A_flexcomm|A_flexcomm_ctrl_gen_A_flexcomm_ctrl|A_flexcomm_fifo|A_flexcomm_fifo_ptrs_rx|rptr_gray_0_} -hold -npaths 100 -detail full_path -panel_name {Report Timing}
AEsqu
Contributor
6 years ago
posted a file.
arria_10_routing_issue.zip750 KB
sstrell
Super Contributor
6 years ago
There are a few things going on here. Extra delay is good for hold (in this case, removal) analysis. Remember that for hold/removal analysis, you want the signal to remain active longer to meet the timing requirement after the latch edge. So the issue here is the delay of the clock to the destination register (the data required path), not the control signal itself (data arrival path). The clock skew of 11 ns shown at the top of the screenshot is a quick giveaway to the problem.
It looks like the clock is being routed through device logic instead of a global clock routing channel because you have a gated clock. If you must gate the clock, it's usually best to put the gating logic on the clock enable signal of the destination register instead of in the clock path. That would probably fix this issue. You could also try forcing the clock onto a global routing channel using the Global Signal assignment in the Assignment Editor, but the gating logic would still require the clock to come off of the global routing channel, adding potentially additional delay.
There's no way of knowing why this routed OK on the older device vs. the Arria 10. Did the design change at all? Were there other assignments involved?
#iwork4intel
AEsqu
Contributor
6 years ago
Hi sstrell,
I tested with global clock usage and that solves the mess for that clock.
But then on the next clock gating that follows that clock there are again extra 3 ns extra delay.
For some reason the Quartus 13.1 and/or stratix III was handling the clock gating much better than with quartus 19.3 and/or the Arria 10.
Our design has definition for about 200 clocks and have thousands of clock gating (low power).
Attaching a picture of the next long routing for the next clock gate after the global clock point.
AEsqu
Contributor
6 years ago
#idonotwork4intel
AEsqu
Contributor
6 years ago
I have been looking further into this,
apparently Quartus 19.3, for the Arria X FPGA,
has issue with clear/preset/clk constructions, that gives a combi loop (but not the case with quartus 13.1 and the stratix 3):
Example below:
if (!cd) q <= `unitdelay 1'b0;
else if (!sd) q <= `unitdelay 1'b1;
else q <= `unitdelay d;
end

Combi loop in the timequest analyzer:
Found combinational loop of 3 nodes
Node "rfd_ic_i|u_top|u_core|u_flash_subsys|A_ip_pflash640k_atfc|u_controller|u_fmc_if|read_fail_sync_reg|q~1~la_mlab/laboutt[6]"
Node "rfd_ic_i|u_top|u_core|u_flash_subsys|A_ip_pflash640k_atfc|u_controller|u_fmc_if|read_fail_sync_reg|q~1|dataf"
Node "rfd_ic_i|u_top|u_core|u_flash_subsys|A_ip_pflash640k_atfc|u_controller|u_fmc_if|read_fail_sync_reg|q~1|combout"
Note the presence of the la_mlab/laboutt[6] again.
How to solve this issue, keeping the same RTL code?
Second (vhdl) example:
process(scl_clk_n, rstn, start_stage1,scantestmode)
begin
if(rstn = '0') then
start_stage2 <= '0' after delay_f;
elsif(start_stage1 = '1' and scantestmode = '0') then
start_stage2 <= '1' after delay_f;
elsif(scl_clk_n'event and scl_clk_n = '1') then
start_stage2 <= '0' after delay_f;
end if;
end process;
Found combinational loop of 3 nodes
Node "rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array[1].A_flexcomm|A_flexcomm|A_bi2c_gen.A_bi2c_core|slave_detect_inst|start_stage2~1~la_mlab/laboutt[0]"
Node "rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array[1].A_flexcomm|A_flexcomm|A_bi2c_gen.A_bi2c_core|slave_detect_inst|start_stage2~1|dataf"
Node "rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array[1].A_flexcomm|A_flexcomm|A_bi2c_gen.A_bi2c_core|slave_detect_inst|start_stage2~1|combout"
AEsqu
Contributor
6 years ago
Hi have seen in the doc that the stratix 3 does not support clear/preset implementation:
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/qts/qts_qii51006.pdf
"
Register Control Signals Avoid using an asynchronous load signal if the design target device architecture does not include registers with dedicated circuitry for asynchronous loads. Also, avoid using both asynchronous clear and preset if the architecture provides only one of these control signals. Stratix III devices, for example, directly support an asynchronous clear function, but not a preset or load function. When the target device does not directly support the signals, the synthesis or placement and routing software must use combinational logic to implement the same functionality. In addition, if you use signals in a priority other than the inherent priority in the device architecture, combinational logic may be required to implement the necessary control signals. Combinational logic is less efficient and can cause glitches and other problems; it is best to avoid these implementations.
"
So I have been looking further into it:
Synplify implements the clear/preset flip flop into a latch + a FF, preventing the timing analysis to be done and preventing combinational loop at quartus level timing check.
This makes those huge non sense delays to be absent.
Quartus synthesis implements as a normal FF with combi logic, this lead to non sense timing routing and analysis.
Would it be possible to tell quartus to implement a latch to solve this issue?
We won't change the RTL code, we use the code for the chip and never write specific FPGA code.
See an attachments with pictures showing this.
clear_preset_ff_implementation_difference_synplify_quartus.zip170 KB
AEsqu
Contributor
6 years ago
Nor the Statix 3 nor arria 10 handbooks show's aset in the ALM:
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stx3/stratix3_handbook.pdf
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/arria-10/a10_handbook.pdf
but the quartus rtl viewer shows it in the RTL viewer (so it must be a combination of the FF and logic around from the ALM).
This is not the case for the arria 10 (simple FF).
I'm attaching an RTL view in quartus 13.1 with the stratix 3 vqm from synplify pro P-2019.09-SP1 (async inputs are indicated).
The flops using the vqm from synplify pro is present in the RTL viewer of Quartus 19.3 for the arria 10 and without async inputs.
AEsqu
Contributor
6 years ago
And the view for Arria10 , with a VQM from synplify pro.