Forum Discussion
28 Replies
- KhaiChein_Y_Intel
Regular Contributor
Hi,
Can you provide the design.qar for investigation?
Thanks.
- AEsqu
Contributor
No, I won't share it.
The other person that have the same issue:
https://forums.intel.com/s/question/0D50P00004OZtoI/timing-closure-on-arria-10?language=en
- AEsqu
Contributor
For example, the following path (gated clock block output to FF):
rfd_ic_i|u_top|u_core|u_rfd_clockshop|i_mcu_flexcomm1_clockgate|i01_cnhlspd|Q -> rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array_1__A_flexcomm|genblk1_A_flexcomm|A_flexcomm_ctrl_gen_A_flexcomm_ctrl|A_flexcomm_fifo|A_flexcomm_fifo_ptrs_rx|rptr_gray_0_
takes 0.68 ns in quartus 13.1 stratix 3
and takes 9.1 ns !!! in quartus 19.3 Arria 10.
Why is quartus adding so much delay in that path for the arria 10?
Attaching 4 pictures showing this.
report command was:
report_timing -from_clock { flexcomm1_hclk } -to_clock { flexcomm1_hclk } -from {rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array_1__A_flexcomm|genblk1_A_flexcomm|A_flexcomm_ctrl_gen_A_flexcomm_ctrl|A_flexcomm_fifo|A_flexcomm_fifo_ptrs_rx|rptr_0_} -to {rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array_1__A_flexcomm|genblk1_A_flexcomm|A_flexcomm_ctrl_gen_A_flexcomm_ctrl|A_flexcomm_fifo|A_flexcomm_fifo_ptrs_rx|rptr_gray_0_} -hold -npaths 100 -detail full_path -panel_name {Report Timing}
- AEsqu
Contributor
- sstrell
Super Contributor
There are a few things going on here. Extra delay is good for hold (in this case, removal) analysis. Remember that for hold/removal analysis, you want the signal to remain active longer to meet the timing requirement after the latch edge. So the issue here is the delay of the clock to the destination register (the data required path), not the control signal itself (data arrival path). The clock skew of 11 ns shown at the top of the screenshot is a quick giveaway to the problem.
It looks like the clock is being routed through device logic instead of a global clock routing channel because you have a gated clock. If you must gate the clock, it's usually best to put the gating logic on the clock enable signal of the destination register instead of in the clock path. That would probably fix this issue. You could also try forcing the clock onto a global routing channel using the Global Signal assignment in the Assignment Editor, but the gating logic would still require the clock to come off of the global routing channel, adding potentially additional delay.
There's no way of knowing why this routed OK on the older device vs. the Arria 10. Did the design change at all? Were there other assignments involved?
#iwork4intel
- AEsqu
Contributor
Hi sstrell,
I tested with global clock usage and that solves the mess for that clock.
But then on the next clock gating that follows that clock there are again extra 3 ns extra delay.
For some reason the Quartus 13.1 and/or stratix III was handling the clock gating much better than with quartus 19.3 and/or the Arria 10.
Our design has definition for about 200 clocks and have thousands of clock gating (low power).
Attaching a picture of the next long routing for the next clock gate after the global clock point.
- AEsqu
Contributor
- AEsqu
Contributor
I have been looking further into this,
apparently Quartus 19.3, for the Arria X FPGA,
has issue with clear/preset/clk constructions, that gives a combi loop (but not the case with quartus 13.1 and the stratix 3):
Example below:
if (!cd) q <= `unitdelay 1'b0;
else if (!sd) q <= `unitdelay 1'b1;
else q <= `unitdelay d;
end
Combi loop in the timequest analyzer:
Found combinational loop of 3 nodes
Node "rfd_ic_i|u_top|u_core|u_flash_subsys|A_ip_pflash640k_atfc|u_controller|u_fmc_if|read_fail_sync_reg|q~1~la_mlab/laboutt[6]"
Node "rfd_ic_i|u_top|u_core|u_flash_subsys|A_ip_pflash640k_atfc|u_controller|u_fmc_if|read_fail_sync_reg|q~1|dataf"
Node "rfd_ic_i|u_top|u_core|u_flash_subsys|A_ip_pflash640k_atfc|u_controller|u_fmc_if|read_fail_sync_reg|q~1|combout"
Note the presence of the la_mlab/laboutt[6] again.
How to solve this issue, keeping the same RTL code?
Second (vhdl) example:
process(scl_clk_n, rstn, start_stage1,scantestmode)
begin
if(rstn = '0') then
start_stage2 <= '0' after delay_f;
elsif(start_stage1 = '1' and scantestmode = '0') then
start_stage2 <= '1' after delay_f;
elsif(scl_clk_n'event and scl_clk_n = '1') then
start_stage2 <= '0' after delay_f;
end if;
end process;
Found combinational loop of 3 nodes
Node "rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array[1].A_flexcomm|A_flexcomm|A_bi2c_gen.A_bi2c_core|slave_detect_inst|start_stage2~1~la_mlab/laboutt[0]"
Node "rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array[1].A_flexcomm|A_flexcomm|A_bi2c_gen.A_bi2c_core|slave_detect_inst|start_stage2~1|dataf"
Node "rfd_ic_i|u_top|u_core|u_atlas|A_flexcomm_array[1].A_flexcomm|A_flexcomm|A_bi2c_gen.A_bi2c_core|slave_detect_inst|start_stage2~1|combout"
- AEsqu
Contributor
Hi have seen in the doc that the stratix 3 does not support clear/preset implementation:
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/qts/qts_qii51006.pdf
"
Register Control Signals Avoid using an asynchronous load signal if the design target device architecture does not include registers with dedicated circuitry for asynchronous loads. Also, avoid using both asynchronous clear and preset if the architecture provides only one of these control signals. Stratix III devices, for example, directly support an asynchronous clear function, but not a preset or load function. When the target device does not directly support the signals, the synthesis or placement and routing software must use combinational logic to implement the same functionality. In addition, if you use signals in a priority other than the inherent priority in the device architecture, combinational logic may be required to implement the necessary control signals. Combinational logic is less efficient and can cause glitches and other problems; it is best to avoid these implementations.
"
So I have been looking further into it:
Synplify implements the clear/preset flip flop into a latch + a FF, preventing the timing analysis to be done and preventing combinational loop at quartus level timing check.
This makes those huge non sense delays to be absent.
Quartus synthesis implements as a normal FF with combi logic, this lead to non sense timing routing and analysis.
Would it be possible to tell quartus to implement a latch to solve this issue?
We won't change the RTL code, we use the code for the chip and never write specific FPGA code.
See an attachments with pictures showing this.
- AEsqu
Contributor
Nor the Statix 3 nor arria 10 handbooks show's aset in the ALM:
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stx3/stratix3_handbook.pdf
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/arria-10/a10_handbook.pdf
but the quartus rtl viewer shows it in the RTL viewer (so it must be a combination of the FF and logic around from the ALM).
This is not the case for the arria 10 (simple FF).
I'm attaching an RTL view in quartus 13.1 with the stratix 3 vqm from synplify pro P-2019.09-SP1 (async inputs are indicated).
The flops using the vqm from synplify pro is present in the RTL viewer of Quartus 19.3 for the arria 10 and without async inputs.
- AEsqu
Contributor