Forum Discussion
OK, I have new information.
The module which carries the PCIe core in instantiated with a verilog generate structure, like this:
parameter enable_blob = 1; // 1= yes, 0 = no
generate
begin: qsys_design
if (enable_blob == 1'b1)
begin: synth_qys_inst
blob blob_inst (
. . .
)
end
endgenerate
The PCIe core is instantiated within a wrapper which is instantiated in 'blob'. This version of the design will not make timing.
So I built a simplified version of the design, which has the PCIe wrapper instantiated in the top level module; the only other things in that module are a PLL to generate system clocking, a reset control module, and a dummy 'terminator' to prevent the compiler from optimizing out the PCIe core. The simple design makes timing, and I am able to partition the critical portions of the DMA controller (which are:
. . .|pcie_wrapper_inst|u0|pcie_a10_hip_0|pcie_a10_hip_0|g_avmm_256_dma.avmm_256_dma.altpcieav_256_app
. . .|pcie_wrapper_inst|u0|pcie_a10_hip_0|pcie_a10_hip_0|g_dmacontrol.dmacontrol.dma_control_0
)
and set the preservation level for those partitions to 'Final'. I was then able to hack in various other parts of the design (DDR4 controller, 10GE Ethernet MAC/PHY, bus bridges and various logic functions) and connect them in a more or less meaningful way, while maintaining the simplified hierarchy, which, and this seems to be important, does not include the Verilog generate structure. This design makes timing, with ~ 280ps of positive slack.
So I started rebuilding my original design. I started with just the PLL, the reset control module, and the PCIe wrapper/core, but this time instantiated with the generate structure and the extra layer of hierarchy above the PCIe core. This does not make timing. But if I remove the generate structure, and otherwise leave the hierarchy unchanged, it does make timing.
So to sum it up:
top>generate:blob>pcie_wrapper>pcie_core - does not make timing
top>blob>pcie_wrapper>pcie_core - does make timing
top>pcie_wrapper>pcie_core - does make timing
The difference in performance between good and bad results is about 1.2ns of slack for the 250MHz clock domain, and the errors are confined to the DMA portion of the PCIe core.
I think I can replace the generate structure (which is there to allow simulation of the top-level design without the PCIe or DDR cores, which slow down sim and are not always needed) with simple conditional compile commands. So I think I have a solution, but I am still curious why the generate structure seems to be causing so much trouble.