Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
15 years ago

Anyone with Experience compiling Stratix IV device w/ over 95% utilization.

Having trouble getting timing closure with device which is over 95% full (530 GX). The design is also fast and consumes all the memory (figured i should get my money's worth). Of course 5% remaining logic is larger than some of the largest devices 5 years ago. Has anyone else had a large device that is this full?

We are currently focused on Incremental Compile which we have been fighting with 4 years (since 6.0). It looks like it may be actually working in 10.0sp1. Prior to this it was unusable and my team was stuck doing flat compiles.

20 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Hi,

    sounds really difficult. Do have still partitions in your design ? 95% uitilization means logic cells, memory ... ??? Can you post your design summary here ?

    Kind regards

    GPK

    --- Quote End ---

    100% of memory and 95% of logic.

    I have another question about how tool allocates memory. The fitter seems to use 100% of memory and then starts chewing up internal logic. Anyone know of any documentation that describes this?
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The fact that you've got a Stratix IV 530 at nearly 100% utilization means you're no novice so forgive me if I state some things hear that are obvious to you.

    1 - The single change that can have the most impact on timing and resource utilization is modification of the source RTL itself. This is usually the first place to look when timing fails. Many times the RTL resulted in long stages of combinatorial logic the designer was unaware of. Now of course you may not be able to do this if a lot of the logic lies within purchased 3rd party cores to which you don't have the source code.

    2 - Timing constraints - Are you sure you've constrained the timing such that Quartus isn't doing unnecessary work trying to meet timing on paths that really aren't important (crossing clock domains for instance). The fitter may take away precious routing and logic resources for paths that really don't deserve it thus making it difficult for the entire design to fit.

    3 - Have you tried logiclock regions yet? Physically constraining the fitter as to where certain modules may be placed allows you to provide heuristic human insight that can dramatically help the fitter out. You know the design. You know what connects to what. Locking adjacent logic into a region can be a big help.

    4 - Have you tried physical placement constraints? Similar to logiclock you may need to manually restrict the placement of certain elements to specific LABs in the device. Often this can be done by scripting.

    5 - You probably have the entire design set to be optimized for speed. Are there any modules in the design that don't need to be optimized for speed. Are you aware that you can set the optimization technique specific for each module in the hierarchy? So you might optimize the entire design for speed and target specific modules to optimize for size or vice versa.

    6 - Have you tried tweaking the fitter/placement effort multiplier? Try a value of 4 and see what happens.

    My personal opinion is that seed sweeps and DSE sweeps are the worst methods for tackling these issues. They consume an enormous amount of time and at the end of the day you're just trying to get lucky. And if you do get it to work a single change to the design may throw the whole thing off again. Find the problem, fix it, then move on to the next problem.

    Jake
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    The fact that you've got a Stratix IV 530 at nearly 100% utilization means you're no novice so forgive me if I state some things hear that are obvious to you.

    1 - The single change that can have the most impact on timing and resource utilization is modification of the source RTL itself. This is usually the first place to look when timing fails. Many times the RTL resulted in long stages of combinatorial logic the designer was unaware of. Now of course you may not be able to do this if a lot of the logic lies within purchased 3rd party cores to which you don't have the source code.

    we had most of the design blocks running fine in 90 nm (stratix ii gx). the problem is ic delay.

    2 - Timing constraints - Are you sure you've constrained the timing such that Quartus isn't doing unnecessary work trying to meet timing on paths that really aren't important (crossing clock domains for instance). The fitter may take away precious routing and logic resources for paths that really don't deserve it thus making it difficult for the entire design to fit.

    3 - Have you tried logiclock regions yet? Physically constraining the fitter as to where certain modules may be placed allows you to provide heuristic human insight that can dramatically help the fitter out. You know the design. You know what connects to what. Locking adjacent logic into a region can be a big help.

    4 - Have you tried physical placement constraints? Similar to logiclock you may need to manually restrict the placement of certain elements to specific LABs in the device. Often this can be done by scripting.

    5 - You probably have the entire design set to be optimized for speed. Are there any modules in the design that don't need to be optimized for speed. Are you aware that you can set the optimization technique specific for each module in the hierarchy? So you might optimize the entire design for speed and target specific modules to optimize for size or vice versa.

    we are now doing this. like i mentioned earlier we are in stratix iv and 9.x was terrible for preserving placement via incremental compile (could not remember previous fitter results). on 10.0 sp1 and things seem to be working like we expect.

    6 - Have you tried tweaking the fitter/placement effort multiplier? Try a value of 4 and see what happens.

    My personal opinion is that seed sweeps and DSE sweeps are the worst methods for tackling these issues. They consume an enormous amount of time and at the end of the day you're just trying to get lucky. And if you do get it to work a single change to the design may throw the whole thing off again. Find the problem, fix it, then move on to the next problem.

    Jake

    --- Quote End ---

    1/ we had most of the design blocks running fine in 90 nm (stratix ii gx). the problem is mostly ic delay.

    2/ agreed. we are quadruple checking. multi-corner analysis is eating up lots of interconnect. this is required for the types of memory. there are over 200 clocks in the design. these mostly come from altera memory ip (we have 4 clock domains).

    3/ i'm a long time user of altera tools. i can never get a straight answer out of altera as to whether logic lock pays any dividends. i guess every design is different so its hard to say for sure. the 530 seems to have crossed the line where there is now denying that floorplanning and design partition is the only answer.

    4/ working on it.

    5/ we now have about 10 partitions are are optimizing each one for area and then trying to meet timing (gradually crank up speed setting if needed). this seems to be working. once the entire design is locked and closed wrt timing we will run out of some type of resource.

    6/ fitter effort definitely helps with design partition. for the flat compiles not much help.

    the path that shows the most promise is incremental compile. as you can imagine it is a slow process. i could write a book on this experience....

    thanks for taking the time for the detailed replay. i will give an update for the final solution (i think 10.0 will bring closure).

    best regards
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Just some brainstorming (I mostly use Cyclone, so maybe something does not really apply for Stratix):

    - Try to reduce the number of used logic (maybe, with 90% it would fit like a charm), either by changing the RTL-code and / or by changing compiler-options (e.g. maximum register-packing, optimze for area instead of optimze for speed)

    - You mentioned 200 clocks. The clocks with failing paths, are they using global clocks or non-global routing-resources? If they use non-global, try to reduce the number of clock-domains so that only global resources are needed (this would also free routing-resources for normal routing).

    To get an impression: What frequencies do you want to achieve, what is your "typical" slack in the moment?

    Thomas
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    1/ we had most of the design blocks running fine in 90 nm (stratix ii gx). the problem is mostly ic delay.

    --- Quote End ---

    I had the same problem in a StratixII GX project (EP2SGX60F1152C3). Our GX logic uses a lot of M4K, about 2/3 of them and apparently that makes the signals cross the device from left to right and back.

    We reduced the logic by using a 192 bit datasize in stead of 216 bit, this reduced the utilization to 78% (coming from 100+), but then we had to increase the speed to 175 MHz. The failing paths had IC delays of 75+ %. I cured the problem by adding pipeline registers between a few blocks. This allows the router to cross the device in two steps and give it some headroom for others.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Just some brainstorming (I mostly use Cyclone, so maybe something does not really apply for Stratix):

    - Try to reduce the number of used logic (maybe, with 90% it would fit like a charm), either by changing the RTL-code and / or by changing compiler-options (e.g. maximum register-packing, optimze for area instead of optimze for speed)

    - You mentioned 200 clocks. The clocks with failing paths, are they using global clocks or non-global routing-resources? If they use non-global, try to reduce the number of clock-domains so that only global resources are needed (this would also free routing-resources for normal routing).

    To get an impression: What frequencies do you want to achieve, what is your "typical" slack in the moment?

    Thomas

    --- Quote End ---

    Thanks Thomas. The failing paths are on the heavily loaded global clocks nets (155 and 311). I think one problem is the device is running out of memory resources and then start chewing up LABs. Its not clear to me what the algorithm is for selecting memory to move to LABs. After reviewing some of the results, I think it may have made bad choices.

    For now we are going down the Incremental compile path and inserting pipeline stages where needed. This seems like it will work if we dont run out of space.

    BTW, do you know of anyone who has tried the "Team Based Flow". We are also looking at that. So far the auto generated makefiles seem all screwed up but we can fix that. I can build a partition that takes up 10 percent of the design in 20 minutes. I can do 10 jobs in parallel for my 10 partitions. Then, in theory, should just have a short compile to route and connect things. Should be done in 2 hours instead of 24; at least that is how it is advertised....
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    I had the same problem in a StratixII GX project (EP2SGX60F1152C3). Our GX logic uses a lot of M4K, about 2/3 of them and apparently that makes the signals cross the device from left to right and back.

    We reduced the logic by using a 192 bit datasize in stead of 216 bit, this reduced the utilization to 78% (coming from 100+), but then we had to increase the speed to 175 MHz. The failing paths had IC delays of 75+ %. I cured the problem by adding pipeline registers between a few blocks. This allows the router to cross the device in two steps and give it some headroom for others.

    --- Quote End ---

    We are looking into the memory. We are definitely running out of it and the tool starts moving memory cells into logic.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Thanks Thomas. The failing paths are on the heavily loaded global clocks nets (155 and 311). I think one problem is the device is running out of memory resources and then start chewing up LABs. Its not clear to me what the algorithm is for selecting memory to move to LABs. After reviewing some of the results, I think it may have made bad choices.

    For now we are going down the Incremental compile path and inserting pipeline stages where needed. This seems like it will work if we dont run out of space.

    BTW, do you know of anyone who has tried the "Team Based Flow". We are also looking at that. So far the auto generated makefiles seem all screwed up but we can fix that. I can build a partition that takes up 10 percent of the design in 20 minutes. I can do 10 jobs in parallel for my 10 partitions. Then, in theory, should just have a short compile to route and connect things. Should be done in 2 hours instead of 24; at least that is how it is advertised....

    --- Quote End ---

    Hi,

    some thoughts about the team based flow ....

    In my point of view it would be not a good decision to use a "team based flow" as long a your device utilization is so high. Using partitons means that the design could not optimized as good as before, because the IF of the partitions will be preserved. No optimization across partitions took place. Next point is that it is recommended to use registered input and output at the partition borders, in order to prevent timing issues between partitions. This will eat up some resources again. For the partitions you have to define LogicLock regions. That means you have to assign resources to the partitions. You can not fill up the Logiclock region 100%, you will waste some resources (especially memory could be a problem.

    To get the full advantage of the flow you have to use the preserving level "Placement & Routing". I assume you will run all your partitions separat and import them in your main projekt. By doing this you could run into some problems, especially with the clock ctrl blocks. I'm not sure that Quartus is able to detect that two partitions use the same clock and the clock ctrl blocks ( they are part of the placement and routing ) could be merged.

    kind regards

    GPK
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    We are looking into the memory. We are definitely running out of it and the tool starts moving memory cells into logic.

    --- Quote End ---

    If it is dumping unused memory into LABs, are you sure you are using the M9Ks efficiently in the first place? To get the same size as an M9K made of MLABs you're going to eat up a lot of logic. Are you using many M9Ks with only 25% usage? Add atrributes/synthesis directives to force small memories into MLABS. Sometimes rearranging a memory map can help improve effeciency hugly, especially when using mixed widths in true dual-port mode. Simple dual ports support up to 32/36 bits in mixed mode, whereas dual port only 16/18 bits.

    Another thing to check - Ive just found a bug in Q9.1 and Q10 where it was auto-generating altshift_taps for me and eating up loads of memory unnessarily (like 30+ M9Ks when it should have placed a couple of registers!). if you're sure you dont need them, turn them off in the project with:

    set_global_assignment -name AUTO_SHIFT_REGISTER_RECOGNITION OFF

    or going to the Analysis and Synthesis options, more settings, and setting Auto Shift Register Replacement to “OFF”, by default this is “AUTO”. You can navigate to the option, in Assignments, Settings, Analysis & Synthesis Setting, More Settings. Auto Shift Registers to “OFF”.
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    And the magic answer is:

    Run a bunch of random seeds (Takes days on server farm as a single run takes 24 hours).

    Change fitter and router effort to 4.0 and 3.0 respectively

    Change synthesis from Speed to Balanced (Dropped size down by a few percent)

    Remove all design partitions (Design partitions were taking us farther and farther away from final solution). Once they were removed, size of design dropped from 95 to 90 percent full.