Multi-Million Gate design From RTL to GDS Using Synopsys flow within less then 10 weeksYaron Lavi Intel Corporation [email protected] ABSTRACT Synopsys holds a set of tools, which enables smooth flow from RTL to GDS (TO) within relative short time and with only two major layout iterations. Although schedule (RTL2GDS) is high depended on design complexity, layout utilization, computing resources, head-count and many other factors, we found a flow which enable to do the job with high confidence level and with approximately constant time to multi-million gate count projects. This paper will present a proved flow that we used in several projects in which we took the advantages of: § Design compiler for synthesis. § Physical compiler + DFT compiler for placement and scan insertion § Astro for clock tree, HFN and routing § PrimeTime for static timing analyzes The results are working Silicon in all design target corners, which is being manufactured in high volume quantity. Yet, it would be fair to mention that there are tools from other vendors, which support the design effort and validation, but this is the backbone of the back-end flow. ........5................ 4 2.....5...................................................................................Layout Typical Congenstion.......................................................................... 6 2....0 Introduction.......................................5...........0 Conclusions and Recommendations... 15 Table of Tables Table 1 .....................................................0 Design flow..................................... 3 2.............................2 Full Timing model for Static Timing Analyzes............................ 10 2......................................................................................3 Floor Planning (Update)...................................Table of Context 2 1.................................................... 5 2.....................................................................................Physical SCAN Chains.....................................................Astro clock tree skews ............................................................................................................................... 15 4............. 4 Figure 2 ..... 10 2....................0 Acknowledgements ... 14 SNUG Israel 2004 2 From RTL to GDS ............................... 14 2.................4 Physical Synthesis (G2PG).......................................2 Professional Synthesis.................... 13 2............................................................. 9 Figure 3 .....................General design flow......................................6 Final tuning.......1 RTL Verification.1 Astro Physical stage......... 8 2............................................... 12 Figure 4 ........................................................5 Layout and Timing closure .......................................................................................................................................................... 8 2............................................... 9 Table 3 .......................................................................................Bonus and FIB cell scattering................Physical congestion typical results..... 6 Table 2 .........................3 Astro ECO mode ................................................................Typical Synthesis results........................................................... 10 Table of Figures Figure 1 ..................................................................................................................................................................................................................................................................... 15 3.......................... 13um process with about 2M gates design. design cycles are shortening and there is a need to have a steady flow. Most of the examples.18um and 0. to meet the schedule constraints. The confidence level of quality and schedule that we have developed will cost: a. b. Using advanced process for the communication application may have the advantage of adding extra cells with no impact over timing. Only two major cycles of timing closure loops. The flow is based on 0. are shown for one cluster of the design. c. An important point to emphasize is the fact that any request/constraints have a cost to be paid. This size of design is behind the tools limitations (to handle as one chunk). and become approximately constant time to multimillion gate count projects success. I will present our flow (and cost) to meet the requirements of such a task. with tough competition from powerful as well as new competitor make the slogan of Time To Market a key element of success. Always there is a place for additional margin to guarantee fast execution. Gates to Placed Gates c. which presented here. and a fast production High Volume Manufacturing ramp-up is needed. in order to prevent IP disclosure. Products life time becomes shorter. RTL to Gates b. As a result. It is based on Synopsys set of tools with a flow which was proved in couple of projects. Much more license. This flow should support high level of confidence. Full chip verification SNUG Israel 2004 3 From RTL to GDS . Extra die area Those targets based on two major assumptions. We have to reduce the relation of gate count and complexity of design depended. The design also includes hard macros like memories and others. ECO flow f. Astro Physical Stage • Clock Tree & HFN • Routing d. so it being divided for several cluster to run at parallel.0 Introduction The increasing demand of the market for new communication products. Timing closure e. In this paper.1. More computing resources b. which must be kept: a. High quality of RTL code. The paper will include the following steps of design a. STA. too complex random logic for the formal verifications tools etc.2.0 Design flow The design flow includes several steps and milestones. It is highly recommended to pass through such a process two times over uncompleted and non-ready design to make sure the flow and tools are working as expected. Of course physical design team should know the design and the methodology very well before starting this flow. RV LAYOUT Manual. Product Definition RTL Coding Re-use Logic ENV TB Manual Design Analog Behavioral Model Floor-Plan Logic Model Full Chip Testing Synthesis Analog Testing Verification Regression. FV.General design flow This paper focus is over the physical design flow (colored) and clarifies the way to get to Tape-Out within 10 weeks with one full synthesis loop and two major timing closure loops. MAX delay path. This is the time to raise issues of floor-plan (clusters partitioning). ATG. because there is no place for re-work. which can be viewed in general as described in figure1. The physical design can be summarized in the following points: • RTL verification toward Back-end readiness – pre stage to meet • Synthesis optimization (including clock gates insertion) – 2 week • Physical synthesis (including scan insertion and DRC fixing) – 1 week • Routing (including HFN and CT insertion) – 1 week • Static Timing Analysis (full annotation) X2 – 2 weeks • Logic + Timing ECO’s X2 – 2 weeks • Layout ECO’s – 2 weeks SNUG Israel 2004 4 From RTL to GDS . GLS. APR LVS/DRC TO Figure 1 . It is far away from the CPU’s which using similar process. is the floor-plan area vs. Some of them are Synopsys tools of the flow (like Design Compiler. The margin which defined in the clock uncertainty may cause unreal MAX delay violation. causing all MAX delay violations to be solved by Logic concepts (like pipelines) at early stages of design. Synchronous design 3. gate count matching. All design exceptions are approved. All physical design clusters should have no more the 70% utilization. Communication products frequencies are low compared with the advanced process which is being used. Logic Equivalent Checker. SNUG Israel 2004 5 From RTL to GDS . Therefore. At this point there is always a point to validate that database is ready. while others are specific for different aspects of Design Rule Checking (like SpyGlass. Summary: Keep verification simple and use conservative design rules Start with low utilization to remove floor plan risk (pay with die) Prevent MAX delay by high clock uncertainty to protect streaming of the flow (pay with area) 1 Defining MAX delay is highly related to the clock definition. The physical designer is hardly meeting with this time consuming problem of MAX delay. All those check points can be verified with several tools in the market. For example.1 RTL Verification One of the critical and key milestones is the Fist Sign-Off. when it is done later in the flow. What is ready in the eyes of the physical designer? Our definition for database ready includes very tight definition with the following major points: 1. but with GHz frequencies. All kinds of constraints file are ready. ATPG etc). Asynchronous path defined and verified 7.2. we define the clock uncertainty as 25-30% of clock cycle in this stage (basic synthesis of RTL). Prime-Time). The MAX delay margin is also an important parameter that should take into account at early stages of design. Additional important verification. Synthesizable code 2. This is a very high margin. with some exceptions of design of 250 MHz. which is being done at this point. 6. 8. Design for Testability verified (Scan. in which the entire RTL database is delivered to the Physical Design flow. typical frequencies are in the range of 40-160 MHz. Our concept will be explained in the next chapter. This utilization is considered low enough to include all design “buffer” in the flow and guarantee no need for floor plan changes. 9. Right use of pre created special cells. No Latches 4. No Max delays 1 5. Memory BIST etc). This is the last time to make any change in the floor plan due to the major schedule impact. at the pre-stages of Physical-Compiler/ASRTO. 502 50. But the efforts yield results. More over when using 0.212 * . The power compiler with the standard clock gate cell reduce gate count by additional ~7%.0 211.4 175. NAND Gate count [Kgate] 171.13um process (in the table) the effect of extra synthesis MAX delay margin has minor effect over gate count and area. including scan FF’s and clock gates (power compiler).2 Professional Synthesis In this part we are actually start the physical design flow and the schedule clock start to count. but it seems always the final drop of RTL has its own secrets (especially when dealing with arithmetic data path blocks).Basic synthesis Table 1 .3 186.Typical Synthesis results Summary: Keep control over gate-count with all netlist changes (DFT and Power) SNUG Israel 2004 6 From RTL to GDS . I would expect any physical designer to know all kinds of synthesis methods and use them over his block prior to the final run.5 Clock uncertainty Margin [%] 10 10 25 25 25* Power Compiler ON OFF ON OFF OFF Number of Instance 42. which are the cost for the extra margin. It is design depended but our experience showed 5-10% reduction in gate count.412 46. In the table below it can be seen a typical block synthesis results. As a result we can see netlist with up to 10% gate-count reduction.2. This includes various types of synthesis like: • Top down • Bottom up • Bottom up using characterized method • Using advanced flow with DC_Ultra and DW_foundation when needed.6 190. reduce the delay in the flow to fix the MAX delay violations. which can be similar to “trial and error” method.631 48. This stage has some characteristics. The design exploration and elaboration is an effort that must be taken into account. Each design has its right approach to synthesis and you can’t know it from just looking over the code. The Maximum benefit is gained by a professional synthesis using additional licenses like Ultra and DW foundation. Those extra cells. We are using at this stage all our computing resources and all available licenses to make the best results out of the RTL in gates.840 44. The constraints files are ready from the early/basic stage and no surprises at this stage should occur. SNUG Israel 2004 7 From RTL to GDS . gate count and instance count) the results become poor (MAX delay. which will lead to a group/region definitions The results of the above would be inserted back to the Physical-Compiler for the placed synthesis. I’m mansion it at this point. congestion) and run time is much higher. Cluster are now at the level of 850K gates (equivalent NAND gate) and the run time is about 30 hours including all the reports . 2. since it is the last point to make any modification with minor impact over schedule. Beyond this limit. For example: “Back to Back” FF’s are in this category. The utilization now is raised to 75% and may in extreme cases raised up to 85% that is our upper limit. Jupiter tool. This is based on the synthesis final results. resizing solves these violations.3 Floor Planning (Update) In general. which can find the logical connection between the units.4 Physical Synthesis (G2PG) In this stage. but we also add in extra buffer to split long or high fan-out nets. the number of iteration reduced dramatically.000 different buffers over a block with 25K FF’s (180K instances). Blocks. and SCAN chains are being built according to the placement of the FF’s. Additional iteration of fixing MAX Transition and MAX Capacitance is done in order to reduce those violations from the Layout stage. macros. the floor plan stage is very early in the design flow.2. the main considerations for chip floor planning are: • Die size • Connection between macros/blocks to the pads • Blocks interconnect • Grid supply • Amount of pins • Complexity of connection in the gate area . Just to remind. The design passes through several iterations of optimization. the pins from each cluster should place in the logical order. SNUG Israel 2004 8 From RTL to GDS . Min delay violations (actually preventions) are being handled too (there is no Clock Tree yet) based on statistical results. (utilization. The approach that we use for the MIN delay prevention is to add extra buffer in any path that has the potential to become MIN delay path. Also. The design MAX delay margins are being reduced to 20% of clock cycle. Also the Physical-Compiler tool can crash. Usually. All of that yield “ECO” of about inserting 7. The minimum is one PhyOpt and additional two incremental runs. will place them accordingly. pads are verified once again. that easily can be placed on the die size. 1. Two files are defined as SDC: one SDC for Clock Tree and relaxed one to the routing flow. X Congestion threshold: Violations (usage > threshold) Number of edges: Maximum violation: Average violation: 0.00: ******* (13724) 1. As you can see the reports of typical block is seen quit good.20: * (717) 1. For the routing.1. SNUG Israel 2004 9 From RTL to GDS . In several cases.00 .20 .00 .80: **************************************** (92278) 0. PDEF file and SDC files.00: ** (3058) 1.7000 17046/95392 0.80 . we use a relaxed uncertainty of 15%.0800 Y 0.1. the results and run time are better that way.1.1797 Histo graph for congestion on X < 0.40: * (6) Table 2 .20 .80: **************************************** (80945) 0.Physical SCAN Chains The physical synthesis stage is also a good place to add Logic ECO’s.1.20: * (45) 1.40: * (11) Histo graph for congestion on Y < 0.6802 0. The SDC file for the Clock Tree is the same as the Physical-Compiler work with 20% clock uncertainty definition.80 .Figure 2 .1.5357 0. The ECO mode is very easy to use in the netlist level.7000 9323/95392 0. Physical-Compiler adds and places the gates with no major effect over timing when the design is ready.Physical congestion typical results Database now defined as verilog netlist. in which clock trees. the clock tree skew is reduced to the level that there is almost no MIN delays violation.7 244. It can be done since Physical-Compiler project the layout timing very close to the results of the PrimeTime. Of course that Clock Tree could be built trough gates which are not flops. It depends also of the level of prevention that we used before.4 13. reset trees and High FanOut net are being built. see table CLOCK1 No Optimization 0. Also. Our methodology will use the placement of the Physical-Complier as mush as possible. For example.43 1. We recognized that the clock tree optimization has to be manual SNUG Israel 2004 10 From RTL to GDS . We also protect it by reducing the design margins within the progress of the design.7 CLOCK2 No Optimization 1. netlist and timing constraints are loaded to the tool and all cells are being fixed.3 2.02 Short path Long path Skew Table 3 . At any stage.5. The Clock Tree stage starts with the load SDC section. In order to verify that all the SDC constraint was read correctly we dumped it out for review. to allow the implementation of the clock signal to all those flops with the right drive. the flow of changing the design by ECO flow makes it more controllable and accurate. add buffers) fixes automatically.2.Astro clock tree skews Clock Tree stage is building structures of trees for all the FF’s. This is from the reason that the timing engine of Astro is still different from the Prime-Time (the sign-off timing tool). It is for Astro to understand the design constraint and the clocks definition. An important stage is the clock tree optimization. The placement.6 After Optimization 1.1 10.5 Layout and Timing closure 2. to define the cells to be used and to optimize that tree. The Clock Tree creates levels of buffers.3 After Optimization 2. With the cost of insertion delay.4 244. but those values are taken into consideration. also it is available to use generate clock from other clock. Now we have to check that design can achieve the timing requirement with out the nets impact. the Layout tool doesn’t make any DRC (resize. Design should meet timing in that stage. which correlates to the same clock nets. Only minor placement changes are being allowed. so we will apply the “without interconnect” option and check timing report.45 0. Astro give as the option to interfere in the building process. The first stage is done with tighter timing constraints.9 1.1 Astro Physical stage This is the first time we meet the layout tool (Astro). The route starts with special and/or sensitive nets like clock nets. We can use blockage for specific metal (like blockage for metal 1) or for all of them. The result of using HFO net command is a net described by buffers. like blocks memory or areas that save to other purpose. The tool makes almost all the layout DRC fixes. Routing Astro is used as a routing tool only. It is not recommended to treat the HFN nets (such as reset) as clocks.changed only for the path of special gated clock with controlled FF’s. Like the previous stage we start the routing stage with loaded the SDC for the routing constraint definitions and duped it out in order to verify that all the SDC constraint were read Before we start to route. we fixed the violations with search & repair command that find the violations and fix them automatically. As you can see below. This is very fast runs and we can complete full cluster within 2 days. Detail route – perform detail routing on a design and then writes the violations to a routed cell Lastly. This manual optimization is easier and faster then solving the MIN delay violation it creates2. As well. we check that we have blockages in the areas that we don’t won’t to route on. which creates a structure as a clock tree. many gates to be connected to the same net which has no clock attribute. 2 This is done manually after verification of Prime-Time at the first timing loop SNUG Israel 2004 11 From RTL to GDS . and proceed wish all the other nets in three steps • • • Global route –that maps the general pathway through the design for each unrouted nets (with no physical layer) Track assignment – Assign nets to wire tracks then places wires and VIAs to show the initial routing configuration. Now we are ready to run the HFO nets. The constraints file now include clock uncertainty degree of 1520% of clock cycle.g. and hold time margin is 100 – 350 pS. e. out typical block pass the routing stage with only one small area of high congestion. The definition of such net is a wide connection net. and that why they have to have their own refer. we loaded route guide for special nets like clocks that we prefer to route in higher metal. Figure 3 .Layout Typical Congenstion Database now is verilog netlist and SPEF file produced by STAR-RC SNUG Israel 2004 12 From RTL to GDS . we may decide not to fix a cap violation. relevant RC files. Also all the nets’ names mismatches. In case those warning do point to real problems. etc. At the last stages of the project. For handling all this parallel running. between the SPEF files & the net list file. which is below 10% of the driving cell ability) Our experience shows that only 300-500 path need to be fixed. cap & transition violations. with enough CPUs for running up to 3 primetime license per machine. up/down sizing or changing the location of a cell. it will be a waste of time not to run all of them in parallel. we use about 5-6 linux machines. The four main violations that are handled are max & min delay. that automatically creates elaborate reports for all relevant checked modes. All working modes have to be tested. Since we may check up to 12 different modes (double 2. Such as: System mode. such as by adding buffers. a full chip update may take more then 30 minutes. SNUG Israel 2004 13 From RTL to GDS . We need to reload all the constraint for every checked mode (clocks & external definitions. we load the netlist to Design-Compiler and use DC command to add all the changes to the netlist. Since this stage may require a lot of checks. All violation should be handling.5. Debug mode and SCAN mode.). It is highly important to verify that all the un-annotated nets warnings are due to nets’ branches that doesn’t appear in the layout (and therefore are not a real problem). up to the relevant percentage of the violation (for example. different case analysis. This is the time to verify timing violation and fix them. For each working mode there is different constraints file according to the risk and the importance of the mode. which includes the new netlist & SPEF files from the Astro. we reload the DB (netlist & RC files) for sefty reasons. it may be a very high time consumer.2 Full Timing model for Static Timing Analyzes The database.2. we need to update the design for every different checked mode. should also be fixed. This is netlist for the 1st timing closure. When all violations are fixed and verified in the Prime-Time environment. for both corner). we build an environment. When an update is done. Even though we use high powered machines. the issue should be check & fixed. The conclusion from this step is that Physical-Compiler has a very good projection of timing and the combination of Physical-Compiler as a placer and Astro as a router yield high efficiency. Therefore. For example: different MIN delay margin is defined for SCAN in shift and capture mode. Production mode. is being read into the Prime-Time and being verified in both fast and slow corners. The amount of that kind of cell is mainly driven by the cluster utilization and the risk of the block. two iterations are enough to close all issues.3 Astro ECO mode The updated netlist is being loaded to Astro and a compare file is being produced. 4 Bonus cells – Extra STD library cells which will be used for bugs fixes by metal changes only. Figure 4 .5.2.Bonus and FIB cell scattering 3 FIB cells – Extra STD library cell which all the interface connection go to the upper metal for easy design changes in the LAB. SNUG Israel 2004 14 From RTL to GDS .5.3 for one additional loop in which the number of violation reduce by factor of 10 each time In some cases. in order to prevent any new violation like MAX transition that the automatic placer can cause. Astro knows how to place them homogeneously in the design. there is additional loop to fix 1-10 violations but in general.5. New cells are being located manually.2 and 2. At this stage we also add our FIB cells 3 and bonus cells4. We repeat steps 2. This combination proved itself in our product line in the past and we continue to work that way. In some other cases. Those tasks are being done only in one top-level cluster. which is kept open till TO. Sagy Eick. 4. there is still a need to balance clocks or to add/remove delay due to external AC timing constraints. In the bottom line such a concept that we presented in this paper keeps us in the time frame and enables to produce TO’s. 3.0 Conclusions and Recommendations The paper presents a concept of taking the advantage of advanced processes and pay of some extra area in order to get a stable and predictable flow from RTL to GDS.0 Acknowledgements I want to thanks my colleagues who work with this methodology and help me to collect the data to this paper. interconnects between clusters cause some violations and there is a need to make again small ECO to solve issues like MAX capacitance.2. It can be use in the communication products in which most of them using low frequencies demands compared to the process. Oren Mamet. Oded Pilowsky. Using a set of tools from Synopsys house seems to add another level of stability and prediction to the Tape-Out flow.6 Final tuning In many cases. Shai Michaeli SNUG Israel 2004 15 From RTL to GDS .
Report "From RTL to GDS using Synopsys flow within less then 10 weeks"