Anda di halaman 1dari 17

DesignCon 2002

Electrical And Physical Design Conference

Top Down Implementation of a 1.6M Gate ASIC

Chris Smith, Devin Bright

Synopsys Professional Services.

Abstract
This paper describes recent experiences using physical synthesis tools and top down timing and physical budgeting to complete a 1.6M gate, 0.25u ASIC. The focus of the paper is on the use of floorplanning, top level routing, physical synthesis, and static timing analysis tools to minimize, or remove the need for costly back end iterations in order to close timing on large designs. The design in question contains several hard and soft proprietary IP blocks including third party IP for a PCI controller, on-chip memory BIST and LogicBIST. Top down timing and physical budgeting is applied to partition the design into 11 functional blocks, including 4 JTAG boundary scan blocks. Block level placement and top-level interblock routing occurs early in the process to drive the refinement of the initial timing and area budgets. Each of the blocks is then independently synthesized, placed and routed using Synopsys and Cadence tools. After all blocks are complete, they are re-assembled and merged with the top level routing to create a final design that meets the originally budgeted timing in a single pass. The methods used are not particularly design specific and can be applied to similar or even larger SoC designs to assist in shortening the overall design cycle. Since this is a confidential customer design of Synopsys Professional Services, the application in question is not disclosed. The tools and flow used were installed at both the customers site and at Synopsys design centers.

Author Biographies Chris Smith received his Bachelor Of Engineering (Honors) in EE from the University Of Wales Institute of Science and Technology (UWIST) in 1981. Since then has continuously worked in the electronics and EDA industries, his first ASIC was designed in 1984/5 using schematic capture on a Daisy Logician workstation for a 6K gate ASIC in a 3u DLM process. He has worked for British Aerospace, LSI Logic, Motorola, and Cascade Design Automation. Since 1998 he has been a consultant for Synopsys Professional Services. Devin Bright received his Bachelors of Science degree in Electrical and Computer Engineering in 1991 and his Masters of Science degree in Electrical Engineering in 1993, both from the University of Iowa. Since 1999 he has been a consultant with Synopsys Professional Services where he has assisted multiple customers design numerous complex ASICs. Prior to joining Synopsys he worked for Motorola architecting cellular infrastructure equipment and designing necessary embedded silicon. The authors have worked together on various Synopsys Professional Services consulting projects with customers to introduce similar design flow methodologies, as are discussed in this paper, for a wide range of applications and technologies.

Introduction
Top Down Flow Overview And Flow Comparison Implementation of the ASIC is primarily a top down approach but some steps, such as initial block sizing estimates, are performed in a bottom up manner where necessary to reduce iterations. In a top down approach the chip is implemented from the die size, I/O timing and block placement requirements downwards while in a bottom up approach the chip is implemented from block level synthesis results upwards. Traditional floorplanning based design flow In many design flows floorplanning tools have been in use for several years as an interface to detailed place and route. Floorplanning has typically been seen as a task that aids the back end design team and is usually performed late in the design implementation schedule, this does little to enhance the efficiency of the front-end design team during the early stages of design realization. Feedback to the synthesis process occurs during the later stages of design implementation, in the form of extracted RC data from post route, and in general the problem of timing closure becomes a process of the back end team iterating the floorplan and place and route and re-generating RC data for the front end team to analyze. In addition for large SoC designs the critical top level, inter block, timing and routing information is not generally available until after all of the blocks have been through place and route this is especially true for design flows that flatten the complete design prior to place and route. The inefficiencies resulting from this late stage data feedback often results in needless and costly iterations between the front end and back end design teams.

Front End
Block Level Synthesis

Back End
Floorplanning Place & Route

Netlist Assem bly

Block Level Constraints

Chip Level STA


Timing Closure Loop

RC Extractio n

Chip Level Constraints

Figure 1: New Tools Old Flow Methodology

Floorplanning as a front end tool The floorplanning stage of the design flow can, and should, be moved into to the Front End of the design process to speed up the process of timing closure for large SoC designs. The flow as depicted in Figure 1 usually requires the handoff of data between multiple teams, and as a minimum, leaves full, accurate, knowledge of the chip level timing until all blocks have been placed and route along with the top level interconnect. Typically chip level constraints are created, in a manual process, from the block level constraints and seen as a separate step from creation of the block level constraints. In many cases full chip level constraints are created after the fact from the individual block level constraints used for synthesis. In a top down approach, physical constraints (e.g. block abstract, pin placement, top level routes, etc.) and timing constraints are pushed down from the chip level to the block level early on and used to drive the parallel implementation of each block. The overall chip is sized based on early area estimates for soft macros, which can be derived from a quick synthesis run (not necessarily with finalized RTL), the designers judgment and experience is needed to estimate sizes for incomplete or missing blocks. The overall core area is then partitioned amongst each block proportionally based on the estimated area resources required and overall timing requirements. Similarly timing constraints are defined at the chip I/O level and pushed down to the block level. Special consideration, or budgeting, needs to occur for those paths that are inter block in nature (i.e. beginning in one block and ending in another block, potentially passing through multiple blocks in route). For such a path, the overall timing is budgeted amongst the blocks through which it passes, including net delays through for top-level routing paths. Using PrimeTime to generate block level constraints from a set of chip level constraints is key to this flow. Early definition of top level routing and timing is essential for large designs if timing closure iterations are to be minimized.
Front End TopLevel Routing RC Extractio n Back End

Module Stubs

TopLevel Netlist Assembl

Floorplanning

Primary Timing Closure Loop Block Level Synthesis

Block Level Route

Chip Level Constraints

Chip Level STA

Block Level Constraints

Final RC Extractio n

Secondary Timing Closure Loop

Figure 2: Floorplan Centric Flow

Such an approach, of early allocation of constraints, creates an environment in which the blocks can be realized independently in parallel. When the blocks are completed then, by virtue of meeting all pushed down block level constraints, the overall chip can then be assembled. Assuming an accurate budgeting process was performed, the composite chip is then well positioned to meet chip level constraints in a single pass. The main timing closure loop occurs in the front end of the design flow where tradeoffs between block size, RTL implementation and top level routing can more easily be made. By providing accurate top level routing and good block level routing estimates it is possible to generate a consistent set of block constraints which, if all met individually, will result in overall chip level timing closure. The back end process then becomes the relatively simple task of completing block level routing for each block in the design driven by the physical and timing constraints generated during floorplanning and top level routing.

Design Description
The ASIC is an element of a highly integrated processing subsystem for a next generation communication system. As with many typical communication systems, embedded memory requirements were large, consisting of approximately 1.9M bits of memory. This logical memory was implemented in nearly 200 discrete memory macros in order to maintain the required memory cycle time speeds. Additionally, another 1.6 M discrete gates were required to implement the desired system level functionality in the selected 0.25 u, 5 metal layer process. This design is mostly synchronous, using primarily 4 clocks derived from two multiplying analog PLLs (APLL), with a few additional, non-synchronous clocks associated with external peripheral interfaces. The fastest clock output from the APLL, and also the most prevalent, was a 135 MHz clock.
JTAG BSR Blocks

APLLs

PCI Interface

JTAG BSR Blocks

Figure 3: Chip Top Level Blocks

Top Level Floorplanning


Die Sizing The ASIC was sized at 9.25 mm by 9.25 mm. The die size is the primary physical constraint, within reasonable bounds the die size can be varied to trade off and optimize numerous technical and nontechnical criteria. These criteria include packaging requirements, wafer yield (and associated cost per die), ease of place and route, and timing requirements. For a hierarchical implementation spacing for interblock routing at the top level must also be accounted for and may be traded off against increasing die size or decreasing block sizes and increasing potential congestion issues. I/O Placement The ASIC was designed using standard perimeter I/O, a key constraint driving the I/O placements was the signal groupings per the board layout. The I/O were positioned around the die such that when bonded out via the package, they were positioned consistent with the requirements derived from the board layout and routing tracks. In addition, the need to manage timing skew across a wide bus was also another constraint for I/O placement. The skew sensitive I/O signals were evenly distributed around two sides of the dies lower right corner. Given that the communicating block emanated from the same corner, such an approach minimized the difference between the farthest and nearest I/O in the bus. This was a first step at proactively managing the route variability, or skew across the bus. For the ASIC, three separate power supplies (core, I/O, and analog) were required in varying quantities to supply various parts of the design. The core supply pads were evenly distributed around all four sides of the die to power a core assumed to be uniform in its dissipation. The I/O power pads were placed such that the peak I/O power dissipation between any two adjacent pair would be nearly consistent. The analog power pads were positioned in close proximity to the APLL macros that required them. Beyond the macro positioning discussed so far there were also micro positioning requirements that required attention. Most notably was the concept of the minimum I/O pitch requirement, which was satisfied by respecting the spacing requirement. Less obvious was the minimum wire pitch requirement in place for the bonding wires. These requirements could not be checked until the die size was fixed and the package substrate designed. The likelihood for impact from this rule was minimized by providing extra space between the I/O cells near the corners, where extreme angles (i.e. non-orthogonal) of the wires were most likely to result in a spacing violation. Block Sizing and Placement Block sizing and positioning is, by necessity, an iterative process in which block aspect ratios, block area, and overall placement must be balanced to produce a floorplan that will meet the die area, routeability, and timing requirements. While automatic block level placement may be used to provide an initial starting point, for complex SoC designs it is usually necessary to use manual intervention to create an optimal floorplan solution. Determination of an optimal floorplan requires that the tools used provide meaningful metrics and feedback by which to measure the correctness of the floorplan. Chip Architect provides visual connectivity feedback in the form of rats nest fly-lines between the blocks, and weighted net connectivity, which show the number of connections between blocks. Quantitative metrics such as total Steiner Route wire length, individual net global route lengths or estimated resistance/capacitance from global routes may be used to monitor whether a given floorplan improves upon its predecessors.

Rats Nest Flyline Display

Weighted Connectivity Display

Figure 4: Different Connectivity Displays

For the ASIC, a rough estimate of each blocks area was obtained by a quick synthesis (using Synopsys design_compiler) to obtain a gate/instance count and then basing the block size on the instance and/or gate count placed at a certain density (typically 80% was used). A physical hierarchical partitioning was then selected which attempted to keep related pieces of logic together, while still balancing block sizes and minimizing block interconnectivity. A balance needs to be made between the size of core blocks and the number of core blocks. This ultimately resulted in 11 core logic blocks and 4 BSCAN blocks. Once identified, these block sizes were used to create the initial floorplan layout. The floorplan was created and manipulated using Chip Architect. The four BSCAN blocks had extreme aspect ratios; the purpose of these blocks was to provide an array in which the JTAG Boundary Scan Registers (BSR) could be placed. These blocks were long and narrow, effectively spanning the I/O on a side. As an early floorplan quality check, Chip Architect pin assignment and global router capabilities were used to identify any major problems. Sufficient block perimeter to support the pins at the desired pitch, as well as channel utilization were two of the metrics monitored. By this approach, major top-level routing problems were identified early on when they were easiest to correct. First Pass Block Pin Placement Initial soft macro block pin placement is determined using Chip Architect in a top down manner, pin assignment being driven only from the top level interconnect of the design. Global routing and congestion analysis are used to determine the QoR of the pin placement and manual or scripted modifications to block pin locations made to improve QoR in an iterative process. Block level pin constraints are used to restrict the pin placement to specific routing layers, required grid spacing, and specific sides of each block. For simple pin assignment tasks (such as straight bus connections between blocks and I/O pads) the Chip Architect results, which are based upon a fly-line connectivity analysis, are usually satisfactory, for certain other regular pin assignments scripts were generated to create design specific pin placements a specific example of this is discussed below for the BSCAN blocks. Once the top-level netlist is stable and the top-level floorplan is reasonably defined further refinements on the block pin assignment and a more complete analysis of the top level routing is performed using FlexRoute.

BSR Pin Placement The 4 BSCAN blocks had extreme aspect ratios that resulted in a limited number of routing resources across the shorter width of the block. To avoid congestion problems inside of the blocks, pin assignment for each BSCAN cell input had to be closely aligned with the associated BSCAN cell output. Since both Chip Architect and FlexRoute have no capabilities to match input/output pin pairs scripts were developed to assist in pin placement for this special case. The pins on the I/O side of the blocks were automatically aligned with their respective I/O cells as a part of the default Chip Architect pin assignment. The pins on the core side were then aligned, using a design specific Tcl script, with the matching pin on the I/O side of the block. Without this scripting methodology the core side pins would tend to clump together around the top level routing channels into which their connections flowed. While this would be effective for easing top level routing it would cause significant routing congestion issues inside of the BSCAN blocks. It was felt that the greater constraint on these pin placements was the limited routing resources inside of the blocks as opposed to the top level routing resources and so a small penalty in top level routing was traded off against easier to route BSCAN blocks. Loose Top Level Cells The design includes a few (less than 20) top-level cells that are used to implement system and test clock multiplexing and APLL test control logic. These cells did not lend themselves well to fly-line driven auto placement and it was found to be more effective to manually place them inside of Chip Architect. Since in many cases they form the start or top of a clock tree network they tended to be rather simple to locate in the center channels of the floorplan.

Top Level Pin Refinement And Routing


Top level routing was implemented using FlexRoute, a gridless, n-layer, object-based router. The initial pass of FlexRoute is used to refine the block pin locations based upon real routing considerations. FlexRoute performs Routing Based Pin Assignment (RBPA) by running multiple global routes in a process that aims to find the optimal pin location for each signal that traverses the top level. As in Chip Architect, pin locations are also driven by block specific parameters for preferred layers and sides of blocks. The primary objectives for pin assignment are assigning the pins so that the top-level routes are minimal length and do not cause congestion hot spots between the blocks. This tends to imply that when possible, pins are assigned such that straight line, single metal nets provide the top-level connectivity. Congestion analysis can be performed and further constraints applied to specific nets or routing channels within the chip to reduce congestion hot spots. Once the pins assignments have been updated (and fed back to Chip Architect) a full detailed route of the top level nets can be performed in FlexRoute. Although FlexRoute is gridless and will perform pin placement based upon simply meeting design rules it is necessary to restrict the block pin placement to be on grid. This means that the block level router (which is a gridded router) has a simpler task when routing to the block boundaries from the inside. The FlexRoute global and detail routing engines can take account of net or net segment specific parameters such as metal width and spacing, net shielding, preferred layers and coupling capacitance avoidance. Utilizing increased spacing between critical routes, such as clock lines, helps to minimize any propensity to noise susceptibility due to long routes running in parallel with each other. For this particular design all blocks were designated as total keepouts for top level wiring. It is possible to route over the blocks by allowing one or more routing channels in different metal layers but this does add some complexity to the extraction of both top level and block level parasitics in isolation since the interaction of the over-block and internal block routing must be considered.

Top Level Repeater Insertion In a bottom up approach block output loading constraints are conservatively estimated or block output drivers are set to the largest available driver in the library. Whether these estimates are correct, or whether the largest driver in the library is sufficient is not known until the top level timing is analyzed. Typically the top-level timing cannot be analyzed until the final block integration and detailed route stages of the design flow are completed at which point iteration back to the block level and/or insertion of top level buffers is required to implement any changes required. For a top down flow, such as the one outlined here, it is critical that accurate top-level timing and loading information is available early in the design flow in order to drive the budget generation process. Top level routing can be completed in FlexRoute without any knowledge of the timing requirements but for many nets the resulting RC network may be too large to be driven by the output of a block. As a result it is also necessary to insert repeater cells (buffers or inverters) at the top level of the design to both maintain block input edge rates and reduce the output loading of the blocks output driving cell. Following top level routing FlexRoute is used to insert buffers based upon the top-level RC network (extracted by FlexRoute) and the driving and load cells within the blocks. The information regarding the driving and load cells, and any RC network inside of the soft macro blocks, is provided to FlexRoute by Chip Architect in a file format known as TBEF (Timing-Based Exchange Format) which contains the following data: * Driver cell name and the distributed RC network from the driver output to the block pin location * Receiver cell names and the distributed RC network from the block pin location to the receiver cell input * Input rise time (maximum transition time) at each receiver cell input * Net delay constraint from the driver cell input to each receiver cell input * Database library cell name and the circuit's operating conditions (process, voltage, and temperature values) Figure 5 shows how the various drivers, loads and RC networks are modeled using TBEF

Load/Drive From PT

Network modeled by TBEF File

Network modeled internally by FlexRoute

Figure 5: Top Level TBEF Modeling

FlexRoute uses PrimeTime's delay calculator, when it calculates the cell delay for the hierarchical driver and repeaters both the hierarchical drivers/loads and the repeaters are analyzed during optimization.

At early stages in the design flow, such as when blocks are incomplete, default values for driver/load cells and/or block internal RC networks can be used. For this design the initial repeater insertion flow was ran in a mode where the aim was to fix input edge rates only, this results in a top level netlist and associated RC data that meets the technologies basic DRC rules. This netlist and top level RC data is used to accurately model the block level interconnect during the push down of t he top level timing constraints and block level budget generation. After synthesis and placement of the blocks (with the generated block level budgets) repeater insertion was re-ran to update the top level netlist with any necessary repeater size changes and/or new repeater insertions; during this run net delay constraints were used on paths identified as critical during the top level timing analysis. Pin Placement Analysis As mentioned above the QoR of a floorplan and associated pin placement can be judged by both qualitative (e.g.. are the congestion hotspots getting smaller) and quantitive (e.g. is the total wirelength reduced) metrics. Figure 6 shows three steps in pin assignment using the same floorplan along with the associated total wirelength.

Wirelength = 36,155u

Wirelength = 25,309u

Wirelength = 21,766u

Figure 6: Wire Length Results For Various Pin Placements

The first snapshot is of the initial floorplan with no top-level pin assignment so pin placement is driven (upwards) purely by the block level pin requirements. The second snapshot shows the results of a top down pin assignment using Chip Architect. The third snapshot shows the results of a combination of scripted pin placements for the BSCAN blocks and RBPA from FlexRoute. Clearly both the actual floorplan layout and the block level pin placement have a dramatic effect on the total routing length (and by inference the congestion) at the top level of the design. Block Level Placement Given that the block aspect ratios and pin assignments have stabilized, block level internal refinement and macro placement can begin. For those blocks that have macros (e.g. RAMs, ROMs, etc.), manual pre-placement must be performed. These macros were placed such that the placement array, the area in which the discrete standard cells will be placed, was contiguous and absent needless fragmentation. Close attention was paid to macro to obstruction placement such that all pins are fully accessible at the

time of route. This meant taking into account the location of the memories with respect to block edges and power supply straps and trunks. In the case of this ASIC, 5 blocks had approximately 180 discrete memories that required placement. These physical memories formed a few dozen larger logical memories; hence a significant portion of the placement exercise was properly grouping related memories. Upon completion of any macro level placement, the block level power plan is finished inside of Chip Architect to provide a true representation of the placement keepout areas and blockages within the block. For this particular design a series of block level rings were used to provide block level power as opposed to a global power striping approach. Once the block is completely described the block physical information can be exported for use in Physical compiler where the timing constraints and RTL or gate level netlist (as was the case for this design) are read in to provide both physical and timing constraints. A Gates to Placed Gates flow was used for this design where the design was synthesized to gate level using Design Compiler and then Physical Compiler was used to place and optimize the resulting netlist. This choice of flow was driven by the BIST test insertion tool requirement to operate on a single chip level gate level netlist in order to insert the requisite test points and test control circuitry. Due to the test tools lack of technology specific knowledge regarding cell drive strengths and circuit optimization strategies this required that a preliminary optimization be performed using Design Compiler prior to loading the design into Physical Compiler. A better approach would be to use test tools that operated at the RTL level and were cognizant of the design timing requirements, this would allow a more flexible RTL to placed gates flow to be utilized. As of the writing of this paper there are no such tools openly available. Physical Compiler optimizes the design based upon several cost functions such as routing congestion, DRC costs (e.g. transition time, fanout, etc.), timing, etc. while performing standard cell placement. Such an approach allows for the prediction of routability, timing performance, along with the ability to fix most DRC problems. Since the block constraints are developed top down and are independent of each other multiple, physical Compiler runs can be used to process the blocks in parallel. There is no need to wait for one or more blocks to finish in order to derive the constraints for other connected blocks in this way the overall design cycle time can be reduced significantly. Due to the speed of the flow it was not found necessary to implement an ECO type flow since, once the flow was set up, block level changes could be implemented very quickly from scratch. Block Level Congestion Analysis Chip Architect allows the user to generate routing congestion maps based upon a global route. While not the actual final route it can be shown that the level of detail used in this global route models the final Cadence SE detail route to a high level of accuracy. The global route takes into effect the various routing layers available, the number of individual wires that can fit in a given routing channel, and obstructions to routing such as power rails, wiring keepout areas and cell ports. In the congestion map areas highlighted as over 100% utilization should be examined carefully; in some cases the detail router will be able to detour around the congested areas to find a solution; in other cases, such as those close to block perimeters or in-between large hard macros even a single point of 100% utilization may cause routing issues later in the flow. In general the more highly congested the routing area, and the more widespread the congestion, the less likelihood that there will be good correlation between the Chip Architect route estimates and the Cadence SE detail routes.

Lower Congestion Areas

Higher Congestion Area

Figure 7: Block Level Congestion Analysis

The congestion map for one of the larger blocks is shown in Figure 7, it is difficult to interpret in black and white but below is a textual report of the congestion (the actual display in Chip Architect is color coded). The report shows the percentage of global routing resources for each layer that fall into a particular utilization figure, the rows do not all add up to 100% since the fixed obstructions such as power straps are excluded. Metal Peak <80 81-92 93-104 ME1 114 49 15 1 ME2 147 41 18 6 ME3 143 84 12 1 ME4 149 61 9 7 ME5 108 78 5 0 The report does indicate that small percentages of ME1 and ME3 (horizontal routes) resources are overutilized, while slightly larger percentages of ME2 and ME4 (vertical routes) are also over utilized. From a textual report it is not possible to determine if this is a problem and reference must be made to the graphic congestion map for analysis. As it turns out the reported congestion is only an issue in one very tiny area of the block due to a lack of horizontal resources between a RAM macro and a power strap on the left of the block. Since this is not necessarily obvious from a cursory glance it requires careful analysis of all potential congestion areas; 110% utilization may be fine in open areas of the block but 100.01 % utilization can be completely unrouteable in tight corners. To confirm the validity of the Chip Architect congestion analysis a comparison of individual route lengths was made for several blocks. Individual net route data extracted from both Chip Architect global routes and Cadence SE detail routes was compared for total route length and showed very high correlation (almost 100%) for placements that show no congestion above 90% with decreasing correlation on longer routes as the peak congestion increases to over 100%. The larger the area of peak congestion the more routes that will deviate significantly from their Chip Architect estimates. This information was used as an aid in making a decision as to whether a particular block would route in Cadence SE or whether further optimization in Physical Compiler and/or Chip Architect was necessary.

It should be noted that Physical Compiler also incorporates a route estimation engine but it is optimized for quick estimations during synthesis and is not currently as detailed or accurate as the Chip Architect global route. Block Level Timing Analysis Block level timing analysis is performed using individual block level extracted parasitics generated in CA and/or Cadence SE. Waiting for post route Cadence SE parasitics is time consuming so a necessary part of the flow was to re-load the Physical Compiler database containing the complete block level placements back into the Chip Architect hierarchy. With the standard cell placement loaded Chip Architect can be used to perform a global route and then extract DSPEF for each block. Since Chip Architect global route is orders of magnitude faster than a full detail route this enables the designer to quickly obtain feedback as to whether the block will meet timing after route. To verify the accuracy of the routing estimates made by Chip Architect against SE several large blocks were routed in Cadence SE and extracted to DSPEF using HyperExtract (full 3D extraction) the total (lumped) capacitance of each individual net was then compared and sorted into bins. Shown below is a typical difference histogram for one of the larger blocks in the design. As can be seen the vast majority of the nets exhibit zero, or close to zero difference, the nets in the outlying regions are typically found to be those that have the longer routes in the block either due to high fanout or to scenic routes related to congestion issues. Note that following CTGEN buffer insertion, and the resulting shorter routes, clock nets do not typically fall into the outlying category.
Chip Architect Vs HyperExtract DSPF

# N et s -1 -0.5

8000 7000 6000 5000 4000 3000 2000 0 1000 0 0 0.5 Capacitance Delta Figure 8: Block Level DSPF Comparison 1

Based upon these results we felt confident in proceeding with timing analysis without resorting to detail routing every block as the design evolved. We relied upon the Chip Architect estimates to indicate whether we would be able to achieve timing closure. This brings one of the major feedback loops for timing closure into the realm of the front end design team. Top Level Timing Analysis For the top level of the design we can use either the Chip Architect generated DSPEF or the FlexRoute generated DSPEF. The drawback to using the Chip Architect DSPEF is that at the top level of the design the routes will typically be significantly longer than those seen inside of a standard cell block and typically use non-minimum widths, spacing and/or shielding. As shown below, in Figure 9, the Chip Architect extraction can vary significantly from HyperExtract for long nets. FlexRoute performs a quick 2.5D extraction of its routed database and is substantially more robust for extraction of long nets or nets with different widths and/or spacing.

The histograms below show the results of comparing Chip Architect, FlexRoute, and HyperExtract following top level routing. HyperExtract is taken as the golden data in this case since it performs a full (and slow) 3D extraction. Based upon the results shown below (and other testcases) we are confident in using the extraction results from FlexRoute in guiding both our top level timing and our overall chip level timing closure.
Chip Architect Vs HyperExtract Capacitance Delta
600

500

400 Number of Nets 300

200

100

0 -1 0 1 2 PF 3 4 5

Figure 9: Chip Architect Vs HyperExtract Capacitance Delta Histogram

As can be seen from the histogram Chip Architect extraction shows significant differences from the true HyperExtract data. This is due to the length of the top-level routes and the additional spacing and/or shielding that appears in the routing and is not modeled in the Chip Architect global routes.
Capacitance Delta
1200

1000

800

Number of Nets

600

400

200

0 -0.6 -0.5 -0.4 -0.3 -0.2 PF -0.1 0 0.1 0.2 0.3

Figure 10: FlexRoute Vs HyperExtract Capacitance Delta Histogram

The FlexRoute 2.5D extraction matches very well with the HyperExtract results, there are still some differences but these occur only on long nets where the actual delta is a small percentage of the total capacitance. Clock Tree Methodology The clock tree methodology planned up front was to create independent clock distribution trees in the hierarchical blocks and a tree at the top level of the design to balance skew between the blocks - for this approach to work each block was required to have the same insertion delay for a given clock. The minimum insertion delay for a clock was determined by processing each clock/block combination through Cadence CTGEN. As expected the largest blocks, with the largest number of clock endpoints, resulted in the largest insertion delay. A small overhead was added to the minimum insertion delay to ensure repeatability if the number of loads in a block increased and this insertion delay was then used to create equivalent depth clock trees in the smaller blocks. Since several blocks had multiple clocks several runs of CTGEN were required initially to build up this information. As the insertion delays on separate trees were matched, care was taken to make sure that the number of levels of buffers in the trees remained constant. This helps to maintain insertion delay tracking over process temperature and voltage. With the blocks having equal insertion delays the problem of clock tree insertion at the top level became one of creating a zero skew network from the clock source (typically a loose standard cell Mux at the top level) to the clock input pins on each top-level block. Attempts to use several different automated clock tree methodologies, including CTGEN, produced less than optimal results at the top level. This was concluded to be due to the very large area at the top level combined with sparse placement area availability and the small number of clock loads. In the end a manual approach was taken to inserting the small tree required for the top-level clock balancing. A Tcl script was developed which allowed the user, using Chip Architect, to manually insert a buffer into a clock net, place and route it in the floorplan, and receive immediate feedback as to the balance of the loads it is driving. To provide quick feedback the output net connected to the newly inserted buffer was incrementally routed and a Reduced SPEF generated for the net and its loads. Since the rspef reduces the driven RC network into an equivalent RC load for the driving cell, and a set of endpoint delays for the loads on the net it is a simple task to parse the resulting rspef and provide accurate feedback regarding the skew from the inserted buffer to the loads it drives. In order to speed up the analysis each of the top-level blocks were replaced by an extracted STAMP model of its internal netlist, or an estimated input pin loading where blocks were incomplete. The use of abstracted models ensures that the RSPEF extraction and subsequent analysis runs in a matter of seconds, without these models the RSPEF extraction takes a significant amount of time since it traverses the hierarchy to each clock tree end point. By working backwards from the load points (i.e. the inputs to the block) it was a simple matter to balance each level of the top-level clock tree. Due to the small number of block input pins to be driven only three levels of clock tree, using 6 buffers, were required for the largest tree. After the complete tree was assembled in a piecemeal fashion a full clock tree analysis was run in PrimeTime using post route data from FlexRoute to verify final results. For the largest block the largest skew value was 280ps, the worst-case top level skew was 40ps, the worst-case difference in insertion delay for the blocks was 20ps. The final, post route, measured clock skew across the chip was 330ps worst case.

Multi Block Timing Analysis Since DSPF is not a hierarchical format the individual DSPEFs for each block and the top level must be merged. Full chip timing analysis is achieved by merging the multiple DSPEF files generated for each block and completing the RC networks at the interface to each block within PrimeTime. Rather than wait for all blocks to be completed it is useful to perform partial chip timing analysis simply by annotating the top level and those blocks that are available. For the final timing analysis the top level DSPEF is extracted from the FlexRoute database while the block level DSPEF files are extracted from a post route Cadence SE database using Synopsys Arcadia or Cadence Hercules. At intermediate stages of block completion DSPEF generated from Chip Architect may be used as an accurate estimate for the block level post route DSPEF. As indicated previously this can be performed with a high degree of confidence that the timing will match post route timing. The script environment used to back annotate the DSPEF used a search path to determine which whether to back annotate the actual Cadence DSPEF or the Chip Architect estimated DSPEF if the Cadence actual DSPEF was not available.

Chip Architect

Top Level LEF/DEF

FlexRoute

Top Level DSPEF


Top Level Verilog

Chip Timing Constraints

Block
DSPEF

PrimeTime

Full Chip SDF

PrimeTime

Block Verilog

Figure 11: Full Chip Timing Analysis Flow

As each block level DSPEF file is annotated in PrimeTime an error message is generated for each primary I/O of the block. These messages are generated since, at the time of reading the block, the RC network is incomplete for primary I/Os (i.e. it does not contain a complete driver, RC network, and receiver). After the final top-level DSPEF file is annotated a final check (using the PrimeTime command report_annotated_parasitics list_not_annotated) is performed to ensure that all nets, especially those that cross hierarchical boundaries, have been annotated correctly. Once all of the sub blocks, and the interconnecting top-level design, have been back annotated with DSPEF the PrimeTime command complete_parasitics is used to merge the interblock networks. A technique used to speed up multiple chip level timing analysis runs was to back annotate all of the DSPEF files and then write out a full chip SDF file for each operating condition. Subsequent timing analysis runs then save time by simply reading the existing SDF file instead of the relatively slow process of annotating multiple DSPEF files and calculating delay information. Tcl scripts using search paths and file time stamps can be used to automate this process.

Chip Assembly As individual blocks finish routing and are verified for post route timing they can be individually verified for DRC/LVS issues in a stand-alone mode. The final chip assembly operation occurs inside of a GDSII Editor where the individual blocks are merged with the top level GDSII. This particular design used a Synopsys tool (SLE) for GDSII merging but the same process is easily accomplished in other tools. Once assembled it is necessary to perform LVS and DRC runs to check for problem at the interfaces between the top level routing structures and the block level structures.

Layout Editor

Figure 11: GDSII Merging Process

Summary Using the flow outlined it has been shown, on a real design, that timing closure can be driven from the top level of a design, and that timing closure need not wait for a full flat GDSII file of the design to perform extraction on. Using top down physical and timing budget generation individual SoC blocks can be implemented in relative isolation from each other and still provide timing closure at the chip level when they are reassembled. The critical top level detailed routing can be completed early in the design flow and the extracted RC information used to drive the block level budgeting and synthesis process. The back end design flow involvement in the timing closure loop is reduced to the detail routing and extraction of the individual SoC blocks. Adoption of the flow, and the use of Chip Architect, FlexRoute, and Physical Compiler may require a change in the organization or roles of typical design teams as more of the physical implementation is performed in parallel with the development of the RTL and ideally by the same engineers.

Anda mungkin juga menyukai