Training 2

Institut fr Integrierte Systeme Integrated Systems Laboratory
VLSI II: Entwurf von hochintegrierten Schaltungen

Dept. Informationstechnologie und Elektrotechnik 7. Semester Dr. H. Kaeslin, Dr. N. Felber, Prof. Dr. W. Fichtner
227-0147-00
Training 2: SoC Encounter for Designers II

Beat Muheim Frank K. Grkaynak
Ausgabe: Betreute bungsstunden: 26. Oktober 2010 2. November 2010, ETZ D96.1 9. November 2010, ETZ D96.1
Erinnerung: Mit der Bearbeitung dieser bung erklren Sie, dass Sie die Regeln fr die Verwendung von CAD-Software an der ETH Zrich kennen und beachten. Diese Regeln knnen Sie jederzeit nachlesen unter http://www.dz.ee.ethz.ch/en/our-range/regulations.html.
Overview
Unlike other exercises in the VLSI lectures, the back-end design ow requires you to learn how to use a commercial Electronic Design Automation (EDA) tool, in our case SoC-Encounter from Cadence Design Systems. These exercises are therefore called Trainings and will teach you the basics of SoC-Encounter so that you can use it for your semester projects. There will be three trainings: Training 1 Introduction to SoC-Encounter and back-end design phases. This training is also suited for a general audience. Training 2 Floorplanning, placement, clock tree synthesis, optimization, routing and timing analysis with SoCEncounter. Training 3 Tape-out preparation, performing Design Rule Check (DRC) and Layout Versus Schematic (LVS) on your nal database. Students who plan to work on an ASIC semester project should make sure to visit all three trainings. Parts of the text that have a gray background, like the current paragraph, indicate steps required to complete the exercise.
Introduction
In this training we will start with a structural verilog design netlist (from synthesis) and create step by step a physical layout that can be manufactured. To keep runtimes reasonably low, we will use an example design with a (slightly) lower complexity than most student design projects.
2.1
Example Design
The example design is based on the FIR lter that we have been using in the past exercises. The lter has been changed to include several pipelined lter stages as shown in the block diagram below1 .
1
The lter is basically useless and has only been engineered as an example circuit suitable for the exercise.
ResetxRBI
DataInxDI DataInReqxSI DataInAckxSO
RamWDxD RamAddrxD
RamTestxTI
SY180_2048X16X1CM8 r256x72tb300xo
RamRDxD
LUT
ScanEnxTI
16 32
16 16
LUT
32
16 16
LUT
32
16 16
LUT
32
16 16
LUT
32
16
DataOutAckxSI DataOutReqxSO
48
48
48
48
48
48
48
48
48
48
48
48
0 ClkxCI
48
48
48
48
48
48
48
48
48
48
48
48
48
DataOutxDO
filter_stage1
filter_stage2
filter_stage3
filter_stage4
filter_stage8
filter top chip
Each lter stage contains a large multiplier, a look-up table and an accumulator. Note that the input of the rst stage is tied to constants and therefore greatly simplied. The following is a short description of all pins of the circuit: Pin Descriptions Name ClkxCI ResetxRBI ScanEnxTI RamTestxTI DataInxDI DataInReqxSI DataInAckxSO DataOutxDO DataOutReqxSO DataOutAckxSI Bits 1 1 1 1 16 1 1 16 1 1 Dir In In In In In In Out Out Out In Description Clock input Reset input, active low signal, 0: Reset Scan Enable for testing, 1: Scan Ram bypass control, 1: Test (RAM bypassed) 16-bit data input Request signal for data input Acknowledge signal for data input 16-bit data output Request signal for data output Acknowledge signal for data output
Getting Started
You will need a terminal program to type in commands throughout this exercise. In the computers in the ETZ D96 you can get a terminal by accessing the menu on the top left corner and selecting "Applications Accessories Terminal". Change to your home directory and install the training les with the script provided: cd ~ /home/vlsi2/t2/install_t2 Change to the design directory cd training_2 The copied les and folders are arranged in a certain structure which is described in the next section.
3.1
Directory Structure
The following gure shows the directory structure for a design directory that was created by the cockpit tool developed by the Design Zentrum (DZ) of ETH Zurich.
design .cockpitrc calibre docs encounter
Configuration for the cockpit Final layout, DRC and LVS Links to documents
out save scripts src tech
Final output files: netlist, layout, timing (Verilog,GDSII, SDF) Save files for Encounter (Encounter native format) Example scripts, run scripts (TCL) Input source files: netlist, constraints, io placement
sample lef lib modelsim simvectors sourcecode synopsys tetramax

Simulation tool Stimuli and expected responses VHDL sourcecode Synthesis environment Test vector generation, test coverage
Sample input files
Links to technology files, etc. Links to absracts and technology Links to timing libraries
In this structure, there are ve subdirectories for SoC-Encounter. It is strongly recommended to use them in the following way: out Place all nal data to be exported from SoC-Encounter in this directory. This includes the nal netlist (the initial netlist gets modied by clock tree insertion, optimization etc.), layout and delay les that will be used for postlayout simulation and/or physical verication and chip nishing. A sample script that generates all these les is provided (scripts/exportall.tcl). 4
save Put all SoC-Encounter save les, i.e. les in native SoC-Encounter format, in this directory. scripts Contains TCL scripts. By default several example scripts for common tasks are provided. It is highly recommended to develop a run script that contains all the commands used for your design. src All user input les should be placed here. These include the initial verilog netlist, the I/O placement le, timing constraints le and clock tree denition le (all will be explained later in section 3.2). tech Holds links to technology specic les. Cockpit manages this directory automatically.
3.2
Input Files
The input les required for back-end design with SoC-Encounter can be divided into two categories: Design les that describe (or are closely related with) the circuit, rst of all the verilog netlist of our synthesized design. Technology les that describe the technology itself as well as libraries of standard building blocks implemented in this technology. Lets start with the rst category. 3.2.1 Verilog Netlist
The verilog netlist we obtain from synthesis contains standard cells, functional I/O pads and their interconnection information. While the functionality including scan circuitry is already complete, some special cells are still missing: Supply pads to provide power and ground to the core (pads VCCKD and GNDKD) and to the padframe (pads VCC3IOD and GNDIOD). Corner pads that need to be placed in the corners of the padframe to complete the power lines running inside the padframe (pad CORNERD). Due to the arrangement we have with our ASIC manufacturer, student designs are strictly limited in size. As a consequence at most 48 pads (not including the 4 corner pads) can be placed in the padframe. Furthermore, to ease chip testing on the ASIC tester two predened power schemes have been established: 1. 40 signal pads, 8 supply pads (recommended for normal designs) 2. 32 signal pads, 16 supply pads (extra power pads for fast designs) Take a look at the following web page for an illustration of the power schemes and to obtain further information on constraints for the semester design projects.
http://www.dz.ee.ethz.ch/en/information/ic-technologies/umc/180/mini-asic-setup.html With all this information we are now ready to add the missing corner and supply pads to our verilog netlist. 5
A typical verilog netlist that you will obtain from synopsys will contain many levels of hierarchy. Each level of hierarchy is enclosed between the module name ( pin names separated by comma ) ... endmodule statements, where name refers to the name of the module (module is the verilog equivalent of an entity in VHDL). In our case we need to add the pads to the top-level module which contains the rest of the I/O pads. The top-level design is almost always the last module denition in a Verilog le2 . Copy the verilog netlist to encounter/src/ in order to have a clean copy of the initial netlist even if synthesis is rerun. cd encounter/src/ cp -p ../../synopsys/netlists/chip.v chip.v.initial The le specialpads.v contains four corner pads and 8 supply pads corresponding to the power scheme 1. As our design uses power scheme 1 no changes are required to this le. For power scheme 2 we would have to comment out the eight additional supply pads (comments in verilog start with //). What remains to do is to add the contents of specialpads.v at the right point, i.e. where the other pads are, to the initial netlist. Using a text editor searching for: module chip Below this declaration you should see lines that instantiate the pads. Insert the contents of specialpads.v at this point. As long as you are in the module body, it does not matter where exactly you insert them. Save the le as chip.v and exit the text editor.
a There are many text editors you can use. There are terminal based editors (vi, vim, nvi, joe, jed, pico, nano etc.), editors that are mainly terminal based but have a simple GUI (emacs, xemacs, gvim etc), and GUI based editors (mousepad, gedit, nedit, kate etc). Out of these emacs, vi (and derivatives), and nedit are the most advanced editors.
, open chip.v.initial and nd the denition of the top-level module chip by
Remark: In the future you can use a small Perl script to add the specialpads to the initial netlist, i.e. ./insert_specialpads ../../synopsys/netlists/chip.v ./specialpads.v > chip.v inserts the contents of specialpads.v into the last module dened in ../synopsys/netlists/chip.v and write the modied netlist to chip.v.
The content of the module needs to be dened before it can be instantiated by a dierent module. Consequently the top-level module is the last to be dened, however not all verilog les need to be hierarchical, a design can also be spread between multiple les
2
3.2.2
I/O File
After the last step our Verilog netlist contains all pads. However there is no information that actually tells the tool where where each pad should be placed. The pad placement is very important as it directly determines the PCB layout3 . In our case, we want all designs to share a common power and ground pad locations so that a single test board can be used on our ASIC tester. For practical reasons we have decided to use a 56-pin package for all designs. So even though the chip has only 48 physical pins, it will be placed in a package that contains 56 pins4 . Depending on the power conguration, a dierent bonding scheme will be used. These two congurations can be seen on the following webpage:
http://www.dz.ee.ethz.ch/en/information/ic-technologies/umc/180/mini-asic-setup.html The cockpit will copy sample I/O les automatically to the src/sample directory5 . All lines starting with # are comments. The le consists of two sections main sections: globals and iopad. (globals [global definitions] ) (iopad (topleft [pads that are on the top left] ) (left [pads that are on the left side] ) [definitions for other sides] ) For us the relevant part is the iopad section. This part contains eight subsections that dene the names of the pad instances, and their locations in the four sides and four corners. We do not have to touch the corner specications6 as they will be the same for all designs. We have to distribute the pads among the four sides of the chip top, right, bottom, left. If you look at the sample le you will see that for each pad there is a single line entry in the following form (inst name="NAME_OF_PAD" offset=OFFSET_VALUE ) # pin no: PIN_NUMBER
The last part following # is a comment, it is there just for your information. Regardless of the power scheme you are using, we will use the same 56 pin package as illustrated in the webpage above. The PIN_NUMBER is just a reminder to show which particular location is being dened. The location is specied using the OFFSET_VALUE. SoC-Encounter uses a coordinate system that bases the coordinate (0,0) on the bottomleft corner as shown in the gure below:
A good pinout could simplify the routing on the PCB, allow you to use fewer layers and result in less parasitics 8 pins will be left unconnected 5 For this technology there will be four les. There will be two template les chip.io-template and chip-ep.io-template for the normal and extended power conguration respectively. These les have all the required power connections in place, and the data sections are commented out. There are also two example les that have ctional I/O placement where all pins are dened. 6 topleft, topright, bottomleft, bottomright
3 4
topleft
top
topright
1 2 3
left
right
Side
Offset
0,0
bottomleft
bottom
bottomright
On the left and right side the pads will be ordered from bottom-to-top, and on the top and bottom side the pads will be ordered from left-to-right. This ordering can be quite confusing, as it is neither clockwise, nor counterclockwise. Therefore the aforementioned comments showing the actual pin numbers will be very useful. The OFFSET_VALUEs given in the template represent xed locations for the given pad. It is very important that you do not change these values, as the chip-nishing part will rely on the pads being located exactly at these locations. You can assign your pads by writing the name of each pad into the corresponding NAME_OF_PAD. The name of the pad will be the name of the instance in the verilog le. For example assume that you are using standard power scheme and your clock signal is assigned to a pad named pad_clock. In your verilog le you would have the following entry for this pad: XMD pad_clock ( .I(ClkxCI) [other pin definitions] ) If you now want to place this pad on pin number 54 of your package, you will nd the subsection top in the I/O le and edit the line for pin 54: ... (iopad ... (top ... (inst name="pad_clock" ... ) ... )
offset=328.6 ) # pin no: 54
Be careful, do not modify the oset value while you are editing the I/O le7 . Since we use a xed bonding scheme for the power and ground pins, all we need to do is extract the instance names for all our
7 Please note that if you used the extended power scheme the pin number 54 would have a dierent oset (234.36), since in the extended power scheme, the pin assignment is slightly dierent.
signal pads and place them by inserting within the appropriate inst name="" statement corresponding the OFFSET_VALUE which corresponds to the desired location. Preparing the I/O le from scratch can be a lengthy and tedious task. To avoid unnecessary work during this exercise we will start with an almost complete I/O le, but before doing so we will describe the full procedure recommended when starting from scratch: 1. Start SoC-Encounter and proceed to design import by selecting "Design form make sure that the "IO Assignment File" is empty. Design Import". In this
2. If everything works well, the design will be loaded. Now we can write out a template le that will contain all the names of the pads. Use "Design Save I/O File ..." to save an I/O le src/chip-sequence.io. You can select the sequence checkbox, however it is not imperative. What we need is only the names of the pads. 3. Copy the template I/O le src/sample/chip.io-template to src/chip.io. As noted earlier, this le includes all offset= statements, and all statements for corner and supply pads. 4. Using a text editor open the les src/chip.io and src/chip-sequence.io. You need to move the PAD_NAMEs from the le src/chip-sequence.io to the correct positions in the le src/chip.io. 5. All entries for data pins in the template le are by default commented out using # character. Do not forget to remove the comment character for the pads you are using. Now, for this exercise you can start with the almost complete I/O le src/chip.io-incomplete instead of the template le. This le has all the pads placed properly with the exception of the 16 pads of the input bus DataInxDI which are still missing. Furthermore the le src/chip.sequence.io mentioned above has already been generated for you. The desired I/O assignment is depicted in the gure below and can also be found in the le src/chip_io.psa . Create the complete I/O le and save it as src/chip.io.
a
Postscript viewers were very common in the earlier days, you can use gv, kghostview, or evince to view this le
You can use the utility src/io2ps.pl to generate a postscript le from your I/O le. This utility will also verify if you have used the correct oset locations in you I/O le, and will report errors. For best results, you should also provide the verilog netlist le, which will enable the script to make even more checks. ./io2ps.pl chip.io chip.v > chip_pin_diagram.ps The src/io2ps.pl utility uses a conguration le with the extension .pads. Per default the le src/io2ps.pads will be used. If you are planning to use the extended power scheme, you will have to add the conguration le src/io2ps-ep.pads to the command as well.
DataOutReqxSO_PAD
44
DataOutxDO_PAD_10
DataOutxDO_PAD_11
DataOutxDO_PAD_12
DataOutxDO_PAD_13
DataOutxDO_PAD_14
DataOutxDO_PAD_15
DataOutxDO_PAD_8
DataOutxDO_PAD_9
DataInAckxSO_PAD
NO_CONNECTION
NO_CONNECTION
pad_gnd_p2
NO_CONNECTION DataInxDI_PAD_9 DataInxDI_PAD_8 DataInxDI_PAD_7 DataInxDI_PAD_6 DataInxDI_PAD_5 pad_gnd_c1 pad_vcc_c1 DataInxDI_PAD_4 DataInxDI_PAD_3 DataInxDI_PAD_2 DataInxDI_PAD_1 DataInxDI_PAD_0 NO_CONNECTION
56 1
55
54
53
52
51
50
49
48
47
46
45
pad_vcc_p2
43 42 41
NO_CONNECTION DataOutxDO_PAD_7 DataOutxDO_PAD_6 DataOutxDO_PAD_5 DataOutxDO_PAD_4 DataOutxDO_PAD_3 pad_vcc_c2 pad_gnd_c2 DataOutxDO_PAD_2 DataOutxDO_PAD_1 DataOutxDO_PAD_0 ClkxCI_PAD ResetxRBI_PAD NO_CONNECTION
40
39
38
37
36
35
34
10
33
11
32
12
31
13
30
14
29
15
16
17
18
19
20
21
22
23
24
25
26
27
28
DataInxDI_PAD_11
NO_CONNECTION
NO_CONNECTION
DataInReqxSI_PAD
DataInxDI_PAD_10
DataInxDI_PAD_13
3.2.3
Timing Constraints
Just as for synthesis, we need to specify timing constraints for the backend design with SoC-Encounter. With decreasing process geometries the impact of placement and routing on timing, power, etc. is steadily increasing. Therefore, timing analysis and optimization have become very important in order to arrive at a layout that (still) satises all requirements. As SoC-Encounter supports most of the more common Synopsys commands/constraints it should be rather straight forward to create an appropriate timing constraints le based on the constraints used for synthesis. There is an example constraint le src/sample/chip.sdc-sample that contains the most commonly used commands along with many useful and important comments. Copy this le to src/chip.sdc and modify it so that the following constraints get set (and nothing else!): Dene a 125 MHz clock Specify 3.5 ns input delay for all inputs Specify 5.0 ns output delay for all outputs Specify an input transition time of 0.8 ns at all inputs Specify a 15 pF output load for all outputs 10
DataOutAckxSI_PAD
DataInxDI_PAD_12
DataInxDI_PAD_14
DataInxDI_PAD_15
RamTestxTI_PAD
ScanEnxTI_PAD
pad_gnd_p1
pad_vcc_p1
3.2.4
Technology Files
The tech directory and the two subdirectorys contains technology les that describe the technology itself as well as libraries of standard building blocks implemented in this technology, i.e. standard cells, pads, RAM/ROM. Technology les (UMCL180) lef/header6_V55.lef Base technology description, denes metal layers, vias, spacing rules, routing umcL180.capTbl Table used to extract parasitic capacitances and resistances for signal and power wires. streamout.map Layer mapping table used when exporting the nal layout in GDSII format. Library les (standard cells, pads, macro-cells) lef/*.lef Physical description, shape and allowed orientation of cells, layer and shape of pins, blockages, antenna information, ... lib/*.lib Functional description, timing and power information, maximum load/fanout or transitiontime allowed, ... 3.2.5 Macro-cells
The macro-cells for the umcL180 process are created using dedicated memory compilers. The specic memory compiler we have access to is able to create ve dierent types of macro-cells with various capacities: SU180_ : single-port static RAM SJ180_ : dual-port8 static RAM SY180_ : single-port register-le9 SZ180_ : two-port10 register-le SP180_ : via programmable ROM The following parameters are used for the macro-cells: words Number of words in the memory sub-word size Number of bits within a sub-word of the memory. The sub-word is the smallest unit used for data access in the macro-cell11 .
dual-port memories have two completely independent access ports. At the same time two separate memory addresses can be accessed for both read and write. 9 Although the name suggests that the memory is made out of individual registers, it is very similar in design to SRAM. 10 In two-port memories, the read and write ports are separate, so you can simultaneously read and write. There are timing constraints for reads and writes to the same address, please refer to the memory compiler manual for details. 11 In many places this sub-word is referred to as byte. This might be slightly confusing, since a byte is commonly accepted to be an information unit consisting of 8-bits.
8
11
number of sub-words per data word This parameter allows creating multiple sub-words. Each sub-word can be written to separately. For example, A 32-bit RAM can be congured as having a single 32-bit sub-word, or two 16-bit sub-words, four 8-bit sub-words and so on. column or block multiplexer This parameter aects the geometry of the macro-block. This can have signicant inuence on the performance of the macro-block. There is no general rule to determine this parameter. Once the memory requirements are known, all possible geometries will be considered and the most suitable one will be determined. There are several available macro cells, their datasheets can be found under:
/usr/pack/designkits-1.0-ma/umc_L180/faraday/gen/memaker/200901.1.1/datasheet.dz If none of the available macro-cells suit your needs more can be easily generated on demand. Please contact the Microelectronics Design Center for this purpose. Our example design uses a single-port RAM named SY180_2048X16X1CM8. This RAM has 2048 words of 16-bits each (single sub-word) and a block multiplexer of 8. All necessary preparations to work with this macro-cell have already been done, so you do not need to do anything additional for this exercise.
Importing the Design
Start SoC-Encounter either from your design directory by using cockpit cd ~/training_2 icdesign umcL180 & or from the encounter directory by issuing the command cd ~/training_2/encounter cds_soc81 encounter We will now import our design. SoC-Encounter uses a large conguration le that denes the design and technology les to be loaded as well as some global settings to be applied. Cockpit does automatically generate an appropriate sample conguration le src/sample/chip.conf that should be used to start with.
12
Copy the sample le into the src directory. cp src/sample/chip.conf src/ Select "Design Import Design ..." to open the design import form. This form contains elds for all conguration options. At the bottom of this window, there are buttons to load and save the conguration from/to a le. Use the Load ... button to load the conguration le we have just copied to the src directory. On the Basic tab make sure that Verilog Netlist:, Timing Constraint File: and IO Assignment File: match your design. Common Timing Libraries: and LEF Files: should already be correct. On the Advanced tab the only setting you might want to adapt for your design is the Default Delay Pin Limit: in the category Delay Calculation. We will explain this a bit later. Once you are happy with the conguration dont forget to save your changes to the conguration le. Click Ok to import your design. Monitor the messages on the console for errorsa . Pay attention to the messages where the timing constraint les is loaded (Reading timing constraint le) to see if everything was accepted! If there are errors, you need to x them!
a
You can ignore warnings (SOCLF-58), (SOCLF-200), (TECHLIB-436), (SOCSYC-2), (EMS-27)
We are now in the oorplan view of SoC-Encounter which displays an empty oorplan with only the pads placed. All top level module(s) of the netlist are shown as a pink/purple square to the left and all macro-cells to the right. Note that all standard cells are inside the module(s).
13
Floorplanning
Now we will have to decide how cells and macro-cells will be placed on our chip. This process is called oorplanning. For a standard design, our main concern would be to nd a oorplan that will result in the smallest possible area, while fullling all performance and reliability requirements. This is purely driven by economical reasons, since chip costs are mainly determined by the area. In some cases there are additional geometrical constraints. The manufacturing company may impose certain limits to the aspect ratio of the nal layout12 , or even dictate the maximum height or width of the layout. Back-end design is not only used for complete chips. Macro-cells that will be part of a larger system-onchip design can also be designed in this way. In such cases there might be even more restrictions. For example, certain metal layers might be reserved for the system level. So the question is, How small can my layout be so that I am still able to fulll all specications? . As a lower bound, you will need enough area to place all your I/O pads and standard cells. Ideally, in terms of area (and assuming your design is not pad limited, see exercise 2), you will want to place standard cells without leaving extra space in between, completely lling out the core area. This is hardly ever possible because:
12
Especially in MPW runs, a lot of silicon area is wasted if all designs have wildly dierent dimensions.
14
The number of interconnections that can pass through a certain area is limited by the number of metal layers available13 , wire width and minimum spacing requirements. Depending on the interconnection overhead, the area above the cells14 may not be sucient for routing. Timing is greatly aected by the placement of your cells. Placing them next to each other with no space in between not leave the tool any exibility in placing cells. This in turn reduces the optimization options of the tool, like the ability to cluster cells that are closely interconnected. All designs require power routing for operation. Some wires of the power connection limit where the cells can be placed, or restrict signal routing which in turn increases the area requirement. The majority of designs require a clock tree to function. This clock tree is added during the back-end design. This requires additional area for the buers used in the clock tree. Furthermore, the clock tree synthesis algorithm can produce better results if it has more freedom to place its buers. Macro-cells, like the RAM in our example, usually require some extra space along the edges so that they can properly be connected to power and signal lines. Designs that have a high switching activity require a lot of current for a short time which is called a surge. The power distribution network may need additional decoupling capacitors to store some charge that can provide some of the current of the standard cells during such a surge. Additional space for these decoupling cells may be required during placement. As a consequence, the standard cell rows (which form the core area) can not be lled completely with standard cells, in other words there needs to remain some free space in between cells. Utilization indicates to what amount the standard cell rows are lled. 100% utilization is the upper bound where all cells are abutted and there is no extra space, while a utilization of 50% means that half of the core area is empty. Usually, it is not possible to predict whether or not it is possible to fulll all requirements with a certain utilization15 . You will have to try and nd out. This is the main reason why back-end design is an iterative process16 .
5.1
Semester Projects
The MPW provider used for the semester projects oers modules caled Mini Asic (mini@sic) with a size of 1379.5 m 1379.5 m. Therefore, the chip size for the semester project ASICs is xed. Please refer to the following web page to learn the details.
http://www.dz.ee.ethz.ch/en/information/ic-technologies/umc/180/mini-asic-setup.html As a consequence, we only have to make sure that our design ts on this area, and there is no need to nd the smallest possible layout. We may however need to constrain the core area to make it smaller if the utilization is to low, since a spread out design has longer interconnections that may adversely aect timing.
For our technology there are 6 metal layers. Cells in our technology use mostly the lowest metal layer Metal-1 and very rarely the Metal-2 for internal connections, all other layers are free for routing. 15 Both placement and routing are separately NP complete problems, without completing the routing and placement you will not know if it is possible to fulll the requirements. 16 Obviously, technology plays an important role, and it is possible to give certain guidelines for a technology. However, backend design is always highly dependent on the design itself. You will usually see in a few iterations what is possible and what is not.
13 14
15
5.2
Sketching a Floorplan
Before we go on with SoC-Encounter we need to make some planning and understand some key concepts. The gure on the following page is an example oorplan (not a very ideal one) that shows the important concepts. In SoC-Encounter die area corresponds to the total silicon area available to place pads (excluding bonding area for this technology) and core cells. For the semester projects this is strictly limited to 1379.5 m 1379.5 m. All pads (I/O, power and corner) are placed in what is known as the padframe. The remaining area can be used for the core of the chip. For semester projects the theoretical maximum for core area is 1099.26 m 1099.26m = 1.21 mm2 . As can be seen from the gure, the core area is surrounded by a core power ring. In its simplest form this consists of two (one for VCC, one for GND) wide17 metal lines that evenly distribute the power all around the chip. In order to leave room for the power ring, we need to leave a certain I/O to core spacing. The standard cells are designed in such a way that, when placed next to each other their VCC and GND pins can be connected with a horizontal power line. These horizontal lines are then extended to the core power ring. These power connections are relatively narrow (0.76 m in the technology that we use) and run over the entire width of the core area. This could be a problem for designs that consume much power, since the cells towards the middle would not have a good power connection18 . To improve this, vertical power stripes that connect to the horizontal power lines can be added, thereby forming sort of a mesh. The core area is lled with standard cell rows on which later all standard cells will be placed. In the same area we will usually also need to make room for our macro-cells. Most macro-cells need some free space around themselves. This free space is required to make signal connections, add a block power ring around the macro-cell or simply to prevent standard cells from being placed too close to the macro-cell. We will dene a block halo to specify this free space. When placing a macro-cell, you should also take into account where the power and signal pins of the block are located and what metal layer they are on. Often signal connections are only on two edges and you want them to face the core and not the I/O pads. Now, when we consider all the above, the core area that remains free to place core cells on is much smaller than the 1.21 mm2 that we started with. Our example design has a total cell area (including RAM) of 0.82 mm2 and should therefore comfortably t into the designated area.
The width of the metal line depends on the amount of current drawn from the line, you will be able to judge this better after exercise 3 which is dedicated to estimating the power consumption. We will mostly use a width of 20 m, since this is the widest metal that can be manufactured without slotting (wider metal lines require slots/holes which break up the metal shape). 18 The problem is that if much current is drawn, there will be a signicant IR drop along the power lines. The cells in the middle will be supplied with a lower VCC than the ones on the sides. This could dramatically eect the performance of the system.
17
16
1379.5 m
VDD
GND Power Stripe
Standard Cell Power Connections
Block Power Ring Standard Cell Row
Standard Cells
Macro Cell (RAM)
1099.26 m
I/O and Corner Pads Placed on the Padframe
Block Power Connection
Block Halo
Core Power Ring
Power Pad Connections
I/O to Core Spacing
5.3
Initialize Floorplan
We are now ready to proceed with SoC-Encounter. From the menu select "Floorplan Specify Floorplan...". A large window will open.
Select the Die Size by: Width and Height option and make sure that both values are 1379.5. Now we need to specify the I/O to core spacing by lling in the four values under the Core Margins by: entry. There must be sucient room for the power ring around the core area. Larger values will reduce the area available to place the core cells thereby increasing core utilization. As noted earlier, some iterations are usually required to nd optimal values for a particular design. In this exercise we will assume that we will use one VCC and one GND line of maximum width 20 m. We need some extra space between the lines and, for the moment, we can start with a distance of 45 m for all sides and click on OK.
17
The oorplan should now look like shown in the screen-shot below. Note that the pads are all placed at their proper locations as the I/O le used during design import species absolute locations and we made sure that the die size stays xed to the proper size during the initialize oorplan step.
18
Next we need to place the RAM macro-cell. Change the cursor mode to Move/Resize/Reshape by selecting the appropriate icon (next to the ruler icon) or use the keyboard shortcut SHIFT-R. Now you can select the RAM macro-cell and drag it to any location you like. The blue lines displayed are so called ightlines that show where the signal connections to the block are. You can change the orientation of the RAM by either using "Floorplan Edit Floorplan Flip/Rotate Instances ... " (or press r), or with the attribute editor (press q). Note that the RAM macro will completely block Metal-1, Metal-2, Metal-3 and Metal-4. Only Metal-5, Metal-6 will be available for routing over the RAM macro-cell19 .
5.4
Power Planning
The next step is to create the power distribution network. The verilog netlist that we started with does not contain any power connections, therefore we need to create this connectivity now. We have to connect the power/ground pins of all instances to the respective global power/ground net that was specied on the "Design Import" form (category Power on the Advancedtab)20 .
By default, the internal structures within a cell or block are not displayed. You need to make Cell Blkg visible to see the so called blockages within a cell. 20 There is also a special rule required if there are logic one/zero values 1b1/1b0 instead of TIE1/TIE0 cells in your netlist. You should however not have such logic values in your netlist.
19
19
This can be done using the "Floorplan script provided.
Connect Global Nets ... " form or you can use the globalnet.tcl
Execute the script provided by typing on the command line of SoC-Encounter (not GUI): source scripts/globalnet.tcl Next we will add the core power rings that distribute power all around the core. Select the menu "Power Power Planning Add Rings...". A large window will appear. The Net(s) eld on the top denes for which nets rings will be created. The default is to create power VCC as well as ground GND rings. In the Ring Conguration section you can specify on what layers the ring segments will be created. Select metal5 H for Top and Bottom and metal6 V for Left and Right. Specify Width as 20 m, Spacing as 1.5 m and Oset as 2 m and click Ok.
There are many alternative power distribution schemes that can be used. The one that we have chosen here is a very simple one. We have selected the upper metal layers Metal-5 and Metal-6 for the ring, because in this technology Metal-6 is thicker and consequently has less parasitic resistance which is desirable for power distribution. For your own designs, you should perform a power analysis (topic of bung 3) to nd out the best power distribution approach that matches your design. 20
The width has been chosen as 20 m for convenience reasons. Basically the wider the power connection, the better. But as already mentioned earlier, in this technology, metal lines wider than 20 m need to be slotted (stress relief slots) which requires extra eort. As an alternative to slotting it is also possible to create several smaller parallel rings, e.g. two VCC and two GND rings. Spacing determines the distance between the two nets and Oset determines the distance between the core area and the innermost ring. We also need a (partial) ring around the macro-cell, you will see later why this is necessary. Select the menu "Power Power Planning Add Rings..." just like before. This time in the Ring Type box, select Block ring(s) around. You can leave the selection at Each block since we have only one block anyway. SoC-Encounter is usually smart enough to create wires only on the edges where no power lines are yet, i.e. to not create new wires on top of the core ring. If this fails you can specify the segments and connections you want on the Advanced tab. Fill in the values/settings similar to that of the Add Rings and click on Ok. At any point if you wish to delete part of the oorplan you can: use the Undo feature by simply pressing u select and remove objects of a specic class (press d) use the menu option "Floorplan Edit Floorplan Clear Floorplan..."
select an object and hit the Del key on the keyboard Also, you can save or load (restore) your oorplan at any time using the menu "Design Floorplan ..." and "Design Load Floorplan ..." respectively. Save your oorplan to the save directory. At this point power is to the standard cells arrives from the sides. Especially for fast designs the standard cells in the middle of the standard cell row will not receive sucient power it is important to add vertical stripes to improve the power distribution. Select "Power Power Planning Add Stripes ...". Save
The Set Conguration part of the window denes the properties of one stripe set. The Set Pattern part denes how many stripes will be added. We can either choose to insert a xed number of sets or only specify the distance between two sets (Set-to-set distance:) In the First/Last Stripe part, we select Relative from core or selected area. Add to X from left and X from right a value stripe sets in such a way that the standard cell rows get divided into three equally long pieces. See the screen shot for width, spacing and layer. Note: You can ne tune this later by moving the stripe sets. By default stripes will continue over macro cells. To prevent this, select the Omit stripes inside block rings option in the Stripe Breaking section of the Advanced tab. 21
It is rather easy to move wires in SoC-Encounter. Click on the move wires button (or press m), select the wires you want to move, and drag them to their new location. SoC-Encounter will make sure that electrical connections remain intact. If you want you can use this to ne tune the stripe placement. We still need to dene a block halo for the RAM macro-cell. This is necessary to keep standard cells from being placed to close to the RAM and also to avoid problems when routing the power lines of the standard cell rows. The gure below illustrates one common problem with the block halo.
Terminated Power Line (good) Standard Cell Row Dangling Power Line (bad) Standard Cell Row Macro-Block
Power Rails
Block Halo
22
In this gure, only two standard cell rows are shown. The block halo around the rst row extends far enough to cover the two power lines21 . This is like it should be. For the second row, the block halo does not cover the power rails, and when making the power connections SoC-Encounter will try to extend the power connection past the power rails as shown in the gure. This leaves a dangling power line22 . While this will not render your chip useless, it should be avoided. From the menu select "Floorplan Edit Floorplan Edit Halo...". A window will appear, where you can specify a keep-out zone for routing and/or placement around the macro-cell. Usually we only need a Placement Halo. The size will depend on your power routing/oorplan. Create an appropriate Placement Halo.
Notice that the I/O pads are placed with some distance between them23 . At some point in the design ow we need to close the gaps between the I/O pads in order to complete the supply rings that run around the core (within the pad cells) and are required to supply the circuitry within of the pad cells. Instead of using wires, we will place so called ller cells that completely ll the gaps and establish the required connectivity. There is a script that will automatically insert matching ller cells. Type the following in the SoCEncounter console window source scripts/fillperi.tcl Now we need to nalize the power connections of the chip. The following connections still need to be made: The core ring needs to be connected to the core supply pads (VCC3IOD and GNDIOD). All standard cells need to be connected to VCC and GND lines. All macro-cells need to be connected to VCC and GND lines.
This is just for illustration. It is not possible to draw a block halo that has this (L) shape. This sort of dangling wires are known as geometry antenna in SoC-Encounter 23 This is due to the contraints set by the company that bonds the chips. They specify that the minimum distance between two adjacent pads can be 90 m. Since even a core-limited pad in this technology is roughly 60 wide, we need to place them with gaps in between.
21 22
23
Select "Route Special Route ..." from the menu. SRoute is the special net router, and is only used to make power connections. The Route: part contains the dierent connection types we have listed above. Block pins are macro-cell power connections, Pad pins are the connections from the core supply pads to the core ring. We will not need Pad rings since we have already used ller cells to complete these rings. Standard cell pins will add power lines to the standard cell rows. Finally, if you still have stripes that are not connected to power (not very likely) you can use the Stripes (unconnected) option. While it is possible to route all connections at the same time, it is strongly recommended to do it one by one: 1. Start with Pad pins. globalnet.tcl script. If nothing happens you have most likely forgotten to source the
2. Route Block pins. Check the result, did the router connect the macro-cell the way you wanted? If not you may need to study the Advanced tab of the SRoute window. If all fails you can edit the connections manually. 3. Route the Standard cell pins. This should create many horizontal Metal-1 lines that connect to the rings and stripes. Look for dangling wires around the block halo (adjust the block halo if necessary). We are now nished with oorplanning. Your oorplan should look similar to the following screen shot.
24
Placement
We will now start with the placement of the standard cells in the core area. Placement is a very computation intensive problem, and mostly heuristic algorithms are used for this purpose. Select "Place Standard Cells.. ...".
We want run a full placement and not an incremental or just the quick prototyping one. "Include Pre-Place Optimization" however is very useful as it removes all buers/inverters trees from the netlist which will help us for timing analysis as you will see later. To set advanced options click Mode. Set "Congestion Eort" to "Low" and deselect "Run Timing Driven Placement" as timing driven takes much longer and might not help that much to improve timing. There are several other options that you can set, but at this time we will leave them as they are. Apply the changes by pressing "OK" You will come back to the placement window seen below, click "OK" to start placement. This may take some time. We have to warn you about the various performance related options such as "Congestion Eort" and "Run Timing Driven Placement" above. In the exercises sometimes we will advise you to use certain settings for these options in order to reduce runtime, or because for this particular design we have found out that a particular option gives better results. When you do your own designs, you should consider evaluating which options are better suited rather than copying all options from this exercise.
For each standard cell, the placement algorithm will try to nd the optimum location so that there is a feasible routing solution and the total length of the connections is minimized. Examine the placement by using the design browser (switch to the physical view). You will notice that standard cells within the same entity are mostly placed next to each other. The available space and the placement of macro-cells and I/O pads can have a great inuence on the placement of standard cells. Even though more space seems to be a good idea, too much space sometimes results in placements where the average distance between standard cells and consequently the delays caused by wire capacitance/resistance become larger. Only experience and several iterations will allow you to nd a placement for your circuit that is close to optimal. Note: Visibility of Special Net is turned o in the next screen shot.
25
The results for placement (and later routing) are strongly design dependent. For example, structures with many interconnections such as look-up tables will usually need much more space than synthesis predicted as the cells need to be spread out in order to have enough space to route all the interconnections. This is why generalizations for back-end design, such as "During back-end design, your circuit area will increase by 10% " dont work very well. Let us save the entire design with "Design Save Design As SoCE". This will save the conguration le, netlist, oorplan, special route, placement and routing les as well as the current mode, options and preferences. A design saved in this way can be restores using "Design Restore Design ... SoCE". The space required is surprisingly small as most les are compressed and the library les do not get saved along with the design. Remember to save under the save directory. Alternatively you could also just save the placement. Select Design Save Place ....
During synthesis, Synopsys Design Compiler assigns constant logic values to two special standard cells named TIE0x and TIE1x, where x is a drive strength modier. This creates a small inconvenience, as often one of these cells is assigned to drive many outputs at the same time, creating relatively long interconnections. There is sucient place on the chip to place several of these cells. We will use a script that rst removes all these cells. Then we will set the rules for placing these cells. The example script scripts/tiehilo.tcl sets the maximum number of connections driven by a single cell to 10, and the maximum distance between the pin and the tie cell to 100 m. And nally we insert the tie cells according to the rules we have dened.
26
At the command line type: source scripts/tiehilo.tcl
Timing
The synthesis tools we currently use for HDL synthesis (Synopsys DC Shell/Design Vision) are not aware of any instance placement information. Therefore the interconnects can only be estimated based on a statistical model, i.e. the fanout of a net determines its length, capacitance, resistance and area. Now that the placement and even trial-routing is available the timing might dier considerably from the numbers obtained from Synopsys.
7.1
Analysis
SoC-Encounter has a very practical timing analysis function, where you usually only have to specify the state of the design (see below) and the "Analysis Type" (Setup or Hold) you want to run. Pre-Place design is not placed Pre-CTS design is placed but clock tree is not yet inserted Post-CTS design is placed and the clock tree is inserted Post-Route design is placed and routed Sign-O will use extra tools for even more precise analysis. We will not use this as these tools are not installed/setup. Depending on this state, trial route (a very simple, but fast routing) and/or parasitic extraction might be run automatically prior to the timing analysis. This will improve the accuracy and help to avoid unnecessary iterations. Open "Timing Analyze Timing" and make sure "Pre-CTS" and "Setup" is selected.
Start the timing analysis by clicking "Ok". Note: You could also do this from the command line with timeDesign -preCTS As the design is not routed, SoC-Encounter will perform trial route and parasitic extraction before doing the timing analysis. A short summary will be displayed on the console (the actual numbers may dier slightly):
27
-----------------------------------------------------------timeDesign Summary -----------------------------------------------------------+--------------------+---------+---------+---------+---------+---------+---------+ | Setup mode | all | reg2reg | in2reg | reg2out | in2out | clkgate | +--------------------+---------+---------+---------+---------+---------+---------+ | WNS (ns):| -7.815 | -5.368 | -7.815 | -0.582 | -7.110 | N/A | | TNS (ns):| -2113.3 | -1239.7 | -1969.2 | -1.269 | -38.582 | N/A | | Violating Paths:| 757 | 708 | 375 | 8 | 6 | N/A | | All Paths:| 1811 | 1344 | 819 | 18 | 6 | N/A | +--------------------+---------+---------+---------+---------+---------+---------+ +----------------+---------------------------+--------------+ | | Real | Total | | DRVs +--------------+------------+--------------| | |Nr nets(terms)| Worst Vio |Nr nets(terms)| +----------------+--------------+------------+--------------+ | max_cap | 135 (135) | -3.518 | 136 (136) | | max_tran | 370 (14467) | -7.767 | 388 (14485) | | max_fanout | 0 (0) | 0 | 0 (0) | +----------------+--------------+------------+--------------+ Density: 78.864% Routing Overflow: 0.00% H and 0.23% V ------------------------------------------------------------
The summary gives a very good overview of the current design timing. Some explanations: The analysis was run in setup mode, i.e. setup time checks were performed but no hold time checks. The columns contain numbers for all path in the design ("all") or for specic path groups, e.g. reg2reg for all register to register paths. Worst negative slack (WNS) reports the slack for the most critical path. Negative numbers mean that the constraints are violated by this value. Total negative slack (TNS) is the sum of WNS for all violating paths. Together with the number of violating paths this gure helps to see how severe the violations are. Real/Total DRV show (electrical) design rule violations, some libraries have a maximum transition time for all nets. The report above shows that 370 nets have a transition violation (the signal takes too long to change from logic-1 to logic-0 or vice versa). In addition 135 nets have a maximum capacitance violation (the total amount of capacitance driven by a net exceeds the limit set by the design library). These violations are mostly related to excessive parasitic capacitance due to interconnections, and generally cause timing violations as well. However, even if a DRV does not cause a timing violation it needs to be xed. "Density" and "Routing Overow" show the placement utilization and routing resources, i.e. are a measure for the feasibility of the current oorplan/placement. Remark: Refer to exercise 4 of VLSI I24 if you have problems with timing concepts. The summary looks really terrible. Obviously we have many timing violations that we need to have a closer look at, before we try to optimize the timing with SoC-Encounter. Here are some important points to consider when doing so:
24
You can access the exercise descriptions, les, and solutions under /home/vlsi1/u4.
28
The timing depends entirely on the constraints you have specied in the le src/chip.sdc. The most common mistake is to have errors in this le. Before you go any further make sure that your timing constraints are correct. Make sure to not accidentally use constraints that were written for the core level (chip without pads) at the chip level (with pads) and vice versa. The pads aect the I/O timing quite a bit and the drive capabilities of a standard cell and an output pad are entirely dierent, i.e. set_load needs to be very dierent. Inputs and outputs used for test and debugging may cause timing violations. Most of these signals are not dynamic (they are not toggled during normal operation) and the timing paths originating from these inputs or ending at these outputs should be ignored, i.e. left unconstrained or explicitly disabled. To speed up delay calculation SoC-Encounter does not compute the timing of nets with a fanout above a certain limit but rather swaps in predened values for delay, capacitance and transition time. All these numbers are specied on the "Design Import" form on the "Advanced" tab in the "Delay Calculation" category. As a result you will not see the real timing25 of these net in timing analysis and furthermore optimization will not see (and therefore not x) violations26 on these nets. However, this is usually the desired behavior as we give these nets a special treatment anyway (with CTS). Lets now examine the detailed reports that were generated by timing analysis and can be found in the timingReports folder. Each analysis produces multiple les. Among these there are three les dedicated to design rule violations (max capacitance: *.cap , max fanout: *.fanout, max transition time: *.tran violations), and separate *.tarpt timing analysis report les for dierent path groups (in2out, in2reg, reg2reg, reg2out) Where do the violating paths in the "in2out" path category start? Where do the violating paths in the "in2reg" path category start? Do the paths in "reg2out" and "reg2reg" look like normal path that should be optimized to meet timing or is there something wrong? Why are the "reg2reg" paths too slow? Look for large numbers in the "Delay" column and check the drive strength of the corresponding cell. There are several dierent problems in the .sdc le that we have used. First of all, two of our inputs should not be considered for timing analysis27 . We also have several nets (clock, reset and scan enable) that we will take care of separately (using the clock tree synthesizer, which we will see later). These nets will show up in the DRV reports. We do not want to solve timing related problems for these nets (since they will anyway be solved later), the time and eort required to optimize these nets could prevent other parts of the design to be optimized. We can use the Default Pin Limit feature of SoC-Encounter to stop SoC-Encounter from extracting timing information (and reporting timing violations) for the nets that we will be optimizing later on. By
To see the real timing you can change the limit on-the-y from 1000 to a very high value in the console with setUseDefaultDelayLimit 100000. More on this topic later. 26 DRV violations will be xed but no setup/hold violations. Clock nets are even more special, also no DRV xing will be done there. 27 SoC-Encounter provides a special timing calculation mode that is called Multi-Mode Multi-Corner Analysis (MMMC). In this mode it is possible to dene several scenarios (i.e. separate test and functional modes). The setup for MMMC is slightly involved and will not be covered as part of this exercise.
25
29
default the pin limit of SoC-Encounter is set to 1000. In our case this number is too high (we have slightly more than 400 ip ops in our design). Let us see the nets which have a large fanout. Report all nets with e.g. more than 400 pins. Use the console command: report_net -min_fanout 400 Now set a suitable limit with the command setUseDefaultDelayLimit <number> so that the high fanout nets will not be considered for timing. Also make the necessary changes to the timing constraints le src/chip.sdc to disable the oending input-ports. Reload the timing constraints by selecting the menu "Timing Load Timing Constraint ...". Then rerun timing analysis. If you have done everything correct, the only setup violations should be in the path group register-toregister and register-to-out. There should no longer be pins that belong to scan enable or reset network in the transition time violation report.
7.2
Optimization
In order to (better) meet the constraints, SoC-Encounter can try to optimize the design at every stage of the design process. In our case, the worst setup time violation is about 5.8 ns (for a 8 ns period), although the netlist delivered by the synthesis tool had no timing violations. This is due to dierences in interconnect parasitics between the two tools. While the synthesis tool relies on an estimate (statistical model based) SoC-Encounter can use the real placement and (trial-)routing at hand. Consider the following line from a timing report (broken down over many lines for readability)
Path 1: VIOLATED Setup Check with Pin i_top/u_filter/u_filter_stage_4/RegxDP_ reg_47_/CK Endpoint: i_top/u_filter/u_filter_stage_4/RegxDP_reg_47_/D (v) checked with leading edge of ClkxCI Beginpoint: i_top/u_ram_wrapper/i_ram/DO7 (^) triggered by leading edge of ClkxCI Path Groups: {reg2reg} Other End Arrival Time 0.000 - Setup 0.127 + Phase Shift 8.000 = Required Time 7.873 - Arrival Time 13.685 = Slack Time -5.812 Clock Rise Edge 0.000 = Beginpoint Arrival Time 0.000 Timing Path: +-------------------------------------------------------------------------------------------------------------+ | Instance | Arc | Cell | Slew | Load | Delay | Arrival | | | | | | | | Time | |--------------------------------------+---------------+--------------------+-------+-------+-------+---------| | | ClkxCI ^ | | 0.000 | 1.828 | | 0.000 | |ClkxCI_PAD | I ^ -> O ^ | XMD | 0.000 | 0.000 | 0.000 | 0.000 | |i_top/u_ram_wrapper/i_ram | CK ^ -> DO7 ^ | SY180_2048X16X1CM8 | 0.115 | 0.026 | 1.739 | 1.739 | |i_top/u_ram_wrapper/i_test_bypass_mux7| A ^ -> O ^ | MUX2 | 8.451 | 1.876 | 3.975 | 5.715 |
30
The last line reports an standard cell instance MUX2 with low driving capability (2) that has to drive a big load on its output (1.876 pF). The propagation delay is therefore huge (3.95 ns). The timing of the same cell as reported by synthesis are: Delay: 0.15 ns, Slew: 0.09, Load: 0.01. While this is an extreme case you see how synthesis can be wrong without knowing the actual placement and wire loads. Open the optimization form by selecting "Timing Optimize ...".
"Design Stage" needs to be set to the current design stage. Some options are only available for certain stages, e.g. hold time optimization can not be performed during "pre-CTS" as it doesnt make much sense. Timing is not the only thing that can optimized. Most technologies specify design rules like maximum transition time, maximum capacitance driven by a certain cell or maximum fanout. After pressing the "Mode" button, within the "Thresholds" section you can nd options that can be used to tighten the constraints in order to get some margina . Set the options as shown in the gure below and hit "OK". Watch the progress of the optimization in the console window. SoC-Encounter is very verbose with its actions.
a
SoC-Encounter will already automatically add a small margin on its own (internally)
During optimization SoC-Encounter can select dierent drive strengths for cells, add/remove buers and inverters, move instances or even restructure part of the logic (just like synthesis does). Optimization is done using iterations of timing analysis, optimization, trial-route and parasitic extraction. As a last step SoC-Encounter performs a timing analysis on the optimized design, prints the summary to the console and writes the detailed reports to the timingReports directory. Take a look at the summary and the nal reports generated. There should be no violations left.
31
But what happens if we can not x the violations with optimization? Again, rst make sure to understand what your constraints are and why they are violated. Often there are errors in converting the design specications to constraints (is the input delay really 3.5 ns? Also for this pin?) and describing them properly with the commands available. If you still have problems, there are three levels where you can reach a solution: Optimization during backend design (SoC-Encounter) SoC-Encounter can optimize the design at every stage of the design process. In general, the earlier the stage, the more changes can be done, e.g. "Pre-CTS" optimization has much more exibility than "Post-Route" optimization. At the "Pre-CTS" stage registers can be moved and resized, this will no longer be possible after clock tree insertion. On the other hand, the parasitic interconnect information is much more accurate with later stages of design, so the timing information (and hence the optimization goals) will be more accurate. We can (re)run the optimization at various stages, try a new placement or even start with a new oorplan. It is impossible to give general guidelines, you will have to see what works best for your design. If you are far from meeting your target (e.g. for a 10 ns clock, if after all optimizations you still have a timing violation of 2 ns), you may need to go back to synthesis. Optimization during synthesis Once you have tried to place and route a netlist you will get a better idea about the relationship between synthesis results and back-end results (area and timing wise). You may use this information to adjust the timing constraints and re-synthesize the circuit. Architectural optimizations If nothing else helps, you will have to modify your architecture. During this iteration you will have a much better idea about what is critical for your circuit. If all of the above fails, you will have to see if the specications could be changed. Your design has changed considerably as the optimization algorithms have modied the netlist and placement. Save it by using "Design Save Design As".
Clock Tree Insertion
The fan-out of a net refers to the number of inputs driven by a particular output. High fan-out nets (that drive hundreds or even thousands of inputs) need to be handled dierently from standard interconnections. Note: For timing analysis we did adjust the pin limit (setUseDefaultDelayLimit) in order to treat them dierently. Every synchronous circuit has at least one high fan-out net, namely the clock net. For most circuits reset and scan-enable signals have to be distributed to each and every ip-op as well. The main problem with high fan-out nets is the large load capacitance that needs to be driven. Each driven input adds its own input capacitance to the total load capacitance and in addition, the interconnection required to distribute the signal to all these inputs increases the load capacitance further. There are three important parameters for such nets: Transition time This is the time it takes to change the logic level of a node (e.g. 0 1). Basically, the more load an output has to drive, the more time is required to charge this load. CMOS drivers 32
consume additional short circuit current during the transition, therefore long transition times are not very welcome. Furthermore, noise on signals with long transition times can result in glitching. Most libraries set an upper limit for the transition time (for the technology we are using this is 1.79 ns for typical libraries). To lower the transition time, a tree of buers can be inserted so that the total load is shared between the buers. The lower the desired transition time, the more buers are required. Insertion delay The time required for the signal to travel from the driver to the end-points. This delay is usually dierent for each end-point. Each level of buers in the buer tree will add a delay to the signal. Skew The dierence between insertion delays of dierent end-points. To minimize skew, a balanced buer tree has to be built. Generally, the lower the desired skew the more buers are required. What parameters are most important depends on the type of net: Clock Our main concern is to reduce the skew, since it will eect our timing. The maximum skew depends on the clock period. As an example, for a 20 MHz clock a clock skew of 0.5 ns is acceptable. But for a 200 MHz clock, the same skew equals to 10% of the clock period and would be to high. If you over-constrain your skew, you will need a deep (and large) clock tree and your insertion time will rise, which will aect your input and output timing. Therefore you will want to balance the skew against insertion delay and the number of buers. Constraining maximum insertion delay too low will usually degrade results. Usually, a tree that gives you an acceptable skew will also give you a decent transition time, so you dont have to worry about that. Reset We are interested in propagating the reset within one clock cycle to all ip-ops in our design. For designs with on-chip reset synchronization this is strictly required. The insertion delay should therefore be less than the clock period, transition times within the bounds imposed by the technology and skew doesnt matter at all. Scan Enable Very similar to the reset signal. Usually a slower clock is used for scan testing, therefore we can allow even a larger insertion delay. For transition time and skew the same holds true as for the reset.
Sink Tran Buf Tran Sink Tran AutoCTS Root Pin Buf Tran Sink Tran Buf Tran Sink Tran
Min Delay Max Delay Max Skew
33
In SoC-Encounter, clock tree synthesis (CTS) is used to generate optimized buer trees to drive high fan-out nets. It can be congured to satisfy a variety of constraints. A sample clock tree synthesis conguration le can be found under src/sample/chip.ctstch-sample. The sample le contains three dierent congurations for a clock, a reset and a scan enable signal. Copy this le to the src directory and adapt the AutoCTSRootPin statements to match your design. For educational purposes, change the clock tree specications as follows: max. skew 0.2 ns, max. insertion delay 4 ns, max. transition time at buers 0.6 ns and at clock pins 0.4 nsa Take a closer look at the other two trees too.
It is usually not a good idea to specify a small max. insertion time such that this becomes a limiting factor for CTS. Results may degrade signicantly and for most designs the insertion delay is not very important anyway.
a
If the design employs a reset synchronization register (the example design has one) the source of the reset tree must be the output of the synchronization register. Note that there is a special option named "SetASyncSRPinAsSync YES" for the reset tree denition. This allows set and reset pins to be considered as targets for the clock tree optimization. The scan-enable signal is also a special case. Normally the clock tree synthesis algorithm starts at the AutoCTSRootPin and traces through the netlist in order to nd valid endpoints. Per default, combinational gates will be traced through and clock and asynchronous input pins of sequential elements (ip-ops) will be stopped at. By specifying the "NoGating rising" option, we can make the tracer stop at the rst gate encountered. This is necessary since the scan enable signal is often connected to multiplexers and we want their input pins to be endpoints. Once this option is underway you need to specify the internal pin of the pad driving the scan-enable signal, otherwise tracing will stop prematurely at the pad cell. Read in the clock tree specication by selecting Clock Design Clock ... from the menu. Using the browser select the clock tree specication le you have just modied. Press Load Spec. DONT PRESS OK yeta . You should now see a summary for all three clock specications on the console, check it. Our netlist may have some buers on the high fan-out nets we want to build trees on. We need to remove them prior to CTS with the following command: deleteClockTree -all
Pressing OK will start the clock tree insertion. We need to make sure that the clock tree specication is correct before we go ahead with this step. If you accidentally pressed OK here, it is advised to restart from the last saved point.
a
A large number of errors can be discovered by analyzing the pins connected to these nets, even before building a clock tree. 34
Select Clock Tracer Pre-CTS Clock Tree .... To start the trace, click on the icon on the top left and accept the default trace le name. A summary will be displayed on the console and the content of the trace le visualized in the GUI.
We can see how the trees currently look like and what pins are connected to them. Look also at the trace le directly. Things to look for include: Clock, reset, or scan-enable connecting to unexpected input pins, e.g. the reset signal should not connect to pins other than asynchronous set/reset pins of sequential elements. Unexpected latches on the clock tree can be discovered this way (G or GB pin). Discrepancy between the number of endpoints of clock, reset and scan trees. For our example numbers are as follows: clock tree: 443 with 442 ip-op CK pins + 1 RAM CK pin reset tree: 441 ip-op RB pins scan tree: 447 with 441 ip-op SEL pins + 6 mux S pins, to choose between the functional and test (scan chain) output signal. As we see, 442 ip-ops are clocked but only 441 recieve a reset signal, this is due to the reset synchronization register being connected to the external reset signal rather than the internal reset tree. As the reset synchronization ip-op is also not on the scan chain and we use full scan otherwise the 441 ip-ops on the scan tree match perfectly. You get the idea... Open the le chip.cts_trace and search for "Clock Tree" to examine the leaf pins. If everything looks OK we can proceed with clock synthesis. In the "Synthesize Clock Tree" form press "OK".
35
After a few minutes clock tree synthesis will be completed. Detailed reports will be generated under the directory specied on the form (most likely clock_report). This directory includes a simple report le (clock.report). A summary report is also displayed on the SoC-Encounter console. The rst column shows the achieved performance while the second column reports the target specied in the conguration le. Check your results (summary and detailed reports). How many buers were added? How many levels created? Whats the insertion delay? Are all constraints met? Note 1: You will get a max transition time violation on ClkxCI_PAD/I which can safely be ignored. As we have specied an input transition time of 800 ps on all primary inputs there is no way CTS could fulll the 600 ps requirement at this point. Note 2: Unless the RouteClkNet YES option was used (more on this later), the timing gures reported are only estimates and might change quite a bit with detailed routing.
Timing Revisited
At this point we will have to go into some more detail about timing. During dierent stages of the design ow, we have slightly dierent timing constraints (Refer to the following gure for the dierences in the three stages). a) synthesis initially the design does not contain any pads. The input delay tidel and the output delay todel should contain the contribution of the input tinpad and output toutpad pads. b) pre-CTS during placement and routing phase, all required I/O pads and drivers will be present. At this stage there is no clock tree present. The timing should be adjusted, as at this moment the input delay tidel and output delay todel no longer include the pad delays. c) post-CTS once the clock tree is inserted, the timing will change slightly again. Due to the clock insertion delay tdi the internal clock will be slightly oset when compared to the external clock. At the input, the data travelling towards the rst ip-op inside the chip, will have more time, since this ip-op will be trigerred by a clock signal that has been delayed by tdi . At the output however, the data that is coming from the chip will be launched with the internal clock, but will have to be sampled by the external clock. Consequently there will be less time for this signal. It should now be clear why it might be desirable to set constraints on the clock insertion delay property by specifying minimum and maximum values in the chip.ctstch le by MinDelay and MaxDelay parameters. The clock insertion delay can play an important part in the I/O delay. You may want to keep the insertion delay within certain limits to ensure proper I/O timing. Design tools have dierent mechanisms to deal with these three dierent cases. The simple solution is to use multiple constraint les for dierent stages. However, both Synopsys Design Compiler and Cadence SoC-Encounter accept several parameters to deal with this problem automatically. In the following we will discuss on how SoC-Encounter calculates delays in the presence and absence of clock tree. The following table summarizes the most important settings:
36
timing analysis mode (setAnalysisMode) -noSkew -skew -noClockTree -skew -clockTree

a b
clock propagation mode (set_propagated_clock) forced ideal forced ideal SDCs in eecta
clock latency (set_clock_latency) no eect SDCs in eect SDCs in eectb
still ideal mode unless set_propagated_clock is set set_clock_latency command is overridden by overlapping set_propagated_clock constraints
The timing analysis mode is automatically updated by SoC-Encounter to match the design stage, i.e. before clock tree insertion it is set to -skew -noClockTree and afterwards to -skew -ClockTree. The analysis mode can also be changed manually with the setAnalysisMode command. The two synopsys design constraints (SDC) set_propagated_clock and set_clock_latency are usually specied by the designer in the chip.sdc le. Furthermore, CTS tries to add a set_propagated_clock constraint on-the-y (in memory), which can cause a number of problems: This constraint will only be added if the AutoCTSRootPin pin/port in chip.ctstch and the clock waveform source pin/port (from the create_clock command in chip.sdc) are perfectly identical, i.e. not port vs. instance pin etc. This constraint is never written to your chip.sdc le, so if you reload that le the constraint is lost. Before CTS, only a pointer to your constraints le is saved along with the database. Now, if a constraint was added by CTS, all loaded constraints (including the new one) will be saved along with the database to a new le (*.pt). Restoring this database will then load this new constraints le instead of the one in encounter/src/ that you might have expected. Note: As soon as you manually (re-)load a constraints le, the behavior is reverted to the normal one. Now, as can be seen from the table above, to get the actual timing of the buers/inverters on the clock tree instead of ideal mode, setting both -skew -ClockTree and set_propagated_clock is required. Also note that set_propagated_clock gets overridden for all pre-CTS design stages and could therefore be set right from the start (as already mentioned earlier). In ideal mode, the clock tree insertion delay is zero unless the set_clock_latency command is used to specify a dierent number, preferably close to the delay of the real tree (that is still to be inserted). While this "placeholder" delay has the advantage that the I/O timing doesnt change between pre-CTS and post-CTS phases, it renders timing reports more intransparent and is not handled exactly the same across dierent tools. Therefore, do not use this command unless you know what you are doing. In conclusion, it is recommended to include set_propagated_clock right from the start, not use set_clock_latency and load modied timing constraints after CTS only if required, i.e. when the I/O timing numbers (set_input_delay, set_output_delay) need to be adjusted to account for the actual clock tree28 . For this training we will modify and reload the constraints29 .
For slower clock speeds and/or uncritical I/O timing this is often not required. It might be more convenient to keep a separate post-CTS constraint le rather than changing the numbers back and fourth when redoing the ow.
28 29
37
The following gure illustrates all three stages in some detail. Whereever possible the same naming conventions as the textbook have been used30
tidel tpd ff tpd a Tclk tinpad tin2reg tpd b tsu ff tpd ff Tclk treg2reg tpd c treg2out tsu ff tpd ff tpd d Tclk toutpad todel tpd e tsu ff
a)
Top
Clk
tidel tpd ff tpd a
Tclk tinpad
tin2reg tpd b tsu ff tpd ff
Tclk treg2reg tpd c tsu ff tpd ff tpd d
treg2out
Tclk toutpad
todel tpd e tsu ff
b)
Chip
Clk
tidel tpd ff tpd a
Tclk tinpad
tin2reg tpd b tsu ff tpd ff
Tclk treg2reg tpd c tsu ff tpd ff tpd d
treg2out
Tclk toutpad
todel tpd e tsu ff
c)
Chip
Clk
External Clock Internal Clock

More time for input
tdi
Clock insertion delay Less time for output
tidel
tin2reg
treg2out
todel
30 Refer to page 235 How to formulate timing constraints, and page 346 How to achieve friendly input/output timing for more on this topic
38
Modify the I/O timing constraints to account for the insertion delay of the actual clock tree, make sure that the clock is set to propagated mode and load the constraints ("Timing Load Timing Constraint ..."a ) Run timing analysis (make sure to select "Post-CTS" as design stage). Examine the reports timingReports/chip_postCTS*. You should now see the real timing on the clock network. If you have violations, run a "Post-CTS" (!) optimization with default settings. This should x all violations. Save the entire design.
a
Currently loaded constraints will be purged before the new ones get loaded.
10
Signal Routing
We will now route the signal nets. What you have seen so far are only trial-route nets that are not DRC clean and can therefore not be manufactured. There are two routing engines in SoC-Encounter. WRoute is the older one and NanoRoute is supposed to be the latest and greatest. Start NanoRoute by selecting "Route NanoRoute Route...". A large window will open. Enable the "Insert Diodes" option (you can leave the "Diode Cell Name" eld blank) and leave all other settings at their defaultsa . Click OK to start routing. You can observe the progress in the console window.
On multi-CPU or multi-core machines you can increase the number of CPUs used by selecting "Set Multiple CPU". This gives almost a linear speedup.
a
39
The "Fix Antenna" and "Insert Diode" will cause the router to change layers and/or insert special protection diodes in order to avoid damages that can happen during manufacturing due to charges that accumulate on the wires and stress the gate oxide of input pins. Note that this is usually referred to as "Process Antennas" which is entirely dierent from geometrical antennas (which is related to dangling wires). Our example design should route without problems. This is not always the case and we might get geometry violations. Geometry violations include shorts between nets and design rule violations (for example metal lines are drawn too close to be manufactured as separate wires). Needless to say that we must solve all these violations. You should always closely examine the violations in order to nd out what causes them. Sometimes there is an unfortunate placement of macro-cells or power lines to blame and sometimes there is just not enough space to route all connections. Solutions range from re-running routing to completely reworking the oorplan. Now that we have the real signal wiring we need to perform a postroute timing analysis to see if we still meet all constraints. At this point not only a setup time analysis, but also a hold time analysis needs to be run. Usually it is not necessary to deal with hold time until this point. Note that you have to do two separate runs, one for setup and one for hold, as it is not possible do this in one single step. Use the GUI (make sure to select "Post-Route) or type the commands below to perform the two analyses. timeDesign -postroute timeDesign -postroute -hold Inspect the two summaries and the report les written to the timingReports directory. You will most likely have setup violations. To x violations or increase the hold margin we can now perform a postroute optimization. Internal hold time violations need to be xed in any case as, unlike internal setup violations, they can not be avoided later on (i.e. real chip) by lowering the clock speed31 . Further possibilities to improve timing include over-constraining the "Post-CTS" optimization and enabling the "Timing Driven" option of NanoRoute. Earlier in the ow, "Timing Driven Placement" might be worth a try. Please note that the biggest improvements are possible with Pre-CTS optimization as the registers can be moved and resized at that stage. Per default, clock tree insertion will "x" the registers to preserve the clock tree, i.e. they no longer can be moved or resized. If your "reg2reg" setup violations are larger than 0.2 ns, this step will take rather long, i.e. 30 minutes or even longer. Therefore we will change the clock period (only for this exercise!!!) in order to have only a small violation of about 0.1 ns. Modify and reload the constraint le, then perform a postroute optimization "Timing Optimize ...". Make sure to select hold time xing and specify a small extra margin for hold slack by selecting the "Mode" button, e.g. 0.2 ns. Optimization will delete and re-route all nets that are aected by the changes and run setup and hold mode timing analyses at the very end. Once again, inspect the reports.
31
This does not necessarily hold true for multi-clock designs.
40
Now let us have a look at the postroute timing of our clock tree(s) reportClockTree -postRoute This will print a summary on the console and write a couple of report les chip.ctsrpt* to the encounter directory. There should be no (or only minor) violations of our clock tree constraints. Please note that the previous postCTS and postRoute setup (and hold) analyses already consider clock skew as they time every single path from the clock root to the leaf pins separately. Therefore, even a rather big skew reported here doesnt really matter as long as the former analyses passed. So far, the clock tree has been routed as any other signal net. This is usually good enough, but if you want, for whatever reason, to further improve clock net timings, you can do the following (in CTS): In the clock tree constraint le, set "RouteClkNet YES". This is a per-tree setting that instructs CTS to call NanoRoute in order to route this clock net during clock tree insertion. The wires get a status of "FIXED" and will therefore not be changed later during signal routing. While this improves timing on the clock tree, overall routability gets worse. To further improve timing, you can tell NanoRoute to route this net not like an ordinary signal net, but to create a balanced routing (by following the so called "RouteGuide" computed by CTS). To do so, set "UseCTSRouteGuide YES" in the clock constraint le32 .
32
This will persistently(!) alter the global CTS Mode to setCTSMode -useCTSRouteGuide
41
11
Timing Debug
To analyze timing violations, SoC-Encounter also oers a graphical interface ("Timing Debug Timing") that visualizes paths and allows cross-probing with the layout. We will not explain the tool in detail here, but rather make some important notes: This functionality is sort of standalone, it does not use results from the "timeDesign" command but runs a new analysis that generates the le top.mtarpt. Then these paths are visualized. If the above le already exists, it will usually simply be loaded. This means that whenever your design has changed you have to regenerate this le in order to get up to date data. This can be done with the "Generate" switch on the form that opens when you click the folder icon. When generating the top.mtarpt, the current timing mode is relevant, i.e. to analyze hold paths timing mode has to be set to hold mode.
42
12
Finishing
We are almost done with backend design, there are only a few steps required to nish the layout and verify that everything is correct.
12.1
Insert Filler Cells
Now that we dont need the additional space within the standard cell rows anymore, we have to ll these gaps with ller cells. This is required for fabrication. In addition, some of them contain capacitors between VCC and GND that lter spikes on the power lines.
source scripts/fillcore.tcl Note that your row utilization will be 100% after this step. This means that you will have no room for further optimizations. Make sure to insert ller cells after all optimizations have been completed. Note: It is also possible to remove the ller cells with "Place Filler Delete..." or by using the script fillcore.tcl.
12.2
Checking Connectivity and Geometry Violations
Now that we are completely nished with the layout, we should make sure that we have no connection errors, i.e. all logic connections from the netlist are also present in the physical layout. Select "Verify Verify Connectivity ..." from the menu. A window will appear. Run the analysis and check the console for the report summary. There should be no violations. In a similar way let us verify all geometrical shapes. Select "Verify Verify Geometry ..." from the menu. Run the analysis and check the report on the console. You should get no violations.
43
There is a script that will perform the last verication steps for you automatically. You can set a variable DESIGNNAME to assign the base name for all the les generated by this script. set DESIGNNAME MyBeautifulChip source scripts/checkdesign.tcl
12.3
Evaluate the Physical Design
Take the time to examine the routing. This is the main feedback you need for a second back-end iteration. Try to view all metal lines separately to see how congested your routing is. If you see a lot of Metal-6 (orange) you are probably close to the density limit. In our design you should not notice any congestion and Metal-6 will barely be used. If your design routed without problems and the routing was rather sparse then the next time you could assign a smaller core area and increase the row utilization. On the other hand if the design barely routed you have found the limits, in a second iteration you might consider assigning a little more core area timing degrades with congestion. Check the connections of your macro-cells and pads, this may give you an idea how to place the macrocells the next time around. You need to get used to evaluating the result of dierent back-end design runs.
12.4
Generate Output Files
Congratulations, you have completed the back-end design. That was not so hard now, was it? Save your design using "Design Save Design As ... SoCE" to the save directory and make sure that you use a name that shows this is a nished design (i.e. chip_final.enc). Finally we need to export all data needed for post layout simulation and physical verication (DRC/LVS). There is a script that will write out all relevant les to the out/ directorya . source scripts/exportall.tcl
To get complete supply net connectivity in the verilog netlist for LVS, the missing connections for the power and ground pins (GNDIO/VCC3IO) of the pads are added and removed on-the-y. We could also dene and handle these two nets in the same way as VCC/GND, but there are more drawbacks than benets.
a
Similar to the checkdesign.tcl le, the variable DESIGNNAME will be used to assign the base name of the les. If you do not specify a name, final will be used. After you complete this step you will have the following les: *.v This is the nal netlist. Make sure to use this netlist for post layout simulations. *.gds.gz The layout in GDSII (Graphic Design System II) format. This is the standard format for exchanging layout data. *.sdf.gz The SDF (Standard Delay Format) le to be used for post layout simulation. *.spef.gz Standard Parasitic Exchange Format. Includes all parasitics, can be used for timing and/or power analysis.
44

Training 2

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Training 2

Diunggah oleh

Hak Cipta:

Format Tersedia

Institut fr Integrierte Systeme Integrated Systems Laboratory

VLSI II: Entwurf von hochintegrierten Schaltungen

Training 2: SoC Encounter for Designers II

DataInxDI DataInReqxSI DataInAckxSO

filter top chip

out save scripts src tech

sample lef lib modelsim simvectors sourcecode synopsys tetramax

Sample input files

, open chip.v.initial and nd the denition of the top-level module chip by

offset=328.6 ) # pin no: 54

Importing the Design

You can ignore warnings (SOCLF-58), (SOCLF-200), (TECHLIB-436), (SOCSYC-2), (EMS-27)

GND Power Stripe

Standard Cell Power Connections

Block Power Ring Standard Cell Row

Macro Cell (RAM)

I/O and Corner Pads Placed on the Padframe

Block Power Connection

Core Power Ring

Power Pad Connections

I/O to Core Spacing

This can be done using the "Floorplan script provided.

At the command line type: source scripts/tiehilo.tcl

Clock Tree Insertion

Min Delay Max Delay Max Skew

timing analysis mode (setAnalysisMode) -noSkew -skew -noClockTree -skew -clockTree

clock latency (set_clock_latency) no eect SDCs in eect SDCs in eectb

tidel tpd ff tpd a

tin2reg tpd b tsu ff tpd ff

Tclk treg2reg tpd c tsu ff tpd ff tpd d

todel tpd e tsu ff

tidel tpd ff tpd a

tin2reg tpd b tsu ff tpd ff

Tclk treg2reg tpd c tsu ff tpd ff tpd d

todel tpd e tsu ff

External Clock Internal Clock

This does not necessarily hold true for multi-clock designs.

Insert Filler Cells

Checking Connectivity and Geometry Violations

Evaluate the Physical Design

Generate Output Files

Anda mungkin juga menyukai