Anda di halaman 1dari 8

A Sliding Window Scheme for Accurate Clock Mesh Analysis

H. Chen C. Yeh UC San Diego UC Santa Barbara CA, USA CA, USA Abstract
Mesh architectures are used for distributing critical global signals on a chip such as clock and power/ground. The inherent redundancy created by loops present in the mesh smooths out undesirable variations between signal nodes spatially distributed over the chip. However, one outstanding problem with mesh architectures is the difculty in analyzing them with sufcient accuracy. In this paper, we present a new sliding window-based scheme to analyze the latency in clock meshes. We show that for small meshes, our scheme comes within 1% of the SPICE simulation of the complete mesh with respect to clock latency. Our scheme is ideally suited for distributed- or grid-computing. We show large design instances where SPICE could not nish, whereas our scheme could complete the analysis in less than 2 hours.

G. Wilke UFRGS Brazil

S. Reddy H. Nguyen W. Walker R. Murgai Fujitsu Laboratories of America, Inc. CA, USA
can be due to non-uniform switching activity in the design, within-die process variations and asymmetric distribution of circuit elements (such as ip-ops). For power/ground, mesh can help reduce voltage variations at different nodes in the network due to non-uniform switching activities. For the clock signal, a mesh (Figure 1) has been shown to achieve very low skew in microprocessor designs, e.g., Digital 200MHz Alpha [4] and 600-MHz Alpha [1]; IBM G5 S/390 [8], Power4 and PowerPC [10, 12]; SUN Sparc V9 [13]. Mesh also has excellent jitter mitigation properties. However, one imposing problem that has limited the applicability of mesh architectures is the difculty in analyzing them with sufcient accuracy. The main reasons are the huge number of circuit nodes needed to accurately model a ne mesh in a large design and large number of metal loops present in the mesh structure. As a result, circuit simulators such as SPICE either require inordinate amount of memory or run-time. In fact, HSPICE (Synopsys) and HSIM (Nassda) failed to analyze even coarse meshes for an industrial design. We are not aware of any satisfactory solutions to this problem. In this paper, we propose a new scheme called sliding window scheme (SWS) for analyzing clock meshes. In particular, our goal is to accurately compute the clock arrival time at the clock input pin of each ip-op. We show that SWS is extremely accurate (almost always within 1% of SPICE), needs much less memory, and is capable of analyzing large industrial designs in a couple of hours. It is also easily amenable to distributed- or grid-computing. Thus, our scheme effectively solves the problems associated with traditional clock mesh analysis. This should enable widespread use of clock mesh architectures in ASIC and processor designs. The paper is organized as follows. Section 2 describes previous and related work. Section 3 gives preliminaries. The sliding window-based scheme is described in Section 4. Results on the accuracy of SWS are presented in Section 5. Finally, we conclude with directions for future work in Section 6.

Introduction
Clock driver

flipflops

Figure 1: Clock mesh architecture Mesh or grid architectures are popular for distributing critical global signals on a chip such as clock and power/ground. The mesh architecture uses inherent redundancy created by loops to smooth out undesirable variations between signal nodes spatially distributed over the chip. These variations
This work was done when these authors were at Fujitsu Labs of America as interns. Contact author: murgai@fla.fujitsu.com

2 Related Work
Not much work has been published on the problem of clock mesh analysis. [12, 5] present a scheme to break the clock mesh into a tree and apply a smoothing algorithm to redis-

0-7803-9254-X/05/$20.00 2005 IEEE.

939

Figure 2: Mesh buffers connect global H-tree to mesh tribute the mesh loads. The tree is analyzed for latency. However, no accuracy results are shown. In [1], the clock mesh is veried in two steps. First, AWE-based reduction [11] is performed on the mesh to simplify the mesh elements. Then, the simplied circuits are simulated using SPICE. The accuracy and efciency of this method depend on the accuracy and stability of the moment matching technique. [3] and [14] address a different mesh problem: that of sizing a clock mesh given constraints on clock latency. [3] breaks all the loops and converts the underlying mesh to a tree structure, since algorithmically it is easier to handle a tree. The RC delay at a grid node is approximated by the rst order pole. Results on clock networks of two actual micro-processor designs are presented. [14] uses dominant time constant as a measure of the signal delay (instead of Elmore delay) and formulates the sizing problem as a semidenite programming problem. The results shown are for smaller mesh sizes. Both these methods use an approximate model of delays. Recently, there has been some related work in power grid analysis. [2] presents a partitioning approach in which the power grid is partitioned into shells, which are analyzed independently. Our work and [2] have a common theme, in that both try to solve the problem using partitioning. However, there are signicant differences. [2] ignores the region outside the shell. But for clock latency, we need to model the region outside the window, otherwise the accuracy goes down sharply (as explained later in Section 4.1, approximation ).

grid of wires spanning the entire chip area,2 driven by the mesh buffers and propagating the clock to the FFs. An x mesh has rows (horizontal wires) and columns (vertical wires). The size of a mesh stands for x. For a given chip size, the greater the mesh size, the more ne-grain the mesh is. A mesh node (or grid node) is the point where each row is connected to each column. As shown in Figure 2, the global tree delivers the clock signal to the mesh nodes via buffers called mesh buffers. We assume a uniform array of x mesh buffers. In Figure 2, and . The mesh wire between two adjacent mesh nodes is called a mesh segment. In any clock distribution scheme, one of the most important concerns is to accurately compute the clock arrival time (also called clock delay or latency) at the clock input pin of each FF. Assume we have a path in a design whose start and end gates are FFs and . Let clock arrival times at these FFs be and respectively. The maximum delay ), the difference allowed on is a function of ( in clock arrival times at the two FFs.

Preliminaries

Figure 1 shows a typical mesh architecture used for distributing the clock signal from the PLL or root buffer to sequential elements such as ip-ops and latches on the chip. It has three main components: a (uniform) mesh,1 a global tree that drives the mesh and local interconnect, where the clock inputs of ip-ops (FFs) connect directly to the nearest point on the mesh. The mesh is a uniform rectangular
1 Although non-uniform meshes are also used, for simplicity we focus on uniform meshes in this paper. It is straightforward to apply our ideas to non-uniform meshes.

where . comparing the arrival times among all FFs, we can compute the worst relevant clock skew in the design. This is the maximum difference in arrival times at two FFs which are connected by a data path. The worst skew impacts the maximum operating frequency for the design, since it limits the maximum delay in the data path. Traditional static timing analysis (STA) techniques assume an acyclic underlying structure for the logic and interconnect and cannot handle loops present in the clock mesh. Moreover, the industry-standard STA tools usually have up to 15% difference vis-a-vis SPICE with respect to cell and interconnect delays. Such a large inaccuracy in timing is unacceptable for the clock signal. So, we use SPICE for accurate timing analysis of the clock mesh. Since it is relatively straightforward and fast to compute the latency on the global tree, we address only the mesh timing analysis problem. We assume that the design is already placed and FF locations are known. The same clock signal source is assumed to drive all mesh buffers. Our primary interest is in accurately computing the arrival time of the rising edge of the clock at each FF with respect to the clock source. Mesh Model: The same clock source is applied to all the mesh buffer inputs. Each mesh buffer is accurately modeled using the BSIM3 transistor models for NMOS and PMOS. Since the mesh is largely composed of wires, it is important to have an accurate wire model. To model wires smaller than 100 , a single- model, which has two capacitors, a resistor and an inductor, is used (Figure 3). For longer wires, a 3model is used, as shown in Figure 4. Our study on a 0.13 technology showed that this scheme is accurate within 0.5%
2 More accurately, mesh should cover the smallest rectangular region spanned by FFs.

(1) is the clock cycle and is the set-up time for is known as the skew between and . By

940

or the run-time is huge, since the run-time of SPICE grows as , where is the number of nodes in the model and . 2. Due to a large number of metal loops and redundancy present in the mesh structure, it is time consuming for a circuit simulator to simulate the mesh. We propose a new method called the sliding window scheme (SWS) for analyzing latency in clock distribution networks involving meshes. The sliding window scheme is based on the observation that for each signal source, i.e., the mesh buffer, the clock mesh can be deemed as a cascaded low-pass lter. For this lter, the attenuation of a ramp input signal is proportional to the exponential of the distance. Because of this exponential attenuation, if two nodes are geometrically far, they have very small electrical impact on each other. This phenomenon enables us to ignore some of the circuit details that are geometrically distant from the node we are interested in. In our proposed method, the mesh is modeled with two different resolutions: we use a detailed circuit model for the mesh elements geometrically close to the nodes we are measuring and simplied model for the mesh elements far from the nodes being measured. The simplication is with respect to the local FF connections. We will have more details on the justication of SWS in Section 4.1. The basic idea of SWS is very simple. Given a mesh x, we dene a rectangular window of size x where and . If we x the lower left corner of to a point on the mesh, covers some xed region of the mesh (Figure 5). The connection of a FF within to the nearest mesh segment is modeled accurately by an appropriate model, as described in Section 3 (single- or 3- , depending on the length of the connection). The clock input pin of the FF is modeled as a capacitance. If there are FFs connected to a mesh segment, the mesh segment is divided into at most ( ) sub-segments. Each sub-segment is modeled with an appropriate model. FFs that lie outside and their connections to the mesh are modeled approximately. The wire connecting such a FF to the mesh is replaced by an equivalent single capacitance. The wire resistance is ignored. Given a mesh node outside , the region covered by is the unit rectangle shown in Figure 5. Let be the sum of the clock input pin capacitances of all the FFs in this region along with the capacitances of the wires connecting them to the mesh. Then, is lumped as a single capacitance at . The mesh segments outside are still modeled with appropriate models. The SPICE le corresponding to this model for the window location is generated and simulated. The clock latencies at all FFs in are measured. Next, the window is slid either horizontally or vertically so as not to overlap with the previous locations. Once again, a SPICE model is created and run. Simulating the entire mesh is thus broken down into multiple window-based simulations. In fact, SPICE simulations are needed to cover the entire mesh and thus all the FFs in the design.

Figure 3: Single- model for interconnect

Figure 4: 3- model for interconnect of 4- and 5- models [15]. It helps reduce the number of nodes in the SPICE model. The same rule is used to model the wires that connect FFs to the mesh. The clock pin of a FF is modeled as a simple equivalent capacitance.

Sliding Window Scheme

The use of a mesh is severely limited by the difculty of analyzing it. In the absence of any other tool, SPICE simulations are performed to analyze meshes. SPICE analysis fails on clock meshes for chip-level circuits; for instance, on a 65x65 mesh for a circuit with 100K FFs. Either it runs out of memory or the CPU time is exorbitant. The reasons are as follows. 1. The model size: Due to importance of the interconnect in determining path delays, we need to model interconnect accurately. Each mesh wire segment contributes three nodes if a single- model is used, and seven nodes for a 3- model. Similarly, each FF is a node and its connection point to the mesh is one node. Thus, just the mesh and the local FF connections can contribute hundreds of thousands of nodes in the SPICE model. Either SPICE runs of memory when generating the model,

Figure 5: Sliding window scheme

941

As we will show, our scheme can complete on ne meshes and is accurate within 1% of the complete mesh simulation. Also, it is naturally suited to parallelization or grid-computing, since different SPICE simulations are completely independent of each other. The SWS scheme is a divide-and-conquer partitioning technique. Approximating the region outside the window reduces the number of nodes in the circuit model. Approximating each FF saves either 7 nodes if the wire is longer than 100 or 3 nodes otherwise. In a typical design, where there are hundreds of thousands of FFs, the reduction in the size of the SPICE model can be huge. Also, we can obtain CPU speed-up as well, as the following example illustrates. Example 4.1 Assume a 6565 mesh, and a design with 100K FFs. Also assume that these FFs are uniformly distributed over the chip. Let us assume that all the wires and mesh segments are modeled with a single- model. Let be the number of nodes in the golden model, which is obtained when all the local FF wires that connect FFs to the mesh are modeled accurately. Each mesh segment is modeled with the single- model and has 2 nodes (Figure 3). The number of mesh segments is 6464 = 4096. Hence, number of nodes on the mesh is about 8200. Each FF contributes three nodes: one for the FF, one for the point where it is hooked to the mesh, and one internal node in the model. Thus FFs contribute about 300K nodes. Then, = 308K. By using a window of size 1717, for a given location of , let the number of nodes in the SPICE model be . As before, the mesh segments will contribute 8K nodes to the model. However, only about 1/16 of the total FFs lie within . Then, only 7K FFs are modeled accurately. They contribute 21K nodes. The FFs outside do not contribute any additional nodes, since they are lumped at the nearest mesh node. Then, = 29K. Thus, we see a 10X reduction in the model size using SWS. Let us estimate the run-time of SWS. Let us assume that the SPICE run-time is . Since the number of nodes reduces by a factor of 10, each window simulation is about = 32 times faster than the golden model simulation. A total of 16 simulations are required to cover the entire mesh. Thus we can expect an overall speed-up of 2 for sequential execution on a single machine and a speed-up of 32 for parallel execution (assuming 16 machines are available).

1 s in b out N1 s N2 s N3 s N4 s Ni p

s s s

Figure 6: Model for justifying SWS

Figure 7: Experimental data justifying SWS. Approximation mimics SWS; does not include model of the circuit outside the region of interest delay measurements. The branches mimic the local FFmesh connections outside the window. During simulation, each is replaced by a single- model. This is the exact model. We measure the (rise) arrival times at and various nodes (the waveform at the input of serves as the timing reference). These arrival times form the basis for determining the accuracy of the next two approximations, in which the branches are represented by simpler models. In the rst approximation, we set the resistance of each of the branch wires at to zero. This makes all the branch wires purely capacitive and effectively lumps all branch capacitances at . Let us call this approximation . mimics the basic idea of SWS, where the local connections to FFs are replaced by appropriate capacitances at the nearest mesh nodes; the resistance components of local connections are set to zero. We measure the arrival times at various nodes and compare them with the corresponding values from the exact model. In the second approximation, we remove all the branches. This mimics the extreme choice that the region outside the window is completely eliminated and not included in the

4.1 Justication of SWS


To justify the basic idea of SWS, we performed a series of simple experiments in 0.13 technology on the circuit of Figure 6. An appropriately-sized buffer drives a series connection of identical wire segments ( is varied from 4 to 8), each of length . The nal node fans out to branches, where each branch contains a single segment . The value of is varied from 1 to 7, and that of from 100 to 300 . Nodes and through mimic the nodes inside the window of SWS, where we wish to make accurate

942

model. Let us call this approximation . We simulated the exact model and the models obtained from & for different values of , , and . Figure 7 shows the percentage delay error at various measurement nodes for and . In all the experiments, delay errors for the approximation were found to be less than 0.5%. However, the errors for were as high as 65%. The error is higher if is longer (b vs. a), is smaller (c vs. b), or is more (a vs. d). Also, the error increases sharply as the measured node shifts to the right and gets closer to the branch region being approximated. This conrms the basic premise of the window scheme: ignoring resistance of the local FF wires outside the window introduces insignicant delay error, but completely ignoring the region outside the window can result in large errors. Note: It is not necessary to consider the length of each , the wire between segment longer than 300 . If and has at least 4 copies of and hence is at least 1.2mm long. A repeater is inserted on the clock after this distance to restore the signal rise and fall times. As for each of the branches to FFs, skew considerations dictate that these connections not be longer than 300 .

border around W

complete mesh

Figure 8: Window and its border of FFs on the border are ignored. So we enhanced the basic SWS by expanding the original window to by including a border around , as shown in Figure 8. The window is modeled accurately. However, the clock latencies are measured only for the FFs in the original window . The latencies for FFs in will be measured when they fall within the core of another appropriate window. We reran the above experiment with the border-enhanced SWS. The border was chosen as one mesh segment along each of the four boundaries of the window. As shown in Figure 9, bars labeled with border, the maximum error with the border is less than 4.5% over all window sizes. Interestingly, the maximum error is almost constant as the window size changes. Then, other criteria can be used to select an optimal window size, as described in Section 5.3. As another set of results, in Figure 10, we show the results for the same set-up as before, except the mesh buffers are now present on every mesh point. Without border, the maximum error ranges from 12% to 17% for different window sizes. With the border, it is always less than 1%. Intuitively, there are large errors in the region outside . The border provides a buffer zone between the outside region and , through which the latency inaccuracy reduces.

4.2 Accuracy of SWS


The circuit set-up in the last section included a single buffer driving a simple tree of wire segments. It did not take into account loops and the multi-driver nature of the mesh. In this section, we model the actual mesh architecture to check the accuracy of SWS. The mesh size was set to 10x10, and the number of FFs to 10,000. The FFs are placed randomly with a uniform distribution on a 10mm x 10mm chip. Mesh buffers are assumed to be present on every other mesh node and are sized according to the load in the region around their respective mesh nodes. The at simulation for the entire mesh could nish, yielding the golden clock latency values for each FF . For SWS, we vary the window size from 2x2 to 10x10. For each window size x , let be the latency for FF in the SWS. We compute , the maximum percentage error in latency over all FFs as follows:

(2)

We plot the maximum error, , for each window size in Figure 9 as bars labeled without border. The size means that the whole mesh is simulated in its entirety by including accurate models for the connections to all FFs. In other words, this simulation yields the golden latencies. It can be seen that ranges from 26% to 32%. Such large error values are unacceptable, given the stringent accuracy requirements for the clock signal.

5 Experimental Results
We have developed a clock mesh analysis tool which reads in a chip specication (e.g., chip dimensions), FF locations, technology information, mesh buffer sizes & locations, and mesh parameters (such as mesh size, wire widths). It then uses SPICE transient simulation to compute clock latencies for the FFs with respect to the clock source (which is connected to the inputs of all the mesh buffers). The computation is based on the proposed sliding window scheme. Currently, it requires window size as input. For each window location, the tool generates the SPICE model for the mesh, local wires and ip-ops within and outside the window. Unix shell scripts were written to manage the sliding win-

4.3 Improving the Accuracy of SWS


Analysis of ip-ops with large latency errors revealed that such FFs were almost always close to the window boundary. SWS accuracy improves drastically when the latencies

943

Figure 9: Maximum error without and with border for 10mm x 10mm chip, 10x10 mesh, 10K FFs and a buffer on every other mesh node

Figure 11: Accuracy of SWS for different experimental settings All the experiments were conducted in an industrial 0.13 technology. A typical run of a circuit simulator (e.g., HSPICE) session to simulate one SPICE le is called a simulation. A win-experiment is dened as the collection of simulations resulting from sliding the window across the mesh so as to cover all the FFs, with a given chip size, mesh size, FF count, window size and mesh buffer locations. The set of winexperiments obtained when the window size is varied from one to the maximum possible value is called an experiment. We carried out numerous experiments with different values of chip size, mesh size, and FF count. Two chip sizes were used: 5mm x 5mm and 10mm x 10mm. Three different mesh sizes were used: 10x10, 18x18, and 26x26. FF counts of 1K and 10K (1K = 1000) were used. FFs were placed randomly with a uniform distribution. Buffer steps used were 0 and 1; 0 means every mesh node has a mesh buffer and 1 means every other mesh node has a buffer.3 Due to lack of space, we only present results for a selected window size for each experiment. This window size was picked to minimize an average of percentage maximum error, percentage average error and CPU time. The following labeling scheme is used for an experiment. For instance, the experiment with the chip size of 10mm x 10mm, FF count of 10K, mesh size of 10x10, and a buffer attached to each mesh node, is labeled c10 f10 m10 0. The accuracy results for this experiment were shown in Figure 10 for all window sizes.

Figure 10: Maximum error without and with border for 10mm x 10mm chip, 10x10 mesh, 10K FFs and a buffer on every node dows generation, simulation, and extraction of clock latencies from the simulation output. We designed our experiments to achieve three goals: 1. to show that our proposed scheme is accurate for measuring clock latencies. For this, we need to use mesh and circuit instances that do not exceed the capacity of SPICE and can be completed in one single simulation. 2. to show examples where the (at) simulation for the whole mesh could not nish, but SWS could. Also, to study the CPU time and memory trade-offs as a function of the window size. 3. to come up with a method for determining the best window size. The best window size is the one that generates a SPICE model within the machines memory capacity, minimizes the error, and either a) minimizes the overall simulation time, or b) minimizes the turn-around time for parallel simulation.

5.1 Accuracy
Figures 9 and 10 have already shown SWS accuracy results for experiments c10 f10 m10 1 and c10 f10 m10 0 respectively over all window sizes. Figure 11 shows results for different experiments, but only for the selected window size (as described above) when running SWS without including the border. It can be seen that the maximum error in most of
3 These uniform buffer distributions were used to quickly generate several different simulation models. They do not imply any mesh buffer synthesis methodology.

944

Mesh Size 65x65 129x129

Execution Time Sequential Parallel 6h 48min 1h 46 min 20h 22min 5h 18 min

Table 1: Results on a real design with about 300K FFs. Parallel execution assumes 4 processors. experiment. It was found that HSPICE could not complete the at simulation for the whole 65x65 mesh. HSPICE used up more than 2GB of memory and aborted after 36 hours. Note that 2GB is the address space limit of the HSPICE binary we used. In Figure 12, we plot the total execution time for the win-experiment for each window size. Note that 64, 16, and 4 simulations are required for the window sizes 8x8, 16x16 and 32x32 respectively. We also show the maximum execution time for a single simulation (i.e., for a xed location of the window). If different simulations for a given mesh could be done in parallel on different machines, the execution time would be determined by the maximum CPU time for simulating a single window location. The maximum memory required by the simulator is shown in Figure 13. The amount of memory shown for the at simulation (i.e., window size 64) is a lower bound, since the simulation did not nish. We note from these plots that as the window size increases, the amount of memory needed for the simulation increases and so does the simulation time for a xed window location. This is expected, since a larger window covers more FFs whose connections to the mesh are modeled accurately in SWS. Hence the size of the SPICE model and the execution time grow. However, the total time over all simulations tends to be large for small window sizes and goes down as the window size increases to half the mesh size. This is because the number of simulations for the whole mesh also reduces with increasing window size. We also tested the sliding window scheme on a real design, which had almost 300K FFs. The cell and FF placement had already been done using a commercial placement tool. We used two different mesh sizes: 65x65 and 129x129. The execution time is reported in Table 1 for two cases: sequential execution on one machine and parallel execution assuming 4 machines. The table shows that SWS can handle real designs overlaid with ne-grain clock meshes. Parallelization, even with only 4 machines, can make the turnaround time practical.

Figure 12: CPU time as a function of the window size. Total CPU time is relevant for sequential execution on a single machine. Max single CPU time is the turn-around time, assuming maximum parallel processing.

Figure 13: Memory usage as a function of window size the cases is below 2.5%. However, for c10 f10 m10 1, the maximum error is 27.5%. We reran all the experiments using border-enhanced SWS, with a border of 1. For all the experiments except two, the maximum error is now below 1%. The other two experiments had maximum errors of 2.9% and 1.8%. Although not reported in the gures, for any given FF, the difference between the golden latency and the SWS-computed latency was less than 1ps almost always. The 2.9% error case was an exception, where a delay difference of 7ps was seen: the golden latency for a FF was 233ps, but the SWS computed a latency of 226ps. These results show that the border-enhanced sliding window scheme is extremely accurate for computing clock latencies.

5.2 Large (Mesh + Design) Instances


Another experiment was run using a 65x65 mesh with 100K FFs placed randomly with a uniform distribution on a 1.1GHz Sparc64-V with 4GB main memory. The goal of this experiment was to demonstrate that there are cases where the at golden simulation cannot complete, but SWS can. For SWS, window sizes equal to 8x8, 16x16 and 32x32 were evaluated. We used HSPICE from Synopsys for this

5.3 Optimal Window Size


The main advantage of SWS is to be able to accurately simulate large meshes and designs that cannot be completed with a at simulation. So the rst requirement on the optimal window size is that the resulting model ts in the main memory. Next, we would like to complete the simulations quickly. As for accuracy, we note from Figures 9 and 10 (with border)

945

that the SWS accuracy does not depend on the window size, and hence is not a determinant of window size. It is clear from Section 5.2 that a smaller window simulates much faster than a bigger one. So for parallel simulation, it is better to pick small window sizes. A smaller window also yields smaller simulation model. On the other hand, larger windows tend to have smaller total simulation time. So they are preferable for sequential simulation, as long as the SWS model ts in the machine memory.

IEEE Journal of Solid-State Circuits. Vol 33., No. 11, pages 16271633, November 1998. [2] E. Chiprout. Fast Flip-chip Power Grid Analysis Via Locality and Grid Shells. In ICCAD, pages 485488, 2004. [3] M. P. Desai, R. Cvijetic, and J. Jensen. Sizing of Clock Distribution Networks for High Performance CPU Chips. In DAC, June 1996. [4] D. W. Dobberpuhl et. al. A 200-MHz 64-b Dual-Ussue CMOS Microprocessor. In IEEE Journal of Solid-State Circuits. Vol 27., No. 11, pages 15551567, November 1992. [5] P. J. Camporese et al.. X-Y Grid Tree Tuning Method. In U.S. Patent, No. 6,205,571 B1, March 2001. [6] P. Feldmann and R. W. Freund. Efcient Linear Circuit Analysis by Pade Approximation via the Lanczos Process. In IEEE Transactions on CAD, pages 639649, May 1995. [7] R. W. Freund. SPRIM: Structure-Preserving ReducedOrder Interconnect Macromodeling. In ICCAD, pages 8087, November 2004. [8] G. Northrop et. al. A 600-MHz G5 S/390 Microprocessor. In ISSCC Tech. Dig., pages 8889, February 1999. [9] A. Odabasioglu, M. Celik, and L. T. Pillegi. PRIMA: Passive Reduced-order Interconnect Macromodeling Algorithm. In IEEE Transactions on CAD, pages 645 654, August 1998. [10] P. J. Restle et. al. The Clock Distribution of the Power4 Microprocessor. In ISSCC Dig. Tech. Papers, pages 144145, February 2002. [11] L. T. Pillage and R. A. Rohrer. Asymptotic Waveform Evaluation for Timing Analysis. In IEEE Transactions on Computer-Aided Design, pages 352366, April 1990. [12] P.J. Restle et. al. A Clock Distribution Network for Microprocessor. In IEEE Journal of Solid-State Circuits. Vol 36., No. 5, May 2001. [13] R. Heald et. al. Implementation of a 3rd-Generation SPARC V9 64b Microprocessor. In ISSCC Dig. Tech. Papers, pages 412413, February 2000. [14] L. Vandenberghe, S. Boyd, and A. E. Gamal. Optimal Wire and Transistor Sizing for Circuits with Non-tree Topology. In ICCAD, pages 252259, 1997. [15] Gustavo Wilke and Rajeev Murgai. Accuracy of Interconnect Pi Models. In Fujitsu Laboratories of America Internal Document, August 2004.

Conclusions

Analyzing clock mesh of a large industrial design has been a difcult problem. In this paper, we presented a new sliding window scheme to analyze the latency in clock meshes. We showed that HSPICE could not nish on a 65x65 mesh with 100K FFs. It needed more than 2GB of memory. Our technique could complete in less than 1.5 hours within 1GB memory. The border-enhanced SWS, when applied to smaller instances of mesh and FFs, almost always comes within 1% of the delay computed from the SPICE simulation of the complete mesh. We also proposed strategies for selecting the optimal window size, which take into account the total machine memory and the degree of parallelization. We applied our scheme successfully on a large, real industrial design. Our technique extends the capability of SPICE in handling large (RLC) clock meshes and designs. Finally, our scheme is naturally suited for parallelization and achieves a turn-around time of less than 2 hours on a design with almost 300K FFs and with a 65x65 mesh. Thus, our scheme effectively solves the problems associated with traditional clock mesh analysis. This should enable widespread use of clock mesh architectures in ASIC and processor designs. Our technique can be further improved to reduce the memory requirement and run-time as follows. 1. We can reduce the complexity of the model by simplifying the region of the mesh outside the window. For instance, each mesh segment outside the window could be modeled always as a single- , instead of the current threshold-length-triggered switch between single- and 3- models. 2. We need to explore the applicability of linear [11, 9, 6, 7] and non-linear model order reduction techniques in the context of the mesh. These techniques could potentially reduce the model size and help speed up the clock mesh analysis. 3. With technology scaling, the variations in voltage, temperature, and crosstalk noise lead to clock jitter. We plan to work on jitter measurement in clock meshes.

References
[1] D. W. Bailey and B. J. Benscheneider. Clocking Design and Analysis for a 600-MHz Alpha Microprocessor. In

946

Anda mungkin juga menyukai