10 1 1 123

Copyright by KAndrea Catherine Bickerstaff 2007
The Dissertation Committee for KAndrea Catherine Bickerstaff certifies that this is the approved version of the following dissertation:
Optimization of Column Compression Multipliers
Committee: _______________________________ Earl E. Swartzlander, Jr., Supervisor _______________________________ Jacob A. Abraham _______________________________ Anthony P. Ambler _______________________________ Harvey G. Cragon _______________________________ Donald S. Fussell _______________________________ Eric J. Swanson
by
KAndrea Catherine Bickerstaff, B.S.; M.S.
Dissertation
Presented to the Faculty of the Graduate School of the University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
The University of Texas at Austin August 2007
Dedication
Dedicated to my grandmothers, Louise Andrews and Mattiel Bickerstaff, for their appreciation of education and love for me.
Acknowledgements
I am very grateful to my graduate supervisor, Dr. Earl E. Swartlzander, Jr., for his encouragement, support, and patience. His guidance in matters both academic and
professional has been invaluable. Many thanks to my fellow research group members, especially Dr. Michael J. Schulte, Dr. Edwin De Angel, Dr. William Lynn Gallagher, and Dr. Jason Arbaugh. It is an honor and a pleasure to work with each of you; I cherish our friendship. I thank my management and colleagues at Crystal Semiconductor, Cirrus Logic, and Luminary Micro. I appreciate the excellent training and job opportunities from my managers, Greg North and Jeff Klaas, and mentors, Eric Swanson and Dr. Matt Perry. I thank my mother, Doris, and my brother, Kenneth, for their enduring love and support. At the lowest points, my brothers Im proud of you! helped me keep going. I also extend very special thanks to my many friends for staying beside me during this long journey. Liz Wright, Yolanda Torres, Judy Ko, Charles Robinson, Scott Haban, Montine Heim and Annola Baileytheir phone calls, hugs, and laughter are precious gifts to me.
Publication No. __________________
KAndrea Catherine Bickerstaff, Ph.D. The University of Texas at Austin, 2007
Supervisor: Earl E. Swartzlander, Jr.
With delay proportional to the logarithm of the multiplier word length, column compression multipliers are the fastest multipliers. Unfortunately, since the design community has assumed that fast multiplication can only be realized through custom design and layout, column compression multipliers are often dismissed as too time consuming and complex because of their irregular structure. This research demonstrates that an automated multiplier generation and layout process makes the column compression multiplier a viable option for application specific CMOS products. Techniques for optimal multiplier designs are identified through analysis of area, delay, and power characteristics of Wallace, Dadda, and Reduced Area multipliers.
vi
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2: Past Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Array Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Column Compression Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 Counters and Compressors . . . . . . . . . . . . . . . . . . . . . . . . . . Reduction Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2.1 2.2.3 2.2.4 Using Higher-Order Counters and Compressors . . x xii 1 4 5 10 11 12 23 26 27 28 28 29 30 30 34 34 35
The Final Carry Propagate Adder . . . . . . . . . . . . . . . . . . . . Layout Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 3: Automated Multiplier Netlist Generation . . . . . . . . . . . . . . . . . 3.1 Basic Multiplier Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 3.1.2 Signal Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partial Product Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.3 Carry Lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 3.3 3.4 Process Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cell Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M x N Multiplier Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Chapter 4: Automated Multiplier Implementation and Verification . . . . . 4.1 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38 38 40 43 46 46 47 48 51 52 58 59 60 61 64 65 78 80 82 85
4.3 Layout Floorplanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 4.5 4.6 4.7 Timing Driven Placement and Route . . . . . . . . . . . . . . . . . . . . . . . . RC Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Static Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 5: Multiplier Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Area Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Area of Multipliers in the 250 nm Process Technology . . . . . . . . . . Area of Multipliers in the 180 nm Process Technology . . . . . . . . . . Area of Multipliers in the 130 nm Process Technology . . . . . . . . . . Area of Multipliers in the 90 nm Process Technology . . . . . . . . . . . Area Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8 Area Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 6: Multiplier Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Delay Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
6.3 6.4 6.5 6.6 6.7 6.8
Delay for Multipliers in the 250 nm Process Technology . . . . . . . . Delay for Multipliers in the 180 nm Process Technology . . . . . . . . Delay for Multipliers in the 130 nm Process Technology . . . . . . . . Delay for Multipliers in the 90 nm Process Technology . . . . . . . . . Delay Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85 89 91 95 97
Delay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Chapter 7: Multiplier Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.1 7.2 7.3 7.4 7.5 7.6 7.7 Power Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Power Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Power for Multipliers in the 250 nm Process Technology . . . . . . . . 112 Power for Multipliers in the 180 nm Process Technology . . . . . . . . 114 Power for Multipliers in the 130 nm Process Technology . . . . . . . . 116 Power Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Power Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Chapter 8: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
ix
List of Figures
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 A square version of a 4 by 4 array multiplier (after [23]) . . . . . . . . . . . . . Twos complement by modified Baugh-Wooley method . . . . . . . . . . . . . Steps for N by N unsigned parallel multiplication . . . . . . . . . . . . . . . . . . Dot Diagram for a 12 by 12 Wallace Multiplier . . . . . . . . . . . . . . . . . . . . Dot Diagram for a 12 by 12 Dadda Multiplier . . . . . . . . . . . . . . . . . . . . . Dot Diagram for a 12 by 12 Reduced Area Multiplier . . . . . . . . . . . . . . . (7,3) Counter design using (3,2) counters after [30] . . . . . . . . . . . . . . . . . (15, 4) Counter design using (3,2) counters after [30] . . . . . . . . . . . . . . . . 4:2 Compressor using (3,2) counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . Block diagram for implemented column compression multipliers . . . . . . Loading for each product bit Modified Full Adder .................................. 6 9 13 15 18 20 23 24 25 29 30 31 33 35 40 42 44 45
........................................
Diagram of 16-bit Carry Lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . Schematic of (3,2) counter standard cell . . . . . . . . . . . . . . . . . . . . . . . . . . Design and Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conformal Process Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power/Ground abutment in layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diagram of pin placement for N by N multipliers . . . . . . . . . . . . . . . . . . x
4.5 Configuration of Common Timing Engine . . . . . . . . . . . . . . . . . . . . . . . . 4.6 5.1 5.2 Configuration of UltraSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Block diagram of an N by N unsigned column compression multiplier . . . Dadda multiplier areas using different process technologies and standard cell libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Area pie charts of Wallace multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . Area pie charts of Dadda multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Area pie charts of Reduced Area multipliers . . . . . . . . . . . . . . . . . . . . . . . Back-annotated delays for N by N Dadda multipliers . . . . . . . . . . . . . . . .
48 49 53 66 70 71 72 98
5.3 5.4 5.5 6.1 6.2 6.3 7.1 7.2 7.3
Delay pie charts for back-annotated Dadda multipliers . . . . . . . . . . . . . . . 103 Back-annotated Dadda multiplier delays versus estimated delays . . . . . . . 105 Average power consumption for Wallace, Dadda, and Reduced Area multipliers in the 250 nm process . . . . . . . . . . . . . . . . . . . . 114 Average power consumption for Wallace, Dadda, and Reduced Area multipliers in the 180 nm process . . . . . . . . . . . . . . . . . . . . 116 Average power consumption for Wallace, Dadda, and Reduced Area multipliers in 130g and 130p cell libraries . . . . . . . . . . . . . . . . . . . . . . . . 119
xi
List of Tables
2.1 2.2 2.3 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Radix-4 Modified Booth Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Truth table for special half adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of Reduction Stages for a Dadda Multiplier . . . . . . . . . . . . . . . . Number of D flip-flops, buffers, and AND gates used in the multipliers . . Components for Wallace, Dadda, and Reduced Area multipliers . . . . . . . Hardware for Wallace, Dadda, and Reduced Area multipliers . . . . . . . . . Complexity of the multiplier components . . . . . . . . . . . . . . . . . . . . . . . . . Layout areas for Wallace, Dadda, and Reduced Area multipliers in the 250 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Wallace, Dadda, and Reduced Area multiplier areas in the 250 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Layout areas for Wallace, Dadda, and Reduced Area multipliers in the 180 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Wallace, Dadda, and Reduced Area multiplier areas in the 180 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of counter and CLA areas for 8 by 8 Wallace and Dadda multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 9 16 54 54 55 57 59 59 60 60 61 62 62
5.10 Layout areas for Wallace, Dadda, and Reduced Area multipliers in the generic 130 nm cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11 Comparison of Wallace, Dadda, and Reduced Area multiplier areas in the generic 130 nm cell library . . . . . . . . . . . . . . . . . . . . . . . . . . .
xii
5.12 Layout areas for Wallace, Dadda, and Reduced Area multipliers in the low power 130 nm cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.13 Comparison of Wallace, Dadda, and Reduced Area multiplier areas in the low power 130 nm cell library . . . . . . . . . . . . . . . . . . . . . . . . 5.14 Percentage that multipliers in the 130p cell library are smaller than multipliers in the 130g cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.15 Layout areas for Wallace, Dadda, and Reduced Area multipliers in the 90 nm cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.16 Comparison of Wallace, Dadda, and Reduced Area multiplier areas in the 90 nm cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.17 Wallace multiplier areas relative to each processs 8 by 8 case . . . . . . . . 5.18 Dadda multiplier areas relative to each processs 8 by 8 case . . . . . . . . . . 5.19 Reduced Area multiplier areas relative to each processs 8 by 8 case . . . 5.20 Wallace multiplier areas relative to 90 nm . . . . . . . . . . . . . . . . . . . . . . . . 5.21 Dadda multiplier areas relative to 90 nm . . . . . . . . . . . . . . . . . . . . . . . . . . 5.22 Reduced Area multiplier areas relative to 90 nm . . . . . . . . . . . . . . . . . . . 5.23 Breakdown of multiplier areas by components . . . . . . . . . . . . . . . . . . . . . 5.24 Comparison of estimated areas using general quadratic approximation versus measured areas of the multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 5.25 Comparison of general area approximations for geometries < 180 nm versus measured areas of the multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 5.26 Predicted areas for column compression multipliers in a 65 nm process technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 6.2 Delay values for Wallace multipliers in the 250 nm process . . . . . . . . . . . Delay values for Dadda multipliers in the 250 nm process . . . . . . . . . . . . xiii
63 63 64 65 65 67 67 67 68 68 68 74 76 77 78 87 87
6.3 6.4 6.5 6.6 6.7 6.8 6.9
Delay values for Reduced Area multipliers in the 250 nm process . . . . . . Critical section delays for 64 by 64 multipliers in the 250 nm process . . . Delay values for Wallace multipliers in the 180 nm process . . . . . . . . . . . Delay values for Dadda multipliers in the 180 nm process . . . . . . . . . . . . Delay values for Reduced Area multipliers in the 180 nm process . . . . . . Critical section delays for 64 by 64 multipliers in the 180 nm process . . . Delay values for Wallace multipliers in the 130g cell library . . . . . . . . . .
87 88 90 90 90 91 92 92 92 93 94 94 94 96 96 96 97 99
6.10 Delay values for Dadda multipliers in the 130g cell library . . . . . . . . . . . 6.11 Delay values for Reduced Area multipliers in the 130g cell library . . . . . 6.12 Critical section delays for 64 by 64 multipliers in the 130g cell library . . 6.13 Delay values for Wallace multipliers in the 130p cell library . . . . . . . . . . 6.14 Delay values for Dadda multipliers in the 130p cell library . . . . . . . . . . . 6.15 Delay values for Reduced Area multipliers in the 130p cell library . . . . . 6.16 Delay values for Wallace multipliers in the 90 nm process . . . . . . . . . . . 6.17 Delay values for Dadda multipliers in the 90 nm process . . . . . . . . . . . . . 6.18 Delay values for Reduced Area multipliers in the 90 nm process . . . . . . . 6.19 Back-annotated delays for Wallace, Dadda, and Reduced Area multipliers developed in generic standard cell libraries . . . . . . . . . . . . . . 6.20 Back-annotated delays for Wallace, Dadda, and Reduced Area multipliers developed in 130g and 130p cell libraries . . . . . . . . . . . . . . . .
xiv
6.21 Wallace multipliers with back-annotated delays relative to each processs 8 by 8 case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.22 Dadda multipliers with back-annotated delays relative to each processs 8 by 8 case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.23 Reduced Area multipliers with back-annotated delays relative to each processs 8 by 8 case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.24 Back-annotated Wallace multiplier delays relative to 90 nm . . . . . . . . . . . 101 6.25 Back-annotated Dadda multiplier delays relative to 90 nm . . . . . . . . . . . . 101 6.26 Back-annotated Reduced Area multiplier delays relative to 90 nm . . . . . 101 6.27 Predicted delays for column compression multipliers in a 65 nm process technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 Average power for Wallace, Dadda, and Reduced Area multipliers in the 250 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Comparison of average power for Wallace, Dadda, and Reduced Area multipliers in the 250 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Average power for Wallace, Dadda, and Reduced Area multipliers in the 180 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Comparison of average power for Wallace, Dadda, and Reduced Area multipliers in the 180 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Average power for Wallace, Dadda, and Reduced Area multipliers in the 130g cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Comparison of average power for Wallace, Dadda, and Reduced Area multipliers in the 130g cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Average power for Wallace, Dadda, and Reduced Area multipliers in the 130p cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Comparison of average power for Wallace, Dadda, and Reduced Area multipliers in the 130p cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 xv
7.9
Comparison of average power of a multiplier in the 130g cell library to the respective multiplier in the 130p cell library . . . . . . . . . . . . . . . . . . 119
7.10 Comparison of average power consumption for Wallace, Dadda, and Reduced Area multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.11 Power/Area for Wallace multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.12 Power/Area for Dadda multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.13 Power/Area for Reduced Area multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.14 Estimated average power for column compression multipliers in a 90 nm process technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.15 Estimated average power for column compression multipliers in a 65 nm process technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
xvi
Chapter 1 Introduction
High-speed multiplication has always been a fundamental requirement of highperformance processors and systems. In digital signal processing (DSP) applications, multiplication is one of the most utilized arithmetic operations, as part of filters, convolvers, and transform processors. Improving multiplier design directly benefits the high-performance embedded processors used in consumer and industrial electronic products. In the past five decades, engineering ingenuity has moved multiplication away from the slow add-and-shift techniques [1] to faster, parallel multiplication schemes. Two classes of parallel multipliers exist: array multipliers and column compression multipliers. Array multipliers [25] are used most frequently in todays designs due to their lower design and layout complexities relative to column compression architectures. Array multipliers are easily pipelined [6]. Most recently, research on array multipliers has focused on power evaluation [9, 10] and power reduction [1113]. Though not implemented as frequently as array multipliers, column compression multipliers continue to be studied due to their high speed performance. With total delays that are proportional to the logarithm of the operand word length, column compression multipliers are faster than array multipliers, whose delay grows linearly with operand word length. When first introduced by Wallace [14], and later refined by Dadda [15], 1
interconnect delays and pipelining were not critical design issues. With the advent of VLSI, this type of multiplier often was difficult to design and exhibited high interconnect overhead. However, advances in computer-aided design and VLSI technology have helped alleviate these problems. In the literature, reports of fast CMOS implementations, alternative design schemes, and strategies for the pipelining [1618] of column compression multipliers have begun to appear with increasing frequency. Designs of column compression multipliers have mostly been recommended based on improved speed performance. The issues of power consumption, interconnect, and layout have not received as much attention, although Callaways papers [9] suggest that they are much more power efficient than array multipliers. In particular, it is often unclear whether proposed strategies for improving delay will result in more irregular interconnect or difficult layout. For the column compression multiplier class to emerge as viable solution to the demand for high-speed multiplication, the primary characteristics of delay, area, and power, as they relate to interconnect and layout, need to be better understood. This research identifies techniques for optimal computer aided designs of column compressions multipliers by analyzing delay, area, and power characteristics, with particular emphasis on interconnect and layout. Practical, realizable multiplier
architectures have been investigated, using industry standard design and layout tools. Chapter 2 provides an overview of past work performed in the area of parallel multipliers, with special attention given to research affecting column compression multipliers. 2
Chapter 3 outlines the gate-level implementations of the column compression multipliers. The design details of the Wallace, Dadda, and Reduced Area multipliers created for analysis are reported. To facilitate the creation of the three types of
multipliers an M by N multiplier has been written in perl. An overview of the M by N multiplier generator is given. Chapter 4 describes the implementation process used to design multipliers. The tools and scripts for functional verification, layout, parasitic extraction, timing analysis, and power estimation are detailed. Chapter 5 discusses the layout areas of column compression multipliers implemented in this research. For different CMOS process technologies, standard cell libraries, and word sizes, sixty multipliers were placed and routed using industry standard tools. The layouts confirm predicted trends in size and layout complexity. Chapter 6 examines multiplier worst case delay values. The delay values
showcase the fast performance of column compression multipliers, developed through an automated tool flow instead of fully custom design. Chapter 7 presents power analysis for the column compression multipliers. Key to understanding the multipliers power characteristics is to examine power consumption as word sizes, standard cell libraries, and process technologies change. Chapter 8 summarizes the findings of this research. Based on understanding gained from analyzing area, delay, and power characteristics, recommendations are given for producing optimal column compression multipliers. 3
Chapter 2 Past Work

In the first large-scale digital systems, multiplication was performed as a series of additions and shifts [1]. The requisite hardware consisted only of a parallel adder and a few registers. In the early 1950s, multiplier performance was significantly improved with the introduction of Booths method [7], the modified Booth multiplier [19], and the development of faster adders [2022] and memory components. Booths method and the modified Booth method do not require a correction of the product when either (or both) of the operands is negative for twos complement numbers. During the 1950s, adder designs moved away from the slow sequential formation of carries executed by ripple carry adders. Carry lookahead, carry select, and conditional sum adders yielded speedy sums through the faster simultaneous or parallel generation of carries. In the 1960s two classes of parallel multipliers were defined. The first class [2 4] of parallel multiplier uses a rectangular array of identical combinatorial cells to generate and sum the partial product bits. Multipliers of this class are called iterative array multipliers or, more simply, array multipliers. They have a delay that is generally proportional to the word length of the multiplier input. Due to the regularity of their structures, array multipliers are easy to layout and have been implemented frequently. The second class of parallel multiplier reduces a matrix of partial product bits to two words through the strategic application of counters or compressors. These two words are then summed using a fast carry-propagate adder to generate the product. This class of 4
parallel multiplier is sometimes termed a column compression multiplier. Since the delay is proportional to the logarithm of the multiplier word length, these are also the fastest multipliers. 2.1 Array Multipliers In array multipliers, the two basic functions of partial product generation and summation are combined. For unsigned N by N multiplication, N2 + N 1 cells, where N2 contain an AND gate for partial product generation and a full adder for summing, and N 1 cells containing a full adder, are connected to produce a multiplier. The array generates N lower product bits directly and uses a carry-propagate adder, in this case a ripple carry adder, to form the upper N bits of the product. Replacing full adders with half adders where possible reduces the complexity to N2 AND gates, N half adders, and N(N-2) full adders as shown in Figure 2.1. This 4 by 4 multiplier is shown as a square array with modifications to the first two rows. Since the carry-in bits and the previous partial product bits are zero for the first row and the left column, only the AND gates are needed. With only two switching inputs, the second row employs half adders instead of full adders. The worst case delay is (2N - 2) c , where
c is the worst case adder delay.
a3 b0
a2
a1
a0
b1 HA HA HA
P0
b2 FA FA FA
P1
b3 FA FA FA
P2
P3
FA
FA
HA
P7
P6
P5
P4
Figure 2.1: A square version of a 4 by 4 array multiplier (after [23]) In order to design an array multiplier for twos complement operands, Booths algorithm [7] can be employed. The implementation of a Booths algorithm array multiplier computes the partial products by examining two multiplicand bits at a time. Except for enabling usage of twos complement operands, this Booths algorithm array multiplier offers no performance or area advantage in comparison to the basic array
multiplier.
Better delays, though, can be achieved by implementing a higher radix
modified Booth algorithm. The radix-4 modified Booth multiplier described by MacSorley [19] examines three bits of the multiplicand to determine whether to add 0, 1x, -1x, 2x, or -2x of that rank of the multiplicand. The rules for the radix-4 modified Booth algorithm are listed in Table 2.1. Though the three bit decode to five possible operationsadd 2A, add A, add 0, subtract A, or subtract 2Aincreases the hardware complexity slightly, the radix-4 modified Booth multiplier uses only about half the delays of the Booth multiplier. It is possible to use higher radices, such as radix-8 or radix-16, but the additional complexity, due to non-power of two multiples of the multiplicand, compromises delay and area improvements. Table 2.1: Radix-4 Modified Booth Algorithm bi 0 0 0 0 1 1 1 1 bi-1 0 0 1 1 0 0 1 1 bi-2 0 1 0 1 0 1 0 1 Operations +0 +A +A +2A -2A -A -A +0
Another method for building an array multiplier that handles twos complement operands was presented by Baugh and Wooley [8, 24]. This method increases the maximum column height by two. This may lead to an additional stage of partial product reduction, thereby increasing overall delay. A modified form of the Baugh and Wooley 7
strategy is more commonly used because it does not increase the maximum column height. The modified Baugh-Wooley method [24] is shown in Figure 2.2. This
organization of partial product bits produces an easy to remember strategy for twos complement multiplication, which is to 1) invert the bits along the left edge and the bottom row, with the exception of the bottom left partial product bit, and 2) add a single one to the n+1 and 2n columns. Note that the one in the 2n column is not actually part of the final product and can be ignored. The negated partial product bits can be produced using a NAND gate instead of an AND gate, which may reduce the area slightly in CMOS. The one in the n+1 column is accommodated by using a special half adder on two partial product bits in the n+1 column. The truth table for this special half adder is given in Table 2.2. The sum is the complement of the sum of a normal half adder. The carry is formed by a OR b.
Figure 2.2: Twos complement by modified Baugh-Wooley method Table 2.2: Truth table for special half adder a 0 0 1 1 b 0 1 0 1 carry 0 1 1 1 sum 1 0 0 1
Implementations of array multipliers were described by Pezaris [5] and McIver, et al. [25]. Pezaris, at Lincoln Laboratories, designed a board level 17 by 17 array
multiplier for twos complement numbers. This multiplier generated the full 34 bit product in 40 nsec. A single chip array multiplier, reported by McIver, et al.,
implemented a 16 by 16 array multiplier with a twos complement algorithm. A revised design, the TRW MPY-16, was first sold commercially in 1976. This multiplier output its product in 160 nsec. 9
2.2 Column Compression Multipliers In 1964, Wallace [14] introduced a scheme for fast multiplication based on using parallel pseudoadders. A pseudoadder is simply a (3,2) counter. Rather than
generating a single sum output, a group of (3,2) counters adds together three numbers and produces two numbers whose sum equals the sum of the original three numbers. The primary advantage of the (3,2) counter is that it avoids carry propagation. Wallace proposed that the addition of partial product be performed as follows: 1) Group partial products into groups of three and input each group into individual sets of (3,2) counters. 2) Group the resulting bits from the 1st step into groups of three and input each group into sets of (3,2) counters. 3) Repeat the combining into groups of three and adding with sets of (3,2) counters until two numbers remain. 4) Add the final two numbers using a carry propagating adder to get the final product.
Dadda [15] later refined Wallaces method by defining a counter placement strategy that required fewer counters in the partial product reduction stage at the cost of a larger carry-propagate adder. For both methods, the total delay is proportional to the logarithm of the operand word-length. Other partial product reduction methods have been proposed since the work of Wallace and Dadda. The Reduced Area [26] and the Windsor [27] methods are based on 10
strategic utilization of (3,2) and (2,2) counters to improve area and layout, while maintaining the fast speed of the Wallace and Dadda designs. Oklobdzija, et al. [58] define an algorithm for partial product reduction based on understanding the unequal delay paths through counters and compressors. Oklobdzijas technique sorts and
connects fast inputs and outputs in the critical delay paths while assigning slow inputs and outputs to signal paths that can tolerate an increase in delay. Other methods reduce the initial matrix of partial products using either compressors or higher order counters.
2.2.1 Counters and Compressors The fast speed of column compression multipliers results from the parallel application of counters or compressors. It is important to note the differences between counters and compressors [15, 29, 30, 31]. A (q,r) counter is a combinational logic block where the number of inputs q and the number of outputs r are related by r = 1 + log2 q . For counters, the outputs express the count of the number of inputs that are ones; in other words, the counter determines how many inputs are active. The outputs for a counter have differing weights. A (q,r) counter with inputs from the ith column generates one bit in the ith column and one bit for each of the next r-1 columns. On the other hand, a q:r compressor consolidates q input bits in the ith column to r output bits, with one bit output in the ith column and one bit for each of the next r-1 columns. Additionally, there are L carry-in bits entering the compressor at different levels and also L carry-out bits leaving the compressor at different levels. These 11
L carry signals enter the compressor from the i-1 column and exit to the i+1 column. The L carry-out signals are not dependent on the L carry-in signals to avoid the horizontal ripple of carries. Since counters or compressors are critical, high-quantity components in column compression multipliers, any area or performance enhancements made to counters or compressors directly affect the multipliers. In [32] Kwon, et al., offer a fast 5:2 Several
compressor that is used to implement a 16 by 16 multiplier-accumulator.
researchers [33, 34, 35, 36, 37, 38, 39] have focused on developing optimal (3,2) counter designs.
2.2.2 Reduction Schemes As indicated in Figure 2.3, the multiplication of an N bit multiplicand by an N bit multiplier generates an N word by N bit matrix of partial products. The reduction of this partial product matrix to two words requires the parallel application of counters or compressors. The final two words are then summed using a carry-propagate adder to obtain the final product.
12
Figure 2.3: Steps for N by N unsigned parallel multiplication The Wallace [14] and Dadda [15] reduction schemes are realized using (3,2) and (2,2) counters. During the reduction process, each (3,2) counter takes three inputs from a given column and outputs a sum bit which remains in that column and a carry bit which enters the next more significant column. Each (2,2) counter accepts two inputs from a column and produces a sum bit in the same column and a carry bit in the next column. A useful tool for illustrating partial product reduction using (3,2) and (2,2) counters is the dot diagram, developed by Dadda [15]. Each partial product bit is represented by a dot. The outputs of each (3,2) counter are depicted as two dots
connected by a plain diagonal line. The outputs of each (2,2) counter are shown as two 13
dots connect by a crossed diagonal line. For both types of counter, the dot representing the sum remains in the same column of the partial product bits that are being added. The dot representing the carry out is placed in the next column. The dot diagram for a 12 by 12 Wallace multiplier is shown in Figure 2.4. In each stage of the reduction, Wallace performs a preliminary grouping of partial product rows into sets of three. (3,2) and (2,2) counters are then employed within each three row set. In the 12 by 12 example, the counters shown in Stage 1 of the reduction are placed in four sections as determined by the preliminary grouping of partial product bits out of the
AND array into sets of three. If due to the preliminary grouping there is only one partial
product bit, then that bit is directly moved down to the next stage. The reduction of the partial product bits in Stage 1 by the counters shown in Stage 2 demonstrates that rows which are not part of a three row set are moved down into the next stage without modification. The complete partial product reduction of a 12 by 12 Wallace multiplier requires five stages (intermediate matrix heights of 8, 6, 4, 3, and 2) and uses 102 (3,2) counters and 34 (2,2) counters. To complete the multiplication, an 18 bit carry-propagate adder forms the final product by adding the final two rows of partial product bits shown in Stage 5.
14
Column
24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Stage 1
(3.2): 40 (2,2): 8 Stage 2
(3,2): 20 (2,2): 6
Stage 3 (3,2): 20 (2,2): 8
Stage 4 (3,2): 11 (2,2): 5 Stage 5 (3,2): 11 (2,2): 7
Figure 2.4: Dot Diagram for a 12 by 12 Wallace Multiplier 15
In the development of his reduction scheme using (3,2) and (2,2) counters, Dadda noted that there exists a sequence of intermediate matrix heights that minimizes the number of reduction stages. This sequence, determined by working back from the final two row matrix, limits the height of each matrix to the largest integer that is no more than 1.5 times the height of its subsequent matrix. Table 2.3 indicates the number of reduction stages based on the number of bits in the multiplier. For example, a 32 by 32 bit Dadda multiplier requires eight reduction stages with intermediate heights of 28, 19, 13, 9, 6, 4, 3, and finally 2. Although the heights of the intermediate matrices are not always the same for Wallace and Dadda multipliers, the two schemes utilize the same number of reduction stages.
Table 2.3: Number of Reduction Stages for a Dadda Multiplier Bits in Multiplier (N) 3 4 5N6 7N9 10 N 13 14 N 19 20 N 28 29 N 42 43 N 63 64 N 94 Number of Stages 1 2 3 4 5 6 7 8 9 10
16
The recursive algorithm used to determine the application of counters for a Dadda multiplier is as follows: 1) Let d1 = 2 and dj+1 = 1.5 dj is the matrix height for the jth stage from the end. Find the smallest j such that at least one column of the original partial product matrix has more than dj bits. 2) In the jth stage from the end, apply (3,2) and (2,2) counters to obtain a reduced matrix with no more than dj bits in any column. 3) Let j = j 1 and repeat step 2 until a matrix with a height of only two is achieved. In Figure 2.5, the dot diagram for a 12 by 12 Dadda multiplier is shown. The first six matrix heights calculated using the recursive algorithm are 2, 3, 4, 6, 9, and 13. Since this is a 12 by 12 multiplier, the matrix height of 13 is unnecessary. The next matrix height to target is 9. Stage 1 of partial product reduction applies (3,2) and (2,2) counters only to the columns whose total height is greater than 9. In Stage 2, (3,2) and (2,2) counters are only used in columns whose total height is greater than 6. Note that when evaluating a columns height it is important to account for carries from the previous column. The 12 by 12 Dadda multiplier requires five reduction stages (matrix heights of 9, 6, 4, 3, and 2) and uses 99 (3,2) counters, 11 (2,2) counters, and a 22 bit carrypropagate adder.
17
Column
24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Stage 1
(3.2): 8 (2,2): 4
Stage 2
(3,2): 27 (2,2): 3
Stage 3 (3,2): 28 (2,2): 2
Stage 4 (3,2): 17 (2,2): 1 Stage 5 (3,2): 19 (2,2): 1
Figure 2.5: Dot Diagram for a 12 by 12 Dadda Multiplier 18
Another reduction scheme, which uses (3,3) and (2,2) counters, is used for the Reduced Area (RA) multiplier [26, 40, 41]. The dot diagram for a 12 by 12 Reduced Area multiplier is shown in Figure 2.6. This multiplier requires five stages (matrix heights of 9, 6, 4, 3, and 2) and uses 104 (3,2) counters, 11 (2,2) counters, and a 17 bit carry-propagate adder. The reduction method for the Reduced Area multiplier is: 1) For each reduction stage, the number of (3,2) counters used in each column is ki / 3, where ki is the number of bits in column i. 2) (2,2) counters are used only (a) when required to reduce the number of bits specified by the Dadda sequence, or (b) to reduce the rightmost column containing exactly two bits. Rule 1) for the Reduced Area multiplier results in the maximum reduction in the number of bits entering the next stage. In Figure 2.6, Rule 2a) directs that in the third reduction stage two (2,2) counters are used to reduce the number of bits in columns 12 and 13 to four. Rule 2b) adds one (2,2) counter to column i during reduction stage i. This has the advantage of decreasing the word length of the carry-propagate adder by an amount equal to the number of reduction stages.
19
Column
24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Stage 1
(3.2): 40 (2,2): 1
Stage 2
(3,2): 27 (2,2): 1
Stage 3 (3,2): 17 (2,2): 3
Stage 4 (3,2): 12 (2,2): 1 Stage 5 (3,2): 8 (2,2): 5
Figure 2.6: Dot Diagram for a 12 by 12 Reduced Area Multiplier 20
A fourth type of partial product reduction scheme has been proposed by Wang, et al. [27]. This technique, referred to in a subsequent paper as the Windsor multiplier, aims to maximize the area efficiency while reducing cross-stage interconnect. In Wangs research, area efficiency of the column compression part of the multiplier is defined as: T 100% K max(T ( k )) where T is the total number of (3,2) and (2,2) counters used in reduction, K is the required number of stages, and T(k) is the number of counters in stage k. High area efficiency percentages indicate close to even distributions of counters within each stage. Even or near even distributions of counters in each stage would provide more regular, similar sized, layout blocks of each stage. Routing together each similar sized stage would create a more regular, compact block of the overall partial product reduction section. The first step in Wangs method is to determine the minimum number of (3,2) and (2,2) counters required for reduction. This total number of counters will be the same as the number of counters used by Dadda. The allocation of these counters for the Windsor multiplier attempts to distribute the same number of counters at each stage. The heuristic procedure for allocating counters for efficient layout is as follows: 1. Calculate the average number of counters for each stage. W0= T / K .
21
2. If all of stages can accommodate W0 counters, then place at most W0 counters in each stage, with all T counters allocated to the K stages. The algorithm terminates here. 3. If step 2 does not apply, and k lower stages cannot contain T counters, then fill those stages with as many counters as they can contain, and calculate the average number of counters for the remaining K k stages, WK. 4. Check if each of the remaining K k stages can accommodate the number of counters determined in step 3. If true, distribute at most WK counters to each of the K k stages, under the condition that all counters have to be allocated to K stages. 5. If at least one stage of the remaining K k stages cannot contain WK counters, go to step 3 and repeat steps 3-5 until an appropriate WK is determined. For an 8 by 8 bit Windsor multiplier, an area efficiency of 95.5 percent was achieved in comparison to the 75 percent attained by an 8 by 8 Dadda multiplier. The performance of any of these four reduction methodsWallace, Dadda, Reduced Area, and Windsorcan be improved by the design of faster or more area efficient (3,2) and (2,2) counters. Al-Twaijry and Flynn [42] used pass transistor (3,2) counters as well as domino (3,2) counters in their investigation of the relationship between the topology of partial product interconnections and possible circuit implementations.
22
2.2.2.1 Using Higher-Order Counters and Compressors Instead of using only (3,2) and (2,2) counters, multipliers can be designed using higher-order (q,r) counters. The most common realizations of higher order counters are (7,3) and (15,4) counters. Dadda [15], in preparation to use such counters for partial product reduction defined new sets of intermediate matrix height sequences. He also offered designs of (3,2) and (7,3) counters using inverting threshold gates and resistortransistor threshold gates. Swartzlander in [30, 43] discusses three ways to design higher order counters. The most straightforward implementations for VLSI technology are built using (3,2) counters as building blocks. Examples of (7,3) and (15,4) counters are shown in Figures 2.7 and 2.8, respectively. The delay to the most significant bit of a (2n 1, n) counter is 2n 3 times the delay of a (3,2) counter.
Figure 2.7: (7,3) Counter design using (3,2) counters after [30]
23
Figure 2.8: (15, 4) Counter design using (3,2) counters after [30] Other designs of higher order counters have been reported. Configurations [44, 45] of (7,3) and (15,4) counters have been logic synthesized under various delay and area constraints. These synthesized counters have smaller areas and faster delays than such counters built from (3,2) counters. Another approach [46] is to use a folded transistor, cross-coupled PMOS load implementation. This method though has three disadvantages: increased input capacitance, more intermediate node capacitance, and a long pull-down path. The reduction of partial product matrices by higher order counters has been examined in the literature. Dadda [29] examines partial product reduction using 24
combinations of (7,3), (3,2), and (2,2) counters for board level multiplier designs. In VLSI technology, using synthesized (7,3) counters, the resultant 16 by 16 multiplier in [44] is two simple gate delays faster than Wallace or Dadda multipliers (i.e., (3,2) and (2,2) implementations), but with an approximate 10% overall gate count increase. Compressors, especially 4:2 versions, have also been used to reduce partial product matrices. The 4:2 compressor can be easily realized from two (3,2) counters as shown in Figure 2.9. More recently, new 4:2 compressor circuits have been devised that speed up multiplication. In [47] Nagamatsu, et al. report a 15 nsec 32 by 32 bit CMOS multiplier, based on the Wallace reduction scheme, by using a specially designed 4:2 compressor cell. Ohkubo, et al. [48] achieved a 4.4 nsec, 54 by 54 bit CMOS multiplier using a 4:2 compressor designed with pass-transistor multiplexers. q3 q2 q1 q0
(3,2)
c s
to adjacent column
from adjacent column
(3,2)
c s
r1 r0 Figure 2.9: 4:2 Compressor using (3,2) counters 25
Higher order compressors are possible, but are not frequently used. Song and De Micheli [46] examined 9:2 and 27:5 compressors, which are reported to have highly regular layout. In their analysis of 5:3, 6:3, and 7:3 compressors, Jones and Swartzlander [49] note that since compressors require connections between adjacent compressors for intermediate carries, layout may be more complex.
2.2.3 The Final Carry-Propagate Adder The literature offers several different types of optimized carry-propagate adders, including carry lookahead [50, 51], carry-select [6], carry-skip [52, 53, 54], and modified Ling [55]. Such adder architectures have been evaluated and ranked on the basis of speed, size, and number of logic transitions [56]. More specifically, work has been done to aid in the estimation of power consumption of adders [57]. To better select and design adders for column compression multipliers Oklobdzija [31, 58] studied the bit arrival times to the final adder. His analysis shows that for Dadda multipliers, the middle bit-pairs are the latest to arrive to the adder. Both the least significant and most significant bit-pairs are produced early. Therefore, it is possible to tailor the final carry-propagate adder to take advantage of the bit-pairs that arrive early. Oklobdzija suggests using either a ripple carry adder or a variable block adder to sum the early least significant bit-pairs, a carry look-ahead adder to sum the middle region of bitpairs, and either a carry select adder or conditional sum adder to sum the early most significant bit pairs. 26
2.2.4 Layout Approaches In the literature, the three main physical development strategies suggested for column compression multipliers are to facilitate custom layout. The first method tries to associate each counter or compressor in a structured grid array. This brute force method is subject to connectivity errors since there can be a very large number of identical counters with no visual patterns or clues to aid routing. The second way to layout the reduction stages is the basis of the Windsor multiplier. By evenly allocating counters in each stage and eliminating many cross-stage interconnects, the Wang, et al. approach [27] attempts to maximize area efficiency. The third method, outlined in [59], divides the N by N partial products into two groups by an appropriate digit around the center of the initial parallelogram structure. The right-triangle halves are then rearranged to form a rectangular structure with less dead area. The partial products in each group are added in opposite directions using 4:2 compressors. In the case of a 54 by 54 bit multiplier, this layout method produced an area that is 19.6% smaller than the conventional multipliers area.
27
Chapter 3 Automated Multiplier Netlist Generation

Sixty column compression multipliers were developed in order to examined their area, delay, and power characteristics. These sixty multipliers are realizations of
Wallace, Dadda, and Reduced Area multipliers for four operand sizes, four process technologies, and five standard cell libraries. The daunting task of creating the design netlists of so many multipliers mandated a flexible, automated process. Therefore a multiplier generator was developed, in the scripting language Perl, to output the multipliers netlists in gate-level Verilog or spice formats. This chapter details the actual designs of the multipliers. An overview is given of the M x N multiplier generator, genmult. The features of the process technologies and standard cell libraries are also discussed.
3. 1 Basic Multiplier Design Column compression multipliers with three different compression strategies were implemented: Wallace, Dadda, and Reduced Area. Figure 3.1 presents the basic toplevel implementation for N by N unsigned multipliers. Mainly, 8 by 8, 16 by 16, 32 by 32, and 64 by 64 unsigned multipliers were developed in different process geometries. It is possible to implement twos complement multipliers by using both NAND and AND
28
gates in the partial product array. The NAND gates can be used, as specified by Parhami [24] and outlined in Chapter 2, with little impact to layout complexity and delay.
Multiplicand (A)
N
Multiplier (B)
N
D flip-flops
N
D flip-flops
N
Buffers AND Gate Array p0
Compression Strategy
Wallace or RA mult
pS,,p1 Carry Lookahead Adder p2n-2,,p1 p2n-2,,pS+1 D flip-flops + load caps

Figure 3.1: Block diagram for implemented column compression multipliers
Dadda Wallace or RA
3.1.1 Signal Buffering In order to approximate typical signal arrival times and drive strengths, D flipflops are used on the primary inputs. D flip-flops drive multiple buffers to distribute input signals to N2 AND gates. Delay simulations were performed for each cell library to 29
resolve 1) the maximum number of buffers that a single D flip-flop can drive, and 2) the maximum number of AND gate inputs that a single buffer can drive. 3.1.2 Partial Product Reduction The least significant bit of the final product is formed from a0 b0; therefore p0 is available immediately from the AND gate array. The Wallace and Reduced Area
reduction stages generate equal numbers of early product bits, while the Dadda reduction stages generate only the LSB, p0. For Wallace and Reduced Area multipliers, the number of product bits that are produced early is equal to the number of column compression stages, S. The remaining product bits are available after the delay through the final
carry-propagate adder. Figure 3.2 details the connection of a final product bit to a D flipflop and capacitive load which scales with process technology from 0.01 pF to 0.0025 pF.
Figure 3.2: Loading for each product bit 3.1.3 Carry Lookahead Adder For each of the three types of multipliers implemented, a carry lookahead adder [60] is used for the final carry-propagate adder. Carry lookahead adders perform fast addition by generating the carries in parallel with the sum computations. Modified Full Adders (MFAs), shown in Figure 3.3, are used to sum each bit pair and determine if a 30
carry has been generated or would be propagated. Carry generation means that both input bits are ONE and therefore, regardless of the carry-in, a carry-out of ONE is generated. Carry propagation means that at least one of the input bits is a ONE and that the carry-in will propagate directly to the carry-out; that is, a carry-in of ONE in this situation produces a carry-out of ONE. For a MFA, the generate and propagate signals are described by gk = xkyk and pk = xk + yk .
Figure 3.3: Modified Full Adder Based on the generate and propagate signals, lookahead logic blocks can quickly determine a series of next carries, as shown in Equations (3.1) (3.4): ck+1 = gk + pkck ck+2 = gk+1 + pk+1ck+1 = gk+1 + pk+1gk + pk+1pkck ck+3 = gk+2 + pk+2ck+2 = gk+2 + pk+2gk+1 + pk+2pk+1gk + pk+2pk+1pkck ck+4 = gk+3 + pk+3ck+3 = gk+3 + pk+3gk+2 + pk+3pk+2gk+1 + pk+3pk+2pk+1gk + pk+3pk+2pk+1pkck (3.4) (3.1) (3.2) (3.3)
31
Organizing the lookahead logic block in a 4-bit wide module, it is possible to express Equation (3.4) in terms of block generate and block propagate signals, gk:k+3 and pk:k+3, respectively: ck+4 = gk:k+3 + pk:k+3ck where gk:k+3 = gk+3 + pk+3gk+2 + pk+3pk+2gk+1 + pk+3pk+2pk+1gk and pk:k+3 = pk+3pk+2pk+1pk (3.7) (3.6) (3.5)
Figure 3.4 shows the block diagram of a 16-bit carry lookahead adder. The carry lookahead logic blocks are organized in 4-bit modules. The operation of the 16-bit carry lookahead adder is as follows: 1) the inputs x, y, and c0 are applied, 2) each MFA computes p and g, 3) the first level of lookahead logic blocks computes the carries and block generates and propagates, 4) with the carry data, each MFA computes the sum outputs. The final carry-out, c15, is made from the second level lookahead logic block. This simple style of a carry lookahead adder was used for all of the multipliers implemented for this research. Primarily, the lookahead logic blocks were organized in 4-bit modules. Where the number of input bits was not a multiple of four, 1-bit, 2-bit, or 3-bit lookahead logic blocks were applied as needed to the most significant bit pairs.
32
Figure 3.4: Diagram of 16-bit Carry Lookahead Adder
33
3.2 Process Technologies The four process technologies used in this research are from the same world-class semiconductor foundry. The four process technologies are 1) a 250 nm CMOS logic, single poly, five metal layer, salicide 2.5 V process, 2) a 180 nm CMOS logic, single poly, six metal layer, salicide, 1.8 V process, 3) a 130 nm CMOS logic, single poly, eight metal layer, salicide, 1.2 V process, and 4) a 90 nm CMOS logic, single poly, nine metal layer, salicide, 1.0 V process. The 250 nm and 180 nm processes represent todays mainstream logic technologies. They are product-proven technologies and the offer best overall value for mixed signal designs in the consumer and industrial marketplaces. The 130 nm and 90 nm processes are among the foundrys more advanced process technologies, offering many low power, high performance options, such as different core voltages and multiple threshold voltages. 3.3 Cell Libraries The column compression multipliers were implemented using standard cells from state-of-the-art libraries. These libraries were created for mainstream applications with optimizations for speed and density. The library architecture for 180 nm, 130 nm, and 90 nm process technologies is an enhanced generation over the 250 nm library architecture. For the 130 nm process technology, two cell libraries were used. One is the generic cell library. The second cell library is specifically architected to be low power and high density. This additional cell library has been characterized down to 0.6 V to enable accurate timing simulations at low voltages. 34
The design kit for each standard cell library includes LEF files and timing files. A LEF (Library Exchange Format) file contains the physical information for a process technology as well as geometric abstracts of all of the cells. All of the timing files used for this research are for the nominal temperature, voltage, and process corner, often named typical.lib. The most critical logic cell of the Wallace, Dadda, and Reduced Area multipliers is the (3,2) counter. Figure 3.5 shows the schematic of the (3,2) counter that is a standard cell within each of the libraries. inputs to the carry out, cout. As expected, the slowest paths are from the a and b
Figure 3.5: Schematic of (3,2) counter standard cell
3.4 M x N Multiplier Generator
35
The automation of the design process is essential in order to create several multipliers in a short time frame. In the literature, various programming languages have been used to create VHDL, Verilog, or netlist files. Several generators [61, 62, 63] for Booth encoded multiplier with optimized Wallace trees have been written in Lisp, AWK, or C. Over time, such multiplier generators have improved in capability, offering
pipeline insertion and opportunities for incremental optimization. In 2000, Hsiao and Jiang [64] produced a synthesizer which generates gate level Verilog code for a fast column compression multiplier. Their synthesizer connects the full adders of partial product reduction by choosing a connection pattern which minimizes the average inputto-output delay, offering global optimization for all of the available adders. In 2003, Qian and Dong-Hui [65] presented a Regularized Multiplier Generator, written in C++ and producing VHDL. This generator uses 4:2 compressors for the partial product reduction. For this research, an M x N multiplier generator called genmult was created using the scripting language, Perl. The user can specify several options for the generation of all or parts of a column compression multiplier. genmult is invoked using genmult -t <type> -M <size> -N <size> -a <adder> -<all, and, comp ,add> -p <proc> where <type> = <dad | wal | ra> = dadda, wallace, or reduced area multiplier <size> = <8 - 64> = number of bits in the multiplicand (M) or multiplier (N) <adder> = <rc | cla | NA > = ripple carry, carry lookahead adder, or no adder <all> = netlist complete multiplier with input and output wrapper of D flip-flops 36
<and> = netlist AND gate array only <comp> = netlist column compression stage only <add> = netlist final adder only <proc> = process ID. Based on the input options, genmult creates a spice netlist. For example, it is possible to generate 32-bit by 16-bit Wallace spice netlist, wal32x16.spi, or just a 14-bit wide Carry Lookahead Adder spice netlist, cla14.spi. The genmult script uses an additional input file <proc>.list. <proc>.list contains a mapping of genmult gate names to the appropriate names of logic gates in a particular process. In one process technology, a full adder cell may be labeled ADDFX1, while in another process, it is labeled fadder. The process map file is created manually but once completed is reused for all designs in that process. In order to have a gate-level Verilog netlist for layout, a spice to Verilog script, spi2ver, was created using Perl. This script takes a spice netlist as input and uses it to generate a Verilog netlist. For example, spi2ver wal32x16.spi is used to create the Verilog file wal32x16.g.
37
Chapter 4 Automated Multiplier Implementation and Verification

A design flow encompassing industry standard layout tools and verification practices was used to develop sixty column compression multipliers. The multiplier netlists, created using the home-grown tool genmult, were checked using formal verification techniques and then placed and routed. The parasitic resistances and
capacitances were then extracted from the layouts and used to back-annotate the netlists for delay analysis and power simulations. Layout tools and simulation practices were applied even-handedly to ensure fair representations of each type of multiplier. Scripting languages, like Perl and C shell, were used to automate often repeated tasks and streamline information extraction. This chapter outlines the design flow. A detailed overview of the tools and the process used for layout, parasitic extraction, delay simulation, and power estimation is given. Floor planning decisions are also reported. 4.1 Design Flow Premier tools from Cadence Design Systems, Inc., form the backbone of the design environment. Figure 4.1 illustrates the design and tool flow for the development and verification of the column compression multipliers. The first important step is to verify that the generated gate-level Verilog netlist functions as the desired multiplier. 38
Verilog Netlist Generation
genmult and spi2ver perl scripts
Functional Verification
Conformal Equivalence Checking Conformal Ultra
Timing Driven Placement
Encounter NanoPlace
Timing Driven Route
Encounter NanoRoute
RC Extraction
Encounter Native Extraction
Static Timing Analysis
Encounter Common Timing Engine
Power Analysis
Virtuoso UltraSim
Figure 4.1: Design and Tool Flow This task is performed using Encounter Conformal Equivalency Checking, version 5.1, with the additional product of Conformal Ultra which targets datapath structures. The verified multiplier netlist is then placed and routed using NanoPlace 39
and NanoRoute of the Encounter platform. Parasitic extraction is performed using Encounters native RC extraction program. The static timing analysis tool, Encounters Common Timing Engine, uses this parasitic information to determine path delays. The parasitic data is also used in power simulations by Cadences Virtuoso UltraSim, version 4.2. All of the tools and scripts were run on a desktop personal computer running the Red Hat Enterprise Linux 4.2 operating system. The desktop personal computer was built using 1 GB of memory and a 3.4 GHz Intel Pentium D, dual-core, 64-bit processor. 4.2 Formal Verification Formal verification is a type of static analysis that applies mathematical techniques to rigorously prove that a design functions correctly. Equivalence checking uses formal techniques to determine whether two versions of a design are functionally equivalent. This powerful verification method can be performed quickly and without the need for test vectors. In order to ensure that each generated column compression multiplier operates correctly, formal verification was performed using Cadences Encounter Conformal Equivalence Checking. The Conformal Ultra product was added to extend logic Conformal Ultra provides
equivalence checking capability to complex datapaths.
targeted support for analyzing adders and multipliers with standard architectures.
40
The flowchart in Figure 4.2 shows the Conformal process flow. Each generated gate-level Verilog netlist of a column compression multiplier is compared to a Verilog RTL multiplier design. The Verilog RTL multiplier is considered the Golden,
faultless design. The generated gate-level Verilog netlist is the Revised, to-be-verified design. After reading in the designs, the cell library information, and user-specified constraints and parameters, Conformal maps key points and compares the logic implemented to reach them. In the case of the multipliers, the key points are the primary inputs, primary outputs, and D flip-flops. When the comparison is complete, Conformal reports areas of equivalence and pinpoints differences. Conformal also assists in
diagnosing mismatches with error patterns and candidates, gate reporting, source code viewing, and schematic viewing with trace capability.
41
Figure 4.2: Conformal Process Flow 42
4.3 Layout Floorplanning Initial floorplanning for the column compression multipliers was performed using Cadences Encounter platform. The fundamental goal of the floorplanning was to
prepare a physical structure such that the placement and route tools could operate on each design in a consistent, balanced manner. To build a floorplan, a minimal set of
constraints and parameters was designated in a configuration file for each multiplier (e.g. dad8x8j250.conf). The main items specified in each configuration file include 1. the type of design netlist, 2. the filename of the timing data for the standard cell library, 3. the filename(s) of the LEF for the standard cell library, 4. a target aspect ratio for layout height and width, 5. the designation to flip cells to facilitate power abutment, 6. a target row utilization, and 7. the filename for the pin I/O assignments. In this research, the type of netlist used for placement and route is gate-level Verilog. The timing file used for each standard cell library represents the typical
performance at nominal voltage, temperature, and process corner. The LEF provides the physical geometries of the process technology and the standard cells. For custom datapaths, a bit-slice of an arithmetic unit tends to fit in long, pitchmatched, rectangular channels. For this research, a key goal is to evaluate column compression multipliers in a sea-of-gates design flow. There is no requirement to pitch43
match. Instead, the footprint should support high-density placement. To this end, a target aspect ratio of 0.95 for the layouts height versus width was given for each multiplier. This means that each layout would take on an almost square appearance. Power and ground strips were configured to abut power with power and ground with ground, as shown in Figure 4.3. No additional space was allotted for routing channels in metal 1.
Figure 4.3: Power/Ground abutment in layout The pin I/O assignment defines the labels and placement order of input and output pins for the multipliers. As shown in Figure 4.4, each pin assignment file places the multiplicand, A, across the top of the layout, the multiplier, B, along the west side, the least significant half of the product along the east side, and the most significant half of the product on the bottom. Note that pin placement order was specified but the exact 44
locations for each pin were not fixed. This allowed the tool to select the optimal pin placement and the overall layout to grow or shrink as needed.
Figure 4.4: Diagram of pin placement for N by N multipliers One of the important input parameters for the cell placement tool is row utilization. Row utilization is the ratio of the total area occupied by the designs cells to the total area of the layout region. The tool user specifies a row utilization target before placement occurs. High row utilization numbers indicate a very dense cell placement. Depending on the size of the design, a cell placement with a high row utilization may not be routable. The initial row utilization targets for each multiplier layout was set at 95%.
45
4.4 Timing Driven Placement and Route Timing driven placement and route offers the opportunity to realize optimized layout of the cells in a multipliers critical path. Based on 1) the floorplan configuration, 2) the input clock period defined in the timing constraints file, 3) cell delays from timing libraries, and 4) net delays calculated using RC extracted from trial routes, the NanoPlace tool performs several iterations to determine an optimal placement. Following cell
placement, filler cells are added as needed to extend power and ground lines. NanoRoute uses the same input information as NanoPlace to produce a routed design. Typically, digital designers allow NanoRoute to perform gate upsizing, signal buffering, and even logic optimization during the routing process. In order to maintain control over the cells used in a multipliers implementation, additional buffering, gate resizing, and logic optimization were disabled during routing. 4.5 RC Extraction The native RC extraction tool within Encounter offers two modes of operation: default and detailed. Per the Encounter User Guide [66], the total capacitance for each net is calculated based on the nets geometry and the local wire density in the default mode. Note that coupling capacitance is not calculated in default mode. In the detailed mode, the coupling capacitance is also evaluated by considering the actual geometries of neighboring nets on the same metal layer and the adjacent metal layer when a complete capacitance table is provided. Detailed mode offers RC values that contribute to more accurate timing results for a particular process technology. 46
The key to performing a detailed RC extraction is the provision of a capacitance table created from an IceCaps Technology (ICT) file. For this research, an ICT file was only available for the 250 nm process. Therefore it was possible to perform both default and detailed RC extractions when using the 250 nm process. For the 180 nm, 130 nm, and 90 nm processes, only the default RC extraction is conducted. For the multipliers developed in the 250 nm process, the timing delays using the detailed RC extractions were 1.5% to 3.5% slower than the timing delays that included the default RC extractions. Since the delay differences between the two extraction modes was very small, the default RC extractions were deemed sufficient for the simulations in this research. 4.6 Static Timing Analysis Given a specified set of operating constraints and timing libraries, timing analysis is used to fine tune and debug speed-limiting critical paths. All timing analysis in this research was performed using Encounters Common Timing Engine (CTE). Figure 4.5 shows the basic configuration of CTE used for this research. For each multiplier, three main sets of timing data were reported: 1) the top 50 slowest paths, 2) the worst case delay to each final product bit, and 3) the worst case delay of the bit pairs into the Carry Look-Ahead Adder.
47
Figure 4.5: Configuration of Common Timing Engine 4.7 Power Simulation Virtuoso UltraSim is designed to verify analog, mixed signal, and digital circuits using a multi-purpose, single engine, hierarchical simulator. It is promoted as ten to more than 10,000 times faster than SPICE [67]. In its most accurate mode, this
simulator offers plus or minus one percent accuracy with respect to SPICE. For this research, Virtuoso UltraSim is used to perform dynamic power analysis, with monitoring of the average, maximum (i.e., peak), and RMS currents and power consumed by each multiplier. It is not possible to evaluate leakage current using Figure 4.6 shows an example
UltraSim and the given standard cell libraries.
configuration for an UltraSim simulation of a 32 by 32 Dadda multiplier. The primary input file, dad32x32.sp, provides pointers to required files, such as the circuits netlist, the parasitic resistors and capacitors for back-annotation, and parameters of the process 48
technology. The input test vector file includes stimulus values for the multiplier and the multiplicand as well as output values for the final product, making the simulations selfchecking.
Figure 4.6: Configuration of UltraSim The dad32x32.sp file also sets up power, ground, circuit clocking, simulation modes, and measurement commands. The simulation mode used for all runs was digital accurate mode. The digital accurate mode is used for timing verification of digital circuits with a simulation error target of less than 5%. In the trade-off of speed and 49
accuracy, the simulations were customized to simulate with slightly faster speed (speed=6) than the default speed (speed=5) of digital accurate mode, thereby setting the relative convergence criterion, tol, for the current and voltage calculations to 0.02 and the absolute current tolerance, iabstol, to 110-10.
50
Chapter 5 Multiplier Area
Complex and unwieldy are two terms often used to describe the physical realization of column compression multipliers. Intuitively, current computer aided design techniques offer the opportunity to make the multiplier areas the smallest and most compact layouts realizable. Todays modern process technologies offer five or more layers of metal for signal routing. Thus it is possible to place all of the multipliers cells without having to leave room for routing channels. This means that multiplier area is solely dependent on the area of the cells used in the design. There is no need for a route component when estimating area. Since, for N by N multipliers, the number of the largest cells, the (3,2) counters, grows as N2, then multiplier area is expected to be roughly proportional to N2. In 1974, Dennard, et al. [68] indicated that, to a first-order, each new generation of process technology should expect to make all MOS physical dimensions proportional to the minimum feature size, , of the process technology. Since the height and width of every cell is proportional to , the area of each standard cell is proportional to 2, and the total area of an N by N column compression multiplier is expected to be approximately equal to k 2 N2, where k is a constant scaling factor.
51
This chapter presents the results of using the place and route tools, NanoPlace and NanoRoute within Cadences Encounter platform, to lay out Wallace, Dadda, and Reduced Area multipliers. The multipliers have been developed in the standard cell libraries of four CMOS process technologies: 1) 250 nm, 2.5 V, 2) 180 nm, 1.8 V, 3) 130 nm, 1.2 V, and 4) 90 nm, 1.0 V. For the 130 nm process technology, multipliers were created using both a generic standard cell library and a low power standard cell library. Before the actual layout values are reported, a simple analysis is offered to predict the trends in area as multiplier sizes, standard cell libraries, and process technologies are changed. This chapter concludes with an analysis of the significance of the layout results. 5.1 Area Estimation As noted above, a first order area estimate is based on the number of (3,2) counters, which is proportional to N2. A more exact area estimate is based on the gate counts. As shown in Figure 5.1, each type of multiplier is comprised of 1) D flip-flops on the inputs and outputs, 2) buffers distributing multiplicand and multiplier bits to the AND gate array, 3) the AND gate array, 4) the reduction stages, and 5) the carry lookahead adder. In order to perform twos complement multiplication, a NAND/AND gate array is implemented to generate the partial products. The NAND/AND gate array has the same layout complexity as the AND gate array.
52
Figure 5.1: Block diagram of an N by N unsigned column compression multiplier Table 5.1 indicates the number of D flip-flips, buffers, and AND gates used for each multiplier. The number of D flip-flops is equivalent to the total number of primary inputs and outputs, 4N. In examining the drive capabilities of D flip-flops and buffers, delay simulations showed that one D flip-flop could drive up to eight buffers and a buffer, depending on size, could drive up to eight AND gate inputs.
53
Table 5.1: Number of D flip-flops, buffers, and AND gates used in the multipliers Multiplier Word Size 8 by 8 16 by 16 32 by 32 64 by 64 D flipflops 32 64 128 256 Buffers 0 64 256 1024 AND gates 64 256 1024 4096
Table 5.2 summarizes expressions, reported in [26, 60], for calculating the number of (3,2) and (2,2) counters used in partial product reduction as well as the word length of the final carry propagate adder as a function of the input multiplier size, N. The expressions for counters and adder word length assume N > 6 and are also based on the number of reduction stages, S, which is approximately equal to log1.4 N. The number of (3,2) counters in Dadda multipliers comes simply from recognizing that there are N2 bits in the original partial product matrix and 4N 3 bits in the final two row matrix. Compared to Dadda multipliers, the Reduced Area multipliers have S fewer bits in the final two row matrix that go to the carry propagate adder. This occurs because the Reduced Area multipliers use (2,2) counters to reduce the rightmost column that has exactly two bits in each reduction stage. Table 5.2: Components for Wallace, Dadda, and Reduced Area multipliers Method Wallace Dadda Reduced Area (3,2) Counters N2 4N + 2 + S N2 4N + 3 2 N 4N + 3 + S (2,2) Counters >N N1 N1 CPA Length 2N 1 S 2N 2 2N 2 S
54
For Wallace multipliers, the number of (3,2) counters is approximately N2 4N + 2 + S. If N is greater than five, then there will be either one or two bits in the most significant column of the final two row matrix. In the former case one less (3,2) counter is required. The number of (2,2) counters is at least N, and is often much greater than N. Table 5.3 gives the (3,2) and (2,2) counter quantities and the word length of the carry propagating adder for 8 by 8, 16 by 16, 32 by 32, and 64 by 64 Wallace, Dadda, and Reduced Area multipliers. Compared to the Dadda multiplier, the N by N Reduced Area multiplier requires 2(log2(N)-1) more (3,2) counters but has a final adder that is 2(log2(N)-1) bits smaller. An N by N Wallace multiplier requires one fewer (3,2) counter than a Reduced Area multiplier and has a carry propagate adder that is one bit wider. The Wallace multiplier also requires roughly N(2(log2(N)-5)) (2,2) counters. Table 5.3: Hardware for Wallace, Dadda, and Reduced Area multipliers Reduction Strategy Wallace Dadda RA Wallace Dadda RA Wallace Dadda RA Wallace Dadda RA (8 by 8) (8 by 8) (8 by 8) (16 by 16) (16 by 16) (16 by 16) (32 by 32) (32 by 32) (32 by 32) (64 by 64) (64 by 64) (64 by 64) (3,2) Counters 38 35 39 200 195 201 906 899 907 3852 3843 3853 (2,2) Counters 15 7 7 54 15 15 164 31 31 459 63 63 Adder Length 11 14 10 25 30 24 55 62 54 117 126 116
55
The carry lookahead adder designed for the multipliers used a maximum lookahead logic block width of four. When needed, 1-bit, 2-bit, and 3-bit lookahead logic blocks were available, so that no additional, unused hardware was included. Per [57], the complexity of an N-bit carry lookahead adder implemented with 4-bit lookahead logic blocks is approximately 1.4 times the complexity of N (3,2) counters. Based on the accumulated data regarding the component counts of the multipliers, Table 5.4 indicates, to a first order, the complexity of the D flip-flops, buffers, AND gates, (3,2) counters, (2,2) counters, and the carry lookahead adder for N by N multipliers. The D flip-flops and the adder length of the final two row matrix grow linearly with N, the number of bits of the multiplier. For Dadda and Reduced Area multipliers, the number of (2,2) counters increases proportionally with N. For Wallace multipliers, the number of (2,2) counters grows faster than linearly with N, but slower than N2. Examination of several (2,2) counter quantities for different values of N shows that the number of (2,2) counters is roughly proportional to N log N. The complexity of the carry lookahead adder is proportional to N since its word length, which is approximately proportional to N. Mainly, the area growth of the multipliers is dominated by the N2 term for (3,2) counters, the largest standard cells in the design. The buffers and
AND gates also contribute quadratic growth with N.
56
Table 5.4: Complexity of the multiplier components Component D flip-flops Buffers AND gates (3,2) Counters (2,2) Counters Carry Lookahead Adder Wallace 4N O(N2) N2 O(N2) O(N logN) O(N) Dadda 4N O(N2) N2 O(N2) O(N) O(N) Reduced Area 4N O(N2) N2 O(N2) O(N) O(N)
Therefore, based on gate counts and the complexity of the multiplier components, the following area estimates are offered: For a given process technology and word size, the Wallace multipliers have the largest area, the Dadda multipliers are smaller, and the Reduced Area multipliers are the smallest. This is based on first examining the (3,2) counters, then the (2,2) counters, and finally the carry lookahead adder. The Wallace and Reduced Area multipliers have roughly equal numbers of (3,2) counters, but Wallace utilizes substantially more (2,2) counters. If the (2,2) counters are approximately half the size of (3,2) counters, then the number of (3,2) counters could be increased by half of the number of (2,2) counters when considering area. For the 16 by 16 multipliers, the area of the Wallace multipliers is formed by 227 (3,2) counter area equivalents versus 202.5 and 208.5 (3,2) counter area equivalents for Dadda and Reduced Area multipliers, respectively. A modified full adder in the carry lookahead adder is approximately the same size as a (3,2) counter. Based on the adder length, the number of modified full adders could be rolled into the (3,2) counter evaluation. Now, for a 16 by 16 multiplier, the relative area equivalents in (3,2) counters 57
would be 252, 232.5, and 232.5 for Wallace, Dadda, and Reduced Area multipliers, respectively. For a given process technology, a low power standard cell library will yield multipliers with smaller areas than those built using a generic standard cell library. The typical approach to developing a low power cell library is to scale down CMOS gate sizes, providing lower area, drive currents, and lower power consumption. Inspection of the two 130 nm standard cell libraries shows that the (3,2) counters in the low power library are 14% smaller in area than the (3,2) counters in the generic library. Therefore, with area dominated by the (3,2) counters, it is expected that the multipliers in the low power cell library will be approximately 14% smaller than those developed in the generic cell library. 5.2 Area Measurements All of the multipliers were placed and routed using NanoPlace and NanoRoute of Cadences Encounter platform. Power and ground were configured to abut in order to keep the area as compact as possible. No additional routing channels were provided. Filler cells were added as needed to extend power and ground lines. Row utilization for all of the multipliers was 95%. Visual inspection of each of the layouts showed dense, compact, very regular layouts. There were no empty, gate-free sections of any significant size. Though five or more layers of metal were available for each process, the small 8 by 8 multipliers were routed using only three layers of metal and the large 64 by 64 multipliers were routed using four layers of metal. 58
5.3 Area of Multipliers in the 250 nm Process Technology Layouts of Wallace, Dadda, and Reduced Area multipliers were completed in the 250 nm process technology. The layout areas for the twelve multipliers are listed in Table 5.5. Each doubling of the operand size from 8 by 8 through 64 by 64 increases the total area by slightly less than a factor of four for all three types of multipliers. As shown in Table 5.6, the Reduced Area multipliers are the smallest in size, with the Wallace multipliers being 4% to 6% larger and the Dadda multipliers 0.3% to 6% larger. Table 5.5: Layout areas for Wallace, Dadda, and Reduced Area multipliers in the 250 nm process Wallace (m2) 14,576 53,321 195,713 738,385 Dadda (m2) 14,570 51,288 186,909 709,509 Reduced Area (m2) 13,807 50,185 185,407 707,699
Word Size 8 by 8 16 by 16 32 by 32 64 by 64
Table 5.6: Comparison of Wallace, Dadda, and Reduced Area multiplier areas in the 250 nm process RA Multiplier Area (m2) 13,807 50,185 185,407 707,699
Wallace + 5.6 % + 6.2 % + 5.6 % + 4.3 % 59
Dadda + 5.5% + 2.2 % + 0.8 % + 0.3 %
5.4 Area of Multipliers in the 180 nm Process Technology Four Wallace, four Dadda, and four Reduced Area multipliers were completed in the 180 nm process technology. The layout areas for each multiplier are listed in Table 5.7. Generally, the area differences were small as indicated in Table 5.8. Doubling the operand size from 8 by 8 through 64 by 64 increases the total area by slightly less than a factor of four for all three types of multipliers. Table 5.7: Layout areas for Wallace, Dadda, and Reduced Area multipliers in the 180 nm process Wallace (m2) 8,400 30,221 109,880 412,456 Dadda (m2) 8,421 29,174 105,230 397,137 Reduced Area (m2) 7,990 28,551 104, 386 396,111
Table 5.8: Comparison of Wallace, Dadda, and Reduced Area multiplier areas in the 180 nm process RA Multiplier Area (m2) 7,990 28,551 104,386 396,111
Wallace + 5.1 % + 5.8 % + 5.3 % + 4.1 %
Dadda + 5.4 % + 2.2 % + 0.8 % + 0.3 %
60
In all cases, the Reduced Area multipliers are the smallest. For the 16 by 16 and larger word sizes, Wallace multipliers are larger than the Dadda multipliers as expected, but for the 8 by 8 multipliers, the Dadda multiplier is very slightly larger. This is due to the larger area of the carry lookahead adder swamping the area savings from fewer (3,2) counters and (2,2) counters in the Dadda multiplier. Table 5.9 shows the sections where there are area differences between the Wallace and Dadda multipliers; this includes the areas for (3,2) counters and (2,2) counters used in partial product reduction and the carry lookahead adder. Table 5.9: Comparison of counter and CLA areas for 8 by 8 Wallace and Dadda multipliers Multiplier Wallace 8 by 8 Dadda 8 by 8 Difference Area of (3,2) counters (m2) 2,655 2,445 - 210 Area of (2,2) counters (m2) 599 279 - 320 Area of CLA (m2) 1,746 2,295 + 549
5.5 Area of Multipliers in the 130 nm Process Technology For the 130 nm process technology, two standard cell libraries were used. First, twelve multipliers were designed using the generic 130 nm cell library, herein designated as 130g. The areas for these twelve multipliers are given in Table 5.10. For a given word size, the three types of multipliers are very close in size. For all three types of multipliers, doubling the operand size increases the total area by slightly less than a factor of four. 61
Table 5.11 compares Wallace and Dadda multipliers to the Reduced Area multipliers. multipliers. The Wallace and Dadda multipliers are larger than the Reduced Area For the 8 by 8 multipliers, the Dadda multiplier is slightly larger than the
Wallace multiplier due to using the larger carry lookahead adder. For the larger word sizes, the higher number of reduction stage counters in the Wallace multipliers dominates the total area resulting in the Wallace multipliers being the largest implementations. Table 5.10: Layout areas for Wallace, Dadda, and Reduced Area multipliers in the generic 130 nm cell library Wallace (m2) 4,388 15,661 56,584 211,551 Dadda (m2) 4,428 15,161 54,257 203,784 Reduced Area (m2) 4,181 14,811 53,783 203,207
Table 5.11:
Comparison of Wallace, Dadda, and Reduced Area multiplier areas in the generic 130 nm cell library RA Multiplier Area (m2) 4,181 14,811 53,783 203,207
Wallace + 5.0 % + 5.7 % + 5.2 % + 4.1 %
Dadda + 5.9 % + 2.4 % + 0.9 % + 0.3 %
62
Using the same 130 nm process technology, twelve column compression multipliers were implemented using a standard cell library designed specifically for low power performance. Herein, this low power cell library is designated as 130p. Each low power cell contains the same logic as the generic cell, but with CMOS gate geometries down-sized for reduced power consumption. Table 5.12 lists the areas of the twelve multipliers. Table 5.13 reports the percentage by which Wallace or Dadda
multipliers are larger than Reduced Area multipliers. Table 5.12: Layout areas for Wallace, Dadda, and Reduced Area multipliers in the low power 130 nm cell library Wallace (m2) 3,493 12,739 46,867 177,318 Dadda (m2) 3,515 12,353 45,104 171,474 Reduced Area (m2) 3,339 12,103 44,766 171,060
Word Size 8 by 8 16 by 16 32 by 32 64 by 64 Table 5.13:
Comparison of Wallace, Dadda, and Reduced Area multiplier areas in the low power 130 nm cell library RA Multiplier Area (m2) 3,339 12,103 44,766 171,060
Wallace + 4.6 % + 5.2 % + 4.7 % + 3.7 % 63
Dadda + 5.3 % + 2.1 % + 0.8 % + 0.2 %
Table 5.14 reports the area differences between the multipliers in the 130g cell library and the multipliers in the 130p cell libraries. The multipliers developed in the 130p cell library are 16% to 21% smaller than the multipliers created in the 130g cell library. Table 5.14: Percentage by which multipliers in the 130p cell library are smaller than multipliers in the 130g cell library
Wallace 20% 19% 17% 16%
Dadda 21% 19% 17% 16%
Reduced Area 20% 18% 17% 16%
5.6 Area of Multipliers in the 90 nm Process Technology Layouts of four Wallace, four Dadda, and four Reduced Area multipliers were completed in the 90 nm process technology. The layout areas for these twelve multipliers are listed in Table 5.15. Doubling the operand size from 8 by 8 through 64 by 64 increases the total area by slightly less than a factor of four for all three types of multipliers. As shown in Table 5.16, the Reduced Area multipliers are the smallest in size, with the Wallace multipliers being 4% to 6% larger and the Dadda multipliers 0.3% to 6% larger.
64
Table 5.15:
Layout areas for Wallace, Dadda, and Reduced Area multipliers in the 90 nm process Wallace (m2) 2,157 7,717 27,953 104,780 Dadda (m2) 2,170 7,444 26,752 100,720 Reduced Area (m2) 2,051 7,277 26,524 100,445
Table 5.16:
Comparison of Wallace, Dadda, and Reduced Area multiplier areas in the 90 nm process RA Multiplier Area (m2) 2,051 7,277 26,524 100,445
Wallace + 5.2 % + 6.0 % + 5.4 % + 4.3 %
Dadda + 5.8 % + 2.3 % + 0.9 % + 0.3 %
5.7 Area Comparisons For a given process technology, standard cell library, and word size, the area differences among the Wallace, Dadda, and Reduced Area multipliers are small; the largest are at most 6% bigger than the smallest. For multipliers larger than 8 by 8, Wallace multipliers will be the largest and Reduced Area multipliers the smallest. For 8 by 8 multipliers, Dadda multipliers will be the largest, due to the area of the carry 65
lookahead adder.
Figure 5.2 shows layout areas for all of the Dadda multipliers
developed in this research. For each process, all of the multiplier areas show slightly less than quadratic growth with increases in N.
Areas of Dadda Multipliers

800,000 700,000 600,000 500,000 400,000 300,000 200,000 100,000 0 0 10 20 30 40 50 60 70 Word Size, N
Area (m )
Dadda 250nm Dadda 180nm Dadda 130g Dadda 130p Dadda 90nm
Figure 5.2: Dadda multiplier areas using different process technologies and standard cell libraries Tables 5.17, 5.18, and 5.19 give the normalized area calculations for Wallace, Dadda, and Reduced Area multipliers in the generic libraries of the 250 nm, 180 nm, 130 nm, and 90 nm CMOS process technologies. The area for each multiplier is normalized to the area of the 8 by 8 multiplier in that particular process technology. These
normalized areas show that doubling the operand size increases the total area by slightly less than a factor of four. This occurs because most, but not all, elements of the
multiplier area are increasing quadratically with the operand size.
66
Table 5.17: Wallace multiplier areas relative to each processs 8 by 8 case Normalized Normalized Normalized Normalized Multiplier Area Area Area Area 90 nm 130 nm 180 nm 250 nm 8 by 8 1.0 1.0 1.0 1.0 16 by 16 3.7 3.6 3.6 3.6 32 by 32 13.4 13.1 12.9 13.0 64 by 64 50.7 49.1 48.2 48.6
Table 5.18: Dadda multiplier areas relative to each processs 8 by 8 case Normalized Normalized Normalized Normalized Area Area Area Area 90 nm 130 nm 180 nm 250 nm 1.0 1.0 1.0 1.0 3.5 3.5 3.4 3.4 12.8 12.5 12.3 12.3 48.7 47.2 46.0 46.4
Multiplier 8 by 8 16 by 16 32 by 32 64 by 64
Table 5.19: Reduced Area multiplier areas relative to each processs 8 by 8 case Normalized Normalized Normalized Normalized Area Area Area Area 250 nm 180 nm 130 nm 90 nm 1.0 1.0 1.0 1.0 3.6 3.6 3.5 3.5 13.4 13.1 12.9 12.9 51.3 49.6 48.6 49.0
Tables 5.20, 5.21, and 5.22 report the area ratios of the multipliers in the 250 nm, 180 nm, and 130 nm processes to the 90 nm multiplier in the same word size. These area ratios show that transitioning from 250 nm to 180 nm to 130 nm decreases multiplier area by slightly less than a factor of 2 with each process transition, with the average area 67
reduction with each respective step being slightly less than (0.25/0.18)2 and slightly more than (0.18/0.13)2. Transitioning from 130 nm to 90 nm decreases multiplier area by
approximately a factor of two, which is slightly less than (0.13/0.09)2. Table 5.20: Wallace multiplier areas relative to 90 nm Multiplier 250 nm to 90 nm 180 nm to 90 nm 130nm to 90 nm 8 by 8 16 by 16 32 by 32 64 by 64 6.8 6.9 7.0 7.0 3.9 3.9 3.9 3.9 2.0 2.0 2.0 2.0
Table 5.21: Dadda multiplier areas relative to 90 nm Multiplier 250 nm to 90 nm 180 nm to 90 nm 130 nm to 90 nm 8 by 8 16 by 16 32 by 32 64 by 64 6.7 6.9 7.0 7.0 3.9 3.9 3.9 3.9 2.0 2.0 2.0 2.0
Table 5.22: Reduced Area multiplier areas relative to 90 nm Multiplier 250 nm to 90 nm 180 nm to 90 nm 130 nm to 90 nm 8 by 8 16 by 16 32 by 32 64 by 64 6.7 6.9 7.0 7.0 3.9 3.9 3.9 3.9 2.0 2.0 2.0 2.0
68
Figures 5.3, 5.4, and 5.5 show area pie charts of each type of multiplier. Clearly, as expected, the (3,2) counters form the biggest portion of each multipliers area, ranging from 30% of the area of an 8 by 8 Dadda multiplier to 69% of the area of a 64 by 64 Reduced Area multiplier. This significant area contribution indicates that any efforts to appreciably reduce the area of column compression multipliers should be targeted at minimizing the size of the (3,2) counters.
69
8 by 8 Wallace Multiplier Area
16 by 16 Wallace Multiplier Area CLA 14%
CLA 22% BUFF 0% DFF 23% AND2 10%
(3,2) 32%
BUFF 3% DFF 12% AND2 11%
(3,2) 48%
(2,2) 8% Filler 5%
Filler 5%
(2,2) 7%
32 by 32 Wallace Multiplier Area CLA 8% BUFF 3% DFF 7% AND2 12% Filler 5% (2,2) 6%
64 by 64 Wallace Multiplier Area BUFF CLA DFF 3% 5% 4% AND2 13% Filler 5% (2,2) 5% (3,2) 65%
(3,2) (2,2) Filler AND2 DFF BUFF CLA
(3,2) 59%
Figure 5.3: Area pie charts of Wallace multipliers
70
8 by 8 Dadda Multiplier Area
16 by 16 Dadda Multiplier Area
CLA 28%
(3,2) 30%
CLA 18% BUFF 3% DFF 13% AND2 11%
BUFF 0% DFF 23%
(3,2) 48%
(2,2) 4% Filler 5%
AND2 10%
Filler 5%
(2,2) 2%
32 by 32 Dadda Multiplier Area CLA 10%
64 by 64 Dadda Multiplier Area BUFF DFF 3% 4% AND2 13% CLA 5%

(3,2) (2,2) Filler AND2
BUFF 3% DFF 7% AND2 13% Filler 5%
(3,2) 61% (2,2) 1%
Filler 5% (2,2) 1% (3,2) 69%
DFF BUFF CLA
Figure 5.4: Area pie charts of Dadda multipliers
71
8 by 8 RA Multiplier Area CLA 21% BUFF 0% (3,2) 35%
16 by 16 RA Multiplier Area CLA BUFF 13% 3% DFF 13% AND2 12% Filler 5% (2,2) 4%
(3,2) 50%
DFF 24% AND2 11%
(2,2) 4% Filler 5%
32 by 32 RA Multiplier Area CLA 9%
64 by 64 RA Multiplier Area BUFF CLA 3% 5%

(3,2) (2,2) Filler AND2 DFF BUFF
BUFF 3% DFF 7% AND2 13% Filler 5% (2,2) 1%
DFF 4% AND2 13% Filler 5% (2,2) 1%
(3,2) 62%
(3,2) 69%
CLA
Figure 5.5: Area pie charts of Reduced Area multipliers
72
At the beginning of this chapter, it was predicted that the area of the column compression multipliers could be estimated by k 2N2, where is the processs minimum feature size, N is the word size, and k is a constant scaling factor. This prediction was based on observing the large gate count of the (3,2) counters. The pie charts show that for the 16 by 16 and smaller multipliers, the area of the (3,2) counters is significant but not dominant, only accounting for 30% to 50% of the total area. Even adding in the areas of the other gates whose count grows as O(N2), such as AND gates and buffers, only accounts for 40% to 65% of multiplier area. For the 16 by 16 and smaller multipliers, the remaining components, such as the final carry-propagate adder, contribute as much as 35% to 60% of the overall area. This indicates that predicting area to grow as N2 does not completely describe the smaller multipliers. An equation estimating area for column compression multipliers needs to include both an N2 term and an N term. Table 5.23 details the percentage that the O(N2) cells, the carry lookahead adder, and the remaining cells contribute to overall multiplier area.
73
Table 5.23: Breakdown of multiplier areas by components % of Area formed by O(N2) cells 42% 40% 46% 62% 62% 65% 74% 77% 78% 81% 85% 85% % of Area formed by CLA 22% 28% 21% 14% 18% 13% 8% 10% 9% 5% 5% 5% % of Area formed by remaining cells 36% 32% 33% 24% 20% 22% 18% 13% 13% 14% 10% 10%
Multiplier Wallace Dadda RA Wallace Dadda RA Wallace Dadda RA Wallace Dadda RA
Word Size 8 by 8 8 by 8 8 by 8 16 by 16 16 by 16 16 by 16 32 by 32 32 by 32 32 by 32 64 by 64 64 by 64 64 by 64
A least squares method was applied to the measured multiplier areas to calculate quadratic approximations in the form of Area k12N2 + k22N + b2 (5.1)
where is the processs minimum feature size, N is the word size, k1 and k2 are coefficients, and b is a constant. Equations 5.2, 5.3, and 5.4 provide area approximations for each type of column compression multiplier, with in units of nanometers. Comparing these estimated areas from the equations to the measured areas, the error ranges for the Wallace, Dadda, and Reduced Area data sets are 8.9% to -4.6%, 10.2% to -4.1%, and 10.1% to -3.9%, respectively. The approximation equations provide the best estimates to the areas measured for the 180 nm, 130 nm, and 90 nm processes and
74
libraries, with the error for the Wallace, Dadda, and Reduced Area data sets ranging from -0.1% to -4.6%, -0.2% to -4.1%, and -0.1% to -3.9%, respectively. AreaWallace 0.00283 2N2 + 0.015 2N - 0.0472 2 AreaDadda 0.00275 2N2 + 0.0122 2N - 0.0166 2 AreaRA 0.00276 2N2 + 0.00114 2N - 0.0246 2 (5.2) (5.3) (5.4)
Equation 5.5 is the general, combined form of a quadratic area approximation for any of the three types of column compression multiplier. AreaCCmultiplier 0.00277 2N2 + 0.0129 2N - 0.0295 2 (5.5)
Table 5.24 reports the differences between the measured areas and the estimated areas calculated using Equation 5.5 for the Wallace, Dadda, and Reduced Area multipliers, respectively. Note that Equation 5.5 overestimates areas by 2.8% to 13.6% for all of the multipliers in the 250 nm process and underestimates areas by 0.9% to 7.0% for all of the multipliers in the 90 nm process. The general area approximations differ the most from the measured areas for the Reduced Area multipliers in 250 nm, overestimating from 7.2% for the 64 by 64 multipliers to 13.6% for the 8 by 8 multipliers. For the other three process technologies, Equation 5.5 provides a very good area estimate, with error percentages ranging between 1.8% and -7.0%.
75
Table 5.24: Comparison of estimated areas using general quadratic approximation versus measured areas of the multipliers
Word Size 8 by 8 8 by 8 8 by 8 16 by 16 16 by 16 16 by 16 32 by 32 32 by 32 32 by 32 64 by 64 64 by 64 64 by 64
250 nm + 7.6% + 7.7% + 13.6% + 3.9% + 8.0% + 10.3% + 2.8% + 7.7% + 8.5% + 2.8% + 7.0% + 7.2%
180 nm 130 nm 90 nm - 3.2% - 3.4% + 1.8% - 5.0% - 1.6% + 0.5% - 5.1% - 0.9% - 0.1% - 4.6% - 0.9% - 0.7% - 3.3% - 4.2% + 1.4% - 4.4% - 1.2% + 1.1% - 3.8% + 0.3% + 1.2% - 3.0% + 0.7% + 1.0% - 5.8% - 6.3% - 0.9% - 7.0% - 3.6% - 1.4% - 6.7% - 2.5% - 1.7% - 6.1% - 2.4% - 2.1%
The area estimates for the process geometries that are smaller than 250 nm can be improved by removing the 250 nm area data from the calculations. This is a valid option because the 250 nm cell library belongs to an architecturally different family of cell libraries. The 180 nm, 130 nm, and 90 nm cell libraries are all part of the same design family. Equations 5.6, 5.7, 5.8, and 5.9 can be used to approximate area for designs in 180 nm or smaller process geometries. AreaWallace, <180 nm 0.00288 2N2 + 0.0156 2N 0.0479 2 AreaDadda, <180 nm 0.00280 2N2 + 0.0128 2N 0.0169 2 AreaRA, <180 nm 0.00280 2N2 + 0.0120 2N 0.0252 2 AreaCCmultiplier, <180 nm 0.00283 2N2 + 0.0134 2N 0.0300 2 (5.6) (5.7) (5.8) (5.9)
76
Comparing the area approximations from Equations 5.6, 5.7, and 5.8 to the measured areas, the error ranges for the Wallace, Dadda, and Reduced Area data sets are 1.8% to -1.9%, 1.8% to -1.6%, and 1.6% to -1.6%, respectively. When the 250 nm data is included in the development of the equations, the magnitude of the error is as high as 4.6% for multipliers in the 180 nm and smaller geometries. Excluding the 250 nm data allows the error for area approximations to be within 2%. Table 5.25 reports the differences between the measured areas and the estimated areas calculated using Equation 5.9 for the Wallace, Dadda, and Reduced Area multipliers, respectively. For the 180 nm and smaller process technologies, Equation 5.9 provides an excellent area estimate, with error percentages ranging between 4.8% to -4.6%. Table 5.25: Comparison general area approximations for geometries < 180 nm versus measured areas of the multipliers
Word Size 180 nm 130 nm 8 by 8 8 by 8 8 by 8 16 by 16 16 by 16 16 by 16 32 by 32 32 by 32 32 by 32 64 by 64 64 by 64 64 by 64 - 0.4% - 0.6% + 4.8% - 2.6% + 0.9% + 3.1% - 2.8% + 1.5% + 2.3% - 2.4% + 1.3% + 1.6% - 0.5% - 1.4% + 4.4% - 1.9% + 1.3% + 3.7% - 1.5% + 2.7% + 3.6% - 0.8% + 3.0% + 3.3%
90 nm - 3.0% - 3.6% + 2.0% - 4.6% -1.1% + 1.2% - 4.5% - 0.2% + 0.7% - 4.0% - 0.1% + 0.2%
77
Using Equations 5.6, 5.7, 5.8, and 5.9, it is possible to predict the areas of column compression multipliers in smaller process geometries. Table 5.26 lists the predicted areas for Wallace, Dadda, and Reduced Area multipliers in a 65 nm process technology. The approximate area of the generalized column compression multiplier is also given. Table 5.26: Predicted areas for column compression multipliers in a 65 nm process technology Wallace (m2) 1,104 3,967 14,367 53,856 Dadda (m2) 1,118 3,822 13,773 51,845 Reduced General Area CC Multiplier (m2) (m2) 1,056 1,091 3,733 3,840 13,630 13,929 51,594 52,471
5.8 Area Summary In this research, the actual layouts for the Wallace, Dadda, and Reduced Area multipliers successfully establish the area differences among the three multipliers types. Where only gate-level area estimates existed in previous research, it has been shown with timing-driven placed and routed designs that Wallace multipliers are generally the largest of the three multipliers and Reduced Area multipliers the smallest. This order of
multiplier size exists across process technologies and for various word sizes greater than 8 by 8. Careful examination of the data suggests that this will hold for multipliers even larger than 64 by 64.
78
Area for column compression multipliers reduces by approximately 0.5 for each generational transition by approximately 1
2 in the process minimum feature size, .
This 50% area reduction is supported by the MOSFET scaling rules outlined by Dennard, et al. [68]. Timing-driven placement of column compression multipliers offers the opportunity to achieve extremely high row utilization, creating very compact designs. All placements of the multipliers achieved a 95% row utilization. In previous research [41], placement algorithms had only the connectivity from the Verilog netlist to indicate possible nearest neighbors in order to direct the cell placement. This type of
connectivity-guided placement often yielded 5%-10% lower row utilizations with extra filler spacing required for additional routing tracks. For N by N multipliers, the actual layouts also confirm the dominance of the multiplier components whose complexity is roughly proportional to N2. As the largest component in the designs, the (3,2) counters, with first order complexity of O(N2), are a major portion of the overall area. The AND gates and buffers also contribute to quadratic growth with N. Across different process technologies, doubling the operand size will increase the total area by a factor of somewhat less than four for each type of multiplier. The area for column compression multipliers has been estimated in terms of the word size, N, and the processs minimum feature size, . In order to minimize the error in an area approximation, the expression must include both N2 and N terms.
79
Chapter 6 Multiplier Delay

Column compression multipliers are often cited for their high speed. In the literature, most of the high-speed multipliers are implemented as fully custom designs. In such cases, design engineers expend significant time and effort manually constructing layouts to minimize routing loads and optimize timing. A relatively minor change in the design, such as increasing the word size or moving to a different process technology, can require a time-consuming, major redesign of the multiplier. The automation of column compression multiplier development yields not only compact layouts as discussed in Chapter 5, but also fast and consistent delay times. Though they differ slightly by the number and significantly by the method of application of (3,2) and (2,2) counters, intuitively, Wallace, Dadda, and Reduced Area multipliers should have approximately equal delay times for a given word size and process technology. This is due to the three multipliers using the same number of reduction stages. It is possible for some word sizes of Wallace and Reduced Area multipliers to be slightly faster than Dadda multipliers if their final carry-propagate adder is faster. If a carry lookahead adder is used to implement the carry-propagate adder, this will be a minor effect.
80
As Dadda demonstrates in the generation of his sequence of intermediate column heights, the number of reduction stages is proportional to the logarithm of the word size, N. With the partial product reduction dominating multiplier delay, then the total delay for the column compression multipliers is expected to be proportional to the logarithm of N for an N-bit multiplier. As process technologies scale down, overall multiplier delays are expected to decrease in proportion to , where is the minimum feature size. At the smaller technology features, the question becomes whether route parasitics will begin to have a greater, more negative effect on timing. The post-layout delay simulations of this
research will show the impact of parasitics for 250 nm to 90 nm process technologies. This chapter presents the results of delay analysis using the Common Timing Engine within Cadences Encounter platform. The designs of each multiplier were
placed and routed in the standard cell libraries of four CMOS process technologies: 1) 250 nm, 2.5 V, 2) 180 nm, 1.8 V, 3) 130 nm, 1.2 V, and 4) 90 nm, 1.0 V. The worst case delays of Wallace, Dadda, and Reduced Area multipliers are examined both with and without the back-annotation of parasitic resistances and capacitances extracted from the layouts. Before the actual delay values are reported, a simple analysis of delays is provided to predict the trends in multiplier speed as a function of the input word sizes and process technologies.
81
6.1 Delay Estimation The total delay of a column compression multiplier is the sum of delays through 1) input signal buffering, 2) partial product array formation, 3) the reduction, and 4) the final carry propagate adder (assumed to be a carry lookahead adder for this research). Since the (3,2) and (2,2) counters are applied in parallel in each stage, the delay of each stage is one (3,2) counter delay. Equations (6.1), (6.2), and (6.3) offer simple equations for total delay of Wallace, Dadda, and Reduced area multipliers DelayWallace,NxN = tbuffer + tAND + S t(3,2) + tCPA(2N-1-S) DelayDadda,NxN = tbuffer + tAND + S t(3,2) + tCPA (2N-2) DelayRA,NxN = tbuffer + tAND + S t(3,2) + tCPA(2N-2-S) where: S is the number of reduction stages, S log1.4 N. (6.1) (6.2) (6.3)
For example, a 20 by 20 Dadda multiplier uses seven reduction stages. The total delay for the 20 by 20 Dadda multiplier implemented with a carry lookahead adder is: DelayDadda,20x20 = tbuffer + tAND + 7 t(3,2) + tCLA(38) (6.4)
For the word sizes and process technologies used in this research, the Wallace, Dadda, and Reduced Area multipliers will have approximately the same total delays. Examining Equations (6.1), (6.2), and (6.3), the delays through the three types of multipliers are equal until the data flow reaches the final carry propagate adder. Wallace and Reduced Area multipliers require smaller final carry propagate adders than Dadda multipliers. Depending on the implementation of the final adder, the adder length may make a small difference in the total delays amongst the three multipliers. 82
For carry lookahead adders, there are predictable points at which delay will be increased due to the addition of a new lookahead logic level. If 4-bit lookahead blocks are used, as in this research, these occur for N = 4k, for integer values of k. A few of the adder lengths, where the increase from one length to the next length adds the delay of one lookahead logic block, are going from adder length equal 4 to 5, from 16 to 17, from 64 to 65, from 256 to 257, etc. To see an impact from differing numbers of lookahead levels, one would need to look at 34 by 34 Wallace, Dadda, and Reduced area multipliers. The adder lengths would be 59, 66, and 58 respectively. The carry
lookahead adder used for the 34 by 34 Dadda multiplier would have four levels of lookahead logic where as the Wallace and Reduced Area multipliers would have three levels of lookahead logic. For a given word size and process technology, the delays of the Wallace, Dadda, and Reduced Area multipliers are proportional to log(N). Though different numbers of (3,2) and (2,2) counters are applied within each reduction stage, the number of reduction stages for each type of multiplier is the same. The number of reduction stages is
proportional to the logarithm of the word size. Especially for large values of N, the delay through the reduction stages dominates the overall multiplier delay. Therefore, the
overall multiplier delay is approximately proportional to the logarithm of the word size. Using simple analytic models [23], it is possible to predict a gates delay as an RC delay expressed in terms of channel width, W, channel length, L, supply voltage, V, current, I, and gate-oxide thickness, tox, : 83
V tgate ~ R C ~ C I
(6.5)
Using the gate capacitance and current expressions

C~ WL t ox
(6.6)
and
W I ~ L
the approximation for gate delay becomes
1 t ox
2 V
(6.7)
W L (L )(t ox ) V tgate ~ t W V 2 ox
(6.8)
cancelling the channel width, the supply voltage, and the gate-oxide thickness terms yields tgate L2 ~ V (6.9)
To a first order, with both channel length and supply voltage proportional to , the processs minimum feature size, a gates propagation delay is proportional to . Since each multipliers delay is the sum of gate delays, then multiplier delay is also proportional to . Based on these predictions of the effects of the word size, N, and process geometry, , on delay, a rough estimate of column compression multiplier delay is
84
DelayCCMultiplier k (log N) where k is a constant scaling factor.
(6.10)
6.2 Delay Analysis In this research, delay analysis is performed using two methods. The first method estimates delays based on the intrinsic delays listed in the datasheets for each library cell. The intrinsic delay is the delay through the cell when there is no load on the output. No routing delays are included. The intrinsic delay values are taken at 25C, nominal voltage, and typical process. The second method uses the Common Timing Engine (CTE) of Cadences Encounter platform. The Common Timing Engine takes as inputs a designs netlist (Verilog), cell library process information, parasitic resistance and capacitance data, and simulation environment parameters such as temperature and voltage. All of the timing analysis is performed at the nominal voltage level, 2.5 V, 1.8 V, 1.2 V, or 1.0 V, for the particular process technology. Temperature is set at 25C. Typical process models are used.
6.3 Delay for Multipliers in the 250 nm Process Technology Following cell placement and route in the 250 nm cell library, parasitic resistances and capacitances were extracted for four Wallace multipliers, four Dadda
85
multipliers, and four Reduced Area Multipliers. Tables 6.1, 6.2, and 6.3 report the comparisons of the delay values for the Wallace, Dadda, and Reduced Area multipliers. At each word size, the estimated delays without routing are approximately the same for the three types of multipliers. The variations in the estimated delays at the smaller word sizes are due to delay increases in lookahead logic blocks as they increase widths from 2-wide to 3-wide or 3-wide to 4-wide. Comparing the estimated delays without route to the CTE delays that include route, the delay increase is 18% or less for word sizes equal to or smaller than 32 by 32. Generally, a 20% or less increase in delay due to routing is very reasonable. The impact of larger areas and longer routing is seen more clearly in the 64 by 64 multipliers. For these large multipliers, the delay increases from estimated delays without route to CTE delays with route ranges from 27% to 46%.
86
Table 6.1: Delay values for Wallace multipliers in the 250 nm process Estimated Delay w/o route (nsec) 4.1 6.2 7.5 8.6 CTE Delay w/ parasitics (nsec) 4.8 6.9 8.6 12.6 Change due to parasitics 17% 11% 12% 46%
Table 6.2: Delay values for Dadda multipliers in the 250 nm process Estimated Delay w/o route (nsec) 4.1 6.4 7.5 8.6 CTE Delay w/ parasitics (nsec) 4.7 6.7 8.5 10.9 Change due to parasitics 15% 5% 13% 27%
Table 6.3: Delay values for Reduced Area multipliers in the 250 nm process Estimated Delay w/o route (nsec) 4.0 6.1 7.5 8.6 CTE Delay w/ parasitics (nsec) 4.7 6.7 8.5 11.4 Change due to parasitics 18% 10% 13% 32%
87
There is a 3% or less difference among the CTE delays for the three types of multipliers at the 8 by 8, 16 by 16, and 32 by 32 word sizes. For the 64 by 64 word size, the Wallace multiplier was the slowest and the Dadda Multiplier was the fastest. The Wallace multiplier is 15% slower and the Reduced Area multiplier is 6% slower than the Dadda multiplier. Closer inspection of simulation report files for all three multipliers reveals that the timing differences are due to routing load variations. Table 6.4 shows the delays through input buffering (D flip-flops and buffers), partial product generation and reduction, and the final carry propagate adder. The Wallace multiplier had significantly higher loading and slower slew rates for the D flip-flops, the buffers, and four of the ten (3,2) counters in the partial product reduction path. Replacing the more heavily loaded 1X-strength (3,2) counters with 2X-strength (3,2) counters would improve timing in the partial product reduction without impacting area, since the 1X and 2X (3,2) counters share the same foot prints. Also during timing driven placement, the tools could be allowed to upsize cells when slew rate and timing budgets are not met. Table 6.4: Critical section delays for 64 by 64 multipliers in 250 nm process Multiplier section Input buffering Partial product generation and reduction Final carry propagate adder Wallace Multiplier (nsec) 1.7 6.6 4.3 Dadda Multiplier (nsec) 1.4 5.0 4.5 Reduced Area Multiplier (nsec) 1.4 5.7 4.3
88
6.4 Delay for Multipliers in the 180 nm Process Technology Tables 6.5, 6.6, and 6.7 list the delay estimates and CTE timing analysis results for the Wallace, Dadda, and Reduced Area multipliers, respectively, developed in 180 nm process technology. For each word size, the estimated delays without routing are approximately the same for the three types of multipliers. Comparing the estimated delays without route to the CTE delays that include route, the delay increase is 16% or less for word sizes equal to or smaller than 32 by 32. The impact of larger areas and longer routing is seen more clearly in the 64 by 64 multipliers. multipliers, the percentage of increased delay ranges from 24% to 36%. There is a 4% or less difference amongst the CTE delays for the three types of multipliers at the 8 by 8, 16 by 16, and 32 by 32 word sizes. For the 64 by 64 word size, the Wallace multiplier was the slowest and the Dadda Multiplier was the fastest. The Wallace and Reduced Area multipliers are 8% and 4%, respectively, slower than the Dadda multiplier. Table 6.8 shows the delays through input buffering (D flip-flops and buffers), partial product generation and reduction, and the final carry propagate adder. Closer inspection of simulation report files reveals that the 0.6 nsec timing difference between the Dadda multiplier and the Wallace multiplier is due to slightly higher routing loads along the path of (3,2) counters in the partial product reduction stage. For these large
89
90
Table 6.8: Critical section delays for 64 by 64 multipliers in 180nm process Multiplier section Input buffering Partial product generation and reduction Final carry propagate adder Wallace Multiplier (nsec) 1.0 4.0 3.0 Dadda Multiplier (nsec) 0.9 3.4 3.2 RA Multiplier (nsec) 1.1 3.6 3.0
6.5 Delay for Multipliers in the 130 nm Process Technology Two standard cell libraries were used to design column compression multipliers in the 130 nm process technology. The generic standard cell library is referred to as 130g and the low power library as 130p. Tables 6.9, 6.10, and 6.11 report the delay values for the Wallace multipliers, Dadda multipliers, and Reduced Area multipliers, respectively, developed using the 130g cell library. At each word size, the estimated delays without routing are approximately the same for the three types of multipliers. Comparing the estimated delays without route to the CTE delays that include route, the delay increase was 20% or less for word sizes equal to or smaller than 32 by 32. The impact of larger areas and longer routing is seen more clearly in the 64 by 64 multipliers. For these large multipliers, the percentage of increased delay ranges from 28% to 43%.
91
Table 6.9: Delay values for Wallace multipliers in the 130g cell library Estimated Delay w/o route (nsec) 2.2 3.3 4.0 4.6 CTE Delay w/ parasitics (nsec) 2.6 3.8 4.8 6.6 Change due to parasitics 18% 15% 20% 43%
Table 6.10: Delay values for Dadda multipliers in the 130g cell library Estimated Delay w/o route (nsec) 2.2 3.4 4.0 4.6 CTE Delay w/ parasitics (nsec) 2.6 3.8 4.6 5.9 Change due to parasitics 18% 12% 15% 28%
Table 6.11: Delay values for Reduced Area multipliers in the 130g cell library Estimated Delay w/o route (nsec) 2.2 3.3 4.0 4.6 CTE Delay w/ parasitics (nsec) 2.6 3.8 4.7 6.3 Change due to parasitics 18% 15% 18% 37%
92
There is a 4% or less difference among the CTE delays for the three types of multipliers at the 8 by 8, 16 by 16, and 32 by 32 word sizes. For the 64 by 64 word size, the Wallace multiplier was the slowest and the Dadda Multiplier was the fastest. The Wallace multiplier is 13% slower and the Reduced Area multiplier is 7% slower than the Dadda multiplier. Table 6.12 shows the delays through input buffering (D flip-flops and buffers), partial product generation and reduction, and the final carry propagate adder. Closer inspection of simulation report files reveals that the 0.7 nsec timing difference between the Dadda multiplier and the Wallace multiplier is due to slightly higher routing loads along the path of (3,2) counters in the partial product reduction stage. Table 6.12: Critical section delays for 64 by 64 multipliers in 130g cell library Multiplier section Input buffering Partial product generation and reduction Final carry propagate adder Wallace Multiplier (nsec) 0.7 3.4 2.5 Dadda Multiplier (nsec) 0.7 2.7 2.5 RA Multiplier (nsec) 0.8 3.0 2.5
Tables 6.13, 6.14, and 6.15 report the delay values for the Wallace multipliers, Dadda multipliers, and Reduced Area multipliers, respectively, developed using the low power standard cell library designated as 130p.
93
Table 6.13: Delay values for Wallace multipliers in the 130p cell library Estimated Delay w/o route (nsec) 2.6 3.8 4.7 5.4 CTE Delay w/ parasitics (nsec) 3.0 4.3 5.5 7.6 Change due to parasitics 15% 13% 17% 41%
Table 6.14: Delay values for Dadda multipliers in the 130p cell library Estimated Delay w/o parasitics (nsec) 2.5 3.9 4.7 5.4 CTE Delay w/ parasitics (nsec) 2.9 4.2 5.4 7.2 Change due to parasitics 16% 8% 15% 33%
Table 6.15: Delay values for Reduced Area multipliers in the 130p cell library Estimated Delay w/o route (nsec) 2.5 3.7 4.7 5.4 CTE Delay w/ parasitics (nsec) 2.9 4.3 5.4 7.2 Change due to parasitics 14% 16% 15% 33%
94
At each word size, the estimated delays without routing are approximately the same for the three types of multipliers developed with the 130p library. Comparing the estimated delays without route to the CTE delays that include route, the delay increase was 17% or less for word sizes equal to or smaller than 32 by 32. The impact of larger areas and longer routing is seen more clearly in the 64 by 64 multipliers. For these large multipliers, the percentage of increased delay ranges from 33% to 41%. At each word size, the CTE delays for the three multiplier types in the 130p library are approximately the same. The largest delay difference is only 0.4 nsec (5%) between the 64 by 64 Wallace and Dadda multipliers. 6.6 Delay for Multipliers in the 90 nm Process Technology Following cell placement and route in the 90 nm cell library, parasitic resistances and capacitances were extracted for Wallace, Dadda, and Reduced Area Multipliers. Tables 6.16, 6.17, and 6.18 report the comparisons of the delay values for each multiplier. Generally, a 20% or less increase in delay due to including the routing characteristics is very reasonable. With the exception of the 16 by 16 Dadda multiplier, the delay
increases from estimated delays without route to CTE delays with route range from 22% to 50%; the delay increase for the 16 by 16 Dadda multiplier is only 16%. At each word size, the CTE delays for the three multiplier types are approximately the same. The delay differences among the three types of multipliers are less than 4% for each word size.
95
96
6.7 Delay Comparisons Table 6.19 lists all of the back-annotated CTE delay data for the Wallace, Dadda, and Reduced Area multipliers developed in generic standard cell libraries. Overall, for the same word size and process technology, the three multipliers show approximately equal delays. For the 64 by 64 multipliers, the Wallace multipliers are slightly slower than the Dadda and Reduced Area multipliers. Figure 6.1 plots the back-annotated CTE delays for the Dadda multipliers. Inspection of the multipliers delay data provides early support of the contention that delay is proportional to the logarithm of the word size N. Table 6.19: Back-annotated delays for Wallace, Dadda, and Reduced Area multipliers developed in generic standard cell libraries Wallace Dadda RA Mult (nsec) (nsec) (nsec) 4.8 3.2 2.6 1.5 6.9 4.7 3.8 2.2 8.6 5.9 4.8 2.9 12.6 8.0 6.6 3.9 4.7 3.2 2.6 1.5 6.7 4.5 3.8 2.2 8.5 5.8 4.6 2.8 10.9 7.4 5.9 3.9 4.7 3.1 2.6 1.5 6.7 4.6 3.8 2.2 8.5 5.8 4.7 2.8 11.4 7.7 6.3 3.8
Word Size 8 by 8 8 by 8 8 by 8 8 by 8 16 by 16 16 by 16 16 by 16 16 by 16 32 by 32 32 by 32 32 by 32 32 by 32 64 by 64 64 by 64 64 by 64 64 by 64
Process 250 nm 180 nm 130 nm 90 nm 250 nm 180 nm 130 nm 90 nm 250 nm 180 nm 130 nm 90 nm 250 nm 180 nm 130 nm 90 nm
97
Back-annotated Delay for N by N Dadda Multipliers

12 10 Delay (nsec) 8
Dadda 250nm
6 4 2 0 0 10 20 30 40 50 60 70 Word Size, N
Dadda 180nm Dadda 130g Dadda 90nm
Figure 6.1: Back-annotated delay for N by N Dadda multipliers
Table 6.20 provides comparison of the delays for all of the multipliers developed in the 130 nm process. The multipliers implemented with the 130p library are 10% to 22% slower than the ones built in the 130g library.
98
Table 6.20: Back-annotated delays for Wallace, Dadda, and Reduced Area multipliers developed in 130g and 130p cell libraries Word Size 8 by 8 8 by 8 16 by 16 16 by 16 32 by 32 32 by 32 64 by 64 64 by 64 Cell Wallace Dadda RA Mult Library (nsec) (nsec) (nsec) 130g 130p 130g 130p 130g 130p 130g 130p 2.6 3.0 3.8 4.3 4.8 5.5 6.6 7.6 2.6 2.9 3.8 4.2 4.6 5.4 5.9 7.2 2.6 2.9 3.8 4.3 4.7 5.4 6.3 7.2
Tables 6.21, 6.22, and 6.23 give the normalized back-annotated delay values of the Wallace, Dadda, and Reduced Area multipliers in the 250 nm, 180 nm, 130 nm (generic library), and 90 nm CMOS technologies. The delay for each multiplier is normalized to the delay of the 8 by 8 multiplier in that particular process technology. These normalized delays show that as the operand size doubles the total delay increases by slightly less than 50%. Moreover, the consistency of the normalized delays shows that the multiplier delays are not adversely impacted by routing parasitics in the smaller process geometries as the word size increases. If disproportionate scaling in critical physical parameters or concentration densities had occurred in the 90 nm process technology, one effect would have been a significant imbalance between drive capabilities and routing parasitics.
99
Table 6.21: Wallace multipliers with back-annotated delays relative to each processs 8 by 8 case Normalized Delay 250 nm 1.0 1.4 1.8 2.6 Normalized Delay 180 nm 1.0 1.5 1.8 2.5 Normalized Delay 130 nm 1.0 1.5 1.8 2.5 Normalized Delay 90 nm 1.0 1.5 1.9 2.6
Table 6.22: Dadda multipliers with back-annotated delays relative to the processs 8 by 8 case Normalized Delay 250 nm 1.0 1.4 1.8 2.3 Normalized Delay 180 nm 1.0 1.4 1.8 2.3 Normalized Delay 130 nm 1.0 1.5 1.8 2.3 Normalized Delay 90 nm 1.0 1.5 1.9 2.6
Table 6.23: Reduced Area multipliers with back-annotated delays relative to the processs 8 by 8 case Normalized Delay 250 nm 1.0 1.4 1.8 2.4 Normalized Delay 180 nm 1.0 1.5 1.9 2.5 Normalized Delay 130 nm 1.0 1.5 1.8 2.4 Normalized Delay 90 nm 1.0 1.5 1.9 2.5
Tables 6.24, 6.25, and 6.26 report the delay ratios of the multipliers in the 250 nm, 180 nm, and 130 nm processes to the 90 nm multiplier in the same word size. A column compression multiplier design that is ported from a 250 nm process to a 90 nm process will be approximately three times faster. Porting from 180 nm to 90 nm yields a 100
multiplier that is approximately twice as fast. Porting from 130 nm to 90 nm improves delay by approximately 40%. Table 6.24: Back-annotated Wallace multiplier delays relative to 90 nm Multiplier 250 nm to 90 nm 180 nm to 90 nm 130 nm to 90 nm 8 by 8 16 by 16 32 by 32 64 by 64 3.2 3.1 3.0 3.2 2.1 2.1 2.0 2.0 1.7 1.7 1.7 1.7
Table 6.25: Back-annotated Dadda multiplier delays relative to 90 nm Multiplier 250 nm to 90 nm 180 nm to 90 nm 130 nm to 90 nm 8 by 8 16 by 16 32 by 32 64 by 64 3.1 3.0 3.0 2.8 2.1 2.0 2.1 1.9 1.7 1.7 1.6 1.5
Table 6.26: Back-annotated Reduced Area multiplier delays relative to 90 nm Multiplier 250 nm to 90 nm 180 nm to 90 nm 130 nm to 90 nm 8 by 8 16 by 16 32 by 32 64 by 64 3.1 3.0 3.0 3.0 2.1 2.1 2.1 2.0 1.7 1.7 1.7 1.7
101
At the beginning of this chapter, a rough estimate of column compression delay was given as k log(N), where is the processs minimum feature size, N is the word size, k is a constant scaling factor. The delay approximations in Equations 6.11, 6.12, and 6.13 are realized by directly calculating average k values for each type of multiplier, with in units of nanometers. The generalized delay approximation in Equation 6.14 is produced by averaging all of the calculated k values. DelayWallace 0.0069 log2(N) DelayDadda 0.0066 log2(N) DelayRA 0.0067 log2(N) DelayCCMultiplier 0.0068 log2(N) (6.11) (6.12) (6.13) (6.14)
In a few cases, the delay approximations generated using Equations 6.11 6.14 are slightly too rough. The differences between the estimated delay values from these equations to the measured delay values can be as large as 24%. This poor approximation to some delay values is due to the increasing delay of the carry-propagate adder as the multipliers word size increases. Figure 6.2 shows contributions of the three main design sectionsthe partial production generation (PP Gen) which includes input buffering, the partial product reduction, and the carry lookahead adderto the delays of the 250 nm Dadda multipliers. There needs to be an additional term that accounts for the delay caused by the input buffering and partial product generation and the carry-lookahead adder.
102
8 by 8 Dadda Multiplier Delay PP Gen 0.7 ns 14% CLA 1.9 ns 41%
16 by 16 Dadda Multiplier Delay

PP Gen 0.8 ns 12% CLA 2.8 ns 42%
Reduction 2.1ns 45%
Reduction 3.1ns 46%
32 by 32 Dadda Multiplier Delay PP Gen 0.8 ns 10% CLA 3.3 ns 38%
64 by 64 Dadda Multiplier PP Gen 1.4 ns 13% CLA 4.5 ns 41%
Reduction 4.4 ns 52%
Reduction 5 ns 46%
Figure 6.2: Delay pie charts for back-annotated Dadda multipliers
103
Better approximations can be found through the application of a least squares method, solving for delay in the form of Delay d1 log2(N) + d2 (6.15)
where is the processs minimum feature size, N is the word size, d1 and d2 are constant scaling factors. Equations 6.16, 6.17, and 6.18 provide delay approximations for each type of column compression multiplier, with in units of nanometers. DelayWallace 0.0094 log2(N) 0.011 DelayDadda 0.0082 log2(N) 0.0066 DelayRA 0.0087 log2(N) 0.0083 (6.16) (6.17) (6.18)
Equation 6.19 is the general, combined form of the delay approximation for any of the three types of column compression multiplier. DelayCCmultiplier 0.0088 log2(N) 0.0085 (6.19)
The largest difference between the estimated delay values from these equations to the measured delay values is 14%, when using the generalized delay equation. The best fits are provided by the delay approximations for specific type of multiplier. The
magnitude of the error ranges between 1.9% to 14% in the comparison of the Wallace delay approximations from using Equation 6.16 to the measured Wallace multiplier delays. The magnitude of the error ranges between 1.2% to 11% and 1.1% to 13% in the comparison of approximated delays versus measured delays for Dadda and Reduced Area multipliers respectively. Figure 6.3 plots the measured back-annotated Dadda delays and the approximated delays using Equation 6.18. 104
Back-annotated Dadda Multiplier Delays Measured verus Estimated

12.00 10.00 Delay (nsec) 8.00 6.00 4.00 2.00 0.00 0 10 20 30 40 50 60 70 Word Size, N Dadda 250nm Estimate 250nm Dadda 180nm Estimate 180nm Dadda 130nm Estimate 130nm Dadda 90nm Estimate 90nm
Figure 6.3: Back-annotated Dadda multiplier delays versus estimated delays Using Equations 6.16, 6.17. 6.18, and 6.19, the delays of column compression multipliers in smaller process geometries can be predicted. Table 6.27 lists the delay predictions for column compression multipliers in a 65 nm process technology. An 8 by 8 multiply is expected to complete in approximately 1.2 nsec, a 16 by 16 in 1.7 nsec, a 32 by 32 in 2.3 nsec, and a 64 by 64 in 2.9 nsec.
105
Table 6.27: Predicted delays for column compression multipliers in a 65 nm process technology Wallace (nsec) 1.1 1.7 2.3 3.0 Dadda (nsec) 1.2 1.7 2.2 2.8 General Reduced CC Multiplier Area (nsec) (nsec) 1.2 1.2 1.7 1.7 2.3 2.3 2.8 2.9
6.8 Delay Summary Timing analysis using Cadences Common Timing Engine has provided significant insight into the delay characteristics of column compression multipliers. For a given process technology, the delays of Wallace, Dadda, and Reduced Area multipliers are approximately equal for 32 by 32 word sizes and smaller; the delay differences amongst the multipliers are at most 5%. For the 64 by 64 multipliers, the Wallace multipliers were slower than the Dadda and Reduced Area multipliers; the delay differences are 8% and higher. For 32 by 32 and smaller multipliers, these delay-related findings mean that an IC architect or designer can use other information, such as area or power consumption, to select amongst the three types of column compression multipliers. These findings also support the usage of automated design and layout tools since no multiplier showed parasitic capacitances that were significantly detrimental to delay. The delay data does confirm that column compression delay is proportional to the logarithm of the word size. Delay can be very roughly estimated by k log(N), where k 106
is a constant scaling factor, is the minimum feature size, and N is the word size. This research has shown that a better approximation to delay includes an additional linear term in . This additional term is needed because while log(N) correctly represents the delay growth in the reduction stages it is not sufficient to completely approximate the increasing delay of the input buffering, the partial product generation, and the final carrypropagate adder.
107
Chapter 7 Multiplier Power Consumption

For todays consumer and industrial product markets, the power consumption of IC components is a critical concern. Portable, battery operated devices require
conscientious power reduction techniques for all sub-components. Even products that utilize a supply cord are manufactured in compact form factors, requiring that the heat generation be minimized. With multiplication among the most common arithmetic
operations performed for signal processing, it is important to examine the power characteristics of column compression multipliers across various operand word sizes and process technologies. This chapter presents the results of multiplier simulations using Cadences multipurpose, hierarchical simulator, Virtuoso UltraSim. The designs of each multiplier were placed and routed in the standard cell libraries of three CMOS process technologies: 1) 250 nm, 2.5 V, 2) 180 nm, 1.8 V, and 3) 130 nm, 1.2 V. For the 130 nm process technology, a generic standard cell library and a low power standard cell library are used to build multipliers. The average power consumption by Wallace, Dadda, and Reduced Area multipliers is examined with the back-annotation of parasitic resistances and capacitances extracted from the layouts. Before the actual power values are reported, a simple analysis is provided to predict the trends in multiplier power consumption as input word sizes and process technologies are changed. 108
7.1 Power Estimation Unfortunately, in the literature there are no equations or even decent heuristics for calculating the average power consumption of column compression multipliers. Researchers have tried to offer relative measures for power characteristics, examining nodal toggle counts, attempting to reduce spurious transitions, and offering probabilistic analysis of switching activity. These power estimation attempts frequently fall short of giving realistic results. Dynamic power consumption in CMOS is described by Power = C V 2 f (7.1)
where C is capacitance, V is supply voltage, and f is operating frequency. Power for column compression multipliers can be expressed as
Area CCMultipli er PowerCCMultipli er t ox 1 V 2 max( Delay CCMultipli er )
(7.2)
where AreaCCMultiplier is the total layout area, tox is the gate oxide thickness, and max(DelayCCMultiplier) is the maximum delay for the multipliers word size. As discussed in Chapter 5, the total area of an N by N column compression multiplier is expected to be approximately equal to k 2 N2, where k is a constant scaling factor and is the minimum process geometry. To a first order, gate oxide thickness, supply voltage, and multiplier delay are proportional to . Therefore, the expression for power becomes
k 2 N 2 PowerCCMultiplier ~
109
1 2
(7.3)
Simplifying Equation 7.3, column compression multiplier power can be approximated by
PowerCCMultiplier ~ k 2 N 2
(7.4)
This approximation for power consumption in terms of and N indicates that power should scale in the same manner as area for column compression multipliers. Based on the similarities of the area and power approximations, for a given process technology, doubling the operand size is predicted to increase the average power consumption by approximately a factor of four. Also, average power consumption is estimated to reduce by approximately 0.5 for each generational transition by approximately 1 process minimum feature size. For the word sizes and process technologies used in this research, it is expected that the Wallace, Dadda, and Reduced Area multipliers will consume basically equivalent amounts of power since the complexities and delays are similar for all three. With the largest area and highest gate count, Wallace multipliers are expected to use the most power, but since the Wallace area and gate counts are only slightly bigger than that of the other two multipliers, the total average power consumptions should be very close. 7.2 Power Simulations In this research, all power simulations were performed using Virtuoso UltraSim. UltraSim takes as inputs a designs netlist, RC paracitics file in SPEF format, process technology information, simulation environment parameters such as 110
2 in the
temperature and voltage, and a vector stimulus file. All simulations were performed at the nominal voltage levels, 2.5 V, 1.8 V, or 1.2 V for the particular process technology. The simulation temperature was set at 25C. Typical process models were used. For each word size, the same vector stimulus files were applied for power analysis. Vector stimulus files containing randomly selected multiplier input values and the expected multiplication products were used to collect average current and power values. The average power values were determined multiplying the nominal voltage by the average current. It was not possible to evaluate leakage current using UltraSim and the given standard cell libraries. One vector was applied each clock cycle. The period for each clock cycle was determined by the longest timing delay for a given word size and process technology. For example, in the 250 nm process technology, the worst case delays are 12.6 nsec, 10.9 nsec, and 11.4 nsec for the 64 by 64 Wallace, Dadda, and Reduced Area multipliers, respectively. The clock period for simulating the three 64 by 64 bit multipliers is set at 13 nsec. For this research, the average power consumption of column compression multipliers is the main focus. Average power consumption is used to determine the duration of battery life for portable consumer and industrial products. Knowing the peak power is important for ensuring that the battery can provide the maximum instantaneous power needed. According to [69, 70, 71], RMS power is directly related to the Joule heating of the circuit, where high RMS current exacerbates electromigration effects and 111
creates thermal gradients across a chip. Therefore it is important to limit the RMS current density in a design. In order to properly evaluate RMS power for the multipliers, equivalent and extremely long time intervals would have been needed for each of the multipliers during the power simulations. RMS power was not closely examined due to time limitations on the availability of tools, libraries, and process technologies for this research. 7.3 Power for Multipliers in the 250 nm Process Technology Table 7.1 reports the average power values for four Wallace multipliers, four Dadda multipliers, and four Reduced Area multipliers developed in the 250 nm process technology. All of the power simulations were performed with the back-annotation of In all examined cases, the Reduced Area
parasitic resistances and capacitances.
multipliers utilized the least power and the Wallace multipliers the most. As shown in Table 7.2, the Wallace and Dadda multipliers consumed significantly more power than the Reduced Area multipliers, ranging from 5% to 48% more. For each of these column compression multipliers, as the word size doubles, average power consumption increases by approximately a factor of five. Figure 7.1 shows plots of the average power for each of the multipliers.
112
Table 7.1: Average power for Wallace, Dadda, and Reduced Area multipliers in the 250 nm process Wallace (W/MHz) 22 114 539 3255 Dadda (W/MHz) 21 110 658 3034 RA Mult (W/MHz) 20 102 442 2776
Table 7.2: Comparison of average power for Wallace, Dadda, and Reduced Area multipliers in the 250 nm process Wallace (W/MHz) + 10% + 12% + 22% + 17% Dadda (W/MHz) + 5% + 8% + 48% + 9% RA Mult (W/MHz) 20 102 442 2776
113
Average Power in 250 nm Process 3500 3000 W/MHz 2500 2000 1500 1000 500 0 0 10 20 30 40 50 60 70 Word Size, N Wallace Dadda RA Mult
Figure 7.1: Average power consumption for Wallace, Dadda, and Reduced Area multipliers in the 250 nm process 7.4 Power for Multipliers in the 180 nm Process Technology Table 7.3 gives the average power values for four Wallace multipliers, four Dadda multipliers, and four Reduced Area multipliers developed in the 180 nm process technology. All of the power simulations were performed with the back-annotation of
parasitic resistances and capacitances. As shown in Table 7.4, the power differences amongst the three multipliers ranges from 3% to 23%. For all of the simulated
multipliers, as the word size doubles, power consumption increases by approximately a factor of five. Figure 7.2 displays plots of average power values for each multiplier.
114
Table 7.3: Average power for Wallace, Dadda, and Reduced Area multipliers in the 180 nm process Wallace (W/MHz) 6.5 36 200 1022 Dadda (W/MHz) 6.3 33 211 926 RA Mult (W/MHz) 5.9 32 172 1058
Table 7.4: Comparison of average power for Wallace, Dadda, and Reduced Area multipliers in the 180 nm process Wallace (W/MHz) + 10% + 13% + 16% - 3% Dadda (W/MHz) + 7% + 3% + 23% - 12% RA Mult (W/MHz) 5.9 32 172 1058
115
Average Power in 180 nm Process 1200 1000 W/MHz 800 600 400 200 0 0 10 20 30 40 50 60 70 Word Size, N Wallace Dadda RA Mult
Figure 7.2: Average power consumption for Wallace, Dadda, and Reduced Area multipliers in the 180 nm process 7.5 Power for Multipliers in the 130 nm Process Technology Two standard cell libraries were used to design column compression multipliers in the 130 nm process technology. The generic standard cell library is referred to as 130g. The low power standard cell library is referred to as 130p. Table 7.5 gives the average power values for four Wallace multipliers, four Dadda multipliers, and four Reduced Area multipliers developed in the 130g cell library. All of the power simulations were performed with the back-annotation of parasitic capacitances. In all examined cases, the Reduced Area multipliers utilized the least power and the Wallace multipliers the most. As shown in Table 7.6, the Wallace and Dadda multipliers consumed significantly more power than the Reduced Area multipliers, ranging from 9% 116
to 23% more. For all of the simulated multipliers, as the word size doubles, power consumption increases by approximately a factor of five. Table 7.5: Average power for Wallace, Dadda, and Reduced Area multipliers in the 130g cell library Wallace (W/MHz) 1.91 10.2 46 281 Dadda (W/MHz) 1.86 9.9 43 270 RA Mult (W/MHz) 1.71 9.0 39 229
Table 7.6: Comparison of average power for Wallace, Dadda, and Reduced Area multipliers in the 130g cell library Wallace (W/MHz) + 12% + 13% + 18% + 23% Dadda (W/MHz) + 9% + 10% + 10% + 18% RA Mult (W/MHz) 1.71 9.0 39 229
Table 7.7 gives the average power values for four Wallace multipliers, four Dadda multipliers, and four Reduced Area multipliers developed in the low power 130p cell library. As shown in Table 7.8, the power differences amongst the three multipliers
117
ranges from 4% to 14%. For all of the simulated multipliers, as the word size doubles, power consumption increases by approximately a factor of five. Table 7.7: Average power for Wallace, Dadda, and Reduced Area multipliers in the 130p cell library Wallace (W/MHz) 1.47 8.3 39 203 Dadda (W/MHz) 1.49 8.2 37 199 RA Mult (W/MHz) 1.37 7.8 34 212
Table 7.8: Comparison of average power for Wallace, Dadda, and Reduced Area multipliers in the 130p cell library Wallace (W/MHz) + 7% + 6% + 14% - 4% Dadda (W/MHz) + 9% + 5% + 9% - 6% RA Mult (W/MHz) 1.37 7.8 34 211
Using the 130p cell library does reduce power in comparison to the 130g cell library. Table 7.9 shows that the average power values decrease by 7% to 28%. Figure 7.3 shows plots of the average power for the multipliers in the 130g and 130p cell libraries. 118
Table 7.9: Comparison of average power of a multiplier in the 130g cell library to the respective multiplier in the 130p cell library Wallace % reduction 23% 19% 15% 28% Dadda % reduction 20% 17% 14% 26% RA Mult % reduction 25% 13% 13% 7%
Average Power in 130g and 130p libraries

300 250 200 W/MHz 150 100 50 0 0 10 20 30 40 50 60 70 Word Size, N
Wallace 130g Dadda 130g RA 130g Wallace 130p Dadda 130p RA 130p
Figure 7.3: Average power consumption for Wallace, Dadda, and Reduced Area multipliers in 130g and 130p cell libraries 7.6 Power Comparisons Table 7.10 lists all of the average power data for the Wallace, Dadda, and Reduced Area multipliers developed in this research. Contrary to initial estimations, the 119
average power consumed by the three multipliers are not approximately equal for a given word size and process technology. In this research, a 5% or less difference in average power would be considered approximately equal. For a given word size, average power values differed amongst the three multipliers by as little as 3% and as much as 48%, with a 10% to 20% variation being typical. These differences are more noticeable when operating frequencies greater than 1 MHz are considered. For example, at 200 MHz in the 130g cell library, the average power consumed for 64 by 64 Wallace, Dadda, and Reduced Area multipliers would be 56.2 mW, 54.0 mW, and 45.8 mW respectively.
Table 7.10: Comparison of average power consumption for Wallace, Dadda, and Reduced Area multipliers Wallace (W/MHz) 22 6.5 1.91 1.47 114 36 10.2 8.3 539 200 46 39 3255 1022 281 203 Dadda (W/MHz) 21 6.3 1.86 1.49 110 33 9.9 8.2 658 211 43 37 3034 926 270 199 RA Mult (W/MHz) 20 5.9 1.71 1.37 102 32 9.0 7.8 442 172 39 34 2776 1058 229 212
Word Size 8 by 8 8 by 8 8 by 8 8 by 8 16 by 16 16 by 16 16 by 16 16 by 16 32 by 32 32 by 32 32 by 32 32 by 32 64 by 64 64 by 64 64 by 64 64 by 64
Process 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p
120
The Reduced Area multiplier is the lowest average power choice for word sizes equal to or smaller than 32 by 32. For the larger multipliers, there is no clear lowest power performer. For the 64 by 64 word size, the Dadda multipliers exhibited the lowest average power in the 180 nm process and when using the low power cell library with the 130 nm process. These inconsistent power profiles among the larger 64 by 64 multipliers are caused by 1) differences in routing parasitics due to the non-uniform nature of placed and routed multiplier designs, and 2) spurious signal transitions in lower reduction stages and the final carry propagate adder as partial product bits are summed. Across word sizes and standard cell libraries, each timing-driven cell placement and route provides a unique layout solution. The place and route of a 64 by 64 Dadda multiplier is not the place and route of a 32 by 32 Dadda multiplier extended by additional cells. The place and route of a 64 by 64 Dadda multiplier in the 130 nm standard cell library is not identical to the layout of a 64 by 64 Dadda multiplier in the 180 nm standard cell library with physical characteristics scaled down. Timing-driven placement and route constructs an optimized layout of the designs critical path according to user-specified timing constraints, but it does not attempt to optimize non-critical paths that meet basic timing requirements. This means that connected cells can be placed at various distances from each other, resulting in disparate routing parasitics for non-critical paths. The signal toggling along these non-uniform, non-optimized routes helps to create a unique power consumption signature for each automated multiplier layout. Spurious signal transitions also contribute to higher than anticipated power consumption. For cells with multiple primary inputs, the propagation delays of each 121
primary input to each primary output are not equal. This delay imbalance allows a cells primary outputs to possibly transition unnecessarily before all of the input stimuli are resolved. These spurious transitions cause additional power to be consumed in the reduction stages as well as in the carry-propagate adder. As the word size doubles, the average power increases by approximately a factor of five. In Section 7.1, it is predicted that the average power consumed in column compression multipliers would increase by approximately a factor of four. This
prediction is based on examining power as a function of capacitance, voltage and frequency. It does not take into account the multipliers long combinatorial logic paths which allow for significant numbers of spurious transitions during the partial product reduction and the addition by the final carry-propagate adder. Also in Section 7.1, it is predicted that average power consumption should reduce by approximately 50% for each generational transition by approximately 1
2 in the
processs minimum feature size, . This research finds that average power consumption decreases by approximately a factor of 3.5 or by roughly 70%. In [23], Weste and Esharaghian point out that more rigorous analysis would modify the first order approximations for the scaling of certain MOS device characteristics. They note that when all MOS dimensions, device voltages, and concentration densities are scaled by
1 2 , dynamic power consumption will decrease by somewhat more than the expected
factor of 2.
122
Examining power/area ratios in Tables 7.11, 7.12, and 7.13 provides insight into possible high power consumption within a given area. Where low power is of utmost concern, detection of these possible hot spots allows the designer to adjust circuits accordingly before silicon manufacture. From the tables, it is clear that increasing the operand word size increases the power to area ratio. It is also important to note that porting from processes with larger cell features to a smaller cell features reduces the power to area ratio for a given word size.
Table 7.11: Power/Area for Wallace multipliers Word Size 8 by 8 8 by 8 8 by 8 8 by 8 16 by 16 16 by 16 16 by 16 16 by 16 32 by 32 32 by 32 32 by 32 32 by 32 64 by 64 64 by 64 64 by 64 64 by 64 Process 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p Power (W/MHz) 22 6.5 1.91 1.47 114 36 10.2 8.3 539 200 46 39 3255 1022 281 203 Area (m2) 14,576 8,400 4,388 3,493 53,321 30,221 15,661 12,739 195,713 109,880 56,584 46,867 738,385 412,456 211,551 177,318 Power/Area [W/(MHz mm2)] 1509 774 444 421 2138 1191 651 652 2754 1821 813 832 4408 2478 1328 1144
123
Table 7.12: Power/Area for Dadda multipliers Word Size 8 by 8 8 by 8 8 by 8 8 by 8 16 by 16 16 by 16 16 by 16 16 by 16 32 by 32 32 by 32 32 by 32 32 by 32 64 by 64 64 by 64 64 by 64 64 by 64 Process 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p Power (W/MHz) 21 6.3 1.86 1.49 110 33 9.9 8.2 658 211 43 37 3034 926 270 199 Area (m2) 14,570 8,421 4,428 3,515 51,288 29,174 15,161 12,353 186,909 105,230 54,257 45,104 709,509 397,137 203,784 171,474 Power/Area [W/(MHz mm2)] 1441 748 420 424 2145 1131 653 664 3520 2005 793 820 4276 2332 1325 1161
Table 7.13: Power/Area for Reduced Area multipliers Word Size 8 by 8 8 by 8 8 by 8 8 by 8 16 by 16 16 by 16 16 by 16 16 by 16 32 by 32 32 by 32 32 by 32 32 by 32 64 by 64 64 by 64 64 by 64 64 by 64 Power (W/MHz) 20 5.9 1.71 1.37 102 32 9.0 7.8 442 172 39 34 2776 1058 229 212 124 Area (m2) 13,807 7,990 4,181 3,339 50,185 28,551 14,811 12,103 185,407 104, 386 53,783 44,766 707,699 396,111 203,207 171,060 Power/Area [mW/(MHz mm2)] 1449 738 409 410 2032 1121 608 661 2384 1648 725 760 3923 2671 1127 1239
Process 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p 250 nm 180 nm 130g 130p
In order to find a better approximation of average power consumption in column compression multipliers than the expression k 2 N2 discussed in Section 7.1, this research initially attempted to find a constant scaling factor, , where average power consumption in W/MHz for column compression multipliers could be estimated by the expression: PowerCCMultiplier AreaCCMultiplier t ox V 2 (7.5)
With supply voltage and gate oxide thickness proportional to , the expression for average power consumption in W/MHz becomes PowerCCMultiplier AreaCCMultiplier 2 (7.6)
Substituting the form of the quadratic approximation for area given in Equation 5.1,
PowerCCMultiplier (k13 N 2 + k 2 3 N + 3 )
(7.7)
where N is the word size and k1, k2, and are constant scaling factors determined using a least squares method for each type of multiplier in Chapter 5. Examination of plots of the collected power data showed that average power consumption was better approximated in terms of 4N2 rather than 3N2. Attempts to determine a constant value for were unsuccessful. Values for ranged too widely. For example, for Wallace multipliers in the 250 nm process
geometry, values of would be 2.1710-8, 3.1810-8, 4.1510-8, and 6.6610-8, for 8 by 8, 16 by 16, 32 by 32, and 64 by 64 word sizes, respectively. Closer examination of the
125
calculated values revealed that was growing by factors of roughly 2 with each doubling of the word size N. Instead of being a constant, is specified as
= q 2 (log
N) 3
(7.8)
where q is a constant scaling factor. Values of q range between 2.010-8 and 3.010-8. Using area approximation equations 5.2, 5.3, and 5.4, the average power for column compression multipliers is estimated by PowerWallace 2.3810-8 PowerDadda 2.5510-8 PowerRA 2.1410-8
2 2
(log 2 N ) 3 4
(0.00283 N2 + 0.015 N 0.0472)
(7.9) (7.10) (7.11)
(log 2 N ) 3 4
(0.00275 N2 + 0.0122 N 0.0166)
(log 2 N ) 3 4
(0.00276 N2 + 0.0014 N 0.0246)
Comparing the average power from these equations to the measured power, the error ranges for the Wallace, Dadda, and Reduced Area data sets are 15% to -17%, 19% to -19%, and 20% to -28%, respectively. For the Reduced Area multipliers, the
magnitude of the error only exceeds 20% in the case of the 64 by 64 multiplier developed in the 180 nm process technology. For multipliers developed in process technologies that are smaller than 250 nm, the power approximations can be improved by using the area approximations given by Equations 5.6, 5.7, and 5.8. Estimates for the average power of multipliers in 180 nm and smaller process geometries can be calculated using PowerWallace, <180nm 2.6010-8 2 PowerDadda, <180nm 2.5710-8 2
(log 2 N ) 3 4 (log 2 N ) 3 4
(0.00288 N2 + 0.0156N 0.0479) (7.12) (7.13)
(0.0028 N2 + 0.0128N 0.0169)
126
PowerRA, <180nm 2.5310-8 2
(log 2 N ) 3 4
(0.0028 N2 + 0.012 N 0.0252)
(7.14)
Comparing the average power from Equations 7.12, 7.13, and 7.14 to the measured power, the error ranges for the Wallace, Dadda, and Reduced Area data sets for 180 nm and smaller process geometries are 10% to -7%, 13% to -17%, and 19% to -13%, respectively. When the 250 nm data is included in the development of the power
estimation equations, the magnitude of the error is as high as 28% for multipliers in the 180 nm and smaller geometries. Excluding the 250 nm data allows the error for the power consumption approximations to be within 20%. Using Equations 7.12, 7.13, and 7.14, it is possible to predict the average power of each type of column compression multiplier in smaller process geometries. Tables 7.14 and 7.15 list the predicted average power for Wallace, Dadda, and Reduced Area multipliers in 90 nm and 65 nm process technologies.
Table 7.14: Predicted average power for column compression multipliers in a 90 nm process technology Wallace Dadda Word Size (W/MHz) (W/MHz) 8 by 8 16 by 16 32 by 32 64 by 64 0.44 2.24 11.5 61.3 0.45 2.16 11.0 58.5 Reduced Area (W/MHz) 0.41 2.07 10.7 57.3
127
Table 7.15: Predicted average power for column compression multipliers in a 65 nm process technology Wallace Dadda (W/MHz) (W/MHz) 0.12 0.61 3.14 16.7 0.12 0.59 2.99 15.9 Reduced Area (W/MHz) 0.11 0.56 2.91 15.6
A generalized power approximation equation for all types of column compression multipliers is not given. The differences amongst the average power data for the three types of multipliers for a given word size and process technology were as little as 3% and as high as 48%. Attempts at developing one generalized power approximation equation would produce an error that significantly exceeds + 20%. For this research, and generally in design engineering practice, equations for approximating area, delay, or power are deemed unacceptable and worthless if the error exceeds + 20%. 7.7 Power Summary The UltraSim simulations performed in this research provide insight into the power characteristics of Wallace, Dadda, and Reduced Area multipliers. One of the key conclusions from examining forty eight column compression multipliers is that Wallace, Dadda, and Reduced Area multipliers do not consume equal power. Typically, average power varies between 10% to 20% amongst the multipliers for a given word size. For word sizes 32 by 32 and smaller, Reduced Area multipliers consistently consume the least power. Wallace multipliers usually, but not always, consume the most power. 128
As the word size doubles, the average power increases by approximately a factor of five. Initial analysis estimated a factor of four increase, not five, but power
examinations that do not take into account the flow of signals, including spurious transitions, will underestimate power in these large, multi-staged multipliers. Average power consumption decreases by approximately a factor of 3.5 or by roughly 70% for each generational transition by approximately 1
2 in the process
minimum feature size. This reduction in average power consumption is larger than the factor of 2 or 50% initially given as a first order approximation. More rigorous
examinations of power consumption support larger decreases than the expected 50% as MOS device characteristics are scaled by 1
2.
The Wallace, Dadda, and Reduced Area multipliers show very similar power to area ratios for a given word size and process technology. This indicates that these column compression multipliers would indicate similar hot spot characteristics. The average power for each of the three types of column compression multipliers has been estimated in terms of the word size, N, and the processs minimum feature size, . It is not feasible to create one generalized power approximation equation since the average power varies significantly amongst the types of column compression multipliers for a given word size and process technology. If the design goal is to quickly develop a low power column compression multiplier, then a Reduced Area multiplier should be implemented using an automated place and route methodology. For a given word size and process technology, the 129
Reduced Area multipliers consumed the least power 14 out of 16 times when compared to Wallace and Dadda multipliers.
130
Chapter 8 Conclusions
During the past decade, many chip architects and IC designers have insisted that fast multipliers are only realized by custom design and layout, which often takes an engineer three months or more to design, layout, and verify. Column compression
multipliers are dismissed as too time consuming and complex to layout because of their irregular structure. Unlike the two to three year development periods allotted for large microprocessors, the time to market for application specific ICs is typically three to six months. This research demonstrates that an automated multiplier generation and layout process makes the column compression multiplier a viable option for application specific CMOS products. In this research, sixty column compression multipliers were designed in order to better understand size and performance characteristics. These multipliers were developed and analyzed using industry standard design practices and tools. The resultant area, delay, and power data provide key insight into how the multipliers perform as the operand word size and the process technology are changed. The place and route of the multipliers yields extremely compact and regular layouts. All of the multipliers show very high row utilizations at 95%. Across different process technologies, doubling the operand size will increase the total area by approximately a factor of four for each type of multiplier. Area for column compression 131
multipliers reduces by approximately 0.5 for each generational transition by approximately 1

2 in the process minimum feature size.
This research has shown that the area of an N by N column compression multiplier is best estimated by an approximation that includes both an N2 term and an N term. For 180 nm and smaller process geometries, the area of column compression multipliers can be estimated to within + 2% for a given process geometry, following type-specific equations or the generalized equation: AreaWallace, <180 nm 0.00288 2N2 + 0.0156 2N 0.0479 2 AreaDadda, <180 nm 0.0028 2N2 + 0.0128 2N 0.0169 2 AreaRA, <180 nm 0.0028 2N2 + 0.012 2N 0.0252 2 AreaCCmultiplier, <180 nm 0.00283 2N2 + 0.0134 2N 0.03 2 (8.1) (8.2) (8.3) (8.4) , using the
The exclusion of the 250 nm data to develop the improved area approximations for designs in 180 nm and smaller process geometries is a valid step. The 250 nm cell library belongs to an architecturally different family of cell libraries, whereas the 180 nm, 130 nm, and 90 nm cell libraries are all part of the same design family. The semiconductor foundry for all of the process technologies and the supplier of all the standard cell libraries are the premier vendors for their respective markets. Other tier 1 and tier 2 merchant semiconductor foundries attempt to clone the process technologies of this semiconductor foundry. Utilizing this latest architectural family of standard cell libraries in conjunction with the processes technologies from the premier, merchant
132
semiconductor foundry forms the best basis for predicting area for column compression multipliers for 180 nm and smaller geometries. The delay data of this research challenges the prediction that multiplier delay increases in proportion to the logarithm of the word size. Using this logarithmic
relationship solely provides a very rough estimate, with the error often exceeding + 20%. A better approximation for delay includes an additional term. The delay of column compression multipliers can be estimated to within + 14 % for a given process geometry, , using the following type-specific equations or the generalized equation: DelayWallace 0.0094 log2(N) 0.011 DelayDadda 0.0082 log2(N) 0.0066 DelayRA 0.0087 log2(N) 0.0083 DelayCCmultiplier 0.0088 log2(N) 0.0085 (8.5) (8.6) (8.7) (8.8)
The average power values are very close, but it can not be said that the three multipliers show approximately equal power consumption for a given word size and process technology. Power consumption in column compression multipliers does
increase with an increase in word size. For a given word size and process technology, the Reduced Area multipliers consume the least average power 14 out of 16 times in comparison to Wallace and Dadda multipliers. As the word size doubles, the average power increases by approximately a factor of five. Average power consumption
decreases by approximately a factor of 3.5 or by roughly 70% for each generational transition by approximately 1
2 in the process minimum feature size.
133
Using Equations 8.9 and 8.10, the average power of Wallace and Dadda multipliers can be estimated to within + 17% and + 19%, respectively, for a given process geometry. Using Equation 8.11, the average power for Reduced Area multipliers can be estimated to within + 28%. This significantly larger error for the Reduced Area
multiplier is due deriving the power equation using two power data points which seem unusually high. These higher than anticipated measure power values may be due to nonoptimized routing loads and spurious signal transitions. PowerWallace 2.3810-8 PowerDadda 2.5510-8 PowerRA 2.1410-8
2 2
(log 2 N ) 3 4
(0.00283 N2 + 0.015 N 0.0472)
(8.9) (8.10) (8.11)
(log 2 N ) 3 4
(0.00275 N2 + 0.0122 N 0.0166)
(log 2 N ) 3 4
(0.00276 N2 + 0.0014 N 0.0246)
Excluding 250 nm data from the development of the area equations yields power approximation equations with lower error magnitudes. The average power of Wallace, Dadda, and Reduced Area multipliers can be estimated to within + 10%, + 17%, and + 19%, respectively, for 180 nm and smaller geometries using the following equations: PowerWallace, <180nm 2.6010-8 2 PowerDadda, <180nm 2.5710-8 2 PowerRA, <180nm 2.5310-8 2
(log 2 N ) 3 4 (log 2 N ) 3 4
(0.00288 N2 + 0.0156N 0.0479) (8.12) (8.13) (8.14)
(0.0028 N2 + 0.0128N 0.0169)
(log 2 N ) 3 4
(0.0028 N2 + 0.012 N 0.0252)
134
Column compression multipliers should be used when fast multiplication is needed in CMOS products that are largely developed using automated design, layout, and verification practices. When the IC development schedule is short, these multipliers can be generated, placed and routed, and simulated in a matter of days instead of the two or three months required for a custom implementation. With the automation of the
development of the column compression multipliers, chip architects and designers can quickly accommodate changes in the word size or the selection of a different semiconductor foundry. Also, once developed, the multipliers fully verified netlist can be easily reused for future products. Based on the analysis of area, delay, and power data summarized herein, select the Reduced Area multiplier for implementation. The Reduced Area multiplier lives up to its name by having the smallest area of the three types of multipliers examined. In most cases, the Reduced Area multiplier consumes the least average power. While achieving the smallest area and the lowest power, the Reduced Area multiplier maintains the same fast delay as the Wallace and Dadda multipliers. The selection of which type of column compression multiplier may be impacted by architectural requirements on the final carry propagate adder. For example, if the final carry propagate adder is going to used to sum operands from other arithmetic functions, the word length of the final carry propagate adder may need to be longer than the word length for a Reduced Area multiplier. In this case, the IC designer should consider the Wallace multiplier with its one bit pair longer adder word length or the Dadda multiplier with its S bit pairs longer adder word length, where S is the number of reduction stages. 135
Finally, this research shows the critical importance of the (3,2) counters. Regardless of which type of column compression multiplier is selected, an IC designer can significantly improve the multipliers performance by adjusting the (3,2) counter cell. For the 130 nm and smaller process geometries, many standard cell library vendors offer special (3,2) counter cells that have been tailored to be low power or high speed. If time is available for any custom design and layout, then develop a new (3,2) counter that fits the performance goals. Note that for a new, custom (3,2) counter cell to be inserted into a projects set of standard cells, the new cell would require significant simulation time to fully characterize it and create its timing file as well as the creation of multiple layout views.
136
Bibliography
[1] [2] Robert F. Shaw, Arithmetic Operations in a Binary Computer, Review of Scientific Instruments, vol. 21, pp. 687-693, 1950. J. C. Majithia and R. Kitai, An Iterative Array for Multiplication of Signed Binary Numbers, IEEE Transactions on Electronic Computers, vol. EC-13, pp. 14-17, 1964. R. De Mori, Suggestions for an I.C. Fast Parallel Multiplier, Electronics Letters, vol. 5, pp. 50 -51, 1965. H. H. Guild, Fully Iterative Fast Array for Binary Multiplication, Electronics Letters, vol. 38, pp. 843-852, 1968. A. D. Pezaris, A 40ns 17-bit by 17-bit Array Multiplier, IEEE Transactions on Computers, vol. C-20, pp. 442-447, 1971. Israel Koren, Computer Arithmetic Algorithms, Englewood Cliffs, NJ: Prentice Hall, Inc., 1993. Andrew D. Booth, A Signed Binary Multiplication Technique, Quarterly Journal of Mechanics and Applied Mathematics, vol. 4, pp. 236-240, 1951. Charles R. Baugh and Bruce. A. Wooley, A Twos Complement Parallel Array Multiplication Algorithm, IEEE Transactions on Computers, vol. C-22, pp. 1045-1047, 1973. Thomas K. Callaway and Earl E. Swartzlander, Jr., Optimizing Multipliers for WSI, Proceedings of the 1993 International Conference on Wafer Scale Integration, pp. 85-94, 1993. N. Vansantha, M. Satyam, and K. Subba Rao, Technique for Minimizing Power Consumption in Array Multipliers through Input Vector Ordering, Proceedings of the International Conference on Signal Processing, Communications, and Networking, pp. 162-167, February, 2007. Edwin de Angel and Earl E. Swartzlander, Jr., An Ultra Low Power Multiplier, International Conference on Signal Processing Applications and Technology, pp. 2118-2122, 1995. 137
[3] [4] [5] [6] [7] [8]
[9]
[10]
[11]
[12]
Shivaling S. Mahant-Shetti, Poras T. Balsara, and Carl Lemonds, High Performance Low Power Array Multiplier Using Temporal Tiling, IEEE Transactions on Very Large Scale Integration Systems, vol. 7, pp. 121-124, 1999. Chang-Young Han, Hyoung-Joon Park, and Lee-Sup Kim, A Low-Power Array Multiplier Using Separated Multiplication Technique, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 48, pp. 866871, 2001. C. S. Wallace, A Suggestion for a Fast Multiplier, IEEE Transactions on Electronic Computers, vol. EC-13, pp. 14-17, 1964. Luigi Dadda, Some Schemes for Parallel Multipliers, Alta Frequenza, vol. 34, pp. 349-356, August 1965. P. R. Cappello and K. Steiglitz, A VLSI Layout for a Pipelined Dadda Multiplier, ACM Transactions on Computer Systems, vol. 1, pp. 157-174, 1983. L. Breveglieri, L. Dadda, and V. Piuri, Column Compression Pipelined Multipliers, Proceedings 1995 International Conference on Application Specific Array Processors, pp. 93-103, 1995. Jieh-Hwang Yen, Lan-Rong Dung, and Chi-Yuan Shen, Design of Power-Aware Multiplier with Graceful Quality-Power Trade-Offs, IEEE International Symposium on Circuits and Systems, vol. 2, pp. 1642-1645, May 2005. O. L. MacSorley, High-Speed Arithmetic in Binary Computers, Proceedings of the IRE, vol. 49, pp. 67-91, 1961. Bruce Gilchrist, J. H. Pomerene, and S. Y. Wong, Fast Carry Logic for Digital Computers, IRE Transactions on Electronic Computers, vol. EC-4, pp. 133-136, 1955. A. Weinberger and J. L. Smith, A Logic for High-Speed Addition, Nat. Bur. Stand. Circ. 591, pp. 3-12, 1958. J. Sklansky, An Evaluation of Several Two-summand Binary Adders, IRE Transactions on Electronic Computers, vol. EC-9, pp. 213-226, 1960. Neil H. Weste and Kamran Eshraghian, Principles of CMOS VLSI Design: A Systems Perspective, 2nd Edition, Reading, MA: Addison-Wesley Publishing Co., 1993. 138
[13]
[14] [15] [16] [17]
[18]
[19] [20]
[21] [22] [23]
[24] [25]
Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Designs, New York: Oxford University Press, 2000. G. W. McIver, R. W. Miller, and T. G. OShaughnessy, A Monolithic 16x16 Digital Multiplier, IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 231-233, 1974. KAndrea C. Bickerstaff, Michael J. Schulte, and Earl E. Swartzlander, Jr., Reduced Area Multipliers, Proceedings of the 1993 International Conference on Application Specific Array Processors, pp. 478-489, 1993. Z. Wang, G. A. Jullien, and W. C. Miller, A New Design Technique for Column Compression Multipliers, IEEE Transactions on Computers, vol. 44, pp. 962970, 1995. V. G. Oklobdzija, D. Villeger, and S. S. Liu, A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using an Algorithmic Approach, IEEE Transactions on Computers, vol. 45, pp. 294-305, 1996. Luigi Dadda, On Parallel Digital Multipliers, Alta Frequenza, vol. 45, pp. 574580, 1976. Earl E. Swartzlander, Jr., Parallel Counters, IEEE Transactions on Computers, vol. C-22, pp. 1021-1024, 1973. V. G. Oklobdzija, Improving Multiplier Design by Using Improved Column Compression Tree and Optimized Final Adder in CMOS Technology, IEEE Transactions on VLSI Systems, vol. 3, pp. 292-301, 1995. Ohsang Kwon, K. Nowka, and Earl E. Swartzlander, Jr., A 16-bit x 16-bit MAC design using fast 5:2 compressor, Proceedings of the IEEE International Conference on Application Specific Systems, Architectures, and Processors, pp. 235-243, July, 2000. M. Zhuang and H. Hu, A new design of the CMOS full adder, IEEE Journal of Solid-State Circuits, vol. 27, pp. 840-844, 1992. M. Alioto and G. Palumbo, Analysis and comparison on full adder block in submicron technology, IEEE Transactions on VLSI Systems, vol. 10, pp. 806823, 2002. 139
[26]
[27]
[28]
[29] [30] [31]
[32]
[33] [34]
[35] [36]
D. Radhakrishnan, Low-voltage low-power CMOS full adder, Proceedings IEE Circuits, Devices, and Systems, vol. 148, pp. 19-24, 2001. Hung Tien Bui, Yuke Wang, and Yingtao Jiang, Design and analysis of lowpower 10-transistor full adders using novel XOR-XNOR gates, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 49, pp. 25-30, 2002. Chip-Hong Chang, Jiangmin Gu, and Mingyan Zhang, A Review of 0.18-m Full Adder Performances for Tree Structured Arithmetic Circuits, IEEE Transactions on VLSI Systems, vol. 13, pp. 686-695, 2005. Ahmed M. Shams, Tarek K. Darwish, and Magdy A. Bayoumi, Performance Analysis of Low-Power 1-Bit CMOS Full Adder, IEEE Transactions on VLSI Systems, vol. 10, pp. 20-29, 2002. Sumeer Goel, Ashok Kumar, and Magdy A. Bayoumi, Design of Robust, Energy-Efficient Full Adders for Deep-Submicrometer Design Using HybridCMOS Logic Style, IEEE Transactions on VLSI Systems, vol. 14, pp. 13091321, 2006. KAndrea C. Bickerstaff, Michael J. Schulte, and Earl E. Swartzlander, Jr., Parallel Reduced Area Multipliers, Journal of VLSI Signal Processing, vol. 9, pp. 181-191, 1995. KAndrea C. Bickerstaff, Earl E. Swartzlander, Jr, and Michael J. Schulte, Analysis of Column Compression Multipliers, Proceedings of the 15th IEEE Symposium on Computer Arithmetic, pp. 33-39, 2001. H. Al-Twaijry and M. Flynn, Multipliers and Datapaths, Stanford University, Technical Report: CSL-TR-94-654, December, 1994. Earl E. Swartzlander, Jr., A Review of Large Parallel Counter Designs, Proceedings of the IEEE Computer Society Annual Symposium on VLSI Emerging Trends in VLSI Systems Design, pp. 88-98, February, 2004. M. Mehta, V. Parmar, and Earl E. Swartzlander, Jr., High-Speed Multiplier Design Using Multi-Input Counter and Compressor Circuits, Proceedings of the 10th Symposium on Computer Arithmetic, pp. 43-50, 1991. Robert F. Jones and Earl E. Swartzlander, Jr., Parallel Counter Implementations, Journal of VLSI Signal Processing, vol. 7, pp. 223-232, 1994. 140
[37]
[38]
[39]
[40]
[41]
[42] [43]
[44]
[45]
[46]
P. J. Song and G. De Micheli, Circuit and Architecture Trade-offs for HighSpeed Multiplication, IEEE Journal of Solid-State Circuits, vol. 26, pp. 1184 1198, 1991. M. Nagamatsu, S. Tanaka, J. Mori, K. Hirano, T. Noguchi, and K. Hatanaka, A 15-ns 32x32-b CMOS Multiplier with an Improved Parallel Structure, IEEE Journal of Solid-State Circuits, vol. 25, pp. 494-497, 1990. N. Ohkubo, M. Suzuki, T. Shinbo, T. Yamanaka, A. Shimizu, K. Sasaki, and Y. Nakagome, A 4.4 ns CMOS 54x54-b Multiplier Using Pass-Transistor Multiplexer, IEEE Journal of Solid-State Circuits, vol. 30, pp. 251-257, 1995. Robert F. Jones and Earl E. Swartzlander, Jr., Parallel Counter Implementation, Twenty-Sixth Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 381-385, October, 1992. P. K. Chan, M. D. F. Schlag, C. D. Thomborson, and V. G. Oklobdzija, Delay Optimization of Carry-Skip Adders and Block Carry-Lookahead Adders, Proceedings of the 10th Symposium on Computer Arithmetic, pp. 154-164, 1991. B. D. Lee and V. G. Oklobdzija, Improved CLA Scheme with Optimized Delay, Journal of VLSI Signal Processing, vol. 3, pp. 265-274, 1991. S. Turrini, Optimal Group Distribution in Carry-Skip Adders, Proceedings of the 9th IEEE Symposium on Computer Arithmetic, pp. 96-103, 1991. P. K. Chan, M. D. F. Schlag, A Note on Design Two-Level Carry-Skip Adders, Journal of VLSI Signal Processing, vol. 3, pp. 275-281, 1991. V. Kantaburtra, Designing Optimum Carry-Skip Adders, Proceedings of the 10th Symposium on Computer Arithmetic, pp. 146-153, 1991. N. T. Quach and Michael J. Flynn, High-Speed Addition in CMOS, IEEE Transactions on Computers, vol. 41, 1992. Thomas K. Callaway and Earl E. Swartzlander, Jr., Optimizing Adders for WSI, Proceedings 1992 International Conference on Wafer-Scale Integration, pp. 251-260, 1992. Thomas K. Callaway and Earl E. Swartzlander, Jr., Estimating the Power Consumption of CMOS Adders, Proceedings of the 11th Symposium on Computer Arithmetic, pp. 210-216, 1993. 141
[47]
[48]
[49]
[50]
[51] [52] [53] [54] [55] [56]
[57]
[58]
V. G. Oklobdzija, Design and Analysis of Fast Carry-Propagate Adder Under Non-Equal Input Signal Arrival Profile, 28th Asilomar Conference Signals, Systems, and Computers, pp. 1398-1401, 1995. Niichi Itoh, Yuka Naemura, Hiroshi Makino, Yasunobu Nakase, Tsutomo Yoshihara, and Yasutaka Horiba, A 600-MHz 54x54-bit Multiplier with Rectangular-Styled Wallace Tree, IEEE Journal of Solid-State Circuits, vol. 36, pp. 249-257, 2001. Earl E. Swartzlander, Jr., High-Speed Computer Arithmetic, in Allen B. Tucker, ed., The Computer Science and Engineering Handbook, Boca Raton: CRC Press, pp. 462-481, 1997. Jalil Fadavi-Ardekani, MxN Booth Encoded Multiplier Generator Using Optimized Wallace Trees, Proceedings of the IEEE 1992 International Conference on Computer Design, pp, 114-117, October, 1992. Pascal Delamotte, Jean-Michel Servant, and Yann Boyer-Chammard, M_GM: A Module Generator for Multipliers, Proceedings of the 32nd Midwest Symposium on Circuits and Systems, pp. 813-816, August, 1989. Johnny Pihl and Einar J. Aas, A Multiplier and Squarer Generator for High Performance DSP Applications, IEEE 39th Symposium on Circuits and Systems, pp. 109-112, August, 1996. S. F. Hsiao and M. R. Jiang, Efficient Synthesiser for Generation of Fast Parallel Multipliers, IEE Proceedings of Computers and Digital Techniques, vol. 147, pp. 49-52, 2000. Yu Qian and Wang Dong-Hui, A Design of Regularized Multiplier Generator, Proceedings of the 5th International Conference on ASIC, pp. 1269-1272, October, 2003. EncounterTM User Guide, pp. 582-592, February, 2006. Virtuoso UltraSim User Guide, p. 17, June, 2004. Robert H. Dennard, Fritz H. Gaensslen, Hwa-Nien Yu, V. Leo Rideout, Ernest Bassous, and Andre R LeBlanc, Design of Ion-Implanted MOSFETs with Very Small Physical Dimensions, IEEE Journal of Solid-State Circuits, vol. SC-9, pp. 256-268, 1974. 142
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66] [67] [68]
[69] [70] [71]
Virtuoso UltraSim Full-Chip Simulator Netlist-Based Electromigration Voltage Drop (EMIR) Flow, Cadence Design Systems, Inc., 2007 Electromigration for Designers, Cadence Design Systems, Inc., 2002 William R. Hunter, The Implications of Self-Consistent Current Density Design Guidelines Comprehending Electromigration and Joule Heating for Interconnect Technology Evolution, International Electron Devices Meeting, pp. 483-486, 1995.
143
Vita
KAndrea Catherine Bickerstaff was born in Montgomery, Alabama, on May 28, 1967, the daughter of Pressley and Doris Bickerstaff. In 1985, as class valedictorian, she completed the high school curriculum of Saint Jude Educational Institute in Montgomery, Alabama, and entered the Massachusetts Institute of Technology in Cambridge, Massachusetts. During her undergraduate studies, she was employed as a summer intern at Hewlett Packard Company in Santa Rosa, California, NCR Corporation in Liberty, South Carolina, and Polaroid Corporation in Cambridge, Massachusetts. In 1989, she received the degree of Bachelor of Science in Electrical Engineering from the Massachusetts Institute of Technology. Awarded a GEM Fellowship with sponsorship from Polaroid Corporation, she received the Master of Science degree from the University of Texas at Austin in December 1992. From 1993 to 1995, she worked as a Development Engineer in the PCI Components Division at Intel Corporation in Folsom, California. Returning to Austin, Texas, in 1995, she has worked as a Senior Design Engineer at Crystal Semiconductor, a Technical Consultant at Brobeck, Phleger, and Harrison LLP, and a Design Manager at Cirrus Logic. In 2005, she founded KenQuest LLC, offering research, design, and management services. Most recently, as Acting Director of Engineering at Luminary Micro, Inc., she led the engineering team through the development of the first ARM Cortex-M3 based SoC.
Permanent Address: 5216 Crystal Water Drive, Austin, Texas, 78735 This dissertation was typed by the author.
144

10 1 1 123

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

10 1 1 123

Diunggah oleh

Hak Cipta:

Format Tersedia

Copyright by KAndrea Catherine Bickerstaff 2007

Optimization of Column Compression Multipliers

Optimization of Column Compression Multipliers

KAndrea Catherine Bickerstaff, B.S.; M.S.

The University of Texas at Austin August 2007

Optimization of Column Compression Multipliers

Publication No. __________________

KAndrea Catherine Bickerstaff, Ph.D. The University of Texas at Austin, 2007

Supervisor: Earl E. Swartzlander, Jr.

The Final Carry Propagate Adder . . . . . . . . . . . . . . . . . . . . Layout Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.2 Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

6.3 6.4 6.5 6.6 6.7 6.8

Delay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.3 5.4 5.5 6.1 6.2 6.3 7.1 7.2 7.3

6.3 6.4 6.5 6.6 6.7 6.8 6.9

Chapter 2 Past Work

c is the worst case adder delay.

Better delays, though, can be achieved by implementing a higher radix

compressor that is used to implement a 16 by 16 multiplier-accumulator.

(3.2): 40 (2,2): 8 Stage 2

Stage 3 (3,2): 20 (2,2): 8

Stage 4 (3,2): 11 (2,2): 5 Stage 5 (3,2): 11 (2,2): 7

Figure 2.4: Dot Diagram for a 12 by 12 Wallace Multiplier 15

Stage 3 (3,2): 28 (2,2): 2

Stage 4 (3,2): 17 (2,2): 1 Stage 5 (3,2): 19 (2,2): 1

Figure 2.5: Dot Diagram for a 12 by 12 Dadda Multiplier 18

Stage 3 (3,2): 17 (2,2): 3

Stage 4 (3,2): 12 (2,2): 1 Stage 5 (3,2): 8 (2,2): 5

Figure 2.6: Dot Diagram for a 12 by 12 Reduced Area Multiplier 20

from adjacent column

r1 r0 Figure 2.9: 4:2 Compressor using (3,2) counters 25

Chapter 3 Automated Multiplier Netlist Generation

Buffers AND Gate Array p0

pS,,p1 Carry Lookahead Adder p2n-2,,p1 p2n-2,,pS+1 D flip-flops + load caps

Figure 3.4: Diagram of 16-bit Carry Lookahead Adder

Figure 3.5: Schematic of (3,2) counter standard cell

3.4 M x N Multiplier Generator

Chapter 4 Automated Multiplier Implementation and Verification

Verilog Netlist Generation

genmult and spi2ver perl scripts

Conformal Equivalence Checking Conformal Ultra

Timing Driven Placement

Timing Driven Route

Encounter Native Extraction

Static Timing Analysis

Encounter Common Timing Engine

equivalence checking capability to complex datapaths.

Figure 4.2: Conformal Process Flow 42

UltraSim and the given standard cell libraries.

Chapter 5 Multiplier Area

Wallace + 5.6 % + 6.2 % + 5.6 % + 4.3 % 59

Dadda + 5.5% + 2.2 % + 0.8 % + 0.3 %

Wallace + 5.1 % + 5.8 % + 5.3 % + 4.1 %

Dadda + 5.4 % + 2.2 % + 0.8 % + 0.3 %

Wallace + 5.0 % + 5.7 % + 5.2 % + 4.1 %

Dadda + 5.9 % + 2.4 % + 0.9 % + 0.3 %

Word Size 8 by 8 16 by 16 32 by 32 64 by 64 Table 5.13:

Wallace + 4.6 % + 5.2 % + 4.7 % + 3.7 % 63

Dadda + 5.3 % + 2.1 % + 0.8 % + 0.2 %

Wallace 20% 19% 17% 16%

Dadda 21% 19% 17% 16%

Reduced Area 20% 18% 17% 16%