Anda di halaman 1dari 26

Eidgenossische

¨ Ecole polytechnique federale


´ ´ de Zurich
Technische Hochschule Politecnico federale di Zurigo
Zurich
¨ Swiss Federal Institute of Technology Zurich

Institut für Integrierte Systeme Integrated Systems Laboratory

Lecture notes on

Computer Arithmetic:
Principles, Architectures,
and VLSI Design

March 16, 1999

Reto Zimmermann

Integrated Systems Laboratory


Swiss Federal Institute of Technology (ETH)
CH-8092 Zürich, Switzerland
zimmermann@iis.ee.ethz.ch

Copyright c 1999 by Integrated Systems Laboratory, ETH Zürich


http://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz
Contents Contents

Contents 4.3 Carry-Propagate Adders (CPA) 26


4.4 Carry-Save Adder (CSA) 45
1 Introduction and Conventions 4
4.5 Multi-Operand Adders 46
1.1 Outline 4
4.6 Sequential Adders 52
1.2 Motivation 4
5 Simple / Addition-Based Operations 53
1.3 Conventions 5
5.1 Complement and Subtraction 53
1.4 Recursive Function Evaluation 6
5.2 Increment / Decrement 54
2 Arithmetic Operations 8 5.3 Counting 58
2.1 Overview 8 5.4 Comparison, Coding, Detection 60
2.2 Implementation Techniques 9 5.5 Shift, Extension, Saturation 64
3 Number Representations 10 5.6 Addition Flags 66
3.1 Binary Number Systems (BNS) 10 5.7 Arithmetic Logic Unit (ALU) 68
3.2 Gray Numbers 13 6 Multiplication 69
3.3 Redundant Number Systems 14 6.1 Multiplication Basics 69
3.4 Residue Number Systems (RNS) 16 6.2 Unsigned Array Multiplier 71
3.5 Floating-Point Numbers 18 6.3 Signed Array Multipliers 72
3.6 Logarithmic Number System 19 6.4 Booth Recoding 73
3.7 Antitetrational Number System 19 6.5 Wallace Tree Addition 75
3.8 Composite Arithmetic 20 6.6 Multiplier Implementations 75
3.9 Round-Off Schemes 21 6.7 Composition from Smaller Multipliers 76
4 Addition 22 6.8 Squaring 76
4.1 Overview 22 7 Division / Square Root Extraction 77
4.2 1-Bit Adders, (m, k)-Counters 23 7.1 Division Basics 77
Computer Arithmetic: Principles, Architectures, and VLSI Design 1 Computer Arithmetic: Principles, Architectures, and VLSI Design 2

Contents

7.2 Restoring Division 78


7.3 Non-Restoring Division 78
7.4 Signed Division 79
7.5 SRT Division 80
7.6 High-Radix Division 81
7.7 Division by Multiplication 81
7.8 Remainder / Modulus 82
7.9 Divider Implementations 83
7.10 Square Root Extraction 84
8 Elementary Functions 85
8.1 Algorithms 85
8.2 Integer Exponentiation 86
8.3 Integer Logarithm 87
9 VLSI Design Aspects 88
9.1 Design Levels 88
9.2 Synthesis 90
9.3 VHDL 91
9.4 Performance 93
9.5 Testability 95
Bibliography 96

Computer Arithmetic: Principles, Architectures, and VLSI Design 3


1 Introduction and Conventions 1.2 Motivation 1 Introduction and Conventions 1.3 Conventions

1 Introduction and Conventions 1.3 Conventions

1.1 Outline Naming conventions


(1-D), (2-D),  (subbus, 1-D)
Signal buses : :
Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7]
   
Signals : ,  (1-D),  (2-D), : (group signal)
Circuit architectures and implementations of main

measures : (area), (cycle time,
arithmetic operations

Circuit complexity
delay), (area-time product), (latency, # cycles)
Aspects regarding VLSI design of arithmetic units
Arithmetic operators : ,
, , , log ( log2 )
Logic operators : (or), (and), (xor), (xnor), (not)
1.2 Motivation

Arithmetic units are, among others, core of every data


path and addressing unit Circuit complexity measures
Data path is core of :
Unit-gate model ( gate-equivalents (GE) model) :
microprocessors (CPU)  0   0 (i.e. ignored)
signal processors (DSP) Inverter, buffer :
data-processing application specific ICs (ASIC) and NOR) :
 1  
Simple monotonic 2-input gates (AND, NAND, OR,
1
programmable ICs (e.g. FPGA)
Standard arithmetic units available from libraries
  
Simple non-monotonic 2-input gates (XOR, XNOR) :
2 2
Complex gates : composed from simple gates
Simple -input gates :  
1   log 
Design of arithmetic units necessary for :
non-standard operations
high-performance components Wiring not considered (acceptable for comparison
library development purposes, local wiring, multilevel metallization)
Only estimations given for complex circuits
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Computer Arithmetic: Principles, Architectures, and VLSI Design 5

1 Introduction and Conventions 1.4 Recursive Function Evaluation 1 Introduction and Conventions 1.4 Recursive Function Evaluation

1.4 Recursive Function Evaluation 2. 


is associative (r.s.a.) a3 a2 a1 a0

, outputs , function (graph sym. :


serial or single-tree structure :
Given : inputs )
 !   !log  "
1 funrsa.epsi
219 20 mm

 (r.m.) (prefix problem) :


Non-recursive functions (n.)
Output  is a function of input  (or :  const.) b) with multiple outputs

 ;  0 
1  &1 ;  0 
1  &1 0 1
parallel structure : a a a a 3 2 1 0 1. 
is non-associative (r.m.n.)
a3 a2 a1 a0

 !   !1 119 "17 mm


funn.epsi
serial structure : 1 funrmn.epsi
"
z3 z2 z1 z0  !   !  219 25 mm
3

z3 z2 z1 z0
Recursive functions (r.) a3 a2 a1 a0

Output  is a function of all inputs   #$ 1



with single output  %&1 (r.s.) :
2. is associative (r.m.a.) 2
a) serial or multi-tree structure : z3
"
funrma1.epsi

' '&1 ;  0 


1  ! 2   !log  19 43 mm

'&1 0 1   '%&1
z2

z1
z0

1. 
is non-associative (r.s.n.)
a3 a2 a1 a0
or shared-tree structure : a3 a2 a1 a0
serial structure : 1 funrsn.epsi

 !   !  "
219 24 mm
3
 ! log   !log  "
1funrma2.epsi
219 21 mm

z z3 z2 z1 z0

Computer Arithmetic: Principles, Architectures, and VLSI Design 6 Computer Arithmetic: Principles, Architectures, and VLSI Design 7
2 Arithmetic Operations 2.1 Overview 2 Arithmetic Operations 2.2 Implementation Techniques

2 Arithmetic Operations 2.2 Implementation Techniques

2.1 Overview Direct implementation of dedicated units :

based on operation fixed-point floating-point always : 1 – 5


related operation in most cases : 6
<< , >>
sometimes : 7, 8
=,< +1 , −1 +/− +,− +,−

Sequential implementation using simpler units and
several clock cycles ( decomposition) :
× × sometimes : 6
"
arithops.epsi
98 83 mm in most cases : 7, 8, 9

⁄ sqrt (x) (same as on Table look-up techniques using ROMs :


the left for
floating-point
numbers) universal : simple application to all operations
exp (x)
efficient only for single-operand operations of high
complexity

%
complexity (8 – 12) and small word length (note: ROM
log (x) trig (x) hyp (x) size 2 )
Approximation techniques using simpler units : 7–12
1 shift/extension 7 division
taylor series expansion
2 comparison 8 square root extraction
3 increment/decrement 9 exponential function polynomial and rational approximations
4 complement 10 logarithm function convergence of recursive equation systems
5 addition/subtraction 11 trigonometric functions CORDIC (COordinate Rotation DIgital Computer)
6 multiplication 12 hyperbolic functions
Computer Arithmetic: Principles, Architectures, and VLSI Design 8 Computer Arithmetic: Principles, Architectures, and VLSI Design 9

3 Number Representations 3.1 Binary Number Systems (BNS) 3 Number Representations 3.1 Binary Number Systems (BNS)

 %  
:
2
 1 ,
3 Number Representations 
Complement
where %&1 %&2 0
Sign : %&1
3.1 Binary Number Systems (BNS)

Radix-2, binary number system (BNS) : irredundant, Properties : asymmetric range, compatible with
weighted, positional, monotonic [1, 2] unsigned numbers in many arithmetic operations

 %&%&     
(i.e. same treatment of positive and negative numbers)
-bit number is ordered sequence of bits (binary digits) :

%
1 2 0 2 0 1 One’s (1’s) complement : similar to 2’s complement
  %&  &2 
Simple and efficient implementation in digital circuits
Value :
 &1 2
1  2
% 1

%&1 / 0
Range : 
2
1  2
1
 
MSB/LSB (most-/least-significant bit) :
%& %&
0
1 1

 %  
Represents an integer or fixed-point number, exact
&  &&%
Fixed-point numbers :  1  0  1  
 -bit integer

 -bit fraction
Complement :
2

1
Sign : %&1

 %representation
Properties : double  of zero, symmetric
% range, modulo 2
1 number system
Unsigned : positive or natural numbers
 %&2%&1 2 & 
  
1
Value : 2

Range : 0  2
1
1 1 0
% 0
Sign-magnitude : alternative representation of signed

 
1 %
numbers
&2 

Two’s (2’s) complement : standard representation of Value :
0 2
1

Range : 
2 1
1  2 1
1
signed or integer numbers
%
&2   %&  %&

Value :
% %&
 &12  2   
Complement :
%&1 %&2 0
1

Range : 

1
%&
2 2
1 %&1
0

Sign : %&1
Computer Arithmetic: Principles, Architectures, and VLSI Design 10 Computer Arithmetic: Principles, Architectures, and VLSI Design 11
3 Number Representations 3.1 Binary Number Systems (BNS) 3 Number Representations 3.2 Gray Numbers

Properties : double representation of zero, symmetric 3.2 Gray Numbers


range, different treatment of positive and negative
Gray numbers (code) : binary, irredundant, non-weighted,
sign changes around 0 ( low power)

numbers in arithmetic operations, no MSB toggles at
non-monotonic
+ Property : unit-distance coding (i.e. exactly one bit
Graphical representation toggles between adjacent numbers)
Applications : counters with low output toggle rate
000...0

011...1
100...0

111...1
(low-power signal buses), representation of continuous
signals for low-error sampling (no false numbers due to
binary number representation switching of different bits at different times)
– Non-monotonic numbers : difficult arithmetic operations,
 
n−1 0 n−1 n e.g. addition, comparison :
−2 2 2
"
numrep.epsi 1 0 0 1 0
0 3binary
2 1 0 3 Gray
2 1 0
0 0 0 1 and 0 1
95 73 mm unsigned

1 1 1 0 but 1 0
0 0 0 0 0 0 0 0 0
2’s complement 1 0 0 0 1 0 0 0 1
2 0 0 1 0 0 0 1 1
binary Gray : 3 0 0 1 1 0 0 1 0
   % 0 ;
1’s complement 4 0 1 0 0 0 1 1 0
5 0 1 0 1 0 1 1 1
 0 
1
1
sign-magnitude (n.) 6 0 1 1 0 0 1 0 1
7 0 1 1 1 0 1 0 0
Gray binary : 8
9
1
1
0
0
0
0
0
1
1
1
1
1
0
0
0
1
Conventions     % 0 ; 10 1 0 1 0 1 1 1 1
11 1 0 1 1 1 1 1 0

1 0
1
2’s complement used for signed numbers in these notes (r.m.a.) 12 1 1 0 0 1 0 1 0
Unsigned and signed numbers can be treated equally in 13 1 1 0 1 1 0 1 1
14 1 1 1 0 1 0 0 1
most cases, exceptions are mentioned 15 1 1 1 1 1 0 0 0
Computer Arithmetic: Principles, Architectures, and VLSI Design 12 Computer Arithmetic: Principles, Architectures, and VLSI Design 13

3 Number Representations 3.3 Redundant Number Systems 3 Number Representations 3.3 Redundant Number Systems

3.3 Redundant Number Systems 1 digit holds sum of 3 bits or 1 digit + 1 bit (no
Non-binary, redundant, weighted number systems [1, 2] carry-out digit, i.e. carry is saved)


Digit set larger than radix (typically radix 2) multiple standard redundant number system for fast addition
representations of same number redundancy
 Signed-digit (SD) or redundant digit (RD) number


      %&  
+ No carry-propagation in adders more efficient impl. representation :
of adder-based units (e.g. multipliers and dividers)
 
' 1 0 1 1 0 1 , 0
1
2

 '         

– Redundancy no direct implementation of relational
operators conversion to irredundant numbers no carry-propagation in :

 
– Several bits used to represent one digit higher storage 2 1 , 1 1 0 1

     
    
1
requirements 1
is redundant (e.g. 0 1 01 11)
– Expensive conversion into irredundant numbers (not 1 0 1
necessary if redundant input operands are allowed) 1 digit holds sum of 2 digits (no carry-out digit)

               
minimal SD representation : minimal number of
 0 1 2 ,
 0 1 ,
Delayed-carry of half-adder number representation :

 1
2 1
,
0
        
non-zero digits, 011 1 10 100 0 10
applications : sequential multiplication (less cycles),
%&1 2     
1
filters with constant coefficients (less hardware)
example :
   
0

   
1 digit holds sum of 2 bits (no carry-out digit)
example : 00 10 00 10 01 01 10 00
   
minimal
7 0111 1111 1011 1001 11111
of

1
 0 & 
1 
1  0
irredundant representation 1 [8], since

canonical SD repres.: minimal SD + not two non-zero

 10 0 10 
           
digits in sequence, 01 1 10
 0 1 2 3 ,
  0 1 ,
Carry-save number representation : 
SD binary : carry-propagation necessary ( adder)
         
 1
2 1
 
%&      
other applications : high-speed multipliers [9]
1 2    similar to carry-save, simple use for signed numbers
0
Computer Arithmetic: Principles, Architectures, and VLSI Design 14 Computer Arithmetic: Principles, Architectures, and VLSI Design 15
3 Number Representations 3.4 Residue Number Systems (RNS) 3 Number Representations 3.4 Residue Number Systems (RNS)

3.4 Residue Number Systems (RNS) Arithmetic operations : (each digit computed separately)

Non-binary, irredundant, non-weighted number system [1]      

 





   

+ Carry-free and fast additions and multiplications       



 


  

– Complex and slow other arithmetic operations       



 


  
(e.g. comparison, sign and overflow detection) because 
 

 &1  
 
digits are not weighted, conversion to weighted
mixed-radix or binary system required




  &2  (Fermat’s theorem)
 






 

Codes for error detection and correction [1] Best moduli   are 2and 2
1:
Possible applications (but hardly used) : high storage efficiency with #bits
digital filters : fast additions and multiplications simple modular addition : 2: #-bit adder without , 

error detection and correction for arithmetic operations 2


1 : #-bit adder with end-around carry ( % ) 

in conventional and residue number systems


  3 2, 6
Example : 


%&%& 
4
3
2
1 0 1 2 3 4 5 6 7 8
1 0

Base is -tuple of integers 0 ,


 1 2 0 1 2 0 1 2 0 1 2 0 1 2
1 2

0 0 1 0 1 0 1 01 0 1 0 1 0
residues (or moduli) pairwise relatively prime
 %&1 %&2 0   ,
 
 0 1 
1 
1 2 0
possible range
%
Range: &1 , anywhere in ZZ 5      5  5   2 1

 
4 5 6 1 0  2 1 3 2
1 0

 mod 0  ,     


6
1 2 3 0 1 2  0 1 3 6


%&1   4 5  1 0  2 1
0  , 0 1 0 
 
 
 

1 2  0 1   2 0 2 


  
  6
 
3 2 6

Computer Arithmetic: Principles, Architectures, and VLSI Design 16 Computer Arithmetic: Principles, Architectures, and VLSI Design 17

3 Number Representations 3.5 Floating-Point Numbers 3 Number Representations 3.7 Antitetrational Number System

3.5 Floating-Point Numbers 3.6 Logarithmic Number System


Larger range, smaller precision than fixed-point Alternative representation to floating-point (i.e. mantissa
representation, inexact, real numbers [1, 2] + integer exponent only fixed-point exponent) [1]
Double-number form
discontinuous precision Single-number form continuous precision
 higher


1 
1 1  2 & 
accuracy, more reliable

S biased exponent E unsigned norm. mantissa M

 
1 
1 2 & 
    
S biased fixed-point exponent E

  

  
1 
Basic arithmetic operations : (signed-logarithmic)
     
 

  
1
Basic arithmetic operations :
    (additionally consider sign)
  
  

1 

 

 : by approximation or addition in conventional



 
    
  

base on fixed-point add, multiply, and shift operations


  
1 
number system and double conversion
postnormalization required (1 $ 1) 
   

 
1

  
1
  

 
Applications :
processors : “real” floating-point formats (e.g. IEEE + Simpler multiplication/exponent., more complex addition
standard), large range due to universal use – Expensive conversion : (anti)logarithms (table look-up)
ASICs : usually simplified floating-point formats with
Applications : real-time digital filters
small exponents, smaller range, used for range
extension of normal fixed-point numbers
3.7 Antitetrational Number System
 22) and antitetration (a.t. ) [10]
IEEE floating-point format : 2

  Tetration (t.
precision bias

range
38
precision
&7 " !
single
double
32
64
23
52
8 127 3 8 10
11 1023 9 10307
10
10
&15 otherwise analogous (i.e. 2 t.  log a.t. )
Larger range, smaller precision than logarithmic repres.,
!

Computer Arithmetic: Principles, Architectures, and VLSI Design 18 Computer Arithmetic: Principles, Architectures, and VLSI Design 19
3 Number Representations 3.8 Composite Arithmetic 3 Number Representations 3.9 Round-Off Schemes

3.8 Composite Arithmetic 3.9 Round-Off Schemes



Proposal for a new standard of number representations [10]

Intermediate results with
( higher accuracy) :
%&
 additional lower bits
0 &1 &

 small during
Scheme for storage and display of exact (primary: 1

integer, secondary: rational) and inexact (primary:


logarithmic, secondary: antitetrational) numbers
Rounding : keeping error  final
length reduction : %&

 word
1 0


Secondary forms used for numbers not representable by
primary ones ( no over-/underflow handling necessary)
Trade-off : numerical accuracy vs. implementation cost
 %& 
Truncation : 1 0
Choice of number representation hidden from user, i.e.


1 1

software/compiler selects format for highest accuracy
Number representations :
2 2 1 (= average error )
Round-to-nearest (i.e. normal rounding) :
tag value   %&    1  0 1
 
1
(nearly symmetric)
integer : 00 2’s complement integer 1 0 2 2

rational :
logarithmic :
01
10
slash denominator numerator
log integer log fraction “
2
0 12” can often be included in previous operation
1

   if &1 &
 0 0
antitetrational : 11 a.t. integer a.t. fraction Round-to-nearest-even/-odd :
Rational numbers : slash position (i.e. size of numerator/   &  
  
denominator) is variable and stored (floating slash)
%&1 1 0 otherwise
Storage form sizes : 32-bit (short), 64-bit (normal), 
0 (symmetric)
128-bit (long), 256-bit (extended)
mandatory in IEEE floating-point standard
Implementation : mixed hardware/software solutions


3 guard bits for rounding after floating-point operations :
Hardware proposal : long accumulator (4096 bits) holds
 guard bit (postnormalization), round bit
higher accurary

any floating-point number in fixed-point format
large hardware/software overhead (round-to-nearest), sticky bit (round-to-nearest-even)
Computer Arithmetic: Principles, Architectures, and VLSI Design 20 Computer Arithmetic: Principles, Architectures, and VLSI Design 21

4 Addition 4.1 Overview 4 Addition 4.2 1-Bit Adders, (m, k)-Counters

4 Addition 4.2 1-Bit Adders, (m, k)-Counters

bits of same magnitude (i.e. 1-bit numbers)


 
4.1 Overview Add up

Output sum as #-bit number ( # log  1)


1-bit adders HA FA (m,k) (m,2)

or : count 1’s at inputs (m, k)-counter [3]
(combinational counters)
RCA CSKA CSLA CIA

carry-propagate adders Half-adder (HA), (2, 2)-counter

CPA
CLA PPA COSA  
 2
   3   2 1




 (sum)

 (carry-out)
3-operand CSA

"
adders.epsi
carry-save adders
103 121 mm
adder adder a b
multi-operand
array tree a b
a b

"
chaschema1.epsi
out

" "
array tree hasym.epsi 19 28 mm haschema2.epsi
multi-operand adders
adder adder 18
c 23HA
mm 21 43 mm
c out
out

s s
Legend:
(reference)
HA: half-adder CPA: carry-propagate adder CLA: carry-lookahead adder
FA: full-adder RCA: ripple-carry adder PPA: parallel-prefix adder s
(m,k): (m,k)-counter CSKA:carry-skip adder COSA:conditional-sum adder
(m,2): (m,2)-compressor CSLA: carry-select adder
CIA: carry-increment adder CSA: carry-save adder
based on component related component

Computer Arithmetic: Principles, Architectures, and VLSI Design 22 Computer Arithmetic: Principles, Architectures, and VLSI Design 23
4 Addition 4.2 1-Bit Adders, (m, k)-Counters 4 Addition 4.2 1-Bit Adders, (m, k)-Counters

Full-adder (FA), (3, 2)-counter (m, k)-counters


 
 2
  %  7   4 2 
&

 1 &1 0 &1
a0 a m-1




   ...

"...
0 2 0 
cntsymbol.epsi
18 (m,k)

  23 mm


 (propagate) 1  
0
(generate)
s k-1 s 0


  %  % Usually built from full-adders


  %  %   % 
Associativity of addition allows convertion from linear to
% %  % tree structure faster at same number of FAs

% 0 % 1  7 log2&7 
log 
 4 2 log   4 log3  2 log 
1
a b
a b

Example : (7, 3)-counter


 28   14  28   10
a b
g HA

" " "


fasymbol.epsi faschematic3.epsi faschematic2.epsi
FA p c out c in
c18 21 mm
out c in c out 29 32 mm c in 32 35 mm
HA a0a1 a2a3a4a5a6 a0a1 a2 a3a4 a5a6
s

s s FA FA FA
a b
a b
a b

"
count73par.epsi
FA 36 48 mm FA

"
count73ser.epsi
0 42 59 mm

"
p
" "
faschematic1.epsi
g p faschematic4.epsi faschematic5.epsi
0 FA FA
c out c in c0
29 43 mm 29 1 41 mm 35 47 mm
c out c in c out 1
c1
s2 s1 s0
c in FA
tree structure
linear
s
(reference) s s2 s1 s0 structure
s
Computer Arithmetic: Principles, Architectures, and VLSI Design 24 Computer Arithmetic: Principles, Architectures, and VLSI Design 25

4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

4.3 Carry-Propagate Adders (CPA) Carry-propagation speed-up techniques


  % 
Add two -bit operands and and an optional carry-in a) Concatenation of partial CPAs with fast
%
by performing carry-propagation [1, 2, 11]


Sum
 is irredundant  1-bit number


a n-1:j b n-1:j a i-1:k b i-1:k a k-1:0 b k-1:0


...

  2%    %
"
speedup1.epsi


CPA CPA CPA
c out cj c i84 26 mm ck c in

2 1
    ; A B s n-1:j
...
s i-1:k s k-1:0

 0 1 
1

0 %  % (r.m.a.) 
CPA "
cpasymbol.epsi
c out 29 26 mm c in
a) Fast carry look-ahead logic for entire range of bits

S
a n-1 b n-1 a1 b1 a0 b0

Ripple-carry adder (RCA)


... preprocessing
Serial arrangement of full-adders
"
speedup2.epsi
Simplest, smallest, and slowest CPA structure carry propagation
104 50 mm
c out c in

 7   2   14 2
... postprocessing

a n-1 b n-1 a1 b1 a0 b0 s n-1 s1 s0


...

"
rca.epsi
FA FA FA
c out c n-1 57c 2 23 mm c1 c in
...
s n-1 s1 s0

Computer Arithmetic: Principles, Architectures, and VLSI Design 26 Computer Arithmetic: Principles, Architectures, and VLSI Design 27
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

Carry-skip adder (CSKA) Carry-select adder (CSLA)


Type a) : partial CPA with fast    Type a) : partial CPA with fast    and  
&1:
 &1:  &1:  (bit group &1 )
&1: 
0&1: 
1&1:
&1: &1 &2  (group propagate)   0  1
    )
1) &1: 0 :    and  selected (   Two CPAs compute two possible results ( % 0 1),
2) &1: 1 :    but  skipped (   )
 group carry-in  selects correct one afterwards
path      never sensitized fast    Variable group sizes (faster) : larger groups at end (MSB)
false path inherent logic redundancy problems in (balance delays 0  and    0)  
circuit optimization, timing analysis, and testing 
Part. CPA typ. is RCA, CSLA ( multil. CSLA), or CLA

(minimize delays 0  
1 and  
1)
Variable group sizes (faster) : larger groups in the middle High speed-up at high hardware overhead
   %&
 & (+ MUX/bit + (CPA + MUX)/group)
 14   2 8   39
Partial CPA typ. is RCA or CSKA ( multilevel CSKA) 
1 2

3 2
Medium speed-up at small hardware overhead
(+ AND/bit + MUX/group) a i-1:k b i-1:k a k-1:0 b k-1:0
8   4 
1 2
  32

3 2
...

c i0 0
a n-1:j b n-1:j a i-1:k b i-1:k a k-1:0 b k-1:0 0 CPA

"
csla.epsi 1 CPA
c out ci ck c in
102 50CPA
... 1
mm
c’i c i1
CPA 0 1
0 s i-1:k s i-1:k
"
CPA cska.epsi CPA ...
c out cj ci 99
1 36 mm ck c in 0 1
ck
...
P i-1:k
s i-1:k s k-1:0
s n-1:j s i-1:k s k-1:0

Computer Arithmetic: Principles, Architectures, and VLSI Design 28 Computer Arithmetic: Principles, Architectures, and VLSI Design 29

4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

Carry-increment adder (CIA) Example : gate-level schematic of carry-incr. adder (CIA)



Type a) : partial CPA with fast    and  
&1: 
only 2 different logic cells (bit-slices) : IHA and IFA


&1:
&1:     &1: 
max 
4 6 10 12 14 16 18 20 22 24 26 28 ... 38
&1: &1 &2  (group propagate)  group 2 3 4 5 6 7 8 9 10 11 ... 16
1 2 4 7 11 16 22 29 37 46 56 67 ... 137

Result is incremented after addition, if  1 [12, 11] a i-1 b i-1 a i-2 b i-2 a k+1 b k+1 ak bk

IFA IFA IFA IHA

 )
Variable group sizes (faster) : larger groups at end (MSB)
(balance delays 0  and     ...


Part. CPA typ. is RCA, CIA ( multilevel CIA) or CLA
...
High speed-up at medium hardware overhead
(+ AND/bit + (incrementer + AND-OR)/group)
...
Logic of CPA and incrementer can be merged [11]
10   2 8 
1 2
  28

3 2
ci
s i-1 100 "
ciagate.epsi
s i-2 112 mm s k+1 sk
ck

a i-1:k b i-1:k a k-1:0 b k-1:0 (i-k-1)IFA + IHA 2IFA + IHA IFA + IHA IHA IHA

...
c’i 0
CPA
CPA
"
c out ci cia.epsi
s’i-1:k ck c in ... bits i-1...k ... bits 6...4 bits 3,2 bit 1 bit 0
86 43 mm
... P i-1:k
+1

s i-1:k s k-1:0
c out c in

Computer Arithmetic: Principles, Architectures, and VLSI Design 30 Computer Arithmetic: Principles, Architectures, and VLSI Design 31
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

Conditional-sum adder (COSA) Carry-lookahead adder (CLA), traditional


 
Type a) : optimized multilevel CSLA with log levels Type b) : carries looked ahead before sum bits computed
(i.e. double CPAs are merged at higher levels)
Typically 4-bit blocks used (e.g. standard IC SN74181)
Correct sum bits (
0&1: or
1&1:) are (conditionally)
 levels of multiplexers 0 0
selected through log
1 0 0 0 ...

2 1 1 0 1 0 0
(g3,p3) (g0,p0)



Bit groups of size 2 at level
3 2 2 1 2 1 0 2 "
clbsymbol.epsi
27 CLB

1 0 0 26 mm c′
Higher parallelism, more balanced signal paths 0

3 3 3 2 3 2 1 3

2 1 0
Highest speed-up at highest hardware overhead   3 3 2 1 0
(g′,p′)
3 3 c3
. . . c0

(2 RCA + more than log MUX/bit)


 
3 log   2 log   6 log2
 
  passedarrangement
Hierarchical  using 12 log levels :
up, 0 passed down between levels
3 3

High speed-up at medium hardware overhead


a3 b3 a2 b2 a1 b1 a0 b0
14   4 log   56 log
level 0

... 0 0 0
FA FA FA
1 1 1 FA (g15,p15) ... (g12,p12) (g11,p11) ... (g8,p8) (g7,p7) ... (g4,p4) (g3,p3) ... (g0,p0)
FA FA FA c in

c′12 c′8 c′4 c′0


"
level 1

0 1 0 1
cosa.epsi 0 1 0 1 CLB CLB CLB CLB
...
100 57 mm

(g′11,p′11)
(g′15 ,p′15 )

(g′,p′)

(g′,p′)
7 7

3 3
c 15 ... c 12
"
c 11 ... c 8 cla.epsi c 7 ... c 4 c 3 ... c 0
level 2

0 1 0 1 0 1
... 97 48 mm

  


...

 
CLB c in 
 
c out + preprocessing :
s3 s2 s1 s0
+ postprocessing : 

Computer Arithmetic: Principles, Architectures, and VLSI Design 32 Computer Arithmetic: Principles, Architectures, and VLSI Design 33

4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

Parallel-prefix adders (PPA) Prefix problem


Type b) : universal adder architecture comprising RCA, Inputs
%& 
%&
, associative
1 0 , outputs 1 0
CIA, CLA, and more (i.e. entire range of area-delay binary operator [11, 13]
trade-offs from slowest RCA to fastest CLA) 
%&
 %&        or
Preprocessing, carry-lookahead, and postprocessing step

0
1
0 

0
 

&
1
1 ;  1
0


1
1
0 0
(r.m.a.)
Carries calculated using parallel-prefix algorithms tree structures for evaluation :
3 2 1  0  3  2  1  0  , but
2 ?
Associativity of
+ High regularity : suitable for synthesis and layout
  
+ High flexibility : special adders, other arithmetic
     1 1
  
 1

  
1 1:0 3:2 1 1:0
operations, exchangeable prefix algorithms (i.e. speeds)
 2  2
2 2:0 3 3:0
+ High performance : smallest and fastest adders
3
5 3   4 2

3
 at level 
3:0


Group variables : : covers bits  
Carry-propagation is prefix problem : :   : :
a n-1
b n-1
a n-2
b n-2

preprocessing:
a1
b1
a0
b0

  0 0   


 
    :  :  & 1  &
& &1 
... ...
: 1 : 1 :   :  ; #$ $
c in  
: :
 1 &1 
1 1
(gn-1 , p n-1 ) (g0 , p0 )
 :&
1
1 : 1 :   : 1 : 
&1 &1 &
73 64 mm "
add.epsi///figures carry-lookahead:
prefix algorithm
1 :0 ;  0 
1   1 
c n p n-1 c1 p0 c0
Parallel-prefix algorithms [11] :
  log )  !  ! 
... ... postprocessing: 
multi-tree structures (
sharing subtrees ( 2
 log )  !  ! 
c out
   
 
 different algorithms trading area vs. delay (influences
!
s n-1

s n-2

s1

s0

also from wiring and maximum fan-out )

Computer Arithmetic: Principles, Architectures, and VLSI Design 34 Computer Arithmetic: Principles, Architectures, and VLSI Design 35
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

Prefix algorithms Sklansky parallel-prefix algorithm (


PPA-SK)
Algorithms visualized by directed acyclic graphs (DAG)
Tree-like collection, parallel redistribution of carries
with array structure ( bits levels)   1 log   log  ! 1
2 2
Graph vertex symbols :
&1  :&
1 1  &1 &1 &1 &1 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

: :  :
: 1

 :  0
 
  "
1 sk.epsi///figures
          2 67 30 mm

: :  : :  : :  : :  3
4
(contains logic for ) (contains no logic)


Performance measures :
Brent-Kung parallel-prefix algorithm (
PPA-BK)
Traditional CLA is PPA-BK with 4-bit groups
: graph size (number of black nodes)

 : graph depth (number of black nodes on critical path) Tree-like redistribution of carries (fan-out tree)

Serial-prefix algorithm ( RCA)
 2
log 
2   2 log 
2

1  
1  ! 2 ! log

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

0 0
1 1

" "
2 ser.epsi///figures 2 bk.epsi///figures
3 69 38 mm 3 67 38 mm
...

4
14 5
15 6

Computer Arithmetic: Principles, Architectures, and VLSI Design 36 Computer Arithmetic: Principles, Architectures, and VLSI Design 37

4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)


Kogge-Stone parallel-prefix algorithm ( PPA-KS) Mixed serial/parallel-prefix algorithm (
RCA + PPA)
very high wiring requirements linear size-depth trade-off using parameter #:
  log
1   log  ! 2  
0 $#$
2 log 2

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
# 0 : serial-prefix graph
0
1
#
2 log  1 : Brent-Kung parallel-prefix
graph
2
fills gap between RCA and PPA-BK (i.e. CLA) in steps
"
ks.epsi///figures
3 67 52 mm of single -operations


1 # 
1
# ! var.

4
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

CIA)
Carry-increment parallel-prefix algorithm (
0
1

 2
1 4 1 2   1 4 1 2  ! 1 4 1 2
  
2
3

"
4 var.epsi///figures
5 68 54 mm
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
6
0 7
1 8

"
cia.epsi///figures 9
2
67 34 mm 10
3
4
5

Computer Arithmetic: Principles, Architectures, and VLSI Design 38 Computer Arithmetic: Principles, Architectures, and VLSI Design 39
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

Example : 4-bit parallel-prefix adder (PPA-SK) Prefix adder synthesis


efficient AND-OR-prefix circuit for the generate and
Local prefix graph transformation :

AND-prefix circuit for the propagate signals
optimization: alternatingly AOI-/OAI- resp. NAND-/ 3 2 1 0 3 2 1 0
depth-decr.
  
NOR-gates (inverting gates are smaller and faster)
can also be realized using two MUX-prefix circuits
3 0
1
"
unfact.epsi

transform 0
1
"
fact.epsi 4
3 20 26 mm 20 26 mm 2

2 size-decr. 2
a3 b3 a2 b2 a1 b1 a0 b0 3 transform 3

c in
Repeated (local) prefix transformations result in overall
minimization of graph depth or size which sequence ?

Goal: minimal size (area) at given depth (delay)
Simple algorithm for sequence of applied transforms :
Step 1 : prefix graph compression (depth minimization) :
depth-decr. transforms in right-to-left bottom-up order

" Step 2 : prefix graph expansion (size minimization) :


askgate.epsi///figures
100 103 mm
size-decreasing transforms in left-to-right top-down
order, if allowed depth not exceeded
Prefix adder synthesis : 1) generate serial-prefix graph, 2)
graph compression, 3) depth-controlled graph expansion,
4) generate pre-/postprocessing and prefix logic
+ Generates all previous prefix graphs (except PPA-KS)
c out + Universal adder synthesis algorithm : generates
P n-1:0 area-optimal adders for any given timing constraints [11]
s3 s2 s1 s0 (including non-uniform signal arrival times)
Computer Arithmetic: Principles, Architectures, and VLSI Design 40 Computer Arithmetic: Principles, Architectures, and VLSI Design 41

4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

Multilevel adders Self-timed adders


Multilevel versions of adders of type a) possible (CSKA, Average carry-propagation length : log

! 1 1for levels
CSLA, and CIA; notation: 2-level CIA = CIA-2L)
 + RCA is fast in average case ( ˜
  !
log ), slow in worst

+ Delay is case suitable for self-timed asynchronous designs [15]

high for CSLA ( COSA)



– Area increase small for CSKA and CIA,
– Completion detection is not trivial

Difficult computation of optimal group sizes Adder performance comparisons

Hybrid adders Standard-cell implementations, 0 8  process



Arbitrary combinations of speed-up techniques possible
hybrid/mixed adder architectures
area [lambda^2]

RCA
Often used combinations : CLA and CSLA [14] 128-bit CSKA-2L
1e+07
CIA-1L
– Pure architectures usually perform best (at gate-level) CIA-2L
64-bit
5 PPA-SK
Transistor-level adders PPA-BK

"
32-bit addperf.ps CLA


Influence of logic styles (e.g. dynamic logic, 2 84 84 mm COSA
pass-transistor logic faster) 16-bit const. AT
1e+06
+ Efficient transistor-level implementation of ripple-carry
chains (Manchester chain) [14] 8-bit
5
+ Combinations of speed-up techniques make sense
– Much higher design effort 2 delay [ns]
Many efficient implementations exist and published 5 10 20

Computer Arithmetic: Principles, Architectures, and VLSI Design 42 Computer Arithmetic: Principles, Architectures, and VLSI Design 43
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.4 Carry-Save Adder (CSA)

Complexity comparison under the unit-gate model 4.4 Carry-Save Adder (CSA)
a) Adds three -bit operands 0 , 1 , 2 performing no

adder A T AT opt.1 syn.2
 carry-propagation (i.e. carries are saved) [1]
     
2
RCA 7 2 14 aaa
  A0 A1 A2
1 2
CSKA-1L 8 4 32 3 2
aat 3
 4 3
0 1 2

1 3 4 4
1
 0 1 2 ; "
CSKA-2L 8 — csasymbol.epsi
2 2
 
21 CSA
26 mm
CSLA-1L 14 8 1 2 39 3 2

CIA-1L 10 2 8 1 2 28

3 2
att

 0 1 
1 (n.)
3
  C S
CIA-2L 10 6 1 3 36 4 3
att
4 4
  b) Adds one -bit operand to an -digit carry-save operand
CIA-3L 10 1 4
44 5 4


    %


PPA-SK 3
2
log 2 log 3 log2 ttt

PPA-BK
PPA-KS
10
3 log
4 log
2 log
40
6
log
log2
att

 ( digits),
– Result is in redundant carry-save format
represented by two -bit numbers (sum bits) and

CLA 5 14 4 log 56 log — ( ) (carry bits)
COSA 3 log 2 log 6 log2 — + Parallel arrangement of full-adders, constant delay
1 optimality regarding area and delay  7   4
aaa : smallest area, longest delay
aat : small area, medium delay

a 0,n-1
a 1,n-1

a 2,n-1

a 0,1
a 1,1

a 2,1

a 0,0
a 1,0

a 2,0
att : medium area, short delay
ttt : large area, shortest delay
"
csa.epsi
. . . 67 27FA
mm
— : not optimal FA FA
2 obtained from prefix adder synthesis
3 automatic logic optimization not possible (redundancy) cn s n-1 c2 s1 c1 s0

4 exact factors not calculated


5 corresponds to 4-bit PPA-BK
Multi-operand carry-save adders ( 3)
 
adder array (linear arrangement), adder tree (tree arr.)
Computer Arithmetic: Principles, Architectures, and VLSI Design 44 Computer Arithmetic: Principles, Architectures, and VLSI Design 45

4 Addition 4.5 Multi-Operand Adders 4 Addition 4.5 Multi-Operand Adders

4.5 Multi-Operand Adders a) 4-operand CPA (RCA) array :



Add three or more ( 2) -bit operands, yield
  
a 0,n-1
a 1,n-1

a 0,2
a 1,2

a 0,1
a 1,1

a 0,0
a 1,0
log -bit result in irredundant number rep. [1, 2] ...

Array adders CPA


FA FA FA HA
a 2,n-1 a 2,2 a 2,1 a 2,0
Realization by array adders : (see figures on next page) ...

"
cparray.epsi
a) linear arrangement of CPAs FA 93 57 mm FA
FA HA
CPA

b) linear arr. of CSAs (adder array) and final CPA a 3,n-1 a 3,2 a 3,1 a 3,0
...


a) and b) differ in bit arrival times at final CPA : CPA

if CPA = RCA : a) and b) have same overall delay FA FA FA FA HA


...
if fast final CPA : uniform bit arrival times required
sn s n-1 s2 s1 s0
CSA array (b)
Fast implementation : CSA array + fast final CPA b) 4-operand CSA array with final CPA (RCA) :
(note: array of fast CPAs not efficient/necessary)
a 0,n-1
a 1,n-1

a 2,n-1

 
2  
a 0,2
a 1,2

a 2,2

a 0,1
a 1,1

a 2,1

a 0,0
a 1,0

a 2,0

A0 A1 A2 A3 A m-1

 
2   CSA ... FA ... FA FA FA
CSA

a 3,n-1 a 3,2 a 3,1 a 3,0


 ! 
"
csarray.epsi

 !  "
mopadd.epsi ... 99FA 57 mm CSA
CPA = RCA : CSA FA FA HA
30 58 mm
...

 ! log  CPA

 ! log 
FA FA FA HA
Fast CPA : CPA
...
sn s n-1 s2 s1 s0
S

Computer Arithmetic: Principles, Architectures, and VLSI Design 46 Computer Arithmetic: Principles, Architectures, and VLSI Design 47
4 Addition 4.5 Multi-Operand Adders 4 Addition 4.5 Multi-Operand Adders

(m, 2)-compressors

  &4 
2

a0 a m-1  7 
2
10    4 
2  6 log 
1

...



& &
0
c in0
"
c out cprsymbol.epsi

 4 %

...

...
37 (m,2)

0  0
m-4 26 mm
c out c inm-4
Optimized (4, 2)-compressor :
c s
2 full-adders merged and optimized (i.e. XORs
1-bit adders (similar to (m, k)-counters) [16]
 
arranged in tree structure)
Compresses bits down to 2 by forwarding 
3  14   6
intermediate carries to next higher bit position  14  8 a0 a1 a2 a3
Is bit-slice of multi-operand CSA array (see prev. page)
+ No horizontal carry-propagation (i.e. %    #) 

a0 a1 a2 a3

Built from full-adders (= (3, 2)-compressor) or FA



" "
(4, 2)-compressors arranged in linear or tree structures cpr42fa.epsi 0 cpr42opt.epsi
1
32 38 mm 41 53 mm
c out c in
Example : 4-operand adder using (4, 2)-compressors FA c out c in
0 1
a 2,n-1
a 0,n-1
a 1,n-1

a 3,n-1

c s
a 2,2

a 2,1

a 2,0
a 0,2
a 1,2

a 3,2

a 0,1
a 1,1

a 3,1

a 0,0
a 1,0

a 3,0

with full-adders c s

(4,2) (4,2) (4,2) (4,2) CSA optimized


"
cpradd.epsi
99 44 mm
+ same area, 25% shorter delay
FA FA FA HA CPA
SD-FA (signed-digit full-adder) is similar to
(4, 2)-compressor regarding structure and complexity
s n+1 sn s n-1 s2 s1 s0

Computer Arithmetic: Principles, Architectures, and VLSI Design 48 Computer Arithmetic: Principles, Architectures, and VLSI Design 49

4 Addition 4.5 Multi-Operand Adders 4 Addition 4.5 Multi-Operand Adders

Advantages of (4, 2)-compressors over FAs for realizing Tree adders (Wallace tree)
(m, 2)-compressors :
higher compression rate (4:2 instead of 3:2) 
Adder tree : -bit -operand carry-save adder
less deep and more regular trees composed of tree-structured (m, 2)-compressors [1, 17]

Tree adders : fastest multi-operand adders using an


tree depth 012 3 4 5 6 7 8 9 10 adder tree and a fast final CPA
   ! log 
2
FA 2 3 4 6 9 13 19 28 42 63 94
# operands

 2  !log  log 
(4,2) 2 4 8 16 32 64 128

Example : (8, 2)-compressor


 42   16  42  12 Adder arrays and adder trees revisited
a0a1 a2a3 a4a5 a6a7 a0a1a2a3 a4a5a6a7
Some FA can often be replaced by HA or eliminated
0
c out c in0 (i.e. redundant due to constant inputs)
FA FA (4,2) (4,2)
0
c out c in0
1
1
c out c in1 Number of (irredundant) FA does not depend on adder
c out c in1
" structure, but number of HA does
2
c out cpr82cpr42.epsi c in2

-operand adder accomodates 


1 carry inputs
FA FA 47 50 mm
3
2
c in2 c in3
"
c out cpr82fa.epsi c out
An
3
c out
47 65 mm
c in3 (4,2) ! 
( log ) are faster 
4
c out
FA
c in4
4
c out c in4 Adder!
(
 
trees ! arrays
than adder
) at same amount of gates (  )

c s
FA (4, 2)-compressor tree
routing than adder arrays

Adder trees are less regular and have more complex
larger area, difficult layout
c s (i.e. limited use in layout generators)
full-adder tree
Computer Arithmetic: Principles, Architectures, and VLSI Design 50 Computer Arithmetic: Principles, Architectures, and VLSI Design 51
4 Addition 4.6 Sequential Adders 5 Simple / Addition-Based Operations 5.1 Complement and Subtraction

4.6 Sequential Adders 5 Simple / Addition-Based Operations


Bit-serial adder : Sequential -bit adder A
5.1 Complement and Subtraction
   ai bi
2’s complementer (negation)

  1 "
neg.epsi
   "
bitseradd.epsi
FA 21 32 mm1
+1

25 27 mm

si Z
Accumulators : Sequential -operand adders A B

With CPA A 2’s complement subtractor


    
  

"CPA
   
sub.epsi

"   1
accucpa.epsi 29 32 mm 1
CPA
27 28 mm c out

  S S

A A B
With CSA and final CPA
Allows higher clock rates 2’s complement adder/subtractor
Final CPA too slow :   
1   
pipelining or multiple "
CSA addsub.epsi

 


36 35 mm
CPA sub
"
accucsa.epsi c out
cycles for evaluation
   4 
33 52 mm

   
S

CPA 1’s complement adder A B

    mod 2%
1
  "
addmod.epsi
S

29 CPA
28 mm
c out c in
Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer
carries saved), trade-off between speed and register size (end-around carry)
S
Computer Arithmetic: Principles, Architectures, and VLSI Design 52 Computer Arithmetic: Principles, Architectures, and VLSI Design 53

5 Simple / Addition-Based Operations 5.2 Increment / Decrement 5 Simple / Addition-Based Operations 5.2 Increment / Decrement

   AND-prefix struct.


: :
5.2 Increment / Decrement Prefix problem : : 1

Incrementer 1  
2   log 2   1 log2
Adds a single bit %to an -bit operand
 2
log 2

   2%  %






A Decrementer
   
%


    29 "
incsymbol.epsi a n-1 a2 a1 a0
  
1  ;  0  
1 c
+1
26 mm
out c in

0 %  % (r.m.a.) 

Z
...

 
Corresponds to addition with 0 ( FA HA) c out "
dec.epsi
93 41 mm
c in
Example : Ripple-carry incrementer using half-adders ...

 3   1   3 2
z n-1 z2 z1 z0

    %  
1 %
a n-1 a1 a0
... Incrementer-decrementer

" 
incfa.epsi
HA 59c 23HA mm c HA 

c out c n-1 2 1 c in
...
z n-1 z1 z0 a n-1 a2 a1 a0

or using incrementer slices (= half-adder)


a n-1 a2 a1 a0 dec
... ...

"
incdec.epsi
94 46 mm
c out
"
inc.epsi c out
c in c in
83 33 mm
... ...
HA

z n-1 z2 z1 z0 z n-1 z2 z1 z0

Computer Arithmetic: Principles, Architectures, and VLSI Design 54 Computer Arithmetic: Principles, Architectures, and VLSI Design 55
5 Simple / Addition-Based Operations 5.2 Increment / Decrement 5 Simple / Addition-Based Operations 5.2 Increment / Decrement

Fast incrementers Gray incrementer

4-bit incrementer using multi-input gates : Increments in Gray number system

0 %&1 %&2  0 (parity)


a3 a2 a1 a0

1   ;  0 
3 (r.m.a.)
c in 0 0  0
"   &1 &1 ;  1 
2
inccg.epsi
62 39 mm

c out %&1 %&1  %&2


z3 z2 z1 z0
Prefix problem
AND-prefix structure
8-bit parallel-prefix incrementer (Sklansky AND-prefix
structure) :

a7 a6 a5 a4 a3 a2 a1 a0

c in

"
incpp.epsi
98 63 mm

c out z7 z6 z5 z4 z3 z2 z1 z0

Computer Arithmetic: Principles, Architectures, and VLSI Design 56 Computer Arithmetic: Principles, Architectures, and VLSI Design 57

5 Simple / Addition-Based Operations 5.3 Counting 5 Simple / Addition-Based Operations 5.3 Counting

5.3 Counting
! 
Fast divider ( 1 ) using delayed-carry numbers
 (irredundant carry-save represention of
1 allows using
Count clock cycles counter,
divide clock frequency

frequency divider ( 
) fast carry-save incrementer) [8]

Binary counter Gray counter


Sequential in-/decrementer Counter using Gray incrementer
Incrementer speed-up c out
+1
c in
"
techniques applicable cntblock.epsi Ring counters
32 33 mm
Down- and up-down-counters clk Shift register connected to ring :
using decrementers /
incrementer-decrementers Q
"
cntring.epsi
51 16 mm
Example : Ripple-carry up-counter using counter slices
(= HA + FF), is count enable % q n-1 q2 q1 q0

State is not encoded


 FF for counting states
c out c in Must be initialized correctly (e.g. 00 01)

" Applications:

cntripple.epsi
... 87 36 mm
fast dividers (no logic between FF)
state counter for one-hot coded FSMs
q n-1 q2 q1 q0
Johnson / twisted-ring counter (inverted feed-back) :

Asynchronous counter using toggle-flip-flops
(lower toggle rate lower power)
"
cntjohnson.epsi
T ... T T T 59 16 mm
clk
"
cntasync.epsi q n-1 q2 q1 q0

q n-1 q2
64 18 mm
q1 q0
 FF for counting 2 states
Computer Arithmetic: Principles, Architectures, and VLSI Design 58 Computer Arithmetic: Principles, Architectures, and VLSI Design 59
5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection 5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection

5.4 Comparison, Coding, Detection Comparators


 
A B

Comparison operations Subtractor (


:
  
 
  " CPA
cmpsub.epsi
 
(equal) 
37 31 mm 1


%&1:0
  
 
GE = c out

  
(not equal)


  
(greater or equal) (for free in PPA) EQ = P n-1:0


    
(less than)
 7   2 or
 $  
  
(greater than)
(less or equal)
& 3 log  & 2 log
2

Equality comparison Optimized comparator :



  removing redundancies in subtractor (unused
)
single-tree structure speed-up at no cost :
a n-1
b n-1

a2
b2

a1
b1

a0
b0
 1     6   2   2 log
     ;
...
 

"
cmpeq.epsi
 40 36 mm
example : ripple comparator using comparator slices
 0 1   0% (r.s.a.)
 

1


a n-1
b n-1

a2
b2

a1
b1

a0
b0
EQ

 
Magnitude comparison
... equality &
 magnitude

"
cmpripple.epsi

      100 47 mm

       
1 magnitude


1
   %
GE
; 0

0 1 (r.s.a.) equality
EQ

Computer Arithmetic: Principles, Architectures, and VLSI Design 60 Computer Arithmetic: Principles, Architectures, and VLSI Design 61

5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection 5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection

Decoder
%& to vector & ( 2%) Detection operations

1 if   1:0


 %&1 %&2 0
Decodes binary number 1:0
All-zeroes detection :
 0 else ;  0 
1 2  
All-ones detection :  %&1 %&2 0 (r.s.a.)

A a2 a1 a0
   log

" "
decodersym.epsi decoder.epsi
21decoder
26 mm 58 28 mm Leading-zeroes detection (LZD) :
for scaling, normalization, priority encoding
Z
 
12%  log 
z7 z6 z5 z4 z3 z2 z1 z0

       
a) non-encoded output :
a n-1 a n-2 a1 a0
0 1 01 0 1 0 ...
Encoder
& %&  % 000100)
"
 (e.g. 000101 lzdnenc.epsi
Encodes vector  1:0 to binary number

 # #    
(condition: 
1:0 ( 2 ) 50 28 mm
if then 1 else 0)  2   ...


   

if 1; 0 1

log2 z z z
n-1 n-2 1 z0

A a7a5a3a1  prefix problem (r.m.a.) AND-prefix structure


a6a4a2a0

"
encodersym.epsi z0
21encoder
b) encoded output : + encoder
"
26 mm encoder.epsi

Z
30 34 mm
z1 signed numbers : + leading-ones detector (LOZ)
 2%&1
1 z2


1 (note: connections
according to PPA-SK)
Computer Arithmetic: Principles, Architectures, and VLSI Design 62 Computer Arithmetic: Principles, Architectures, and VLSI Design 63
5 Simple / Addition-Based Operations 5.5 Shift, Extension, Saturation 5 Simple / Addition-Based Operations 5.5 Shift, Extension, Saturation

5.5 Shift, Extension, Saturation Applications :


Shift : a) shift -bit vector by bit positions # adaption of magnitude (shift a)) or word length
b) select out of more bits at position # 
(extension) of operands (e.g. for addition)
multiplication/division by multiples of 2 (shift)
also: logical (= unsigned), arithmetic (= signed) 
#
Rotation by bit positions, constant (logic operation) 
logic bit/byte operations (shift, rotation)

# #
scaling of numbers for word-length reduction (i.e.
Extension of word lengths by bits (  ) ignore leading zeroes, shift b)) or normalization (e.g.
(i.e. sign-extension for signed numbers) of floating-point numbers, shift a)) using LZD
Saturation to highest/lowest value after over-/underflow reducing error after over-/underflow (saturation)
shift a) un- l. %&2  0 0 sll Implementation of shift/extension/rotation by
signed r. 0 %&1  1 srl  constant values : hard-wired
%&1  %&3  0 0  variable values : multiplexers
signed l.
r. %&1 %&1 %&2  1
sla
sra
 possible values : –by– barrel-shifter/rotator
shift b) unsigned %&1   Example : 4–by–4 barrel-rotator
signed 2%&1  %&2    ! 2  a3 a2 a1 a0
%&2  0 %&1
 !log 
rotate l. rol
r. 0 %&1  1 ror s1 s0

extend un- l. 0 %&1  0 s1 s0


"
a3 a2 a1 a0
%&1  0 0
barshift.epsi
signed r. 44 49 mm

signed l. %&1 %&1 %&2  0 s0


"
muxshift.epsi s1 s0

%&1  %&2  0 0


41 28 mm
r. s1 s1 s0

saturate unsigned %&1  %&1 z3 z2 z1 z0 z3 z2 z1 z0


signed  &1  %&1  %&1
% multiplexers tristate buffers
Computer Arithmetic: Principles, Architectures, and VLSI Design 64 Computer Arithmetic: Principles, Architectures, and VLSI Design 65

5 Simple / Addition-Based Operations 5.6 Addition Flags 5 Simple / Addition-Based Operations 5.6 Addition Flags

5.6 Addition Flags Basic and derived condition flags


flag formula
%
description
carry flag condition flag
formula
% %&1    ( )
unsigned
 
 (
)
signed
%%
%
%%
%
signed overflow flag
 0
operation: or

 :
 0 
  
zero flag  zero
%&1 negative flag, sign
00  negative —

 positive
—( )

Implementation of adder with flags
,  overflow
( )


 
: for free

underflow
% %&
: fast , 1 computed by e.g. PPA very cheap operation:


: a) %  %&
1 (subtract.) :
 
1:0 (of PPA)
  
  
 

b) % 0 1 :

    
  


%&1
%&2
0 (r.s.a.)       
 


     log 


1)
 
$  
  
2) faster without final sum (i.e. carry prop.) [18]
example : 01001 1 00 0

Unsigned and signed addition/subtraction only differ
10110 1 00

with respect to the condition flags

0    00000 0 00
  0  %
  &1 &1 
0


%&1 %&2 0 ;  0 
1 (r.s.a.)
  3   4 log 
Computer Arithmetic: Principles, Architectures, and VLSI Design 66 Computer Arithmetic: Principles, Architectures, and VLSI Design 67
5 Simple / Addition-Based Operations 5.7 Arithmetic Logic Unit (ALU) 6 Multiplication 6.1 Multiplication Basics

5.7 Arithmetic Logic Unit (ALU) 6 Multiplication


A B
6.1 Multiplication Basics
c out alusymbol.epsi c in
and [1, 2]
   
Multiplies two -bit operands
flags
"
30 ALU
29 mm
op Product is 2 -bit unsigned number or 2
1 -bit
signed number
Z

%
&1  % &1   % &1 %
Example : unsigned multiplication
  &1  
ALU operations
  % 

%  2  2    2 or
 

  
 1 
1 %
0 0 0 0
&
add sub
2 ;  0 
1 (r.s.a.)

 0
arithmetic inc dec 1

pass neg
and  nand 
or   nor   Algorithm
logic
xor    xnor    1) Generation of partial products 
pass  not 
  2) Adding up partial products :

11  11
sll srl 
shift/ a) sequentially (sequential shift-and-add),

rotate
sla
rol
1 sra
ror
 1 
b) serially (combinational shift-and-add), or
c) in parallel
s/ro : shift/rotate ; l/r : left/right ;
l/a : logic (unsigned) / arithmetic (signed) Speed-up techniques
Logic of adder/subtractor can partly be shared with logic Reduce number of partial products
operations
Accelerate addition of partial products
Computer Arithmetic: Principles, Architectures, and VLSI Design 68 Computer Arithmetic: Principles, Architectures, and VLSI Design 69

6 Multiplication 6.1 Multiplication Basics 6 Multiplication 6.2 Unsigned Array Multiplier

Sequential multipliers : 6.2 Unsigned Array Multiplier


partial products generated × Braun multiplier : array multiplier for unsigned numbers
and added sequentially (using
"
mulseq.epsi
%
&1 %
&1   8 2
11
accumulator)
  2
34 28 mm
CPA
 !   !log   0 0  6
9
0 3 0 2 0 1 0 0
×
 1 3 1 2 1 1 1 0
Array multipliers :
2 3 2 2 2 1 2 0
3 3 3 2 3 1
CSA
partial products generated and
added simultaneously in linear
CSA
×
3 0
" ×
mularr.epsi
7 6 5 4 3 2 1 0
array (using array adder) 34 47 CSAmm

 ! 2   !  CSA
×
a0
b3 b2 b1 b0

CPA
p0
a1

Parallel multipliers : × HA HA HA
1
partial products × p1
generated in parallel and added ×
× a2

" CSA
mulpar.epsi
subsequently in multi-operand
"
34 43 mm mulbraun.epsi
FA FA FA
adder (using tree adder) 99 83 mm

 ! 2   !log 
tree p2

a3
CPA
2 FA FA FA
CSA
Signed multipliers : p3
CPA


a) complement operands before and result after
multiplication unsigned multiplication
3 FA FA HA

b) direct implementation (dedicated multiplier structure) p7 p6 p5 p4

Computer Arithmetic: Principles, Architectures, and VLSI Design 70 Computer Arithmetic: Principles, Architectures, and VLSI Design 71
6 Multiplication 6.3 Signed Array Multipliers 6 Multiplication 6.4 Booth Recoding

6.3 Signed Array Multipliers 6.4 Booth Recoding


Modified Braun multiplier Speed-up technique : reduction of partial products
special FAs [1]
Subtract bits with negative weight Sequential multiplication

1 neg. bit :
  % 2

  
Minimal (or canonical) signed-digit (SD) represent. of
2 neg. bits : 

%
2

+ One cycle per non-zero partial product (i.e.   0)


 %

Replace FAs in regions – Negative partial products
1 , 2 , and 3 by :
  %  % – Data-dependent reduction of partial products and latency
(input at mark )


Combinational multiplication

Otherwise exactly same structure and complexity as
Braun multiplier efficient and flexible Only fixed reduction of partial product possible

Baugh-Wooley multiplier multiplier digit



Radix-4 modified Booth recoding : 2 bits recoded to one
2 partial products


%2

 2
221) 22 ; &1 0



0 (2 &1& 


Arithmetic transformations yield the following partial

& 
products (two additional ones) :

0 3 0 2 0 1 0 0



2 10 1 2

1 3 1 2 1 1 1 0 1 2 2&1 
2 3 2 2 2 1 2 0



2
 ×


3 3 3 2 3 1 3 0
0 0 0 0

recoding
Booth
×
3 3 0 0 1 ×



0 1 0 ×
"
mulbooth.epsi
1


3 3 0 1 1 2 41 43 mm


7 6 5 4 3 2 1 0 1 0 0 2 CSA
1 0 1 array/tree
– Less efficient and regular than modified Braun 1 1 0 CPA
multiplier 1 1 1 0
Computer Arithmetic: Principles, Architectures, and VLSI Design 72 Computer Arithmetic: Principles, Architectures, and VLSI Design 73

6 Multiplication 6.4 Booth Recoding 6 Multiplication 6.6 Multiplier Implementations

Applicable to sequential, array, and parallel multipliers 6.5 Wallace Tree Addition
– additional recoding logic and more : 8 Speed-up technique : fast partial product addition
complex partial product generation
: 7  ! 2   !log 
(MUX for shift, XOR for negation)
Applicable to parallel multipliers : parallel partial
+ adder array/tree cut in half
 considerably smaller (array and tree)
: 2 product generation (normal or Booth recoded)
much faster for adder arrays : 2 – Irregular adder tree (Wallace tree) due to different
slightly or not faster for adder trees :
0 
number of bits per column

 irregular wiring and/or layout

    
non-uniform bit arrival times at final adder
Negative partial products (avoid sign-extension) :


3 3 3 3 2 1 0 0 0 0
3 2 1 0 6.6 Multiplier Implementations
ext. sign 1
1 1 1 3 Sequential multipliers :

     
2 1 0
low performance, small area, resource sharing (adder)


1

 
Braun or Baugh-Wooley multiplier (array multiplier) :
03 03 03 03 02 01 00
 03 02 01 00 
13 13 13 12 11 10 13 12 11 10
 
medium performance, high area, high regularity
23
33
23
32
22
31
21
30
20
33
23
32
22
31
21
30
20
layout generators data paths and macro-cells
simple pipelining, faster CPA higher speed

6 5 4 3 2 1 0 6 5 4 3 2 1 0
Booth-Wallace multiplier (parallel multiplier) [9] :
Suited for signed multiplication (incl. Booth recod.) 
for unsigned multiplication : % 0 high performance, high area, low regularity
Extend custom multipliers, netlist generators
often pipelined (e.g. register between CSA-tree and CPA)
 
Radix-8 (3-bit recoding) and higher radices :
  0)
Signed-unsigned multiplier : signed multiplier with
precomputing 3 , larger overhead
operands extended by 1 bit ( 1 0, % %& % %&
1

Computer Arithmetic: Principles, Architectures, and VLSI Design 74 Computer Arithmetic: Principles, Architectures, and VLSI Design 75
6 Multiplication 6.8 Squaring 7 Division / Square Root Extraction 7.1 Division Basics

6.7 Composition from Smaller Multipliers 7 Division / Square Root Extraction


  multiplier can be composed from 4
2 2-bit-bitmultipliers (can be repeated recursively)
7.1 Division Basics
   ; 
   2%   2%     
rem (remainder)
  22%    2%  0 22%
1    0 2%
1   0
    %  %, otherwise overflow
4   -bit multipliers
      normalize before division (2%&1 2%
1)
 2  2
+ 2 -bit CSA + 3 -bit CPA

less efficient (area and speed)


  Algorithms (radix-2)
Subtract-and-shift : partial remainders
 [1, 2]
6.8 Squaring
non-associative
2  : multiplier optimizations possible 
Sequential algorithm : recursive,
  

1 2    1
2 


0 3  0 1 0 %  ; 
1 0 (r.m.n.)
  1  1 0
2 3 1 2 3 12 21 
0

3 3 2 3 1 3 
2 3 1 3 0 3 0 0 1 
Basic algorithm : compare and conditionally subtract
0 0 expensive comparison and CPA
 3 3 1 2 1 1
2 2 
Restoring division : subtract and conditionally restore

 2 1partial products (if no Booth recoding used)


7 6 5 4 3 2 1 0 (adder or multiplexer) expensive CPA and restoring

+
optimized correct by next steps expensive CPA

Non-restoring division : detect sign, subtract/add, and
squarer more efficient than multiplier
Table look-up (ROM) less efficient for every 
SRT division : estimate range, subtract/add (CSA), and
correct by next steps inexpensive CSA
Computer Arithmetic: Principles, Architectures, and VLSI Design 76 Computer Arithmetic: Principles, Architectures, and VLSI Design 77

 if 
2 0  1 same sign
7 Division / Square Root Extraction 7.3 Non-Restoring Division 7 Division / Square Root Extraction 7.4 Signed Division

 1 if 
7.2 Restoring Division 7.4 Signed Division
 1 
2 0 
1 if 1  opposite sign
 1 
0 if 1

 1
2 0 :  0    (restored)

1 1
2&1 0 : &1 1  &1 1
2&1
1



(simplifications:
0, final correction of omitted)
Example : signed non-restoring array divider

 9 2   2 2 4
  

7.3 Non-Restoring Division

1 11 ifif 11 00
 b3 a6 b2 a5 b1 a4 b0 a3

1 0 :  1   1
2 
a6 ⊕ b3

 


1 1
2 0 : &1 1  
 &1  
2


2 &1 1
2&1
1 q3 FA FA FA FA


One subtraction/addition (CPA) per step
a2
Final correction step for (additional CPA)
 
 1 1    0 1  :  1   1
Simple quotient digit conversion : (note:  irredundant)
q2 FA FA FA FA

"
divarray.epsi
    81 101 mm

 %&1  %&2  %&3  0 1


2
   
a1

q1 FA FA FA FA

  1
A B

! 2  or ! 2 log  ≥ +/− CPA


 1
 !
Q
"
≥ +/− CPA
divnr.epsi
46 ≥38 mm
+/− CPA
a0

 2  or ! log 
≥ +/− CPA q0 FA FA FA FA
≥ +/− CPA
r3 r2 r1 r0
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 78 Computer Arithmetic: Principles, Architectures, and VLSI Design 79
7 Division / Square Root Extraction 7.5 SRT Division 7 Division / Square Root Extraction 7.7 Division by Multiplication

2,  
1 1 0 1 
1 
7.5 SRT Division (Sweeney, Robertson, Tocher) 7.6 High-Radix Division

  1 if 2 $ 1

0 if
2 $ 1 2    is SD number


Radix

1 if 1 
2 quotient bits per step fewer, but more complex steps
%&  %  
If 2 1 $ 2 , i.e. is normalized :
+ Suitable for SRT algorithm faster

2 $
2%&1 $ 1 2%&1 $2
 
– Complex comparisons (more bits) and decisions

 %&1 $ 1
 0 if
2%&1 $ 1 2%&1
 1 if 2
table look-up ( Pentium bug!)

1 if  
2%&1
1
7.7 Division by Multiplication

 are estimated CSA Division by convergence


+ Only 3 MSB are compared
  &  1
  0 1
instead of CPA can be used (precise enough) [19]
&1  1 1 resp. 2%


 0 1 

Correction in following steps (+ final correction step) 

– Redundant representation of  (SD representation)
1

   2%1

 1
 2%1

2  
final conversion necessary (CPA)


+ Highly regular and fast (
! 
) SRT array dividers
only slightly slower/larger than array multipliers


       2%
1
2&%   2
2&%  1 (signed)

  2       
A B

! 2 
Algorithm :
≥ +/− CSA  1
 1 ;  0 1
1
 !  "      (r.s.n.)
CPA

≥ +/− CSA
divsrt.epsi
Q ≥ mm+/− CSA
50 38


≥ +/− CSA 0 0

 log 
≥ +/− CPA

Quadratic convergence :
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 80 Computer Arithmetic: Principles, Architectures, and VLSI Design 81

7 Division / Square Root Extraction 7.8 Remainder / Modulus 7 Division / Square Root Extraction 7.9 Divider Implementations

Division by reciprocation 7.9 Divider Implementations


 1
  Iterative dividers (through multiplication) :
resource sharing of existing components (multiplier)
    medium performance, medium area
Newton-Raphson iteration method :

find 
  0 by recursion 1

   
high efficiency if components are shared
 
 
   

1 1 1
0 Sequential dividers (restoring, non-restoring, SRT) :
2
resource sharing of existing components (e.g. adder)
  2
 ;  0
1
Algorithm : low performance, low area
   (r.s.n.)
1
Array dividers (restoring, non-restoring, SRT) :
dedicated hardware component
0

 !log  high performance, high area


from table
Quadratic convergence :
Speed-up : first approximation 0 high regularity layout generators, pipelining
square root extraction possible by minor changes
7.8 Remainder / Modulus
combination with multiplication or/and square root
rem  
   sign   sign 
Remainder (rem) : signed remainder of a division
No parallel dividers exist, as compared to parallel
multipliers (sequential nature of division)

 
Modulus (mod) : positive remainder of a division

mod   
0

ifelse 0
Computer Arithmetic: Principles, Architectures, and VLSI Design 82 Computer Arithmetic: Principles, Architectures, and VLSI Design 83
7 Division / Square Root Extraction 7.10 Square Root Extraction 8 Elementary Functions 8.1 Algorithms

7.10 Square Root Extraction 8 Elementary Functions



  2 
 !

0 22%
1  0 2%
1
Exponential function : (exp )
Logarithm function : ln , log 
Trigonometric functions : sin , cos , tan 
Inverse trig. functions : arcsin , arccos , arctan 
Algorithm
 and quotients

   %&  0[1]

Subtract-and-shift : partial remainders
   Hyperbolic functions : sinh , cosh , tanh 
  
1 2 1 0
2
1 2 2 21 2 2 1 2
   
  

 1 2 2 1 2   1 2


8.1 Algorithms
  
  Table look-up : inefficient for large word lengths [5]
 
2 2  2 ; 
1 0




Taylor series expansion : complex implementation
% 1  % 0  1  (r.m.n.) Polynomial and rational approximations [1, 5]
0 0

Shift-and-add algorithms [5]


Implementation
 Convergence algorithms [1, 2] :
+ Similar to division same algorithms applicable similar to division-by-convergence
(restoring, non-restoring, SRT, high-radix)
+ Combination with division in same component possible
two (or more) recursive formulas : one formula
converges to a constant, the other to the result
Only triangular array required A
  
(step :  0) Coordinate rotation (CORDIC) [2, 5, 20] :
3 equations for x-, y-coordinate, and angle
  2
+/− CPA

"
sqrtnr.epsi
+/− CPA
computes all elementary functions by proper input
 
Q 42 36+/− mmCPA
+/− CPA
settings and choice of modes and outputs
+/− CPA
simple, universal hardware, small look-up table
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 84 Computer Arithmetic: Principles, Architectures, and VLSI Design 85

   
8 Elementary Functions 8.2 Integer Exponentiation 8 Elementary Functions 8.3 Integer Logarithm

 
     
   1   2 
8.2 Integer Exponentiation b) 12 1 0

Approximated exponentiation :   ln 2 log


   
1 2 2 2 1 2 0
  !  !

 1 0 
Base-2 integer exponentiation : 2 0 
 ; 
1 0
  


1
  
2

% 1  0 (r.s.n.)
  

Integer exponentiation (exact) :      2 


1
  
   0 2
1
  %
 (!)
8.3 Integer Logarithm

log2 
Applications : modular exponentiation mod
   

in cryptographic algorithms (e.g. IDEA, RSA)

  2
Algorithms : square-and-multiply For detection/comparison of order of magnitude

  
 
2   2   4 2 
    1 
a) 2 1
1
0 Corresponds to leading-zeroes detection (LZD) with
1 2    encoded output
1 2 2 1 0


  &1  1 2 ;  0 
1
 

&1 1  0  %&1 (r.s.n.)


  

 2    or
     2

Computer Arithmetic: Principles, Architectures, and VLSI Design 86 Computer Arithmetic: Principles, Architectures, and VLSI Design 87
9 VLSI Design Aspects 9.1 Design Levels 9 VLSI Design Aspects 9.1 Design Levels

9 VLSI Design Aspects Gate-level design

9.1 Design Levels Cell-based design techniques : standard-cells, gate-array/


sea-of-gates, field-programmable gate-array (FPGA)
Transistor-level design
Circuit implemented by hand or by synthesis (library)
Circuit and layout designed by hand (full custom) Layout implemented by automated place-and-route
Low design efficiency Medium to high design efficiency
High circuit performance : high speed, low area Medium to low circuit performance
High flexibility : choice of architecture and logic style Medium to low flexibility : full choice of architecture
Transistor-level circuit optimizations :
logic style : static vs. dynamic logic, Block-level design
complementary CMOS vs. pass-transistor logic
special arithmetic circuits : better than with gates
Layout blocks and netlists from parameterized automatic
generators or compilers (library)
gi g i-1 High design efficiency
"p
ci c i-1
carrychain.epsi
carry chain : c out 54 17 mm c in Medium to high circuit performance
ki i k i-1 p i-1
Low flexibility : limited choice of architectures
Implementations :
a b a a b c in a
data-path : bit-sliced, bus-oriented layout (array of

b
c in c in cells: bits operations), implementation of entire
"
full- b facmos.epsi
76 40 mm
s data paths, medium performance, medium diversity
adder : c in b c in
c out macro-cells : tiled layout, fixed/single-operation
b
components, high performance, small diversity
a b a a b c in a
portable netlists :

gate-level design
Computer Arithmetic: Principles, Architectures, and VLSI Design 88 Computer Arithmetic: Principles, Architectures, and VLSI Design 89

9 VLSI Design Aspects 9.2 Synthesis 9 VLSI Design Aspects 9.3 VHDL

9.2 Synthesis 9.3 VHDL


High-level synthesis Arithmetic types : unsigned, signed (2’s complement)
Synthesis from abstract, behavioral hardware description Arithmetic packages
(e.g. data dependency graphs) using e.g. VHDL
numeric_bit, numeric_std (IEEE standard 1076.3),
Involves architectural synthesis and arithmetic std_logic_arith (Synopsys)
transformations
contain overloaded arithmetic operators and resizing /
High-level synthesis is still in the beginnings type conversion routines for unsigned, signed types

Low-level synthesis Arithmetic operators (VHDL’87/93) [21]

Layout and netlist generators relational : =, /=, <, <=, >, >=
shift, rotate (’93 only) : rol, ror, sla, sll, sra, srl
Included in libraries and synthesis tools
adding : +, -
Low-level synthesis is state-of-the-art sign (unary) : +, -
Basis for efficient ASIC design multiplying : *, /, mod, rem
Limited diversity and flexibility of library components exponent, absolute : **, abs

Circuit optimization Synthesis


Efficient optimization of random logic is state-of-the-art Typical limitations of synthesis tools :


Optimization of entire arithmetic circuits is not feasible /, mod, rem : both operands must be constant or divisor
only local optimizations possible must be a power of two
Logic optimization cannot replace the synthesis of ** : for power-of-two bases only
efficient arithmetic circuit structures using generators Variety of arithmetic components provided in separate
libraries (e.g. DesignWare by Synopsys)

Computer Arithmetic: Principles, Architectures, and VLSI Design 90 Computer Arithmetic: Principles, Architectures, and VLSI Design 91
9 VLSI Design Aspects 9.3 VHDL 9 VLSI Design Aspects 9.4 Performance

Resource sharing 9.4 Performance

Sharing one resource for multiple operations Pipelining


Done automatically by some synthesis tools
Otherwise, appropriate coding is necessary : 
Pipelining is basically possible with every combinational
circuit higher throughput
a)
2 adders + 1 multiplexer
S <= A + C when SELA = ’1’ else B + C;
Arithmetic circuits are well suited for pipelining due to
high regularity
b) T <= A when SELA
S <= T + C;
1 multiplexer + 1 adder
= ’1’ else B;
Pipelining of arithmetic circuits can be very costly :
large amount of internal signals in arithmetic circuits
Coding & synthesis hints
array structures : many small pipeline registers
Addition : single adder with carry-in/carry-out : tree structures : few large pipeline registers
Aext <= resize(A, width+1) & Cin; no advantage of tree structures anymore
Bext <= resize(B, width+1) & ’1’;
Sext <= Aext + Bext; (except for smaller latency)
S <= Sext(width downto 1); Fine-grain pipelining
systolic arrays (often applied to
Cout <= Sext(width+1); arithmetic circuits)


Synthesis : check synthesis result for allocated arithmetic
units code sanity check, control of circuit size
High speed

Fast circuit architectures, pipelining, replication


VHDL library of arithmetic units (parallelization), and combinations of those

Structural, synthesizable VHDL code for most circuits Optimal solution depends on arithmetic operation, circuit
described in this text is found in [22] architecture, user specifications, and circuit environment

Computer Arithmetic: Principles, Architectures, and VLSI Design 92 Computer Arithmetic: Principles, Architectures, and VLSI Design 93

9 VLSI Design Aspects 9.4 Performance 9 VLSI Design Aspects 9.5 Testability

Low power 9.5 Testability

Power-related properties of arithmetic circuits : Testability goal : high fault coverage with few test vectors
that are easy to generate/apply
High glitching activity due to high bit dependencies
and large logic depth Random test vectors : easy to generate and
apply/propagate, few vectors give high (but not perfect)
Power reduction in arithmetic circuits [23] : fault coverage for most arithmetic circuits
Reduce the switched capacitance by choosing an area Special test vectors : sometimes hard to generate and
efficient circuit architecture apply, required for coverage of hard-detectable faults
Allow for lower supply voltage by speeding up the which are inherent in most arithmetic circuits
circuitry
Hard-detectable faults found in :
Reduce the transition activity :
apply stable inputs while circuit is not in use ( circuits of arithmetic operations with inherent special
disabling subcircuits) cases (arithmetic exceptions) : detectors, comparators,
reduce glitching transitions by balancing signal incrementers and counters (MSBs), adder flags
paths (partly done by speed-up techniques, otherwise
difficult to realize)

circuits using redundant number representations

( redundant hardware) : dividers (Pentium bug!)
 reduce glitching transitions by reducing logic depth
(pipelining)
 take advantage of correlated data streams
 choose appropriate number representations
(e.g. Gray codes for counters)

Computer Arithmetic: Principles, Architectures, and VLSI Design 94 Computer Arithmetic: Principles, Architectures, and VLSI Design 95
Bibliography Bibliography

Bibliography [11] R. Zimmermann, Binary Adder Architectures for


Cell-Based VLSI and their Synthesis, PhD thesis, Swiss
[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, Federal Institute of Technology (ETH) Zurich,
1993. Hartung-Gorre Verlag, 1998.

[2] K. Hwang, Computer Arithmetic: Principles, Architecture, [12] A. Tyagi, “A reduced-area scheme for carry-select adders”,
and Design, John Wiley & Sons, 1979. IEEE Trans. Comput., vol. 42, no. 10, pp. 1162–1170, Oct.
1993.
[3] O. Spaniol, Computer Arithmetic, John Wiley & Sons,
1981. [13] T. Han and D. A. Carlson, “Fast area-efficient VLSI
adders”, in Proc. 8th Computer Arithmetic Symp., Como,
[4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design May 1987, pp. 49–56.
and Implementation, McGraw-Hill, 1984.
[14] D. W. Dobberpuhl et al., “A 200-MHz 64-b dual-issue
[5] J.-M. Muller, Elementary Functions: Algorithms and CMOS microprocessor”, IEEE J. Solid-State Circuits, vol.
Implementation, Birkhauser Boston, 1997. 27, no. 11, pp. 1555–1564, Nov. 1992.
[6] Proceedings of the Xth Symposium on Computer Arithmetic. [15] A. De Gloria and M. Olivieri, “Statistical carry lookahead
[7] IEEE Transactions on Computers. adders”, IEEE Trans. Comput., vol. 45, no. 3, pp. 340–347,
Mar. 1996.
[8] D. R. Lutz and D. N. Jayasimha, “Programmable modulo-k
counters”, IEEE Trans. Circuits and Syst., vol. 43, no. 11, [16] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A method for
pp. 939–941, Nov. 1996. speed optimized partial product reduction and generation of
fast parallel multipliers using an algorithmic approach”,
[9] H. Makino et al., “An 8.8-ns 54 54-bit multiplier with IEEE Trans. Comput., vol. 45, no. 3, pp. 294–305, Mar.
high speed redundant binary architecture”, IEEE J. 1996.
Solid-State Circuits, vol. 31, no. 6, pp. 773–783, June 1996.
[17] Z. Wang, G. A. Jullien, and W. C. Miller, “A new design
[10] W. N. Holmes, “Composite arithmetic: Proposal for a new technique for column compression multipliers”, IEEE
standard”, IEEE Computer, vol. 30, no. 3, pp. 65–73, Mar. Trans. Comput., vol. 44, no. 8, pp. 962–970, Aug. 1995.
1997.

Computer Arithmetic: Principles, Architectures, and VLSI Design 96 Computer Arithmetic: Principles, Architectures, and VLSI Design 97

Bibliography

[18] J. Cortadella and J. M. Llaberia, “Evaluation of A + B = K


conditions without carry propagation”, IEEE Trans.
Comput., vol. 41, no. 11, pp. 1484–1488, Nov. 1992.

[19] S. E. McQuillan and J. V. McCanny, “Fast VLSI algorithms


for division and square root”, J. VLSI Signal Processing,
vol. 8, pp. 151–168, Oct. 1994.

[20] Y. H. Hu, “CORDIC-based VLSI architectures for digital


signal processing”, IEEE Signal Processing Magazine, vol.
9, no. 3, pp. 16–35, July 1992.

[21] K. C. Chang, Digital Design and Modeling with VHDL and


Synthesis, IEEE Computer Society Press, Los Alamitos,
California, 1997.

[22] R. Zimmermann, “VHDL Library of Arithmetic Units”,


http://www.iis.ee.ethz.ch/˜zimmi/arith lib.html.

[23] A. P. Chandrakasan and R. W. Brodersen, Low Power


Digital CMOS Design, Kluwer, Norwell, MA, 1995.

Computer Arithmetic: Principles, Architectures, and VLSI Design 98

Anda mungkin juga menyukai