Comp Arith Notes

Eidgenossische
¨ Ecole polytechnique federale

´ ´ de Zurich
Technische Hochschule Politecnico federale di Zurigo
Zurich
¨ Swiss Federal Institute of Technology Zurich
Institut für Integrierte Systeme Integrated Systems Laboratory
Lecture notes on
Computer Arithmetic:
Principles, Architectures,
and VLSI Design
March 16, 1999
Reto Zimmermann
Integrated Systems Laboratory

Swiss Federal Institute of Technology (ETH)
CH-8092 Zürich, Switzerland
zimmermann@iis.ee.ethz.ch
Copyright c 1999 by Integrated Systems Laboratory, ETH Zürich

http://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz
Contents Contents
Contents 4.3 Carry-Propagate Adders (CPA) 26

4.4 Carry-Save Adder (CSA) 45
1 Introduction and Conventions 4
4.5 Multi-Operand Adders 46
1.1 Outline 4
4.6 Sequential Adders 52
1.2 Motivation 4
5 Simple / Addition-Based Operations 53
1.3 Conventions 5
5.1 Complement and Subtraction 53
1.4 Recursive Function Evaluation 6
5.2 Increment / Decrement 54
2 Arithmetic Operations 8 5.3 Counting 58
2.1 Overview 8 5.4 Comparison, Coding, Detection 60
2.2 Implementation Techniques 9 5.5 Shift, Extension, Saturation 64
3 Number Representations 10 5.6 Addition Flags 66
3.1 Binary Number Systems (BNS) 10 5.7 Arithmetic Logic Unit (ALU) 68
3.2 Gray Numbers 13 6 Multiplication 69
3.3 Redundant Number Systems 14 6.1 Multiplication Basics 69
3.4 Residue Number Systems (RNS) 16 6.2 Unsigned Array Multiplier 71
3.5 Floating-Point Numbers 18 6.3 Signed Array Multipliers 72
3.6 Logarithmic Number System 19 6.4 Booth Recoding 73
3.7 Antitetrational Number System 19 6.5 Wallace Tree Addition 75
3.8 Composite Arithmetic 20 6.6 Multiplier Implementations 75
3.9 Round-Off Schemes 21 6.7 Composition from Smaller Multipliers 76
4 Addition 22 6.8 Squaring 76
4.1 Overview 22 7 Division / Square Root Extraction 77
4.2 1-Bit Adders, (m, k)-Counters 23 7.1 Division Basics 77
Computer Arithmetic: Principles, Architectures, and VLSI Design 1 Computer Arithmetic: Principles, Architectures, and VLSI Design 2
Contents
7.2 Restoring Division 78

7.3 Non-Restoring Division 78
7.4 Signed Division 79
7.5 SRT Division 80
7.6 High-Radix Division 81
7.7 Division by Multiplication 81
7.8 Remainder / Modulus 82
7.9 Divider Implementations 83
7.10 Square Root Extraction 84
8 Elementary Functions 85
8.1 Algorithms 85
8.2 Integer Exponentiation 86
8.3 Integer Logarithm 87
9 VLSI Design Aspects 88
9.1 Design Levels 88
9.2 Synthesis 90
9.3 VHDL 91
9.4 Performance 93
9.5 Testability 95
Bibliography 96
Computer Arithmetic: Principles, Architectures, and VLSI Design 3

1 Introduction and Conventions 1.2 Motivation 1 Introduction and Conventions 1.3 Conventions
1 Introduction and Conventions 1.3 Conventions
1.1 Outline Naming conventions

(1-D), (2-D), (subbus, 1-D)
Signal buses : :
Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7]

Signals : , (1-D), (2-D), : (group signal)
Circuit architectures and implementations of main

measures : (area), (cycle time,
arithmetic operations

Circuit complexity
delay), (area-time product), (latency, # cycles)
Aspects regarding VLSI design of arithmetic units
Arithmetic operators : ,
, , , log ( log2 )
Logic operators : (or), (and), (xor), (xnor), (not)
1.2 Motivation
Arithmetic units are, among others, core of every data

path and addressing unit Circuit complexity measures
Data path is core of :
Unit-gate model ( gate-equivalents (GE) model) :
microprocessors (CPU) 0 0 (i.e. ignored)
signal processors (DSP) Inverter, buffer :
data-processing application specific ICs (ASIC) and NOR) :
1
Simple monotonic 2-input gates (AND, NAND, OR,
1
programmable ICs (e.g. FPGA)
Standard arithmetic units available from libraries

Simple non-monotonic 2-input gates (XOR, XNOR) :
2 2
Complex gates : composed from simple gates
Simple -input gates :
1 log
Design of arithmetic units necessary for :
non-standard operations
high-performance components Wiring not considered (acceptable for comparison
library development purposes, local wiring, multilevel metallization)
Only estimations given for complex circuits
1 Introduction and Conventions 1.4 Recursive Function Evaluation 1 Introduction and Conventions 1.4 Recursive Function Evaluation
1.4 Recursive Function Evaluation 2.

is associative (r.s.a.) a3 a2 a1 a0
, outputs , function (graph sym. :

serial or single-tree structure :
Given : inputs )
! !log "
1 funrsa.epsi
219 20 mm
(r.m.) (prefix problem) :

Non-recursive functions (n.)
Output is a function of input (or : const.) b) with multiple outputs
; 0
1 &1 ; 0
1 &1 01
parallel structure : a a a a 3 2 1 0 1.
is non-associative (r.m.n.)
a3 a2 a1 a0
! !1 119 "17 mm

funn.epsi
serial structure : 1 funrmn.epsi
"
z3 z2 z1 z0 ! ! 219 25 mm
3
z3 z2 z1 z0
Recursive functions (r.) a3 a2 a1 a0
Output is a function of all inputs #$ 1

with single output %&1 (r.s.) :
2. is associative (r.m.a.) 2
a) serial or multi-tree structure : z3
"
funrma1.epsi
' '&1 ; 0

1 ! 2 !log 19 43 mm
'&1 01 '%&1
z2
z1
z0
1.
is non-associative (r.s.n.)
a3 a2 a1 a0
or shared-tree structure : a3 a2 a1 a0
serial structure : 1 funrsn.epsi
! ! "
219 24 mm
3
! log !log "
1funrma2.epsi
219 21 mm
z z3 z2 z1 z0
2 Arithmetic Operations 2.1 Overview 2 Arithmetic Operations 2.2 Implementation Techniques
2 Arithmetic Operations 2.2 Implementation Techniques
2.1 Overview Direct implementation of dedicated units :
based on operation fixed-point floating-point always : 1 – 5

related operation in most cases : 6
<< , >>
sometimes : 7, 8
=,< +1 , −1 +/− +,− +,−

Sequential implementation using simpler units and
several clock cycles ( decomposition) :
× × sometimes : 6
"
arithops.epsi
98 83 mm in most cases : 7, 8, 9
⁄ sqrt (x) (same as on Table look-up techniques using ROMs :

the left for
floating-point
numbers) universal : simple application to all operations
exp (x)
efficient only for single-operand operations of high
complexity
%
complexity (8 – 12) and small word length (note: ROM
log (x) trig (x) hyp (x) size 2 )
Approximation techniques using simpler units : 7–12
1 shift/extension 7 division
taylor series expansion
2 comparison 8 square root extraction
3 increment/decrement 9 exponential function polynomial and rational approximations
4 complement 10 logarithm function convergence of recursive equation systems
5 addition/subtraction 11 trigonometric functions CORDIC (COordinate Rotation DIgital Computer)
6 multiplication 12 hyperbolic functions
3 Number Representations 3.1 Binary Number Systems (BNS) 3 Number Representations 3.1 Binary Number Systems (BNS)
%
:
2
1 ,
3 Number Representations
Complement
where %&1 %&2 0
Sign : %&1
3.1 Binary Number Systems (BNS)
Radix-2, binary number system (BNS) : irredundant, Properties : asymmetric range, compatible with
weighted, positional, monotonic [1, 2] unsigned numbers in many arithmetic operations
%&%&
(i.e. same treatment of positive and negative numbers)
-bit number is ordered sequence of bits (binary digits) :
%
1 2 0 2 0 1 One’s (1’s) complement : similar to 2’s complement
%& &2
Simple and efficient implementation in digital circuits
Value :
&1 2
1 2
% 1
%&1 / 0
Range :
2
1 2
1

MSB/LSB (most-/least-significant bit) :
%& %&
0
1 1
%
Represents an integer or fixed-point number, exact
& &&%
Fixed-point numbers : 1 0 1
-bit integer

-bit fraction
Complement :
2

1
Sign : %&1
%representation
Properties : double of zero, symmetric
% range, modulo 2
1 number system
Unsigned : positive or natural numbers
%&2%&1 2 &

1
Value : 2
Range : 0 2
1
1 1 0
% 0
Sign-magnitude : alternative representation of signed

1 %
numbers
&2

Two’s (2’s) complement : standard representation of Value :
0 2
1
Range :
2 1
1 2 1
1
signed or integer numbers
%
&2 %& %&

Value :
% %&
&12 2
Complement :
%&1 %&2 0
1
Range :

1
%&
2 2
1 %&1
0
Sign : %&1
3 Number Representations 3.1 Binary Number Systems (BNS) 3 Number Representations 3.2 Gray Numbers
Properties : double representation of zero, symmetric 3.2 Gray Numbers

range, different treatment of positive and negative
Gray numbers (code) : binary, irredundant, non-weighted,
sign changes around 0 ( low power)

numbers in arithmetic operations, no MSB toggles at
non-monotonic
+ Property : unit-distance coding (i.e. exactly one bit
Graphical representation toggles between adjacent numbers)
Applications : counters with low output toggle rate
000...0
011...1
100...0
111...1
(low-power signal buses), representation of continuous
signals for low-error sampling (no false numbers due to
binary number representation switching of different bits at different times)
– Non-monotonic numbers : difficult arithmetic operations,

n−1 0 n−1 n e.g. addition, comparison :
−2 2 2
"
numrep.epsi 1 0 0 1 0
0 3binary
2 1 0 3 Gray
2 1 0
0 0 0 1 and 0 1
95 73 mm unsigned
1 1 1 0 but 1 0
0 0 0 0 0 0 0 0 0
2’s complement 1 0 0 0 1 0 0 0 1
2 0 0 1 0 0 0 1 1
binary Gray : 3 0 0 1 1 0 0 1 0
% 0 ;
1’s complement 4 0 1 0 0 0 1 1 0
5 0 1 0 1 0 1 1 1
0
1
1
sign-magnitude (n.) 6 0 1 1 0 0 1 0 1
7 0 1 1 1 0 1 0 0
Gray binary : 8
9
1
1
0
0
0
0
0
1
1
1
1
1
0
0
0
1
Conventions % 0 ; 10 1 0 1 0 1 1 1 1
11 1 0 1 1 1 1 1 0

1 0
1
2’s complement used for signed numbers in these notes (r.m.a.) 12 1 1 0 0 1 0 1 0
Unsigned and signed numbers can be treated equally in 13 1 1 0 1 1 0 1 1
14 1 1 1 0 1 0 0 1
most cases, exceptions are mentioned 15 1 1 1 1 1 0 0 0
3 Number Representations 3.3 Redundant Number Systems 3 Number Representations 3.3 Redundant Number Systems
3.3 Redundant Number Systems 1 digit holds sum of 3 bits or 1 digit + 1 bit (no
Non-binary, redundant, weighted number systems [1, 2] carry-out digit, i.e. carry is saved)

Digit set larger than radix (typically radix 2) multiple standard redundant number system for fast addition
representations of same number redundancy
Signed-digit (SD) or redundant digit (RD) number

%&
+ No carry-propagation in adders more efficient impl. representation :
of adder-based units (e.g. multipliers and dividers)

' 1 0 1 1 0 1 , 0
1
2

'

– Redundancy no direct implementation of relational
operators conversion to irredundant numbers no carry-propagation in :

– Several bits used to represent one digit higher storage 2 1 , 1 1 0 1

1
requirements 1
is redundant (e.g. 0 1 01 11)
– Expensive conversion into irredundant numbers (not 1 0 1
necessary if redundant input operands are allowed) 1 digit holds sum of 2 digits (no carry-out digit)

minimal SD representation : minimal number of
0 1 2 ,
0 1 ,
Delayed-carry of half-adder number representation :
1
2 1
,
0

non-zero digits, 011 1 10 100 0 10
applications : sequential multiplication (less cycles),
%&1 2
1
filters with constant coefficients (less hardware)
example :

0

1 digit holds sum of 2 bits (no carry-out digit)
example : 00 10 00 10 01 01 10 00

minimal
7 0111 1111 1011 1001 11111
of
1
0 &
1
1 0
irredundant representation 1 [8], since

canonical SD repres.: minimal SD + not two non-zero

10 0 10

digits in sequence, 01 1 10
0 1 2 3 ,
0 1 ,
Carry-save number representation :
SD binary : carry-propagation necessary ( adder)

1
2 1

%&
other applications : high-speed multipliers [9]
1 2 similar to carry-save, simple use for signed numbers
0
3 Number Representations 3.4 Residue Number Systems (RNS) 3 Number Representations 3.4 Residue Number Systems (RNS)
3.4 Residue Number Systems (RNS) Arithmetic operations : (each digit computed separately)
Non-binary, irredundant, non-weighted number system [1]

+ Carry-free and fast additions and multiplications

– Complex and slow other arithmetic operations

(e.g. comparison, sign and overflow detection) because

&1

digits are not weighted, conversion to weighted
mixed-radix or binary system required

&2 (Fermat’s theorem)

Codes for error detection and correction [1] Best moduli are 2and 2
1:
Possible applications (but hardly used) : high storage efficiency with #bits
digital filters : fast additions and multiplications simple modular addition : 2: #-bit adder without ,
error detection and correction for arithmetic operations 2

1 : #-bit adder with end-around carry ( % )
in conventional and residue number systems

3 2, 6
Example :

%&%&
4
3
2
1 0 1 2 3 4 5 6 7 8
1 0
Base is -tuple of integers 0 ,

1 2 0 1 2 0 1 2 0 1 2 0 1 2
1 2
0 0 1 0 1 0 1 01 0 1 0 1 0
residues (or moduli) pairwise relatively prime
%&1 %&2 0 ,

0 1
1
1 2 0
possible range
%
Range: &1 , anywhere in ZZ 5 5 5 2 1

4 5 6 1 0 2 1 3 2
1 0
mod 0 ,

6
1 2 3 0 1 2 0 1 3 6

%&1 4 5 1 0 2 1
0 , 0 1 0

1 2 0 1 2 0 2

6

3 2 6
3 Number Representations 3.5 Floating-Point Numbers 3 Number Representations 3.7 Antitetrational Number System
3.5 Floating-Point Numbers 3.6 Logarithmic Number System

Larger range, smaller precision than fixed-point Alternative representation to floating-point (i.e. mantissa
representation, inexact, real numbers [1, 2] + integer exponent only fixed-point exponent) [1]
Double-number form
discontinuous precision Single-number form continuous precision
higher

1
1 1 2 &
accuracy, more reliable

S biased exponent E unsigned norm. mantissa M

1
1 2 &

S biased fixed-point exponent E

1
Basic arithmetic operations : (signed-logarithmic)

1
Basic arithmetic operations :
(additionally consider sign)

1

: by approximation or addition in conventional

base on fixed-point add, multiply, and shift operations

1
number system and double conversion
postnormalization required (1 $ 1)

1

1

Applications :
processors : “real” floating-point formats (e.g. IEEE + Simpler multiplication/exponent., more complex addition
standard), large range due to universal use – Expensive conversion : (anti)logarithms (table look-up)
ASICs : usually simplified floating-point formats with
Applications : real-time digital filters
small exponents, smaller range, used for range
extension of normal fixed-point numbers
3.7 Antitetrational Number System
22) and antitetration (a.t. ) [10]
IEEE floating-point format : 2
Tetration (t.
precision bias

range
38
precision
&7 " !
single
double
32
64
23
52
8 127 3 8 10
11 1023 9 10307
10
10
&15 otherwise analogous (i.e. 2 t. log a.t. )
Larger range, smaller precision than logarithmic repres.,
!
3 Number Representations 3.8 Composite Arithmetic 3 Number Representations 3.9 Round-Off Schemes
3.8 Composite Arithmetic 3.9 Round-Off Schemes

Proposal for a new standard of number representations [10]

Intermediate results with
( higher accuracy) :
%&
additional lower bits
0 &1 &

small during
Scheme for storage and display of exact (primary: 1
integer, secondary: rational) and inexact (primary:

logarithmic, secondary: antitetrational) numbers
Rounding : keeping error final
length reduction : %&

word
1 0

Secondary forms used for numbers not representable by
primary ones ( no over-/underflow handling necessary)
Trade-off : numerical accuracy vs. implementation cost
%&
Truncation : 1 0
Choice of number representation hidden from user, i.e.

1 1

software/compiler selects format for highest accuracy
Number representations :
2 2 1 (= average error )
Round-to-nearest (i.e. normal rounding) :
tag value %& 1 0 1

1
(nearly symmetric)
integer : 00 2’s complement integer 1 0 2 2
rational :
logarithmic :
01
10
slash denominator numerator
log integer log fraction “
2
0 12” can often be included in previous operation
1
if &1 &
0 0
antitetrational : 11 a.t. integer a.t. fraction Round-to-nearest-even/-odd :
Rational numbers : slash position (i.e. size of numerator/ &

denominator) is variable and stored (floating slash)
%&1 1 0 otherwise
Storage form sizes : 32-bit (short), 64-bit (normal),
0 (symmetric)
128-bit (long), 256-bit (extended)
mandatory in IEEE floating-point standard
Implementation : mixed hardware/software solutions

3 guard bits for rounding after floating-point operations :
Hardware proposal : long accumulator (4096 bits) holds
guard bit (postnormalization), round bit
higher accurary

any floating-point number in fixed-point format
large hardware/software overhead (round-to-nearest), sticky bit (round-to-nearest-even)
4 Addition 4.1 Overview 4 Addition 4.2 1-Bit Adders, (m, k)-Counters
4 Addition 4.2 1-Bit Adders, (m, k)-Counters
bits of same magnitude (i.e. 1-bit numbers)

4.1 Overview Add up
Output sum as #-bit number ( # log 1)

1-bit adders HA FA (m,k) (m,2)

or : count 1’s at inputs (m, k)-counter [3]
(combinational counters)
RCA CSKA CSLA CIA
carry-propagate adders Half-adder (HA), (2, 2)-counter
CPA
CLA PPA COSA
2
3 2 1

(sum)

(carry-out)
3-operand CSA
"
adders.epsi
carry-save adders
103 121 mm
adder adder a b
multi-operand
array tree a b
a b
"
chaschema1.epsi
out
" "
array tree hasym.epsi 19 28 mm haschema2.epsi
multi-operand adders
adder adder 18
c 23HA
mm 21 43 mm
c out
out
s s
Legend:
(reference)
HA: half-adder CPA: carry-propagate adder CLA: carry-lookahead adder
FA: full-adder RCA: ripple-carry adder PPA: parallel-prefix adder s
(m,k): (m,k)-counter CSKA:carry-skip adder COSA:conditional-sum adder
(m,2): (m,2)-compressor CSLA: carry-select adder
CIA: carry-increment adder CSA: carry-save adder
based on component related component
4 Addition 4.2 1-Bit Adders, (m, k)-Counters 4 Addition 4.2 1-Bit Adders, (m, k)-Counters
Full-adder (FA), (3, 2)-counter (m, k)-counters

2
% 7 4 2
&

1 &1 0 &1
a0 a m-1

...
"...
0 2 0
cntsymbol.epsi
18 (m,k)
23 mm

(propagate) 1
0
(generate)
s k-1 s 0

% % Usually built from full-adders

% % %
Associativity of addition allows convertion from linear to
% % % tree structure faster at same number of FAs
% 0 % 1 7 log2&7
log
4 2 log 4 log3 2 log
1
a b
a b
Example : (7, 3)-counter

28 14 28 10
a b
g HA
" " "

fasymbol.epsi faschematic3.epsi faschematic2.epsi
FA p c out c in
c18 21 mm
out c in c out 29 32 mm c in 32 35 mm
HA a0a1 a2a3a4a5a6 a0a1 a2 a3a4 a5a6
s
s s FA FA FA
a b
a b
a b
"
count73par.epsi
FA 36 48 mm FA
"
count73ser.epsi
0 42 59 mm
"
p
" "
faschematic1.epsi
g p faschematic4.epsi faschematic5.epsi
0 FA FA
c out c in c0
29 43 mm 29 1 41 mm 35 47 mm
c out c in c out 1
c1
s2 s1 s0
c in FA
tree structure
linear
s
(reference) s s2 s1 s0 structure
s
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
4.3 Carry-Propagate Adders (CPA) Carry-propagation speed-up techniques

%
Add two -bit operands and and an optional carry-in a) Concatenation of partial CPAs with fast
%
by performing carry-propagation [1, 2, 11]

Sum
is irredundant 1-bit number

a n-1:j b n-1:j a i-1:k b i-1:k a k-1:0 b k-1:0

...
2% %
"
speedup1.epsi

CPA CPA CPA
c out cj c i84 26 mm ck c in
2 1
; A B s n-1:j
...
s i-1:k s k-1:0
0 1
1

0 % % (r.m.a.)
CPA "
cpasymbol.epsi
c out 29 26 mm c in
a) Fast carry look-ahead logic for entire range of bits
S
a n-1 b n-1 a1 b1 a0 b0
Ripple-carry adder (RCA)

... preprocessing
Serial arrangement of full-adders
"
speedup2.epsi
Simplest, smallest, and slowest CPA structure carry propagation
104 50 mm
c out c in
7 2 14 2
... postprocessing
a n-1 b n-1 a1 b1 a0 b0 s n-1 s1 s0

...
"
rca.epsi
FA FA FA
c out c n-1 57c 2 23 mm c1 c in
...
s n-1 s1 s0
Carry-skip adder (CSKA) Carry-select adder (CSLA)

Type a) : partial CPA with fast Type a) : partial CPA with fast and
&1:
&1: &1: (bit group &1 )
&1:
0&1:
1&1:
&1: &1 &2 (group propagate) 0 1
)
1) &1: 0 : and selected ( Two CPAs compute two possible results ( % 0 1),
2) &1: 1 : but skipped ( )
group carry-in selects correct one afterwards
path never sensitized fast Variable group sizes (faster) : larger groups at end (MSB)
false path inherent logic redundancy problems in (balance delays 0 and 0)
circuit optimization, timing analysis, and testing
Part. CPA typ. is RCA, CSLA ( multil. CSLA), or CLA
(minimize delays 0
1 and
1)
Variable group sizes (faster) : larger groups in the middle High speed-up at high hardware overhead
%&
& (+ MUX/bit + (CPA + MUX)/group)
14 2 8 39
Partial CPA typ. is RCA or CSKA ( multilevel CSKA)
1 2

3 2
Medium speed-up at small hardware overhead
(+ AND/bit + MUX/group) a i-1:k b i-1:k a k-1:0 b k-1:0
8 4
1 2
32

3 2
...
c i0 0
a n-1:j b n-1:j a i-1:k b i-1:k a k-1:0 b k-1:0 0 CPA
"
csla.epsi 1 CPA
c out ci ck c in
102 50CPA
... 1
mm
c’i c i1
CPA 0 1
0 s i-1:k s i-1:k
"
CPA cska.epsi CPA ...
c out cj ci 99
1 36 mm ck c in 0 1
ck
...
P i-1:k
s i-1:k s k-1:0
s n-1:j s i-1:k s k-1:0
Carry-increment adder (CIA) Example : gate-level schematic of carry-incr. adder (CIA)

Type a) : partial CPA with fast and
&1:
only 2 different logic cells (bit-slices) : IHA and IFA

&1:
&1: &1:
max
4 6 10 12 14 16 18 20 22 24 26 28 ... 38
&1: &1 &2 (group propagate) group 2 3 4 5 6 7 8 9 10 11 ... 16
1 2 4 7 11 16 22 29 37 46 56 67 ... 137
Result is incremented after addition, if 1 [12, 11] a i-1 b i-1 a i-2 b i-2 a k+1 b k+1 ak bk
IFA IFA IFA IHA
)
Variable group sizes (faster) : larger groups at end (MSB)
(balance delays 0 and ...

Part. CPA typ. is RCA, CIA ( multilevel CIA) or CLA
...
High speed-up at medium hardware overhead
(+ AND/bit + (incrementer + AND-OR)/group)
...
Logic of CPA and incrementer can be merged [11]
10 2 8
1 2
28

3 2
ci
s i-1 100 "
ciagate.epsi
s i-2 112 mm s k+1 sk
ck
a i-1:k b i-1:k a k-1:0 b k-1:0 (i-k-1)IFA + IHA 2IFA + IHA IFA + IHA IHA IHA
...
c’i 0
CPA
CPA
"
c out ci cia.epsi
s’i-1:k ck c in ... bits i-1...k ... bits 6...4 bits 3,2 bit 1 bit 0
86 43 mm
... P i-1:k
+1
s i-1:k s k-1:0
c out c in
Conditional-sum adder (COSA) Carry-lookahead adder (CLA), traditional

Type a) : optimized multilevel CSLA with log levels Type b) : carries looked ahead before sum bits computed
(i.e. double CPAs are merged at higher levels)
Typically 4-bit blocks used (e.g. standard IC SN74181)
Correct sum bits (
0&1: or
1&1:) are (conditionally)
levels of multiplexers 0 0
selected through log
1 0 0 0 ...
2 1 1 0 1 0 0
(g3,p3) (g0,p0)

Bit groups of size 2 at level
3 2 2 1 2 1 0 2 "
clbsymbol.epsi
27 CLB

1 0 0 26 mm c′
Higher parallelism, more balanced signal paths 0
3 3 3 2 3 2 1 3

2 1 0
Highest speed-up at highest hardware overhead 3 3 2 1 0
(g′,p′)
3 3 c3
. . . c0
(2 RCA + more than log MUX/bit)

3 log 2 log 6 log2

passedarrangement
Hierarchical using 12 log levels :
up, 0 passed down between levels
3 3
High speed-up at medium hardware overhead

a3 b3 a2 b2 a1 b1 a0 b0
14 4 log 56 log
level 0
... 0 0 0
FA FA FA
1 1 1 FA (g15,p15) ... (g12,p12) (g11,p11) ... (g8,p8) (g7,p7) ... (g4,p4) (g3,p3) ... (g0,p0)
FA FA FA c in
c′12 c′8 c′4 c′0

"
level 1
0 1 0 1
cosa.epsi 0 1 0 1 CLB CLB CLB CLB
...
100 57 mm
(g′11,p′11)
(g′15 ,p′15 )
(g′,p′)
(g′,p′)
7 7
3 3
c 15 ... c 12
"
c 11 ... c 8 cla.epsi c 7 ... c 4 c 3 ... c 0
level 2
0 1 0 1 0 1
... 97 48 mm

...

CLB c in

c out + preprocessing :
s3 s2 s1 s0
+ postprocessing :
Parallel-prefix adders (PPA) Prefix problem

Type b) : universal adder architecture comprising RCA, Inputs
%&
%&
, associative
1 0 , outputs 1 0
CIA, CLA, and more (i.e. entire range of area-delay binary operator [11, 13]
trade-offs from slowest RCA to fastest CLA)
%&
%& or
Preprocessing, carry-lookahead, and postprocessing step
0
1
0

0

&
1
1 ; 1
0

1
1
0 0
(r.m.a.)
Carries calculated using parallel-prefix algorithms tree structures for evaluation :
3 2 1 0 3 2 1 0 , but
2 ?
Associativity of
+ High regularity : suitable for synthesis and layout

+ High flexibility : special adders, other arithmetic
1 1

1

1 1:0 3:2 1 1:0
operations, exchangeable prefix algorithms (i.e. speeds)
2 2
2 2:0 3 3:0
+ High performance : smallest and fastest adders
3
5 3 4 2

3
at level
3:0

Group variables : : covers bits
Carry-propagation is prefix problem : : : :
a n-1
b n-1
a n-2
b n-2
preprocessing:
a1
b1
a0
b0
0 0

: : & 1 &
& &1
... ...
: 1 : 1 : : ; #$ $
c in
: :
1 &1
1 1
(gn-1 , p n-1 ) (g0 , p0 )
:&
1
1 : 1 : : 1 :
&1 &1 &
73 64 mm "
add.epsi///figures carry-lookahead:
prefix algorithm
1 :0 ; 0
1 1
c n p n-1 c1 p0 c0
Parallel-prefix algorithms [11] :
log ) ! !
... ... postprocessing:
multi-tree structures (
sharing subtrees ( 2
log ) ! !
c out

different algorithms trading area vs. delay (influences
!
s n-1
s n-2
s1
s0
also from wiring and maximum fan-out )
Prefix algorithms Sklansky parallel-prefix algorithm (

PPA-SK)
Algorithms visualized by directed acyclic graphs (DAG)
Tree-like collection, parallel redistribution of carries
with array structure ( bits levels) 1 log log ! 1
2 2
Graph vertex symbols :
&1 :&
1 1 &1 &1 &1 &1 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
: : :
: 1

: 0

"
1 sk.epsi///figures
2 67 30 mm
: : : : : : : : 3
4
(contains logic for ) (contains no logic)

Performance measures :
Brent-Kung parallel-prefix algorithm (
PPA-BK)
Traditional CLA is PPA-BK with 4-bit groups
: graph size (number of black nodes)
: graph depth (number of black nodes on critical path) Tree-like redistribution of carries (fan-out tree)

Serial-prefix algorithm ( RCA)
2
log
2 2 log
2

1
1 ! 2 ! log

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0
1 1
" "
2 ser.epsi///figures 2 bk.epsi///figures
3 69 38 mm 3 67 38 mm
...
4
14 5
15 6

Kogge-Stone parallel-prefix algorithm ( PPA-KS) Mixed serial/parallel-prefix algorithm (
RCA + PPA)
very high wiring requirements linear size-depth trade-off using parameter #:
log
1 log ! 2
0 $#$
2 log 2

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
# 0 : serial-prefix graph
0
1
#
2 log 1 : Brent-Kung parallel-prefix
graph
2
fills gap between RCA and PPA-BK (i.e. CLA) in steps
"
ks.epsi///figures
3 67 52 mm of single -operations

1 #
1
# ! var.
4
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
CIA)
Carry-increment parallel-prefix algorithm (
0
1
2
1 4 1 2 1 4 1 2 ! 1 4 1 2

2
3
"
4 var.epsi///figures
5 68 54 mm
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
6
0 7
1 8
"
cia.epsi///figures 9
2
67 34 mm 10
3
4
5
Example : 4-bit parallel-prefix adder (PPA-SK) Prefix adder synthesis

efficient AND-OR-prefix circuit for the generate and
Local prefix graph transformation :

AND-prefix circuit for the propagate signals
optimization: alternatingly AOI-/OAI- resp. NAND-/ 3 2 1 0 3 2 1 0
depth-decr.

NOR-gates (inverting gates are smaller and faster)
can also be realized using two MUX-prefix circuits
3 0
1
"
unfact.epsi

transform 0
1
"
fact.epsi 4
3 20 26 mm 20 26 mm 2
2 size-decr. 2
a3 b3 a2 b2 a1 b1 a0 b0 3 transform 3
c in
Repeated (local) prefix transformations result in overall
minimization of graph depth or size which sequence ?

Goal: minimal size (area) at given depth (delay)
Simple algorithm for sequence of applied transforms :
Step 1 : prefix graph compression (depth minimization) :
depth-decr. transforms in right-to-left bottom-up order
" Step 2 : prefix graph expansion (size minimization) :

askgate.epsi///figures
100 103 mm
size-decreasing transforms in left-to-right top-down
order, if allowed depth not exceeded
Prefix adder synthesis : 1) generate serial-prefix graph, 2)
graph compression, 3) depth-controlled graph expansion,
4) generate pre-/postprocessing and prefix logic
+ Generates all previous prefix graphs (except PPA-KS)
c out + Universal adder synthesis algorithm : generates
P n-1:0 area-optimal adders for any given timing constraints [11]
s3 s2 s1 s0 (including non-uniform signal arrival times)
Multilevel adders Self-timed adders

Multilevel versions of adders of type a) possible (CSKA, Average carry-propagation length : log
! 1 1for levels
CSLA, and CIA; notation: 2-level CIA = CIA-2L)
+ RCA is fast in average case ( ˜
!
log ), slow in worst

+ Delay is case suitable for self-timed asynchronous designs [15]
high for CSLA ( COSA)

– Area increase small for CSKA and CIA,
– Completion detection is not trivial
Difficult computation of optimal group sizes Adder performance comparisons
Hybrid adders Standard-cell implementations, 0 8 process

Arbitrary combinations of speed-up techniques possible
hybrid/mixed adder architectures
area [lambda^2]
RCA
Often used combinations : CLA and CSLA [14] 128-bit CSKA-2L
1e+07
CIA-1L
– Pure architectures usually perform best (at gate-level) CIA-2L
64-bit
5 PPA-SK
Transistor-level adders PPA-BK
"
32-bit addperf.ps CLA

Influence of logic styles (e.g. dynamic logic, 2 84 84 mm COSA
pass-transistor logic faster) 16-bit const. AT
1e+06
+ Efficient transistor-level implementation of ripple-carry
chains (Manchester chain) [14] 8-bit
5
+ Combinations of speed-up techniques make sense
– Much higher design effort 2 delay [ns]
Many efficient implementations exist and published 5 10 20
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.4 Carry-Save Adder (CSA)
Complexity comparison under the unit-gate model 4.4 Carry-Save Adder (CSA)
a) Adds three -bit operands 0 , 1 , 2 performing no

adder A T AT opt.1 syn.2
carry-propagation (i.e. carries are saved) [1]

2
RCA 7 2 14 aaa
A0 A1 A2
1 2
CSKA-1L 8 4 32 3 2
aat 3
4 3
0 1 2

1 3 4 4
1
0 1 2 ; "
CSKA-2L 8 — csasymbol.epsi
2 2

21 CSA
26 mm
CSLA-1L 14 8 1 2 39 3 2
—
CIA-1L 10 2 8 1 2 28

3 2
att

0 1
1 (n.)
3
C S
CIA-2L 10 6 1 3 36 4 3
att
4 4
b) Adds one -bit operand to an -digit carry-save operand
CIA-3L 10 1 4
44 5 4
—

%

PPA-SK 3
2
log 2 log 3 log2 ttt

PPA-BK
PPA-KS
10
3 log
4 log
2 log
40
6
log
log2
att
—
( digits),
– Result is in redundant carry-save format
represented by two -bit numbers (sum bits) and

CLA 5 14 4 log 56 log — ( ) (carry bits)
COSA 3 log 2 log 6 log2 — + Parallel arrangement of full-adders, constant delay
1 optimality regarding area and delay 7 4
aaa : smallest area, longest delay
aat : small area, medium delay
a 0,n-1
a 1,n-1
a 2,n-1
a 0,1
a 1,1
a 2,1
a 0,0
a 1,0
a 2,0
att : medium area, short delay
ttt : large area, shortest delay
"
csa.epsi
. . . 67 27FA
mm
— : not optimal FA FA
2 obtained from prefix adder synthesis
3 automatic logic optimization not possible (redundancy) cn s n-1 c2 s1 c1 s0
4 exact factors not calculated

5 corresponds to 4-bit PPA-BK
Multi-operand carry-save adders ( 3)

adder array (linear arrangement), adder tree (tree arr.)
4 Addition 4.5 Multi-Operand Adders 4 Addition 4.5 Multi-Operand Adders
4.5 Multi-Operand Adders a) 4-operand CPA (RCA) array :

Add three or more ( 2) -bit operands, yield

a 0,n-1
a 1,n-1
a 0,2
a 1,2
a 0,1
a 1,1
a 0,0
a 1,0
log -bit result in irredundant number rep. [1, 2] ...
Array adders CPA

FA FA FA HA
a 2,n-1 a 2,2 a 2,1 a 2,0
Realization by array adders : (see figures on next page) ...
"
cparray.epsi
a) linear arrangement of CPAs FA 93 57 mm FA
FA HA
CPA
b) linear arr. of CSAs (adder array) and final CPA a 3,n-1 a 3,2 a 3,1 a 3,0
...

a) and b) differ in bit arrival times at final CPA : CPA
if CPA = RCA : a) and b) have same overall delay FA FA FA FA HA

...
if fast final CPA : uniform bit arrival times required
sn s n-1 s2 s1 s0
CSA array (b)
Fast implementation : CSA array + fast final CPA b) 4-operand CSA array with final CPA (RCA) :
(note: array of fast CPAs not efficient/necessary)
a 0,n-1
a 1,n-1
a 2,n-1

2
a 0,2
a 1,2
a 2,2
a 0,1
a 1,1
a 2,1
a 0,0
a 1,0
a 2,0
A0 A1 A2 A3 A m-1

2 CSA ... FA ... FA FA FA
CSA
a 3,n-1 a 3,2 a 3,1 a 3,0

!
"
csarray.epsi
! "
mopadd.epsi ... 99FA 57 mm CSA
CPA = RCA : CSA FA FA HA
30 58 mm
...
! log CPA
! log
FA FA FA HA
Fast CPA : CPA
...
sn s n-1 s2 s1 s0
S
(m, 2)-compressors
&4
2

a0 a m-1 7
2
10 4
2 6 log
1

...

& &
0
c in0
"
c out cprsymbol.epsi
4 %
...
...
37 (m,2)
0 0
m-4 26 mm
c out c inm-4
Optimized (4, 2)-compressor :
c s
2 full-adders merged and optimized (i.e. XORs
1-bit adders (similar to (m, k)-counters) [16]

arranged in tree structure)
Compresses bits down to 2 by forwarding
3 14 6
intermediate carries to next higher bit position 14 8 a0 a1 a2 a3
Is bit-slice of multi-operand CSA array (see prev. page)
+ No horizontal carry-propagation (i.e. % #)
a0 a1 a2 a3
Built from full-adders (= (3, 2)-compressor) or FA

" "
(4, 2)-compressors arranged in linear or tree structures cpr42fa.epsi 0 cpr42opt.epsi
1
32 38 mm 41 53 mm
c out c in
Example : 4-operand adder using (4, 2)-compressors FA c out c in
0 1
a 2,n-1
a 0,n-1
a 1,n-1
a 3,n-1
c s
a 2,2
a 2,1
a 2,0
a 0,2
a 1,2
a 3,2
a 0,1
a 1,1
a 3,1
a 0,0
a 1,0
a 3,0
with full-adders c s
(4,2) (4,2) (4,2) (4,2) CSA optimized

"
cpradd.epsi
99 44 mm
+ same area, 25% shorter delay
FA FA FA HA CPA
SD-FA (signed-digit full-adder) is similar to
(4, 2)-compressor regarding structure and complexity
s n+1 sn s n-1 s2 s1 s0
Advantages of (4, 2)-compressors over FAs for realizing Tree adders (Wallace tree)
(m, 2)-compressors :
higher compression rate (4:2 instead of 3:2)
Adder tree : -bit -operand carry-save adder
less deep and more regular trees composed of tree-structured (m, 2)-compressors [1, 17]
Tree adders : fastest multi-operand adders using an

tree depth 012 3 4 5 6 7 8 9 10 adder tree and a fast final CPA
! log
2
FA 2 3 4 6 9 13 19 28 42 63 94
# operands

2 !log log
(4,2) 2 4 8 16 32 64 128
Example : (8, 2)-compressor

42 16 42 12 Adder arrays and adder trees revisited
a0a1 a2a3 a4a5 a6a7 a0a1a2a3 a4a5a6a7
Some FA can often be replaced by HA or eliminated
0
c out c in0 (i.e. redundant due to constant inputs)
FA FA (4,2) (4,2)
0
c out c in0
1
1
c out c in1 Number of (irredundant) FA does not depend on adder
c out c in1
" structure, but number of HA does
2
c out cpr82cpr42.epsi c in2
-operand adder accomodates

1 carry inputs
FA FA 47 50 mm
3
2
c in2 c in3
"
c out cpr82fa.epsi c out
An
3
c out
47 65 mm
c in3 (4,2) !
( log ) are faster
4
c out
FA
c in4
4
c out c in4 Adder!
(

trees ! arrays
than adder
) at same amount of gates ( )

c s
FA (4, 2)-compressor tree
routing than adder arrays

Adder trees are less regular and have more complex
larger area, difficult layout
c s (i.e. limited use in layout generators)
full-adder tree
4 Addition 4.6 Sequential Adders 5 Simple / Addition-Based Operations 5.1 Complement and Subtraction
4.6 Sequential Adders 5 Simple / Addition-Based Operations

Bit-serial adder : Sequential -bit adder A
5.1 Complement and Subtraction
ai bi
2’s complementer (negation)

1 "
neg.epsi
"
bitseradd.epsi
FA 21 32 mm1
+1

25 27 mm
si Z
Accumulators : Sequential -operand adders A B
With CPA A 2’s complement subtractor

"CPA

sub.epsi
" 1
accucpa.epsi 29 32 mm 1
CPA
27 28 mm c out
S S
A A B
With CSA and final CPA
Allows higher clock rates 2’s complement adder/subtractor
Final CPA too slow :
1
pipelining or multiple "
CSA addsub.epsi

36 35 mm
CPA sub
"
accucsa.epsi c out
cycles for evaluation
4
33 52 mm

S
CPA 1’s complement adder A B
mod 2%
1
"
addmod.epsi
S

29 CPA
28 mm
c out c in
Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer
carries saved), trade-off between speed and register size (end-around carry)
S
5 Simple / Addition-Based Operations 5.2 Increment / Decrement 5 Simple / Addition-Based Operations 5.2 Increment / Decrement
AND-prefix struct.

: :
5.2 Increment / Decrement Prefix problem : : 1
Incrementer 1
2 log 2 1 log2
Adds a single bit %to an -bit operand
2
log 2
2% %

A Decrementer

%

29 "
incsymbol.epsi a n-1 a2 a1 a0

1 ; 0
1 c
+1
26 mm
out c in
0 % % (r.m.a.)
Z
...

Corresponds to addition with 0 ( FA HA) c out "
dec.epsi
93 41 mm
c in
Example : Ripple-carry incrementer using half-adders ...
3 1 3 2
z n-1 z2 z1 z0
%
1 %
a n-1 a1 a0
... Incrementer-decrementer
"
incfa.epsi
HA 59c 23HA mm c HA
c out c n-1 2 1 c in
...
z n-1 z1 z0 a n-1 a2 a1 a0
or using incrementer slices (= half-adder)

a n-1 a2 a1 a0 dec
... ...
"
incdec.epsi
94 46 mm
c out
"
inc.epsi c out
c in c in
83 33 mm
... ...
HA
z n-1 z2 z1 z0 z n-1 z2 z1 z0
5 Simple / Addition-Based Operations 5.2 Increment / Decrement 5 Simple / Addition-Based Operations 5.2 Increment / Decrement
Fast incrementers Gray incrementer
4-bit incrementer using multi-input gates : Increments in Gray number system
0 %&1 %&2 0 (parity)

a3 a2 a1 a0
1 ; 0
3 (r.m.a.)
c in 0 0 0
" &1 &1 ; 1
2
inccg.epsi
62 39 mm
c out %&1 %&1 %&2

z3 z2 z1 z0
Prefix problem
AND-prefix structure
8-bit parallel-prefix incrementer (Sklansky AND-prefix
structure) :
a7 a6 a5 a4 a3 a2 a1 a0
c in
"
incpp.epsi
98 63 mm
c out z7 z6 z5 z4 z3 z2 z1 z0
5 Simple / Addition-Based Operations 5.3 Counting 5 Simple / Addition-Based Operations 5.3 Counting
5.3 Counting
!
Fast divider ( 1 ) using delayed-carry numbers
(irredundant carry-save represention of
1 allows using
Count clock cycles counter,
divide clock frequency

frequency divider (
) fast carry-save incrementer) [8]
Binary counter Gray counter

Sequential in-/decrementer Counter using Gray incrementer
Incrementer speed-up c out
+1
c in
"
techniques applicable cntblock.epsi Ring counters
32 33 mm
Down- and up-down-counters clk Shift register connected to ring :
using decrementers /
incrementer-decrementers Q
"
cntring.epsi
51 16 mm
Example : Ripple-carry up-counter using counter slices
(= HA + FF), is count enable % q n-1 q2 q1 q0
State is not encoded

FF for counting states
c out c in Must be initialized correctly (e.g. 00 01)
" Applications:

cntripple.epsi
... 87 36 mm
fast dividers (no logic between FF)
state counter for one-hot coded FSMs
q n-1 q2 q1 q0
Johnson / twisted-ring counter (inverted feed-back) :

Asynchronous counter using toggle-flip-flops
(lower toggle rate lower power)
"
cntjohnson.epsi
T ... T T T 59 16 mm
clk
"
cntasync.epsi q n-1 q2 q1 q0
q n-1 q2
64 18 mm
q1 q0
FF for counting 2 states
5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection 5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection
5.4 Comparison, Coding, Detection Comparators

A B
Comparison operations Subtractor (

:

" CPA
cmpsub.epsi

(equal)
37 31 mm 1

%&1:0

GE = c out

(not equal)

(greater or equal) (for free in PPA) EQ = P n-1:0

(less than)
7 2 or
$

(greater than)
(less or equal)
& 3 log & 2 log
2
Equality comparison Optimized comparator :

removing redundancies in subtractor (unused
)
single-tree structure speed-up at no cost :
a n-1
b n-1
a2
b2
a1
b1
a0
b0
1 6 2 2 log
;
...

"
cmpeq.epsi
40 36 mm
example : ripple comparator using comparator slices
0 1 0% (r.s.a.)

1

a n-1
b n-1
a2
b2
a1
b1
a0
b0
EQ

Magnitude comparison
... equality &
magnitude
"
cmpripple.epsi
100 47 mm

1 magnitude

1
%
GE
; 0

0 1 (r.s.a.) equality
EQ
5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection 5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection
Decoder
%& to vector & ( 2%) Detection operations
1 if 1:0

%&1 %&2 0
Decodes binary number 1:0
All-zeroes detection :
0 else ; 0
1 2
All-ones detection : %&1 %&2 0 (r.s.a.)
A a2 a1 a0
log
" "
decodersym.epsi decoder.epsi
21decoder
26 mm 58 28 mm Leading-zeroes detection (LZD) :
for scaling, normalization, priority encoding
Z

12% log
z7 z6 z5 z4 z3 z2 z1 z0

a) non-encoded output :
a n-1 a n-2 a1 a0
0 1 01 0 1 0 ...
Encoder
& %& % 000100)
"
(e.g. 000101 lzdnenc.epsi
Encodes vector 1:0 to binary number

# #
(condition:
1:0 ( 2 ) 50 28 mm
if then 1 else 0) 2 ...

if 1; 0 1

log2 z z z
n-1 n-2 1 z0
A a7a5a3a1 prefix problem (r.m.a.) AND-prefix structure

a6a4a2a0
"
encodersym.epsi z0
21encoder
b) encoded output : + encoder
"
26 mm encoder.epsi
Z
30 34 mm
z1 signed numbers : + leading-ones detector (LOZ)
2%&1
1 z2

1 (note: connections
according to PPA-SK)
5 Simple / Addition-Based Operations 5.5 Shift, Extension, Saturation 5 Simple / Addition-Based Operations 5.5 Shift, Extension, Saturation
5.5 Shift, Extension, Saturation Applications :

Shift : a) shift -bit vector by bit positions # adaption of magnitude (shift a)) or word length
b) select out of more bits at position #
(extension) of operands (e.g. for addition)
multiplication/division by multiples of 2 (shift)
also: logical (= unsigned), arithmetic (= signed)
#
Rotation by bit positions, constant (logic operation)
logic bit/byte operations (shift, rotation)
# #
scaling of numbers for word-length reduction (i.e.
Extension of word lengths by bits ( ) ignore leading zeroes, shift b)) or normalization (e.g.
(i.e. sign-extension for signed numbers) of floating-point numbers, shift a)) using LZD
Saturation to highest/lowest value after over-/underflow reducing error after over-/underflow (saturation)
shift a) un- l. %&2 0 0 sll Implementation of shift/extension/rotation by
signed r. 0 %&1 1 srl constant values : hard-wired
%&1 %&3 0 0 variable values : multiplexers
signed l.
r. %&1 %&1 %&2 1
sla
sra
possible values : –by– barrel-shifter/rotator
shift b) unsigned %&1 Example : 4–by–4 barrel-rotator
signed 2%&1 %&2 ! 2 a3 a2 a1 a0
%&2 0 %&1
!log
rotate l. rol
r. 0 %&1 1 ror s1 s0
extend un- l. 0 %&1 0 s1 s0

"
a3 a2 a1 a0
%&1 0 0
barshift.epsi
signed r. 44 49 mm
signed l. %&1 %&1 %&2 0 s0

"
muxshift.epsi s1 s0
%&1 %&2 0 0

41 28 mm
r. s1 s1 s0
saturate unsigned %&1 %&1 z3 z2 z1 z0 z3 z2 z1 z0

signed &1 %&1 %&1
% multiplexers tristate buffers
5 Simple / Addition-Based Operations 5.6 Addition Flags 5 Simple / Addition-Based Operations 5.6 Addition Flags
5.6 Addition Flags Basic and derived condition flags

flag formula
%
description
carry flag condition flag
formula
% %&1 ( )
unsigned

(
)
signed
%%
%
%%
%
signed overflow flag
0
operation: or
:
0

zero flag zero
%&1 negative flag, sign
00 negative —

positive
—( )

Implementation of adder with flags
, overflow
( )

: for free

underflow
% %&
: fast , 1 computed by e.g. PPA very cheap operation:

: a) % %&
1 (subtract.) :

1:0 (of PPA)

b) % 0 1 :

%&1
%&2
0 (r.s.a.)

log

1)

$

2) faster without final sum (i.e. carry prop.) [18]
example : 01001 1 00 0

Unsigned and signed addition/subtraction only differ
10110 1 00

with respect to the condition flags
0 00000 0 00
0 %
&1 &1
0

%&1 %&2 0 ; 0
1 (r.s.a.)
3 4 log
5 Simple / Addition-Based Operations 5.7 Arithmetic Logic Unit (ALU) 6 Multiplication 6.1 Multiplication Basics
5.7 Arithmetic Logic Unit (ALU) 6 Multiplication

A B
6.1 Multiplication Basics
c out alusymbol.epsi c in
and [1, 2]

Multiplies two -bit operands
flags
"
30 ALU
29 mm
op Product is 2 -bit unsigned number or 2
1 -bit
signed number
Z
%
&1 % &1 % &1 %
Example : unsigned multiplication
&1
ALU operations
%

% 2 2 2 or

1
1 %
0 0 0 0
&
add sub
2 ; 0
1 (r.s.a.)

0
arithmetic inc dec 1
pass neg
and nand
or nor Algorithm
logic
xor xnor 1) Generation of partial products
pass not
2) Adding up partial products :
11 11
sll srl
shift/ a) sequentially (sequential shift-and-add),

rotate
sla
rol
1 sra
ror
1
b) serially (combinational shift-and-add), or
c) in parallel
s/ro : shift/rotate ; l/r : left/right ;
l/a : logic (unsigned) / arithmetic (signed) Speed-up techniques
Logic of adder/subtractor can partly be shared with logic Reduce number of partial products
operations
Accelerate addition of partial products
6 Multiplication 6.1 Multiplication Basics 6 Multiplication 6.2 Unsigned Array Multiplier
Sequential multipliers : 6.2 Unsigned Array Multiplier

partial products generated × Braun multiplier : array multiplier for unsigned numbers
and added sequentially (using
"
mulseq.epsi
%
&1 %
&1 8 2
11
accumulator)
2
34 28 mm
CPA
! !log 0 0 6
9
0 3 0 2 0 1 0 0
×
1 3 1 2 1 1 1 0
Array multipliers :
2 3 2 2 2 1 2 0
3 3 3 2 3 1
CSA
partial products generated and
added simultaneously in linear
CSA
×
3 0
" ×
mularr.epsi
7 6 5 4 3 2 1 0
array (using array adder) 34 47 CSAmm
! 2 ! CSA
×
a0
b3 b2 b1 b0
CPA
p0
a1
Parallel multipliers : × HA HA HA
1
partial products × p1
generated in parallel and added ×
× a2
" CSA
mulpar.epsi
subsequently in multi-operand
"
34 43 mm mulbraun.epsi
FA FA FA
adder (using tree adder) 99 83 mm
! 2 !log
tree p2
a3
CPA
2 FA FA FA
CSA
Signed multipliers : p3
CPA

a) complement operands before and result after
multiplication unsigned multiplication
3 FA FA HA
b) direct implementation (dedicated multiplier structure) p7 p6 p5 p4
6 Multiplication 6.3 Signed Array Multipliers 6 Multiplication 6.4 Booth Recoding
6.3 Signed Array Multipliers 6.4 Booth Recoding

Modified Braun multiplier Speed-up technique : reduction of partial products
special FAs [1]
Subtract bits with negative weight Sequential multiplication

1 neg. bit :
% 2

Minimal (or canonical) signed-digit (SD) represent. of
2 neg. bits :

%
2

+ One cycle per non-zero partial product (i.e. 0)

%

Replace FAs in regions – Negative partial products
1 , 2 , and 3 by :
% % – Data-dependent reduction of partial products and latency
(input at mark )

Combinational multiplication

Otherwise exactly same structure and complexity as
Braun multiplier efficient and flexible Only fixed reduction of partial product possible
Baugh-Wooley multiplier multiplier digit

Radix-4 modified Booth recoding : 2 bits recoded to one
2 partial products

%2

2
221) 22 ; &1 0

0 (2 &1&

Arithmetic transformations yield the following partial
&
products (two additional ones) :
0 3 0 2 0 1 0 0

2 10 1 2

1 3 1 2 1 1 1 0 1 2 2&1
2 3 2 2 2 1 2 0

2
×

3 3 3 2 3 1 3 0
0 0 0 0
recoding
Booth
×
3 3 0 0 1 ×

0 1 0 ×
"
mulbooth.epsi
1

3 3 0 1 1 2 41 43 mm

7 6 5 4 3 2 1 0 1 0 0 2 CSA
1 0 1 array/tree
– Less efficient and regular than modified Braun 1 1 0 CPA
multiplier 1 1 1 0
6 Multiplication 6.4 Booth Recoding 6 Multiplication 6.6 Multiplier Implementations
Applicable to sequential, array, and parallel multipliers 6.5 Wallace Tree Addition
– additional recoding logic and more : 8 Speed-up technique : fast partial product addition
complex partial product generation
: 7 ! 2 !log
(MUX for shift, XOR for negation)
Applicable to parallel multipliers : parallel partial
+ adder array/tree cut in half
considerably smaller (array and tree)
: 2 product generation (normal or Booth recoded)
much faster for adder arrays : 2 – Irregular adder tree (Wallace tree) due to different
slightly or not faster for adder trees :
0
number of bits per column
irregular wiring and/or layout

non-uniform bit arrival times at final adder
Negative partial products (avoid sign-extension) :

3 3 3 3 2 1 0 0 0 0
3 2 1 0 6.6 Multiplier Implementations
ext. sign 1
1 1 1 3 Sequential multipliers :

2 1 0
low performance, small area, resource sharing (adder)

1

Braun or Baugh-Wooley multiplier (array multiplier) :
03 03 03 03 02 01 00
03 02 01 00
13 13 13 12 11 10 13 12 11 10

medium performance, high area, high regularity
23
33
23
32
22
31
21
30
20
33
23
32
22
31
21
30
20
layout generators data paths and macro-cells
simple pipelining, faster CPA higher speed

6 5 4 3 2 1 0 6 5 4 3 2 1 0
Booth-Wallace multiplier (parallel multiplier) [9] :
Suited for signed multiplication (incl. Booth recod.)
for unsigned multiplication : % 0 high performance, high area, low regularity
Extend custom multipliers, netlist generators
often pipelined (e.g. register between CSA-tree and CPA)

Radix-8 (3-bit recoding) and higher radices :
0)
Signed-unsigned multiplier : signed multiplier with
precomputing 3 , larger overhead
operands extended by 1 bit ( 1 0, % %& % %&
1
6 Multiplication 6.8 Squaring 7 Division / Square Root Extraction 7.1 Division Basics
6.7 Composition from Smaller Multipliers 7 Division / Square Root Extraction

multiplier can be composed from 4
2 2-bit-bitmultipliers (can be repeated recursively)
7.1 Division Basics
;
2% 2%
rem (remainder)
22% 2% 0 22%
1 0 2%
1 0
% %, otherwise overflow
4 -bit multipliers
normalize before division (2%&1 2%
1)
2 2
+ 2 -bit CSA + 3 -bit CPA
less efficient (area and speed)

Algorithms (radix-2)
Subtract-and-shift : partial remainders
[1, 2]
6.8 Squaring
non-associative
2 : multiplier optimizations possible
Sequential algorithm : recursive,

1 2 1
2

0 3 0 1 0 % ;
1 0 (r.m.n.)
1 1 0
2 3 1 2 3 12 21
0
3 3 2 3 1 3
2 3 1 3 0 3 0 0 1
Basic algorithm : compare and conditionally subtract
0 0 expensive comparison and CPA
3 3 1 2 1 1
2 2
Restoring division : subtract and conditionally restore
2 1partial products (if no Booth recoding used)

7 6 5 4 3 2 1 0 (adder or multiplexer) expensive CPA and restoring
+
optimized correct by next steps expensive CPA

Non-restoring division : detect sign, subtract/add, and
squarer more efficient than multiplier
Table look-up (ROM) less efficient for every
SRT division : estimate range, subtract/add (CSA), and
correct by next steps inexpensive CSA
if
2 0 1 same sign
7 Division / Square Root Extraction 7.3 Non-Restoring Division 7 Division / Square Root Extraction 7.4 Signed Division
1 if
7.2 Restoring Division 7.4 Signed Division
1
2 0
1 if 1 opposite sign
1
0 if 1
1
2 0 : 0 (restored)

1 1
2&1 0 : &1 1 &1 1
2&1
1

(simplifications:
0, final correction of omitted)
Example : signed non-restoring array divider
9 2 2 2 4

7.3 Non-Restoring Division

1 11 ifif 11 00
b3 a6 b2 a5 b1 a4 b0 a3
1 0 : 1 1
2
a6 ⊕ b3

1 1
2 0 : &1 1
&1
2

2 &1 1
2&1
1 q3 FA FA FA FA

One subtraction/addition (CPA) per step
a2
Final correction step for (additional CPA)

1 1 0 1 : 1 1
Simple quotient digit conversion : (note: irredundant)
q2 FA FA FA FA
"
divarray.epsi
81 101 mm
%&1 %&2 %&3 0 1

2

a1
q1 FA FA FA FA
1
A B
! 2 or ! 2 log ≥ +/− CPA

1
!
Q
"
≥ +/− CPA
divnr.epsi
46 ≥38 mm
+/− CPA
a0
2 or ! log
≥ +/− CPA q0 FA FA FA FA
≥ +/− CPA
r3 r2 r1 r0
R
7 Division / Square Root Extraction 7.5 SRT Division 7 Division / Square Root Extraction 7.7 Division by Multiplication
2,
1 1 0 1
1
7.5 SRT Division (Sweeney, Robertson, Tocher) 7.6 High-Radix Division
1 if 2 $ 1
0 if
2 $1 2 is SD number

Radix
1 if 1
2 quotient bits per step fewer, but more complex steps
%& %
If 2 1 $ 2 , i.e. is normalized :
+ Suitable for SRT algorithm faster

2 $
2%&1 $1 2%&1 $2

– Complex comparisons (more bits) and decisions
%&1 $1
0 if
2%&1 $1 2%&1
1 if 2
table look-up ( Pentium bug!)
1 if
2%&1
1
7.7 Division by Multiplication
are estimated CSA Division by convergence

+ Only 3 MSB are compared
& 1
0 1
instead of CPA can be used (precise enough) [19]
&1 1 1 resp. 2%

0 1

Correction in following steps (+ final correction step)

– Redundant representation of (SD representation)
1
2%1
1
2%1
2
final conversion necessary (CPA)

+ Highly regular and fast (
!
) SRT array dividers
only slightly slower/larger than array multipliers

2%
1
2&% 2
2&% 1 (signed)

2
A B
! 2
Algorithm :
≥ +/− CSA 1
1 ; 0 1
1
! " (r.s.n.)
CPA
≥ +/− CSA
divsrt.epsi
Q ≥ mm+/− CSA
50 38

≥ +/− CSA 0 0
log
≥ +/− CPA
Quadratic convergence :
R
7 Division / Square Root Extraction 7.8 Remainder / Modulus 7 Division / Square Root Extraction 7.9 Divider Implementations
Division by reciprocation 7.9 Divider Implementations

1
Iterative dividers (through multiplication) :
resource sharing of existing components (multiplier)
medium performance, medium area
Newton-Raphson iteration method :
find
0 by recursion 1

high efficiency if components are shared

1 1 1
0 Sequential dividers (restoring, non-restoring, SRT) :
2
resource sharing of existing components (e.g. adder)
2
; 0
1
Algorithm : low performance, low area
(r.s.n.)
1
Array dividers (restoring, non-restoring, SRT) :
dedicated hardware component
0
!log high performance, high area

from table
Quadratic convergence :
Speed-up : first approximation 0 high regularity layout generators, pipelining
square root extraction possible by minor changes
7.8 Remainder / Modulus
combination with multiplication or/and square root
rem
sign sign
Remainder (rem) : signed remainder of a division
No parallel dividers exist, as compared to parallel
multipliers (sequential nature of division)

Modulus (mod) : positive remainder of a division

mod
0

ifelse 0
7 Division / Square Root Extraction 7.10 Square Root Extraction 8 Elementary Functions 8.1 Algorithms
7.10 Square Root Extraction 8 Elementary Functions

2
!
0 22%
1 0 2%
1
Exponential function : (exp )
Logarithm function : ln , log
Trigonometric functions : sin , cos , tan
Inverse trig. functions : arcsin , arccos , arctan
Algorithm
and quotients

%& 0[1]

Subtract-and-shift : partial remainders
Hyperbolic functions : sinh , cosh , tanh

1 2 1 0
2
1 2 2 21 2 2 1 2

1 2 2 1 2 1 2

8.1 Algorithms

Table look-up : inefficient for large word lengths [5]

2 2 2 ;
1 0

Taylor series expansion : complex implementation
% 1 % 0 1 (r.m.n.) Polynomial and rational approximations [1, 5]
0 0
Shift-and-add algorithms [5]

Implementation
Convergence algorithms [1, 2] :
+ Similar to division same algorithms applicable similar to division-by-convergence
(restoring, non-restoring, SRT, high-radix)
+ Combination with division in same component possible
two (or more) recursive formulas : one formula
converges to a constant, the other to the result
Only triangular array required A

(step : 0) Coordinate rotation (CORDIC) [2, 5, 20] :
3 equations for x-, y-coordinate, and angle
2
+/− CPA
"
sqrtnr.epsi
+/− CPA
computes all elementary functions by proper input

Q 42 36+/− mmCPA
+/− CPA
settings and choice of modes and outputs
+/− CPA
simple, universal hardware, small look-up table
R

8 Elementary Functions 8.2 Integer Exponentiation 8 Elementary Functions 8.3 Integer Logarithm

1 2
8.2 Integer Exponentiation b) 12 1 0
Approximated exponentiation : ln 2 log

1 2 2 2 1 2 0
! !
1 0
Base-2 integer exponentiation : 2 0
;
1 0

1

2

% 1 0 (r.s.n.)

Integer exponentiation (exact) : 2

1

0 2
1
%
(!)
8.3 Integer Logarithm
log2
Applications : modular exponentiation mod

in cryptographic algorithms (e.g. IDEA, RSA)
2
Algorithms : square-and-multiply For detection/comparison of order of magnitude

2 2 4 2
1
a) 2 1
1
0 Corresponds to leading-zeroes detection (LZD) with
1 2 encoded output
1 2 2 1 0

&1 1 2 ; 0
1

&1 1 0 %&1 (r.s.n.)

2 or
2
9 VLSI Design Aspects 9.1 Design Levels 9 VLSI Design Aspects 9.1 Design Levels
9 VLSI Design Aspects Gate-level design
9.1 Design Levels Cell-based design techniques : standard-cells, gate-array/

sea-of-gates, field-programmable gate-array (FPGA)
Transistor-level design
Circuit implemented by hand or by synthesis (library)
Circuit and layout designed by hand (full custom) Layout implemented by automated place-and-route
Low design efficiency Medium to high design efficiency
High circuit performance : high speed, low area Medium to low circuit performance
High flexibility : choice of architecture and logic style Medium to low flexibility : full choice of architecture
Transistor-level circuit optimizations :
logic style : static vs. dynamic logic, Block-level design
complementary CMOS vs. pass-transistor logic
special arithmetic circuits : better than with gates
Layout blocks and netlists from parameterized automatic
generators or compilers (library)
gi g i-1 High design efficiency
"p
ci c i-1
carrychain.epsi
carry chain : c out 54 17 mm c in Medium to high circuit performance
ki i k i-1 p i-1
Low flexibility : limited choice of architectures
Implementations :
a b a a b c in a
data-path : bit-sliced, bus-oriented layout (array of

b
c in c in cells: bits operations), implementation of entire
"
full- b facmos.epsi
76 40 mm
s data paths, medium performance, medium diversity
adder : c in b c in
c out macro-cells : tiled layout, fixed/single-operation
b
components, high performance, small diversity
a b a a b c in a
portable netlists :

gate-level design
9 VLSI Design Aspects 9.2 Synthesis 9 VLSI Design Aspects 9.3 VHDL
9.2 Synthesis 9.3 VHDL

High-level synthesis Arithmetic types : unsigned, signed (2’s complement)
Synthesis from abstract, behavioral hardware description Arithmetic packages
(e.g. data dependency graphs) using e.g. VHDL
numeric_bit, numeric_std (IEEE standard 1076.3),
Involves architectural synthesis and arithmetic std_logic_arith (Synopsys)
transformations
contain overloaded arithmetic operators and resizing /
High-level synthesis is still in the beginnings type conversion routines for unsigned, signed types
Low-level synthesis Arithmetic operators (VHDL’87/93) [21]
Layout and netlist generators relational : =, /=, <, <=, >, >=
shift, rotate (’93 only) : rol, ror, sla, sll, sra, srl
Included in libraries and synthesis tools
adding : +, -
Low-level synthesis is state-of-the-art sign (unary) : +, -
Basis for efficient ASIC design multiplying : *, /, mod, rem
Limited diversity and flexibility of library components exponent, absolute : **, abs
Circuit optimization Synthesis

Efficient optimization of random logic is state-of-the-art Typical limitations of synthesis tools :

Optimization of entire arithmetic circuits is not feasible /, mod, rem : both operands must be constant or divisor
only local optimizations possible must be a power of two
Logic optimization cannot replace the synthesis of ** : for power-of-two bases only
efficient arithmetic circuit structures using generators Variety of arithmetic components provided in separate
libraries (e.g. DesignWare by Synopsys)
9 VLSI Design Aspects 9.3 VHDL 9 VLSI Design Aspects 9.4 Performance
Resource sharing 9.4 Performance
Sharing one resource for multiple operations Pipelining

Done automatically by some synthesis tools
Otherwise, appropriate coding is necessary :
Pipelining is basically possible with every combinational
circuit higher throughput
a)
2 adders + 1 multiplexer
S <= A + C when SELA = ’1’ else B + C;
Arithmetic circuits are well suited for pipelining due to
high regularity
b) T <= A when SELA
S <= T + C;
1 multiplexer + 1 adder
= ’1’ else B;
Pipelining of arithmetic circuits can be very costly :
large amount of internal signals in arithmetic circuits
Coding & synthesis hints
array structures : many small pipeline registers
Addition : single adder with carry-in/carry-out : tree structures : few large pipeline registers
Aext <= resize(A, width+1) & Cin; no advantage of tree structures anymore
Bext <= resize(B, width+1) & ’1’;
Sext <= Aext + Bext; (except for smaller latency)
S <= Sext(width downto 1); Fine-grain pipelining
systolic arrays (often applied to
Cout <= Sext(width+1); arithmetic circuits)

Synthesis : check synthesis result for allocated arithmetic
units code sanity check, control of circuit size
High speed
Fast circuit architectures, pipelining, replication

VHDL library of arithmetic units (parallelization), and combinations of those
Structural, synthesizable VHDL code for most circuits Optimal solution depends on arithmetic operation, circuit
described in this text is found in [22] architecture, user specifications, and circuit environment
9 VLSI Design Aspects 9.4 Performance 9 VLSI Design Aspects 9.5 Testability
Low power 9.5 Testability
Power-related properties of arithmetic circuits : Testability goal : high fault coverage with few test vectors
that are easy to generate/apply
High glitching activity due to high bit dependencies
and large logic depth Random test vectors : easy to generate and
apply/propagate, few vectors give high (but not perfect)
Power reduction in arithmetic circuits [23] : fault coverage for most arithmetic circuits
Reduce the switched capacitance by choosing an area Special test vectors : sometimes hard to generate and
efficient circuit architecture apply, required for coverage of hard-detectable faults
Allow for lower supply voltage by speeding up the which are inherent in most arithmetic circuits
circuitry
Hard-detectable faults found in :
Reduce the transition activity :
apply stable inputs while circuit is not in use ( circuits of arithmetic operations with inherent special
disabling subcircuits) cases (arithmetic exceptions) : detectors, comparators,
reduce glitching transitions by balancing signal incrementers and counters (MSBs), adder flags
paths (partly done by speed-up techniques, otherwise
difficult to realize)

circuits using redundant number representations

( redundant hardware) : dividers (Pentium bug!)
reduce glitching transitions by reducing logic depth
(pipelining)
take advantage of correlated data streams
choose appropriate number representations
(e.g. Gray codes for counters)
Bibliography Bibliography
Bibliography [11] R. Zimmermann, Binary Adder Architectures for

Cell-Based VLSI and their Synthesis, PhD thesis, Swiss
[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, Federal Institute of Technology (ETH) Zurich,
1993. Hartung-Gorre Verlag, 1998.
[2] K. Hwang, Computer Arithmetic: Principles, Architecture, [12] A. Tyagi, “A reduced-area scheme for carry-select adders”,
and Design, John Wiley & Sons, 1979. IEEE Trans. Comput., vol. 42, no. 10, pp. 1162–1170, Oct.
1993.
[3] O. Spaniol, Computer Arithmetic, John Wiley & Sons,
1981. [13] T. Han and D. A. Carlson, “Fast area-efficient VLSI
adders”, in Proc. 8th Computer Arithmetic Symp., Como,
[4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design May 1987, pp. 49–56.
and Implementation, McGraw-Hill, 1984.
[14] D. W. Dobberpuhl et al., “A 200-MHz 64-b dual-issue
[5] J.-M. Muller, Elementary Functions: Algorithms and CMOS microprocessor”, IEEE J. Solid-State Circuits, vol.
Implementation, Birkhauser Boston, 1997. 27, no. 11, pp. 1555–1564, Nov. 1992.
[6] Proceedings of the Xth Symposium on Computer Arithmetic. [15] A. De Gloria and M. Olivieri, “Statistical carry lookahead
[7] IEEE Transactions on Computers. adders”, IEEE Trans. Comput., vol. 45, no. 3, pp. 340–347,
Mar. 1996.
[8] D. R. Lutz and D. N. Jayasimha, “Programmable modulo-k
counters”, IEEE Trans. Circuits and Syst., vol. 43, no. 11, [16] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A method for
pp. 939–941, Nov. 1996. speed optimized partial product reduction and generation of
fast parallel multipliers using an algorithmic approach”,
[9] H. Makino et al., “An 8.8-ns 54 54-bit multiplier with IEEE Trans. Comput., vol. 45, no. 3, pp. 294–305, Mar.
high speed redundant binary architecture”, IEEE J. 1996.
Solid-State Circuits, vol. 31, no. 6, pp. 773–783, June 1996.
[17] Z. Wang, G. A. Jullien, and W. C. Miller, “A new design
[10] W. N. Holmes, “Composite arithmetic: Proposal for a new technique for column compression multipliers”, IEEE
standard”, IEEE Computer, vol. 30, no. 3, pp. 65–73, Mar. Trans. Comput., vol. 44, no. 8, pp. 962–970, Aug. 1995.
1997.
Bibliography
[18] J. Cortadella and J. M. Llaberia, “Evaluation of A + B = K

conditions without carry propagation”, IEEE Trans.
Comput., vol. 41, no. 11, pp. 1484–1488, Nov. 1992.
[19] S. E. McQuillan and J. V. McCanny, “Fast VLSI algorithms

for division and square root”, J. VLSI Signal Processing,
vol. 8, pp. 151–168, Oct. 1994.
[20] Y. H. Hu, “CORDIC-based VLSI architectures for digital

signal processing”, IEEE Signal Processing Magazine, vol.
9, no. 3, pp. 16–35, July 1992.
[21] K. C. Chang, Digital Design and Modeling with VHDL and

Synthesis, IEEE Computer Society Press, Los Alamitos,
California, 1997.
[22] R. Zimmermann, “VHDL Library of Arithmetic Units”,

http://www.iis.ee.ethz.ch/˜zimmi/arith lib.html.
[23] A. P. Chandrakasan and R. W. Brodersen, Low Power

Digital CMOS Design, Kluwer, Norwell, MA, 1995.
Computer Arithmetic: Principles, Architectures, and VLSI Design 98

Comp Arith Notes

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Comp Arith Notes

Diunggah oleh

Hak Cipta:

Format Tersedia

Eidgenossische

¨ Ecole polytechnique federale

Institut für Integrierte Systeme Integrated Systems Laboratory

March 16, 1999

Integrated Systems Laboratory

Copyright c 1999 by Integrated Systems Laboratory, ETH Zürich

Contents 4.3 Carry-Propagate Adders (CPA) 26

7.2 Restoring Division 78

Computer Arithmetic: Principles, Architectures, and VLSI Design 3

1 Introduction and Conventions 1.3 Conventions

1.1 Outline Naming conventions

Arithmetic units are, among others, core of every data

1.4 Recursive Function Evaluation 2. 

, outputs , function (graph sym. :

 (r.m.) (prefix problem) :

!   !1 119 "17 mm

Output  is a function of all inputs   #$ 1

' '&1 ;  0 

2 Arithmetic Operations 2.2 Implementation Techniques

2.1 Overview Direct implementation of dedicated units :

based on operation fixed-point floating-point always : 1 – 5

⁄ sqrt (x) (same as on Table look-up techniques using ROMs :

Properties : double representation of zero, symmetric 3.2 Gray Numbers

error detection and correction for arithmetic operations 2

in conventional and residue number systems

Base is -tuple of integers 0 ,

mod 0  , 

1 2  0 1   2 0 2 

3.5 Floating-Point Numbers 3.6 Logarithmic Number System

: by approximation or addition in conventional

base on fixed-point add, multiply, and shift operations

3.8 Composite Arithmetic 3.9 Round-Off Schemes

integer, secondary: rational) and inexact (primary:

4 Addition 4.1 Overview 4 Addition 4.2 1-Bit Adders, (m, k)-Counters

4 Addition 4.2 1-Bit Adders, (m, k)-Counters

bits of same magnitude (i.e. 1-bit numbers)

Output sum as #-bit number ( # log  1)

carry-propagate adders Half-adder (HA), (2, 2)-counter

Full-adder (FA), (3, 2)-counter (m, k)-counters

Example : (7, 3)-counter

" " "

4.3 Carry-Propagate Adders (CPA) Carry-propagation speed-up techniques

a n-1:j b n-1:j a i-1:k b i-1:k a k-1:0 b k-1:0

Ripple-carry adder (RCA)

a n-1 b n-1 a1 b1 a0 b0 s n-1 s1 s0

Carry-skip adder (CSKA) Carry-select adder (CSLA)

Carry-increment adder (CIA) Example : gate-level schematic of carry-incr. adder (CIA)

IFA IFA IFA IHA

Conditional-sum adder (COSA) Carry-lookahead adder (CLA), traditional

(2 RCA + more than log MUX/bit)

High speed-up at medium hardware overhead

c′12 c′8 c′4 c′0

  

Parallel-prefix adders (PPA) Prefix problem

  0 0   

also from wiring and maximum fan-out )

Prefix algorithms Sklansky parallel-prefix algorithm (

Example : 4-bit parallel-prefix adder (PPA-SK) Prefix adder synthesis

" Step 2 : prefix graph expansion (size minimization) :

Multilevel adders Self-timed adders

high for CSLA ( COSA)

Difficult computation of optimal group sizes Adder performance comparisons

Hybrid adders Standard-cell implementations, 0 8  process

4 exact factors not calculated

1.4 Recursive Function Evaluation 2.

, outputs , function (graph sym. :

(r.m.) (prefix problem) :

! !1 119 "17 mm

Output is a function of all inputs #$ 1

' '&1 ; 0

error detection and correction for arithmetic operations 2

mod 0 ,

1 2 0 1 2 0 2

: by approximation or addition in conventional

bits of same magnitude (i.e. 1-bit numbers)

Output sum as #-bit number ( # log 1)

0 0

also from wiring and maximum fan-out )

Hybrid adders Standard-cell implementations, 0 8 process

if CPA = RCA : a) and b) have same overall delay FA FA FA FA HA

! log CPA

-operand adder accomodates

AND-prefix struct.

0 %&1 %&2 0 (parity)

c out %&1 %&1 %&2

100 47 mm

A a7a5a3a1 prefix problem (r.m.a.) AND-prefix structure

extend un- l. 0 %&1 0 s1 s0

signed l. %&1 %&1 %&2 0 s0

%&1 %&2 0 0

saturate unsigned %&1 %&1 z3 z2 z1 z0 z3 z2 z1 z0

log

0 (2 &1&

0 3 0 2 0 1 0 0

irregular wiring and/or layout

2 1partial products (if no Booth recoding used)

%&1 %&2 %&3 0 1

! 2 or ! 2 log ≥ +/− CPA

1 if 2 $ 1

are estimated CSA Division by convergence

!log high performance, high area

1 2 2 1 2 1 2

Approximated exponentiation : ln 2 log

Integer exponentiation (exact) : 2

&1 1 0 %&1 (r.s.n.)