Anda di halaman 1dari 49

# VLSI Architecture :: MEL G642

## Dr. A. Amalin Prince

BITS Pilani K.K. Birla Goa Campus
Department of Electrical and Electronics Engineering

MEL G642
Contents

 Numerical representations
 Finite length data
 Signal processing under finite data precision
 Special cases

MEL G642
Numerical Representation

MEL G642
Numerical Representation

## Snapdragon SoC Qualcomm

MEL G642
General

 General
– Requirement on low silicon costs drives custom data types in
DSP applications.
– Word–16b; double word–32b;
– Custom word length = bus width = native
 Two’s complement
– From HW point of view, the addition and subtraction can be
performed independent of the sign of the operand.
 Fixed point two’s complement in our course

MEL G642
Fixed point

MEL G642
Fixed point numerical representation
 What is fixed point
– Integer or fractional representation, Dominate in DSP.
– Integer: between -2n-1 and +2n-1-1 or [-2n-1, 2n-1 -1]
– Fractional: between -1 and 1-2-n+1 or [-1, 1-2-n+1]
 Why fixed point (Important)
– Easy to implement data path HW
– Short physical critical path
– Low hardware (memory) costs, low power
– But, it must be the acceptable precision
 Why not fixed point (sometimes)
– Cannot separate dynamic range and precision
– High firmware design costs

MEL G642
Integer numerical representation

##  The integer representation

-an-12n-1 + an-22n-2+...... + a121+ a0
the range is −2n-1 ≤ X < 2n-1
 Example
01010011 = 83 (decimal)
= -0×27+1×26+0×25 +1×24+0×23+0×22+1×21+1×20
=26+24+21+20 = 64+16+2+1= 83
 Example
10101101 = -83 (decimal)
= -1×27+0×26+1×25 +0×24+1×23+1×22+0×21+1×20
=-27+25 +23+22+20 = -128+32 +8+4+1 = -83

MEL G642
Fractional numerical representation

##  The fractional representation

– - a0 + a-12-1 + a-22-2+...... + a-(n-1)2-(n-1) + an2-n
– a0 is the sign bit, the range is −1 ≤ X < 1
 Example
01010000 = 0.625 (decimal)
= -0×20+1×2-1+0×2-2 +1×2-3+0×2-4+0×2-5+0×2-6+0×2-7
=2-1+2-3 = 0.5 + 0.125 = 0.625
 Example (10101111+00000001)
10110000 = -0.625 (decimal)
= -1×20+0×2-1+1×2-2 +1×2-3+0×2-4+0×2-5+0×2-6+0×2-7
=-20+2-2+2-3 = -1+0.25 +0.125 = -0.625

MEL G642
General

 Precision
– The measurement of ability to distinguish between the closest
values, the distance between the smallest value that can be
represented.
 Example:
– Using a 16 bits fixed-point processor, the precision of the data
with native data width is 0.000-0000-0000-0001which is the
distanced between 0 and 0.000-0000-0000-0001.

MEL G642
General

 Dynamic range
– The ratio between the largest and the smallest numbers that can
be represented.
 Quantization error
– The numerical error introduced when a longer numeric format is
converted to a shorter one

MEL G642
Dynamic range and precision of 16b

## Dynamic range Dynamic range in DB Precision

Unsigned 1 to 65536 20 log10(216)=96dB 1
integer
Signed 1 to 32767 20 log10(215)=90dB 1
integer
Unsigned 2-16 to 0.99998474 96dB 2-16
fractional
Signed 2-15 to 0.99996948 90dB 2-15
fractional

## dynamic _ range  20 log  

positiveMAX
 positiveMIN 

MEL G642
Typical Sound Levels and Their
Decibel Levels

MEL G642

MEL G642
M.F format: QD(M,F) representation

##  M is the magnitude and F is the fractional

 D stands for base, for binary system, D=2
 An example of Q2(4.4), data width=M+F

M F

M F
.
23 = 8 2-4 =1/16=0.0625

## 22 =4 2-3 =1/8 =0.125

21 =2 2-2 =1/4 =0.25
20 =1 2-1 =1/2 = 0.5

MEL G642
Integer or Fractional

MEL G642
Example

##  Multiply 16-bits integer 02FF16 by 16-bits integer 011016

using a 16-bit fixed-point datapath and save the result in
a 16-bit register. Discuss the result.
– 32 bit result of the multiplication is 00032EF016
– The integer is right-aligned with the point at the right side of
LSB.
o the result after truncation must be aligned at the right side of the
result, and the truncation must be performed from the left side.
o The value of the result after truncation becomes 2EF016 or 0010
1110 1111 00002.

MEL G642
Dynamic range / precision
Format Largest positive value Largest negative value Precision
Q1.15 0.999969482421875 -1 0.00003051757813
Q2.14 1.99993896484375 -2 0.00006103515625
Q3.13 3.9998779296875 -4 0.00012207031250
Q4.12 7.999755859375 -8 0.00024414062500
Q5.11 15.99951171875 -16 0.00048828125000
Q6.10 31.9990234375 -32 0.00097656250000
Q7.9 63.998046875 -64 0.00195312500000
Q8.8 127.99609375 -128 0.00390625000000
Q9.7 255.9921875 -256 0.00781250000000
Q10.6 511.984375 -512 0.01562500000000
Q11.5 1023.96875 -1024 0.03125000000000
Q12.4 2047.9375 -2048 0.06250000000000
Q13.3 4095.875 -4096 0.12500000000000
Q14.2 8191.75 -8192 0.02500000000000
Q15.1 16383.5 -16384 0.05000000000000
Q16.0 32767 -32768 1.00000000000000
MEL G642
Example

 Discussion:
– We find that the result of an integer multiplication is wrong and
overflow when the result keeps the same precision as operands.
o A reliable way is to limit the range of both operands to be |x |< 007F16
– For iterative
o If there are L steps of integer multiplications in an iterative loop,
there will be L+ 1 identical sign bits
– reduced dynamic range after multiplication
o except for an extreme case dealing with the maximum negative
value, (-max)×(-max)=(+max)

MEL G642
Example

##  Multiply 16-bits fractional 02FF16 by 16-bits fractional

011016 using a 16-bit fixed-point datapath and save the
result in a 16-bit register. Discuss the result.
02FF 0000 0010 1111 1111 0.023406982
0110 0000 0001 0001 0000 0.008300781
00032EF0 0000 0000 0000 0011 0010 1110 111 000 0.000194296
– The fractional is left-aligned with the point at the left side of MSB.
o the result after truncation must be aligned at the leftt side of the
result, and the truncation must be performed from the right side.
o The value of the result after truncation becomes 000316 or 0000
0000 0000 00112 = 0.000091552.

MEL G642
Example

##  Calculate y=x8, where x=7F16( =0.9921875) is a fractional two’s

complement data, by using an 8-bit integer datapath. The output of
the multiplier is 16 bits when both operands are 8-bits. Discuss the
results.

## the intermediate 8-bit operand for

multiplication has to be
yn[7:0] =2-8 × Round (FF 00×Yn)
to match the data type of an
operand. (It is not a good
way, yet the only way to manage
the iteration.)

 Discussion
– The resolution of the result is lower after every step of the iteration. Integer-
based multiplication is not good for iterative computing.

MEL G642
Fraction MUL unit

##  How to design a fractional multiplier?

– Difficult?
– The integer two’s complement multiplier can be used for
fractional multiplication.
– After the multiplication, one of the two sign bits should be
removed. A saturation operation is required before removing one
of the sign bits to adapt the case when (-1)×(-1).
 Refer class notes

MEL G642
MEL G642
Fixed point numerical representation Integer and fractional multiplication

Integer Fractional

## Sign bit Sign bit Sign bit Sign bit

-23 22 21 20 × -23 22 21 20 -20 2-1 2-2 2-3 × -20 2-1 2-2 2-3

## Sign bit After multiplication

-21 20 2-1 2-2 -2-3 2-4 2-5 2-6
-27 26 25 24 -23 22 21 20 Sign bit

MEL G642
Fractional result
Truncate off
-20 2-1 2-2 2-3 -2-4 2-5 2-6 0

Truncate off
Example : Example :
0111 × 0111 0111 × 0111
= 7 × 7 = 00110001 = 0.875 × 0.875 = 00.110001=0.1100010
= (-03+23+21+20) × (-03+23+21+20) = 0.5 + 0.25 + 0.015625
= (-07+06+25+24+03+02+01+20) = 0.765625 (7-bit precision)
= 32 + 16 + 1 = -01+00+2-1+2-2+0-3 (4-bit precision)
= 49 = 0.75 (4-bit precision after truncation).
=01 (4-bit precision) unacceptable (Acceptable, not so bad)

MEL G642
Example - Extension

 Discussion
– it is not suitable to conduct integer multiplication if the ranges of operands are
unknown.
– The real result of y = (7F16)8 = 0.9391810; the radix point is right after the sign
bit.
– The result with finite precision hardware in Example is 0.9379.
– The error introduced by truncation is about 0.00128. The result of the fractional
multiplication is acceptable.
MEL G642
Floating point

MEL G642
Floating Point Numerical Representation

##  What is floating point

– Numbers are presented by a mantissa and an exponent
– value = (-1)S × mantissa × 2 exponent
 Why floating point
– to get a larger dynamic range
– The computing HW automatically scales each data to utilize the
full word length of mantissa after each arithmetic operation.
o Hence best resolution is maintained for small numbers
– Reduce firmware development time
 Why not floating point
– complicated data path, longer critical path in the data path, could
be more silicon cost
MEL G642
Floating Point Numerical Representation

## An example of IEEE 754 floating point numerical representation

Bit# 31 30 23 22 0

S EXPONENT FRACTION
MSB LSB MSB LSB

## S is the sign bit of the mantissa (0 is positive and 1 is negative)

EXPONENT is an unsigned 8-bit field that determines the location of the
binary point of the number being encoded.
FRACTION is a 23-bit field containing the fractional part of the mantissa
without sign bit
LSB is the lest significant bit
MSB is the most significant bit

MEL G642
Floating Point Numerical Representation

## S is the sign bit of the mantissa (0 is positive and 1 is negative)

EXPONENT is an unsigned 8-bit field that determines the location of the
binary point of the number being encoded.
FRACTION is a 23-bit field containing the fractional part of the mantissa
without sign bit
LSB is the lest significant bit
MSB is the most significant bit

MEL G642
MEL G642
Block floating point

##  A fixed-point processor does not have a floating-point

hardware unit
 Floating-point computing can be emulated by fixed-point
DSP processors when larger dynamic range is required.
 Block floating-point (also called dynamic signal scaling
technique) is an emulation method using a fixed-point
DSP processor hardware

MEL G642
Block floating point

## Larger dynamic range

The average amplitude
40dB

30dB

20dB
Exponent=7
Exponent=4

10dB
Exponent=0
0dB

10dB

Exponent=3
20dB

30dB
Time
80dB
Scale Scale
down down
Scale down 42db 24dB No scaling 18dB

MEL G642
Finite length DSP

MEL G642
Finite length DSP: the problem

results under a precision-limited datapath and storage
system
– The first problem: the finite resolution given by the analog to
– The second problem: most DSP processors are “fixed point
processors” to minimize the silicon costs
– The third problem: extra quantization error introduced while
scaling down signals within a finite data length system

MEL G642
Definitions

 Definition 1: Truncation:
– To convert a longer numerical format to a shorter one by
simple cutting off bits at the LSB part.
 Definition 2: Quantization error
– The numerical error introduced when a longer numeric
format is converted to a shorter one

MEL G642
Round operation
 Why: To eliminate bias error after truncation
 To round up the truncated part or round-to-nearest
technique
– To add the smallest value after truncation if the MSB of
 Implementation
o extra addition operation is needed.
o the hardware availability and timing for rounding have to be
considered during the instruction set design

MEL G642
A rounding example

## Round Arithmetic Example: Round 8 bits to 4 bits

Sign bit
Before round: 8 bits
A7 A6 A5 A4 A3 A2 A1 A0
Round to nearest
Sign bit b[3:0] = a[7:4] + {000, a}
After round: 4 bits
b7 b6 b5 b4

MEL G642
Overflow, saturation, and guard

 An overflow happens
– if X is not in the range -2N-1≤ X < 2N-1
 When overflow happens
– In a normal computer, overflow can be managed by
o exception handling
o Re-execution of the algorithm after scaling input data.
– In a DSP processor running real-time applications
o there is no time to handle exceptions and re-execution of the
application.
o overflows in a real-time system need to be handled by other
means.

MEL G642
Overflow, saturation, and guard

##  Three ways to deal with overflow

– To use a floating point processor
o Higher hardware costs
– To scale down the input data
o Precision will be lower
– To use guard and saturation arithmetic
o Good idea

##  DSP react on overflow

– might be exception or saturation

MEL G642
Manage Overflow: Guard/saturate

##  Most popular way in DSP hardware

 What is guard
– Add more sign extension bits to operands
– So that any results will/must be less than the maximum extended
value
– in the range that the guard can protect
2G×(–2N-1) ≤ X < 2G× (2N-1).

MEL G642
Guard bits requirements

##  In ASIC DSP Circuits

– design example discussed before?
 In DSP ASIP
– In an ALU without iteration support we require one guard bit
(G=1)
– No of guard bits for iteration

MEL G642
How many Guard bits required

## When accumulation needed, guard is needed.

Required Number of Guard bits is based on:

## both the scaling factor or the guard factor:

MEL G642
How many Guard bits required

##  For example, if there are a maximum of 512 iteration

steps
– all the coefficients are checked and the sum of 512 coefficients is
not more than 250, the factor of 28 = 256 or eight guard bits will
be enough

MEL G642
Manage Overflow: saturation

##  Saturation: after executing an algorithm

if result ≥1 then final_result = 1-2-n+1
elseif result <-1 then final_result = -1
else final_resul <= result // no overflow.
– Performed after an iterative accumulation
– Do not do it during convolution (exceptions)
 Discussion
– Better than exception for hard-real-time system

MEL G642
Manage Overflow: an example

## Sign bit Sign bit

Before guard: 4-bit OPA a3 a2 a1 a0 4-bit OPB b3 b2 b1 b0

## Sign bit Sign bit

After guard: 8-bit ga4 ga3 ga2 ga1 a3 a2 a1 a0 gb4 gb3 gb2 gb1 b3 b2 b1 b0
guards=a3 Accumulator guards=b3
8-bit full adder + Accumulator register
Data could be out of range
gc4 gc3 gc2 gc1 c3 c2 c1 c0
during accumulation
After accumulation
Saturation
Dynamic range
d3 d2 d1 d0
The final result is correct After saturation, the
iteration final result is 4-bits

MEL G642
Manage Overflow: saturation

##  One guard bit means one more sign bit extension to

operands and the arithmetic hardware.
– At the same time, saturation circuit is required at the output of
the arithmetic circuit.
 The saturation is enabled for single-precision computations
 In the following cases, saturation should be disabled:
– For lower subword computing of a double precession
algorithm.
– Unsigned arithmetic computing.
– During the execution of an iteration algorithm.

MEL G642
Manage Overflow: saturation

##  When saturation is disabled,

– the carry-out bit (flag) should be used to link the upper and
the lower part computing of double-precision arithmetic.

MEL G642
Examples of special cases
 Absolute operation on -1 in ALU.
 Without guard and saturation operations it will become:

A [3:0] = 4’b1000
ABS (A [3:0]) <= INV (1000)+0001 <= 0111+0001 <= 1000

##  The right result should be +7 (4’b0111). The correct operation with

guard and saturation should be:

## A [3:0] = 4’b1000; guarded A = GA [4:0] = 5’b11000

ABS (GA [4:0]) <= Truncate MSB (saturation (ABS (GA)))
<= Truncate MSB (saturation (INV(11000)+00001))
<= Truncate MSB (saturation (00111 + 00001))
<= Truncate MSB (saturation (01000))
<= Truncate MSB (00111) <= 0111

MEL G642
The right execution order

 Guarding operands
 Scaling and Rounding
 Saturation and removing guards
 Truncation

MEL G642
The End :: Thank you for your attention

Questions?

MEL G642