Anda di halaman 1dari 82

Part I

Speech Processing
1

Introduction

Speech Processing:
1. Speech/Speaker Recognition
2. Synthesis
3. Speech Coding (Communication)
In this course:
1. Analysis (Modeling, Physiology)
(a) Time domain
(b) Frequency domain
2. Coding

Review of Digital Signal Processing Fundamentals

2.1

Discrete-Time Signals

Speech signals occur naturally as continuous-time acoustic signals, xa (t). However, speech signals can
be transduced into electrical signals, and sampled at period T = 1/Fs to become discrete-time signals or
sequences:

x (n) = xa (t) t=nT = xa (nT ) .
Sampling Theorem: Let xa (t) be a continuous-time signal, and let Xa () be the corresponding
frequency representation. Suppose xa (t) is bandlimited:
xa () = 0, for || > a ,
where a = 2Fa . If xa (t) is sampled at period T = 1/Fs to become x (n), then xa (t) can be
uniquely reconstructed from x (n) if Fs > 2Fa . Here, Fa is referred to as the bandwidth of xa (t). The
Nyquist rate is defined as FN = 2Fa .
If the original continuous-time signal is not band-limited, the sampling component must be preceded
by a low-pass filter with cut-off frequency Fc F2s in order to avoid aliasing.
Note that speech signals are typically not band-limited. However, the majority of useful information is
contained below certain frequencies. For example, for speech coding, speech is generally sampled at 8 kHz
(narrowband coding) or 16 kHz (wideband coding). For music, the sampling rate is generally much higher,
such as 44.1 kHz or PCM coded music.

2.2

Important Discrete-Time Functions

Several discrete-time functions are important to digital speech processing, and will be used repeatedly in
this course:
The delta
 function: x (n) = (n)
1, for n = 0,
x (n) =
0, otherwise.
The stepfunction: x (n) = u (n)
1, for n 0,
x (n) =
0, otherwise.
n
The one-sided
 n exponential function: x (n) = a u (n)
a , for n 0,
x (n) =
0, otherwise.

Sinusoidal functions:
x (n) = cos (2f n),
x (n) = sin (2f n).
Eulers Identity: e jn = cos (n ) + j sin (n ).
Note that Eulers Identity
can be used to expand sinusoidal functions:

1
j
j
n
n
cos (n ) = 2 e + e
,
1
sin (n ) = 2j
e jn ejn .

2.3

Important Discrete-Time Frequency-Domain Transforms

Frequency-domain transforms allow for spectral analysis of signals and systems. In this section, several
common discrete-time transforms will be reviewed:
The Z-transform expresses an input signal x (n) as a geometric series in the complex variable z = re j :

X (z) =

x (n) z n .

n=

An important topic regarding the Z-transform is the region of convergence (ROC), which is defined
as all portions of the Z-plane for which:

X
n=

|x (n) z

|=

|x (n) r

n jn

n=

|=

|x (n) rn | < .

n=

Note that different discrete-time signals may have the same Z-transform, but different ROCs (e.g.,
x1 (n) = an u(n) and x2 (n) = an u(n 1)).
The inverse Z-transform allows the time-domain signal to be obtained from the transform-domain
representation:
1
x (n) =
2j

X (z) z n1 dz,

where C denotes a path surrounding the origin and contained in the ROC.
The Discrete-Time Fourier Transform (DTFT) expresses an input signal as a geometric series in the
complex variable e j :

X

X e j =
x (n) ejn .
n=

The DTFT represents a transformation from discrete time, n, to continuous frequency, . By


definition, the DTFT is periodic in frequency () with period 2. The inverse DTFT allows the
time-domain signal to be obtained from the frequency-domain representation:
1
x (n) =
2


X e j e jn d.

Note that the DTFT can be interpreted as the Z-transform evaluated along the unit circle (|z| = 1):


X e j = X (z) z=e j .
The N -point Discrete Fourier Transform (DFT) provides a discrete frequency representation of a
finite length discrete-time signal x (n) of length L N :

X (k) =

L1
X

x (n) ej N kn , for 0 k N 1.

n=0

The DFT represents a transformation from discrete time, n, to discrete frequency, k. Note that the
length of the sequence is not required to be equal to the size of the DFT, as long as L N . If L < N ,
the sequence is said to be zero-padded with N L zeros. The discrete frequency representation
given by the DFT is at many times easier to handle in practice than the continuous representation
provided by the DTFT.
Note that the DFT assumes implicit periodicity of period N , both in the time and frequency domain.
The inverse DFT is given by:

x (n) =

N 1
2
1 X
X (k) e j N kn , for 0 n N 1.
N
k=0

Note also that the DFT can be interpreted as a uniform sampling of the DTFT along the unit circle:

X (k) = X e j = 2k .
N

There exist computationally efficient algorithms for determining the DFT, namely the Fast Fourier
Transforms (FFTs), many of which require the length of the input time-domain signal to be a power
of 2. Since this is not generally the case, time signals may be zero-padded to produce lengths of 2v
(also known as radix-2).
Figure 1 provides views of the domains of the transforms previously discussed, in the Z-plane.

Figure 1: Views of the domains of various frequency transforms in the Z-plane: Panel a shows the Z-transform. Panel
b shows the Discrete-Time Fourier Transform (DTFT) as a subset of the Z-transform, evaluated along the unit circle
(r = 1). Panel c shows the Discrete Fourier Transform (DFT) as a uniform sampling of the DTFT along the unit circle
(in this case N = 8).

Example 1: Consider an input signal x (n) = [1, 2, 3, 2, 1], where the underscore denotes the origin.
The Z-transform can be determined as:

X (z) =

4
X

x (n) z n = 1 + 2z 1 + 3z 2 + 2z 3 + z 4 .

n=0

The DTFT can be determined as:

X e

4
X

x (n) ejn = 1 + 2ej + 3ej2 + 2ej3 + ej4

n=0

= ej2 [3 + 4 cos() + 2 cos(2)] .


The DFT of length N = 8 (note that this implies zero-padding) is determined as:

X (k) =

7
X

x (n) ej N kn = 1 + 2ej

k
4

+ 3ej

k
2

+ 2ej

3k
4

+ ejk

n=0

= [9, j5.828, 1, j0.172, 1, j0.172, 1, j5.828] .


The DFT can also be determined by sampling the DTFT uniformly along the unit circle:


X (k) = ej2 [3 + 4 cos () + 2 cos (2)] = 2k
N

= [9, j5.828, 1, j0.172, 1, j0.172, 1, j5.828] .

2.4

Properties of Frequency-Domain Transforms

There exist several important properties pertaining to frequency-domain transforms of discrete-time signals
that assist in the analysis of signals and systems. Note that if the properties discussed in this section
apply identically to all transforms previously discussed, they will only be explicitly formalized for the
DFT, for the sake of brevity. However, differences occur, the properties will be formalized separately for
each necessary transform.
Linearity:
Let x1 (n) and x2 (n) be time-domain signals, with X1 (k) and X2 (k) being the corresponding DFTs:
x1 (n) X1 (k) ,
x2 (n) X2 (k) .
In this case, a linear combination of the time-domain signals will produce a linear combination of
the frequency-domain signals:
x1 (n) + x2 (n) X1 (k) + X2 (k) .
One subtle aspect of the linearity property is when dealing with time-domain signals of various
lengths, say N1 and N2 , they must be initially zero-padded to at least length N = max (N1 , N2 ) to
avoid aliasing in the time domain.
Convolution:
Convolution of signals in the time-domain results in multiplication of corresponding signals in the
frequency domain. Conversely, multiplication in the time domain results in convolution in the
frequency domain. For the DTFT, this property refers to standard convolution, denoted by :


x1 (n) x2 (n) X1 e j X2 e j ,


1
x1 (n) x2 (n)
X1 e j X2 e j .
2
However, due to the implied periodicity of the DFT, this property refers to circular convolution,
denoted by ~, in this case:
x1 (n) ~ x2 (n) X1 (k) X2 (k) ,
1
x1 (n) x2 (n) X1 (k) ~ X2 (k) .
N
Time-Shifts:
Introducing a time shift results in modulation in the frequency domain by a complex exponential.
For the DTFT, this refers to a standard time shift:

x (n no ) ejno X e j .
However, for the DFT, due to the implied periodicity, this refers to a modular time shift:
x ((n no ) mod N ) ej
5

2kno
N

X (k) .

Real Time-Domain Signals:


Real time-domain signals will results in frequency-domain transforms which are conjugate symmetric.
That is:
x (n) real X (k) = X (k mod N ) .
This property is important for speech processing since speech signals are naturally real.

2.5

Insight Into Discrete-Time Frequencies

For continuous-time signals, frequencies can take values within the entire range (, ). Generally,
continuous-time frequencies are denoted (radians per second) or F (hertz). However, as can be interpreted
from the sampling theorem previously discussed, sampling limits the range of possible frequency components
in a signal. For a discrete-time signal, frequencies are generally denoted by (radians per sample), which
take values within [0, 2). Figure 2 provides examples of unit amplitude discrete-time signals at various
frequencies.

0
1
1

10

10

10

0
1

0
1

Figure 2: Examples of Discrete-Time Frequencies: Panel a shows a unit-amplitude signal of frequency = 0. Panel b
shows a unit-amplitude signal of frequency = 2 . Panel c shows a unit-amplitude signal of frequency = , which is
the highest possible frequency. Note that the phase is zero in each case.

The relationship between discrete-time frequencies and corresponding continuous-time frequencies is


dependent on the sampling frequency, Fs :
=
6

.
Fs

Furthermore, when the DFT is used for spectral analysis, the spacing between frequency bins is given
as:
=

2
.
N

Example 2:
Let x (n) be a discrete-time signal with sampling rate Fs = 8 kHz. If a segment of the signal
corresponding to 25 ms is extracted for spectral analysis, and a radix-2 FFT is to be used, what is
the minimum length N of the signal used (after zero-padding) during the transform?
A signal of duration T = 25 ms with sampling rate Fs comprises:
N = 0.025 s 8000 samples/s = 200 samples.
However, since the length of the signal must of the form 2v , the final length of the signal after zero
padding should be:
= dlog2 (N )e = 8 NFFT = 28 = 256 samples.
Example 3:
Let x (n) be a discrete-time signal with sampling rate Fs = 8 kHz. A segment of the signal
corresponding to 25 ms is extracted for spectral analysis. If the DFT is used for spectral analysis,
what is the minimum length, N , of signal used in order to provide frequency resolution F 20 Hz?
(In this case, dont worry about using FFTs.)
Note that the minimum frequency resolution is given in the continuous-frequency domain, so the
discrete-frequency resolution must be determined:
=

2F
=
0.005.
Fs
Fs

The frequency resolution provided by the DFT is given by =


can be determined as:
2
0.005
N

2
N,

so the minimum signal length

N 400 samples.

The minimum length of the discrete signal, after zero-padding, to guarantee frequency resolution of
20 Hz is N = 400 samples.

2.6

Linear Time-Invariant Systems

The properties of linearity and time-invariance greatly simplify the analysis of systems. Consider the
system:
x(n) H(z) y(n)
Here, Y (Z) = H (z) X (z), and y (n) is determined as:
y (n) = (h x) (n) =

X
m=

h (m) x (n m) ,

where h (n) is the impulse response. Note that if the system applies the DFT, circular convolution (~) is
used instead. To make the outputs of standard and circular convolution equivalent, the size of the DFT
needs to chosen as N N1 + N2 1, where N1 and N2 are the lengths of the signals being convolved.
Systems comprise zeros and poles (which will be discussed later during the topic of speech production).
Zeros and poles are characterized by a spectral location as well as a bandwidth. The effect of zeros are
attenuations in the frequency response, whereas the effect of poles are gains in the frequency response.
Linearity:
Let x1 (n) and x2 (n) be input signals, and let y1 (n) and y2 (n) be corresponding output signals. If
the system H (z) is linear, then:
x1 (n) + x2 (n) y1 (n) + y2 (n) .
Time Invariance:
The property of time-invariance describes a system H (z) as remaining constant throughout time. If
H (z) is time-invariant, then:
x (n no ) y (n no ) .
As will be seen later in the course, the speech production system is highly time-varying. However,
aspects of the speech production system can be sampled at high enough rates to allow the assumption
of stationarity and thus assume time-invariance.
If a system is LTI, then the transfer function H (z), or equivalently the impulse response h (n),
completely characterizes the system.
Additionally, systems that are in series can be combined to form a single cascaded system:
x(n) H1 (z) H2 (z) y(n) = x(n) H(z) y(n)
where H (z) = H1 (z) H2 (z), or equivalently, h (n) = (h1 h2 ) (n).

The Speech Production System

Figure 3 provides a simplified overview of the speech production system. Here, the speech production
organs can be divided into three main groups: the lungs, the larynx, and the vocal tract. The lungs provide
airflow to the larynx stage of the production system. The larynx modulates the flow of air, resulting either
in a periodic series of pulses, or a noisy flow. The vocal tract spectrally shapes the flow of air. In this
section, a linear model will be used to approximate the speech production system, and signal processing
techniques will be utilized to analyze the resulting speech signal.

Figure 3: An overview of the speech production system.

The speech production can be approximated as a linear system:


u (n)
| {z }

Source Signal

V (z)
| {z }

Vocal Tract

s (n)
| {z }

Speech Signal

Here, the source signal, u (n), is created by the lungs and larynx components of the overall production
system illustrated in Figure 3. The source signal then passes through a linear time-invariant system, V (z),
representing the vocal tract. The resulting speech signal, s (n), is then emitted.

3.1

The Source Signal

As discussed previously, the source signal, u (n), is created by the lungs and larynx. During speech, the
lungs maintain a slightly raised air pressure, causing a steady flow of air through the trachea. The larynx
modulates the air flow from the lungs.
The larynx is a system of muscles, cartilage, and ligaments, whose function is to control the vocal
cords or vocal folds. The vocal cords are a set of ligaments which stretch from the back to the front of the

larynx. The slit located between the vocal cords is called the glottis. Figure 4 shows the vocal folds and
glottis, as a downward-looking view of the human larynx.

Figure 4: The vocal folds and glottis, as a downward-looking view of the human larynx.

The vocal cords can be operated in three different states: breathing, unvoiced, and voiced. In the
breathing state, the vocal cords are held apart with little tension to allow steady airflow without resistance.
In the unvoiced state, the cords are held apart, though closer and more tense than in the breathing state.
This results in some turbulence, referred to as aspiration. In the voiced state, the vocal cords are brought
together and held tense, resulting in oscillation of the cords. The breathing state is not directly involved
in speech production, so in this section we will analyze the source signals produced during the unvoiced
and voiced vocal cord states.
3.1.1

Unvoiced Source Signal

When the vocal cords are in the unvoiced state, air flows through the larynx with slight turbulence, or
aspiration. The actual signal produced by unvoiced vocal cords is similar to white noise. The aspiration
caused by low tension in the vocal cords results in whisper-like speech.
3.1.2

Voiced Source Signal

The close proximity and high tension of vocal cords during the voiced state result in oscillation of the
vocal cords, which leads to periodic modulation of airflow through the larynx. Each period is roughly
characterized by a period of closure (closed phase), then a slow opening of the cords (open phase), followed
by a rapid closure (return phase).
A study of the voiced source signal is presented in Fig. 5. The top left panel illustrates the airflow
corresponding to one period of a simulated glottal closure. Note that the various phases of the vocal cord
formations are labeled. The top right panel shows the spectrum of one period of the voiced source function.
Note that the spectrum is monotonically decreasing with respect to frequency.
The bottom left panel illustrates a series of glottal closures. The time between adjacent peaks is
referred to the the pitch period of the signal. The bottom right panel provides the spectrum of the series
of glottal closures. Note that the spectrum approximates a sampled version of the spectrum above.
The pitch period of a speaker determines how high or how low the resulting speech is. The pitch
period is denoted by T0 . A related parameter, the pitch frequency or fundamental frequency, is denoted
10

Voiced Source Signal (1 Period)

Spectrum of Voiced Source Signal (1 Period)


30

Airflow

Magnitude (dB)

Open Phase Return Phase

Closed Phase

20
10
0
10
20
30

Time

0.5

0.5

Magnitude (dB)

Voiced Source Signal

Airflow

Pitch Period

40
20
0
20

Time

1
1.5
2
2.5
Frequency (radians)
Spectrum of Voiced Source Signal

1
1.5
2
Frequency (radians)

2.5

Figure 5: The Voiced Source Signal: The top panels show the time waveform and spectrum of one period of a simulated
voiced source signal. The various temporal segments are labeled. The bottom panels illustrate the time waveform and
spectrum of the complete voiced source signal.

men
women
children

ave.
125
225
300

F0 (Hz)
min.
80
150
200

max.
200
350
500

Table 1: The dependency of sex and age on pitch period.

by F0 , and is determined as:


1
1

,
T0
2 M C
where M is the mass of vocal folds per unit length and C is a measure of compliance of vocal folds
C = (stiffness)1 : if you stiffen them, F0 goes up, if you slacken them, F0 goes down.
Although pitch varies in time during speech, there exists dependency of sex and age on pitch: females
and children have shorter vocal cords than adult males. Table 1 provides ranges of pitch periods for males,
females, and children.
F0 =

3.2

The Vocal Tract

The vocal tract comprises the oral cavity and the nasal passage, and the two are coupled via the velum
(see Figure 3). The main function of the vocal tract is to spectrally shape the source signal to produce
desired speech sounds. Another function of the vocal tract is to generate new sources of sound.

11

3.2.1

Spectral Shaping

It is commonly assumed that the relationship between the source signal airflow and the airflow outputted
by the vocal tract can be approximated by a linear filter, V (z). Certain configurations of the vocal
tract components create specific resonant frequencies, referred to as formant frequencies or formants, and
denoted by Fi . Note that the term formant can refer to information regarding both the spectral location
and bandwidth of the corresponding resonance. In terms of the expression for H (z), the energy gain
present for formants is due to p poles. Since speech is naturally a real signal, the vocal tract transfer
functions poles are either real or complex conjugate pairs:
V (z) =

G
pQ
1 /2
k=1

(1 ck z 1 ) 1 ck z

p2
Q
1

p = p1 + p2 .

(1 r` z 1 )

`=1

Figure 6 illustrates an approximation of the frequency response of the vocal tract transfer function
for the vowel sound in boot, Figure 7 shows the corresponding pole-zero plot, and the |V (z)| is plotted
in Fig. 8. Note that three formants are noticeable in this case. Also, F1 shows a high peak and narrow
bandwidth as it is located near the unit circle of the Z-plane. Conversely, F2 and F3 show lower energy
and wider bandwidths, since they are placed farther from the unit circle.

Figure 6: The approximated vocal tract transfer function for the vowel sound in boot.

3.2.2

Source Generation

Besides spectrally shaping the airflow signal from the larynx, the vocal tract can also generate source
signals. Components of the vocal tract can generate two types of source signal: impulse sources and
frication. Impulse sources are produced by creating a full closure in the oral cavity (such as by the tongue
or lips), followed by a quick release. Frication sources are produced by creating a steady partial closure
(such as between the tongue and palate, or between the tongue and teeth), causing turbulence.

3.3

The Complete Speech Production Model

As can be interpreted from the linear speech production model, the final speech signal is determined as the
convolution of source signal, u (n), with the impulse response of the vocal tract transfer function, v (n):
12

Figure 7: The pole-zero plot for the approximated vocal tract transfer function for the vowel sound in boot.

Figure 8: Plot of the amplitude of the vocal tract transfer function for the vowel sound in boot, |V (z)|, where the
values it takes on the unit circle are highlighted in black.

13

s (n) =

v (n ) u ( ) .

In the transform domain, this can be expressed as:


S (k) = V (k) U (k) .
Example:
Figure 9 shows the spectrum of a steady-state vowel. (Note that in the figure, the harmonics appear as
lobes, rather than as clean spikes, which is an effect of windowing. This will be addressed as a future
topic.) What is the pitch period on samples? If the sampling frequency is known to be Fs = 8 kHz,
what is the pitch period in hertz? Is the speaker more likely to be adult male, adult female, or a child?

Figure 9: The spectrum of a steady-state vowel. (Note that in the figure, the harmonics appear as lobes, rather than
as clean spikes, which is an effect of windowing. This will be addressed as a future topic.)

The harmonics can be uniformly observed to be spaced by 24


. This corresponds to a pitch period of
48 samples. If the signal is sampled at Fs = 8 kHz, the pitch period (in hertz) can be determined as:

F0 =

1
8000 samples/s 167 Hz.
48 samples

Using Table 1, it can be determined that the speaker is most likely an adult male.

3.4

Sound Classification

The basic unit of speech is referred to as a phoneme. Table 2 provides a condensed list of General American
English phonemes, along with their International Phonetic Alphabet (IPA) and ARPABET symbols. The
phonemes can be classified by considering the previous discussion on the physiological aspects of human
speech production. Discriminative features used to classify phonemes include the presence of voicing, the
location of formant frequencies, resonance within the nasal cavity, the presence of sound generation within
the vocal tract, and many others. The major classes of phonemes are vowels and consonants. The latter
include nasals, fricatives, and plosives. Some speech sounds are transitional and can be consonants, vowels,
or intermediate sounds.

14

3.4.1

Vowels

Vowels are characterized by a voiced source signal. Each vowel corresponds to a different vocal tract
configuration, resulting in specific formants. In English speech, the nasal cavity is decoupled from the oral
cavity across all vowels.
3.4.2

Consonants

The air flow is restricted.


3.4.3

Nasals

Nasals are also characterized by a voiced source signal. For nasals, the velum is lowered, which couples the
nasal cavity and oral cavity. Additionally, the oral tract may be constricted, resulting in output airflow
being radiated from the nostrils. Various nasal phonemes are distinguished by the location at which the
oral cavity is constricted.
3.4.4

Fricatives

The discriminative feature of fricatives is that they entail source generation within the oral cavity. The
oral cavity is constricted to a certain extent, causing turbulence. Examples include frication between the
tongue and the palate (/s/) or between the upper teeth and the lower lip (/f/). Additionally, fricatives
can include a voiced or unvoiced source signal.
3.4.5

Plosives (Stops)

Plosives are characterized by an impulsive burst of airflow generated within the oral cavity. This is
produced by creating a full closure with the tongue or lips, followed by a quick release. Plosives can include
voiced or unvoiced source signals.
3.4.6

Transitional Speech Sounds

The phoneme types previously discussed assume a degree of stationarity of speech production system.
However, certain phonemes are defined by the transition between steady-state sounds. These phoneme
classes include diphthongs, glides, semi-vowels, and affricates.

15

Phoneme
/i/
/I/
/ej /
/E/
//
/a/
/2/
/O/
/ow /
//
/u/
/@/
/1/
//
//
/aw /
/aj /
/Oj /
/j/
/w/
//
/l/
/m/
/n/

ARPABET
IY
IH
EY
EH
AE
AA
AH
AO
OW
UH
UW
AX
IX
ER
AXR
AW
AY
OY
Y
W
R
L
M
N

Example
beat
bit
bait
bet
bat
bob
but
bought
boat
book
boot
about
roses
bird
butter
down
bite
boy
you
wit
rent
let
met
net

Phoneme
/N/
/p/
/t/
/k/
/b/
/d/
/g/
/h/
/f/
/T/
/s/
/S/
/v/
/D/
/z/
/Z/
//
//
//
//
"/
/m
/n" /
"
/R/
/P/

ARPABET
NG
P
T
K
B
D
G
HH
F
TH
S
SH
V
DH
Z
ZH
CH
JH
WH
EL
EM
EN
DX
Q

Example
sing
pet
ten
kit
bet
debt
get
hat
fat
thing
sat
shut
vat
that
zoo
azure
church
judge
which
battle
bottom
button
batter
(glottal stop)

Table 2: Condensed IPA and ARPABET lists for General American English.

16

Possible classification of General American English phonemes


sounds
PPP
)

q
P

voiced

unvoiced

vocal-cord vibration

vowels

consonants

aspirant
/h/

semivowels
(glides)
/w, j/

liquids
/l, /

monophthong
a
i
e
I
u
o
E

2
O

diphthong
aj
ej
aw
ow

characterized by
formant frequencies
or resonances.
consonants

stops
/p, t, k, b, d, g/

silence +
burst

fricatives
/s, S, z, Z, v, f, T, D/

turbulence noise
generation

17

nasals
/m, n, N/

affricates
/, /

stops +
fricatives

Tube Model of the Vocal Tract

How can one find the transfer function V (z)?


Wave equations:

A p(x,t)
c2 t
u(x,t)
A t

=
u(x,t)
x
p(x,t)
x =

where
A = cross-sectional area (assumed uniform)
t = time
= 1.14 103 g cm3
volume velocity u =0 cm3 /s to 1000 cm3 /s
c 340 m/s to 350 m/s in air
p = pressure.
The pressure is usually given in terms of the sound pressure level, SPL, in dB:
SPL = 20 log

Prms
,
Pref

where Pref = 20 Pa is the reference pressure, considered to be the audibility threshold for humans at
1 kHz.
Note that in the equations above we can draw parallels to electric circuits by making the following
analogies:
Acoustic quantity

Analogous electric quantity

p, pressure
u, velocity
/A, acoustic inductance
A/(c2 ), acoustic capacitance

v, voltage
i, current
L, inductance
C, capacitance

If the width of the vocal tract (on average around 2 cm) is much smaller than = Fc then, to a
first-order approximation, the pressure waves can be considered planar, and the vocal tract can be modeled
as a series of tubes. Three main tubes are considered, the quarter-wavelength tube, closed at one end and
open at the other) and the two half-wavelength tubes, one closed at both ends and the other open at both
ends:

A
l

A
-

A
-

l
F =

c
4l (2n

1),

n = 1, 2, . . .

F =

c
2l n,

l
n = 1, 2, . . .

In addition, small constriction tubes that connect larger tubes can be modeled as Helmholtz resonators,
characterized by a resonating frequency given by
c
F = q
,
A2
2 l1 l2 A
1
18

A1

wine bottle.

A2
l1
l2

Figure 10: Helmholtz resonator

where l1 and A1 are the length and cross-sectional area of the constriction, and l2 and A2 are the length
and area of the previous tube, see Fig. 10.

19

Short-Time Speech Analysis

We have previously looked at properties and applications of linear, time-invariant (LTI) systems. However,
when modeling the speech production system as a linear system, it must be considered time-varying due to
the dynamic nature of speech. Thus, in order to correctly analyze the spectral or temporal characteristics
of speech, we can extract short segments that can be assumed stationary. That is, we can window speech
segments wherein information regarding pitch, formants, etc., doesnt change significantly.

5.1

The Short-Time Fourier Transform

Common tools used to analyze the spectral characteristics of speech with respect to time are the discretetime short-time Fourier Transform (STFT) and the discrete STFT. For an input speech signal s (n), the
discrete-time STFT is defined as:
Sn e

s (m) w (n m) ejm ,

m=

where w (n) is the analysis window used to extract the current speech signal segment, and is nonzero
only on the range n [0, Nw 1].
The STFT provides information regarding the spectral characteristics of speech signals as a function of
two variables, n and . The STFT can be interpreted in two ways:
1. Fixed time (n):
If the time index, n, is fixed, the STFT provides the Fourier Transform of xn (m) = s (m) w (n m),
a speech signal windowed on the range n [n Nw + 1, n]:
Sn e

xn (m) ejm .

m=

2. Fixed frequency ():



 2
If the frequency, , is fixed, Sn e j provides the trajectory of energy values with respect to time
contained by the input speech signal at frequency .
Note that the discrete-time STFT provides the frequency variable as a continuous function, which
may not be feasible for some applications. Thus the discrete STFT can be used:
Sn (k) =

s (m) w (n m) ej N km .

m=

The discrete STFT can also be expressed as the discrete-time STFT, sampled uniformly in the frequency
domain:

Sn (k) = Sn e j = 2k
N

5.2

The Analysis Window

Two common analysis windows used are the rectangular window and the Hamming window:

20

The Rectangular
Window:

1, if 0 n Nw 1,
wr (n) =
0, else.
The Hamming
Window:

(
, if 0 n Nw 1,
0.54 0.46 cos N2n
w 1
wh (n) =
0, else.
The rectangular and Hamming windows are illustrated, along with their spectra, in Figure 11. The top
panels illustrate the time waveform and spectrum of the rectangular window. The bottom panels illustrate
the time waveform and spectrum of the Hamming window. Note that the mainlobe of the rectangular
window has a small main-lobe bandwidth ( N4w ), but the side-lobes are very prominent. The Hamming
window shows a mainlobe of greater bandwidth ( N8w ), but has highly attenuated side-lobes.
wr

Wr
0
Magnitude (dB)

0.5

500

550

20
40
60

600

1
2
frequency (radians)
Wh

1
2
frequency (radians)

wh
0
Magnitude (dB)

0.5

500

550

20
40
60

600

Figure 11: Examples of common analysis windows: The top panels illustrate the time waveform, wr (n), and spectrum,
Wr (), of the rectangular window. The bottom panels illustrate the time waveform, wh (n), and spectrum, Wh (), of
the Hamming window.

5.3

Spectral Leakage Associated with Windowing

Applying analysis windows to speech segments during spectral analysis causes energy naturally located in
a given spectral location to leak into nearby locations. Since the window is multiplied with the speech
signal in the time domain, spectral leakage can be interpreted as convolution of the true speech spectrum
with the spectrum of the analysis window:

21



1
W e j S e j .
2
Thus, the optimal (yet unrealistic) window would have a frequency response composed of a delta
function located at = 0. Since this window would be infinite in duration, we instead desire to use
windows which have spectra showing low-bandwidth main lobes and highly attenuated side-lobes, and
thus approximating the optimal delta function case.
In the case of voiced speech, spectral leakage will cause harmonics to appear as lobes, as opposed to
clean spikes.
w (n) s (n)

Example:
Figure 12 shows the time waveform and discrete STFT of the speech signal seven. As was discussed,
the STFT can be interpreted in two ways, for a fixed time index n, or for a fixed frequency index k.
Figure 13 illustrates both cases: for a fixed time index during the steady-state vowel /e/, and for the
fixed frequency bin corresponding to 1813 Hz.
Time Waveform of "seven"

Discrete STFT of "seven"

Figure 12: Example of the Discrete STFT: the top panel shows the time waveform for the speech signal seven. The
bottom panel shows the corresponding discrete STFT, using a Hamming window.

Example:
Figure 14 provides an example spectrum of a steady-state vowel. If spectrum analysis was carried
out using a Hamming window, what was the length of the analysis window? (Assume negligible
effects due to side-lobes.) If the sampling rate was Fs = 8 kHz, what was the duration of the analysis
window, Tw , in seconds?
The bandwidth of the main lobe of a Hamming window is N8w . From Figure 9, it can be observed

that the bandwidth of the analysis window is roughly equal to the harmonic spacing, i.e. 24
. Thus:
8

=
Nw = 192 samples.
Nw
24
22

Case 1: fixed n (during /e/)

Case 2: fixed k (corresponding to 1813 Hz)

Figure 13: Illustrating the various interpretations of the STFT: the top panel shows the STFT for a fixed time index
during the steady-state vowel /e/; the bottom panel shows the STFT for a fixed frequency bin corresponding to 1813 Hz.

If the sampling rate is Fs = 8 kHz, then:


Tw = 192 samples

5.4

1
= 0.024 s
8000 samples/s

Wideband vs. Narrowband Analysis

Recall the inverse relationship between analysis window length (in the time domain), and main lobe
bandwidth (in the frequency domain). Thus, as we decrease the duration of the analysis window, we
increase the amount of spectral leakage, and smear the resulting spectrum. In such a scenario, harmonics
will become blurred, but formants will become resolved. Additionally, better time resolution is achieved.
This corresponds to wideband (WB) analysis.
Conversely, if we increase the duration of the analysis window, we decrease the spectral leakage, and
reduce the amount of smearing of the resulting spectrum. In this scenario, harmonics will remain
resolved, but it may be difficult to accurately locate formants. Here, better frequency resolution is achieved.
This corresponds to narrowband (NB) analysis.
Figure 15 provides examples of wideband and narrowband STFT analysis. The top panel shows WB
analysis; the bottom panel shows NB analysis. Note the resolved harmonics in the NB case, whereas the
harmonics have been blurred due to spectral leakage in the WB case. However, note that the formants
are clearer for the WB case. Similarly, Fig. 16 shows that formants are more clearly visible in the WB
analysis of vowels, whereas NB analysis can be used to discern harmonics. WB analysis can be used
for segmentation, as shown by the crisper transitions in the WB spectrogram of Fig. 17. Also note the
three occurrences of // at 200 ms, 600 ms and 1200 ms from the start of the utterance: the wideband
spectrogram in Fig. 17 shows that the formants are the same, while the narrowband spectrogram indicates
that the pitch varies.

23

Figure 14: The spectrum of a steady-state vowel.

5.5

Time-Domain Analysis

Although short-time spectral analysis is widely used for speech processing applications, there exist a
number of short-time time-domain speech features which are also important.
Short-Time Energy:
STE(n) =

|s (m)|2 w (n m) .

m=

The short-time energy is commonly used in voice-activity detection (VAD) and automatic speech
recognition (ASR) of high-energy segments (i.e., vowels). It helps detect pauses, boundaries between
phonemes, words, or syllables, and voiced vs. unvoiced sounds.
Zero-Crossing Rate:


1

ZCR(n) =
sgn (s (m)) sgn (s (m 1)) wr (n m) .
2N
w
m=
The zero-crossing rate is used in applications such as VAD to differentiate between periodic signals
(low ZCR) and noisy signals (high ZCR). Typical values of ZCRs are 1400 crossings per second
for voiced sounds and 4900 crossings per second for unvoiced sounds. A threshold at about 2500
crossings per second can be used to discriminate voiced vs. unvoiced sounds. ZCR can be used to
estimate the frequency of a periodic sound: note that the frequency Fa of a sinusoid is related to the
zero crossing rate by
Fs
Fa = ZCR .
2
Short-Time Autocorrelation:
Rn (k) =

xn (m) xn (m k) ,

(1)

m=

where xn (m) = s (m) w (n m). The short-time autocorrelation is an important tool in determining
Linear Predictive Coefficients (LPCs), which are used inPASR as well as speech coding.
The (non-windowed) autocorrelation function, R (k) =
m= s (m) s (m k), has several important properties:
24

Wideband Analysis of "seven"

Narrowband Analysis of "seven"

Figure 15: Wideband vs. narrowband STFT analysis: the top panel illustrates wideband analysis of the speech signal
seven; the bottom panel illustrates narrowband analysis of the speech signal seven.

1. R (k) is an even function.


2. If s (n) is periodic, i.e., s (n) = s (n + N ) for some N , then R (k) is periodic:
X
X
R (k) =
s (m) s (m k) =
s (m) s (m k N ) = R (k + N )
m

3. R (k) has a max value at R (0).


The autocorrelation function of a voiced speech segment has peaks occurring at intervals equal to
seconds.

1
F0

Short-time average magnitude difference function (AMDF):


AMDFn (k) =



X


s (m) wr (n m) s (m k) wr (n m + k) .
m=

The AMDF of voiced (i.e., quasi-periodic) speech segments has nulls at intervals equal to
See also Sections 6.16.4.3 of the textbook.

25

1
F0

seconds.

(a) Wideband analysis of five vowels.

(b) Narrowband analysis of five vowels.


Figure 16: Wideband vs. narrowband STFT of five vowels.

26

Figure 17: Wideband (top) vs. narrowband (bottom) STFT analysis of the sentence Is Pat sad or mad? The fifth,
tenth, and fifteenth harmonics of two vowels are marked with white squares in the bottom panel. From P. Ladefoged, A
Course in Phonetics, 2001.

27

Linear Prediction Analysis

Linear prediction analysis is based on the notion that speech samples can be predicted accurately from
previous speech. That is, speech samples can be estimated as a linear combination of past samples. In
determining the optimal weights by which to combine previous samples, we reveal important spectral
information regarding the signal.

6.1

The All-Pole Vocal Tract Model

As discussed previously, we assume an all-pole model for the vocal-tract transfer function:
V (z) =

G
pQ
1 /2

(1 ck

z 1 )

k=1

ck z 1

p2
Q

,
(1 r`

p = p1 + p2 .

z 1 )

`=1

Since speech is naturally a real signal, the poles appear either as complex conjugate pairs, (ck , ck ), or
as real poles, r` . An alternative form of the vocal tract transfer function, which will prove useful for linear
prediction analysis, is:
V (z) =

G
.
p
P
1
ak z k

(2)

k=1

Recall the expression that relates the source signal U (z) and the output speech S (z) in a linear system:
S (z) = V (z) U (z) .

(3)

By substituting the expression for the all-pole transfer function from (2) into the expression for the
linear speech production model (3), we obtain:
S (z) =

p
X

ak S (z) z k + GU (z) .

k=1

By taking the inverse Z-transform we obtain:


s (n) =

p
X

ak s (n k) + Gu (n) .

k=1

Thus, the current speech sample, s (n) can be predicted as a function of past speech samples
{s (n 1) , . . . , s (n p)} and the current source signal sample, u (n). The parameters ak are referred to
as the Linear Predictive Coefficients
(LPCs).
P
The expression A (z) = 1 pk=1 ak z k is referred to as the inverse filter, since it theoretically inverts
the effect of the vocal tract, and returns the source signal. If the vocal tract truly is an all-pole system
and is modeled perfectly during linear predictive analysis, then inverse filtering simply gives the source
function:

A (z) S (z) = A (z) V (z) U (z)


G
= A (z)
U (z)
A (z)
= GU (z) .
28

6.2

Deriving Linear Predictive Coefficients

To derive the linear predictive coefficients, ak , we first define the least-square error function:

E=

(s (n) s (n))2

n=

s (n)

n=

p
X

!2
ak s (n k)

(4)

k=1

where s (n) is the estimate of s (n) based on the past samples.


To derive each optimal coefficient, we minimize the least-square error function by setting the derivative
of E with respect to that coefficient to zero:
!2
p

X
E
X
s (n)
ak s (n k)
=
ai
ai n=
k=1
!
p

X
X

=2
s (n)
ak s (n k)
ai
n=

s (n)

k=1

= 2

s (n) s (n i)

n=

p
X

p
X

!
ak s (n k)

k=1

!
ak s (n k) s (n i)

k=1

= 0.
This leads to the normal equations:

s (n) s (n i) =

n=

p
X

ak

s (n k) s (n i) .

(5)

n=

k=1

Note that using (5) in (4) yields the minimum least-square error:
Emin =

X
n=

s2 (n)

p
X

!
ak s (n k) s(n) .

(6)

k=1

The minimum error is actually the square of the gain


the LPC-based vocal tract transfer
P used for
2 (n) = 1.
function, if the excitation, u (n) is normalized such that
u
n=
The normal equations can be solved by means of two methods: using the autocorrelation function or
using the covariance function. The covariance method is more accurate, but induces a higher computational
complexity. There exist efficient algorithms for determining LPCs using the autocorrelation method.

6.3

The Autocorrelation Method

The autocorrelation method for linear prediction analysis assumes that a signal is only nonzero over
an interval of Nw samples, with Nw > p. That is, the signal was windowed prior to analysis. In the
following, let us assume that that interval is [0, Nw 1], and, in the notation of (1), let R (k) = Rn (k),
with n = Nw 1. When applying the autocorrelation function:
R (i) =

NX
w 1

x (m) x (m i) ,

m=i

29

i = 0, . . . , Nw 1,

where x (m) = s (m) w (n m), n = Nw 1. Because x(m) is non-zero only for m [0, Nw 1], and
x(m i) is non-zero for m [i, Nw + i 1], the summation only needs to be carried out over the range
[i, Nw 1]. In this case, the normal equations (5), where we use the windowed signal, x (n), instead of
s (n), become:

R (i) =

p
X

ak R (i k) , and more generally

k=1

R (i) =

p
X

ak R (|i k|) ,

i = 1, . . . , p.

k=1

Note that in the above equation we took advantage of the property that the autocorrelation is an even
function. The normal equations (5) can be written in matrix form as:

R (1) R (p 1)
R (0)
..
..
.
.
R (p 1)
R (0)
{z
R (0)
R (1)
..
.


a1
R (1)
.. = ..
. .
ap
R (p)
| {z } | {z
a

.
}

The matrix R is Toeplitz, meaning that is symmetric with identical elements on the diagonals. There
exist efficient algorithms to invert Toeplitz matrices, so the matrix a can be determined as:
a = R1 r.
The minimum least-square error (6) is then calculated as:
Emin = R (0)

p
X

ak R (k) .

k=1

Example: Consider the speech segment [3, 2, 1, 1] (note that we used Nw = 4). Find the 2nd-order
vocal tract transfer function using linear predictive analysis.
R (i) = [15, 3, 1, 3].
For order p = 2:

R=


15 3
,
3 15

Emin = R (0)

2
X


r=


3
,
1

a = R1 r =

2
9

19

ak R (k)

k=1

 
 
2
1
128
= 15
(3)
(1) =
14.22,
9
9
9
p
G = Emin 3.77,
3.77
V (z) =
,
2 1
1 9 z + 19 z 2
30


with roots at z = c1 and z = c1 , with c1 = 1 + j 8 /9 0.11 + j0.31. This corresponds to an
estimated formant F1 0.196Fs .
The choice of the order p during LPC analysis should reflect the number of formants expected. Recall
that a pair of conjugate poles are required to create a formant. For speech sampled at 8 kHz, p is usually
chosen between 10 and 13. Figure 18 shows the effect of model order on LPC analysis. The top panel
shows the original spectrum of the steady-state vowel /i/. The bottom panels show the vocal tract transfer
function obtained from LPC analysis using orders p = 2, 10, and 25, respectively. Note that when p is
chosen to be too large, a number of spurious peaks may turn up.

Figure 18: The effect of model order on LPC analysis: The top panel shows the original spectrum of the steady-state
vowel /i/. The bottom panels show the vocal tract transfer function obtained from LPC analysis using orders p = 2, 10,
and 25, respectively.

6.4

The Covariance Method

In the autocorrelation method, it was assumed that the signal is nonzero only on the range [0, Nw 1]. In
the covariance method, however, we drop this assumption, and instead define the covariance function as:

31

(i, k) =

p
X

s (n k) s (n i) ,

0 i p, 0 k p.

(7)

n=0

Note that since the signal is not windowed, as in the autocorrelation case, the summation in (7) is
always carried out on the range [0, p].
The normal equations for the covariance method become:
(i, 0) =

p
X

ak (i, k) , for 1 i p.

k=1

In matrix form, the normal equations can be expressed as:

(1, 1) (1, 2) (1, p)


(2, 1) (2, 2)
..
..
..
.
.
.
(p, 1)
(p, p)
{z

a1
(1, 0)

.. =
..
.
.
.
ap
(p, 0)
| {z } |
{z
}

Note that the matrix is symmetric but not necessarily Toeplitz. Thus it is generally more difficult
to determine the inverse of than of R. Nevertheless, the linear predictive coefficients can be solved as:
a = 1 .
Furthermore, the minimum error is determined as:
Emin = (0, 0)

p
X

ak (0, k) .

k=1

The covariance matrix can be interpreted as the error (4) being windowed, as opposed to the signal
being windowed (as in the autocorrelation case).
Example: Consider the speech segment [. . . , 3, 1, 2, 3, 2, 1, 1, 2, 0, 1, . . . ] (compare this with
the previous example). Find the 2nd-order vocal tract transfer function using the covariance method
of linear predictive analysis. The origin is denoted by the underscore.
For order p = 2:

=


17 2
,
2 14

Emin = (0, 0)

2
X


=


2
,
4

a = 1 =

 
1 18
,
117 36

ak (0, k)

k=1




18
36
1458
= 14
12.46,
(2)
(4) =
117
117
117
p
G = Emin 3.53,
3.68
V (z) =
36 2 ,
18 1
1 + 117 z + 117
z


1
with poles at z = c1 and z = c1 , with c1 = 117
9 + j 4131 0.077 + j0.55, corresponding to an
estimated formantF1 0.272Fs .

32

Part II

Image Processing
7

Examples of Applications
Photography
Computer vision
Remote sensing
Planetary exploration
Environmental monitoring
Geology
Reconnaisance
Medicine
Magnetic resonance imaging (MRI)
Positron emission tomography (PET)
Angiograms
Digital radiography
Communications
Videoconferencing
HDTV
Chemistry
Etc.

Image Processing v. 1D DSP


Similarities
Sampling
Filtering
Impulse response
Transforms
Differences
Data size
Complexity (1D vs. 2D)
Conceptual challenges due to dimensionality
Interface with human visual system

33

Overview
Introduction
2D linear system theory
Math preliminaries
Continuous 2D systems, including convolution, FT
Discrete systems
Image sampling
Image transforms
Fourier (2D discrete)
Discrete cosine transform (DCT)
Other (KL, etc.)
Image Enhancement
Histogram modification
Edge detection
Noise filtering
Image compression
Human visual perception vs. digital image processing

34

10

1D vs. 2D

In 1D, a linear, time-invariant (LTI) system satisfies:


If f (x) LTI g(x),
then f (x a) LTI g(x a)
and k1 f1 (x) + k2 f2 (x) LTI k1 g1 (x) + k2 g2 (x)
The functions f (x) and g(x) are related by convolution and the impulse response h(x):
Z
f (u)h(x u) du = (f h)(x)
g(x) =

Examples:
f (x)

h(x) = (x)

g(x)
=

h(x) = (x a)
=

f (x)

g(x)
a

a
f (x)

h(x)

g(x)

=
b

f (x)
a

g(x)

=
a

Recall that

and

b
2ab2

h(x)

2a

(
1,
rect(x) =
0,
x
1
(x) = lim rect
,
0 


|x| < 21 ,
|x| > 12
Z

(x) = 0, x 6= 0,

(x) dx = 1,

35

2a

(ax) =

1
(x).
|a|

The 2D impulse is defined



(x, y) = lim

0

x 1
 y 
1
= (x)(y),
rect
rect

 


(x, y) = 0, for (x, y) 6= (0, 0), and


ZZ
(x, y) dx dy = 1.
Note that
f (x, y) (x a, y b) = f (x a, y b).
(x a, y b)

(x, y)
y

y
b

A 2D linear, shift-invariant (LSI) system satisfies:


If f (x, y) LSI g(x, y),
then f (x a, y b) LSI g(x a, y b)
and k1 f1 (x, y) + k2 f2 (x, y) LSI k1 g1 (x, y) + k2 g2 (x, y)
An LSI system can be completely defined by its impulse response:
(x, y) LSI h(x, y)
Example: Astronomical imaging: Actual image of a star resembles a (x a, y b) and the resulting
image is h(x a, y b). Several stars produce similar images.
In general, the output of a 2D LSI system can be understood as the superposition of shifted and scaled
versions of h(x, y):
ZZ
g(x, y) =
f (x0 , y 0 )h(x x0 , y y 0 ) dx0 dy 0 = (f h)(x, y) = (h f )(x, y).
Examples (note: a dot represents a delta)
f (x, y)
y

h(x, y)
y

(x a, y)
a x
f (x, y)
y
c

g(x, y)
y
(x b, y)
x
b

(x a b, y)
x
a+b
g(x, y)
y
c

h(x, y)
y

a+b
d

a x

bd
c

36

11

The Fourier Transform

The FT is probably the single most important mathematical tool in image processing. We will review the
1D FT and then study the 2D FT.
Although image processing generally involves 2D discrete functions, it uses many techniques modeled
on analog processes. A good understanding of 1D and 2D continuous FT theory is essential to image
processing.
Given a 1D continuous function f (x), the FT and invest FT (IFT) are defined as follows
Z

F (u) :=
Z

f (x) :=

f (x)ej2ux dx = F[f ](u)

(8a)

F (u)e j2ux du = F1 [F ](x).

(8b)

The units of u are the inverse of the units of x, i.e., if x is measured in seconds, then u is measured in
hertz, if x is in meters, u is in cycles per meter (spacial frequency).
F (u) is a decomposition of f (x) into its component frequencies.
Example:
1
1
F (u) = (u + a) + (u a).
2
2
Using the sifting property of the delta function:
Z
(x x0 )f (x) d(x) = f (x0 ),

we obtain
Z

f (x) =


1
1
1
1
(u + a) + (u a) e j2ux dx = e j2ax + ej2ax = cos(2ax).
2
2
2
2

In general, the 1D FT provides the decomposition of f (x) in terms of cosines of the form A cos(2f x + ).
The set of all cosines at all combinations of A, f , and , provides a complete basis to build 1D functions.
Some important transforms:
(x) 1
x
b|a| sinc(au)
b rect
a
2
2
ex eu ,
where
sinc(x) :=

sin(x)
x

is the normalized cardinal sine function.


The Convolution Theorem states that if
Z
Z
g(x) = (f h)(x) =
f (z)h(x z) dz =

h(z)f (x z) dz,

then
G(u) = F (u)H(u),
where G(u), F (u), and H(u) are the FTs of g(x), f (x), and h(x), respectively.
37

The 2D FT of f (x, y) expresses


f (x, y) as a sum of 2D sinusoids A cos(ax + by + ). Each sinusoid

has spatial frequency = a2 + b2 , period 1/, and its contours of constant amplitude make an angle
= tan1 ab with respect to the x-axis.
Given a 2D continuous function f (x, y)
ZZ
F (u, v) :=
ZZ
f (x, y) :=

f (x, y)ej2(ux+vy) dx dy = F[f ](u, v)

(9a)

F (u, v)e j2(ux+vy) du dv = F1 [F ](x, y).

(9b)

The 2D FT expresses f (x, y) in terms of a combination of 2D sinusoidal corrugations.


Example:
1
1
F (u, v) = (u + a, v) + (u a, v)
2
2
F (u, v)

v
a u

f (x, y) = cos(2ax).
In this case, f (x, y) depends only on x and the corrugations are parallel with the y-axis.
y

A rotation of f (x, y) causes a corresponding rotation of F (u, v)

tan1

v
b
a

x
tan1

a
a

38

b
a

Figure 19: Illustration of the Fourier transform of a sinusoidal corrugation.

The set of all sinusoidal corrugations (at all amplitudes, frequencies, phase shifts, and rotations)
constitutes a complete basis for any function f (x, y). The 2D FT expresses f (x, y) as a linear combination
of such corrugations.
Theorems for the 2D FT are generally analogous to their 1D counterparts

39

Shift
Stretch
Convolution
Correlation

1D: f (x) F (u)


f (x a) ej2au
 F (u)
1
f (ax) |a|
F ua
(f g)(x) F (u)G(u)
rf g (x) = f (x) g (x) F (u)G (u)
f (x) F (u)
f (x) F (u)
f (x) F (u)

2D: f (x, y) F (u, v)


f (x a, y b) ej2(au+bv)
F (u, v)

1
f (ax, by) |ab|
F ua , vb
(f g)(x, y) F (u, v)G(u, v)
rf g (x, y) = f (x, y) g (x, y) F (u, v)G (u, v)
f (x, y) F (u, v)
f (x, y) F (u, v)
f (x, y) F (u, v)

The cross-correlation between f and g is defined as


Z
f (u)g (u x) du.
rf g (x) :=

Note from the table above that


F[rf g ](u, v) = (F[rgf ](u, v)) ,
which shows, upon IFT, that
rgf (x, y) = rf g (x, y).

40

Figure 20: Examples of 2D convolutions.

41

11.1

FT of Separable Functions

If f (x, y) = f1 (x)f2 (y) then the FT can be written


ZZ
F (u, v) =
f1 (x)ej2ux f2 (y)ej2vy dx dy = F1 (u)F2 (v),
where F1 (u) = F[f1 ](u) and F2 (v) = F[f2 ](v). Therefore, if f (x, y) is separable, so is F (u, v).
Example:
(
W
1, |x| < W2x and |y| < 2y ,
f (x, y) =
0, elsewhere.
Using the rect() function
(
1, |x| < 12 ,
rect(x) =
0, elsewhere,
we have that


f (x, y) = rect

x
Wx


rect

y
Wy


.

We also know that





x
F rect
(u) = Wx sinc(Wx u),
Wx



y
F rect
(v) = Wy sinc(Wy v).
Wy
Therefore
F (u, v) = Wx Wy sinc(Wx u) sinc(Wy v).

12

Discrete Signals

A general 2D discrete signal (or sequence) has the form f (m, n), where m and n are integers. The region of
the (m, n) plane where a function f (m, n) can take on non-zero values is the region of support of f (m, n).
For example, a photograph has a region of support determined by the dimensions of the photo and the
sampling rate.
Examples: 2D impulse
(
1, m = n = 0,
(m, n) =
0, otherwise.
Line impulse
(
1, m = 0,
T (m) =
0, otherwise.
We have that (m, n) = T (m) T (n).
Step functions
(
1, m, n 0,
u(m, n) =
0, otherwise.
and

(
1, m 0,
uT (m) =
0, otherwise.
42

Figure 21: Examples of a 2D separable function.

We have that u(m, n) = uT (m) uT (n).


A discrete sequence f (m, n) is separable if f (m, n) = x1 (m) x2 (n). (m, n) and u(m, n) are separable.

43

Figure 22: Examples of a 2D non-separable function.

12.1

Periodicity

A sequence f (m, n) is periodic if there exist integers M and N such that


f (m, n) = f (m + M, n + N )

44

for all m, n.

12.2

Input-Output

A 2D system produces an output sequence g(m, n) from an input sequence f (m, n):
g(m, n) = T [f (m, n)] .
Linearity:
T [k1 f1 (m, n) + k2 f2 (m, n)] = k1 g1 (m, n) + k2 g2 (m, n).
Shift-invariance:
If T [f (m, n)] = g(m, n), then T [f (m m1 , n n1 )] = g(m m1 , n n1 ).
The response h(m, n) of a 2D LSI system to the input (m, n) is the impulse response. The function
|h(m, n)|2 is sometimes called point spread function.

12.3

2D Discrete Convolution (Linear Convolution)

Aperiodic signal case:


g(m, n) = (f h)(m, n) =

h(k1 , k2 )f (m k1 , n k2 ).

k1 = k2 =

Suppose h(m, n) = (m, n), then g(m, n) = (f )(m, n) = f (m, n). In general, for integers c and d:
f (m, n) (m c, n d) = f (m c, n d).

Figure 23: Examples of 2D discrete convolutions.

45

A general sequence f (m, n) can be expressed as a sum of weighted, shifted impulses:


f (m, n) = f (0, 0)(m, n) + f (1, 0)(m 1, n) + f (0, 1)(m, n 1) +

X
X
=
f (k1 , k2 )(m k1 , n k2 ).
k1 = k2 =

Therefore the output of an LSI system can be written as


g(m, n) = f (0, 0)h(m, n) + f (1, 0)h(m 1, n) + f (0, 1)h(m, n 1) +

X
X
=
f (k1 , k2 )h(m k1 , n k2 ) = (f h)(m, n).
k1 = k2 =

As with 1D, 2D convolution is commutative, associative, and distributive.


An LSI system is separable if h(m, n) is separable. A separable systems response can be computed
with 1D convolutions and requires fewer multiplications to calculate.
An LSI system is stable if every bounded input produces a bounded output

12.4

Cross-correlation

For aperiodic signals


rf g (m, n) :=

f (k1 , k2 )g (k1 m, k2 n) = f (m, n) g (m, n).

k1 = k2 =

Figure 24: Examples of 2D discrete cross-correlations.

12.5

2D DTFT

For a stable aperiodic sequence with

|u(m, n)| < ,

m= n=

46

Figure 25: 2D convolutions vs. cross-correlations.

the 2D DTFT is given by


V (1 , 2 ) =

u(m, n)ejm1 ejn2

m= n=
Z Z

u(m, n) =

1
(2)2

V (1 , 2 )e jm1 e jn2 d1 d2 .

(10a)
(10b)

It is clear that V (1 + 2k, 2 + 2`) = V (1 , 2 ). In image processing, the 2D DTFT is approximated


using the 2D DFT. All properties and theorems for the 1D DTFT hold for the 2D DTFT with the
appropriate modifications.
Example
(f h)(m, n) F (1 , 2 )H(1 , 2 ).
Example Real f (m, n) produces a Hermitian F (1 , 2 ) = F (1 , 2 ).
Example
1
1
1
1
u(m, n) = (m 1, n) + (m + 1, n) + (m, n 1) + (m, n + 1)
4
4
8
8
has 2D DTFT
1
1
1
1
1
1
V (1 , 2 ) = ej1 + e j1 + ej2 + e j2 = cos(1 ) + cos(2 ).
4
4
8
8
2
4

47

13

Image Sampling

Recall that in 1D a comb function (pulse train) is its own transform:

(x n)

n=

(u k),

k=

1
(x nT )
T
n=

k=



k
,
u
T

A 2D comb function, or bed of nails, is also self-transforming.


Sampling can be described as follows
Space domain
Multiplication of f (x, y) by a
bed of nails with spacing of x
in the x dimension and y in
the y dimension

Frequency domain
Convolution of F (u, v) with
bed of nails with spacing of
1
1
x in the u dimension and y
in the v dimension

A bandlimited 2D function can be completely reconstructed from its samples provided that the Nyquist
rates are satisfied. A 2D function is bandlimited if Fa (u, v) is zero outside a bounded region in the (u, v)
plane:
Fa (u, v) = 0, |u| > U, |v| > V.
As with the 1D case, sampling replicates the spectrum periodically at periods given by the reciprocals
of the sampling intervals, x and y. Note that x and y have units of meters. The sampled image
spectrum is given by



X
X
k
`
1
Fa u
,v
.
Fs (u, v) =
xy
x
y
k= `=

The spectrum is replicated on both the u and v dimensions, in spacings that are multiples of 1/x and
1/y, respectively.
Reconstruction is accomplished by using an ideal 2D LPF
(
1
1
xy, |u| < 2x
, |v| < 2y
,
HLPF (u, v) =
0,
otherwise.
Recall that
A rect

x
AW sinc (W u) .
W

The inverse FT of HLPF (u, v) gives




 x 
y
hLPF (x, y) = sinc
sinc
.
x
y
Therefore
fa (x, y) =

k= `=



 x

y
k sinc
` .
fs (kx, `y) sinc
x
y

Reconstruction is less important in image processing because the image is often left in discrete form.
48

Figure 26: Bed of nails (N = 8).

49

Figure 27: Bed of nails (N = 4).

50

14

Image Transforms

Topics:
1. 1D DFT
2. Concept of Unitary Transforms
3. 2D DFT
4. DCT

14.1

1D Unitary DFT

The DFT is used for spectral analysis, filtering, convolution, etc. It can be viewed as
1) a transform time(1D)/space(2D) frequency
2) an expansion of a function into orthogonal basis functions
3) a coordinate rotation
DFT

Let u(n) V (k)


N 1
1 X
u(n)ej2kn/N
V (k) =
N n=0

(11a)

N 1
1 X
u(n) =
V (k)e j2kn/N
N k=0

(11b)

The DFT can be expressed as a matrix operation


V = Au

where

u is input (column vector, size N 1)


A is an N N transform matrix
V is output (column vector, size N 1)

of AH are the basis vectors of the expansion.

2
2
2
ej0
ej N 01
ej N 02
...
ej N 0(N 1)
u(0)
2
2
2
j 2 10

e N
ej N 11
ej N 12
...
ej N 1(N 1)
1

u(1)

=
..
..
..
..
..
..

N
.
.
.
.
.
.

2
2
2
2
V (N 1)
u(N 1)
ej N (N 1)0 ej N (N 1)1 ej N (N 1)2 . . . ej N (N 1)(N 1)

The columns

V (0)
V (1)

..

Example: N = 4

u(0)
V (0)
1 1
1
1
u(1)
V (1) 1 1 j 1 j

..
V (2) 2 1 1 1 1

.
V (3)
1 j 1 j
u(N 1)

51

Orthogonality
Basis vectors of the DFT are orthogonal. Let ak = kth basis vector. 0 k N 1.
Example:

1

1
j

N = 4, a1 =
2 1
j
a1 (0) =

1
2

a1 (1) =

j
2

a1 (2) =

1
2

a1 (3) =

j
2

Orthogonality
N
1
X

(
1,
ak (n)al (n) =
0,
n=0

k=l
k 6= l

Recall from EE113 that if an input u(n), 0 n < N 1 has exactly 1 non-zero element, its transform
V (k) has the form
A
V (k) = ej2n0 k/N
N

where

n0 is the location of the non-zero term


A is its value, i.e., u(n) = A(n n0 )

V (k) is a single, rotating phasor with amplitude A, rotating = 2n0 /N radians each time k is
incremented.
Example 1:
(a)
A = 1, n0 = 0 = 0
1
V (k) = [1, 1, 1, 1]
4

u(n) = [1, 0, 0, 0],

(b)
u(n) = [0, 1, 0, 0],

A = 1, n0 = 1 =

=
4
2

1
V (k) = [1, j, 1, j]
2
(c)
u(n) = [0, 0, 0, j],

A = j, n0 = 3 =

=+
4
2

j
1
V (k) = [1, j, 1, j] = [j, 1, j, 1]
2
2
The inverse DFT uses an identical procedure, except that = +2k0 /N .
Example 2:
V (k) = [0, 0, 1, 0, 0, 0, 0, 0]

= 2 =
8
2

1
u(n) = [1, j, 1, j, 1, j, 1, j]
8
52

Any DFT can be found using linear combinations of these techniques.


Example 3:
(a)
1
2j
u(n) = [0, 1, 0, 2 j] [1, j, 1, j] +
[1, j, 1, j]
2
2
1
1
= [1, j, 1, j] + [2 j, 1 + 2j, 2 + j, 1 2j]
2
2
1
= [3 j, 1 + j, 3 + j, 1 j]
2
(b)
V (k) = [0, 0, 0, 0, 1, 0, 0, 0]

4
= 2 =
8

1
u(n) = [1, 1, 1, 1, 1, 1, 1, 1]
8
(c)
1
V (k) = [j, 1, j, 1, j, 1, j, 1]
8
Forward DFT of j(n n0 ), where

2n0
8

= 2 n0 = 2

u(n) = [0, 0, j, 0, 0, 0, 0, 0]
(d)
1
n0

V (k) = [j, 1, j, 1] = forward DFT of j(n n0 ), where 2


= n0 = 1
2
4
2
Since u(n) (and V (k)) have period 4, n0 = 1 is equivalent to n0 = 1 + 4 = 3
u(n) = [0, 0, 0, j]


Example 4: With N = 4, let u(n) = 4 1 0 1 .


1
4 0 0 0
4 4 4 4
2



1
0 1 0 0
1 j 1 j
2



1
0 0 0 1
1 j 1 j
2




 

1
6 4 2 4 = 3 2 1 2
4 1 0 1
2
Using matrix multiplication:


1 1
1
1
4
3
1 2
1
1
j
1
j
=
V = Au =
2 1 1 1 1 0 1
1 j 1 j
1
2
53

V is an expansion of u in terms of the N basis vectors given by the conjugate columns of A.






1
1
1
1
4

j
1
j 1
1
1
1
1
1
+2 +1 +2 =
u=3
2 1
2 1
2 1
2 1 0
1
j
1
j
1


Example 5: With N = 4, let u(n) = 3 1 0 2 .
1
[u(0) + u(1) + u(2) + u(3)] = 3
2
1
3
1
V (1) = [u(0) ju(1) u(2) + ju(3)] = + j
2
2
2
V (2) = 0
1
3
V (3) = j = V (1).
2
2

0 32 j 21 . V (k) can be viewed as an expansion of u(n) into a basis given by
V (0) =


Result: V (k) = 3
the columns of AH :

3
2

+ j 12




1
1
3



 1

1
1 j
1 j 1
3
3
1
u(n) =
+j
j
3

= .
+

1
1
2
2
2
2
2 1 0
j
j
1
2

Recall: If u(n) is real, V (k) has Hermitian symmetry. Since u(n) and V (k) are periodic, then
V (k) = V (k) = V (N k)
V (1) = V (4 1) = V (3)
Symmetry of discrete, periodic functions:
Even: f (n) = f (N n)
Odd: f (n) = f (N n)
Symmetry Properties:
u(n) V (k)
real, even real, even
real, odd imaginary, odd
(Others followed from above)

14.2

Periodicity and Indexing

Note that the DFT definition implies that both u(n) and V (k) are periodic of period N :
N 1
N 1
n+N
nk
1 X
1 X
V (k)e j2 N k =
V (k)e j2 N e j2k = u(n).
u(n + N ) =
N k=0
N k=0

54

In general, u(n) = u(n + `N ), V (k) = V (k + `N ), ` any integer.


Example:




u(n) = 0 1 2 3
u(n 1) = 3 0 1 2 .
Although any window of length N is sufficient, by convention we use 0 n N 1. This places the
origin on the left, not the center. Make sure you define vectors this way avery time you use functions such
as fft in Matlab.

14.3

DFT Properties

Parseval:

N
1
X

|u(n)|2 =

n=0

N
1
X

|V (k)|2 .

k=0

Reversal:
If u(n) V (k), then u(n) V (k).
Note that if

u(n) = a b c d e f

g h

then

u(n) = a h g f


e d c b .

Circular convolution:
(u1 ~ u2 )(n) V1 (k)V2 (k).
Shift: If u(n) V (k), then
u(n n0 ) ej2n0 k V (k).
Example



u(n) = 3 1 0 2 V (k) = 3

3
2

+ j 12

3
2

j 12

then



u(n 1) = 2 3 1 0 ej2k/4 V (k) = 3

1
2

j 32

1
2


+ j 32 .

Shifting u(n) imposes a progressive phase term on V (k).

15

Unitary Transformations

The general form for a 1D unitary transformation is

V (k) =

u(n) =

N
1
X
n=0
N
1
X

u(n)a(k, n)

(12a)

V (k)a (k, n).

(12b)

k=0

In matrix form
V = Au,
where A is unitary, i.e.,
A1 = AH = (A )T .

55

A unitary transform V = Au provides an expansion of u in terms of basis vectors obtained as the columns
of AH . The coefficients of the expansion are given by the elements of V .
Example: DFT with N = 4

1 1
1
1
1 1 j 1 j
.
A=
2 1 1 1 1
1 j 1 j
The four basis vectors are




1
1
1
1

1
j
1
j

a0 =
1 , a1 = 1 , a2 = 1 , a3 = 1 .
1
j
1
j

3
3 + j1
2
2
V =
0 .
3
1
2 j2




1
3
1
3
+j
a1 +
j
a3 .
u = 3a0 +
2
2
2
2

3
1

u=
0
2

A 2D unitary transform pair is given by

V (k, l) =

u(m, n) =

N
1 N
1
X
X
m=0 n=0
N
1 N
1
X
X

u(m, n)a(k, l, m, n)

(13a)

V (k, l)a (k, l, m, n).

(13b)

k=0 l=0

The term a(k, l, m, n) is called kernel of the transformation. If a(k, l, m, n) can be factored into a(k, l, m, n) =
b(k, m)c(l, n), the 2D transform is called separable: one can transform all the rows and then all the columns,
or vice versa. Among the unitary transformations are DFT, DCT, Hadamard, Haar, Karhunen-Lo`eve.
Letting b(k, m) be the elements of B and c(l, n) the elements of C, a separable 2D unitary transform
can be expressed as
V = BuC T ,
where V and u are N N matrices, and so are B and C. If C = B = A, then
V = AuAT .
This expresses u in terms of the basis images Akl :
u=

N
1 N
1
X
X

V (k, l)Akl ,

k=0 l=0

where
Akl := ak al T ,
and ak is the kth column of AH .
56

16

2D Unitary DFT
V (k, l) =

u(m, n) =

N
1 N
1
X
X
m=0 n=0
N
1 N
1
X
X

u(m, n)a(k, l, m, n)

(14a)

V (k, l)a (k, l, m, n),

(14b)

k=0 l=0

where the kernel of the transformation is


a(k, l, m, n) =

1 j2 mk+nl
N
.
e
N

Since the kernel is factorable, the 2D DFT is separable, i.e.,


rows, followed by 1D transform of columns (or vice versa).
Example:

0 1 0
0 1 0
u(m, n) =
0 1 0
0 1 0
1D transform of columns:

0
0

0
0

1D transform of rows:

2
0
0
0

0
0
0
0

it can be implemented by 1D transforms of

0
0
.
0
0

0
0
.
0
0

1 j 1 j
0 0
0 0

.
0 0
0 0
0 0
0 0

Example of 2D DFTs
2D DFT, N = 4:

1 1
1
1
1 1 j 1 j

A=
2 1 1 1 1
1 j 1 j
and

1 1
,
a0 =
2 1
1

1 1
1
1
1 1 j 1 j
,
AH =
2 1 1 1 1
1 j 1 j

1
1 j
,
a1 =
2 1
j

etc.

Basis images (or basis maps):

A00 = a0 a0 T

1
1
1
=
4 1
1

57

1
1
1
1

1
1
1
1

1
1
,
1
1

etc.

Example

0
0
u=
0
0

1
1
1
1

0
0
0
0

0
0

0
0

1 j 1 j
0 0
0 0
.
V =
0 0
0 0
0 0
0 0

This means that we can express u as follows:


u = (1)A00 + (j)A01 + (1)A02 + (j)A03 .

16.1

2D Periodic (Cyclic) Convolution


(u1 ~ u2 )(m, n) V1 (k, l)V2 (k, l).

For instance, u1 (m, n) may be an image and u2 (m, n) the impulse response of a digital filter, such as a
low-pass filter. The filtering operation can be applied in the frequency domain by simply multiplying
V1 (k, l) and V2 (k, l) element by element, and then inverse transforming.

17

The Unitary Discrete Cosine Transform (DCT)

The DCT is a real transform, i.e., a real u(n) produces a real V (k). It can be obtained using a DFT. Most
fast DCT algorithms use FFTs. Of all unitary transforms, only the DFT is used more than the DCT.
For a 1D sequence u(n), 0 n N 1, the DCT V (k), 0 k N 1, is given by
V = Cu,
where

1 ,


C(k, n) = qN2
(2n+1)k

cos
,
N
2N

k = 0, 0 n N 1,
1 k N 1, 0 n N 1.

As with the DFT, the DCT basis vectors are sinusoidal. The DCT is best understood in terms of a DFT
of a sequence of length 2N derived from u(n).
The DFT implies a periodicity N in the sequence u(n). This periodicity can result in discontinuities
that do not
 reflect the
 nature of the signal u(n).
u(n) = 1 2 3 4 . The periodic signal has discontinuities that cause energy to be placed in highfrequency bins of V (k), the DFT of u(n).
Principle behind DCT: Make the input symmetric by doubling the length of u(n) by defining u0 (n) of
period 2N , where u0 (n) = u(n), 0 n N 1, and u0 (n) = u(2N 1 n) for N n 2N 1. The
non-unitary definition of the DCT is thus:
!
N
1
X
2k n + 12
V (k) =
2u(n) cos
, 0 k N 1.
2N
n=0

The inverse transform is:


1
u(n) =
2N

"
V (0) +

N
1
X

2V (k) cos

k=0

58

2k n +
2N

1
2

 !#
.

Note that V (0) is considered separately because in a symmetric DFT of length 2N the DC element is
represented only once: V (1) = V (2N 1), V (2) = V (2N 2), . . ., V (0) = V (0). The 1D unitary DCT is
given by:


N
1
X
(2n + 1)k
, 0k N 1
V (k) = (k)
u(n) cos
(15a)
2N
n=0


N
1
X
(2n + 1)k
u(n) =
(k)V (k) cos
, 0nN 1
(15b)
2N
n=0
r
1
2
(0) = , (k) =
(15c)
, 1 k N 1.
N
N
The 2D unitary DCT is given by:
N
1 N
1
X
X




(2m + 1)k
(2n + 1)l
V (k, l) = (k)(l)
u(m, n) cos
cos
2N
2N
m=0 n=0




N
1 N
1
X
X
(2m + 1)k
(2n + 1)l
u(m, n) =
(k)(l)V (k, l) cos
cos
.
2N
2N

(16a)

(16b)

k=0 l=0

For signals that generate a smooth u0 (n) the DCT is well localized. In general, the N 2N mapping
implicit in the DCT gives smoother sequences than the periodic N extension of the DFT. This is why the
DCT has better energy compression for most sequences.
Example:


0.65 0.27 0.27 0.65
has DCT



0 1 0 0

and DFT



0 0.46 j0.46 0 0.46 + j0.46 .

Because


0.65 0.27 0.27 0.65 0.65 0.27 0.27 0.65
is smooth under 2N periodicity, its DCT is compact.
Example


1 1 1 1
has DFT and DCT equal to



2 0 0 0 .

Example


1 0 0 0
has DCT equal to



0.5 0.65 0.5 0.27

and DFT


0.5 0.5 0.5 0.5 .
The DCT of an impulse is not DC.

59

The DFT kernel is symmetric, i.e., exchanging n with k will not affect the transform. In the DCT the
kernel is not symmetric with respect to n and k.
Example: mid-frequency


1 0 1 0
has DCT


0 0.92 1 0.38
and DFT



0 1 0 1 .

Example: high-frequency


1 1 1 1
has DCT


0 0.765 0 1.848
and DFT



0 0 2 0 .

For signals whose periodic extension is smooth, the energy localization with the DFT seems to be
better than with the DCT. This is because the assumption of periodicity of the DFT does not apply with
the DCT.
For the DCT, C =
6 C T and the expansion using the DCT is fundamentally different from the expansion
using the IDCT.

60

18

Image Enhancement

Image enhancement techniques are used to improve the perception of images to human observers. There
are subjective and objective measures of quality.
The image may also be processed for further machine-based analysis.
Among image enhancements techniques are
1. Contrast enhancement
(a) Histogram (gray scale) modification
(b) High-pass (linear) filtering
(c) Homomorphic processing
2. Denoising and noise smoothing (background noise, Gaussian noise, salt-and-pepper noise, speckle
noise, quantization noise)
(a) Low-pass (linear) filtering
(b) Median (nonlinear) filtering
(c) Out-of-range pixel smoothing (nonlinear)
3. Edge detection
(a) Gradient-based edge detection
(b) Laplacian-based edge detection

18.1
18.1.1

Contrast Enhancement
Histogram modification

Let u(m, n) be the intensity level of pixel (m, n) of a given image. Histogram modification allows one to
change the range of values taken on by the intensity function, thereby improving the contrast of the image.
Let y(m, n) be the result of the transformation
y = T [u].
As an example, consider the 4 4 image with intensities

3 3 4
2 3 4
u=
2 3 4
3 3 4

given by

5
4
,
4
5

where 0 is black and 7 is white. The histogram and the cumulative histogram of the image are given by
u
0
1
2
3
4
5
6
7

p(u)
0
0
2
6
6
2
0
0
61

P (u)
0
0
2
8
14
16
16
16

Assume that ideally we would like to have a cumulative histogram given by


Pd (y)
2
4
6
8
10
12
14
16

y
0
1
2
3
4
5
6
7

The transformation from u to y can now be obtained as follows:


1. Pick a value of u
2. Look up P (u)
3. Find Pd (y) that is closest to P (u)
4. Choose the corresponding y
In our example:
1. u = 0 P (u) = 0 Pd (y) = 2 y = 0
2. u = 1 P (u) = 0 Pd (y) = 2 y = 0
3. u = 2 P (u) = 2 Pd (y) = 2 y = 0
4. u = 3 P (u) = 8 Pd (y) = 8 y = 3
5. u = 4 P (u) = 14 Pd (y) = 14 y = 6
6. u = 5 P (u) = 16 Pd (y) = 16 y = 7
7. u = 6 P (u) = 16 Pd (y) = 16 y = 7
8. u = 7 P (u) = 16 Pd (y) = 16 y = 7
which corresponds to the transformation
u
0
1
2
3
4
5
6
7

y
0
0
0
3
6
7
7
7

62

The transformed image is given by

3
0
y=
0
3

3
3
3
3

6
6
6
6

7
6
.
6
7

The histogram and cumulative histogram of the transformed image are given by
y
0
1
2
3
4
5
6
7

p(y)
2
0
0
6
0
0
6
2

P (y)
2
2
2
8
8
8
14
16

Fig. 28b shows an example of histogram modification based on the image in Fig. 28a.

(a) Original unprocessed image (from Matlabs demo photo


library).

(b) Image processed by histogram modification.

Figure 28: Example of histogram modification.

18.1.2

High-pass filtering

The image details, such as hair, edges, specks, etc., correspond to high-frequency components, while flat
and smooth surfaces correspond to low-frequency components. High-pass filtering is used to increase
contrast and sharpen the image. High-pass filtering may also accentuate any noise that is present in the
image. In 2D, high-pass filtering may be obtained by using any of the following 3 3 filters:

0 1 0
1 2 1
1 2 1
h1 = 1 5 1 , h2 = 2 5 2 , h3 = 2 13 2 .
0 1 0
1 2 1
1 2 1
63

P P
Note that these filters are such that m n hi (m, n) = Hi (k, l)|k=0,l=0 = 1, i = 1, 2, 3, which guarantees
that the DC component (i.e., the average intensity) of the image is unaltered. High-pass filtering can also
be performed in the log-intensity domain, which can be seen as a form of homomorphic processing, see
below. Examples of the effects of high-pass filtering are shown in Fig. 29.

(a) Original unprocessed image.

(b) Image processed by high-pass filtering using h1 .

(c) Image processed by high-pass filtering using h2 .

(d) Image processed by high-pass filtering using h3 .

Figure 29: Example of high-pass filtering.

18.1.3

Homomorphic processing

Homomorphic processing is based on the observation that the intensity of an image is the product of
illumination and reflectance,
u(m, n) = i(m, n) r(m, n).
The illumination is the principal contributor to the dynamic range of the images intensity, and it varies
slowly across the image. The reflectance contributes to the images detail, and it changes much more
64

quickly than the illumination. Imagine a source of light (sun, lamp, etc.) illuminating an object composed
of different parts, each with a different reflective characteristic. When the light source is very bright, the
contrast is diminished. The goal is to reduce the illumination component and enhance the reflectance
component, so as to increase the overall contrast of the image. Because of the multiplicative relationship
between illumination and reflectance, it is convenient to work in the log domain
log u(m, n) = log i(m, n) + log r(m, n).
In this way the illumination can be extracted by low-pass filtering, and the reflectance by high-pass
filtering. The low-passed log i(m, n) is now multiplied by a factor < 1 and the high-passed log r(m, n) is
multiplied by > 1. After exponentiation, the transformed image is given by
y(m, n) = (i(m, n)) (r(m, n)) .
The use of homomorphic processing, where the image transformation is achieved in the log-intensity
domain, is also justified by the way the human visual system operates. An example of homomorphic
processing applied to the image in Fig. 28a is shown in Fig. 30.

Figure 30: Example of homomorphic processing.

18.2

Noise Smoothing

Images may contain different forms of noise: Gaussian noise, speckles or salt-and-pepper noise. Different
type of linear and nonlinear filtering may be applied to mitigate their effects.
18.2.1

Low-pass filtering

Most of the images energy is contained in the low-frequency components. High-frequency components
contain detail as well as much of the noise. Low-pass filtering may reduce much of the noise, at the expense
of also reducing some of the detail, i.e., blurring the image. Low-pass filtering is a linear 2D filtering
operation that can be achieved by using one of the following filters

1 1 1
1 1 1
1 2 1
1
1
1
1 2 1 , h6 =
2 4 2 .
h4 = 1 1 1 , h5 =
9
10
16
1 1 1
1 1 1
1 2 1
65

Again note that the filters are normalized to guarantee that the average intensity of the image in unchanged.
Fig. 31 demonstrate the effects of low-pass filtering on an image corrupted by Gaussian noise.

(a) Original image corrupted by Gaussian noise.

(b) Image processed by low-pass filtering using h4 .

(c) Image processed by low-pass filtering using h5 .

(d) Image processed by low-pass filtering using h6 .

Figure 31: Example of low-pass filtering.

18.2.2

Median filtering

Median filtering is a form of nonlinear filtering that is used to combat impulsive, salt-and-pepper noise.
This kind of noise may be the result of random bit flips that occur in the communication of an image.
Consider first a 1D median filter. Median filtering is obtained by sliding a window of odd length over
the sequence of interest, and replacing the intensity of the middle pixel with the median intensity of all
pixels in the window (the median value out of N , N odd, is the one such that (N 1)/2 values are lower
than that value, and (N 1)/2 are higher; for instance, the median of 4, 7, 15, 33, 255 is 15). 1D median
filtering is ideal when some pixels have outlier values of intensity. Such outliers will be removed. It is also
used when the edges (discontinuities) need to be preserved. Consider for instance the effects of a low-pass
66



filter given by 0.2 0.2 0.2 0.2 0.2 and a 5-point median filter on the sequence
0.1, 0.2, 0.1, 0.1, 0.0, 0.1, 0.0, 1.2, 0.9, 1.0, 1.1, 0.9, 1.2, 1.0, 1.1, 1.2
An important parameter in median filtering is the length of the sliding window: if outliers can come in
pairs (two adjacent large values surrounded by small values), then windows of length less than 5 (such as
3) will not be effective at removing the impulsive values.
2D median filtering is obtained by applying 1D median filtering on the horizontal and vertical directions
separately. The reason is that having a 2D sliding window and applying median filtering in 2D may result
in the altering of edge information where edges meet. Try to apply 2D median filtering to the unit step
u(m, n), and compare with 1D filtering applied separately to the horizontal and vertical directions. The
result of median filtering on an image corrupted by salt-and-pepper noise is demonstrated in Fig. 32.

(a) Original image corrupted by salt-and-pepper noise.

(b) Image processed by median filtering.

Figure 32: Example of median filtering.

18.2.3

Out-of-range pixel smoothing

Similarly to median filtering, out-of-range pixel smoothing is a nonlinear filtering operation obtained by
sliding a window over an image and comparing the intensity value of the middle pixel with the average
of all the other pixels intensities; if its intensity is significantly different (based on some predetermined
threshold), then its value will be replaced by the average. Out-of-range pixel smoothing is used to mitigate
salt-and-pepper noise. Fig. 33 shows the effects of out-of-range pixel smoothing on the image of Fig. 32a.

19

Edge Detection

An edge is a boundary between two regions of an image that differ by their reflectivity, illumination,
distance, etc. Edge detection can be used to segment an image into different regions. Segmentation is
useful for image understanding, for instance, or object detection and identification.

67

Figure 33: Example of out-of-range pixel smoothing.

19.0.4

Gradient-based edge detection

A sudden change in intensity may indicate an edge. Consider a 1D intensity function that suddenly
changes from low values to large values. The point of change can be determined by computing the gradient
of that function, which in 1D is simply a derivative. The amplitude of that derivative is compared to
a predetermined threshold, and if it exceeds it, and is a local maximum, then an edge is detected. An
edge detection system may comprise a derivative calculator, an absolute value, a comparator to a given
threshold and a way to determine if the point is a local maximum. In 2D the gradient is given by
u(x, y) =

u(x, y)
u(x, y)
~ +
~.
x
y

The edge detection mechanism is similar to the 1D system, but in addition the values of |u(x, y)| for all
the candidate edge points need to be considered for a number of specified directions. If all the candidate
|u(x, y)| are local maxima in at least one direction, then an edge is detected. Normally, the specified
directions are the horizontal and vertical directions. In order to avoid minor, false, edge lines from being
detected, additional constraints may be added, such as the following:
1. If |u(x, y)| has a local maximum
at (x0 , y0 ) in the horizontal direction but not in the vertical

u
u


direction, and x > 2 y , then (x0 , y0 ) is an edge point;
2. If |u(x, y)| has
at (x0 , y0 ) in the vertical direction but not in the horizontal
a local maximum

u

u
direction, and y > 2 x , then (x0 , y0 ) is an edge point.
An edge detector that only considers a specific direction is called a directional edge detector. One that is
based on the gradient |u(x, y)| is non-directional.
In the discrete 2D domain, the partial derivatives and gradients can be approximated by linear filters.
Among filters that can be used for vertical, horizontal, and diagonal (directional) edge detection are



1 1
0
1 1
1 1
0
1
1
1
h7 = 1 1 , h8 =
, h9 = 1 0 1 , h10 = 1 0 1 .
1 1 1
1 1
1 1 0
0 1 1
68

The nondirectional edge detector can be obtained from the approximation of


s



u(x, y) 2
u(x, y) 2
|u(x, y)| =
+
x
y
as in

q
(ux (m, n))2 + (uy (m, n))2

(17)

where
ux (m, n) = (u hx )(m, n),
and

1 1
hx = 2 2 ,
1 1

uy (m, n) = (u hy )(m, n),





1
2
1
hy =
.
1 2 1

Gradient-based edge detection methods can be sensitive to noise. Some form of noise smoothing may have
to be applied to the image before using these techniques. Additional removal of false edges may require
some post-processing.
19.0.5

Laplacian-based edge detection

An alternative to gradient-based processing can be derived by considering that the second derivative of
the intensity function has a zero-crossing in correspondence with an edge. In 2D, the second derivative is
replaced by the Laplacian
2 u(x, y) 2 u(x, y)
2 u(x, y) =
+
.
x2
y 2
In discrete domains, the Laplacian is approximated by computing second-order differences, which can be
computed by using 2D linear filters. If
u(x, y)
u(m + 1, n) u(m, n)
x
then

or u(m, n) u(m 1, n),

2 u(x, y)
u(m + 1, n) 2u(m, n) + u(m 1, n)
x2

and
2 u(x, y) u(m + 1, n) + u(m 1, n) + u(m, n + 1) + u(m, n 1) 4u(m, n),
which corresponds to the filter
h11

0 1 0
= 1 4 1 .
0 1 0

Other filters can be used, such as


h12

1 1 1
= 1 8 1 ,
1 1 1

h13

1 2 1
= 2 4 2 .
1 2 1

The Laplacian-based edge detector may generate many false edges. One way to avoid that is to require
that the local variance at a candidate edge point be larger than a predetermined threshold. The rationale
is that if the candidate point is indeed an edge point, then the intensity of the pixels around that point
69

should vary greatly. The variance should be small if the candidate point is simply an outlier. The variance
over a (2M + 1) (2M + 1) window is computed as follows:
1
(m, n) =
(2M + 1)2
2

m+M
X

n+M
X

(u(k, l) (k, l))2 ,

k=mM l=nM

where
1
(m, n) =
(2M + 1)2

m+M
X

n+M
X

u(k, l)

k=mM l=nM

is the sample mean. The value of M is a design parameter, and is typically M = 2. The threshold will
depend on the nature of the image. Examples of edge detection are shown in Fig. 34

70

(a) Original unprocessed image


(from Matlabs demo photo li- (b) Image processed by gradient- (c) Image processed by gradientbrary).
based edge detection using h7 .
based edge detection using h8 .

(d) Image processed by gradient- (e) Image processed by gradient- (f) Image processed by gradientbased edge detection using h9 .
based edge detection using h10 .
based edge detection using (17).

(g) Image processed by Laplacianbased edge detection using h11 , with


no zero-crossing detection or variance
thresholding.

(h) Image processed by Laplacianbased edge detection using h12 , with


no zero-crossing detection or variance
thresholding.

Figure 34: Example of edge detection.

71

20

Image Compression and Coding

Goal: To reduce the number of bits needed to encode an image by eliminating redundancy.
Current and future application examples:
Raw data volume (1 Mibit = 220 bits)
1.5 Mibit
2.98 Gibit/s
64 Mibit/image, 1800 images per month
800 Mibit/image, 5 to 10 images per day
2 Mibit/page

Application
Color image (256 256, 24-bit color)
Color video, 1920 1080, 24-bit color, 60 fps
Digital Radiology Facility
SAR Facility
Fax

Almost all compression techniques use one or more of the following principles:
1. Resources, i.e., storage space and transmission time, will be conserved if one uses short messages to
communicate the occurrence of common events
2. Pixels within a given region are often correlated
3. If loss must occur, discard information less vital to the image
Compression techniques are divided into two classes
1. Reversible, error-free, distortionless: all information is preserved, complete reconstruction is possible.
Typical compression 3 : 1
2. Irreversible, lossy: some information is lost. Typical compression 20 : 1 for single images.

20.1

Information Theory Background

The information of a random event with probability P (E) is


I(E) = log P (E).
The base of the logarithm determines the units of information, a bit if the base is 2, a nat if the base is e.
Example: E = {coin toss = heads}
P (E) =

1
2

I(E) = log2

1
= 1 bit.
2

Now consider an information source with an alphabet of L possible symbols, each with probability pi ,
i = 0, 1, . . . , L 1,
L1
X
pi = 1.
i=0

If M  1 symbols are generated by this source, symbol i will appear M pi times. The information
contributed by symbol i is
Ii = M pi log2 pi bits.
The total information in the string of M symbols is
I = M

L1
X
i=0

72

pi log2 pi .

The average information per symbol, H is


H=

L1
X

pi log2 pi = Entropy.

i=0

According to Shannons noiseless coding theorem, it is always possible to devise a code with an average of
B bits per symbol where
1
HBH+ ,
n
where n is the number of symbols encoded at one time. Usually n = 1.

20.2

Huffman Coding

Huffman Coding is a common, reversible image compression technique that encodes one symbol at a time
and results in a variable-length code with the lowest possible bit rate.
To generate a Huffman codebook
1. Map initial symbols to symbol nodes and order based on symbol probabilities
2. Combine two symbol nodes having the lowest probabilities into a new symbol node, using 0 and 1 to
link the new node to the former symbols
3. Repeat until only two symbol nodes remain.

20.3

How to generate a Huffman tree

Let us use an example. Assume your sources alphabet is given in the following table, together with the
symbols probabilities:
Symbol
a
b
c
d
e
f

Probability
0.1
0.4
0.04
0.1
0.07
0.29

Lets first rank the symbols by descending probability and create a node for each symbol (which will
become a leaf in the tree):

73

0.4

0.29

0.1

0.1

0.07

0.04

Next, lets connect the two nodes with the lowest probability and generate a node with the combined
probability:
b

0.4

0.29

0.1

0.1

0.07

0.04

0.11

74

The next two least probable nodes are a and d; they can now be connected to form a new node with
combined probability of 0.2:
b

0.4

0.29

0.1

0.1

0.07

0.04

0.2

0.11

At each stage, continue connecting the two least probable nodes as follows:
b

0.4

0.29

0.1

0.1

0.07

0.04

0.2

0.11

75

0.31

0.4

0.29

0.1

0.1

0.07

0.04

0.2

0.31

0.6

0.11

Finally the node with probability 1.0 will be the root of the tree.
b

0.4

0.29

0.1

0.1

0.07

0.04

0.2

0.31

0.6

1.0

0.11

Now, starting from the root, trace your way back to the leaves. Each node in the tree, whether it is the
root or an internal node, has two edges leading to it. Label one with a 0 and the other with a 1:
76

0.4

0.29

0.1

0.1

1
1

0.2

0.31

0.6

1.0

0
0
e

0.07

0.04

0.11

The strings of binary digits associated to each symbol can be read by traversing the tree from the root to
each leaf, and are given in the following table
Symbol
a
b
c
d
e
f

Binary string
0011
1
0000
0010
0001
01

The average number of binary digits per symbol is


1 0.4 + 2 0.29 + 4 0.1 + 4 0.1 + 4 0.07 + 4 0.04 = 2.22.
Compare this with the source entropy in bits per symbol, given by
0.4 log2 (0.4) 0.29 log2 (0.29) 0.1 log2 (0.1) 0.1 log2 (0.1) 0.07 log2 (0.07) 0.04 log2 (0.04) 2.165.
Example: Consider a digital image, where each pixel has 8 bits. The probability of the ith gray level,
0 i 255, is pi . If the pdf is uniform, i.e., pi = 28 for all i,
255
X
1
H=
log2 28 = 8.
256
i=0

The minimum coding rate is 8 bit/pixel. Result: when the pdf is uniform, standard PCM coding is optimal.

77

Example: Suppose the image pdf is


(
27 ,
pi =
0,

0 i 127,
128 i 255.

Then H = 7, and sending 8 bit/pixel is not optimal.


If an image having M bits per pixel has entropy H, the minimum possible distortionless coding scheme
uses H bits per pixel, and the maximum compression rate is
C=

M
,
H

and the efficiency of a code using B bits per pixel is


H
.
B
Note that although the codewords are of variable length, they are uniquely decodable. No end-of-word
code needs to be sent.
Problem: if M different symbols are possible, there are L codewords, with the longest codeword having
length L.
In the modified Huffman coding, the first L1 symbols are Huffman coded, with symbol i given by
efficiency =

i = qL1 + j.

20.4

Run-length Coding (RLC)

RLC is especially good for documents, drawings, and other two-tone images. RLC is the basis for fax
transmission.
Principle: Encode the lengths of runs of consecutive pixels. Possibilities include:
1. Encoding the number of zeros between two consecutive 1s (good when P (1)  1)
2. Encoding the distance to the next 1 0 or 0 1 transition.
Often the run-lengths themselves are Huffman coded.

20.5

Bit-plane Encoding

Consider an 8-bit image as 8 separate two-tone images. Encode each plane separately.

20.6

Distortion Function

In communication, the rate distortion function, RD , gives the minimum possible coding rate for a given
channel-induced distortion. In image compression, the channel is the image compression/decompression
process. A quantizer that achieves RD is called a Shannon quantizer.
Examples:
1. Binary source: d = 1 if coding error occurs, d = 0 is no coding errors occur, p(1) = 12 . Note: if
D = 0, the minimum rate is 1 bit/symbol, if D = 0.5, RD = 0.
2. Gaussian source with variance 2 .


1
2
RD = max 0, log2
2
D
Note: if we accept D = 2 , no information needs to be sent.
78

21

Transform Coding and Compression

Steps:
1. Transform u(m, n) to get V (k, l)
2. Quantization: Each element of V (k, l) is called a coefficient. Each coefficient is quantized using nk,l
2 , is large. The value n
bits, where nk,l is larger for (k, l) where the variance of V (k, l), k,l
k.l is called
the bit allocation.
Example of bit allocation with N = 4, coded as 0.5 bit/pixel, i.e., 8 bits per 4 4 block.

3 2 0 0
2 1 0 0

0 0 0 0
0 0 0 0
The coded image is V (k, l).
3. Send V (k, l) to the receiver.
4. Inverse-transform V (k, l) to get u
(m, n).

22

The JPEG Standard

JPEG = Joint Photographic Expert Group, representatives form industry and academia from all over
the world. Joint refers to ISO (international Standardization Organization) and CCITT (Comite
Consultatif International Telephonique et Telegraphique, now ITU-T, International Telegraph Union,
Telecommunication Standardization Sector).
Goal: A compression standard ensuring compatibility, universality, etc.
There are several modes (progressive, lossless, etc.) in JPEG. We consider only the basic lossy mode.
The algorithm (assume the image consists of M -bit integers)
1. Break the image up into non-overlapping 8 8 blocks.
2. For each block u(m, n), take the 8 8 DCT:
u(m, n) V (k, l).
3. Quantize each DCT coefficient. The quantization step size for each (k, l) is stored in a look-up table
as q(k, l), 1 q(k, l) 255:


V (k, l)

,
V (k, l) = integerr ound
q(k, l)
where q(k, l) is determined by the just noticeable threshold: if the contribution of V (k, l) will not be
noticed, then V (k, l) = 0. The DC coefficient, V (0, 0), is then converted to a differential value:




V = V (0, 0)
V (0, 0)
block n

block n1

4. The 64 DCT coefficients are then mapped to a 1D vector of length 64 by using a zig-zag approach.
This orders V (k.l) in approximate order of increasing spatial frequency. Many coefficients of V (k, l)
are zero or near zero.
79

5. The 1D vector of length L1 = 64 is run-length coded (DC element excluded). Each non-zero value is
represented by two symbols:
Symbol 1 : (Run length, size S), where S is the number of bits used to code symbol 2;
Symbol 2 : Amplitude, axpressed as an S-bit number. For M = 8-bit images, shifted to a range
127 p 128, the maximum possible DCT coefficients amplitude is 18 64 128 = 1024
Smax = M + 3.
6. Symbol is entropy-coded using modified Huffman code.
Color images contain three components, one component of luminance (gray scale), and two components of
chrominance. In JPEG, chrominance is coded with less precision than luminance.
JPEG performance for color images depends on quantization table, image stats, etc.):
0.25 to 0.5 bits per pixel
1 bit per pixel
1.5 to 2 bits per pixel

moderate quality, but degradation


noticeable
excellent, acceptable for most uses
usually indistinguishable from original

Other JPEG features include


1. Progressive encoding: sen most important information first. Two options
(a) Send most important DCT coefficient first
(b) Send most significant bits of all DCT coefficients first
2. Lossless predictive coding algorithm based on 2D prediction

23

Video Compression

1. Intraframe techniques compress single images and use spatial redundancy to reduce the bit rate
2. Interframe techniques use temporal redundancies between frames.
A full video compression scheme incorporates both types of redundancy.
Video applications can be classified as
Symmetric: compression and decompression are performed with equal frequency Example: video
conferencing, video telephone, etc.
Asymmetric: compress only once Example: movies, electronic publishing, etc.
The goal of most video compression schemes is often a rate of about 1.5 Mibit/s. This is a good rate for
LANs, existing communication channels, etc. One of the most recent standards is MPEG (motion picture
experts group). Achieved and projected bit rates (for 30 frames/s, non-interlaced) are:
Resolution
352 240
720 489
1920 1080 (HDTV)

Compressed bit rate (Mibit/s)


1.5 (VHS-like quality)
7.5
30

80

In addition to achieving reduced bit rates, video compression algorithms must enable
Random access
Audio-visual synchronization
Coding/decoding delay of less than 150 ms
Recovery from transmission errors
Rate control in video coding is an extremely difficult issue: goal is to meet an overall constant rate. There
are buffering considerations. Standards may codify quantization step size, inter-/intra-frame compression,
block sizes.
Audio in video: Quantization precision is based on perceptual model. Quantization introduces loss
depending on precision. A non-linear quantizer may be used (where step sizes are based on a power law).
Stereo may be handled differently. Example: can code L+R and then separately LR. Or can exploit the
fact that for humans it is hard to locate very low and very high frequencies, so can code those on a single
channel. After quantization, Huffman coding.
CD data rate: 44.1 kHz sampling rate, 16 bits sample 700 Mibit/s per channel. Two channels give
1.5 Mibit/s.
MP3: Initially defined in early 1990s. MPEG-1/2 Layer 3. MPEG is the name of a family of
audio-visual coding standards. Originally specified three different layers of audio codec; higher layers give
better quality at given bit rate but require more complexity. Originally (in MPEG1) MP3 allowed output
rates from 32 Kibit/s to 320 Kibit/s; MPEG2 extended this down to 8 Kibit/s. MP3 supports sampling
rates of 16 kHz, 22.05 kHz, 24 kHz, 32 kHz, 44.1 kHz and 48 kHz; usually uses 44.1 kHz for compatibility
with CD audio rates.

Figure 35: MP3 block diagram.

Based on subband filtering. First step is decomposition into 32 subbands, then DCT is used to create
18 more frequency components for each; total of 576 frequency bins. (MDCT is modified DCT: modified
because it has overlap between successive windows.)
Psychoacoustic (or perceptual) model exploits masking. Masking describes the fact that perception of
a given sound depends not only on the level of the sound itself but also on the context. In coding, this
allows assigning fewer bits to portions of the audio signal that humans would be unlikely to hear.

81

AAC (Advanced Audio Coding): Also developed by MPEG in mid 1990s. Like MP3 it uses MDCT,
perceptual coding. Better than MP3 (music will sound better at same bit rate; same quality at lower bit
rate). Supported in iTunes, many video games, standards such as MPEG4, etc.

82

Anda mungkin juga menyukai