13
DIGITAL COMMUNICATIONS
Poompat Saengudomlert
Asian Institute of Technology
February 2012
ii
Contents
1 Introduction
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
9
12
17
20
3 Source Coding
3.1 Binary Source Code for Discrete Sources . . . . .
3.2 Entropy of Discrete Random Variables . . . . . .
3.3 Source Coding Theorem for Discrete Sources . . .
3.4 Asymptotic Equipartition Property . . . . . . . .
3.5 Source Coding for Discrete Sources with Memory
3.6 Source Coding for Continuous Sources . . . . . .
3.7 Vector Quantization . . . . . . . . . . . . . . . .
3.8 Summary . . . . . . . . . . . . . . . . . . . . . .
3.9 Practice Problems . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
29
34
36
38
41
45
46
47
4 Communication Signals
4.1 L2 Signal Space . . . . . . . . .
4.2 Pulse Amplitude Modulation . .
4.3 Nyquist Critetion for No ISI . .
4.4 Passband Modulation: DSB-AM
4.5 K-Dimensional Signal Sets . . .
4.6 Summary . . . . . . . . . . . .
4.7 Practice Problems . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
55
57
64
71
72
73
.
.
.
.
.
77
77
80
82
86
90
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
. . . . . .
. . . . . .
and QAM
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
5 Signal Detection
5.1 Hypothesis Testing . . . . . . . . . . . . . .
5.2 AWGN Channel Model . . . . . . . . . . . .
5.3 Optimal Receiver for AWGN Channels . . .
5.4 Performance of Optimal Receivers . . . . . .
5.5 Detection of Multiple Transmitted Symbols .
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iv
CONTENTS
5.6
5.7
5.8
6 Channel Coding
6.1 Hard Decision and Soft Decision Decoding
6.2 Binary Linear Block Codes . . . . . . . . .
6.3 Binary Linear Convolutional Codes . . . .
6.4 Summary . . . . . . . . . . . . . . . . . .
6.5 Practice Problems . . . . . . . . . . . . . .
93
95
96
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
101
103
110
118
119
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
123
123
124
126
127
129
130
133
137
142
142
144
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Introduction
In this course, we discuss principles of digital communications. We shall focus on
fundamental knowledge behind the constructions of practical systems, rather than
on detailed specications of particular standards or commercial systems. Having
mastered the fundamental knowledge, you should be able to read and understand
technical specications of practical systems in the path of your career. For most
of the course, we shall focus our attention to point-to-point digital communication
systems, leaving the networking aspects of digital communications to other courses.
Figure 1.1 shows a block diagram of a typical point-to-point communication system.
We discuss dierent parts of the block diagram below.
input
output
source
encoder
source
decoder
bits
bits
channel
encoder
channel
decoder
signal
waveform
bits
modulator
bits
demodulator
signal
waveform
physical
channel
CHAPTER 1. INTRODUCTION
decoder is to convert the possibly corrupted received bit sequence back to the
information bits, or as close as possible to the information bits.
Modulation: The function of a modulator is to map the coded bit sequence into
a signal waveform suitable for the transmission over the physical channel. The
function of a demodulator is to convert the possibly corrupted received signal
waveform back to the transmitted bit sequence, or as close as possible to the
transmitted bit sequence.
Note that the structure in gure 1.1 is common, but is not always the case. For
example, in some cases, it is desirable to perform channel coding and modulation
together in a single step called coded modulation. Breaking the overall communication
problem into dierent steps is in general suboptimal. However, such separations
are often practical; dierent parts of the system can be designed and constructed
separately.
If the information signal from the source is an analog waveform, the source encoder
typically needs to perform sampling and quantization to the input. Sampling refers
to obtaining sample values from the waveform, while quantization refers to converting
the sample values to information bits. Sampling and quantization is in general lossy,
i.e. even though the physical channel is ideal, the system output will be distorted
and cannot be used to retrieve the original waveform exactly.
For applications that require encryption, we can add the encryptor after the source
encoder, and the decryptor after the channel decoder. Encryption is, however, beyond
the scope of this course and will not be discussed.
The subsequent chapters discuss various components of a typical digital communication system. As a note to the student reader, the sections that are marked with
are optional materials; you will not be responsible for them in the examinations.
Chapter 2
Review of Related Mathematics
In this chapter, we give a brief review on basic mathematical tools that we shall use
in the analysis of digital communication systems. The review includes probability,
Fourier analysis, linear algebra, and random processes. The review is not meant to
be comprehensive, but is used to help refresh relevant concepts that we shall use in
this course. Several results are stated without proofs. However, references are given
for more detailed information.
2.1
Review of Probability
The sample space S of an experiment is the set of all possible outcomes. An event
is a set of outcomes, or a subset of the sample space. For an event E, we shall use
Pr{E} to denote the probability of E. We rst present the axioms that a probability
measure must satisfy.
Axiom 2.1 (Axiom of probability): Let S be the sample space and E, F S be
events.
1. Pr{S} = 1.
2. 0 Pr{E} 1.
3. If E and F are disjoint, then Pr{E F } = Pr{E} + Pr{F}.
The above axiom can be used to prove basic properties such as Pr{E c } = 1Pr{E}
and Pr{E F} = Pr{E} + Pr{F} Pr{E, F}.1 For example, since E and E c are
disjoint and their union is S, from statement 3 of the axiom, Pr{S} = Pr{E}+Pr{E c }.
From statement 1, we obtain the desired property: 1 = Pr{E} + Pr{E c }.
By induction, statement 3 can{be extended
three or more events. In particular,
} to
n
n
= j=1 Pr{Ej }.
if E1 , . . . , En are disjoint, then Pr
j=1 Ej
The conditional probability of event E given that event F happens (or in short
given event F), denoted by Pr{E|F}, is dened as Pr{E|F} = Pr{E, F}/ Pr{F}.2 A
E denotes the complement of E, i.e. E c = S E.
Pr{E, F} denotes Pr{E F}.
1 c
2
Random Variables
A random variable (RV) is a mapping from a sample space S to a set of nite real
numbers. By convention, we use capital letters to denote RVs and use lower case
letters to denote their values. Strictly speaking, the value of a RV must be a real
number. Note that a result of an experiment, e.g. head or tail in a coin toss, may
not be a RV. (Some prefer to use the term chance variable for such a quantity.)
However, if we assign real numbers to outcomes, e.g. 1 for head and 0 for tail, it is
straightforward to turn such an experimental result into a RV. For this reason, we
shall use the term RV to refer to any experimental result.
A discrete RV takes on a discrete, i.e. countable, set of values.3 A continuous
RV takes on a continuous set of values. A RV can be neither discrete nor continuous.
A probability distribution or in short a distribution of a RV X, denoted by FX (x), is
dened as FX (x) = Pr{X x}. An example of a distribution for a RV that is neither
discrete nor continuous is given below. Note that X is equal to 1 with probability
1/2, or else is uniformly distributed between 0 and 1.
0,
x (, 1)
1/2,
x [1, 0)
FX (x) =
(x + 1)/2, x [0, 1)
1,
x (1, )
The probability mass function (PMF) for a discrete RV X, denoted by fX (x),
is dened as fX (x) = Pr{X = x}. The probability density function (PDF) for a
continuous RV X, also denoted by fX (x), is dened as fX (x) = dFX (x)/dx when the
derivative exists. Note that we use the same notation for PMF and PDF. It will be
clear from the context whether fX (x) is a PMF or a PDF.
A joint distribution of RVs X1 , . . . , Xn , denoted by FX1 ,...,Xn (x1 , . . . , xn ), is dened
as FX1 ,...,Xn (x1 , . . . , xn ) = Pr{X1 x1 , . . . , Xn xn }. The joint PDFs and PMFs are
dened similarly to the case of a single RV. In particular, we can write the joint PDF
of RVs X1 , . . . , Xn as
fX1 ,...,Xn (x1 , . . . , xn ) =
RVs X and Y are independent if FX,Y (x, y) = FX (x)FY (y) for all x and y.
When the PDFs/PMFs exist, we can write fX,Y (x, y) = fX (x)fY (y) if X and Y
3
A set A is countable if we can assign a one-to-one mapping from its elements to a subset of
positive integers {1, 2, . . .}.
Conditional PDFs/PMFs
In this section, we discuss conditional PDFs/PMFs involving two RVs in four dierent
cases.
1. Consider two discrete RVs X and Y . The conditional PMF of X given Y ,
denoted by fX|Y (x|y), is equal to the conditional probability Pr{X = x|Y = y}.
It follows that
fX|Y (x|y) =
Pr{X = x, Y = y}
fX,Y (x, y)
=
.
Pr{Y = y}
fY (y)
Pr{X x|y y Y y} =
Finally, taking the derivative with respect to x and using the denition of PDF,
we can write
(
)
FX,Y (x,y)/y
FY (y)/y
2 FX,Y (x, y)/xy
fX,Y (x, y)
=
=
.
x
FY (y)/y
fY (y)
Therefore, for continuous RVs X and Y , we can write fX|Y (x|y) =
fX,Y (x,y)
.
fY (y)
fX|Y (x|y) =
lim
In summary, in all cases of discrete and continuous RVs X and Y , we can write
fX|Y (x|y) =
fX,Y (x, y)
, or equivalently fX,Y (x, y) = fX|Y (x|y)fY (y).
fY (y)
(2.1)
It is worth noting that, if X and Y are independent, then fX|Y (x|y) = fX (x), i.e.
knowing Y does not alter the statistics of X compared to knowing X alone.
2
denoted by var[X] or X
, is dened as var[X] = S (x X)2 fX (x)dx. (Note that we
can also write var[X] = E[(X X)2 ], as will be seen shortly.) The standard deviation
of X, denoted by X , is the positive square root of the variance.
The conditional mean of RV X given that RV Y is equal to y, denoted by E[X|Y =
y], is dened as E[X|Y = y] = SX|Y =y xfX|Y (x|y)dx, where SX|Y =y is the sample space
2
of X given that Y = y. The conditional variance, denoted by X|Y
=y , can be dened
2
2
similarly, i.e. X|Y =y = SX|Y =y (x E[X|Y = y]) fX|Y (x|y)dx.
The jth moment of X is dened as E[X j ]. The moment generating function (MGF)
of X, denoted by X (s), is dened as X (s) = E[esX ]. As the name suggests, the jth
moment of X can be obtained from using the relationship E[X j ] = dj X (s)/dsj |s=0 .
Functions of RVs
Let Y = g(X), where g is a monotonically increasing and dierentiable function. It
is known that the PDF of Y is
fY (y) =
fX (x)
,
dg(x)/dx
(2.2)
use the relationship E[Y ] = SX g(x)fX (x)dx (e.g. [?, p. 142] or [?, p. 560]). The
relationship can be extended for multiple RVs, i.e. if Y = g(X1 , . . . , Xn ), then
E[Y ] =
SXn
n
j=1
SXn
)
gn (xn )fXn (xn )dxn
(2.4)
j=1
Sum of RVs
Let X and Y be any two RVs.
n Then X+Y is another RV. More generally, if X1 , . . . , Xn
are RVs, so is the sum j=1 Xj . Below are useful properties of the mean and the
variance of a sum of RVs. These properties can be derived from the denitions of
mean and variance.
For RVs X and Y and real numbers a and b, E[aX + bY ] = aE[X] + bE[Y ]. More
generally, for RVs X1 , . . . , Xn and real numbers a1 , . . . , an ,
]
[ n
n
aj E[Xj ].
(2.5)
E
aj Xj =
j=1
j=1
For RVs X and Y and real numbers a and b, var[aX + bY ] = a2 var[X] + b2 var[Y ]
if X and Y are uncorrelated. More generally, for uncorrelated RVs X1 , . . . , Xn and
real numbers a1 , . . . , an ,
[ n
]
n
var
aj Xj =
a2j var[Xj ].
(2.6)
j=1
j=1
Note that the statement in (2.5) does not require the RVs to be uncorrelated to
be valid, while the statement in (2.6) does.
fX (x)dx
E[X]
.
a
x
f (x)dx
a X
1
a
xfX (x)dx =
E[X]
.
a
2
X
.
b2
Roughly speaking, the WLLN states that, as n gets large, the empirical average
Sn is equal to the mean E[X]. There is a stronger version of the law of large numbers
called the strong law of large numbers, which states that Pr {limn Sn = E[X]} = 1
(e.g. [?, p. 566] or [?, p. 258]). However, we shall use only the weak law in this
course. Finally, we present the central limit theorem without proof (e.g. [?, p. 258]).
Theorem 2.4 Central Limit Theorem (CLT): ConsiderIID RVs X1 , . . . , Xn
2
. Dene the average Sn = n1 nj=1 Xj . Then,
with mean E[X] and variance X
{
lim Pr
} a
Sn E[X]
1
2
ex /2 dx.
a =
X / n
2
Rougly speaking, the CLT states that, as n gets large, the distribution of
approaches that of a zero-mean unit-variance Gaussian RV.
2.2
Sn E[X]
X / n
We shall assume for general discussion that the signals are complex. Initially, when
we discuss baseband communications, this assumption is not required since we are
dealing with real signals. However, when we discuss passband communications, it is
convenient to consider complex signals.
The energy of a complex signal u(t) is dened as |u(t)|2 dt. We shall focus on
signals whose energies are nite. Such signals are called L2 signals. Two L2 signals
u(t) and v(t) are L2 -equivalent if their dierence has zero energy, i.e. |u(t)
v(t)|2 dt = 0.
We focus on L2 signals partly because L2 signals always have Fourier transforms
and their inverse transforms always exist in the L2 -equivalent sense [?, p. 118]. For
practical purposes, if two signals are L2 -equivalent, they are considered the same.
More specically, let u(t) be an L2 signal. The Fourier transform of u(t), denoted
by F{u(t)} or u(f ), is equal to
u(f ) = F{u(t)} =
u(t)ei2f t dt.
(2.7)
For an L2 signal u(t) in the time domain, u(f ) is an L2 signal in the frequency
domain. The inverse Fourier transform of u(f ), denoted by F 1 {
u(f )} or uinv (t) , is
equal to
uinv (t) = F 1 {
u(f )} =
(2.8)
10
u( )v ( t)d
a
u(f ) + b
v (f )
u (f )
u(f )
ei2f t0 u(f )
u(f f0 )
T u(f T )
i2f u(f )
u(f )
v (f )
u(f )
v (f )
linearity
conjugation
time/frequency duality
time shift
frequency shift
scaling (T > 0)
dierentiation
convolution
correlation
From the denition of Fourier transfrom and its inverse, note that
we have the
following identities in the special cases when t = 0 and f = 0: u(0) = u(f )df and
u(0) = u(t)dt. Using the rst identify and the correlation property, we obtain
the Parseval theorem
u(t)v (t)dt =
u(f )
v (f )df.
(2.9)
If we set v(t) = u(t) in the Parseval theorem, we obtain the energy equation
2
|u(t)| dt =
|
u(f )|2 df.
(2.10)
The quantity |
u(f )|2 is called the spectral density of u(t), which describes the
amount of energy contained per unit frequency around f .
4
sinc(x) =
sin(x)
x .
11
1. (t)u(t)dt = u(0)
2. (t) =
d
s(t)
dt
3. (t) 1
4. 1 (f )
By assuming the above properties and manipulating (t) as if it were an ordinary
function, we can carry out most analysis in digital communications.
Example 2.2 The signal cos(2fc t) is not an L2 signal. Its Fourier transform can
be evaluated using the above properties of the unit impulse. In particular, from
the property of the Fourier transform pair ei2fc t u(t) u(f fc ), setting u(t) = 1
yields ei2fc t (f fc ). Similarly, from ei2fc t u(t) u(f + fc ), we can write
ei2fc t (f + fc ). It follows that
1
1
1
1
cos(2fc t) = ei2fc t + ei2fc t (f + fc ) + (f fc ),
2
2
2
2
which gives us the Fourier transform of cos(2fc t).
]
[
For an L2 signal u(t) that is time-limited to the time interval T2 , T2 , the following
set of Fourier series coecients exists.6
1 T /2
u(t)ei2kt/T dt, k Z.
(2.11)
uk =
T T /2
In addition, the following signal reconstructed from the above coecients is L2 equivalent to u(t) [?, p. 110].
[
]
T T
i2kt/T
urec (t) =
uk e
, t ,
.
(2.12)
2
2
k=
In addition, if the signal u(t) is continuous, then the reconstruction is perfect, i.e.
urec (t) and u(t) are exactly the same.
{
0, t < 0
1, t 0
6
Z denotes the set of all integers, while Z+ denotes the set of all nonnegative integers.
5
12
2.3
13
Vector Spaces
A vector space V is a set of elements dened over a eld F according to the following
vector space axioms. The elements of the eld are called scalars. The elements of a
vector space are called vectors.
Axiom 2.3 (Vector space axioms): For all u, v, w V and , F, we have
the following properties.
1. Closure: u + v V, u V
2. Axioms for addition
Commutativity: u + v = v + u
Associativity: (u + v) + w = u + (v + w)
Existence of identity: There exists an element in V, denoted by 0, such that
u + 0 = u for all u V.
Existence of inverse: For each u V, there exists an element in V, denoted by
u, such that u + (u) = 0.
3. Axioms for multiplication
Associativity: ()u = (u)
Unit multiplication: 1u = u
Distributivity: (u + v) = u + v, ( + )u + u + u
In this course, we consider three dierent scalar elds: R, C, and Fk . A vector
space with scalar eld R is called a real vector space. A vector space with scalar eld
C is called a complex vector space. A vector space with scalar eld F2 is called a
binary vector space.
14
15
n
2
dened as u, v = nj=1 uj vj .8 The corresponding norm is u =
j=1 uj . A real
linear vector space with a dened inner product is called a Euclidean space. Therefore,
Rn is a Euclidean space.
The vector space Cn consisting of all complex n-tuples can be made an inner
product space by dening the inner product of u = (u1 , . .
. , un ) and v = (v1 , . . . , vn )
n
n
2
as u, v = j=1 uj vj . The corresponding norm is u =
j=1 |uj | .
u, v
v.
v2
(2.13)
From the denition, we can verify that the dierence between u and u|v is orthogonal to v as follows.
u, v
u, v
v, v = u, v
v, v = u, v u, v = 0
u u|v , v = u
v2
v2
Let uv = u u|v . Since uv , v = 0 and u|v is a scalar multiple of v, it follows
that uv and u|v are orthogonal, i.e. uv , u|v = 0, and u can be expressed as a
sum of two orthogonal components: u = uv + u|v , as illustrated in gure 2.4 for
Euclidean space R2 .
The following theorem states that u|v is the best approximate of u among the
vectors in the subspace spanned by v based on the square error.
Theorem 2.6 u, v/v2 = arg minR u v2 .
8
16
0, j = k
j , k =
1, j = k
(2.14)
17
2
Example
One possible
] [orthonormal
]} basis is
{[ ] [ 2.6]}Consider the Euclidean space R . {[
1
1/ 2
0
1/2
,
.
,
. A dierent orthonormal basis is
0
1
1/ 2
1/ 2
n
that u|S = j=1 u, j j , i.e. the projection of u on S is the summation of onedimensional projections of u on all the basis vectors. One can easily check that the
corresponding uS is orthogonal to all vectors in S.
If u is itself in S, then u|S = u and u can be expressed as u = nj=1 u, j j .
Such an expression for u in terms of a linear combination of orthonormal basis vectors
is called an orthonormal expansion of u.
Given a set of orthogonal vectors v1 , v2 , . . ., we can create an orthonormal set by
normalizing v1 , v2 , . . .. The resultant normalized vectors are v1 /v1 , v2 /v2 , . . .. If
a given set of vectors v1 , v2 , . . . is linearly independent but is not orthogonal, then
we can use the Gram-Schmidt procedure to create an orthonormal set 1 , 2 , . . . that
spans the same vector space as follows.
Gram-Schmidt procedure:
1. Set 1 = v1 /v1 .
2. At each step j {2, 3, . . .}, substract from vj its projections on the subspace
spanned by 1 , . . . , j1 to create an intermediate result j , i.e.
j
= vj
j1
vj , k k .
k=1
2.4
Recall that a random variable (RV) is a mapping from the sample space S to the
set of real numbers R. In comparison, a stochastic process or random process is a
mapping from the sample space S to the set of real-valued functions called sample
functions. We can denote a stochastic process as {X(t), t R} to emphasize that
it consists of a set of RVs, one for each time t. However, for convenience, we shall
simply use X(t) instead of {X(t), t R} to denote a random process in this course.
18
A random process is strict sense stationary (SSS) if, for all values of n Z+ ,
t1 , . . ., tn , and R, the joint PDF satises
fX(t1 ),...,X(tn ) (x1 , . . . , xn ) = fX(t1 + ),...,X(tn + ) (x1 , . . . , xn )
for all x1 , . . . , xn . Roughly speaking, the statistics of the random process looks the
same at all time.
Let E[X(t)] and X(t) denote the mean of the random process X(t) time t. The
covariance function, denoted by KX (t1 , t2 ), is dened as
[(
)(
)]
KX (t1 , t2 ) = E X(t1 ) X(t1 ) X(t2 ) X(t2 ) .
For the purpose of analyzing communication systems, it is usually sucient to
assume a stationary condition that is weaker than SSS. In particular, a random process
X(t) is wide-sense stationary (WSS) if, for all t1 , t2 R,
E[X(t1 )] = E[X(0)] and KX (t1 , t2 ) = KX (t1 t2 , 0).
Roughly speaking, for a WSS random process, the rst and second order statistics
look the same at all time.
Since the covariance function KX (t1 , t2 ) of a WSS random process only depends on
the time dierence t1 t2 , we usually write KX (t1 , t2 ) as a function with one argument
KX (t1 t2 ). Note that a SSS random process is always WSS, but the converse is not
always true.
Dene the correlation function of a WSS random process X(t) as
RX ( ) = E [X(t)X(t )] .
The power spectral density (PSD), denoted by SX (f ), is dened as the Fourier transform of RX ( ), i.e. RX ( ) SX (f ). It is possible to show that SX (f ) is real and
non-negative, and can be thought of as the power per unit frequency at f (e.g. [?, p.
68]).
A complex-valued random process Z(t) is dened as Z(t) = X(t) + iY (t) where
X(t) and Y (t) are random processes. The joint PDF of complex-valued RVs Z(t1 ),
. . ., Z(tn ) is given by the joint PDF of their components
fX(t1 ),...,X(tn ),Y (t1 ),...,Y (tn ) (x1 , . . . , xn , y1 , . . . , yn ).
The covariance function of a complex-valued random process Z(t) is dened as
)(
) ]
1 [(
,
KZ (t1 , t2 ) = E Z(t1 ) Z(t1 ) Z(t2 ) Z(t2 )
2
where the scaling factor 1/2 is introduced for convenience in the analysis. Finally, we
can extend the denition of SSS, WSS, and PSD for complex-valued random processes
in a straightforward fashion.
19
Gaussian Processes
A set of RVs X1 , . . . , Xn are zero-mean jointly Gaussian if there is a set of IID zeromean unit-variance Gaussian RVs N1 , . . . , Nm such that, for each k {1, . . . , n},
Xk can be expressed as a linear combination of N1 , . . . , Nm , i.e. Xk = m
j=1 k,j Nj .
For convenience, dene a random vector X = (X1 ,. . . , Xn ) and a random
vector
1,1 1,m
..
.. so that we
..
N = (N1 , . . . , Nm ). In addition, dene a matrix A = .
.
.
n,1 n,m
can write X = AN.
Let KX be the covariance matrix for random vector X, i.e.
..
..
..
KX =
.
.
.
.
E[(Xn X n )(X1 X 1 )] E[(Xn X n )(Xn X n )]
The PDF of a zero-mean jointly Gaussian random vector X is
f X (x) =
1
12 xT KX 1 x
e
.
(2)n/2 det KX
(2.15)
The above PDF can be derived from the PDF of N together with (2.2). Note
that, for IID zero-mean unit-variance Gaussian random vector N, the PDF of jointly
Gaussian random vector has the following simple form.
f N (n) =
1
12 nT n
e
(2)m/2
(2.16)
(2)n/2
1
1
1
e 2 (x ) KX (x ) .
det KX
(2.17)
Some important properties of jointly Gaussian random vector X are listed below [?, chp. 7].
1. A linear transformation of X yields another jointly Gaussian random vector.
2. The PDF of X is fully determined by the mean and the covariance matrix
KX , which are the rst-order and second-order statistics.
3. Jointly Gaussian RVs that are uncorrelated are independent.
We are now ready to dene a Gaussian process. We say that X(t) is a zero-mean
Gaussian process if, for all n Z+ and t1 , . . . , tn R, (X(t1 ), . . . , X(tn )) is a zeromean jointly Gaussian random vector. In addition, we say that X (t) is a Gaussian
process if it is the sum of a zero-mean Gaussian process X(t) and a deterministic
function (t).
Some important properties of Gaussian process X (t) are listed below [?, chp. 7].
20
[(
KY ( ) = E
[
) (
h()X( )d
)]
h()X()d
]
h()h()X( )X()dd
= E
=
h()h()KX ( )dd
=
h() (h( ) KX ( )) d
= h( ) h( ) KX ( )
(2.18)
2.5
(2.19)
Practice Problems
21
Problem 2.3 (Sum of independent Gaussian RVs): Let X and Y be two inde2
and Y2 denote
pendent Gaussian RVs. Let X and Y denote their means, and X
their variances. Using the fact that the MGF of a Gaussian RV with mean and
2 2
variance 2 is given by (s) = es+ s /2 , argue that the sum X + Y is another Gaus2
sian RV with mean X + Y and variance X
+ Y2 . (HINT: Find the MGF of the RV
X + Y .)
s=0
Problem 2.5 (Uncorrelated and dependent RVs): Verify that the RVs X and
Y with the following joint PMF are uncorrelated but are not independent.
fX,Y (1, 0) = fX,Y (1, 0) = fX,Y (0, 1) = fX,Y (0, 1) = 1/4
Problem 2.6 (Sample mean and sample variance): Consider n IID RVs X1 , . . .,
Xn . Let and 2 denote the mean and the variance of each Xj respectively.
(a) The quantity S = n1 nj=1 Xj is known as the sample mean. Show that E[S ] =
.
n
1
2
(b) The quantify S2 = n1
j=1 (Xj S ) is known as the sample variance. Show
that E[S2 ] = 2 .
22
Chapter 3
Source Coding
In this chapter, we shall consider the problem of source coding. With respect to
the block diagram of the point-to-point digital communication system in gure 1.1,
we shall discuss in detail the operations of the source encoder and decoder. We
rst focus on source coding for discrete sources. For continuous sources, we shall
discuss sampling and quantization that convert a continuous source to a discrete one.
During these discussions, we shall introduce basic denitions of several quantities in
information theory, including entropy and mutual information.
3.1
L(C) =
fX (x)l(C(x))
(3.1)
xX
Example 3.1 Let X = {a, b, c, d} with the PMF {1/2, 1/4, 1/8, 1/8}, i.e. fX (a) =
1/2, . . . , fX (d) = 1/8. Consider the code C = {0, 10, 110, 111}, where the codewords
are for a, b, c, and d respectively. It follows that L(C) = 21 1+ 14 2+ 81 3+ 18 3 = 1.75
bit.
23
24
One fundamental problem that we shall consider is the design of a code C that
minimizes the expected codeword length L(C) in (3.1).
25
a
0
root
26
codeword
1
root
unused leaf
codeword 1
codeword
node available
to be a codeword
codeword 2
node NOT available
to be a codeword
Figure 3.3: Systematic construction of a code tree from the codeword lengths l1
. . . lM .
loss of generality, assume that l1 . . . lM . Start with a full binary code tree of
depth lM , i.e. lM branches between the root and each leaf. Pick any node at depth l1
to be the rst codeword leaf. At this point, all nodes at depth l1 are still available
except for a fraction 2l1 of nodes stemming from the rst codeword. Next, pick any
node at depth l2 to be the second codeword leaf. At this point, all nodes at depth
l2 are still available except for a fraction 2l1 + 2l2 of nodes stemming from the
rst and second codewords, as illustrated in gure 3.3.
Repeat
process until the last codeword. Since the Kraft inequality is satised,
M the
lm
1, there is always at least a fraction of 2lj+1 nodes available at
i.e.
m=1 2
depth lj+1 after each step j, j {1, . . . , M 1}, of the process. This means there
is never a problem of nding a free leaf to use as a codeword.
It can be shown the the Kraft inequality also holds for any uniquely decodable
code. For the proof of this fact, see [?, p. 115]. Therefore, given a uniquely decodable
code, we can use their codeword lengths to construct a prex-free code.
We have so far described a property that the codeword lengths of a prex-free code
must satisfy, i.e. the Kraft inequality. We shall now present a procedure to construct
a prex-free code with the minimum expected codeword length. This procedure is
called the Human algorithm. The resultant code is called a Human code.
PMF
codeword
0.35
00
0.2
10
0.2
11
0.15
010
0.1
011
0
0.6
0
symbol
27
0.4
1
0.25
Human Algorithm
Suppose that the alphabet is X = {a1 , . . . , aM } with the PMF {p1 , . . . , pM }, i.e.
fX (a1 ) = p1 , . . . , fX (aM ) = pM . For a prex-free code, two codewords are siblings of
each other if they dier only in the last bit. The Human algorithm is an iterative
process that proceeds as follows.
In each step, take the two least likely symbols, say with probabilities q1 and q2 ,
and make them siblings. The pair of siblings are regarded as one symbol with
probability q1 + q2 in the next step.
Repeat the process until only one symbol remains. The resultant code tree
yields an optimal prex-free code.
Example 3.4 Suppose that M = 5 and the PMF is {0.35, 0.2, 0.2, 0.15, 0.1}. The
Human code tree is shown in gure 3.4. The corresponding set of codewords is C =
{00, 10, 11, 010, 011}. For this code, L(C) = (0.35+0.2+0.2)2+(0.15+0.1)3 = 2.25
bit.
Note that a Human code is not unique. For example, we can arbitrarily interchange bit 0 and bit 1 in each branching step of the code tree without changing the
value of L(C).
This section provides a proof of optimality for the Human algorithm. The proof
approach is based on [?, p. 123]. We start by proving some useful facts, and then go
on to the optimality proof.
Lemma 3.1 An optimal code must satisfy the following properties.
1. If pj > pk , then lj lk .
2. The two longest codewords have the same length.
28
L(CM ) =
pm lm =
pm lm + pM 1 lM 1 + pM lM .
m=1
m=1
L(CM ) =
pm lm + pM 1 lM
m=1
(M 2
)
pm lm
+ pM 1 (lM 1 + 1)
m=1
= L(CM 1 ) + pM 1 + pM .
The above relationship tells us that, minimizing L(CM ) can be done through minimizing L(CM 1 ). Therefore, we can reduce the problem of nding M codewords to
PMF
symbol
29
codeword length
1/3
1/3
1/3
0
1
2/3
as L(C)
= L(C)/n, where n is the number of symbols per block.
Example 3.5 Suppose that X = {a, b, c} and the symbols are equally likely. The
corresponding Human code tree is shown in gure 3.5. For this code, we can compute
L(C)
= 31 1 + 2 13 2 = 53 1.67 bit/symbol.
Suppose now that a block of two symbols are encoded at a time. Then the
new alphabet is X = {aa, ab, ac, ba, bb, bc, ca, cb, cc} with equally likely symbols. The
corresponding
Human code tree)is shown in gure 3.6. For this code, we can compute
(
1
L(C) = 2 7 91 3 + 2 19 4 = 29
1.61 bit/symbol. Note the decrease in the
18
value of L(C).
The process can be repeated to show that, for n = 1, 2, 3, 4, 5, . . ., the correspond
ing values of L(C)
are 1.67, 1.61, 1.60, 1.60, 1.59, . . .. The specic Human codes are
not presented here.
From the above example, there seems to be a lower bound on the value of L(C);
this bound is approached as we increase n. In the next section, we dene the entropy
of a discrete random variable (RV), which is the quantity that serves as the limit
value of L(C)
as n grows large.
3.2
Consider a discrete RV X with the alphabet X and the PMF fX (x). The entropy of
X, denoted by H(X), is dened as
H(X) =
fX (x) log fX (x).
(3.2)
xX
30
0
0
0
3/9
PMF
symbol
1/9
aa
1/9
ab
1/9
ac
1/9
ba
1/9
bb
1/9
bc
1/9
ca
1/9
cb
1/9
cc
codeword length
2/9
5/9
1
2/9
2/9
0
4/9
1
2/9
31
Hbin(p)
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
H(X) =
fX (x) log
1
1/M
=
fX (x) log
fX (x)
fX (x) 1/M
xX
xX
1
1/M
=
fX (x) log
+
fX (x) log
1/M xX
fX (x)
xX
1/M
= log M +
fX (x) log
fX (x)
xX
Assuming for now that the logarithm has base e. Using the fact that ln x x1,
as shown in gure 3.8, we can bound H(X) by1
)
)
(
( 1
1/M
1 = ln M +
fX (x)
H(X) ln M +
fX (x)
f
M
X (x)
xX
xX
= ln M + 1 1 = ln M
We see from gure 3.8 that the bound ln x x 1 holds with equality if and
only if x = 1. In the derivation above, we see that H(X) = ln M if and only if
fX (x) = 1/M for all x X , i.e. the symbols are equally likely.
x
Finally, if the logarithm has base 2, then we can use the bound log2 x = ln
ln 2
x1
to
show
that
H(X)
log
M
.
The
argument
is
the
same
as
before
and
is
2
ln 2
thus omitted.
1
32
1
0.5
0
x-1
-0.5
ln(x)
-1
-1.5
-2
-2.5
-3
0
0.5
1.5
H(X, Y ) =
fX,Y (x, y) log fX,Y (x, y),
(3.4)
xX yY
where fX,Y (x, y) is the joint PMF of X and Y . The joint entroy H(X, Y ) can be
considered as the amount of uncertainty in RVs X and Y . The denition can be
extended to n RVs X1 , . . . , Xn ; their joint entropy is written as
H(X1 , . . . , Xn ) =
xn Xn
(3.5)
Consider again two discrete RVs X and Y with the alphabets X and Y and the
PMFs fX (x) and fY (y) respectively. The conditional entropy of X given that Y = y,
denoted by H(X|Y = y), is dened as
H(X|Y = y) =
fX|Y (x|y) log fX|Y (x|y)
(3.6)
xX
where fX|Y (x|y) is the conditional PMF of X given Y . The conditional entropy
H(X|Y = y) can be considered as the amount of uncertainty left in RV X given that
we know Y = y. The average conditional entropy of X given the RV Y or in short
the conditional entropy of X given Y , denoted by H(X|Y ), is dened as
H(X|Y ) =
fY (y)H(X|Y = y) =
fX,Y (x, y) log fX|Y (x|y).
(3.7)
yY
xX yY
Two special cases should be noted. First, if X and Y are independent, then
H(X|Y ) = H(X). Intuitively, in this case, the knowledge of Y does not change the
33
1
H(X|Y ) =
fX,Y (x, y) log
fX|Y (x|y)
xX yY
=
fX (x)
fX|Y (x|y)fX (x)
1
fX (x)
+
fX,Y (x, y) log
fX (x) xX yY
fX|Y (x|y)
xX yY
xX yY
fX (x) log
xX
= H(X) +
fX (x)fY (y)
1
+
fX,Y (x, y) log
fX (x) xX yY
fX,Y (x, y)
xX yY
fX (x)fY (y)
fX,Y (x, y)
fX (x)fY (y)
H(X|Y ) H(X) +
fX,Y (x, y)
1
fX,Y (x, y)
xX yY
= H(X) +
fX (x)fY (y)
fX,Y (x, y)
xX yY
xX yY
= H(X) + 1 1 = H(X).
Note that the equality H(X|Y ) = H(X) holds if and only if the logarithm argument is equal to 1 while applying ln x x 1. This happens when the ratio
fX (x)fY (y)/fX,Y (x, y) is equal to 1 for all x and y. This is equivalent to having
fX,Y (x, y) = fX (x)fY (y) for all x and y, i.e. X and Y are independent.
x
Finally, as for the proof of theorem 3.2, we can use the bound ln
x1
to prove
ln 2
ln 2
the theorem if the logarithm has base 2.
The following theorem tells us that the joint entropy can be written as the sum
of conditional entropies.
Theorem 3.4 (Chain rules for the joint entropy): For discrete RVs X1 , . . . , Xn ,
H(X1 , . . . , Xn ) = H(X1 ) +
j=2
34
Proof: We provide below a proof for two RVs X1 and X2 . The proof is essentially
based on the fact that fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 |X1 (x2 |x1 ).
fX1 ,X2 (x1 , x2 ) log fX1 ,X2 (x1 , x2 )
H(X1 , X2 ) =
x1 X1 x2 X2
(
)
fX1 ,X2 (x1 , x2 ) log fX1 (x1 )fX2 |X1 (x2 |x1 )
x1 X1 x2 X2
x1 X1 x2 X2
x1 X1 x2 X2
H(Xj ),
(3.8)
j=1
3.3
Consider a discrete RV X with the alphabet X and entropy H(X). Consider constructing a prex-free source code for X. Let Lmin (C) be the minimum expected
codeword length (in bit). Recall that Lmin (C) can be achieved using the Human
algorithm. The following theorem relates Lmin (C) to H(X).
Theorem 3.5 (Entropy bound for prex-free codes): Assume that all quantities have the bit unit.
1. H(X) Lmin (C) < H(X) + 1
2. H(X) = Lmin (C) if and only if the PMF fX () is dyadic, i.e. fX (x) is a negative
integer power of 2 for all x X .
Proof: For convenience, let M = |X |. Denote the PMF values by p1 , . . . , pM , and
2lm
1
p m lm =
pm log2
H(X) Lmin (C) =
pm log2
p
pm
m
m=1
m=1
m=1
((
)
)
)
( l
M
M
1
1
2 m
1 =
pm
2lm 1
ln 2 m=1
pm
ln 2
m=1
35
lm
Using the Kraft inequality (see theorem 3.1), i.e. M
1, H(X)Lmin (C)
m=1 2
1
is further upper bounded by ln 2 (1 1) = 0, yielding H(X) Lmin (C).
We now prove the upper bound Lmin (C) < H(X) + 1 by showing the existence
of one prex-free code with the expected codeword length L(C) < H(X) + 1. In
this code, we choose the codeword lengths to be lm = log2 pm , m {1, . . . , M }.
(This is also refered to as the Shannon-Fano-Elias coding.) Note that the following
inequality holds: log2 pm lm < log2 pm + 1.
lm
From the bound log2 pm lm , or equivalently 2lm pm , we obtain M
m=1 2
M
m=1 pm = 1. Thus, the Kraft inequality holds, and there exists a prex-free code
with the above choice of codeword lengths. From lm < log2 pm + 1 , we can write
L(C) =
p m lm <
m=1
pm ( log2 pm + 1)
m=1
M
m=1
pm log2 pm
(
+
)
pm
= H(X) + 1.
(3.9)
m=1
Finally, if the PMF of X is dyadic, then lm = log2 pm , and L(C) = Lmin (C) =
x
x1
in
H(X). On the other hand, if Lmin (C) = H(X), then the inequality ln
ln 2
ln 2
the proof for H(X) Lmin (C) must be satised with equality. This means that
36
3.4
H(X)
(3.10)
n
Any length-n sequence (x1 , ..., xn ) satisfying (3.10) is called a typical sequence.
Example 3.7 Consider a DMS with two symbols 0 and 1 with probabilities 0.8 and
0.2 respectively. The typical set Tn with n = 6 and = 0.1 is found as follows. Note
that H(X) = Hbin (0.2). It follows that the condition for a typical sequence is
0.033 26(Hbin (0.2)+0.1) < fX1 ,...,X6 (x1 , . . . , x6 ) < 26(Hbin (0.2)0.1) 0.075.
For j {0, . . . , 6}, dene a group-j sequence to be a sequence (x1 , . . . , xn ) with
6
bit 1 appearing j times. We now check whether a group-j sequence is in T0.1
for each
j.
Group-0 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.86 0.26
Group-1 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.85 0.2 0.066
Group-2 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.84 0.22 0.016
Group-3 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.83 0.23 4.1 103
Group-4 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.82 0.24 1.0 103
Group-5 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.8 0.25 2.6 104
Group-6 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.26 6.4 105
6
6
= {100000, 010000,
. Therefore, T0.1
We see that only group-1 sequences are in T0.1
001000, 000100, 000010, 000001}.
The following theorem lists important properties of typical sets and typical sequences.
Theorem 3.7 (AEP):
37
log f
(x ,...,x )
1. Since X1 , . . . , Xn are IID, X1 ,...,Xnn 1 n = n1 nj=1 log fX (xj ). Dene
a RV Wj = log fX (Xj ). Note that W1 , . . . , Wn are IID with mean H(X).
n
n
2
denote the
Let{(W
) variance of W
}j . From the denition of T , 1 Pr{T } =
Pr n1 nj=1 Wj H(X) . Using the weak law of large number (WLLN),
we obtain the following bound.
1 Pr{Tn }
2
2
W
W
n
,
or
equivalently
Pr{T
}
n2
n2
2
W
n2
< .
2. The statement follows directly from the denition of a typical sequence in (3.10).
3. Since fX1 ,...,Xn (x1 , . . . , xn ) > 2n(H(X)+) for each typical sequence in Tn ,
L(C)
< H(X) + + 1/n. Since Pr{Tn } approaches 1 for large n (statement 1 of
theorem 3.7), the probability of coding failure approaches 0 for large n. Since can
be chosen to be arbitrarily small, it follows that, for large n, this coding scheme yields
L(C)
H(X), which is optimal.
38
3.5
H (X) = lim
(3.11)
It is known that the limit in (3.11) exists and hence the denition is valid (see [?,
p. 103] or [?, p. 74]).
Even though the Human algorithm can be used as an optimal coding procedure
for discrete sources with memory, it requires the knowledge of the joint PMF of source
symbols. In practice, this information may not be available or may be dicult to estimate. To overcome this requirement, the Lempel-Ziv (LZ) algorithm was proposed
as a source coding procedure that does not require the knowledge of the source statistics. Due to this property, the LZ algorithm is considered as a universal source coding
algorithm. Various versions of the LZ algorithm have been implemented in practice,
e.g. the compress command in Unix.
39
Lempel-Ziv Algorithm
We now discuss the operations of the LZ algorithm as well as a rough explanation on
why it is ecient. The discussion is taken from [?, p. 51].
The LZ algorithm we describe is a variable-to-variable length coding process.2
At each step, the algorithm maps a variable number of symbols to a variable-length
codeword. In addition, the code C adapts or changes over time depending on the
statistics in the recent past.
Let X1 , X2 , . . . denote the sequence of identically distributed source symbols. Let
X denote the alphabet, and dene M = |X |. Let xnm , m n, denote the subsequence
(xm , . . . , xn ). The algorithm operates by keeping a sliding window of size W = 2k ,
where k is some large positive integer. The operations of the algorithm are as follows.
1. Encode the rst W symbols using a xed-length code with log2 M bits per
symbol. (In terms of the overall eciency, it does not really matter how eciently these W symbols are coded since the W log2 M bits used in this step
is a negligible fraction of the total number of encoded bits.)
2. Set the pointer P = W indicating that all symbols up to xP have been coded.
P +nu
3. Find the largest positive integer n 2 (if exists) such that xPP +n
+1 = xP +1u
for some u {1, . . . , W }. (In other words, nd the longest match between the
symbol sequence starting at index P + 1 and a symbol subsequence starting in
the sliding window.) Encode xPP +n
+1 by encoding n and then encode u. Figure 3.9
gives some example values of n and u.
A dierent version of the LZ algorithm is a variable-to-xed length coding process [?, p. 106].
40
Roughly speaking, the longest match will occur for length n such that 2n H (X) W ,
W
symbols.
or equivalently n Hlog2(X)
Recall that, in the algorithm, coding for a match of length n and a match position
u uses 2log2 n + 1 bits for n (unary-binary code) and log2 W bits for u (xedW
, the term
length code), for a total of (2log2 n + 1) + log2 W bits. Since n Hlog2(X)
3.6
41
Z.
In
addition,
x(t)
can
be
reconstructed
from
x
by
x(t)
=
k ei2kt/T , t
k
kZ x
[ T T]
2, 2 .
By applying the same properties in the frequency domain, we can establish the
sampling theorem as follows. Since x(t) x(f ) and x(t) is an L2 signal, x(f ) is also
an L2 signal. Since x(f ) = 0 for f
/ [W, W ], there exists a set of Fourier series
coecients
W
2kf
1
x(f )ei 2W df, k Z.
(3.12)
xk =
2W W
42
xk ei
2kf
2W
, f [W, W ]
(3.13)
kZ
1
From the inverse Fourier
( k ) transform formula, we see that xk in (3.12) is 2W times
the sampling value x 2W in the time domain. Since (x(f )) uniquely determines x(t),
k
we can reconstruct x(t) perfectly from the samples x 2W
, k Z.
The reconstruction formula can be obtained by taking the inverse Fourier transform of the expression in (3.13). We can reexpress (3.13) for f R as3
x(f ) =
i 2kf
2W
xk e
(
rect
kZ
f
2W
)
, f R.
(3.14)
f
2W
,
2W sinc(2W t) rect
(
(
))
(
)
k
f
i 2kf
2W sinc 2W t +
e 2W rect
,
2W
2W
we can write the inverse Fourier transform of (3.14) as
( k )
x
2W xk sinc(2W t + k) =
sinc(2W t + k)
x(t) =
2W
kZ
kZ
j=
(
x
j
2W
)
sinc(2W t j).
(3.15)
Scalar Quantization
In this section, we discuss quantization of a single symbol produced from a source,
e.g. its sample value. A scalar quantizer with M levels partitions the set R into M
subsets R1 , . . . , RM called quantization regions. Each region Rm , m {1, . . . , M }, is
then represented by a quantization point qm Rm . If a symbol u Rm is produced
from the source, then u is quantized to qm .
Our goal is to treat the following problem. Let U be a RV denoting a source
symbol with probability density function (PDF) fU (u). Let q(U ) be a RV denoting
its quantized value. Given the number of quantization levels M , we want to nd the
3
(
Recall that rect
f
2W
)
=
1, f [W, W ]
0, otherwise
43
For the time being, let us assume that R1 , . . . , RM are intervals, as shown in
gure 3.11. We ask two simplied questions.
1. Given q1 , . . . , qM , how do we choose R1 , . . . , RM ?
2. Given R1 , . . . , RM , how do we choose q1 , . . . , qM ?
We rst consider the problem of choosing R1 , . . . , RM given q1 , . . . , qM . For a given
u R, the square error to qm is (uqm )2 . To minimize the MSE, u should be quantized
to the closest quantization point, i.e. q(u) = qm where m = arg minj{1,...,M } (u qj )2 .
It follows that the boundary point bm between Rm and Rm+1 must be the halfway
point between qm and qm+1 , i.e. bm = (qm + qm+1 )/2. In addition, we can say that
R1 , . . . , RM must be intervals.
We now consider the problem of choosing q1 , . . . , qM given R1 , . . . , RM . Given
R1 , . . . , RM , the MSE in (3.16) can be written as
MSE =
m=1
Rm
(u qm )2 fU (u)du.
To minimize the MSE, we can consider each quantization region separately from
the rest. Dene a RV V such that V = m if U Rm , and let pm = Pr{V = m}. The
conditional PDF of U given that V = m can be written as
fU |V (u|m) =
fV |U (m|u)fU (u)
fU,V (u, m)
fU (u)
fU (u)
=
=
=
fV (m)
fV (m)
fV (m)
pm
fU (u)
2
du
(u qm ) fU (u)du = pm
(u qm )2
pm
Rm
Rm
[
]
= pm
(u qm )2 fU |V (u|m)du = pm E (U qm )2 |V = m . (3.17)
Rm
44
It is known that the value of a that minimizes E[(X a)2 ] is the mean of X, i.e.
E[X] = arg minaR E[(X a)2 ].4 Therefore, the MSE is minimized when we set qm
equal to the conditional mean of U given V = m, i.e.
qm = E[U |V = m] = E[U |U Rm ].
(3.18)
We now consider the special case of high-rate uniform scalar quantization. In this
scenario, we assume that U is in a nite interval [umin , umax ]. Consider using M
quantization regions of equal lengths, i.e. uniform quantization. In addition, assume
that M is large, i.e. high-rate quantization. Let denote the length of each quantization region. Note that = (umax umin )/M . When M is suciently large (and hence
4
To see why, we can write E[(X a)2 ] = E[X 2 ] 2aE[X] + a2 . Dierentiating the expression with
respect to a and setting the result to zero, we can solve for the optimal value of a.
45
Under this approximation, the quantization point in each region is the midpoint
of the region. From (3.19), the corresponding MSE can be expressed as
(
)
M
M
/2
pm
pm
2
2
MSE
(u qm ) du =
w dw
Rm
/2
m=1
m=1
( )
M
pm 3
2
=
=
,
(3.20)
12
12
m=1
/2
where we use the fact that Rm (u qm )2 du = /2 w2 dw for each length- quantization region with the quantization point in the middle. Therefore, the approximate
MSE does not depend on the form of fU (u) for a high-rate uniform quantizer.
If we represent the quantization points using a xed-length code, then the codeword length L(C) is equal to L(C) = log2 M (assuming M = 2k for some k Z+ ),
and is related to the MSE by
(umax umin )2
.
(3.21)
12 22L(C)
Therefore, the MSE decreases exponentially with the number of bits used for a
high-rate uniform quantizer; each extra bit decreases the MSE by a factor of 1/4.
MSE =
3.7
Vector Quantization
When the source produces successive symbols U1 , U2 , . . . that are continuous RVs, it
is possible to use scalar quantization to quantize these symbols one by one. However,
46
3.8
Summary
In this chapter, we considered the problem of source coding. We showed that, for
a discrete memoryless source (DMS), the entropy serves as a fundamental limit on
the average number of bits required to represent each source symbol. For sources
with memory, we dene the entropy rate which serves as a fundamental limit for
these sources. In either case, we can use the Human algorithm to eciently encode
source symbols, assuming the knowledge of joint probability mass function (PMF) of
the source symbols. In cases where the joint PMF is not available, the Lempel-Ziv
universal encoding algorithm can be applied.
47
When the source produces a continuous waveform, we can convert the source to
a discrete source by sampling the source output and quantizing the sample values.
We saw that, for a band-limited continuous waveform, we can perfectly represent
the waveform by its sample values with the sampling rate equal to twice the source
bandwidth. In addition, we discussed a heuristic for nding quantization regions and
quantization points for scalar quantization. Compared to scalar quantization using
the same data bit rate, we saw that vector quantization can reduce the distortion
even though the source symbols are independent.
Quantization can be studied using the information theory framework. Using mutual information between the symbol and its quantized value, we can dene the rate
distortion function which gives a theoretical lower bound on the data rate subject to
the constraint on the distortion. See [?, p. 108] or [?, p. 301] for the discussion on
rate distortion theory.
In practice, there are techniques for source coding that are specialized to the applications. For example, for coding of speeches in cellular networks, model-based source
coding based on linear predictive coding (LPC) is commonly used. Such specialized
source coding techniques are beyond the scope of this course. See [?, p. 125] for more
detailed discussions.
3.9
Practice Problems
Problem 3.1 (Problem 3.7 in [Pro95]): A DMS has an alphabet containing eight
letters a1 , . . . , a8 with probabilities 0.25, 0.2, 0.15, 0.12, 0.1, 0.08, 0.05, and 0.05.
(a) Assume that we encode one source symbol at a time, nd an optimal prex-free
code C for this source.
(b) Compute the expected codeword length L(C) for the code in part (a).
(c) Compute the entropy H(X) of the source symbol.
Problem 3.3 (Problem 2.16 in [CT91]): Consider two discrete RVs X and Y
with the joint PMF given below.
48
Y =0
1/3
0
Y =1
1/3
1/3
Problem 3.4 (Entropy computation from cards): Consider drawing two cards
randomly from a deck of 8 dierent cards without putting the 1st card back before
drawing the 2nd card. Let the cards be numbered by 1, . . . , 8. Let X and Y denote
the numerical values of the 1st and 2nd cards respectively. Note that X and Y are
RVs.
(a) Compute H(X) and H(Y ).
(b) Compute H(X, Y ) and H(X|Y ).
(c) Suppose that you put the 1st card back into the deck before randomly drawing
the 2nd card. Compute H(X, Y ) in this case.
{
1/4, y = 1
1/2, x = 0
1/2, y = 0
fX (x) =
fY (y) =
1/2, x = 1
1/4, y = 1
Let Z = X + Y . Compute H(Z) and H(Z|X).
(b) Suppose we want to transmit the values of X and Y in part (a). Find an
optimal source code that minimizes the expected number of bits used in the
transmission.
49
(c) Let X1 , X2 , . . . be a sequence of IID RVs with the PMF fX (x) in part (a). Let
Y1 , Y2 , . . . be a sequence of IID RVs, independent of the sequence X1 , X2 , . . .,
with the PMF fY (y) in part (a). Suppose we wnat to transmit the two sequences. What is the minimum number of bits per symbol pair (Xj , Yj ) required
for the transmission?
l(x)
1, by
Show that the code C must satisfy the Kraft inequality, i.e.
xX 2
following the steps below.
(
)
l(x) k
(a) Write
as (x1 ,...,xk )X k 2l(x1 ,...,xk ) .
xX 2
(b) Dene lmax = maxxX l(x). In addition, let a(m) denote the number of symbols
(
)
l(x) k
(x1 , . . . , xk ) such that l(x1 , . . . , xk ) = m. Rewrite
in part (a) as
xX 2
klmax
m
.
m=1 a(m)2
(c) Argue that, for unique decodability, we must have a(m) 2m . Use this in(
)
l(x) k
equality to upperbound
in part (a) by klmax .
xX 2
1/4,
1/2,
fX (x) =
1/4,
Let Tn denote the typical set of length-n sequences (x1 , . . . , xn ) with respect to
2
2
.
and T0.6
fX (x). Compute the probabilities of the sets T0.2
50
Problem 3.9 (LZ universal souce coding): Use the LZ universal source coding
(as discussed in class) to encode the following bit sequence. Assume that the size of
the sliding window is 8.
0010011010101110010
pj H(X2 |X1 = xj ),
j=1
where we assume that X1 has the PMF according to the steady-state probabilities.
(a) Determine the entropy rate of the binary stationary rst-order Markov source
with two states as shown below. Note that the source has transition probabilities
between the two states equal to p2|1 = 0.2 and p1|2 = 0.3. (HINT: The steadystate probabilities p1 and p2 can be found in this case as follows. In the steady
state, the probability of being in state 1 and moving to state 2 must be equal to
the probability of being in state 2 and moving to state 1, i.e. p1 p2|1 = p2 p1|2 .)
(b) How does the entropy rate compare with the entropy of a binary DMS with the
same output symbol probabilities p1 and p2 ?
51
Problem 3.13 (Quantization with Human coding): Consider a scalar quantizer shown below together with the PDF of a DMS.
52
1/4
1/8
1/16
quantization
points
quantization
region
boundaries
1 2
3 4
(a) Can the given scalar quantizer be a result of the Lloyd-Max algorithm? Why?
(b) Compute the associated MSE for the given quantizer.
(c) Let a RV V denote the quantized value. Suppose that we use a xed-length
code for V . What is the number of bits per symbol required for source coding?
(d) Find the optimal (variable-length) source code for the quantized value V . What
is the number of bits per symbol required for source coding in this case?
Chapter 4
Communication Signals
Physically, communication signals are continuous waveforms. In this chapter, we
show how to represent communication signals as vectors in a linear vector space.
This vector representation is a powerful tool for analysis of communication systems.
It also allows us to understand communication theory using geometric visualization.
In addition, we discuss various modulation schemes including pulse amplitude
modulation (PAM), quadrature amplitude modulation (QAM), and other modulations with higher dimensions. We assume throughout the chapter that the transmission channel is ideal and the system is noise-free. We shall relax these two assumptions
in the next chapter.
4.1
L2 Signal Space
Recall that a signal u(t) is an L2 signal if |u(t)|2 dt < . The set of L2 signals
together with the complex scalar eld C forms a vector space called the L2 signal
space. Most communication signals of interest can be reasonably modeled as L2
signals. Consequently, we shall view a signal as a vector in the L2 signal space, and
use the terms signal and vector interchangeably.
To make the L2 signal space an inner product space, we can dene the inner
product between two L2 signals u(t) and v(t), denoted by u(t), v(t), as
u(t), v(t) =
u(t)v (t)dt.
(4.1)
Note that we need to consider the equal sign = in (4.1) in terms of L2 equivalence. Otherwise, the above denition is not a valid inner product. For example,
consider the following signal.
{
1, t = 0
u(t) =
0, otherwise
The above signal has u(t), u(t) = 0, but it is not a zero vector. Without the notion of L2 equivalence, the positivity property of the inner product, i.e. u(t), u(t)
0 with the equality if and only if u(t) = 0, is violated. Based on the inner product
53
54
(4.2)
j=1
2, 2
T
k (t) =
0,
otherwise
where k Z. Given a signal u(t) in this vector space, the corresponding orthonormal
expansion is
[
]
T T
1
i2kt/T
uk e
, t ,
,
u(t) =
2 2
T k=
T /2
where uk = u(t), 1T ei2kt/T = 1T T /2 u(t)ei2kt/T dt.
.
Example 4.2 From the sampling theorem, the set of L2 signals band-limited to the
frequency range [W, W ] forms an innite-dimensional complex vector space. The
reconstruction formula in (3.15) tells us that one orthonormal basis for this vector
space is the set of vectors
2W
uk sinc(2W t k),
k=
1 u
2W
k
2W
55
the inner product between u(t) and v(t) can be computed from the coecients of
their orthonormal expansions as shown below.
n
n
n
n
n
u(t), v(t) =
uj j (t),
vk k (t) =
uj vk j (t), k (t) =
uj vj
j=1
j=1 k=1
k=1
j=1
Note that the last equality follows from the fact that 1 (t), . . . , n (t) are orthonormal.
The relationship
is also valid for
the innite-dimensional L2 signal space, i.e. for
ui vj .
(4.3)
j=1
4.2
j=1
j=1
aj p(t jT ).
(4.5)
j=0
One possible choice for p(t) is the rectangular pulse shown below.3
[
]
{
1, t T2 , T2
prec (t/T ) =
0, otherwise
(4.6)
However, the rectangular pulse is not practical since its bandwidth is innite; the
pulse cannot be transmitted over a bandlimited channel. Another choice for p(t) is
the sinc pulse sinc(t/T ). Strictly speaking, although the sinc pulse is bandlimited,
it cannot be generated perfectly in practice since its support, i.e. the time interval
during which sinc(t/T ) is nonzero, is innite. To generate the sinc pulse in practice,
we need to approximate by truncating the pulse to be time-limited. Later on in the
chapter, we shall see other choices of pulses that are more practical than the sinc
pulse.
2
For notational convenience, we start indexing the amplitudes from 0 so that the amplitude aj
is used to modulate the pulse delayed by time jT , i.e. p(t jT ).
3
We are not concerned about the non-causality of p(t) since in practice the modulator can be
made causal if we allow some delay, e.g. T /2 for the rectangular pulse.
56
...
0
LTI filter
sampled at
2
=
M
=
(( )
( )2
(
)2 )
2
d
3d
(M 1)d
+
+ ... +
2
2
2
d2 (M 2 1)
d2 (22b 1)
=
.
12
12
(4.7)
n(n+1)(2n+1)
.
6
n
j=1
j=
n(n+1)
2
and
n
j=1
j2 =
57
v(t) =
r( )q(t )d =
aj
p( jT )q(t )d
j=0
For convenience, dene g(t) = p(t) q(t). We can then write v(t) as
v(t) =
aj g(t jT )
(4.8)
j=0
To obtain v(jT ) = aj for j {0, 1, . . .}, it suces to choose g(t) with the following
property.
{
1, k = 0
(4.9)
g(kT ) =
0, k Z, k = 0
A signal g(t) that satises the condition in (4.9) is called ideal Nyquist with period
T . Note that the rectangular pulse in (4.6) and the sinc pulse sinc(t/T ) are both
ideal Nyquist with period T . If g(t) is not ideal Nyquist, then
ak g(jT kT ).
(4.10)
v(jT ) = g(0)aj +
k=j
4.3
In this section, we develop a general condition that makes an L2 signal g(t) ideal
Nyquist with period T . Before doing so, it is useful to develop the sampling theorem
for passband signals and the aliasing theorem.
xk ei
2kf
2W
, f [fc W, fc + W ],
(4.11)
k=
(4.12)
58
1
From the inverse Fourier
( k ) transform formula, we see that xk in (4.12) is 2W times
the sampling value x 2W in the time domain. Since (x(f )) uniquely determines x(t),
k
we can reconstruct x(t) perfectly from the samples x 2W
, k Z.
The reconstruction formula can be obtained by taking the inverse Fourier transform of the expression in (4.11). We can reexpress (4.11) for f R as follows.
(
)
f fc
i 2kf
x(f ) =
, f R,
xk e 2W rect
2W
k=
{
1, |f | 1/2
(4.13)
where rect(f ) =
0, otherwise
( f )
From the Fourier transform pair 2W sinc(2W t) rect 2W
, basic properties of
the Fourier transform pair yield
(
)
k
f fc
i2fc (t+ 2W
i 2kf
)
2W e
sinc(2W t + k) e 2W rect
.
(4.14)
2W
From (4.13) and (4.14), we obtain
x(t) =
(4.15)
k=
Since xk =
shown below.
1
x
2W
k
2W
x(t) =
k=
(
x
k
2W
sinc(2W t k)ei2fc (t 2W )
(4.16)
Aliasing Theorem
1
. However,
Consider sampling a continuous L2 signal x(t) at the sampling period 2W
we do not assume that x(t) is band-limited to the frequency range [W, W ]. Let z(t)
be the reconstructed signal from the samples of x(t) according to the reconstruction
formula in (3.15), i.e.
(
)
k
z(t) =
x
sinc(2W t k).
(4.17)
2W
k=
The aliasing theorem below gives an explicit expression for the Fourier transform
of the reconstructed signal, i.e. z(f ).
Theorem 4.2 (Aliasing theorem): The Fourier transform of the reconstructed
signal z(t) is given by
{
(f 2W j), f [W, W ]
j= x
z(f ) =
0,
otherwise
59
(a)
(b)
x(f ), f [2W j W, 2W j + W ]
0,
otherwise
Figure( 4.3a
xj)(f ). Using the rectangle func) illustrates the frequency components
( f
f
tion rect 2W , we can writexj (f ) = x(f )rect 2W j .
From (4.17) and x(t) =
j= xj (t), we can write
z(t) =
(
xj
k= j=
k
2W
)
sinc(2W t k).
(4.18)
xj (t)ei2(2W j)t ,
j=
where we use the fact that ei2jk = 1. It follows that, in the frequency domain,
60
z(f ) =
( f
)
xj (f + 2W j). From xj (f ) = x(f )rect 2W
j , we can write
(
)
(
)
f
f
z(f ) =
x(f + 2W j)rect
=
x(f 2W j)rect
,
2W
2W
j=
j=
j=
It should be clear from the example in gure 4.3 that, unless x(t) is band-limited
to the frequency range [W, W ], the reconstructed signal z(t) is not equal to x(t).
Nyquist Criterion
We are now ready to develop a condition that makes an L2 signal g(t) ideal Nyquist
with period T . The condition is called the Nyquist criterion and is given in the
following theorem.
Theorem 4.3 (Nyquist criterion): A continuous L2 signal g(t) is ideal Nyquist
with period T if and only if
(
)
[
]
1
j
1 1
g f
= 1, f ,
.
T j=
T
2T 2T
Proof: Let z(t) be the signal reconstructed from the samples g(kT ), k Z, i.e.
(
)
t
g(kT )sinc
z(t) =
k .
T
k=
Note that g(t) is ideal Nyquist with period T if and only if z(t) = sinc(t/T ), or
equivalently z(f ) = T rect(f T ). From the aliasing theorem, we can write
(
)
j
z(f ) =
g f
rect(f T ),
T
j=
)
(
yielding T rect(f T ) =
j f Tj rect(f T ), or equivalently
j= g
(
)
[
]
1
j
1 1
g f
= 1, f ,
,
T j=
T
2T 2T
which is the desired result.
From the above observations, we are interested in nding a signal g(t) whose
1
1
, e.g. (1 + ) 2T
for some small > 0.
bandwidth is slightly above 2T
61
T g
= g
+ ,
(4.20)
2T
2T
and is refered to as the band edge symmetry. Figure 4.4 illustrates the band edge symmetry when g(f ) is real. If g(f ) is complex, then the gure illustrates the symmetry
for the real part of g(f ).
It can be veried that the following choice of g(t), called the raised cosine pulse,
satises the band edge symmetry in (4.20). The raised cosine pulse with parameter
and period T is shown below together with its Fourier transform.
( )(
)
t
cos(t/T )
grc, (t) = sinc
(4.21)
T
1 42 t2 /T 2
| 1
T,
2T
( (
)) |f
1
1
2 T
T cos 2 |f | 2T
, 2T < |f | 1+
grc, (f ) =
(4.22)
T
0,
|f | > 1+
2T
The Fourier transform of a raised cosine pulse is called a raised cosine spectrum.
Note that, when = 0, the raised cosine pulse is the same as the sinc pulse. Compared to the sinc pulse, a raised consine pulse decays faster in time, as illustrated
in gure 4.5, and is more desirable when there is ISI due to signal distortion in the
nonideal channel.
grc,(t)
62
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
=1
=0.5
=0
grc,(f)
-3
-2
-1
0
t
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
=1
=0.5
=0
-2
-1.5
-1
-0.5
0
f
0.5
1.5
Figure 4.5: Raised cosine pulses with parameter , where 0 1 and T = 1. The
higher the value of , the faster the pulse decays.
|
p(f )| = |
q (f )| = g(f ).
(4.23)
Since g(f ) is real, it follows that q(f ) = p (f ), or equivalently q(t) = p (t).
With this choice of p(t) and q(t), it turns out that the set of pulses {p(t jT ), j Z}
is a set of orthonormal signals, as stated formally in the following theorem.
Theorem 4.4 Let g(t) be ideal Nyquist with
period T . In addition, assume that
g(f ) is real and nonnegative. Let |
p(f )| = g(f ). Then, {p(t jT ), j Z} is a set
of orthonormal signals.
Proof: Since q(t) = p (t), we can write g(t) = p(t) q(t) as
g(t) =
p( )q(t )d =
p( )p ( t)d.
g(kT ) =
p( )p ( kT )d =
0, k =
0
(4.24)
grc,(t)
63
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
=1
=0.5
=0
-3
-2
-1
t
Figure 4.6: Square root of raised cosine pulses with parameter , where 0 1
and T = 1. The higher the value of , the faster the pulse decays.
where the last equality follows from the assumption that g(t) is ideal Nyquist with
period T . Thus, p(t) is orthogonal to p(t kT ), k = 0. By the change of variable =
jT , we can establish that p(t jT ) is orthogonal to p(t kT ), k = j. In addition,
from (4.24), we see that p(t) = 1. It follows that p(t jT ) = p(t) = 1, j Z.
In conclusion, {p(t jT ), j Z} is an orthonormal set.
If we usethe raised cosine spectrum for g(f ), then the choice of p(f ) in (4.23),
i.e. p(f ) = grc, (f ), is called a square root of raised cosine spectrum. In the time
domain,
a square root of raised cosine pulse is equal to the inverse Fourier transform
of grc, (f ), which is given below [?, p. 228] and illustrated in gure 4.6.
(
)
cos((1 + )t/T ) + T sin((1 )t/T )/(4t)
4
psqrc, (t) =
(4.25)
1 (4t/T )2
T
From (4.26), we see that aj can be obtained by passing s(t) through an LTI lter
with impulse response q(t) = p (t) and sampling the output at time t = jT . This
is exactly the operation of the receiver in gure 4.2. Therefore, we can think of
the receiver operations in gure 4.2 as nding the coecients of the orthonormal
expansion of the PAM signal.
64
(a)
(b)
4.4
(4.27)
Figure 4.7b illustrates the spectrum of sDSB-AM (t). Note that, since sb (t) is real,
sb (f ) must have the conjugate symmetry. Consequently, sDSB-AM (f ) in the frequency
range [fc W, fc ] can be determined from sDSB-AM (f ) in the frequency range [fc , fc +
W ]. This redundancy indicates an inecient use of bandwidth by DSB-AM.
65
aj p(t jT )
(4.28)
j=0
i2fc t
we can add its complex
. In addition, it is convenient to add a
conjugate sb (t)e
scaling factor of 1/ 2, yielding the following expressions for a QAM signal.5
1
1
sQAM (t) = sb (t)ei2fc t + sb (t)ei2fc t
2
2
{
}
i2fc t
=
2Re sb (t)e
=
2Re{sb (t)} cos(2fc t) 2Im{sb (t)} sin(2fc t)
(4.29)
66
transmitter
(complex)
(a)
baseband
(complex)
passband
(complex)
passband
(real)
passband
(real)
passband
(complex)
baseband
(complex)
(b)
67
(real)
PAM transmitter
with pulse
baseband
(real)
passband
(real)
PAM transmitter
with pulse
baseband
(real)
(real)
QAM Implementation
To avoid the use of complex signals in QAM implementation, we can view the complex baseband signal sb (t) in (4.28) as two real signals Re{sb (t)} and Im{sb (t)}. In
particular, we can write
Re{sb (t)} =
Re{aj }p(t jT ), Im{sb (t)} =
Im{aj }p(t jT ).
j=0
j=0
We can view the transmissions of Re{sb (t)} and Im{sb (t)} as transmissions over
two parallelbaseband PAM systems. From the expression of sQAM (t) in (4.29), i.e.
sQAM (t) = 2Re{sb (t)} cos(2fc t) 2Im{sb (t)} sin(2fc t), we have the transmitter
implementation in gure 4.9. Notice that all involved signals are real.
To recover the complex baseband signal sb (t), we can separately recover Re{sb (t)}
and Im{sb (t)}. From trigonometric identities 2 cos2 x = 1+cos(2x) and 2 sin x cos x =
sin(2x), we can write
2sQAM (t) cos(2fc t) = 2 [Re{sb (t)} cos(2fc t) Im{sb (t)} sin(2fc t)] cos(2fc t)
= Re{sb (t)} [1 + cos(4fc t)] Im{sb (t)} sin(4fc t).
It follows that, after multiplying sQAM (t) with 2 cos(2fc t), we can use a low pass
lter (LPF) passing the frequency range [W, W ] to recover Re{sb (t)}. Figure 4.10
shows the demodulation of Re{sb (t)}.
Since Re{sb (t)} is a baseband PAM signal, it is passed through a PAM receiver
that contains a matched lter q(t) = p(t). Since p(t) is band-limited to [W, W ],
so is q(t). (Note that q(f ) = p (f ).) It follows that the LPF in gure 4.10 is in
fact redundant. Therefore, to recover Re{a0 }, Re{a1 }, . . ., we can use the receiver
structure shown in gure 4.11.
68
LPF
(assume no noise)
(assume no noise)
PAM matched
filter
+
2p(t jT ) cos(2fc t), 2p(t jT ) sin(2fc t), j Z
PAM matched
filter
(assume no noise)
PAM matched
filter
69
sQAM (t) =
2Re{sb (t)} cos(2fc t) 2Im{sb (t)} sin(2fc t)
=
Re{aj } 2p(t jT ) cos(2fc t)
j=0
(4.30)
j=0
From (4.30), we see that the coecients of expansion are equal to Re{a0 }, Re{a1 }, . . .
and Im{a0 }, Im{a1 }, . . .. Being a coecient of an orthonormal expansion, Re{aj } can
be retrieved from the inner product
Re{aj } =
sQAM (t) 2p(t jT ) cos(2fc t)dt
]
[
=
sQAM (t) 2 cos(2fc t) p(t jT )dt
[
]
= sQAM (t) 2 cos(2fc t) p(t)
t=jT
The coecient Im{aj } can be retrieved similarly from the inner product
,..., , ,...,
A =
.
2
2 2
2
(4.31)
Figure 4.13 shows the standard 4 4-QAM signal set with spacing d. For QAM,
the energy per symbol, denoted by Es , is dened as Es = E [|Ak |2 ]. For the standard
M M -QAM signal set,
Es,M M -QAM =
d2 (M 1)
.
6
(4.32)
70
8PSK
(4.34)
dmin,BPSK = 2 Es , dmin,QPSK = 2Es , dmin,8PSK = (2 2)Es .
One fundamental question is how to choose an M -point QAM signal set such that
it has the maximum value of dmin subject to a xed value of Es . In general, optimal
signal sets are dicult to derive. In addition, the performance gain is limited and
often not worth the additional complexity involved in signal detection. As a result,
several simple but suboptimal signal sets are often used in practice. We shall not
investigate the problem of nding optimal signal sets any further.
71
4.5
So far, we have seen one-dimensional signal sets for PAM and two-dimensional signal
sets for QAM. It is possible to generalize to K-dimensional signal sets. For a transmission system that uses a K-dimensional signal set, the jth transmitted symbol is
a signal point that can be described as a K-dimensional vector aj = (aj,1 , . . . , aj,K ).
As with PAM and QAM, we can view aj,k , j Z+ , k {1, . . . , K}, as the coefcients of an orthonormal expansion. In particular, let {1 (t), . . . , K (t)} be the set
of K orthonormal signals corresponding to the 0th transmission. In addition, assume
that {1 (tjT ), . . . , K (tjT ), j Z+ } is an orthonormal set, where T is the symbol
period. Then, the transmitted signal can be described as
sK-dim (t) =
aj,k k (t jT ).
(4.35)
j=0 k=1
As with PAM and QAM, the process of retrieving the coecient aj,k from sK-dim (t)
is equivalent to computing the inner product aj,k = sK-dim (t), k (t jT ).
In an orthogonal signal set with M signal points, we can describe the M signal
points as the following M vectors in M dimensions.
0
0
E
s
Es
0
0
A = .. , .. , . . . , ..
(4.36)
. .
.
0
Es
0
Figure 4.15 shows the orthogonal signal set with 3 signal points. Note that, for an
orthogonal signal set, the number of dimensions K is equal to the number of signal
points M . One example of an orthogonal signal set is a set of M -point pulse position
modulation (M -PPM) shown in gure 4.16 for M = 4.
A biorthogonal signal set with M signal points (M even) is obtained from an
orthogonal signal set with M/2 signal points by including the negatives of those
signal points. In particular, if {s1 , . . . , sM/2 } is the M/2-point orthogonal signal
set, then the corresponding biorthogonal signal set is {s1 , . . . , sM/2 , s1 , . . . , sM/2 }.
Figure 4.17 shows the 6-point biorthogonal signal set constructed from the signal set
in gure 4.15. Note that, for a biorthogonal signal set, K = M/2.
72
4.6
Summary
We started the chapter by discussing the L2 signal space. We showed that transmitted signals in PAM, QAM, and higher dimensional modulation techniques can be
conveniently viewed as orthonormal expansions in the L2 signal space. Based on this
viewpoint, the process of retrieving the transmitted symbols is equivalent to computing the inner product between the transmitted signal and the appropriate basis
vectors.
Our discussion on modulations was based on the assumption of an ideal channel
with no noise. Under this perfect condition, there is a problem of designing a PAM
pulse so that there is no ISI. We described the Nyquist criterion which can be used to
73
identify PAM pulses with no ISI, e.g. sinc pulse and square root of raised cosine pulse.
1
We also specied the Nyquist bandwidth of 2T
, which is the bandwidth required for
a PAM pulse with period T to have no ISI.
Our discussion on modulation techniques is by no mean complete. In particular, we only focused on linear modulations with no memory where each symbol is
modulated onto a waveform in a linear fashion and is modulated independently from
the other symbols. Nonlinear modulations and modulations with memory are discussed in [?, sec. 4.3]. Their advantages include the improvement of characteristics
of the transmit signal spectrum, and the ability to perform signal detection without
synchronization at the receiver.
4.7
Practice Problems
Problem 4.1 (TRUE or FALSE): Indicate whether each of the following statements is true or false (i.e. not always true). Justify your answer.
(a) The set of signals {sinc(t 2k), k Z} forms an orthonormal signal set.
(b) The set of signals {sinc(3W t k), k Z} is a basis for the vector space of
continuous L2 signals band-limited to the frequency band [W, W ].
(c) A continuous L2 passband signal band-limited to the frequency band [fc
W, fc + W ] can be uniquely determined by its samples taken at the sampling
rate 2W (in sample/s).
(d) Suppose p(t) is ideal Nyquist with period T . Let W be the bandwidth of p(t),
1
W T1 .
i.e. p(f ) = 0 for f
/ [W, W ]. Then, W is bounded by 2T
(e) Consider baseband transmission using the standard 4-PAM signal set. Suppose
we want to double the transmission bit rate while keeping the same signal
spacing. If we use the same amount of channel bandwidth, we need to increase
the expected symbol energy by a factor of 4.
Problem 4.2 (Properties of PAM signals): Consider using the standard 4-PAM
signal set with signal spacing d for the transmission of independent and equally likely
data bits that enter the baseband modulator at the rate of 4 Mbps. Suppose that we
want to transmit a PAM signal over the baseband channel. Assume that we use the
sinc function as the pulse shape, i.e. the transmitted signal is
)
(
t
j ,
s(t) =
aj sinc
T
j=0
where T is the symbol period and aj is the signal value for symbol j {0, 1, . . .}.
(a) Write down all possible values for each aj .
74
(b) Find the amount of bandwidth (in Hz) required to transmit the above information.
(c) Suppose that the transmission lasts for 1 s, i.e. we only transmit 4 million bits.
Express the expected energy of the PAM signal that is used to carry this amount
of information bits in terms of d. (HINT: Use an orthonormal expansion.)
(d) Repeat parts (a), (b), and (c) for the standard 8-PAM with signal spacing d.
In addition, what is the ratio between the bandwidth required in this case and
that in part (b)?
Problem 4.3 (Necessity of zero mean for PAM signal sets): Consider a PAM
signal set with
M M signal points a1 , . . . , aM . Let m be the mean signal point dened
1
by m = M j=1 aj . Let Es denote the expected symbol energy for this signal set.
(a) Show that, if m = 0, then the symbol energy can be reduced further by
constructing a modied signal set with signal points a1 , . . . , aM , where aj =
aj m, j {1, . . . , M }. In particular, let Es be the symbol energy of the
modied signal set. Write Es in terms of Es and m.
(b) Compute the expected symbol energy of the following M -PAM signal set, where
d > 0 and M is a positive integer power of 2.
)
}
{ (
M
M
1 d, . . . , d, 0, d, . . . , d
2
2
Problem 4.4 (Symbol energy of standard M M -QAM): Show that, for the
standard M M -QAM signal set with the minimum distance d between signal points,
the expected symbol energy is given by
Es,M M -QAM =
d2 (M 1)
, where M = M 2 .
6
Problem 4.5 (Symbol energy of QAM signal sets): Consider the following 8point QAM signal sets. Note that each signal set has zero mean and the minimum
distance dmin equal to d.
(a) For each signal set, compute the expected symbol energy Es in terms of d.
(b) Which of the three signal sets has the lowest symbol energy Es ?
Problem 4.6 (Orthonormal basis for QAM signals): Let the set of signals
{p(t jT ), j {0, 1, . . .}} be an orthonormal set band-limited to the frequency range
[W, W ]. Let fc be the carrier frequency with fc > W . Show that the following set
of vectors or signals in the L2 signal space is an orthonormal set.
}
{
signal set 1
75
signal set 2
signal set 3
j=0
(
aj sinc
)
t
j ,
T
where a0 , a1 , . . . are the complex signal amplitudes and T is the symbol period. What is the minimum value of the channel bandwidth required for this
transmission?
(b) Suppose that we want to have the minimum distance of dmin between signal
points. What is the expected symbol energy Es of the QPSK signal set?
(c) Continuing from (a) and (b), suppose that the transmission lasts for 1 s, i.e. only
1,000 bits are transmitted.{
Express the expected
signal energy of the passband
}
QAM signal sQAM(t) = Re
2sb (t)ei2fc t in terms of dmin .
Problem 4.8 (Pulse position modulation (PPM)): Consider a 4-point orthogonal signal set constructed based on the four orthonormal signals or vectors in the L2
signal space as shown below.
76
Consider the transmission of a single symbol. Let Es denote the expected symbol
energy. The corresponding transmitted signal is
s(t) =
ak k (t),
a1
Es
a2 0
where
a3 0
a4
0
k=1
0
0
Es 0
,
0 , Es
0
0
0
.
,
0
Es
The corresponding waveforms for s(t) are called the set of 4-point pulse position
modulation (4-PPM) signals.
(a) Specify the value of (in terms of T ) that makes the signals 1 (t), 2 (t), 3 (t),
4 (t) orthonormal.
(b) Draw all possible 4-PPM signal waveforms associated with the transmission of
a single symbol. Specify the signal values in your drawing.
(c) What is the minimum distance dmin between signal points in the given 4-point
signal set?
(d) Consider constructing a simplex signal set from the given 4-point signal set.
Draw all signal waveforms associated with the transmission of a single symbol
based on this simplex signal set. Specify the signal values in your drawing.
Problem 4.9 (Symbol energy and dimension of a simplex signal set): Consider the simplex signal set constructed from an M -point orthogonal signal set {s1 , . . .,
sM }.
(a) Show that the expected symbol energy of the simplex signal set is lower than
that of the orthogonal signal set by a factor of (1 1/M ).
(b) Let dmin be the minimum distance between signal points in the orthogonal signal
set. What is the minimum distance between signal points for the corresponding
simplex signal set?
(c) Show that the dimension of the subspace of RM spanned by the simplex signal
set is M 1.
Chapter 5
Signal Detection
In this chapter, we consider the presence of noise in a communication channel. We
shall focus our attention on additive white Gaussian noise (AWGN) channels and
investigate how to perform signal detection for various modulation schemes discussed
in the previous chapter. Since the problem of signal detection involves hypothesis
testing, we shall start our discussion there.
5.1
Hypothesis Testing
In hypothesis testing, there are M possible outcomes in the sample space. Each
outcome is called a hypothesis. We shall index these hypotheses from 1 to M . Let H
be a discrete random variable (RV) whose value is equal to h if hypothesis h actually
occurs, where h {1, . . . , M }. Denote the probability mass function (PMF) values of
H by p1 , . . . , pM . In the context of hypothesis testing, p1 , . . . , pM are called a priori
probabilities. We assume that a priori probabilities p1 , . . . , pM are known.
Assume there is an observation RV R (or random vector R) whose statistics
depends on the hypothesis. In addition, assume that the conditional probability density function (PDF) or the conditional probability mass function (PMF), denoted by
fR|H (r|h), is known.
For our discussion on digital communications, M hypotheses correspond to M
possible signal points with probabilities p1 , . . . , pM . An observation R corresponds to a
received signal. The conditional PDF/PMF fR|H (r|h) characterizes a communication
channel. In what follows, we assume that R is a continuous RV. Note, however,
that the discussion is also valid for a discrete observation RV R, as well as for an
observation random vector R.
Given R, the goal of hypothesis testing is to decide which event h actually occurs
while minimizing the probability of decision error or equivalently maximizing the
denote the decision value that is a function of
probability of correct decision. Let H
Note that H
{1, . . . , M }. In addition, the probability
R. Since R is a RV, so is H.
= H}. Using the conditional probability, we can
of correct decision is equal to Pr{H
write
= H} = fR (r) Pr{H
= H|R = r}dr.
Pr{H
77
78
= H} is equivalent to maximizing
Since fR (r) is nonnegative, maximizing Pr{H
(5.1)
max
h{1,...,M }
fH|R (h|r).
(5.2)
The decision rule in (5.2) is called the maximum a posteriori (MAP) decision rule.
Observe that, in case of a tie, i.e. more than one value of h maximize fH|R (h|r), we
can arbitrarily select one of the optimal values of h without changing the probability
of correct decision.
Since we are not given the values of fH|R (h|r), it is convenient to rewrite the
MAP decision rule in (5.2) in terms of the known quantities. Note that we can write
f
(r|h)p
fH|R (h|r) = R|HfR (r) h . Since fR (r) is independent of h, we can express the MAP rule
in (5.2) as follows.
= arg
MAP rule: H
max
h{1,...,M }
fR|H (r|h)ph
(5.3)
For equally likely hypotheses, i.e. p1 = . . . = pM = 1/M , the MAP rule in (5.3)
can be simplied as follows.
= arg
ML rule: H
max
h{1,...,M }
fR|H (r|h).
(5.4)
The decision rule in (5.4) is called the maximum likelihood (ML) decision rule.
Note that the ML decision rule can be applied in cases where we know fR|H (r|h) but
do not know p1 , . . . , pM .
fR|H (r|1)p1
fR|H (r|2)p2
<
=2
H
= 1 if the left hand side (LHS) is at
The above expression means that we set H
= 2 if the RHS is more
least the right hand side (RHS). On the other hand, we set H
79
than the LHS. We can rewrite the above MAP decision rule as
=1
H
fR|H (r|1) p2
L(r) =
.
fR|H (r|2) < p1
=2
H
(5.5)
The quantity L(r) is called the likelihood ratio, and the decision rule of the form
in (5.5) is called a likelihood ratio test (LRT).
Example 5.1 Consider binary hypothesis testing in which the two hypotheses are
equally likely and the observation RV R is given by (assuming > 0)
{
+ N, h = 1
R=
+ N,
h=2
where N is a Gaussian RV with mean 0 and variance 2 . Note that we can write
(r+)2
(r)2
1
1
fR|H (r|1) =
e 22 , fR|H (r|2) =
e 22 .
2 2
2 2
(r+)2
2 2
(r)2
2 2
=2
=1
H
H
>
0.
1 r
<
=1
=2
H
H
Figure 5.1 shows the conditional PDFs fR|H (r|1) and fR|H (r|2). From gure 5.1,
it is easy to see that 0 is the threshold of the decision rule.
The probability of decision error, denoted by Pe , is equal to
1
1
=
= H|H = 2}
Pr{H
H|H = 1} + Pr{H
2
2
1
1
=
Pr{R > 0|H = 1} + Pr{R 0|H = 2}
2
2
= Pr{R > 0|H = 1}
= H} =
Pe = Pr{H
where the last equality follows from symmetry. Note that Pe is equal to the area of
the shaded region in gure 5.1. Since R = + N under hypothesis 1, we can write
Pr{R > 0|H = 1} = Pr{N > |H = 1} = Pr{N > }, where the last equality
follows from the fact that N is independent of H.
Let Q denote the complementary cumulative
distribution function of a zero-mean
2
unit-variance Gaussian RV, i.e. Q(x) = x 12 ey /2 dy. In terms of the Q function,
we can express Pe as
}
{
()
N
>
.
=Q
Pe = Pr{N > } = Pr
80
0.5
fR|H(r|1)
0.4
fR|H(r|2)
0.3
0.2
0.1
0
-4
-3
-2
-1
Figure 5.1: Conditional PDFs for binary hypothesis testing with Gaussian noise.
5.2
Figure 5.2 shows an additive white Gaussian noise (AWGN) channel model in which
the received signal R(t) is the sum of the transmitted signal S(t) and a zero-mean
white Gaussian random process N (t). AWGN channel models are often used in
practice.
We shall assume that N (t) is wide-sense stationary (WSS). Let SN (f ) denote the
power spectral density (PSD) of N (t). By convention, we normally set
SN (f ) = N0 /2.
(5.6)
N0
( ).
2
(5.7)
The Gaussian noise assumption is a result of the central limit theorem (CLT)
which tells us that a superposition of a large number of waveforms associated with
ltered impulse noises in electronics converges to a Gaussian random process.
The white noise assumption is for modeling convenience. Although white noise
does not exist in practice, as long as the PSD is approximately constant over the
81
frequency response
of receiver filter
frequency response
of receiver filter
Figure 5.3: White noise yields the same ltered noise PSD as wideband non-white
noise.
transmission band, the noise behaves as if it were white, i.e. with innite bandwidth.
In particular, note that we usually pass R(t) through a receiver lter. Figure 5.3
illustrates that the ltered noises are the same for white and non-white noises whose
PSDs are constant in the transmission band.
N0
N0
( ) =
q( ) q( ),
2
2
(5.8)
N0
|
q (f )|2 .
2
(5.9)
=0
Now consider taking two samples W (t1 ) and W (t2 ). The covariance between the
82
(5.11)
Finally, consider splitting and passing N (t) through two LTI lters q1 (t) and q2 (t),
as shown in gure 5.4. Let W1 (t) = N (t) q1 (t) and W2 (t) = N (t) q2 (t). Consider
taking two samples W1 (t1 ) and W2 (t2 ). The covariance between the two samples is
given by
]
[
q1 ()N (t1 )q2 ()N (t2 )dd
E [W1 (t1 )W2 (t2 )] = E
N0
=
q1 ()q2 ()(t1 t2 + )dd
2
N0
=
q1 ()q2 ( + t2 t1 )d.
(5.12)
2
5.3
Ak k (t),
(5.13)
k=1
( )
2 (t) = T2 sinc Tt sin(2fc t), where fc is the carrier frequency.
83
Figure 5.5: Optimal receiver structure for a single symbol transmission over an AWGN
channel.
Assume that the signal points are equally likely. Denote the set of signal points
by
sM,1
s1,1
..
..
{s1 , . . . , sM } = . , . . . , .
s
sM,K
1,K
Note that A takes its value in {s1 , . . . , sM }. Suppose that we transmit the symbol
through an AWGN channel whose noise PSD is equal to N0 /2. Let N (t) denote the
noise process. The received signal R(t) is given by
R(t) = S(t) + N (t).
(5.14)
Consider the K-dimensional signal space S spanned by the orthonormal set {1 (t),
. . ., K (t)}. We shall see shortly that we can project the received signal R(t) on S;
the noise components outside S can be ignored without loss of optimality in terms
of the probability of decision error. In particular, given R(t), the receiver can use a
bank of K matched lters to compute
Rk = R(t), k (t) = Ak + Nk , k {1, . . . , K},
(5.15)
where we dene Nk = N (t), k (t). Figure 5.5 shows the receiver structure corresponding to the computation in (5.15).
From (5.10), the variance of Nk is equal to N20 |k ()|2 d = N0 /2. From (5.12),
1
2
K
j=1 nj /N0 .
e
(N0 )K/2
(5.16)
(5.17)
84
Compared to the expression of the AWGN channel in (5.14), we see that the
channel can be described in (5.17) using vectors in the signal space S instead of
waveforms. From (5.17), we see that detection of the transmitted signal point can
be viewed as a hypothesis testing problem. There are M hypotheses indexed from 1
to M . Under hypothesis m {1, . . . , M }, the observation random vector R is given
in (5.17).
Irrelevant Noise
k=1
Rk k (t) =
k=1
Ak k (t) +
k=1
The above expression implies that the receiver in gure 5.5 discards the noise
component N (t) NS (t). In what follows, we argue that this noise component is in
fact irrelevant to the receivers decision, and hence can be ignored. The discussion is
based on [?, p. 220]. We start by proving a useful theorem.
Theorem 5.1 (Theorem of irrelevance): Let vector R and R be two observations at the receiver after the signal point A is sent. An optimal receiver can disregard
R if and only if fR |R,A (r |r, a) = fR |R (r |r). In addition, a sucient condition for
disregarding R is fR |R,A (r |r, a) = fR (r ).
Proof: Since the hypotheses are equally likely, the MAP decision rule is equal to the
ML decision rule. In particular, the ML decision rule compares
fR ,R|A (r , r|sm ) = fR|A (r|sm )fR |R,A (r |r, sm ), m {1, . . . , M }.
If fR |R,A (r |r, a) = fR |R (r |r), then the last term can be ignored since it is the
same for all m. Thus, we can simplify the decision rule to be based only on R.
Conversely, if the last term can be ignored, it cannot depend on m, and we necessarily
have fR |R,A (r |r, a) = fR |R (r |r).
Finally, if fR |R,A (r |r, a) = fR (r ), then the last term can be ignored since it
is the same for all m. Thus, fR |R,A (r |r, a) = fR (r ) is a sucient condition for
disregarding R .
Let us now extend the orthonormal set {1 (t), . . . , K (t)} to an innite orthonormal set {1 (t), 2 (t), . . .} that spans the L2 signal space. For a nite observation
time, we can view each realization of N (t) as an L2 signal. Dene Nk = N (t), k (t)
for k Z+ and let N = (NK+1 , NK+2 , . . .). Note that N completely species the
noise component N (t) NS (t).
Dene Rk = R(t), k (t) for k Z+ and let R = (RK+1 , RK+2 , . . .). Note that R
and R completely specify R(t). In addition, note that R = N . From the theorem
85
of irrelevance, in order to discard N from the decision rule, it suces to show that
fN |R,A (n |r, a) = fN (n ).
Since knowing R and A is equivalent to knowing N and A (note that R = N + A),
fN |R,A (n |r, a) = fN |N,A (n |n, a).
From the denition of N , N is independent of A given N, so
fN |N,A (n |n, a) = fN |N (n |n).
Therefore, it remains to show fN |N (n |n) = fN (n ), or equivalently N and N
are independent. We prove that N and N are independent by showing that, for any
) of N and any subset N
1 , . . . , N
Q ) of N, we can
, . . . , N
= (N
= (N
nite subset N
1
P
n )fN
n).
n , n
) = fN
write fN
(
,N
(
(
1
2
2
P
j=1 Rj /N0 k=1 Rk /N0
=
e
(P
+Q)/2
(N )
( 0
)(
)
1
1
n
2
/N0
P
Q
n
2
/N0
j=1
j
k
k=1
=
e
e
(N0 )P/2
(N0 )Q/2
= fN
n )fN
n)
(
(
from which we conclude that N and N are independent. In conclusion, N can be
ignored in optimal detection at the receiver.
max
m{1,...,M }
fR|A (r|sm ).
(5.18)
= arg
H
max
m{1,...,M }
(5.19)
Since the quantity r sm is the distance between the receive signal r and the
signal point sm in the signal space, the decision rule of the form in (5.19) is called the
minimum distance decision rule. The minimum distance decision rule is quite simple
intuitively. Given an observation point r, the most likely transmitted signal point is
the one closest to the observation r.
86
perpendicular
bisector of
5.4
Dene the pairwise error probability Pm |m as the probability that the received signal
r is closer to sm than to sm . From the illustration of pairwise error probability in
gure 5.6, r is closer to sm than to sm when the noise component N along the
1
direction sm sm is greater than
} d(sm , sm ) = sm sm .
{ 2 d(s1m , sm ), where
It follows that Pm |m = Pr N > 2 d(sm , sm ) . From (5.16), note that the joint
PDF of N is spherically symmetric. It follows that N is a zero-mean Gaussian RV
with variance N0 /2. Using the Q function, we can write
(
)
d(sm , sm )
Pm |m = Q
.
(5.20)
2N0
Pr
Ej
Pr{Ej }.
(5.21)
j=1
j=1
Let Em denote the event in which a decision error is made given H = m, i.e.
a = sm . Let Em |m denote the event inwhich r is closer to sm than to sm . By the
denition of Em |m , we can write Em = m =m Em |m . Using the union bound,
Pr{Em }
m =m
Pr{Em |m } =
Pm |m .
m =m
Let dmin be the minimum distance between signal points of the signal set. The
union bound estimate of Pr{Em } is based on the idea that the nearest neighbors to
sm at distance dmin will dominate the summation in the union bound.
000
001
011
010
110
111
101
87
100
8PAM
000
001
100
011
101
111
010
110
8PSK
Figure 5.7: Gray encoding for PAM and QAM signal sets.
Let Kmin,m be the number of neighbors of sm that are at distance dmin away. The
union bound estimate of Pr{Em } is
(
)
dmin
Pr{Em } Kmin,m Q
.
2N0
Let Kmin be the average value of Kmin,m over all m. The overall union bound
estimate of the symbol error probability Ps is
(
)
M
1
dmin
Ps =
Pr{Em } Kmin Q
.
(5.22)
M m=1
2N0
Instead of the symbol error probability, it is conventional to express the transmission system performance in terms of the bit error probability or the bit error rate
(BER). Let Pb denote the bit error probability. In terms of Ps , we can approximate
Pb as
Pb Ps / log2 M
(5.23)
with the assumption that a symbol error leads to only one bit error. This assumption
is reasonable if we can map log2 M information bits to M signal points such that
adjacent points dier in only one information bit. For PAM and QAM signal sets,
such a mapping is called Gray encoding [?, p. 175]. Figure 5.7 illustrates examples
of Gray encoding.
88
0
-1
16-PAM
log10Pb
-2
8-PAM
-3
4-PAM
-4
2-PAM
-5
-6
-7
-8
0
10
15
20
Eb/N0(dB)
Given Eb ,
the two signal points are Eb and Eb , and the distance between signal
points is 2 Eb . Using (5.22), we can write
(
)
( )
2 Eb
Eb
Pb,2-PAM = 1 Q
=Q
2
.
(5.24)
N0
2N0
d2 (M 2 1)
12Es
or equivalently dmin =
(see (4.7)). In
For M -PAM, Es = min 12
M 2 1
addition, note that Eb = Es / log2 M and Pb Ps / log2 M . Note also that
1
2(M 1)
(2 1 + (M 2) 2) =
.
M
M
Using (5.22) and (5.23), we can write
12Es
2
M 1
1
2(M 1)
Q
Pb,M -PAM
log2 M
M
2N0
((
)
)
2(M 1)
6 log2 M Eb
=
Q
.
M log2 M
M 2 1 N0
Kmin =
(5.25)
Figure 5.8 shows the bit error probability for M -PAM signal sets according to (5.25)
for dierent values of M .
89
0
-1
16 16-QAM
log10Pb
-2
8 8-QAM
-3
4 4-QAM
-4
2 2-QAM
-5
-6
-7
-8
0
10
15
20
Eb/N0(dB)
Ps
.
2 log2 M
Kmin =
) 4(M 1)
1 (
2
(M
2)
4
+
4(M
2)
3
+
4
2
=
.
M 2
M
6Es
M 2 1
1
4(M 1)
2 log2 M
M
2N0
((
)
)
6 log2 M Eb
2(M 1)
=
Q
.
M log2 M
M 2 1 N0
Pb,M M -QAM
(5.26)
Figure 5.9 shows the bit error probability for standard M M -QAM signal sets
according to (5.26) for dierent values of M .
Consider M -point phase shift keying (PSK) signal sets with the expected symbol
energy Es . Note that, for M = 2, 2-PSK is the same as binary PAM. For M =
4,
is the same as 2 2-QAM. We now consider 8-PSK. Note that dmin =
4-PSK
(
)
(2
2)E
Eb
s
1
2
3
= Q
Pb,8-PSK 2 Q
(2 2)
.
(5.27)
3
3
2
N0
2N0
Figure 5.10 shows the bit error probability for M -PSK signal sets for dierent M .
90
0
-1
log10Pb
-2
-3
8-PSK
-4
-5
2-PSK,4-PSK
-6
-7
-8
0
10
15
20
Eb/N0(dB)
that dmin = 2Es and Eb = Es / log2 M . Unlike previous signal sets discussed so far,
we cannot well approximate Pb Ps / log2 M since all distances between signal points
are the same and thus Gray encoding is not applicable. However, we can approximate
M/2
Pb M
P since, for each bit position, there are M/2 out of M 1 error symbols
1 s
with the incorrect bit value in that position. In addition, note that Kmin = M 1.
Using (5.22), we can write
(
)
2Es
M/2
Pb,M orthogonal
(M 1) Q
M 1
2N0
)
(
M
Eb
=
Q
log2 M
.
(5.28)
2
N0
Figure 5.11 shows the bit error probability for M -point orthogonal signal sets
according to (5.28) for dierent values of M .
5.5
J1
K
j=0 k=1
Aj,k k (t jT ),
(5.29)
91
0
-1
log10Pb
-2
-3
2-orthogonal
-4
4
-5
8
-6
16
-7
-8
0
10
15
20
Eb/N0(dB)
Figure 5.11: Bit error probability for M -point orthogonal signal sets.
Figure 5.12: Optimal receiver structure for J symbol transmissions over an AWGN
channel.
where T is the symbol period, {1 (t), . . . , K (t)} is a set of orthonormal signals, and
Aj = (Aj,1 , . . . , Aj,K ) denotes the jth transmitted signal point. Denote the set of M
signal points by {s1 , . . . , sM }. Note that Aj takes its value in {s1 , . . . , sM } for each j.
For no intersymbol interference (ISI), assume that {k (t jT ), k {1, . . . , K}, j Z}
is an orthonormal set.
In the context of hypothesis testing, there are M J hypotheses. We can describe
a hypothesis using vector m = (m0 , . . . , mJ1 ), where mj {1, . . . , M } for j
{0, . . . , J 1}. Note that, under hypothesis m, the J transmitted signal points are
sm0 , . . . , smJ1 .
Consider transmitting the signal in (5.29) through the AWGN channel whose
noise PSD is N0 /2. Let N (t) denote the noise and R(t) denote the received signal,
i.e. R(t) = S(t) + N (t). Viewing the signal in (5.29) as an orthonormal expansion, it
follows that the optimal receiver has the structure shown in gure 5.12.
Note that the receiver in gure 5.12 only preserves the signal and noise components
in the signal space spanned by {k (t jT ), k {1, . . . , K}, j {0, . . . , J 1}}.
92
The justication that we can throw away noise components outside the signal space
without loss of optimality in detection performance is the same as in the case of a
single symbol transmission and is omitted here.
From gure 5.12, the optimal receiver computes Rj,k = R(t), k (t jT ) for
k {1, . . . , K} and j {0, . . . , J 1}. Dene Nj,k = N (t), k (t jT ). Note
that these Nj,k s are IID Gaussian RVs with zero mean and variance N0 /2. For
convenience, dene the following vectors.
A = (A0 , . . . , AJ1 )
sm = (sm0 , . . . , smJ1 )
N = (N0 , . . . , NJ1 ), where Nj = (Nj,1 , . . . , Nj,K )
R = (R0 , . . . , RJ1 ), where Rj = (Rj,1 , . . . , Rj,K )
It follows that, under hypothesis m, we can write R = sm + N. For optimal
detection, we use the ML decision rule (equivalent to the MAP decision rule) given
below.
= arg max fR|A (r|sm )
H
(5.30)
m{1,...,M }J
= arg
H
max
m{1,...,M }J
(5.31)
which is the minimum distance decision rule for multiple symbol transmissions. We
can rewrite the decision rule in (5.31) as
= arg
H
min
m{1,...,M }J
J1
rj smj 2 .
j=0
min
mj {1,...,M }
rj smj 2 = arg
min
mj {1,...,M }
rj smj
Note that the above decision rule for the jth symbol is exactly the minimum
distance decision rule that we saw before in (5.19). In conclusion, when the set {k (t
jT ), k {1, . . . , K}, j Z} is an orthonormal set, we can detect transmitted symbols
separately from the observations R0 , R1 , . . . respectively without loss of optimality.
In other words, symbol-by-symbol detection is optimal when there is no ISI.
93
(5.32)
where Rj , Aj , Nj are the received signal, the transmitted signal, and the Gaussian
noise for the jth transmitted symbol respectively. Recall that each of these vectors
has K components for K-dimensional modulation. In addition, the K components of
Nj are IID Gaussian RVs with zero mean and variance N0 /2, and are independent of
the components of Nj , j = j.
For analysis of communication systems, it is usually more convenient to work with
the discrete-time AWGN channel model in (5.32) than to work with the continuoustime channel model. For the rest of the course, we shall use the discrete-time channel
model whenever it is possible to do so.
5.6
We compare dierent modulation schemes based on the approach in [?, sec. 5.2.10].
In particular, for each modulation scheme, we consider two performance parameters.
The rst parameter, called the bandwidth eciency, is the transmission bit rate (in bps
or bit/s) obtained per unit of bandwidth (in Hz). Let R and W be the transmission
bit rate and the bandwidth, then the bandwidth eciency is the ratio R/W (in
bit/s/Hz).
The second performance parameter is the value of Eb /N0 associated with a certain
requirement on the bit error probability Pb . We shall assume 105 as this requirement
in the following discussion, and denote the corresponding Eb /N0 by (Eb /N0 )105 .
M -PAM: We assume that M -PAM utilizes the orthonormal set of baseband
signals
{
(
)
}
1
t
sinc
j ,j Z ,
T
T
1
where T is the symbol period. For M -PAM, note that W = 2T
, R = logT2 M ,
R
= 2 log2 M . The value of (Eb /N0 )105 can be obtained from the union
and W
bound estimate of Pb shown in gure 5.8.
(
)
(
)
2
t
2
t
sinc
j cos(2fc t),
sinc
j sin(2fc t), j Z ,
T
T
T
T
where T is the symbol period and
fc is the carrier frequency. For M M -QAM,
2
log
M
R
2
, and W
= 2 log2 M . The value of (Eb /N0 )105
note that W = T1 , R =
T
can be obtained from the union bound estimate of Pb shown in gure 5.9.
94
(
)
(
)
2
t
2
t
sinc
j cos(2fc t),
sinc
j sin(2fc t), j Z ,
T
T
T
T
where T is the symbol period and fc is the carrier frequency. For M -PSK, note
R
that W = T1 , R = logT2 M , and W
= log2 M . The value of (Eb /N0 )105 can be
obtained from the union bound estimate of Pb shown in gure 5.10.
M -point orthogonal modulation: We assume that M -point orthogonal modulation utilizes the orthonormal set of signals
{
}
(
)
1
t
sinc
k , k {0, . . . , M 1}
T /M
T /M
for the 0th symbol, where T is the symbol period. (Note that this is the same
as M -point pulse position modulation (M -PPM) shown in gure 4.16.) In
addition, for the jth symbol, we use
{
}
(
)
t
1
sinc
jM k , k {0, . . . , M 1} ,
T /M
T /M
M
,R=
where j Z+ . For M -point orthogonal modulation, note that W = 2T
log2 M
2 log2 M
R
, and W = M . The value of (Eb /N0 )105 can be obtained from the
T
union bound estimate of Pb shown in gure 5.11.
Figure 5.13 shows the curve of R/W versus (Eb /N0 )105 for dierent modulation
schemes that we discussed above. Shown also is the upper bound on R/W as a
function of Eb /N0 . This upper limit is called the channel capacity, a quantity that
we shall dene and study in more detail in a later chapter. For now, it suces to say
that, for a given Eb /N0 , if the bandwidth eciency R/W does not exceed the channel
capacity, then we can make the bit error probability as small as we want. Conversely,
if the bandwidth eciency R/W exceeds the channel capacity, then we cannot make
the bit error rate approach zero.
From gure 5.13, we see that there is a trade-o between the bandwidth eciency
R/W and the power eciency Eb /N0 . We can categorize communication system
scenarios into two regions: bandwidth-limited region with R/W > 1 and power-limited
region with R/W < 1. Examples of systems in the bandwidth-limited region are
Asymmetric Digital Subscriber Line (ADSL) systems, cellular phone systems, and
wireless local area networks (WLANs). Examples of systems in the power-limited
region are optical communication systems and communications in deep space.
Based on gure 5.13, for bandwidth-limited communications, we should consider
the following modulation schemes: M -PAM, M -PSK, and M M -QAM with large
M and M . On the other hand, for power-limited communications, orthogonal signal
sets should be considered. Since bi-orthogonal and simplex signal sets are created
from orthogonal signal sets, they are also good candidates for power-limited communications.
5.7. SUMMARY
95
10
channel capacity
R/W(dB)
8-PAM,8 8-QAM
4-PAM,4 4-QAM
8-PSK
2-PAM,2 2-QAM,4-PSK
2-PSK 4-orthogonal
2-orthogonal
8-orthogonal
16-orthogonal
-5
10
15
20
25
Figure 5.13: The curve of R/W versus (Eb /N0 )105 for dierent modulation schemes
(similar to [?, Fig. 5.2.17]). Note that, to obtain the above gure, the bit error
probabilities are computed based on the union bound estimate of Pb .
5.7
Summary
96
5.8
Practice Problems
1
N1
R1
1 N2
R2
R3 = A 1 + N3 ,
R4
1 N4
1
N5
R5
where N1 , . . . , N5 are IID zero-mean Gaussian RVs with variance 2 , and A is the
signal amplitude equal to under hypothesis 1 and equal to under hypothesis 2.
Assume that > 0.
(a) Assume that the two hypotheses are equally likely. Find the optimal decision
rule that minimizes the probability of decision error, i.e., the MAP decision
rule, and its associated probability of decision error.
(b) Consider now a hard decision procedure in which we make 5 separate decisions based on R1 , . . . , R5 , and use the majority rule for the nal decision. For
1 = 1, H
2 = 2, H
3 = 1, H
4 = 1, and H
5 = 2,
example, if the 5 decisions are H
then the nal decision is 1. Express the probability of decision error for this
hard decision procedure.
97
(c) Suppose we use the decision rule in part (b) when 1 = 2 , i.e. suboptimal
decision rule. Compare the probability of decision error to that of part (a).
NOTE: The optimal combining of independent observations taking into account different attenuation parameters in part (a) is called maximum ratio combining (MRC).
If we use the suboptimal decision rule in part (b) based on the assumption that
1 = 2 , the corresponding combining is called equal gain combining (EGC).
21 2 1 2
(
)]
[
(x1 1 )2 2(x1 1 )(x2 2 ) (x2 2 )2
1
+
exp
2(1 )2
12
1 2
22
where 1 , 2 and 12 , 22 are the means and the variances of X1 , X2 , and is the
2 2 )]
covariance coecient equal to E[(X1 11)(X
.
2
(b) Find the corresponding probability of an incorrect decision.
(c) Suppose that the observation contains only R1 . Show that the probability of an
incorrect decision under the optimal decision rule in this case is strictly higher
than that in part (b).
98
(d) Assume that we transmit over an AWGN channel with noise PSD equal to N0 /2.
Draw the optimal receiver structure and specify the optimal decision rule for a
single symbol transmission.
(e) Use the union bound estimate to express the symbol error probability Ps for
the decision rule in part (d) in terms of Eb /N0 .
(f ) Describe how we can further reduce Ps for a xed Eb /N0 without any channel
coding.
99
Problem 5.6 (Optimal detection of orthogonal signal sets): Consider an orthogonal signal set with M equally likely signals
(t),
m
1,
.
.
.
,
s
m
2
{
}
s(t) =
Es mM/2 (t), m M2 + 1, . . . , M
where {1 (t), . . . , M/2 (t)} is an orthonormal set of signals.
(a) Sketch the signal set in the signal space diagram for M = 4. HINT: For Kdimensional modulation, the signal space diagram has K dimensions. The kth
axis species the signal amplitude in the kth dimension, i.e. the coecient of
k (t) for k {1, . . . , K}.
(b) Consider a single symbol transmission through an AWGN channel with noise
PSD equal to N0 /2. Describe the optimal detection at the receiver. (In other
words, describe the optimal receiver structure and the optimal decision rule.)
(c) Compute the union bound estimate of the symbol error probability Ps for the
decision rule in part (b). Express your answer in terms of M , Es , and N0 .
100
Chapter 6
Channel Coding
In this chapter, we discuss the functions of the channel encoder and decoder in the
schematic diagram of a communication system shown in gure 1.1. Studying in detail
the subject of channel coding is beyond the scope of this course. For our course, we
shall not study how to construct a channel code, but we shall study how to evaluate
the performance of a given code.
We shall focus on binary block codes and binary convolutional codes in this chapter. For such codes, a block of information bits are mapped to an encoded bit sequence
that contains additional bits.1 Such a mapping from information bits to encoded bits
for transmission is called channel coding. The redundancy introduced by channel
coding can improve the bit error rate (BER) of a system in the presence of noise.
6.1
Let us start with an example. The simplest but not the most ecient channel code is a
repetition code, which simply repeats the information bit multiple times. In particular,
consider a repetition code in which each bit is repeated 3 times: 0 000 and 1 111.
For transmission, suppose that we use binary pulse amplitude modulation (PAM)
through an additive white Gaussian noise (AWGN) channel with noise power spectral
density (PSD) equal to N0 /2.
Note that the observations for detection are the output of a matched lter sampled
at 3 successive symbol periods. Let R = (R1 , R2 , R3 ) denote the observation. Let
hypotheses 0 and 1 correspond to the transmission of bit 0 and of bit 1 respectively.
In particular, we write
R = sm + N under hypothesis m {0, 1},
Instead of raw data bits, information bits can also be the output of a source encoder.
In the presence of channel coding, there is a dierence between information bits and encoded
101
102
decision rule. Assuming equally likely hypotheses, the optimal MAP decision rule
has the following form. (The derivation is left as an exercise.)
=1
H
>
r1 + r2 + r3
0
=0
H
We call the above decision process that jointly utilizes the exact values of r1 , r2 , r3
soft decision decoding. Alternatively, we can perform 3 separate hypothesis tests
based on r1 , r2 , r3 , and then use a majority rule for a nal decision. More specically,
1 = 1, H
2 = 0, H
3 = 1, then the nal decision is H
= 1. Such
if the 3 decisions are H
a decision process that involves separate bit decisions based on dierent observations
is called hard decision decoding.
While soft decision decoding performs better in term of the error probability, hard
decision decoding can be attractive since it usually requires less computational eorts
for the decoding process.3 Because of its optimality, we shall focus on soft decision
decoding in this chapter.
Following our example, the corresponding probability of decision error for soft
decision decoding, denoted by PbSOFT , is
{
}
PbSOFT = Pr {R1 + R2 + R3 > 0|m = 0} = Pr N1 + N2 + N3 > 3 Ed .
Since N1 + N2 + N3 is a Gaussian random variable (RV) with mean zero and
variance 3N0 /2, it follows that
{
}
(
)
N
+
N
+
N
3
E
6E
1
3
d
d
2
PbSOFT = Pr
>
=Q
.
N
3N0 /2
3N0 /2
0
1, H
2, H
3 from r1 , r2 , r3 . Note
For hard decision decoding, we make 3 decisions H
that each
(decision)Hj has the error probability p equal to that of binary PAM, i.e.
p=Q
2Ed /N0 . From the majority rule, the overall bit error occurs when there
1, H
2, H
3 . Therefore, the overall probability of decision
are two or three errors in H
error, denoted by PbHARD , is given by
( )
( )
3
3
2
HARD
=
p (1 p) +
p3
Pb
2
3
(
(
(
)(
))
)
2E
2E
2E
d
d
d
= 3Q2
1Q
+ Q3
.
N0
N0
N0
or transmitted bits. We shall use Eb to denote the expected energy per information bit (as before),
and Ed to denote the expected energy per transmission of encoded bits or equivalently the expected
energy per dimension (as will be seen later).
3
Note that the discussions on bit error detection and bit error correction are only relevant when
we discuss hard decision decoding.
103
0
-1
-2
-3
-4
-5
-6
0
10
Eb/N0(dB)
Figure 6.1: PbSOFT and PbHARD for dierent values of Eb /N0 for the repetition code.
Figure 6.1 compares PbSOFT and PbHARD for dierent values of Eb /N0 , where
Eb = 3Ed . The gure veries that soft decision decoding outperforms hard decision decoding in our example.
6.2
To understand the operations of binary linear block codes, it is convenient to use the
vector space viewpoint.
The binary eld, or equivalently the Galois eld of order 2, is denoted by F2 and
contains two elements: 0 and 1. Its addition and multiplication are given by the rules
of modulo-2 or mod-2 arithmetic. A vector space dened over the scalar eld F2 is
called a binary vector space. Examples of binary vector spaces are given below.
Example 6.1 The set of all binary n-tuples (i.e. vectors with n components), denoted by Fn2 , with componentwise mod-2 addition and mod-2 scalar multiplication is
a binary vector space.
Example 6.2 Let {g1 , . . . , gk } be a set of linearly independent vectors in Fn2 , where
k n. Then the set of all binary linear combinations
{ k
}
C=
j gj : 1 , . . . , k F2
j=1
104
C=
j gj : 1 , . . . , k F2
j=1
For an (n, k) code that contains all binary linear combinations of linearly independent vectors g1 , . . . , gk in Fn2 , we can dene the k n generator matrix G such
that its rows are g1T , . . . , gkT , i.e.
g1T
G = ... .
gkT
The encoding operation can then be viewed as computing the product of the information bit vector b = [b1 , . . . , bk ] and G; the encoded sequence is bG.4
1 0 0 0 1 0 1
0 1 0 0 1 1 1
By convention, we write the information bit vector b as a row vector [?, p. 417].
105
shown below.
x16 = [0000]G = [0000000]
x1 = [0001]G = [0001011]
x2 = [0010]G = [0010110]
x3 = [0011]G = [0011101]
x4 = [0100]G = [0100111]
x5 = [0101]G = [0101100]
x6 = [0110]G = [0110001]
x7 = [0111]G = [0111010]
x8 = [1000]G = [1000101]
x9 = [1001]G = [1001110]
x10 = [1010]G = [1010011]
x11 = [1011]G = [1011000]
x12 = [1100]G = [1100010]
x13 = [1101]G = [1101001]
x14 = [1110]G = [1110100]
x14 = [1111]G = [1111111]
Note that, for convenience, we let the all-zero codeword to be the 16th codeword
so that the index of any other codeword corresponds to the decimal value of the
information bits.
The Hamming metric or Hamming weight of a binary vector x in the binary vector
space Fn2 , denoted by wH (x), is dened as
wH (x) = number of ones in x.
(6.1)
The Hamming distance between two binary vectors x and y in Fn2 , denoted by
dH (x, y), is dened as
dH (x, y) = wH (x + y),
(6.2)
and can be thought as the number of bit positions that are dierent between x and
y. For example, let x = [001] and y = [100]. Both x and y have Hamming weight 1.
Their Hamming distance is 2.
An (n, k) binary linear block code C has the minimum Hamming distance d, and
is called an (n, k, d) binary linear block code if
d=
min
x,yC,x=y
dH (x, y).
(6.3)
Since an (n, k, d) binary linear block code C is itself a binary vector space, it has
the closure property; a mod-2 addition of any two codewords in C yields a codeword
in C. This closure property allows us to easily identify d as described next.
We rst argue that, for an arbitrary codeword y C, the set Cy = {y + x : x C}
is the same as C. To see this, note that each addition y + x is a codeword in C from
the closure property. In addition, for any codeword z C, we see that z is also in Cy
since z is equal to y + (y + z) and y + z is in C by the closure property.
Let W(C) be the set of Hamming weights of the codewords in C, i.e. W(C) =
{wH (x) : x C}. Consider an arbitrary codeword y C. Let Dy (C) be the
set of Hamming distances between y and all the codewords in C, i.e. Dy (C) =
{wH (x + y) : x C}. Since Cy = C, it follows that Dy (C) = W(C) for all y. This
observation yields the following theorem.
Theorem 6.1 (Minimum Hamming distance of binary linear block codes):
An (n, k, d) binary linear block code C has the following properties.
106
binary PAM that maps bit 0 to amplitude Ed and bit 1 to amplitude Ed , where
Ed is the expected energy per transmission of encoded bits. An (n, k) binary linear
block code C that is a subspace of Fn2 can be mapped into a 2k -point signal set S in
the signal space Rn by using the mapping of the given binary PAM. We make the
following observations from such mapping.
1. The set of all binary n-tuples Fn2 is mapped tothe set of 2n vertices of an n-cube
centered at the origin and with side length 2 Ed .
2. The set of codewords of an (n, k) code C is mapped to a subset of 2k vertices of
this n-cube.
We shall refer to the set of signal points S obtained from the above mapping of C
to Rn as the signal space image of the code C.
Example 6.9 Consider again the (3,2,2) SPC code C = {000, 011, 101, 110}. The
corresponding signal set S in R3 is
1
1
1
1
Ed 1 , Ed 1 , Ed 1 , Ed 1
S=
1
1
1
1
which is illustrated in gure 6.2.
107
dmin = 2 Ed d.
(6.5)
Since for an (n, k, d, Nd ) code the number of codewords at Hamming distance d
from each codeword is Nd , the number of nearest neighbors to each signal point in S
is Nd . Therefore, the average nearest neighbors of the signal set S is
Kmin = Nd .
(6.6)
Consider the transmission through an AWGN channel with noise PSD N0 /2. It
follows that the union bound estimate for the symbol error probability Ps of the signal
set is equal to
(
)
(
)
2Ed d
dmin /2
= Nd Q
Ps Kmin Q
.
N0
N0 /2
To estimate the bit error probability Pb , we follow the convention of normalizing
Ps by a factor of k to get the error probability per information bit [?, p. 2391], i.e.
we can approximate5
(
)
Nd
2Ed d
1
Q
.
Pb Ps =
k
k
N0
5
Note that this error probability per information bit is only an approximate, and is not the same
as the bit error probability, which is dicult to obtain analytically.
108
kd 2Eb
n N0
)
.
(6.7)
109
0
-1
-2
log10Pb
uncoded
-3
coded
-4
-5
-6
-7
0
10
12
Eb/N0(dB)
Table 6.1: Coding gains at Pb = 106 for some known block codes [?, p. 2392].
(n, k, d, Nd )
n/k
(8,4,4,14)
(16,5,8,30)
(24,12,8,759)
(32,6,16,62)
(32,16,8,620)
(64,7,32,126)
2.0
3.2
2.0
5.3
2.0
9.1
coding
(n, k, d, Nd )
n/k
gain (dB)
2.6
(64,22,16,2604)
2.9
3.5
(128,8,64,254)
16
4.8
(128,64,16,10668) 2.0
4.1
(128,64,16,94488) 2.0
5.0
(256,9,128,510) 28.4
4.6
(256,37,64,43180) 6.9
coding
gain (dB)
6.0
5.0
6.9
6.9
5.4
7.6
110
6.3
Similar to the discussion on binary linear block codes, we can discuss binary linear
convolutional codes using the binary vector space Fn2 with mod-2 arithmetic. A
binary linear convolutional code is dened by an encoding structure that contains a
shift register; the outputs are linear combinations of the contents of the shift register.
A rate-k/n binary convolutional code uses a shift register of length Kk, where
K is called the constraint length of the code. At each time step, k information bits
are shifted into the register, and n encoded bits are formed as linear combinations
of the Kk bits in the register. Let vector b = [b1 , . . . , bKk ] contain the bits in the
shift register. Dene the Kk n generator matrix G such that the encoded bits are
x = bG at each time step, as illustrated in gure 6.3.
input bits
(k bits shifted
in each step)
A link from
to
the mth output
exists if
111
stored bits
current bit
input bits
(1 bits shifted
in each step)
Figure 6.4: Encoding structure for the convolutional code in example 6.11.
111
1
011
010
001
101
011
112
000
00
100
111
011
01
10
101
001
010
11
110
Figure 6.6: State transition diagram for the convolutional code in example 6.11.
state
000
000
01
01
1
11
bit 0
10
01
bit 1
0
10
0
10
1
1
10
10
01
001
10
0
10
000
01
10
111
111
111
0
10
000
01
110
110
bit 2
last bit
001
000
001
000
111
00
Figure 6.7: Trellis diagram for the convolutional code in example 6.11.
From each state, there are 2k links indicating 2k possible transitions. The n output
bits for that time step are labeled on the transition link.
Example 6.12 The state transition diagram of the rate-1/3 binary linear convolutional code in example 6.11 is shown in gure 6.6.
Another way to represent a binary linear convolutional code is to use the trellis
diagram, which is essentially the state transition diagram with the states at dierent
time steps drawn separately.
Example 6.13 For the rate-1/3 binary linear convolutional code in example 6.11,
assuming that the original state is 00 and the nal two information bits are zero,
we can draw the trellis diagram as shown in gure 6.7. Note that, regardless of the
current state, the appearance of two zero tail bits always leads to state 00.
Given six input bits 110101 and two tail bits 00, the encoded bit sequence corresponds
to a path in the trellis diagram, as shown in gure 6.8. From the link labels along
the path, we can read out the encoded bits, i.e. 111 010 001 011 101 011 101 100.
113
state
00
0
10
111
01
10
01
1
10
001
01
10
01
0
11
1
Figure 6.8: Correspondence between an encoded bit sequence and a path in the trellis
diagram.
114
= arg
min
m{1,...,2J }
min
m{1,...,2J }
min
m{1,...,2J }
r sm 2
J+K1
rj sm,j 2
j=1
J+K1
rj +
2
J+K1
j=1
sm,j 2
2
j=1
J+K1
rj , sm,j .
j=1
Notice that the rst and the second terms are the same for all m. Therefore, we can
simplify the decision rule to
= arg
H
min
m{1,...,2J }
J+K1
rj , sm,j .
(6.9)
j=1
The Viterbi algorithm is an ecient recursive algorithm that can be used for
soft-decision decoding for binary linear convolutional codes. The underlying ideas
are based on the following observations.
1. If we assign to each branch or link l in the trellis diagram a metric equal to
rl , xl , where rl and xl are the n-dimensional observations and encoded outputs associated with that branch, then the optimal decision is equivalent to
nding the shortest path through the trellis diagram from the all-zero state at
time 1 to the all-zero state in the nal time step.
2. The initial segment of the shortest path, say from time 1 to time m, must be
the shortest path to whatever state Sm that is passed at time m; if there were
any shorter path to Sm , it could replace the segment of the shortest path to
create an even shorter path, yielding a contradiction. Therefore, it suces at
time m to determine and keep, for each state Sm , only the shortest path from
the all-zero state at time 1 to Sm . These paths are called survivors.
3. The survivors at time m + 1 can be found from the survivors at time m using
the following recursive operations:
(i) For each branch from a state at time m to a state at time m + 1, add
the metric of that branch to the metric of the time-m survivor to get a
candidate path metric at time m + 1.
(ii) For each state at time m + 1, compare the candidate path metrics arriving
at that state and select the path with the smallest metric as the survivor.
Store the survivor path history, i.e. the previous node along the survivor
path, as well as its metric.
(iii) Since we have the all-zero state in the nal time step, it is clear that there
is only one survivor in the end.
path
metric
state
.7
0.1
0.1
3.5
0
5.2
2.6 0
0.4
0.52.2
.1
0.7 1
.7
0
2.3
0
0.5
.6
0 4.6
.5
2.3
4 .6
3 .5
0.9
0.5
.2 0.5 .1
0
0.7
0
0
0
1
0
0
0.5
0.2
.1
.7
0.7
2.5
3
1.5
0.4
1
0.3
0.3
.3
0.8
0.3
11
.8
.2
10
.3
0
0.4
.5
0
01
0.5 0.4 0.1 0.6 0.5 0.3 1.6 0.7 1.1 0.1
0.5
0.6
00
115
0.3
0.3
decoded
bits
0
116
(110)
(010)
(001)
(011)
(100)
(111)
(101)
Figure 6.10: Modied state transition diagram for the derivation of the transfer function of the convolutional code in example 6.11.
An error event corresponds to some nite error bit sequence e. Suppose that there
is only one error event, i.e. we can write y = x + e. The probability of such a nite
error bit sequence e is given by
(
)
s(y) s(x)
117
where u is the Hamming weight of the link label for the transition from state U to
state S, and v is the Hamming weight of the link label for the transition from state
d
s
V to state S. The transfer function T (D) is dened as the ratio S00
/S00
. For the
rate-1/3 convolutional code in example 6.11, the state transition equations are
S10
S11
S01
d
S00
=
=
=
=
s
D3 S00
+ D2 S01
DS10 + D2 S11
D2 S10 + D2 S11
DS01
From the above state transition equations, we can write the transfer function
T (D) =
d
S00
2D6 D8
= 2D6 + D8 + 5D10 + . . . ,
=
s
S00
1 D2 2D4 + D6
(6.10)
from which the exponent of D in the rst term (with the smallest exponent) is dfree
(= 6) and the coecient of the rst term is Nfree (= 2).
In general, the powers of D in T (D) indicate the Hamming weights of error events,
and the coecient of Dw indiciates how many error events have Hamming weight w.
For example, the transfer function in (6.10) indicates that there are 2 error events
with Hamming weight 6, one error event with Hamming weight 8, 5 error events
with Hamming weight 10, and so on.
For error probability, we consider the probability of an error event starting at a
given time, assuming that no error event is in progress. The union bound estimate
for this probability is
(
)
(
)
dmin /2
2Ed dfree
Ps Kmin Q
= Nfree Q
.
N0
N0 /2
(
)
kdfree 2Eb
Since Ed = nk Eb , we can write Ps Nfree Q
. As with binary linear block
n
N0
codes, we can normalize Ps by k to obtain the error probability per information bit,
denoted by Pb , as shown below.
(
)
Nfree
kdfree 2Eb
Pb
Q
(6.11)
k
n N0
By comparing Pb in (6.11) to PbUNCODED in (6.8) for uncoded binary PAM, we can
quantify the coding gain, as illustrated in the following example.
Example 6.15 For the rate-1/3 convolutional code in example 6.11, we have dfree = 6
and Nfree = 2. Therefore,
)
(
4Eb
.
Pb 2Q
N0
From the plot of Pb versus Eb /N0 in gure 6.11, we see that the coding gain is about
2.6 dB for Pb = 105 .
118
0
-1
-2
log10Pb
uncoded
-3
coded
-4
-5
-6
-7
0
10
12
Eb/N0(dB)
Figure 6.11: Error performance of the rate-1/3 convolutional code in example 6.11.
Table 6.2 shows the coding gains for some known codes with soft decision decoding.
These gains are computed based on the estimate of Pb in (6.11) and the baseline
UNCODED
Pb
in (6.8). As previously mentioned, we shall not discuss the theory behind
the construction of these codes in this course.
6.4
Summary
We discussed two fundamental types of channel coding: binary linear block codes
and binary linear convolutional codes. For the purpose of the course, we did not
discuss how to construct these codes, but discussed how to evaluate the error
performances of given codes. By comparing to uncoded binary PAM, we see that a
rate-k/n channel code can reduce the requirement of Eb /N0 for a given bit error rate
by the amount called the coding gain at the expense of using more bandwidth by a
factor of n/k.
Channel coding is a rich subject that requires a separate course to master the
materials. For more information, see for example on-line materials of the MIT Open
Courseware available for free at ocw.mit.edu.
119
Table 6.2: Coding gains at Pb = 106 for some known convolutional codes [?, p.
2393].
(n, k, K, dfree , Nfree )
n/k
(2,1,1,3,1)
(2,1,2,5,1)
(2,1,3,6,1)
(2,1,4,7,2)
(2,1,5,8,2)
(2,1,6,10,12)
(2,1,7,10,1)
(2,1,8,12,10)
(3,1,1,5,1)
(3,1,2,8,2)
(3,1,3,10,3)
(3,1,4,12,5)
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0
3.0
3.0
3.0
3.0
6.5
coding
gain (dB)
1.8
4.0
4.8
5.2
5.8
6.3
7.0
7.1
2.2
4.1
4.9
5.6
n/k
(3,1,5,13,1)
(3,1,6,15,3)
(3,1,7,16,1)
(3,1,8,18,5)
(4,1,1,7,1)
(4,1,2,10,1)
(4,1,3,13,2)
(4,1,4,16,4)
(4,1,5,18,3)
(4,1,6,20,10)
(4,1,7,22,1)
(4,1,8,24,2)
3.0
3.0
3.0
3.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
coding
gain (dB)
6.4
6.7
7.3
7.4
2.4
4.0
4.9
5.6
6.2
6.4
7.4
7.6
Practice Problems
Problem 6.1 (Bit error probability of a binary linear block code): Consider
the generator matrix G for an (n, k, d, Nd ) binary linear block code given below.
1 0 0 1 1 0
G= 0 1 0 0 1 1
0 0 1 1 0 1
(a) Specify the values of n, k, d, and Nd .
(b) Assume
( that
) the uncoded binary PAM system has the bit error probability
2Eb
Q
, where Eb is the energy per bit and N0 /2 is the PSD of AWGN.
N0
Find the union bound estimate of the error probability per bit Pb in terms of
Eb /N0 for the above block code.
(c) Compute the coding gain (numerically in dB) for this block code at Pb = 104 .
NOTE: You will need a calculator to evaluate the Q function, e.g. MATLAB.
Problem 6.2 : Suppose you have to choose between two binary linear block codes
with the following generator matrices. Based on the union bound estimate of the
error probability per bit, which one would you choose and why?
[
]
[
]
1 0 1 0 1
1 0 1 0 1
G1 =
, G2 =
0 1 0 1 1
0 1 0 1 0
120
0 1 1
1 0 1
1 1 1
(a) Draw the state transition diagram for this code.
(b) Assume that we use binary PAM transmission through an AWGN channel with
the expected energy per dimension (or per transmission) Ed = 1. In addition,
bits 0 and 1 are mapped to amplitudes 1 and 1 respectively. Assume that 3
information bits followed by 2 zero tail bits are transmitted. Suppose that the
received signals are given by
(0.1, 0.4, 0.2, 0.3, 0.2, 0.4, 0.1, 0.4, 0.2, 0.1, 0.1, 0.3, 0.2, 0.1, 0.3).
Use the Viterbi algorithm to perform soft decision decoding of this received
sequence. Identify the 3 information bits.
(c) Assume
( that
) uncoded binary PAM transmission has the bit error probability
2Eb
Q
, where Eb is the energy per bit and N0 /2 is the PSD of AWGN.
N0
Specify dfree and Nfree , and use the union bound estimate to express the error
proability per bit Pb .
121
current bit
stored bits
input bits
(1 bits shifted
in each step)
output bits
(2 bits output
in each step)
122
Chapter 7
Capacities of Communication
Channels
In this chapter, we discuss fundamental limits on transmission rates of digital
communication systems. We shall focus on discrete-time channel models. In the
rst part, we shall consider discrete channels, where both inputs and outputs of the
communication systems belong to nite sets of possible values. In the second part,
we shall consider discrete-time additive white Gaussian noise (AWGN) channels,
where outputs are inputs plus Gaussian noise random variables (RVs) and are
therefore continuous RVs. To treat AWGN channels, we introduce the dierential
entropy and its related denitions. After that, we derive the channel capacity
formula for AWGN channels.
Since the discussion involves information theory, reviewing the concept of the
entropy and its basic properties can be helpful. In addition, we shall introduce the
mutual information, which is another important fundamental quantity in
information theory.
7.1
(7.1)
124
(7.2)
j=1
7.2
Mutual Information
xX yY
fX,Y (x, y)
.
fX (x)fY (y)
(7.3)
125
=
fX,Y,Z (x, y, z) log
xX yY
(7.4)
The conditional mutual information I(X; Y |Z) can be considered as the amount of
information that Y provides about X given Z.
As for the denition of conditional entropy, the denition of conditional mutual
information can be extended in a straightforward fashion to more than two RVs.
The following theorem provides a useful identity for multiple RVs. The proof can be
done using induction.
Theorem 7.1 (Chain rule for mutual information): Consider discrete RVs
X1 , . . . , Xn , Y .
I(X1 , . . . , Xn ; Y ) = I(X1 ; Y ) +
j=2
126
=0
=0.5
=0.3
0.8
0.6
0.6
I(X;Y)
H(X|Y)
0.8
=0.1
0.4
=0.1
0.4
0.2
0.2
=0
0
0
=0.3
=0.5
0
1
7.3
Capacity of a DMC
An (n, k) channel code (binary or nonbinary) for the DMC with X , Y, and
fY |X (y|x) consists of the following.
1. The index set {1, . . . , 2k } for 2k possible hypotheses or signal points. Let U be
the hypothesis RV.1
2. An encoding mapping x : {1, . . . , 2k } X n that maps each index m in
{1, . . . , 2k } to a codeword xm = (xm,1 , . . . , xm,n ) in X n . The set of codewords C
is called a codebook or simply a code.
3. A decoding mapping u : Y n {1, . . . , 2k } that maps each possible received
vector y = (y1 , . . . , yn ) in Y n to a hypothesis index u(y) in {1, . . . , 2k }.
Let m = Pr{U (y) = U |U = m} be the decision error probability under hypothesis
m. Let nmax = maxm{1,...,2k } m be the maximum error probability for an (n, k)
2k
m
code. Let Pen = m=1
be the average error probability for an (n, k) code. Note
2k
that the rate of an (n, k) code is k/n bit/dimension, or equivalently k/n bits per
transmission.
Let R be the information bit rate (in bit/dimension). We say that R is achievable if
there is a sequence of (n, Rn) codes with nmax 0 as n . The channel capacity
of a DMC with X , Y, and fY |X (y|x), denoted by C, is dened as
(7.5)
fX (x)
1
Unlike in the previous two chapters, we use U instead of H to denote the hypothesis RV because
H will denote the entropy in this chapter.
127
where I(X; Y ) denotes the mutual information between X and Y . Note that the
maximization is performed over all input PMFs. In a later section, we shall prove
the following theorem on the capacity of a DMC.
Theorem 7.2 (Channel coding theorem for a DMC): All rates R < C are
achievable. Conversely, if a rate R is achievable, then R C.
Before we can justify theorem 7.2, we need to develop additional analytical tools in
information theory. The next two sections are for this purpose.
7.4
First, let us briey review the asymptotic equipartition property (AEP). Consider a
sequence of independent and identically distributed (IID) RVs X1 , X2 , . . ., where
each Xj takes its value in the set X according to the PMF fX (x). Let H(X) denote
the entropy of X. The typical set Tn with respect to the PMF fX (x) is the set of
sequences x = (x1 , . . . , xn ) that satisfy the inequality
n(H(X)+)
2
< f X (x) < 2n(H(X)) , or equivalently n1 log f X (x) H(X) < .
The AEP states that, for suciently large n, we have the following properties.
1. Pr {Tn } > 1 .
2. (1 )2n(H(X)) < |Tn | < 2n(H(X)+) .
Roughly speaking, as n , there are about 2nH(X) equally likely sequences. Now
consider a sequence of IID pairs of RVs (X1 , Y1 ), (X2 , Y2 ), . . ., where each (Xj , Yj )
takes its value in the set X Y with joint PMF fX,Y (x, y). The jointly typical set
An with respect to fX,Y (x, y) is the set of sequences (x, y) = (x1 , . . . , xn , y1 , . . . , yn )
such that
1
1
log f X (x) H(X) < , log f Y (y) H(Y ) < ,
n
n
1
log f X,Y (x, y) H(X, Y ) < .
(7.6)
n
We next derive the AEP for jointly typical sequences. This AEP will be useful in
proving the forward part of theorem 7.2.
Theorem 7.3 (Joint AEP): For suciently large n, we have the following.
1. Pr {An } > 1 .
2. (1 )2n(H(X,Y )) < |An | < 2n(H(X,Y )+) .
1, . . . , X
n , Y1 , . . . , Yn ) be a sequence of RVs. If their joint PMF
Y)
= (X
3. Let (X,
x, y
) = f X (
x)f Y (
y), then
satises fX,
Y
(
{
}
Y)
An < 2n(I(X;Y )3) .
(1 )2n(I(X;Y )+3) < Pr (X,
128
2
1. Dene RV Wj = log fXj (xj ). Note that E[Wj ] = H(X). Let W
denote the
variance
of large number (WLLN), we can write
weak
{ of Wj . From the
} law
2
1 n
W
Pr n j=1 Wj H(X) n2 . By choosing n large enough, say n = n1 ,
2
W
n2
<
.
(7.7)
3
so that
Pr log f Y (y) H(Y ) < .
n
3
(7.8)
Pr log f X,Y (x, y) H(X, Y ) < .
n
3
(7.9)
From statement 1
of the theorem and f X,Y (x, y) < 2n(H(X,Y )) , with n large
enough, 1 <
f X,Y (x, y) < |An | 2n(H(X,Y )) , yielding |An | >
(x,y)An
(1 )2n(H(X,Y )) .
x, y
) = f X (
x)f Y (
y), we can write
3. From the assumption that f X,
Y
(
}
{
Y)
An =
f X (
x)f Y (
y).
Pr (X,
(
x,
y)An
From |An | < 2n(H(X,Y )+) , f X (x) < 2n(H(X)) , and f Y (y) < 2n(H(Y )) ,
{
}
Y)
An < 2n(H(X,Y )+) 2n(H(X)) 2n(H(Y )) = 2n(I(X;Y )3)
Pr (X,
129
From |An | > (1 )2n(H(X,Y )) , f X (x) > 2n(H(X)+) , and f Y (y) > 2n(H(Y )+) ,
{
}
n
Pr (X, Y) A
> (1 )2n(H(X,Y )) 2n(H(X)+) 2n(H(Y )+)
= (1 )2n(I(X;Y )+3) .
Combining the above two inequalities, we obtain statement 3.
Roughly speaking, the joint AEP tells us that there are about 2nH(X) typical
x-sequences, about 2nH(Y ) typical y-sequences, but only about 2nH(XY ) jointly
typical (x, y)-sequences. Thus, the probability that a randomly selected
2nH(X,Y )
(x, y)-sequence is jointly typical is about 2nH(X)
= 2nI(X;Y ) .
2nH(Y )
7.5
In this section, we establish two inequalities that will be useful for proving the
converse part of theorem 7.2.
Three RVs X, Y, Z form a Markov chain, denoted by X Y Z, if the
conditional PMF of Z given X and Y depends only on Y , i.e.
fZ|Y,X (z|y, x) = fZ|Y (z|y).
In particular, consider three RVs X, Y, Z. If Z is a function of Y , i.e. Z = g(Y ),
then X, Y, Z form a Markov chain. We now establish a useful result called the data
processing inequality.
Theorem 7.4 (Data processing inequality): If X Y Z, then I(X; Y )
I(X; Z) and I(Y ; Z) I(X; Z).
Proof: Using the chain rule for mutual information, we can expand I(X; Y, Z) in
two ways, as shown below.
I(X; Y, Z) = I(X; Z) + I(X; Y |Z) = I(X; Y ) + I(X; Z|Y )
Since X Y Z, fX,Z|Y (x, z|y) = fX|Y (x|y)fZ|Y,X (z|y, x) = fX|Y (x|y)fZ|Y (z|y).
Thus, X and Z are independent given Y . Consequently, I(X; Z|Y ) = 0 and
I(X; Z) + I(X; Y |Z) = I(X; Y ).
Since I(X; Y |Z) 0 (nonnegativity of mutual information), I(X; Y ) I(X; Z).
The proof of I(X; Y ) I(X; Z) is similar (start with I(Z; Y, X) instead of with
I(X; Y, Z)) and is thus omitted.
Note that, when Z = g(Y ), the data processing inequality tells us that no clever
processing of data Y can increase the amount of information about X from that
already contained in Y .
130
Consider now a discrete RV X taking a value from the set X . Suppose we guess the
value of X from a RV Y that are related to X by the conditional PMF fY |X (y|x).
= g(Y ) be the guess. Let Pe = Pr{X
= X} be the probability of error.
Let X
Intuitively, if the conditional entropy H(X|Y ) is small, then Pe should be small, and
vice versa. The Fano inequality quanties the bound on Pe based on H(X|Y ).
Theorem 7.5 (Fano inequality): The probability of error Pe in guessing X from
Y is bounded by
1 + Pe log(|X | 1) H(X|Y ), or equivalently Pe
H(X|Y ) 1
.
log(|X | 1)
7.6
We now prove theorem 7.2 in two steps. First, we justify the forward part of the
theorem by showing that any information bit rate R < C is achievable. Then we
justify the converse part of the theorem by showing that any rate R > C is not
achievable.
131
Achievability of R < C
The proof that R < C is achievable is based on the random coding argument.
Consider the transmission scheme described below.
Transmission and detection: We rst randomly generate an (n, Rn) code C with 2Rn
codewords according to PMF fX (x).2 In particular, we can write the codebook as a
matrix
x1,1
x1,2 x1,n
..
..
..
C = ...
,
.
.
.
x2Rn ,1 x2Rn ,2 x2Rn ,n
where the entries are generated IID with PMF fX (x), and the codewords are listed
as the rows of the matrix. Note that the probability of obtaining a particular C is
given by
2
n
Rn
Pr{C} =
fX (xm,j ).
m=1 j=1
Assume that the transmitter and the receiver both know the code C and the channel
description fY |X (y|x). The receiver observed y = (y1 , . . . , yn ) and uses typical set
decoding; the decision is to set U (y) = m if the following two conditions hold.
1. (xm , y) is jointly typical, i.e. (xm , y) An .
2. There is no other index m such that (xm , y) is jointly typical.
Otherwise, the receiver declares an error, i.e. the transmission is not successful.
Analysis of error probability: Let Pr{E} denote the error probability averaged over
all the codewords as well as over all the codebooks, i.e.
Pr{E} =
Pr{C}Pen (C),
C
where Pen (C) is the error probability averaged over all the codewords for codebook
2Rn
m
C. From the denition Pen = m=1
, where m = Pr{U (y) = U |U = m},
Rn
2
2
1
Rn
Pr{E} =
2Rn
m=1
Pr{C}m (C).
By the symmetry of the code construction, C Pr{C}m (C) does not depend on m.
Therefore, we can write
Pr{E} =
Pr{C}1 (C) = Pr{E|U = 1}.
C
2
For simplicity, we assume that Rn is an integer. Note that this is always possible for large n if
R is rational.
132
Rn
Pr{E|U = 1} Pr{E1c } +
Pr{Em }.
m=2
From joint AEP (statement 1 of theorem 7.3), Pr{E1c } < for large n. From the
code construction, for m = 1, xm and y are independent, i.e. their joint PMF is
fX (xm )fY (y). From joint AEP (statement 3 of theorem 7.3),
Pr{Em } < 2n(I(X;Y )3) . It follows that we can bound Pr{E|U = 1} as follows.
2
Rn
Pr{E|U = 1} < +
m=2
If R < I(X; Y ), then it is possible to choose small and n large enough so that
2n(I(X;Y )3R) < , yielding Pr{E|U = 1} < 2.
Existence of capacity achieving code: We nish the proof by arguing that there
exists a codebook whose rate achieves the capacity as follows.
1. By selecting the PMF fX (x) to be the capacity achieving PMF fX (x), i.e. setting fX (x) = arg maxfX (x) I(X; Y ), we can replace the condition R < I(X; Y )
by R < C.
2. Since the average error probability over all codebooks is less than 2, there must
exist at least one codebook C with error probability Pen (C ) < 2.
2Rn
m
and Pen (C ) < 2, throwing away the worse half (based on
3. From
= m=1
2Rn
m ) of the codewords of C yields a codebook with half the size and nmax < 4.
(Note that if more than half the codewords of C have m 4, then Pen (C )
2, yielding a contradiction.) We can reindex the codewords in the modied
codebook with 2Rn /2 = 2Rn1 codewords. The rate of this code is reduced from
R to R 1/n.
Pen
In conclusion, given any > 0, if R < C, there exists a code with rate R 1/n and
nmax < 4 for suciently large n. As n , the code rate approaches R. Thus, R
is achievable.
Non-Achievability of R > C
Proving that R > C is not achievable is equivalent to showing that, for any
sequence of (n, Rn) codes with nmax 0, we have R C.
Note that nmax 0 implies Pen 0. Since Pen = Pr{U = U }, we can apply the
Fano inequality (theorem 7.5) for guessing U from Y = (Y1 , . . . , Yn ) to write
H(U |Y) 1 + Pen log(2Rn 1) < 1 + Pen Rn.
(7.10)
133
To relate R to C, we rst note that H(U ) = Rn. Then, we can use the equality
H(U ) = H(U |Y) + I(U ; Y) and (7.10) to write
Rn = H(U ) = H(U |Y) + I(U ; Y) < 1 + Pen Rn + I(U ; Y).
(7.11)
From the data processing inequality (theorem 7.4), I(U ; Y) I(X(U ); Y), where
X(U ) = xm under hypothesis m. It follows from (7.11) that
Rn < 1 + Pen Rn + I(X(U ); Y).
(7.12)
We now show that I(X(U ); Y) nC. For convenience, we drop the notation U
below. Using the chain rule on the entropy and the memoryless property of a DMC,
we can write I(X; Y) as
I(X; Y) = H(Y) H(Y|X)
n
= H(Y)
H(Yj |X, Yj1 , . . . , Y1 )
= H(Y)
Using the inequality H(Y)
I(X; Y)
j=1
H(Yj |Xj ).
j=1
n
j=1
H(Yj )
j=1
n
j=1
H(Yj |Xj ) =
I(Xj ; Yj ) nC.
j=1
7.7
Dierential Entropy
1
log(2e 2 ).
2
Note that, for any real constant a, h(X) = h(X + a). It follows that, for Gaussian
RV Y = X + a with mean a and variance 2 , h(Y ) = h(X) = 21 log(2e 2 ).
134
It is left as an exercise for the reader to show that, among all the RVs with variance
2 , the Gaussian RV has the highest dierential entropy. We state this result
formally below.
Theorem 7.6 (Maximum entropy of Gaussian RV): Let X be a continuous RV
with variance 2 . Then, h(X) log(2e 2 ), where the equality holds if and only if
X is Gaussian.
=
h(Xj |Xj1 , . . . , X1 ).
j=1
Theorem 7.8 (Conditioning can only reduces entropy): For continuous RVs
X and Y ,
h(X|Y ) h(X).
135
(7.18)
(7.19)
(7.20)
As for discrete RVs, we have the following theorems. The proofs are omitted.
Theorem 7.9 (Chain rule for mutual information): For continuous RVs X1 ,
. . ., Xn , Y ,
I(X1 , . . . , Xn ; Y ) = I(X1 ; Y ) + I(X2 ; Y |X1 ) + . . . + I(Xn ; Y |Xn1 , . . . , X1 )
n
=
I(Xj ; Y |Xj1 , . . . , X1 ).
(7.21)
j=1
In the remaining part of this section, we shall establish some inequalities that are
useful in proving the channel coding theorem for AWGN channels. Since the
derivations are similar in nature to the results for discrete RVs, we omit the proofs
in what follows. For more details, see [?].
vol(A) =
dx1 . . . dxn .
(7.22)
Theorem 7.10 (AEP): For suciently large n, we have the following properties.
1. Pr {Tn } > 1 .
136
n(I(X;Y )+3)
{
}
n
Roughly speaking, the typical x-sequences form a set of volume 2nh(X) in Rn . The
typical y-sequences form a set of volume 2nh(Y ) in Rn . The jointly typical
(x, y)-sequences form a set of volume 2nh(X,Y ) in R2n . Thus, the probability that a
2nh(X,Y )
= 2nI(X;Y ) .
randomly selected (x, y)-sequence is jointly typical is about 2nh(X)
2nh(Y )
7.8
137
(7.23)
We now dene the channel capacity of the AWGN channel and then prove the
channel coding theorem for the AWGN channel. Note that we use the denitions of
an achievable rate as well as various error probabilities as in the case of a DMC.
The channel capacity of the AWGN channel with energy constraint Ed is dened as
C=
max
fX (x):E[X 2 ]Ed
I(X; Y ).
(7.24)
Theorem 7.14 (Channel coding theorem for AWGN channel): Any rate R <
C is achievable. Conversely, if a rate R is achievable, then R C.
Before proving the channel coding theorem, we rst derive the capacity expression
for the AWGN channel.
Theorem 7.15 (Capacity formula for AWGN channel): Let SNR = NE0d/2 .
Then,
1
C = log2 (1 + SNR) (in bit/dimension).
(7.25)
2
Proof: We rst write
1
I(X; Y ) = h(Y )h(Y |X) = h(Y )h(X+N |X) = h(Y )h(N ) = h(Y ) log(eN0 ).
2
Since h(Y ) is maximized when Y is Gaussian and the variance of Y is Ed + N0 /2,
1
1
log(2e(Ed + N0 /2)) log(eN0 )
2
2
(
)
1
Ed
1
=
log 1 +
= log(1 + SNR),
2
N0 /2
2
I(X; Y )
Before formally proving the channel coding theorem for AWGN channels, we make
the following comments.
138
nN0 /2)n
codewords, yielding
the capacity of
(
)
( n(Ed + N0 /2))n
n
Ed
=
log2 1 +
log2
bit/ n dimension
2
N0 /2
( nN0 /2)n
1
log2 (1 + SNR) bit/dimension
=
2
139
Achievability of R < C
The proof is based on the random coding argument. The main ideas are similar to
the proof of the channel coding theorem for DMCs. Consider the transmission
scheme described below.
Transmission and detection: We rst randomly generate an (n, Rn) codebook C
with 2Rn codewords according to a Gaussian PDF with mean 0 and variance Ed .
In particular, we can write the codebook as a matrix
x1,1
x1,2 x1,n
..
..
..
C = ...
,
.
.
.
x2Rn ,1 x2Rn ,2
x2Rn ,n
where the entries are generated IID according to the above Gaussian PDF.
The transmitter and the receiver both know the code C and the channel fY |X (y|x).
The receiver observed y = (y1 , . . . , yn ) and uses the following typical set decoding;
the decision is to set U (y) = m if the following conditions hold.
1. (xm , y) is jointly typical, i.e. (xm , y) An .
2. There is no other index m such that (xm , y) is jointly typical.
140
where Pen (C) is the error probability averaged over all the codewords for a codebook
C.
By the symmetry of the code construction,
Pr{E} = Pr{E|U = 1}.
n
c
Let Em denote the event that (xm , y) is jointly typical, i.e. (x
m , y) A . Let Em
n
1
2
denote the complement of Em . Let F denote the event that n j=1 x1,j > Ed . From
the modied typical set decoding, Pr{E|U = 1} = Pr{F E1c E2 . . . E2Rn }.
Using the union bound,
2
Rn
Pr{E|U = 1} Pr{F} +
Pr{E1c }
Pr{Em }.
m=2
By the WLLN, Pr{F} < for large n. From joint AEP (statement 1 of
theorem 7.11), Pr{E1c } < for large n. From the code construction, for m = 1, xm
and y are independent, i.e. their joint PMF is fX (xm )fY (y). From joint AEP
(statement 3 theorem 7.11), Pr{Em } < 2n(I(X;Y )3) . It follows that
Pr{E|U = 1} < 2 + (2Rn 1)2n(I(X;Y )3) < 2 + 2n(I(X;Y )3R) .
Note that, if R < I(X; Y ), we can choose n large enough so that
2n(I(X;Y )3R) < , yielding Pr{E|U = 1} < 3.
Existence of capacity achieving code: We nish the proof by arguing that there
exists a code whose rate achieves the capacity as follows.
1. By selecting the PMF fX (x) to be the capacity achieving PMF fX (x), i.e. setting fX (x) = arg maxfX (x) I(X; Y ), we can replace the condition R < I(X; Y )
by R < C.
2. Since the average error probability over all codebooks is less than 3, there must
be at least one codebook C with error probability Pen (C ) < 3.
Pen
2Rn
m
= m=1
3. From
, throwing away the worse half (based on m ) of the code2Rn
words of C yields a codebook with half the size and max < 6. (Note that if
more than half the codewords of C have m 6, then Pen (C ) 3, yielding
a contradiction.) We can reindex the codewords in the modied codebook with
2Rn /2 = 2Rn1 codewords. The rate of this code is reduced from R to R 1/n.
In conclusion, if R < C, there exists a code with rate R 1/n with nmax < 6 for
suciently large n. As n , the code rate approaches R. Thus, R is achievable.
141
Non-Achievability of R > C
In addition to data processing and Fano inequalities, we shall make use of the
Jensen inequality described next. A function f is convex over an interval (a, b) if,
for every x1 , x2 (a, b) and [0, 1], f satises
f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 ).
Theorem 7.16 (Jensen inequality): If a function f is convex and X is a RV, then
E[f (X)] f (E[X]).
We now proceed with the proof that R > C is not achievable. Proving that R > C
is not achievable is equivalent to showing that, for any sequence of (n, Rn) codes
with nmax 0, we must have R C. Note that nmax 0 implies that Pen 0.
Since Pen = Pr{U = U }, we can apply the Fano inequality for guessing U from
Y = (Y1 , . . . , Yn ) to write
H(U |Y) 1 + Pen log(2Rn 1) < 1 + Pen Rn.
(7.27)
To relate R to C, we rst note that H(U ) = Rn. Then, we can use the equality
H(U ) = H(U |Y) + I(U ; Y) and (7.27) to write
Rn = H(U ) = H(U |Y) + I(U ; Y) < 1 + Pen Rn + I(U ; Y).
(7.28)
From the data processing inequality, I(U ; Y) I(X(U ); Y), where X(U ) = xm
under hypothesis m. It follows from (7.28) that
Rn < 1 + Pen Rn + I(X(U ); Y)
(7.29)
For convenience, we drop the notation U below. Using the chain rule on the entropy
and the memoryless property of the discrete-time channel, we can write and bound
I(X; Y) as
I(X; Y) = h(Y) h(Y|X) = h(Y)
j=1
= h(Y)
j=1
j=1
h(Yj )
h(Nj )
j=1
n
j=1
h(Nj ) =
[h(Yj ) h(Nj )]
(7.30)
j=1
2Rn 2
1
Let Ej = 2Rn
m=1 xm,j be the average energy of position j in the codeword. Since
Yj and Nj are zero-mean Gaussian RVs with variances Ej + N0 /2 and N0 /2
142
1
1
1
2E
j
.
+ Pen R + log 1 +
n
2
n j=1 N0
1
n
+ Pen R 0 as n , we have R C as n .
7.9
Summary
In this chapter, we dened the channel capacities of DMCs and AWGN channels.
Using the AEP and related inequalities in information theory, we proved the
channel coding theorems for both DMCs and AWGN channels. We observed that
the proofs of the coding theorems guarantee the existence of good codes, but do not
tell us specically how to construct good codes. Nevertheless, the channel capacity
can provide us with a fundamental limit on what we can achieve in terms of the
communication rate subject to the availability in bandwidth (for both DMCs and
AWGN channels) as well as in transmit power (for AWGN channels). It is
interesting to note that the random coding argument assumes no restriction on the
coding delay. In particular, by allowing the sequence length to go to innity, we
allow the coding delay to become arbitrarily large.
Finally, we derived the channel capacity formula for AWGN channels. The formula
provides the bound in gure 5.13. In addition, it is used as a benchmark to evaluate
the performances of practical systems.
7.10
A function f is convex over an interval (a, b) if, for every x1 , x2 (a, b) and
[0, 1], f satises
f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 ).
(7.31)
7.10.
143
Theorem 7.17 If a function f is twice dierentiable in (a, b), then f is convex if and
only if its second derivative f is nonnegative in (a, b).
Proof: We rst show that, if f (x) 0 in (a, b), then f (x) is convex in (a, b).
Consider the second-order Taylor series expansion around a point x0
f (x )
(x x0 )2 ,
2
where x lies somewhere between x and x0 . Since f (x) 0,
f (x) = f (x0 ) + f (x0 )(x x0 ) +
Example 7.6 We can check the second derivatives to verify that ex and log x are
convex functions.
144
f (x )
(x x0 )2 ,
2
7.11
Practice Problems
DMC 2
Write the expressions for the mutual information I(X; Y ) between the input and the
output as well as the conditional entropy H(X|Y ) for each channel.
Problem 7.2 (Cascade of two BSCs): Consider the cascade of two BSCs with
the crossover probabilities given below. (An example scenario of this model is a
satellite transmission system in which the rst BSC corresponds to the uplink while
the second BSC corresponds to the downlink.)
Let be the crossover probability of the rst BSC, and be the crossover probability
of the second BSC. Let RVs X, Y , and Z denote the input, the intermediate output,
and the nal output of the transmission system respectively.
145
(a) Assume that we want to support the bit rate of 6 Mbps. Find the amount of
bandwidth required to transmit at this bit rate.
(b) Draw the decision regions for the receive signals according to the optimal decision rule that minimizes the probability of decision error.
(c) Find the union bound estimate of the symbol error probability associated with
the optimal decision rule in part (c). Express the bound in terms of the energy
per bit Eb and N0 .
146
(d) Assume that a decision error occurs only for nearest neighbors. In other words,
assume that the probability that a signal point (a1 , a2 ) is sent but a non-nearest
neighbor is decided is negligible.
View the channel as (a DMC
) with 8 inputs and 8 outputs (after the receivers
d
min
decision). Let = Q 2N0 , where dmin denotes the minimum distance between
signal points. Find the channel capacity of this DMC in terms of with the
unit of bit/transmission (or equivalently bit/dimension).
Problem 7.4 : Consider the additive-noise DMC shown below. In the gure, X, Y ,
and N denote the input, the output, and the additive noise RVs respectively.
Bibliography
147