Preface xi
1 Introduction 1
1.1 Estimation in Signal Processing . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Mathematical Estimation Problem . . . . . . . . . . . . . . . . . . 7
1.3 Assessing Estimator Performance . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Some Notes to the Reader . . . . . . . . . . . . . . . . . . . . . . . . . . 12
vii
viii
CONTENTS CONTENTS ix
4 Linear Models
4.1 Introduction . . . . . . . . 83 8. 3 The Least Squares Approach 220
4.2 Summary . . . . . . . . . 83 8.4 Linear Least Squares . . . . . 223
4.3 Definition and Properties 83 8.5 Geometrical Interpretations 226
4.4 Linear Model Examples 83 8.6 Order-Recursive Least Squares 232
4.5 Extension to the Linear Model 86 8.7 Sequential Least Squares . . 242
94 8.8 Constrained Least Squares . . . 251
5 General Minimum Variance Unbiased Estimation 8.9 Nonlinear Least Squares . . . . 254
5.1 Introduction . . . . 101 8.10 Signal Processing Examples . . . . . . . . . . 260
5.2 Summary . . . . . . . . . . 101 8A Derivation of Order-Recursive Least Squares. 282
5.3 Sufficient Statistics . . . . . 101 8B Derivation of Recursive Projection Matrix 285
5.4 Finding Sufficient Statistics 102 8C Derivation of Sequential Least Squares 286
5.5 Using Sufficiency to Find the MVU Estimator. 104
5.6 Extension to a Vector Parameter . . . . . . . . 107 9 Method of Moments 289
116 9.1 Introduction . . . . 289
5A Proof of Neyman-Fisher Factorization Theorem (Scalar Parameter) .
127 9.2 Summary . . . . . 289
5B Proof of Rao-Blackwell-Lehmann-Scheffe Theorem (Scalar Parameter)
130 9.3 Method of Moments 289
6 Best Linear Unbiased Estimators 9.4 Extension to a Vector Parameter 292
6.1 Introduction....... 133 9.5 Statistical Evaluation of Estimators 294
6.2 Summary . . . . . . . . 133 9.6 Signal Processing Example 299
6.3 Definition of the BLUE 133
6.4 Finding the BLUE . . . 134 10 The Bayesian Philosophy 309
6.5 Extension to a Vector Parameter 136 10.1 Introduction . . . . . . . 309
6.6 Signal Processing Example 139 10.2 Summary . . . . . . . . 309
6A Derivation of Scalar BLUE 141 10.3 Prior Knowledge and Estimation 310
6B Derivation of Vector BLUE 151 10.4 Choosing a Prior PDF . . . . . . 316
153 10.5 Properties of the Gaussian PDF. 321
7 Maximum Likelihood Estimation 10.6 Bayesian Linear Model . . . . . . 325
7.1 Introduction. 157 10.7 Nuisance Parameters . . . . . . . . . . . . . . . . 328
7.2 Summary . . . . 157 10.8 Bayesian Estimation for Deterministic Parameters 330
7.3 An Example . . . 157 lOA Derivation of Conditional Gaussian PDF. 337
7.4 Finding the MLE 158
7.5 Properties of the MLE 162 11 General Bayesian Estimators 341
7.6 MLE for Transformed Parameters 164 11.1 Introduction .. 341
7.7 Numerical Determination of the MLE 173 11.2 Summary . . . . . . . . . . 341
7.8 Extension to a Vector Parameter 177 11.3 Risk Functions . . . . . . . 342
7.9 Asymptotic MLE . . . . . . 182 11.4 Minimum Mean Square Error Estimators 344
7.10 Signal Processing Examples . . . 190 11.5 Maximum A Posteriori Estimators . . . . 350
7A Monte Carlo Methods . . . . . . 191 11.6 Performance Description . . . . . . . . . . 359
7B Asymptotic PDF of MLE for a Scalar Parameter 205 11. 7 Signal Processing Example . . . . . . . . . : . . . . . . . . . . . . 365
211 llA Conversion of Continuous-Time System to DIscrete-TIme System 375
7C Derivation of Conditional Log-Likelihood for EM Algorithm Example
214
8 Least Squares 12 Linear Bayesian Estimators 379
8.1 Introduction. 219 12.1 Introduction . . . . . . . . 379
8.2 Summary . . 219 12.2 Summary . . . . . . . . . 379
219 12.3 Linear MMSE Estimation 380
x CONTENTS
xi
xii PREFACE
r
t
and matrix algebra. This book can also be used for self-study and so should be useful
to the practicing engin.eer as well as the student.
The author would like to acknowledge the contributions of the many people who
over the years have provided stimulating discussions of research problems, opportuni-
ties to apply the results of that research, and support for conducting research. Thanks
are due to my colleagues L. Jackson, R. Kumaresan, L. Pakula, and D. Tufts of the
University of Rhode Island, and 1. Scharf of the University of Colorado. Exposure to
practical problems, leading to new research directions, has been provided by H. Wood-
Chapter 1
sum of Sonetech, Bedford, New Hampshire, and by D. Mook, S. Lang, C. Myers, and
D. Morgan of Lockheed-Sanders, Nashua, New Hampshire. The opportunity to apply
estimation theory to sonar and the research support of J. Kelly of the Naval Under- Introduction
sea Warfare Center, Newport, Rhode Island, J. Salisbury of Analysis and Technology,
Middletown, Rhode Island (formerly of the Naval Undersea Warfare Center), and D.
Sheldon of th.e Naval Undersea Warfare Center, New London, Connecticut, are also
greatly appreciated. Thanks are due to J. Sjogren of the Air Force Office of Scientific
Research, whose continued support has allowed the author to investigate the field of 1.1 Estimation in Signal Processing
statistical estimation. A debt of gratitude is owed to all my current and former grad-
uate students. They have contributed to the final manuscript through many hours of Modern estimation theory can be found at the heart of many electronic signal processing
pedagogical and research discussions as well as by their specific comments and ques- systems designed to extract information. These systems include
tions. In particular, P. Djuric of the State University of New York proofread much 1. Radar
of the manuscript, and V. Nagesha of the University of Rhode Island proofread the
manuscript and helped with the problem solutions. 2. Sonar
3. Speech
Steven M. Kay
4. Image analysis
University of Rhode Island
Kingston, RI 02881 5. Biomedicine
6. Communications
7. Control
8. Seismology,
and all share the common problem of needing to estimate the values of a group of pa-
rameters. We briefly describe the first three of these systems. In radar we are mterested
in determining the position of an aircraft, as for example, in airport surveillance radar
[Skolnik 1980]. To determine the range R we transmit an electromagnetic pulse that is
reflected by the aircraft, causin an echo to be received b the antenna To seconds later~
as shown in igure 1.1a. The range is determined by the equation TO = 2R/c, where
c is the speed of electromagnetic propagation. Clearly, if the round trip delay To can
be measured, then so can the range. A typical transmit pulse and received waveform
a:e shown in Figure 1.1b. The received echo is decreased in amplitude due to propaga-
tIon losses and hence may be obscured by environmental nois~. Its onset may also be
perturbed by time delays introduced by the electronics of the receiver. Determination
of the round trip delay can therefore require more than just a means of detecting a
jump in the power level at the receiver. It is important to note that a typical modern
l
2
CHAPTER 1. INTRODUCTION 1.1. ESTIMATION IN SIGNAL PROCESSING 3
Sea surface
Transmit/
receive
antenna Towed array
Sea bottom
'-----+01 Radar processing
system ---------------~~---------------------------~
(a) Passive sonar
(a) Radar
Sensor 1 output
~
Transmit pulse
....................... - - .................... - - ... -1
Time
Time
Sensor 3 output
~ \~
Time
(b) Transmit and received waveforms (b) Received signals at array sensors
radar s!,stem will input the received continuous-time waveform into a digital computer are then transmitted to a tow ship for input to a digital computer. Because of the
by takmg samples via an analog-to-digital convertor. Once the waveform has been positions of the sensors relative to the arrival angle of the target signal, we receive
sampled, the data compose a time series. (See also Examples 3.13 and 7.15 for a more the signals shown in Figure 1.2b. By measuring TO, the delay between sensors, we can
detailed description of this problem and optimal estimation procedures.) determine the bearing f3 Z<!T t~e ~~ress.!.o~
Another common application is in sonar, in which we are also interested in the
posi~ion of a target, such as a submarine [Knight et al. 1981, Burdic 1984] . A typical eTO)
f3 = arccos ( d (1.1)
passive sonar is shown in Figure 1.2a. The target radiates noise due to machiner:y
on board, propellor action, etc. This noise, which is actually the signal of interest, where c is the speed of sound in water and d is the distance between sensors (see
propagates through the water and is received by an array of sensors. The sensor outputs
Examples 3.15 and 7.17 for a more detailed description). Again, however, the received
4 CHAPTER 1. INTRODUCTION 1.1. ESTIMATION IN SIGNAL PROCESSING 5
S
:s
:~
-.....
<0
-.....
.,u....<0
-..... "C.
<0
-..... "'
u.
p.. -10
0 ...:l
"&
.;, -1
"'0
<= -20
:d
S -30~
-2 E !
!?!' -40-+
-0 !
-3 .;::c -50 I i I
0 2 4 6 8 10 12 14 16 18 20 "
p.. 0 500 1000 1500 2000 2500
Time (ms) Frequency (Hz)
S
::=.. 30
-..... 1
'" 2i
-.....
<0
tu 10-+
::;
C.
0
U'"
"- -10
...:l
"2 -20
id I
~ -30
1
~ -40-+
~ -50il--------~I--------_r1--------TI--------T-------~----
o 8 10 14 0:: 0 500 1000 1500 2000 2500
Time (ms) Frequency (Hz)
Figure 1.3 Examples of speech sounds Figure 1.4 LPC spectral modeling
waveforms are not "clean" as shown in Figure 1.2b but are embedded in noise, making, minimizes some distance measure. Difficulties arise if the itch of the speaker's voice
the determination of To more difficult. The value of (3 obtained from (1.1) is then onli( c anges from the time he or s e recor s the sounds (the training session) to the time
an estimate. when the speech recognizer is used. This is a natural variability due to the nature of
- Another application is in speech processing systems [Rabiner and Schafer 1978]. human speech. In practice, attributes, other than the waveforms themselves, are used
A particularly important problem is speech recognition, which is the recognition of to measure distance. Attributes are chosen that are less sllsceptible to variation. For
speech by a machine (digital computer). The simplest example of this is in recognizing example, the spectral envelope will not change with pitch since the Fourier transform
individual speech sounds or phonemes. Phonemes are the vowels, consonants, etc., or of a periodic signal is a sampled version of the Fourier transform of one period of the
the fundamental sounds of speech. As an example, the vowels /al and /e/ are shown signal. The period affects only the spacing between frequency samples, not the values.
in Figure 1.3. Note that they are eriodic waveforms whose eriod is called the pitch. To extract the s ectral envelo e we em 10 a model of s eech called linear predictive
To recognize whether a sound is an la or an lei the following simple strategy might coding LPC). The parameters of the model determine the s ectral envelope. For the
be employed. Have the person whose speech is to be recognized say each vowel three speec soun SIll 19ure 1.3 the power spectrum (magnitude-squared Fourier transform
times and store the waveforms. To reco nize the s oken vowel com are it to the divided by the number of time samples) or periodogram and the estimated LPC spectral
stored vowe s and choose the one that is closest to the spoken vowel or the one that envelope are shown in Figure 1.4. (See Examples 3.16 and 7.18 for a description of how
6 CHAPTER 1. INTRODUCTION 1.2. THE MATHEMATICAL ESTIMATION PROBLEM 7
the parameters of the model are estimated and used to find the spectral envelope.) It
is interesting that in this example a human interpreter can easily discern the spoken
vowel. The real problem then is to design a machine that is able to do the same. In
the radar/sonar problem a human interpreter would be unable to determine the target
position from the received waveforms, so that the machine acts as an indispensable
tool. x[O]
In all these systems we are faced with the problem of extracting values of parameters
bas~ on continuous-time waveforms. Due to the use of di ital com uters to sample
Figure 1.5 Dependence of PDF on unknown parameter
and store e contmuous-time wave orm, we have the equivalent problem of extractin
parameter values from a discrete-time waveform or a data set. at ematically, we have
the N-point data set {x[O], x[I], ... , x[N -In which depends on an unknown parameter 1.2 The Mathematical Estimation Problem
(). We wish to determine () based on the data or to define an estimator
In determining good .estimators the first step is to mathematically model the data.
{J = g(x[O] , x[I], . .. , x[N - 1]) (1.2) ~ecause the data are mherently random, we describe it by it, probability density func-
tion (PDF) 01:" p(x[O], x[I], ... , x[N - 1]; ()). The PDF is parameterized by the unknown
where 9 is some function. This is the problem of pammeter estimation, which is the l2arameter ()J I.e., we have a class of PDFs where each one is different due to a different
subject of this book. Although electrical engineers at one time designed systems based value of (). We will use a semicolon to denote this dependence. As an example, if N = 1
on analog signals and analog circuits, the current and future trend is based on discrete- and () denotes the mean, then the PDF of the data might be
time signals or sequences and digital circuitry. With this transition the estimation
problem has evolved into one of estimating a parameter based on a time series, which
is just a discrete-time process. Furthermore, because the amount of data is necessarily
p(x[O]; ()) = .:-." exp [__I_(x[O] _
v 27rO' 2 20'2
())2]
finite, we are faced with the determination of 9 as in (1.2). Therefore, our problem has
now evolved into one which has a long and glorious history, dating back to Gauss who which is shown in Figure 1.5 for various values of (). It should be intuitively clear that
in 1795 used least squares data analysis to predict planetary m(Wements [Gauss 1963 because the value of () affects the probability of xiO], we should be able to infer the value
(English translation)]. All the theory and techniques of statisti~al estimation are at of () from the observed value of x[OL For example, if the value of x[O] is negative, it is
our disposal [Cox and Hinkley 1974, Kendall and Stuart 1976-1979, Rao 1973, Zacks doubtful tha~ () =:' .()2' :rhe value. (). = ()l might be more reasonable, This specification
1981]. of th~ PDF IS cntlcal m determmmg a good estima~. In an actual problem we are
Before concluding our discussion of application areas we complete the previous list. not glv~n a PDF but .must choose one that is not only consistent with the problem
~onstramts and any pnor knowledge, but one that is also mathematically tractable. To
4. Image analysis - Elstimate the position and orientation of an object from a camera ~llus~rate the appr~ach consider the hypothetical Dow-Jones industrial average shown
image, necessary when using a robot to pick up an object [Jain 1989] IP. FIgure 1.6. It. mIght be conjectured that this data, although appearing to fluctuate
WIldly, actually IS "on the average" increasing. To determine if this is true we could
5. Biomedicine - estimate the heart rate of a fetu~ [Widrow and Stearns 1985] assume that the data actually consist of a straight line embedded in random noise or
6. Communications - estimate the carrier frequency of a signal so that the signal can x[n] =A+Bn+w[n] n = 0, 1, ... ,N - 1.
be demodulated to baseband [Proakis 1983]
~ reasonable model for the noise is that win] is white Gaussian noise (WGN) or each
1. Control - estimate the position of a powerboat so that corrective navigational sample of win] has the PDF N(0,O' 2 ) (denotes a Gaussian distribution with a mean
action can be taken, as in a LORAN system [Dabbous 1988] of 0 and a variance of 0'2) and is uncorrelated with all the other samples. Then, the
unknown parameters are A and B, which arranged as a vector become the vector
8. Seismology - estimate the underground distance of an oil deposit based on SOUD& parameter 9 = [A Bf. Letting x = [x[O] x[I] . .. x[N - lW, the PDF is
reflections dueto the different densities of oil and rock layers [Justice 1985].
Finally, the multitude of applications stemming from analysis of data from physical p(x; 9)
1 [1
= (27rO'2)~
N-l
exp - 20'2 ~ (x[n]- A - Bn)2 .
]
(1.3)
experiments, economics, etc., should also be mentioned [Box and Jenkins 1970, Holm
and Hovem 1979, Schuster 1898, Taylor 1986].
The choice of a straight line for the signal component is consistent with the knowledge
that the Dow-Jones average is hovering around 3000 (A models this) and the conjecture
8 CHAPTER 1. INTRODUCTION 1.3. ASSESSING ESTIMATOR PERFORMANCE
9
3200 3.0
i
3150 2.5-+
~ 3100
...<'$
~" 3050
1.0
'Il "F.
3000
~"0
.-,
fl 0.5
2950 0.0
~0
0 2900
-{)'5~
-1.0
2850
-1.5
2800
- 2 . 0 - r - - ' I - - i l - - i l - - i l - - i l - - I I - - I I_ _"1'I_ _-+1----<1
0 10 20 30 40 50 60 70 80 90 100
o ill W ~ ~ M W m W 00 ~
Day number
n
Figure 1.6 Hypothetical Dow-Jones average Figure 1. 7 Realization of DC level in noise
that it is increasing (B > 0 models this). The assumption of WGN is justified by the Once
"
the PDF has been specified the problem becomes one 0 f d et ermmmg
. '
.. an
need to formulate a mathematically tractable model so that closed form estimators can optImal estImator or functlOn of the data ' as in (1 .2) . Note that an es t'Imat or may
be found. Also, it is reasonable unless there is strong evidence to the contrary, such as depend on other par~meters, but only if they are known. An estimator may be thought
highly correlated noise. Of course, the performance of any estimator obtained will be of as a rule that ~Sl ns a value to 9 for each realization of x. The estimate of 9 is
critically dependent on the PDF assum tions. We can onl hope the estimator obtained the va ue o. 9 obtal~ed .for a given realization of x. This distinction is analogous to a
is robust, in that slight changes in the PDF do not severely affect t per ormance of the random vanable (whIch IS a f~nction defined on the sample space) and the value it takes
estimator. More conservative approaches utilize robust statistical procedures [Huber on. Althoug~ some authors dIstinguish between the two by using capital and lowercase
1981J. letters, we WIll not do so. The meaning will, hopefully, be clear from the context.
Estimation based on PDFs such as (1.3) is termed classical estimation in that the
parameters of interest are assumed to be deterministic but unknown. In the Dow-Jo~
average example we know a priori that the mean is somewhere around 3000. It seems 1.3 Assessing Estimator Performance
inconsistent with reality, then, to choose an estimator of A that can result in values as
Consider. the data set shown in Figure 1.7. From a cursory inspection it appears that
low as 2000 or as high as 4000. We might be more willing to constrain the estimator
x[n] conslst~ of.a DC.level A in noise. (The use of the term DC is in reference to direct
to produce values of A in the range [2800, 3200J. To incorporate this prior knowledge
current, whIch IS eqUlvalent to the constant function.) We could model the data as
we can assume that A is no Ion er deterministic but a random variable and assign it a
DF, possibly uni orm over the [2800, 3200J interval. Then, any subsequent estImator x[nJ = A + wIn]
will yield values in this range. Such an approach is termed Bayesian estimation. The
parameter we are attem tin to estimate is then viewed as a realization of the randQ; ;~re w n denotes so~e zero ~ean noise ro~~ss. B~ed on the data set {x[O], x[1], .. . ,
, the data are described by the joint PDF [[ .l]), we would .hke to estImate A. IntUltlvely, smce A is the average level of x[nJ
(w nJ IS zero mean), It would be reasonable to estimate A as
p(x,9) = p(xI9)p(9)
I N-l
.4= N Lx[nJ
n=O
1j
Intuitively, we would not expect this estimator to perform as well since it does not 20
make use of all the data. There is no averaging to reduce the noise effects. However,
for the data set in Figure 1.7, A = 0.95, which is closer to the true value of A than '0 15
the sample mean estimate. Can we conclude that A is a better estimator than A? lil
r:~i ~JI~m
The answer is of course no. Because an estimator is a function of the data, which
are random variables, it too is a random variable, subject to many possible outcomes.
The fact that A is closer to the true value only means that for the given realization of
data, as shown in Figure 1.7, the estimate A = 0.95 (or realization of A) is closer to ------rl- r - I------;--------1 ---r---I_
the true value than the estimate .1= 0.9 (or realization of A). To assess performance -3 -2 -1 0 2 ~
we must do so statistically. One possibility would be to repeat the experiment that
Sample mean value, A
generated the data and apply each estimator to every data set. Then, we could ask
which estimator produces a better estimate in the majority of the cases. Suppose we
repeat the experiment by fixing A = 1 and adding different noise realizations of win] to
generate an ensemble of realizations of x[n]. Then, we determine the values of the two
estimators for each data set and finally plot the histograms. (A histogram describes the
number of times the estimator produces a given range of values and is an approximation
to the PDF.) For 100 realizations the histograms are shown in Figure 1.8. It should
'I
now be evident that A is a better estimator than A because the values obtained are 1
more concentrated about the true value of A = 1. Hence, A will uliWl-lly produce a value i
I
closer to the true one than A. The skeptic, however, might argue-that if we repeat the
experiment 1000 times instead, then the histogram of A will be more concentrated. To
dispel this notion, we cannot repeat the experiment 1000 times, for surely the skeptic
would then reassert his or her conjecture for 10,000 experiments. To prove that A is -1 o 2 3
better we could establish that the variance is less. The modeling assumptions that we First sample value, A
must employ are that the w[n]'s, in addition to being zero mean, are uncorrelated and
have equal variance u 2 . Then, we first show that the mean of each estimator is the true Figure 1.8 Histograms for sample mean and first sample estimator
value or
1 N-l )
1
E ( N ~ x[nJ
N-l
E(A)
N2 L
n=O
var(x[nJ)
1 N-l
= N L E(x[n])
n=O
1
N2
Nu
2
A u2
E(A) E(x[O]) N
A since the w[nJ's are uncorrelated and thus
so that on the average the estimators produce the true value. Second, the variances are
var(A) var(x[OJ)
1 N-l )
var(A) = var ( N ~ x[nJ u2
> var(A).
12 CHAPTER 1. INTRODUCTION REFERENCES 13
Furthermore, if we could assume that w[n] is Gaussian, we could also conclude that the presented first followed by the vector estimator. This approach reduces the tendency
probability of a given magnitude error is less for A. than for A (see Problem 2.7). of vector/matrix algebra to obscure the main ideas. Finally, classical estimation is
ISeveral important points are illustrated by the previous example, which should described first, followed by Bayesian estimation, again in the interest of not obscuring
always be ept in mind. the main issues. The estimators obtained using the two approaches, although similar
in appearance, are fundamentally different.
1. An estimator is a random variable. As such, its erformance can onl be com- The mathematical notation for all common symbols is summarized in Appendix 2.
pletely descri e statistical y or by its PDF. The distinction between a continuous-time waveform and a discrete-time waveform or
2. The use of computer simulations for assessing estimation performance, although sequence is made through the symbolism x(t) for continuous-time and x[n] for discrete-
quite valuable for gaiiiing insight and motivating conjectures, is never conclusive. time. Plots of x[n], however, appear continuous in time, the points having been con-
At best, the true performance may be obtained to the desired degree of accuracy. nected by straight lines for easier viewing. All vectors and matrices are boldface with
At worst, for an insufficient number of experiments and/or errors in the simulation all vectors being column vectors. All other symbolism is defined within the context of
the discussion.
techniques employed, erroneous results may be obtained (see Appendix 7A for a
further discussion of Monte Carlo computer techniques).
Another theme that we will repeatedly encounter is the tradeoff between perfor:
mance and computational complexity. As in the previous example, even though A
has better performance, it also requires more computation. We will see that QPtimal
estimators can sometimes be difficult to implement, requiring a multidimensional opti- References
mization or inte ration. In these situations, alternative estimators that are suboptimal,
but which can be implemented on a igita computer, may be preferred. For any par- Box, G.E.P., G.M. Jenkins, Time Series Analysis: Forecasting and Contro~ Holden-Day, San
ticular application, the user must determine whether the loss in performance is offset Francisco, 1970.
by the reduced computational complexity of a suboptimal estimator. Burdic, W.S., Underwater Acoustic System Analysis, Prentice-Hall, Englewood Cliffs, N.J., 1984.
Cox, D.R., D.V. Hinkley, Theoretical Statistics, Chapman and Hall, New York, 1974.
Dabbous, T.E., N.U. Ahmed. J.C. McMillan, D.F. Liang, "Filtering of Discontinuous Processes
1.4 Some Notes to the Reader Arising in Marine Integrated Navigation," IEEE Trans. Aerosp. Electron. Syst., Vol. 24,
pp. 85-100, 1988.
Our philosophy in presenting a theory of estimation is to provide the user with the Gauss, K.G., Theory of Motion of Heavenly Bodies, Dover, New York, 1963.
Holm, S., J.M. Hovem, "Estimation of Scalar Ocean Wave Spectra by the Maximum Entropy
main ideas necessary for determining optimal estimator.. We have included results
Method," IEEE J. Ocean Eng., Vol. 4, pp. 76-83, 1979.
that we deem to be most useful in practice, omitting some important theoretical issues. Huber, P.J., Robust Statistics, J. Wiley, ~ew York, 1981.
The latter can be found in many books on statistical estimation theory which have Jain, A.K., Fundamentals of Digital Image ProceSSing, Prentice-Hall, Englewood Cliffs, N.J., 1989.
been written from a more theoretical viewpoint [Cox and Hinkley 1974, Kendall and Justice, J.H .. "Array Processing in Exploration Seismology," in Array Signal Processing, S. Haykin,
Stuart 1976--1979, Rao 1973, Zacks 1981]. As mentioned previously, our goal is t<;) ed., Prentice-HaU, Englewood Cliffs, N.J., 1985.
obtain an 0 timal estimator, and we resort to a subo timal one if the former cannot Kendall, Sir M., A. Stuart, The Advanced Theory of Statistics, Vols. 1-3, Macmillan, New York,
be found or is not implement a ~. The sequence of chapters in this book follows this 1976--1979.
approach, so that optimal estimators are discussed first, followed by approximately Knight, W.S., RG. Pridham, S.M. Kay, "Digital Signal Processing for Sonar," Proc. IEEE, Vol.
optimal estimators, and finally suboptimal estimators. In Chapter 14 a "road map" for 69, pp. 1451-1506. Nov. 1981.
finding a good estimator is presented along with a summary of the various estimators Proakis, J.G., Digital Communications, McGraw-Hill, New York, 1983.
and their properties. The reader may wish to read this chapter first to obtain an Rabiner, L.R., RW. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs,
N.J., 1978.
overview.
Rao, C.R, Linear Statistical Inference and Its Applications, J. Wiley, New York, 1973.
We have tried to maximize insight by including many examples and minimizing
Schuster, !," "On the Investigation of Hidden Periodicities with Application to a Supposed 26 Day
long mathematical expositions, although much of the tedious algebra and proofs have . PerIod of Meterological Phenomena," Terrestrial Magnetism, Vol. 3, pp. 13-41, March 1898.
been included as appendices. The DC level in noise described earlier will serve as a Skolmk, M.L, Introduction to Radar Systems, McGraw-Hill, ~ew York, 1980.
standard example in introducing almost all the estimation approaches. It is hoped Taylor, S., Modeling Financial Time Series, J. Wiley, New York, 1986.
that in doing so the reader will be able to develop his or her own intuition by building Widrow, B., Stearns, S.D., Adaptive Signal Processing, Prentice-Hall, Englewood Cliffs, N.J., 1985.
upon previously assimilated concepts. Also, where possible, the scalar estimator is Zacks, S., Parametric Statistical Inference, Pergamon, New York, 1981.
CHAPTER 1. INTRODUCTION
14
Problems
1. In a radar system an estimator of round trip delay To has the PDF To ~ N(To, (J~a)"
where 7< is the true value. If the range is to be estimated, propose an estimator R
and find its PDF. Next determine the standard deviation (J-ra so that 99% of th~
time the range estimate will be within 100 m of the true value. Use c = 3 x 10
mls for the speed of electromagnetic propagation. Chapter 2
2. An unknown parameter fJ influences the outcome of an experiment which is mod-
eled by the random variable x. The PDF of x is
Minimum Variance Unbiased
p(x; fJ) = vk exp [-~(X - fJ?) .
Estimation
A series of experiments is performed, and x is found to always be in the interval
[97, 103]. As a result, the investigator concludes that fJ must have been 100. Is
this assertion correct?
3. Let x = fJ + w, where w is a random variable with PDF Pw( w) ..IfbfJ is (a dfJet)er~in~
istic parameter, find the PDF of x in terms of pw and denote It y P x; ... ex
assume that fJ is a random variable independent of wand find the condltlO?al
PDF p(xlfJ). Finally, do not assume that eand ware independent and determme
p(xlfJ). What can you say about p(x; fJ) versus p(xlfJ)?
2.1 Introduction
4. It is desired to estimate the value of a DC level A in WGN or In this chapter we will be in our search for good estimators of unknown deterministic
parame ers. e will restrict our attention to estimators which on the average yield
x[n] = A + w[n] n = 0,1, ... , N - C the true parameter value. Then, within this class of estimators the goal will be to find
where w[n] is zero mean and uncorrelated, and each sample has variance (J2 = l. the one that exhibits the least variability. Hopefully, the estimator thus obtained will
produce values close to the true value most of the time. The notion of a minimum
Consider the two estimators
variance unbiased estimator is examined within this chapter, but the means to find it
1 N-I will require some more theory. Succeeding chapters will provide that theory as well as
A = N 2::
x[n]
n=O
apply it to many of the typical problems encountered in signal processing.
15
16 CHAPTER 2. MINIMUM VARIANCE UNBIASED ESTIMATION 2.3. UNBIASED ESTIMATORS 17
Example 2.1 - Unbiased Estimator for DC Level in White Gaussian Noise It is possible, however, that (2.3) may hold for some values of 8 and not others as the
next example illustrates. '
Consider the observatioJ!s
Example 2.2 - Biased Estimator for DC Level in White Noise
x[n) = A + w[n) n = 0, 1, ... ,N - 1
where A is the parameter to be estimated and w[n] is WGN. The parameter A can Consider again Example 2.1 but with the modified sample mean estimator
take on any value in the interval -00 < A < 00. Then, a reasonable estimator for the
average value of x[n] is _ 1 N-1
A= 2N Lx[n].
(2.2) n=O
Then,
or the sample mean. Due to the linearity properties of the expectation operator
1 N-1 ] E(A) ~A
E(A.) = E [ N ~ x[n) 2
A if A=O
1 N-1 # A if A # o.
N L E(x[nJ)
n=O It is seen that (2.3) holds for the modified estimator only for A = o. Clearly, A is a
N-1 biased estimator. <>
~LA
n=O That an estimator is unbiased does not necessarily mean that it is a good estimator.
= A It only guarantees that on the average it will attain the true value. On the other hand
for all A. The sample mean estimator is therefore unbiased. <> biased estimators are ones that are characterized by a systematic error, which presum~
ably should not be present. A persistent bias will always result in a poor estimator.
In this example A can take on any value, although in general the values of an unknown As an example, the unbiased property has an important implication when several es-
parameter may be restricted by physical considerations. Estimating the resistance R timators are combined (see Problem 2.4). ~t s?metimes occurs that multiple estimates
of an unknown resistor, for example, would necessitate an interval 0 < R < 00. ~th~ same paran:eter ar.e available, i.e., {81 , 82 , .. , 8n }. A reasonable procedure is to
Unbiased estimators tend to have symmetric PDFs centered about the true value of combme these estimates mto, hopefully, a better one by averaging them to form
8, although this is not necessary (see Problem 2.5). For Example 2.1 the PDF is shown
in Figure 2.1 and is easily shown to be N(A, (72/N) (see Problem 2.3). . 1 ~.
8=- ~8i. (2.4)
The restriction that E(iJ) = 8 for all 8 is an important one. Lettin iJ = n i=l
x = [x 0 x , it asserts that
Assuming the estimators are unbiased, with the same variance, and uncorrelated with
E(iJ) = Jg(x)p(x; 8) dx = 8 for all 8. (2.3) each other,
E(iJ) =8
X[~] ~ N (f)/ 1 ) ry 1"
tt;:; [~1 ~2 ~ - - &p 1
X [1\] -'?- \ f'I C~, 1) , &;; 0 I
') N (f), 2) ) 9-.( 0 (
i1" ~ [X [, 1 + x[l]
'2
)
~=-----
[9~~~.~ Sp]" . _-_........ _.
I
..
f. (~);: @'1'
'
Qi 1. ~. < ('9t.
1 I
J -L i . p
t -=>
GJ-~')~ D -.-----:-===:====---:::::--~-... ------:;-----====--
'- tlt; -I
0: 0'"" J :,
'\"
~ e, l : r 1I (!. ~~, _J.!.1'" *
'>1 (
~i
,". [J
l \;!
'v H
f ,1 !:'
~~
~ k < (-' :r &; - ~ ?: t\ .1 1;
N ,=, tV '>1
tJ J )"7. "'A
~
1~'
IJ
9. .f C J~
)/'J
L~( I
1
Ju
1:"
Z.
~
~@(
" 1:
'\ ")2 .\
-=., i ~i
'<I
-
Q.
-::;
N
E ~ Z~ &i'" ~ . ("'t --1 :c 9i 1
'''-'
.J
,tl
-1
i- - ;
IJ
J
to 1?:~ Sr:~ ~I-.
,.,
~
;, -;- t .\(}oJ A)1
"[: @,: ~- (
-t E. ) L IV A
1)1 Z= -r~ ~t 1(L" e ~ E [~
f\I A
\ _ ). JJ ;
IV ,. I /VZ ,-., N' ;., i I ,<I
n increases
In searching for optimal estimators we need to ado t some 0
~ natura one is the mean square error (MSE), defined as
__+-____ __ __
~ ~ ~~ ____ O (2.5)
IJ
This measures the avera e mean s uared deviation of the estimator from the t e value.
(a) Unbiased estimator Unfortunate y, adoption of this natural criterion leads to unrealizable estimators ones
that cannot be written solely as a function of the data. To understand the pr~blem
which arises we first rewrite the MSE as
p(O)
mse(8) = E{[(8-E(O))+(E(O)-8)f}
..
n increases
var(O) + [E(8) - 8f
__~~--~--~~-------O __+-_'-...,.---'..,.....--------- 0
var(8) + b2 (8) (2.6)
E(9) IJ
(b) Biased estimator which shows that the MSE is composed of errors due to the variance of the estimator as
well as the bias. As an example, for the problem in Example 2.1 consider the modified
estimator
. 1 N-l
Figure 2.2 Effect of combining estimators
A=a Lx[nj
N n=O
and for some c:onstant a. We wi~l attempt to find the a which results in the minimum MSE.
Since E(A) = aA and var(A) = a2(j2/N, we have, from (2.6),
= var(8d
n Differentiating the MSE with respect to a yields
so that as more estimates are averaged, the variance will decrease. Ultimately, as dmse(A) 2a(j2 2
n -+ 0Cl, {} -+ 8. However, if the estimators are biased or E(8 i ) = 8 + b(8), then da =N + 2(a - l)A
which upon setting to zero and solving yields the optimum value
A2
a - -------
8 + b(8) opt - A2 + (j2/N'
It is seen that, unfortunately, the optimal value of a depends upon the unknown param-
and no matter how many estimators are avera ed 8 will not conver e to the true value.
eter A. ~he estimator is therefore not realizable. In retrospect the estimator depends
This is depicted in igure 2.2. Note that, in general, up~n A smce the bias ter~ in ~2.6) is a function of A. It would seem that any criterion
which depends on the bias Will lead to an unrealizable estimator. Although this is
b(8) = E(8) - 8 gene~~lly true, on occasion realizable minimum MSE estimators can be found [Bibby
and Ioutenburg 1977, Rao 1973, Stoica and Moses 1990j.
is defined as the bias of the estimator.
FINDING THE MVU ESTIMATOR
21
20 CHAPTER 2. MINIMUM VARIANCE UNBIASED ESTIMATION 2.6.
var(8) var(ii)
var(8)
-+-----
_ iii _+-__----- 81 _------"1 :~!~?.................. ii2
27 36
-4--------
/
-1_ _ _ _ _---
--~------------------- 9
8
2
83 = MVU estimator
-
--+-----~~----------- 9
90
- 82
03
NoMVU
estimator
... 2ciiji
18/36
__---------t---------------
01
9
Figure 2.4 Illustration of nonex-
istence of minimum variance unbi-
ased estimator
(a) (b)
The two estimators
Figure 2.3 Possible dependence of estimator variance with (J
1
- (x[Q] + x[l])
2
From a practical view oint the minimum MSE estimator needs to be abandoned. 2 1
An alternative approach is to constrain t e bias to be zero and find the estimator which -x[Q] + -x[l]
3 3
minimizes the variance. Such an estimator is termed the minimum variance unbiased
(MVU) estimator. Note that from (2.6) that the MSE of an unbiased estimator is just can easily be shown to be unbiased. To compute the variances we have that
the variance. 1
Minimizing the variance of an unbiased estimator also has the effect of concentrating - (var(x[O]) + var(x[l]))
4
the PDF of the estimation error, 0 - B, about zero (see Problem 2.7). The estimatiw 4 1
error Will therefore be less likely to be large. -var(x[O]) + -var(x[l])
9 9
so that
2.5 Existence of the Minimum Variance Unbiased
Estimator
The uestion arises as to whether a MVU estimator exists Le., an unbiased estimator and
wit minimum variance for all B. Two possible situations are describe in Figure ..
If there are three unbiased estimators that exist and whose variances are shown in s if B < O.
Figure 2.3a, then clearly 03 is the MVU estimator. If the situation in Figure 2.3b The variances are shown in Figure 2.4. Clearly, between these two esti~~tors no M:'U
exists, however, then there is no MVU estimator since for B < Bo, O2 is better, while estimator exists. It is shown in Problem 3.6 that for B ~ 0 the mInimum possible
for iJ > Bo, 03 is better. In the former case 03 is sometimes referred to as the uniformly variance of an unbiased estimator is 18/36, while that for B < 0 is 24/36. Hence, no
minimum variance unbiased estimator to emphasize that the variance is smallest for single estimator can have a variance uniformly less than or equal to the minima shown
all B. In general, the MVU estimator does not always exist, as the following example in Figure 2.4. 0
illustrates. .
To conclude our discussion of existence we should note that it is also possible that there
Example 2.3 - Counterexample to Existence of MVU Estimator may not exist even a single unbiased estima.!2E (see Problem 2.11). In this case any
search for a MVU estimator is fruitless.
If the form ofthe PDF changes with B, then it would be expected that the best estimator
would also change with B. Assume that we have two independent observations x[Q] and
x[l] with PDF
2.6 Finding the Minimum Variance
Unbiased Estimator
x [0] N(B,l)
Even if a MVU estimator exists, we may not be able to find it. is no known
N(B,l) if B ~ Q
x[l] { N(B,2) if B < O. urn-t e-crank" procedure which will always produce the estimator. In the next few
chapters we shall discuss several possible approaches. They are:
E(8) =8
............ CRLB Figure 2.5 Cramer-Rao for every 8 contained wjthjn the space defined in (2.7). A MVU estimator has the
----------------r-------------------- 9 lower bound on variance of unbiased
estimator
~ditional property that var(Bi) for i = 1,2, ... ,p is minimum among all unbiased
estimators.
1. Determine the Cramer-Rao lower bound CRLB and check to see if some estimator
satisfies it Chapters 3 and 4). References
2. Apply the Rao-Blackwell-Lehmann-Scheffe (RBLS) theorem (Chapter 5). Bibbv, J .. H. Toutenburg, Prediction and Improved Estimation in Linear Models, J. Wiley, New
. York, 1977.
3. Further restrict the class of estimators to be not only unbiased but also linear. Then, Rao, C.R., Linear Statistical Inference and Its Applications, J. Wiley, New York, 1973.
find the minimum variance estimator within this restricted class (Chapte~ 6). Stoica, P., R. Moses, "On Biased Estimators and the Unbiased Cramer-Rao Lower Bound," Signal
Process., Vol. 21, pp. 349-350, 1990.
Approaches 1 and 2 may produce the MVU estimator, while 3 will yield it only if the
MVU estimator is linear III the data.
The CRLB allows us to determine that for any unbiased estimator the variance Problems
must be greater than or equal to a given value, as shown in Figure 2.5. If an estimator
exists whose variance equals the CRLB for each value of (), then it must be the MVU 2.1 The data {x[O], x[I], ... ,x[N - I]} are observed where the x[n]'s are independent
estimator. In this case, the theory of the CRLB immediately yields the estimator. It and identically distributed (lID) as N(0,a 2 ). We wish to estimate the variance
may happen that no estimator exists whose variance equals the bound. Yet, a MV~ a 2 as
estimator may still exist, as for instance in the case of ()! in Figure 2.5. Then, we
must resort to the Rao-Blackwell-Lehmann-Scheffe theorem. Thts procedure first find;
a su czen s atistic, one whic uses a the data efficient! and then nds a unction
of the su dent statistic which is an unbiased estimator oL(}. With a slight restriction Is this an unbiased estimator? Find the variance of ;2 and examine what happens
of the PDF of the data this procedure will then be guaranteed to produce the MVU as N -t 00.
estimator. The third approach requires the estimator to be linear, a sometimes severe
restriction, and chooses the best linear estimator. Of course, only for particular data 2.2 Consider the data {x[O],x[I], ... ,x[N -l]}, where each sample is distributed as
sets can this approach produce the MVU estimator.
. U[O, ()] and the samples are lID. Can you find an unbiased estimator for ()? The
range of () is 0 < () < 00.
2. 7 Extension to a Vector Parameter 2.3 Prove that the PDF of A given in Example 2.1 is N(A, a 2 IN).
If 8 = [(}l (}2 .. (}"jT is a vector of unknown parameter~, then we say that an estimator 2.4 The heart rate h of a patient is automatically recorded by a computer every 100 ms.
'T
In 1 s the measurements {hI, h2 , , hlO } are averaged to obtain h. If E(h;) = ah
, A A
is proposed. Find the an's so that the estimator is unbiased and the variance is
1# g(u)du = l.
minimized. Hint: Use Lagrangian mUltipliers with unbiasedness as the constraint
equation. Next. prove that a function 9 cannot be found to satisfy this condition for all () > O.
2.7 Two unbiased estimators are proposed whose variances satisfy var(O) < var(B). If
both estimators are Gaussian, prove that
f:
for any > O. This says that the estimator with less variance is to be preferred
since its PDF is more concentrated about the true value.
2.8 For the problem described in Example 2.1 show that as N -t 00, A -t A by using
the results of Problem 2.3. To do so prove that
lim Pr {IA
N-+oo
- AI> f:} = 0
f:
for any > O. In this case the estimator A is said to be consistent. Investigate
what happens if the alternative estimator A = 2~ L::OI x[n] is used instead.
2.9 This problem illustrates what happens to an unbiased est!1nator when it undergoes
a nonlinear transformation. In Example 2.1, if we choose to estimate the unknown
parameter () = A 2 by
0= (~ ~Ix[n]r,
can we say that the estimator is unbiased? What happens as N -t oo?
2.10 In Example 2.1 assume now that in addition to A, the value of 172 is also unknown.
We wish to estimate the vector parameter
Is the estimator
N1 N-I
Lx[n] ]
[,;, 1~ ~ ~
,
A n=O
[ N 1 (x[n] - A)'
unbiased?
Chapter 3
3.1 Introduction
Being able to place a lower bound on the variance of any unbiased estimator proves
to be extremely useful in practice. At best, it allows us to assert that an estimator is
the MVU estimator. This will be the case if the estimator attains the bound for all
values of the unknown parameter. At worst, it provides a benchmark against which we
can compare the performance of any unbiased estimator. Furthermore, it alerts us to
the physical impossibility of finding an unbiased estimator whose variance is less than
the bound. The latter is often useful in signal processing feasibility studies. Although
many such variance bounds exist [McAulay and Hofstetter 1971, Kendall and Stuart
1979, Seidman 1970, Ziv and Zakai 1969], the Cramer-Rao lower bound (CRLB) is by
far the easiest to determine. Also, the theory allows us to immediately determine if
an estimator exists that attains the bound. If no such estimator exists, then all is not
lost since estimators can be found that attain the bound in an approximate sense, as
described in Chapter 7. For these reasons we restrict our discussion to the CRLB.
3.2 Summary
The CRLB for a scalar parameter is given by (3.6). If the condition (3.7) is satisfied,
then the bound will be attained and the estimator that attains it is readily found.
An alternative means of determining the CRLB is given by (3.12). For a signal with
an unknown parameter in WGN, (3.14) provides a convenient means to evaluate the
bound. When a function of a parameter is to be estimated, the CRLB is given by
(3.16). Even though an efficient estimator may exist for (), in general there will not be
one for a function of () (unless the function is linear). For a vector parameter the CRLB
is determined using (3.20) and (3.21). As in the scalar parameter case, if condition
(3.25) holds, then the bound is attained and the estimator that attains the bound is
easily found. For a function of a vector parameter (3.30) provides the bound. A general
formula for the Fisher information matrix (used to determine the vector CRLB) for a
multivariate Gaussian PDF is given by (3.31). Finally, if the data set comes from a
27
CHAPTER 3. CRAMER-RAO LOWER BOUND 3.3. ESTIMATOR ACCURACY CONSIDERATIONS 29
28
p2(X[0] = 3; A) this we determine the probability of observing x[O] in the interval [ [ ]-
PI (x [0] = 3; A) [3 _ J/2, 3 + J/2] when A takes on a given value or x 0 J/2, x[0]+J/2] =
{ J
Pr 3 - 2" :::; x[O] :::; 3 + 2"
J} = r3+~
J3-~ Pi(U; A) du
2
__r--r__~-r--~-r--r-------- A
which for J small is Pi (x[O] = 3; A)J. But PI (x[O] = 3' A = 4)J -
l~
23456 .
2 3 4 5 6 3; A = 3)J = 1.20J. The probability of observing x[O] 'in a O.OlJ, while PI (x [0] =
x[O] = 3 when A = 4 is small with respect to that h ~nterval centered about
S;=-
A > 4 can be eliminated from consideration. It mthte~ - . Hence, the values ?f
(a) <11 = 1/3 (b) <12 = 1
Figure 3.1 PDF dependence on unknown parameter the interval 3 3a l = [2,4] are viable candidates ~or t~ea~~ed. tha~
values of A III
is a much weaker dependence on A. Here our . b'l d' d Fill. Figure 3.1 b there
interval 3 3a2 = [0,6]. via e can I ates are III the much wider
WSS Gaussian random process, then an approximate CRLB, that depends on the PSD, o
is given by (3.34). It is valid asymptotically or as the data record length becomes large.
it is termed the likelihood function. Two exam l:su~f ~~w~ paramete: (with x fixed),
When the PDF is viewed as a function of th k
in Figure 3.1. Intuitively, the "sharpness" of &e likelih~h~o~d fu.nctlOns we:e shown
3.3 Estimator Accuracy Considerations accurately we can estimate the unknown paramet er T0 quantify unctIOn
ha h 0
thi determllles
f b how
t t t e sharpness is effectively measured b th . . s no Ion 0 serve
the logarithm of the likelihood function at i~s :a~eg;t~~e .of the second derivative of
Before stating the CRLB theorem, it is worthwhile to expose the hidden factors that
determine how well we can estimate a parameter. Since all our information is embodied
in the observed data and the underlying PDF for that data it is not sur risin that the likelihood function. In Example 3., 1 if we cons I.Pd er th e natural
IS IS the of of
curvature
logarithm thethe log-
PDF
estimation accuracy depen s uect Y on the PDF. For instance, we should not expect
to be able to estimate a parameter with any degree of accuracy if the PDF depends Inp(x[O]; A) = -In v'21ra 2 - _l_(x[O]_ A)2
only weakly upon that parameter, or in the extreme case, i!.the PDF does not depend 2a 2
on it at all. In general, the more the PDF is influenced by the unknown parameter, the
then the first derivative is
better we shou e a e to estimate it.
81np(x[0]; A) 1
Example 3.1 _ PDF Dependence on Unknown Parameter
8A = 0-2 (x[O] - A) (3.2)
9 A
3.4 Cramer-Rao Lower Bound I(9}
1
a2
We are now ready to state the CRLB theorem. g(x[O]) = x[O)
Theorem 3.1 (Cramer-Rao Lower Bound - Scalar Parameter) It is assumed
that the PDF p(x; 9) satisfies the "regularity" condition so that (3.7) is satisfied. Hence, A = g(x[O]) = x[O] is the MVU estimator. Also, note
that var(A) = a 2 = 1/1(9), so that according to (3.6) we must have
E[81n~~X;9)] =0 for all 9
where the expectation is taken with respect to p( X; 9). Then, the variance of any unbiased
estimator {) must satisfy
We will return to this after the next example. See also Problem 3.2 for a generalization
_~,.....1---:-~-= <>
- _ [8
var(8) >
E
2
In p (X;9)]
89 2
(3.6) to the non-Gaussian case.
where the derivative is evaluated at the true value of 9 and the expectation is taken with Example 3.3 - DC Level in White Gaussian Noise
respect to p( X; 9). Furthermore, an unbiased estimator may be found that attains the
bound for all 9 if and only if Generalizing Example 3.1, consider the multiple observations
8Inp(x;6} = I(9)(g(x) _ 9} (3.7) x[n) = A + w[n) n = 0, 1, ... ,N - 1
89
for some functions 9 and I. That estimator which is the MVU estimator is {) = x), where w[n] is WGN with variance a 2 . To determine the CRLB for A
and the minimum variance is 1 1(9).
The expectation in (3.6) is explicitly given by p(x; A) 11 1 [1
N-l
V2rra 2 exp - 2a 2 (x[n]- A?
]
2
E [8 1n p (X; 9}]
89 2
= J 2
8 1np(x; 9) ( . 9) d
89 2 P X, X 1 [1 ~ N-l ]
(2rra2)~ exp - 2a 2 (x[n]- A)2 .
since the second derivative is a random variable dependent on x. Also, the bound will
depend on 9 in general, so that it is displayed as in Figure 2.5 (dashed curve). An Taking the first derivative
example of a PDF that does not satisfy the regularity condition is given in Problem
3.1. For a proof of the theorem see Appendix 3A.
8lnp(x;A) 8 [ 1
-In[(2rra2)~]- - 2 "(x[n]- A)2
Some examples are now given to illustrate the evaluation of the CRLB. N-l ]
-
8A 8A 2a L..
n=O
Example 3.2 - CRLB for Example 3.1 1 N-l
2' L(x[n]- A)
For Example 3.1 we see that from (3.3) and (3.6) a n=O
N
for all A. = -(x-A) (3.8)
a2
32 CHAPTER 3. CRAMER-RAO LOWER BOUND 3.4. CRAMER-RAO LOWER BOUND 33
where x is the sample mean. Differentiating again Example 3.4 - Phase Estimation
2
8 Inp(x;A) N Assume that we wish to estimate the phase of a sinusoid embedded in WGN or
8A2 = - q2
x[n] = Acos(21lJon + ) + wIn] n = 0, 1, ... , N - 1.
and noting that the second derivative is a constant, ~ from (3.6)
The ampiitude A and fre uenc 0 are assumed known (see Example 3.14 for the case
(3.9) when t ey are unknown). The PDF is
as the CRLB. Also, by comparing (3.7) and (3.8) we see that the sample mean estimator p(x; ) = 1 {I
(27rq2)
Ii. exp --22
q
E [x[n]- Acos(21lJon + 4f
N-l
n=O
}
.
attains the bound and must therefore be the MVU estimator. Also, once again the 2
minimum variance is given by the reciprocal of the constant N/q2 in (3.8). (See also
Problems 3.3-3.5 for variations on this example.) <> Differentiating the log-likelihood function produces
8Inp(x; ) 1 .\'-1
We now prove that when the CRLB is attained
84>
-2
q
E [x[n]- Acos(27rfon + cP)]Asin(27rfon + )
n=O
. 1
= /(8) A N-l A
var(8)
- q2 E [x[n]sin(27rfon + 4
n=O
- "2 sin(47rfon + 24]
where
-
and
821
n; ()
X
2 ;
A N-l
= - 2 E [x[n] cos(27rfon + ) - Acos(47rfon + 2)].
From (3.6) and (3.7) q n=O
var( 9)
-___ [8
E
-..".-."...,--1-,..---:-:-:-
2 In p (X;
80 2
0)]
Upon taking the negative expected value we have
A N-l
and
2
q
E [Acos (27rfon + ) -
n=O
2
A cos (47rfon + 2)]
8Inp(x; 0) = /(0)(9 _ 0).
88 2"
q
E -2 + -2 cos(47rfon + 2) - cos(47rfon + 2)
A2N-l[11
n=O
]
and therefore
A 1
var(O) = /(0)' (3.10) In this example the condition for the bound to hold is not satisfied. Hence, a phase
estimator does not eXIst whIch IS unbiased and attains the CRLB. It is still possible,
In the next example we will see that the CRLB is not always satisfied. however, that an MVU estimator may exist. At this point we do not know how to
34 CHAPTER 3. CRAMER-RAO LOWER BOUND
3.5. GENERAL CRLB FOR SIGNALS IN WGN 35
var(9)
1. nonnegative due to (3.11)
93
91 and CRLB
2. additive for independent observations .
...... .......... _.......... . ................ The latter property leads to the result that the CRLB for N lID observations is 1 N
............................. . times t a, or one 0 servation. To verify this, note that for independent observations
-------4--------------------- 0 -------+-------------------- e N-I
lnp(x; 8) =L lnp(x[n]; 8).
(a) (h efficient and MVU (b) 81 MVU but not efficient n==O
determine whether an MVU estimator exists, and if it does, how to find it. The theory n=O
of sufficient statistics presented in Chapter 5 will allow us to answer these questions. and finally for identically distributed observations
o /(8) = Ni(8)
An estimator which is unbiased and attains the CRLB, as the sample mean estimator
where
in Example 3.3 does, IS said to be efficient in that it efficiently uses the data. An MVU
estimator rna or may not be efficient. For instance, in Figure 3.2 the variances of all
i(8) = -E [[)2In~~[n];8)]
possible estimators or purposes of illustration there are three unbiased estimators)
is the Fisher information for one sam Ie. For nonindependent samples we might expect
are displayed. In Figure 3.2a, 81 is efficient in that it attains the CRLB. Therefore, it !!J.at the in ormation will be less than Ni(8), as Problem 3.9 illustrates. For completely
is also the MVU estimator. On the other hand, in Figure 3.2b, 81 does not attain the dependent samples, as for example, x[O] = x[l] = ... = x[N -1], we will have /(8) = i(8)
CRLB, and hence it is not efficient. However, since its varianoe is uniformly less than (see also Problem 3.9). Therefore, additional observations carry no information, and
that of all other unbiased estimators, it is the MVU estimator.- the CRLB will not decrease with increasing data record length.
The CRLB given by (3.6) may also be expressed in a slightly different form. Al-
though (3.6) is usually more convenient for evaluation, the alternative form is sometimes
useful for theoretical work. It follows from the identity (see Appendix 3A) 3.5 General CRLB for Signals
in White Gaussian Noise
(3.11)
Since it is common to assume white Gaussian noise, it is worthwhile to derive the
CRLB for this case. Later, we will extend this to nonwhite Gaussian noise and a vector
so that parameter as given by (3.31). Assume that a deterministic signal with an unknown
(3.12) p'arameter 8 is observed in WGN as
(see Problem 3.8). The dependence of the signal on 8 is explicitly noted. The likelihood function is
x or
The denominator in (3.6) is referred to as the Fisher information /(8) for the data
p(x; 8) =
1 {I
(211"172)..
N exp - -
2172 n=O
L (x[n] -
N-I
s[nj 8])2 .
}
(3.13)
Differentiating once produces
As we saw previously, when the CRLB is attained, the variance is the reciprocal of the
Fisher information. Int"iirtrvely, the more information, the lower the bound. It has the [)lnp(xj8) = ~ ~I( [ ]_ [ . ll]) [)s[nj 8]
essentiaI properties of an information measure in that it is [)8 172 L., X n s n, u [)8
n=O
36 CHAPTER 3. CRAMER-RAO LOWER BOUND 3.6. TRANSFORMATION OF PARAMETERS 37
02lnp(x;(J)
8(J2
=2- ~l{(
(12 L...J x n
[]_ s n,
2
[.(J])8 s[n;(J]_ (8S[n;(J])2}.
8(J2 8(J
-g
j 4.0~
4.5-+
~ 3.5~
n=O
L...J
(8s[n; 0]) 2
80
~ 2.0~
...'"
n=O C) 1.5-+
I
1.o-l.~'>Lr----r---r---r-----t----''---r---r---r:>.L----t--
1
so that finally
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
(3.14) Frequency
The form of the bound demonstrates the importance of the si nal de endence on O. 3.6 Transformation of Parameters
Signals that c ange rapidly as t e un nown parameter changes result in accurate esti-
mators. A simple application of (3.14) to Example 3.3, in which s[n; 0] = 0, produces It fre uentl occurs in practice that the parameter we wish to estimate is a function
a CRLB of (12/ N. The reader should also verify the results of Example 3.4. As a final o some more fun amenta parameter. or mstance, in Example 3.3 we may not be
example we examine the problem of frequency estimation. interested in the sign of A but instead may wish to estimate A2 or the power of the
signal. Knowing the CRLB for A, we can easily obtain it for A2 or in general for ~
function of A. As shown in Appendix 3A, if it is desired to estimate ex = g( 0), then the
Example 3.5 - Sinusoidal Frequency Estimation
CRLBi
a 2var(8)O
the values of x that are observed lie in a small interval about x = A (the 3 standard
But var{g(O)) = var(aO + b) = a 2 var(O), so that the CRLB is achieved. deviation interval is displayed). Over this small interval the nonlinear transformation
Althou h efficienc is reserved onl over linear transformations, it is approximatel is approximately linear. Therefore, the transformation may be replaced by a linear one
maintained over nonlinear transformations if the data record is large enoug. IS as since a value of x in the nonlinear region rarely occurs. In fact, if we linearize g about
great practical significance in that we are frequently interested in estimating functions A, we have the approximation
of parameters. To see why this property holds, we return to the previous example of
estimating A2 by x2. Although x2 is biased, we note from (3.18) that x 2 is asymptotically dg(A) ,",
unbiased or unbiased as N ~ 00. Furthermore, since x '" N(~0'2 IN), we can evaluate
g(x) ~ g(A) + d}i(x - A)."
tIle variance
var(x2 ) = E(x 4 ) - E2(x 2 }J It follows that, to within this approximation,
by using the result that if ~ '" N(J,L, 0'2), then E[g(x)] = g(A) = A2
:IS defined in Section 2.7. The vector parameter CRLB will allow us to place a bound Upon taking tne negative expectations, the Fisher information matrix becomes
"ll the variance of each element. As derived in Appendix 3B, the CRLB is found as the
.i. i] element of the inverse of a matrix or
var(6i ) ~ [rl(9)]ii . (3.20)
1(8) = [~ ~] .
2a 4
where 1(8) is the p x p Fisher information matrix. The latter is defined by
Although not true in general, for this example the Fisher information matrix is diagonal
[1(8)] .. = -E
"
[8 88;l}8j(X;8)] v
2
ln p
it
(3.21) and hence easily inverted to yield
0
a-
f,)r 1= 1,2, ... ,p;j = 1,2, . . . . In evaluating (3.21) the true value of 8 is used. var(A) ~
N
\'ote t at in t e scalar case (p = 1), = J(O) and we ave the scalar CRLB. Some 2a 4
,'xamples follow. var(d2 ) ~
N
Example 3.6 - DC Level in White Gaussian Noise (Revisited) Note that the CRLB for A. is the same as for the case when a 2 is known due to the
diagonal nature of the matrix. Again this is not true in general, as the next example
\\"e now extend Example 3.3 to the case where in addition to A the noise variance a 2 illustrates. 0
:~ also unknown. The parameter vector is 8 =
[Aa 2 f, and hence p 2. The 2 x 2 =
Fisher information matrix is
_E[8 ln (x;8)] 1
2
8A8a 2
p Example 3.7 - Line Fitting
lnp(x' 8)
,
N
= --ln271" -
2
N
-lna2
2
- -
2a
1
2
N-l
L (x[n]- A)2.
n=O
_ [8 E
2
ln p (X;8)]
8A2
1(8) =
l'he derivatives are easily found as
1 N-l
[ _ [8 E
2
ln p (x;8)]
.8B8A
8 lnp(x; 8)
"2 L(x[n]- A)
8A a n=O The likelihood function is
8 lnp(x; 8)
8a 2
N 1 N-l
-2a-2 + -2a4 "(x[n]-
L
A)2 p(x; 8) = 1 {I N exp --2 L (x[n] -
N-l
A - Bn)2
}
8 2 lnp(x; 8) N
from which the derivatives follow as
8A2 a2
1 N-l 8lnp(x;8) 1 N-l
8 2 lnp(x; 8) - "(x[n]- A - Bn)
-a4
- 'L" (x[n]- A) a2 L
8A8a 2 n=O
8A n=O
N 1 N-l 8lnp(x;8) 1 N-l
8 2 lnp(x;8) - "(x[n]- A - Bn)n
- - - - 6 "(x[n]- A)2. a2 L
8a 22 2a~ a L 8B n=O
n=O
42
CHAPTER 3. CRAMER-RAO LOWER BOUND
3.7. EXTENSION TO A VECTOR PARAMETER 43
and
x[nJ x[nJ
o2Inp(x; 9) N 4-
OA2 0'2
1 3 3-
o2Inp(x; 9) N-I
oA8B - 0'2 Ln
n=O 2 2-
o2Inp(x; 9) 1 N-I
L n 2. 1-
OB2 - 0'2
n=O
N~'
2:> 1 (a) A = 0, B = 0 to A = 1, B = 0 (b) A = 0, B = 0 to A = 0, B = 1
1 [ N
1(9) n-O
Figure 3.5 Sensitivity of observations to parameter changes-no noise
0'2 %n ~n2
Some interestine; observations follow from examination of the CRLB. Note first that
~ :' [
N(N -1) the CRLB for A has increased over that obtained when B is known, for in the latter
N
2 case we have
, 1 0'2
N(N -1) N(N - 1)(2N - 1)
Inverting the matrix yields for N 2 3. Hence, B is easier to estimate, its CRLB decreasing as 1/N3 as opposed to
the l/N dependence for the CRLB of A. ~differing dependences indicate that x[n]
2(2N -1) is more sensitive to changes in B than to changes in A. A simple calculation reveals
1-1(9) = 0'2
[
_N(N: 1) N(: j.
12
+ 1) ~x[n] ~ ox[n] ~A = ~A
oA
N(N + 1) N(N2 - 1) ~x[n] ~ o;~] ~B = n~B.
It follows from (3.20) that the CRLB is
Changes in B are magnified by n, as illustrated in Figure 3.5. This effect is reminiscent
of (3.14), and indeed a similar type of relationship is obtained in the vector parameter
var(A)
2 2(2N - 1)0'2
case (see (3.33)). See Problem 3.13 for a generalization of this example. 0
N(N + 1)
var(.8) > As an alternative means of computing the CRLB we can use the identity
N(N2 - 1)'
E [olnp(x; 9) 81np(x; 8)] = -E [021n p (X; 8)}.:., (3.23)
8Bi 8Bj 8Bi 8Bj .
44 45
CHAPTER 3. CRAMER-RAO LOWER BOUND 3.8. VECTOR PARAMETER CRLB FOR TRANSFORMATIONS
as shown in Appendix 3B. The form given on the right-hand side is usually easier to 1 N-l
evaluate, however. -(J"2 "(x[n]-
~
A - Bn)
We now formally state the CRLB theorem for a vector parameter. Included in the n=O
(3.28)
theorem are conditions for equality. The bound is stated in terms of the covariance 1 N-l
where the derivatives are evaluated at the true value of 0 and the e ectation is taken
MVU estimator. Furthermore, the matrix in (3.29) IS the mverse 0 t e covanance
with respect to X' 0). Furthermore, an unbiased estimator may be found that attains
the bound in that Co = 1-1(0) if and only if matrix.
If the equality conditions hold, the reader may ask whether we can be assured that
8lnp(x; 8) = I(8)(g(x) _ 0) (j is unbiased. Because the regularity conditions
08
8ln p (x;0)] =0
for some p-dimensional function g and some p x p matrix I. That estimator, which is E [ 80
the MVU estimator, is 0 = g(x), and its covariance matrix is 1- 1 (0).
are always assumed to hold, we can apply them to (3.25). This then yields E[g(x)] =
The proof is given in Appendix 3B. That (3.20) follows from (3.24) is shown by noting E(8) = O. .
that for a positive semidefinite matrix the diagonal elements are nonnegative. Hence, In finding MVU estimators for a vector parameter the CI~.LB theorem ~rovldes a
powerful tool. In particular, it allows us to find the MVU estlI~ator. for an.l~portant
class of data models. This class is the linear mode'and is descnbed m detail m Chap-
and therefore ter 4. The line fitting example just discussed is a special case. Suffice. it to say t~at
if we can model our data m the linear model form, then the MVU estimator and Its
performance are easily found.
x[n) = s[n; 0) + w[n) .= ~ ... N _ 1 which agrees with the results in Example 3.6. A slightly more complicated example
follows. 0
where w[n) is WGN. The covariance matrix is C = -I ~ d
second term in (3.32) is therefore zero. The :-;:- _=:::: .~ld~oes not depend on O. The
.50 51
CHAPTER 3. CRAMER-RAO LOWER BOUND 3.1 O. ASYMPTOTIC CRLB
where 1 = [11 .. . 1]T. Using Woodbury's identity (see Appendix 1), we have Q(f)
Also, since I
-11 h 1
-! 2'
4
4
Substituting this in (3.32) produces
-2
1
G -Ie Ie - 11 Ie Ie +h 1
2'
I
1
fernin = h fernax = :2 - 12
we wish to determine the CRLB for Ie assuming that Q(f) and a 2 are known. We
view the process as consisting of a random signal embedded in WGN. The center
.JlE~uen(:y of the signal PSD is to be estimated. The real function Q(f) and the signal
;Ie) are shown in Figure 3.6. Note that the possible center frequencies are
,'CO:nstlrailled to be in the interval [ir, 1/2 - h]. For these center frequencies the signal
for I ~ 0 will be contained in the [0,1/2] interval. Then, since (J = Ie is a scalar,
.-.- ------------
52 CHAPTER 3. CRAMER-RAO LOWER BOUND 3.11. SIGNAL PROCESSING EXAMPLES 53
where (jf 1/2, so that Q(f) is bandlimited as shown in Figure 3.6. Then, if Q(f)
(j2, we have approximately
But
alnPxx(f; Ie) aln [Q(f - Ie) + Q( - I - Ie) + (j2]
ale ale
aQ(f - Ie) + aQ( - I - Ie)
ale ale Narrower bandwidth (smaller (jJ) spectra yield lower bounds for the center freguency
Q(f - Ie) + Q( - I - Ie) + (j2 since the PSD changes more rapidly as te changes. See also Problem 3.16 for another
This is an odd function of I, so that example. 0
54
CHAPTER 3. CRAMER-RAO LOWER BOUND
3.11. SIGNAL PROCESSING EXAMPLES 55
PSD of wit) ACF ofw(t)
=
N o/2
sin 211"rB
rww(r) = N o B - - -
211"rB
EI(dS(t)\ )2
n=O dt t=nD.
-B B
F(Hz) r since To = no~. Assuming that ~ is small enough to approximate the sum by an
1 1 1
2B 2B B
-
integral, we have
wIn} O:::;n:::;no-l ,J
where
.....--
x[n) ={ s(n~ - To) + wIn) no:::;n:::;no+M-l:; (3.36)
fT, (dS(t))2 dt
wIn) no+M:::;n:::;N-l''> F
2 _
-
10 T
dt
where M is the len th of the sampled signal and no = T; ~ is the del a in sam les. l's2(t)dt
For simplicity we assume that ~ is so small that To/ ~ can be approximated by an
-
integer.) With this formulation we can apply (3.14) in evaluating the CRLB. It can be shown that /(No!2) is a SNR [Van Trees ~968]. Also, F2 is a ~easure of
the bandwidth of the signal since, using standard FOUrier transform properties,
_ JOO (27rF)2IS(FWdF
F2 - 00 (3.39)
[ : IS(F)1 2dF
r
-
n:~O-1 (8s(n!o- TO) where F denotes continuous-time frequency, and S(F) is the Fourier trans~orm of s(t).
In this form it becomes clear that F2 is the mean square bandwidth of the signa!. From
(72 (3.38) and (3.39), the larger the mean square bandwi~th, the lower the CRr~{ ;~
instance, assume tfiat the SIgnal is a Gaussian pulse .glven by s(t) exp( 2 F( _
no+M-1 (dS(t) I ) 2 T./2)2) and that s(t) is essentially nonzero over the mterval [0, T.}. Then IS(F)I -
T~O i l i t=nD.-To
57
56 CHAPTER 3. CRAMER-RAO LOWER BOUND SIGNAL PROCESSING EXAMPLES
3.11.
[1(8}b
Example 3.14 - Sinusoidal Parameter Estimation
In many fields we are confronted with the roblem of estimatin the arameters of a [1(fJ)]J3 =
sinusoidal signal. Economic data w ich are cyclical in nature may naturally fit such a
model, while in sonar and radar physical mechanisms cause the observed signal to be The Fisher information matrix becom:~_ _ _ _ _---11
sinusoidal. Hence, we examine the determination of the CRLB for the am litu _ "F"'"iti
fre uency fa, and phase 0 a smusoid embedded in WGN. This example generalizes o o
2
xarnples . and 3.5. The data are assumed to be
where A > 0 and 0 < fo < 1(2 (otherwise the parameters are not identifiable, as is
verIfied by considering A = 1, </> = 0 versus A = -1, rP = 7r or fo = 0 with A = 1/2, rP = 0 o
versus A = 1/,;2,</> = 7r(4). Since ~ultiple parameters are un!nown, we use (3.33)
(1(8)] .. =
'1
~
a2
f:
n=O
os[n;8) os[n;8J.'
aOi oOj .
Using (3.22), we have upon inversion
for i = 0, 1,2. Using these approximations and letting 0 = 27r fan + </>, we have
1 N-I 1 N-I (1 1 ) N
[1(fJ)]u - 2 '"' cos 2 0 = -cr 2 '"' - + -2 cos 20 ~-2 Example 3.15 - Bearing Estimation
a n=O
L.. L.. 2
n=O
2cr
1 N-I A N-I
In sonar it is of interest to estimate the in
[1(6)h2 -2
cr
L A27rncososino = -~
=0
L nsin20 ~ 0
cr n_ do so the acoustic pressure field is observed b
58 CHAPTER 3. CRAMER-RAO LOWER BOUND 3.11. SIGNAL PROCESSING EXAMPLES 59
The Jacobian is
0c
o1 _ 00 1
y
Fodsin{3
[ 001
Planar
wavefronts so that from (3.30)
C. - 8 g (8)I_ 1 (8/g (8)T] > 0
[ a 88 88 22 - '
d x Figure 3.8 Geometry of array for
o 2 M-l bearing estimation Because of the diagonal Jacobian this yields
~'. Ass~ming that the target radiates a sinusoidal signal Acos(21TFot + c/J), then the var({3):::: {fiJ
A [8 (8)]2[_1
g
22 I (8)
122'
r~ce1ved S1 nal at the nth sensor is A cos 21T Fo t - t n) + c/J), where tn is the pro a ati;lli
t1me to the nth sensor. If the array is located far rom the target, then the circular
But from (3.41) we have
waveFronts can be considered to be planar at the array. As shown ill Figure 3.8, the
wavefront at t~e (n. l)st sensor lags that at the nth sensor by dcos{3/c due to the -1 ( 1 12
extra propagatlOn d1stance. Thus, the propagation time to the nth sensor is [
I 8) 22 = (21T)2TJM(M2 _ 1)
d and therefore
tn = to - n- cos{3 n = 0, 1, ... ,M - 1
c A 12 c2
var(3) :::: (21r)2TJM(M2 -1) FgtPsin2 {3
where to is the propagation time to the zeroth sensor, and the observed signal at the
nth sensor is or finally
12
sn(t) = A cos [21rFo(t - to + n~ cos{3) + c/Jt. var({3) ~ 2 M+ 1 (L) 2 2 ,
(3.43)
(21r) MTJ M _ 1 >: sm (3,
If a single "snapshot" of data is taken or the array element outputs are sampled at a
given time t., then -
d
sn(t.) = Acos[21r(Fo- cos{3)n + c/J'] (3.42)
c
wher~ 4J' ~ c/J+2~Fo(t. -to). In this form it bec0Il:l,es clear that the spatial observations
~usOldal w1th frequency I. = Fo(d/c) cos (3. To complete the description of the
data we assume that the sensor outputs are corrupted by Gaussian noise with zero mean
and variance (]2 which is independent from sensor to sensor. The data are modeled as
Example 3.16 - Autoregressive Parameter Estimation
x[n] = sn(t.) + w[n] n = 0, 1, ... ,M - 1
In speech processing an important model for speech production is the autoregressive
where w[n] is WGN. Since typicall A, are unknown, as well as {3 we have the roblem (AR) process. As shown in Figure 3.9, the data are modeled as the output of a causal
of estimating {A, I., c/J' based on (3.42) as in Example 3.14. Onc~ the CRLB for these all-pole discrete filter excited at the input by WGN un. The excitation noise urn] is
parameters is determined, we can use the transformation of parameters formula. The an III erent part of the model, necessary to ensure that x n is a WSS random rocess.
transformation is for 8 - [A I. c/J't - The all-pole filter acts to niode e vocal tract, while the excitation noise models the
- :Orclll of air throu h a constriction in the throat necessary to produce an unvoiced
~ ~ lJ ~ [~"o'11,~) ].
sound such as an "s." The effect of t e Iter is to color the white noise so as to model
PSDs wjth several resonances. This model is also referred to as a linear predictive coding
Q 0(9) [ (LPC) modef'[Makhoul 1975]. Since the AR model is capable of producing a variety
of PSDs, depending on the choice of the AR filter parameters {a [1], a[2]' ... ,a(p]) and
61
60 CHAPTER 3. CRAMER-RAO LOWER BOUND SIGNAL PROCESSING EXAMPLES
3.11-
_ 1 2
For k - , , ... "
p' l = 1" 2 ... ,p we hav! from (3.34)
1
urn] .. x[nJ
1 A(z)
[J(9)Jkl
N j! _1_ (A(f) exp(j27rJk) + A*(f) exp( -j27rJk)]
Puu(f) 2 _1 IA(f)14
&
2
. (A(f)exp(j27rfl) +A*(f)exp(-j27rfl)] df
O"~
f f
N
2
j! [_1_
_1 A*(f)2
exp(j27rf(k + l)] + IA(I )12 exp(j27rJ(k -l)]
J
1 1 1 1
2 2 2
-
2 2 1 1
_1_ x ('27rf(l-k)]+--exp[-j27rf(k+l)] df.
+ IA(f)1 2 e P J A2(f)
Figure 3.9 Autoregressive model
p
Noting that
A(z) = 1 + L a[mJz-m
exp(-j27rf(k + l)J df
",=1
j ! .-2-
_1 A*(f)2
exp[j27rf(k + l)] df j ! _1_
_1
2
A2(f)
2
excitation white noise variance (7~, it has also been successfully used for high-resolution j-!! _1_
IA(f)12
exp[j27rf(k -l)] df j-!! _1_exp(j27rf(l-k)Jdf
IA(f)12
spectral estimation. Based on observed data {x[O], x[I], ... ,x[N - I]}. the parameters
are estimated (see Example 7.18), and hence the PSD is estimated as [Kay 1988J which follows from the Hermitian property of the integrand (due to A( - J) = A* (f)),
we have
[J(9)Jkl
N j! ~ exp[j27r f(k -l)J df
_1 IA(f)12
2
The derivation of the CRLB proves to be a difficult task. The interested reader may +
Nj! _1_exp[j27rf(k+l)Jdf.
A*_1 2
consult [Box and Jenkins 1970, Porat and Friedlander 1987J for details. In practice the
asymptotic CRLB given by (3.34) is quite accurate, even for short data records of about . . . Frier transform of 1 A * 2 evaluated at n = k l >
N - 100 points if the poles are not too close to the unit circle in the z plane. Therefore, The second mte ralls the Inverse ou . th onvolution of two anticausal se uences,
we now determine the asymptotic CRLB. The PSD implied by the AR model is O. IS term is zero since the se uence IS ec
{ I} 0
that is, if
P (/ 9)
(72 ''l {h(n] n 2:
"". ; =
u ,.
IA(f)1 2i::J ;:-1 A(f) = 0 n <0
where 9 = [allJ a[2J ... alP] (7~]T and A(f) = 1 + E!:..=I a[mJ exp( -j27rfm). The partial then
derivatives are
8 In Pxx(f; 9) 8InIA(f)12 ;:
-I{ A*(f)2
1 } h(-n] * h(-n]
8a[k] 8a[k]
o for n > O.
= -"2 1!
Box, G.E.P., G.M. Jenkins, Time Series Analysis: Forecasting and Control, Holden-Day, San
N 1 1
[I(B)h" _1 O'~ IA(f)12 [A(f) exp(j21l'jk) + A*(f) exp(-j21l'jk)] dj Francisco, 1970.
2 Brockwell, P.J., R.A. Davis, Time Series: Theory and Methods, Springer-Verlag, New York, 1987.
II
Kay, S.M., Modem Spectml Estimation: Theory and Application, Prentice-Hall, Englewood Cliffs,
N ! 1
- O'~ 2
A*(j) exp(j21l'jk)dj N.J., 1988.
X. Kendall, Sir M., A. Stuart, The Advanced Theory of Statistics, Vol. 2, Macmillan, New York, 1979.
= 0 Makhoul, J., "Linear Prediction: A Tutorial Review," IEEE Proc., Vol. 63, pp. 561-580, April
1975.
where again we have used the Hermitian propert of the inte rand and the anticausalit :vIcAulay, R.J., E.M. Hofstetter l "Barankin Bounds on Parameter Estimation," IEEE Trans. In-
form. Theory, Vol. 17, pp. 669-676, Nov. 1971.
of F- 1 { l/A*(f)}. Finally, or k = P + 1; 1= P + 1
?orat, B., B. Friedlander, "Computation of the Exact Information Matrix of Gaussian Time Series
[1(8)lkl = -2Nl i 1 N
0'4 df = -2
-! u
4
O'u
With Stationary Random Components," IEEE Trans. Acoust., Speech, Signal Process., Vol.
34, pp. 118-130, Feb. 1986.
Porat, B., B. Friedlander, "The Exact Cramer-Rao Bound for Gaussian Autoregressive Processes,"
IEEE Trans. Aerosp. Electron. Syst., Vol. 23, pp. 537-541, July 1987.
so that vi' Seidman, L.P., "Performance Limitations and Error Calculations for Parameter Estimation," Proc.
IEEE,Vol. 58, pp. 644-652, May 1970.
1(8) =
[
~oR.:T x NO 1 (3.44)
Stoica, P., R.L. Moses, B. Friedlander, T. Soderstrom, "Maximum Likelihood Estimation of the
Parameters of Multiple Sinusoids from Noisy Measurements," IEEE Trans. Acoust., Speech,
Signal Process., Vol. 37, pp. 378-392, March 1989.
20'~ Van Trees, H.L., Detection, Estimation, and Modulation Theory, Part I, J. Wiley, New York, 1968.
Ziv, J., M. Zakai, "Some Lower Bounds on Signal Parameter Estimation," IEEE Trans. Inform.
where [Rxx]!.i = Txx[i - j] is a P x P Toeplitz autocorrelation matrix and 0 is a p x 1 Theory, Vol. 15, pp. 386-391, May 1969.
vector of zeros. Upon mvertmg the FIsher mformatlOn matrix voe have that
var(a[k]) ;:: u~ -1
N [R"""]kk k = 1,2, ... ,p
Problems
2u! 3.1 If x[n] for n = 0, 1, ... , N - 1 are lID according to U[O, 0], show that the regularity
var(ci;) ;:: (3.45) condition does not hold or that
N'
As an illustration, if p = 1, for all 0 > O.
7
var(a[l]) ;:: NO''' [ ]. Hence, the CRLB cannot be applied to this problem.
Txx 0
3.2 In Example 3.1 assume that w[O] has the PDF p(w[O]) which can now be arbitrary.
But
Show that the CRLB for A is
2 ]_1
1 (~)
dp(u)
00
so that var(A) 2: du
[ -00 p(u)
indicating that it is easier to estimate the filter parameter when la[l]1 is closer to one Evaluate this for the Laplacian PDF
than to zero. Since the pole of the filter is at -a[l], this means that the filter parameters
of processes with PSDs having sharp peaks are more easily estimated (see also Problem p(w[O]) = _1_exp (_ V2I W [O]I)
3.20). V20' 0'
)'
and compare the result to the Gaussian case. 3.10 By using (3.23) prove that the Fisher information matrix is positive semidefinite
for all O. In practice, we assume it to be positive definite and hence invertible,
3.3 The data x[n] = Arn + w[n] for n = 0,1, ... ,N - 1 are observed, where w[n] is although this is not always the case. Consider the data model in Problem 3.3
WGN with variance a 2 and r > 0 is known. Find the CRLB for A. Show that an with the modification that r is unknown. Find the Fisher information matrix for
efficient estimator exists and find its variance. What happens to the variance as 0= [A rf. Are there any values of 0 for which 1(0) is not positive definite?
N ~ 00 for various values of r?
3.11 For a 2 x 2 Fisher information matrix
3.4 If x[n] = rn + w[n] for n = 0,1, ... ,N - 1 are observed, where w[n] is WGN with
variance a 2 and r is to be estimated, find the CRLB. Does an efficient estimator
exist and if so find its variance?
1(0) = [ ab cb]
3.5 If x[n] = A+w[n] for n = 0, 1, ... , N -1 are observed and w = [w[O] w[I] ... w[N- which is positive definite, show that
l]jT '" N(O, C), find the CRLB for A. Does an efficient estimator exist and if so
what is its variance? [I - I ] elI
(0) 11 = ac _ b2 2 ~ = [1(0)] I! .
3.6 For Example 2.3 compute the CRLB. Does it agree with the results given? What does this say about estimating a parameter when a second parameter is
either known or unknown? When does equality hold and why?
3.7 Prove that in Example 3.4
3.12 Prove that
1 N-I
N L cos(471Jon + 2rjJ) ::::: O.
n=O
This generalizes the result of Problem 3.11. Additionally, it provides another
What conditions on fo are required for this to hold? Hint: Note that lower bound on the variance, although it is usually not attainable. Under what
conditions will the new bound be achieved? Hint: Apply the Cauchy-Schwarz
~ cos(an + (3) = Re (~exp[j(an +~)]) inequality to eT y'I(O)jI-I(O)ei' where ei is the vectors of all zeros except for a
1 as the ith element. The square root of a positive definite matrix A is defined
to be the matrix with the same eigenvectors as A but whose eigenvalues are the
and use the geometric progression sum formula. square roots of those of A.
3.8 Repeat the computation of the CRLB for Example 3.3 by using the alternative 3.13 Consider a generalization of the line fitting problem as described in Example 3.7,
expression (3.12). termed polynomial or curve jitting. The data model is
p-I
3.9 We observe two samples of a DC level in correlated Gaussian noise
x[n] =L Aknk + w[n]
k=O
x[O] A+w[O]
x[l] A + w[l] for n = 0, 1, ... ,N - 1. As before, w[n] is WGN with variance a 2 It is desired to
estimate {Ao, AI' ... ' Ap-d. Find the Fisher information matrix for this problem.
where w = [w[O] w[l]jT is zero mean with covariance matrix
3.14 For the data model in Example 3.11 consider the estimator ;~ = (A)2, where A
is the sample mean. Assume we observe a given data set in which the realization
of the random variable A is the value Ao. Show that A -+ Ao as N -+ 00 by
verifying that
The parameter p is the correlation coefficient between w[O] and w[I]. Compute
the CRLB for A and compare it to the case when w[n] is WGN or p = O. Also, E(AIA = Ao)
explain what happens when p -+ 1. Finally, comment on the additivity property
of the Fisher information for nonindependent observations. var(AIA = Ao)
N
66 CHAPTER 3. CRAMER-RAO LOWER BOUND
I:1
2
Q(f)df = 1 E(&) = a = g(8)
or
and Q(f) is known. If N observations are available, find the CRLB for the total
power using the exact form (3.32) as well as the asymptotic approximation (3.34)
and compare.
J &p(x; 9) dx = g(9:y. (3A.1)
. 8p(x; 9) d 8g(9)U
J o.~ x= 89';-
67
68 APPENDIX 3A. DERIVATION OF SCALAR PARAMETER CRLB "DI~1\Tl)IX 3A. DERIVATION OF SCALAR PARAMETER CRLB 69
or ~j8Inp(x;B) ('B)d o
= 8g(B) 81 81 p x, x
j A8Inp(x;B) ( .ll)d
a 81 p x, u x 81'
(.1) 8Inp(x;B) 8P(X;B)] dx
We can modify this using the regularity condition to produce j [82Inp(X;B)
812 p x, + 81 81 o
!( a
A_
a
)81n p (x; 9) ( . 9) -'- .... 8g(9).'
8fJ p x, "'... - 89
_
2
[8 In p (X;B)] 8Inp(x; B) 8 lnp(x; B) ( . B) d
E 81 2 j 81 81 P x, x
since
8Inp(x;B) ( .ll)d _ E [8In p (x;B)] _
j a 81 p x, u X - a 81 - o. E [(8In~~X;B)r].
We now apply the Cauchy-Schwarz inequalit~
Now let
Appendix 3B w(x) p(x;9) "
g(x) = aT(a - 0)'0
= 81np(x;9)T I
Derivation of Vector Parameter CRLB& h(x)
89
and apply the Cauchy-Schwarz inequality of (3A.5)
In this appendix the CRLB for a vector arameter 0 = 9 is derived. The PDF
(aT8~~)br ~~l aT(a-a)(a-afap(x;9)dx
is c aracterized by 9. We consider unbiased estimators such that 'jbT8lnp(X;9)8lnp(X;9)Tb ( '9)d
89 89 p x, x
E(Oi) = ai = [g(9)]i i = 1,2, ... ,r. i
aT C"abTI(9)b 0
The regularity conditions are since as in the scalar case
2
E [8ln p (X;9) 8ln p (X;9)] ln p (X;9)] = [I(9)J;j.
E[ 81n p (X;8)]_ .""
89 - Q.} 8Bi 8Bj
= -E [8
8Bi 8Bj
8lnp(x; 9)
J A
(ai - ai) 8B
J
p(x; 9) dx j (Qi - ai) 8p~~;9) dx
J
Since 1(9 is ositive definite, so is 1- 1(9), and ~I-1(9)~)~J!).r is at least ositive
8~j j QiP(X; 9) dx
semidefinite. The term inside the parentheses is there ore nonnegative, and we have
_ aE [8ln p (x;9)]
8Bj aT (C" _ 8~)rl(9) 8~~) T) a ~ O. Z
8a;
8Bj Recall that a was arbitrary, so that (3.30) follows. If 0 = g(9) = 9, then a~(:) = 1 and
8[g(9)]i (3.24) follows. The conditions for equality are g(x) = ch(x), where c is a constant not
(3B.2) dependent on x. This condition becomes
8Bj
8ln p (x;9)\
Combining (3B.1) and (3B.2) into matrix form, we have c 89
8 lnp(x; 9) T _1 (9) 8g(9) T
J( A
00
_ )81n p (7; 9) T ( 9) d _ 8g(9):r
89 px, x-7fij; c 89 I 89 a .
70
72 APPENDIX 3B. DERIVATION OF VECTOR PARAMETER CRLB
and differentiating once more Assume that x'" N(I!(J), C(J)), where u((J) is the N x 1 mean vector and C(lI) is
the N x N covariance matrix, both of which depend on (J. Then the PDF is
8 ([I(J)]ik) )
2
8 lnp(x; (J) =~ [I(J)]ik (-0 ,) c(J) (0 _ e ) p(x;9) = Ii eXP [-!(X- P (9))T C-l(9)(X-P(9))l.
8e8e,
'}
~
k=1 (
(J)
C
k} + 8e,
}
k k (211}lf det [C(9)] 2 J
We will make use of the following identities
Finally, we have
8lndet[C(9)] = tr (C-1(9) 8C(9)) (3C.l)
2 Mk 8ek
[I(9)]ij -E [8 ln p (X;(J)]
8ei M j where 8C(J)/8e k is the N x N matrix with [i,j] element 8[C(J)]ij/8e k and
[I( (J)]ij
c( (J)
8C-l(9) = -C- 1(9) 8C(9) C-1(9). (3C.2)
Mk 8ek
since B(Ok) = ek. Clearly, c(J) = 1 and the condition for equality follows.,. To establish (3C.l) we first note
alndet[C(J)] 1 8det[C(J)]
(3C.3)
8e k det[C(J)] 8ek
Since det[C(J)] depends on all the elements of C(J)
8det[C(J)]
8e k
tt
i=1
8det[C(J)] 8[C(J)]ij
j=1 8[C(J)]ij 8e k
8det[C(J)] 8CT(J))
(3C.4)
tr ( 8C(J) ~
where 8det[C(J)]/8C(J) is an N x N matrix with [i,j] element
8det[C(J)l/8[C(J)]ij and the identity
N N
tr(AB T ) = LL[A]ij[B]ij
i=1 j=1
73
74 APPENDIX 30. DERIVATION OF GENERAL GAUSSIAN CRLB
APPENDIX 3C. DERIVATION OF GENERAL GAUSSIAN CRLB 75
has been used. Now by the definition of the determinant o N N .
N
- L L(x[iJ - [JL(9)Ji)[C- 1(9)Jij(X[JJ - [JL(9)Jj)
OOk i=1 j=1
det[C(9)J = L[C(9)Jij[MJij
i=l
where M is the N x N cofactor matrix and j can take on any value from 1 to N. Thus,
tt
.=1 J=1
{(X[i]- [JL(9)]i) [[C- 1 (9)Jij ( o[~~:)Jj)
odet[C(9)J = [MJ .. + 0[C~~~9)Jij (x[jJ - [JL(9)Jj)]
0[C(9)Jij 'J
or
odet[C(9)J =M + ( O[~~~)Ji) [C- 1 (9)Jij(X[jJ - [JL(9)lJ)}
OC(9) .
-(x - JL(9)fC- 1 (9) O~~~)
It is well known, however, that oC
+ (x - JL(9)f ;;k(9) (x - JL(9
1
MT
C- (9) = det[C(9)J _ OJL(9)T C- 1 (9)(x - JL(9
OOk
so that op (8)T oC- 1(9) .
= -2-- - C - 1 (8)(x - JL(8 + (x - JL(8)f 8() (~ - 1'(8.:
0~~~~9)J = C- 1 (9)det[C(9)J. O(}k k"
tr (C-l(9/~~~) . (3C.5)
The second identity (3C.2) is easily established as follows. Consider Let y = x - JL(9). Evaluating
-
term:
+ 4E
1 [TOC-1(9) TOC-1(9)]
y OOk yy OOl Y
76
APPENDIX 3G. DERIVATION OF GENERAL GAUSSIAN CRLB
where we note that all odd order moments are zero. Continuing, we have
[I(8)]kl =
Appendix 3D
f) (8)T
+_IJ_ C-I(8) oIJ(8) ~E [TOC- 1 (8) T OC-I(8)] Derivation of Asymptotic CRLB
oBk oBI +4 Y oBk yy oBI y (3C.6)
where E(yT~z0)-;=~t~r[f1E;(;Zy~Tr;~::or::-:Z~N;-:-:x~I~v-ec-:-t-o-~:-:----=2..-~_l
eva uate e ast erm we use [Porat and Friedlan;:r ;~~6] 3C.2) have been used. To
Next, ~ing the relationship (3C.2), this term becomes With this representation the PSD of x[n] is
P",,,,(f) = IH(fWa~ ,
where a~ is the variance of urn] and H(tJ = 2::%"=0 h[k] exp(-j27rJk) is the filter fre=-
quency response. If the observations are {x[O], x[I], ... ,x[N - I]} and N is large, then
the representation is approximated by
(3C.7) n 00
and finally, using (3C.7) in (3C.6), produces the desired result. x[n] 1 L h[k]u[n - k] + L h[k]u[n - k]
k=O k=n+l
n
77
78 APPENDIX 3D. DERIVATION OF ASYMPTOTIC CRLB APPENDIX 3D. DERIVATION OF ASYMPTOTIC CRLB 79
the correlation time or effective duration of rxx[k] is the same as the impulse response
length. Hence, because the CRLB to be derived IS based on (3D.2), the asymptotic
CRLB will be a good approximation if the data record length is much greater than the
:~ I: IU(fWdf
correlation time.
To find the PDF of x we use (3D.2), which is a transformation from u = [ufO] u[I] ... ::::: j ! (J"~IH(f)12
_1
IX(fW
df
u[N - 1]jT to x = lx[O] xlI] ... x[N - llIY ~ x:: R. = 1t
2
IX(fW df (3DA)
-t P",,,,(f)
[~I~I ~
1],
.
x = h[OI
Also,
,
h[N - 1] h[N - 2] h[N - 3]
.
H
,
In(J"~ - I: In(J"~
1
2
df
-1~
1-t
In P.,.,(f) df
-t
In IH(f)1 2df
But But
Also,
1 i In IH(f)12df I: InH(f)+lnH*(f)df
so that
-t
2 Re
2
I: 2
In H(f) df
p(x;8) = (27r(J"~)
1 if exp (-2 12 u u) .
(J"u
T
(3D.3) 2 Re J In 1i(z)~
Ie 27rJz
From (3D.2) we have approximately = 2 Re [Z-l {In l(z)}ln=o]
X(f) = H(f)U(f) where C is the unit circle in the z plane. Since 1i(z) corresponds to the system function
of a causal filter, it converges outside a circle of radius r < 1 (since 1i(z) is assumed
where to exist on the unit circle for the frequency response to exist). Hence, In 1i(z) also
N-I
converges outside a circle of radius r < 1 so that the corres ondin se uence is causal.
X(f) L x[n]exp(-j27rfn) y t e initial value theorem which is valid for a causal sequence
n=O
N-I
Z-l{lnl(z)}ln=o ~ lim In 1i(z)
z-+oo
U(f) = L u[n]exp(-j27rfn) In lim 1i(z)
z-+oo
n::::::Q
Inh[O] = O.
are the Fourier transforms of the truncated sequences. By Parseval's theorem
Therefore,
1 i In IH(fWdf = 0
-i
80 APPENDIX 3D. DERIVATION OF ASYMPTOTIC CRLB APPENDIX 3D. DERIVATION OF ASYMPTOTIC CRLB 81
Inp(xj8)
N1n211" - 2Nj!_! InP",,(f)df -"21j!_! IX(fW
= -2 Px,,(f) df~.
/;-
Upon taking expectations in (3D. 7), the first term is zero, and finally,
Hence, the asymptotic log PDF is
(3D.6)
olnp(x;8) which is (3.34) without the explicit dependence of the PSD on 8 shown.
OOi
o21np(x; 8)
oO/yOj
(3D.7)
1 N-IN-l
N L L Txx[m - n]exp[-j211"f(m - n)]
m=O n=O
L
N-I (
1- N
Ikl) Txx[k] exp( -j211" fk) (3D.8)
k=-(N-I)
L Lg[m-n] = L (N - Ikl)g[k].
m=O n=O k=-(N-l)
Chapter 4
Linear Models
4.1 Introduction
The determination of the MVU estimator is in general a difficult task. It is fortunate,
however, that a large number of signal processing estimation problems can be repre-
sented by a data model that allows us to easily determine this estimator. This class of
'iii:odels is the linear model. Not only is the MVU estimator immediately evident once
the linear model has been identified, but in addition, the statistical performance follows
naturally. The key, then, to finding the optimal estimator is in structuring the problem
in the linear model form to take advantage of its unique properties.
The linear model is defined b 4.8. When this data model can be assumed, the MVU
an also e cient) estimator is given by (4.9), and the covariance matrix by (4.10).
A more general model, termed the general linear model, allows the noise to have an
arbitrary covariance matrix, as 0 osed to 0- 2 1 for the linear model. The MVU (and also
e Clent estimator for this model is given by (4.25), and its corresponding covariance
matrix by (4.26). A final extension allows for known si nal components in the data to
yield the MVU (an also e Clent estimator of 4.31 . The covariance matrix is the
same as for t e general linear model.
83
84 CHAPTER 4. LINEAR MODELS 4.3. DEFINITION AND PROPERTIES 85
where w[n] is WGN and the slope B and intercept A were to be estimated. In matrix Assuming that HTH is invertible
notation the model is written more compactly as
alnp(x;8) = HTH[(HT H)-lHT x_8.1 (4.4)
x=H8+w" (4.1) 88 0'2 I'j
.
.,'
A *#
.' .'
.............
All 8 on this
line produce the same
observations
Time, t
bfhj =0 for i i= j.
The observation matrix for this example has the special form of a Vandermonde matrix.
Note that the resultant curve fit is This property is quite useful in that
p
L 8t
[1]
i 1
s(t} = i -
i=l
HTH [ bl b 2M ]
k=} k=}
~ sin (2'/1'in) sin (2'/1'jn) N 8.
where w[nJ is WGN. The frequencies are assumed to be harmonically related or m~lti f;:o N N 2 'J
pIes of the fundamen~l it = liN as fA: = klN:'The amplitudes ak,b k of. the cosmes
and sines are to be estimated. To reformulate the problem m terms of the lmear model ~I
L...J cos (27rin).
- - sm (27rjn)
-- o foralli,j. (4.13)
n-O N N
we let
An outline of the orthogonality proof is given in Problem 4.5. Using this property, we
and bave
H= o
N
(~~)
N-
1
i1
1 o "2
~
1 =-1
(2"r,M) sin (~) 2 .
cos (~) cos sin
2;(N-l)] cos
21rM(N-l)]
N
.
SID
[2,,(N-l)]
N
.
SID
[21r~(N-l)]
N
.
o
cos [
[ N so that the MVU estimator of the amplitudes is
90 CHAPTER 4. LINEAR MODELS 4.4. LINEAR MODEL EXAMPLES
91
urn]
h[O]
or finally,
P-l
2 (2 k )
N ~ x[n] cos ~n
N-l '----. L
k=O
h[k]u[n - k]
--+l~Jt---+l'~I--"'.
These are recognized as the discrete Fourier transform coefficients. From the properties
of the linear model we can Imme(itately conclude that the means are
urn] x[n]
E(o.,,) = a"
E(b,,) = b,p ( 4.15) p-l
(4.16)
we would like to estima:e the TDL weights h[kJ, or equivalently, the impulse response of
Because {j is Gaussian and the covariance matrix is dia 0 al the am litude estimates ~h~~m fi~te~ .In practIce, ho,,:ever
the output is corrupted by noise, so that the model
are m ependent (see Problem 4.6 for an application to sinusoidal detection). m d Igure .3 IS more ppropnate. Assume that u n is provided for n = 0 1 N -1
It 18 seen trom this example that a key ingredient in simplifying the computation of an that the output is observed over the same interval. We then have " ... ,
the MVU estimator and its covariance matrix is the orthogonality of the columns of H. p-l
Note th.at this property does not hold if the frequencies are arbitrarily chosen. 0
:e(n] = .E h(k]u(n - k] + w[n] n = 0, 1, ... , N _ l~c (4.17)
"=0
Example 4.3 - System Identification where it is assumed that urn] = 0 for n < O. In matrix form we have
[ ufO]
u[l]
0
ufO] o
.
J[ h[OJ
h[l]
.
1 +w. ( 4.18)
=
hlP ~ 1)
x U[N:-1]
2J
, urN -
H
. u[N:- p]
'~
B
CHAPTER 4. LINEAR MODELS 4.4. LINEAR MODEL EXAMPLES 93
92
. [ ] 's WGN ,(4 18) is in the form of the linear model, and so the MVU If we combine these equations in matrix form, then the conditions for achieving the
Assummg w n I .
minimum possible variances are
estimator of the impulse response is
fJ = (HTH)-lHTx. CI 0 .. ,
o C2 ...
H T H=O' 2
The covariance matrix is [
o 0
It is now clear that to minimize the variance of the MVU estimator, urn] should be
chosen to make HTH diagonal. Since [H]ij = uri - j]
N
var(Oi) = eTCoei [HTH]ij =L urn - i]u[n - j] i = 1,2, ... ,p;j = 1,2, ... ,p (4.19)
l
where ei = [00 ... 010 ... of with the 1 occupying the ith place, and Ci can be
n=l
factored as DTD with D an invertible p x p matrix, we can use the Cauchy-Schwarz and for N large we have (see Problem 4.7)
inequality as follows. Noting that
N-I-!i-j!
1 = (eTDTD T- ei)2
1
[HTH]ij ~ L u[n]u[n + Ii - jl] (4.20)
n=O
we can let el = Dei and e2 = DT-' ei to yield the inequality which can be recognized as a correlation function of the deterministic sequence urn].
Also, with this approximation HTH becomes a symmetric Toeplitz autocorrelation ma-
(eie2)2 s efeleIe2' trix
r",,[O] r u ,,[I] r",,[p - 1] ]
Because efe2 = 1, we have T ruu[l] r",,[O] r",,[p - 2]
1 < (eTDTDei)(eTD-IDT-l ei) H H=N .
[
(eTCilei)(eTCeei) r",,[p - 1] r",,[p - 2] r",,[O]
where
or finally . 1
1 N-l-k
var(Oi) 2: TC-1e.
ei 8 t
=
0'2
[HTH] tt.. ruu[k] = N L u[n]u[n + k];-
n=O
Equality holds or the minimum variance is attained if and only if el = ce2 for c a may be viewed as an autocorrelation function of urn]. For HTH to be diagonal we
constant or
= CiDT-
1
Dei ei
or, equivalently, the conditions for all the variances to be minimized are which is approximately realized if we use a PRN sequence as our input signal. Finally,
under these conditions H1 H - Nr"" [0]1, and hence
i=I,2, ... ,p.
. 1
var(h[i]) = Nr",,[0]/O' 2 i=O,I, ... ,p-l (4.21)
Noting that
Example 4.5 - DC Level and Exponential in White Noise ,', Mac Williams F J N J SI "P d Ra
' .. , " oane, seu 0- ndom Sequences and Arrays" Proc IEEE, Vol. 64,
pp. 1715-1729, Dec. 1976. ' .
If x[n] = A + rn + w[n] for n = 0,1, ... , N -1, where r is known, A is to be estimated,
and w[n] is WGN, the model is
Problems
4.1 We wish to estimate the amplitudes of exponentials in noise. The observed data
are
p
x[n] = L Air~ + wIn]
;=1
n == 0, 1, ... , N -1
.~~-~-------~-~-~------~-----.
H=[~ ~ 1
to be a WSS random process, as is the output x[nJ. Prove that the cross-power
spectral density between the input and output is
1 1+f
L cos (27rkn)
N-l
----r:r cos (27rln)
N N
= "2 8kl 4.10 If in Example 4.4 the noise samples are uncorrelated but of unequal variance or
n=O
C = diag(u5,ui, ,uF.'-l)
by using the trigonometric identity
1 1 find dn and interpret the results. What would happen to A if a single u~ were
COSWICOSW2 = "2COS(Wl +W2) + "2COS(WI-W2) equal to zero?
and noting that 4.11 Assume the data model described in Problem 3.9. Find the MVU estimator of
~ cos an = Re (~exp(jan)). A and its variance. Why is the estimator independent of p? What happens if
p -+ 1?
4.6 Assume that in Example 4.2 we have a single sinusoidal component at /k = kiN. 4.12 In this problem we investigate the estimation of a linear function of 8. Letting
The model as given by (4.12) is this new parameter be a = A8, where A is a known r x p matrix with r < p and
rank r, show that the MVU estimator is
x[n] = ak cos(27r /kn) + bk sin(27r /kn) + w[n] n = 0,1, ... , N - 1.
Using the identity Acosw+Bsinw = viA2 + B2 cos(w-), where = arctan(B/A),
we can rewrite the model as where iJ is the MVU estimator of 8. Also, find the covariance matrix. Hint:
x[n] = va~ + b~ cos(27r/kn - ) + w[n]. Replace x by
x' = A(HTH)-IHT x
An MVU estimator is used for ak, bk , so that the estimated power of the sinusoid
where x' is r x 1. This results in the reduced linear model. It can be shown that
is "2 + b"2 x' contains all the information about 8 so that we may base our estimator on this
P=~. lower dimensionality data set [Graybill 1976J.
2
CHAPTER 4. LINEAR MODELS
100
"1 d 1" x - H6 + w but with
4.13 In practice we sometimes encounter t he mear mo e. .-
R composed of mndom variables. Suppose we ignore thIS dIfference and use our
usual estimator
where we assume that the particular realization of H is k?own to us. Show that
if R and ware independent, the mean and covariance of 6 are Chapter 5
E(8) = 6
Co = (J2 EH [(HTH)-l]
General Minimum Variance Unbiased
where EH denotes the expectation with respect to the PDF of H. What happens
if the independence assumption is not made? Estimation
4 Suppose we observe a signal, which is subject to fading, in noise. We model
4 1
the fading process as resulting in a signa.1 t h a tIS el.ther "on"or "off" . As
_ a
simple illustration, consider the DC level m WGN or x[n] = A + w[n] for n -
0,1, ... , N - 1. When the signal fades, the data model becomes 5.1 Introduction
A + w[n] n = 0,1, ... ,M - 1
x[n] ={ w[n] n = M, M + 1, ... ,N - 1 We have seen that the evaluation of the CRLB sometimes results in an efficient and
hence MVU estimator. In articular, the linear model rovides a useful exam Ie of
where the probability of a fade is f. Assuming we know w?en the s~gnal has this approac . If, however, an efficient estimator does not exist, it is still of interest
experienced a fade, use the results of Problem 4.13 to determme a~ estImator of to be able to find the MVU estimator (assuming of course that it exists). To do so
A and also its variance. Compare your results to the case of no fadmg. requires the concept of sufficient statistics and the important Rao-Blackwell-Lehmann-
SChefte theorem. Armed with this theory it is possible in many cases to determine the
MVU estimator by a simple inspection of the PDF. How this is done is explored in this
chapter.
5.2 Summary
In Section 5.3 we define a sufficient statistic as one which summarizes the data in the
sense that if we are given the sufficient statistic, the PDF of the data no longer de ends
on the unknown arameter. The Neyman-Fisher factorization theorem see Theorem
5.1 enables us to easily find sufficient statistics by examining the PDF. Once the
sufficient statistic has been found, the MVU estimator may be determined by finding
a function of the sufficient statistic that is an unbiased estimator. The essence of the
approach is summarized in Theorem 5.2 and is known as the Ra:;;:Blackwell-Lehmann-
Scheffe theorem. In using this theorem we must also ensure that the sufficient statistic
is com lete or that there exists only one function of it that is an unbiased estimator.
The extension to t e vector parameter case IS gIven m eorems 5.3 and 5.4. The
approach is basically the same as for a scalar parameter. We must first find as many
sufficient statistics as unknown parameters and then determine an unbiased estimator
based on the sufficient statistics.
101
102 CHAPTER 5. GENERAL MVU ESTIMATION 5.3. SUFFICIENT STATISTICS
103
5.3 Sufficient Statistics
a DC level A .
. I N-l,
A= N Lx[nJ
n=O
was the MVU estimator, having minimum variance a 2 / N. & on the other hand, ~
had chosen
A = x[O]
A A
as our estimator it is immediately clear that even though A is unbiased, its variance is Ao
much larger (bei~g a 2 ) than the minimum. Intuitively, the poor performance is a direct
result of discarding the data oints {xli], x[2], ... ,x[N - I]} which carr inf(ilrmation (a) Observations provide information after (b) No information from observations after
about A. reasona e questioll'o to as IS IC a a samples are pertinent to the T(x) observed-T(x) is not sufficient T(x) observed-T(x) is sufficient
estimation problem? or Is there a set of data that is sufficient? The foll<?wing data
Figure 5.1 Sufficient statistic definition
sets may be claimed to be sufficient in that they may be used to compute A.
SI {x[Oj,x[IJ, ... ,x[N-l]} depend on A. If it did, then we could infer some additional information about A from
S2 {x[Oj +x[I],x[2j,x[3j, ... ,x[N -I]} the data in addition to that already provided by the sufficient statistic. As an example,
in Figure 5.1a, if x = Xo for an arbitrar~ Xo, then values of A near Ao would be more
S3 {% x[n]}. likely. This violates our notion that 2:n~ol x[nj is a sufficient statistic. On the other
hand, in F'lgure 5.1b, any value of A IS as hkeIy as any other, so that after observing
T(x) the data may be discarded. Hence, to verify that a statistic is sufficient we need
SI represents the original data set, which as expected, is always sufficient for the prob- to determine the conditional PDF and confirm that there is no dependence on A.
lem. S2 and S3 are also sufficient. It is obvious that for this roblem there are man
sufficient data sets. The data set that contaIlls t e least number of elements is called
Example 5.1 - Verification of a Sufficient Statistic
the minimal on.,. If we now think of the elements of these sets as statistics, we say
that the N statistics of 1 are su c~en as we as t e statIstIcs of 2 an
Consider the PDF of (5.1). To prove that 2:::'~01 x[n] is a sufficient statistic we need
the single statistic of S3' This latter statistic, 2:n~ol x[n], in addition to being a suf-
ficient statistic is the minimal sufficient statisti~ For estimation of A, once we know to determine p(xJT(x) = To; A), where T(x) = 2:;:'~01 x[nJ. By the definition of the
conditional PDF we have
2:::' ~1 x[n], we 'nO"lollger need the individual data values since all infor~ation ~as be~
summarIzed III the sufficient statistic. To quantIfY what we mean by thIS, consIder the
PDF of the data
p(xJT(x) = To; A) = p(x, T(x) = To; A) .
p(T(x) = To; A)
p(x' A) = 1 [1
exp - -2 L (x[nj -
N-l
A?
]
(5.1) But note that T(x) is functionally dependent on x, so that the joint PDF p(x, T(x) =
To; A) takes on nonzero values only when x satisfies T(x) = To. The joint PDF is
, (21l'a 2 ) If 2a n=O
therefore p(x; A)6(T(x) - To), where 6 is the Dirac delta function (see also Appendix
and assume that T(x) = 2:;:'~01 x[n] = To has been observed. Knowledge of the value of 5A for a further discussion). Thus, we have that
1
this statistic will change the PDF to the conditional one p(xJ 2:;:'=0 x[n] = To; A), which
now ives the PDF of the observatIons after the sufficIent statistic has been observed. p(xJT(x) = To; A) = p(x; A)6(T(x) - To)
(5.2)
Since the statistic is su cient for the estimation of A, this con ltIona PD s ou p(T(x) = To; A)
- _______ 1
104 CHAPTER 5. GENERAL MVU ESTIMATION 5.4. FINDING SUFFICIENT STATISTICS 105
Clearly, T(x) '" N(N A, N 0- 2 ), so that where g is a function depending on x only through T(x) and h is a /unction depending
only. o~ x, then T(x) is a sufficient statistic for 8. Conversely, if T(x) is a sufficient
p(x; A)8(T(x) - To) statzstzc for 8, then the PDF can be factored as in {5.3}.
A proof of this theorem is contained in Appendix 5A. It should be mentioned that at
1 N-l ]
times it is not obvious if the PDF can be factored in the required form. If this is the
= 1 N exp - - 2 L (x[nJ - A)2 8(T(x) - To)
(271"0- 2)2 [ 20- n=O case, then a sufficient statistic may not exist. Some examples are now given to illustrate
1
(271"0-2)~
exp [ _ _
1
20- 2
(
n=O
x 2[nJ- 2AT(x) + N A2)] 8(T(x) - To)
the use of this powerful theorem.
p(X;0-2) =(
12)l! [1 N-l 2 ]
exp --22 L x [nJ . 1 .
5.4 Finding Sufficient Statistics ,
271"0- 2
.0- n=O
..........,
The Neyman-Fisher factorization theorem is now stated, after which we will use it to g(T(x), 0- 2) h(x)
find sufficient statistics in several examples.
~gain it is immediately obvious from the factorization theorem that T(x) = "EN,:l x 2[nJ
Theorem 5.1 (Neyman-Fisher Factorization) If we can factor the PDF p(x; 8) as ISa sufficient statistic for 0- 2 . See also Problem 5.1. n 0 0
p(x; 8) = g(T(x), 8)h(x) (5.3)
CHAPTER 5. GENERAL MVU ESTIMATION 5.5. USING SUFFICIENCY TO FIND THE MVU ESTIMATOR 107
106
By a slight generalization of the Neyman-Fisher theorem we can conclude that Ti{x)
Example 5.4 - Phase of Sinusoid and T2{x) are jointly sufficient statistics for the estimation of fjJ. However, no single
sufficient statistic exists. The reason why we wish to restrict our attention to single
Recall the problem in Example 3.4 in which we wish to estimate the phase of a sinusoid
sufficient statistics will become clear in the next section. 0
embedded in WGN or
x[n] = Acos(21!10n + fjJ) + win] n = 0, 1, ... , N - l. The concept of jointly sufficient statistics is a simple extension of our previous definition.
The r statistics Ti(x), T2 (x), ... , Tr{x) are jointly sufficient statistics if the conditional
Here, the amplitude A and frequency fo of the sinusoid are known, as is the noise PDF p(xIT1(x), T2 {x), ... , Tr(x); B) does not depend on B. The generalization of the
variance u 2 . The PDF is Neyman-Fisher theorem asserts that if p(x; B) can be factored as [Kendall and Stuart
1979]
p(x; fjJ) = 1
(27ru 2 )2
N exp {-~
2u
'I: [x[n]- Acos(27rfon + fjJ)]2} .
n=O
(5.4)
then {Tl (x), T2 (x), ... , Tr(x)} are sufficient statistics for B. It is clear then that the
original data are always sufficient statistics since we can let r = Nand
The exponent may be expanded as
N-i N-i N-i Tn+l (x) = x[n] n = O,l, ... ,N - 1
L x2[n]- 2A L x[n] cos(27rfon + fjJ) + L A2 cos2(27rfon + fjJ) )
so that
n=O n=O n=O
9 p(x; B)
and (5.4) holds identically. Of course, they are seldom the minimal set of sufficient
N-i ) N-i statistics.
+2A ~x[n]sin27rfon sinfjJ+ ~A2cos2(27rfon+fjJ).
(
In this example it does not appear that the ~DF is ~a~tora?le as require~ by the 5.5 Using Sufficiency to Find the MVU Estimator
Neyman-Fisher theorem. Hence, no single suffiCient statistic eXists. However, it can be
Assuming that we have been able to find a sufficient statistic T(x) for B, we can make use
factored as of the Rao-Blackwell-Lehmann-Scheffe (RBLS) theorem to find the MVU estimator. We
p(x; fjJ) =
will first illustrate the approach with an example and then state the theorem formally.
1 li. exp {_~ ['I: A2 cos2(27rfon + fjJ) - 2ATi(X) cosfjJ + 2AT2(x) sinfjJ]} Example 5.5 - DC Level in WGN
(27ru 2 ) 2u n O '
2
.
g(Ti (x),T2 (x),fjJ) We will continue Example 5.2. Although we already know that A = i is the MVU
estimator (since it is efficient), we will use the RBLS theorem, which can be used even
when an efficient estimator does not exist and hence the CRLB method is no longer
viable. The procedure for finding the MVU estimator A may be implemented in two
different ways. They are both based on the sufficient statistic T(x) = L~;Ol x[n].
1. Find any unbiased estimator of A, say A = x[O], and determine A = E(AIT). The
where expectation is taken with respect to p(AIT).
N-i
Ti (x) L x[n] cos 27r fon 2. Find some function 9 so that A = g(T) is an unbiased estimator of A.
n=O
N-i For the first approach we can let the unbiased estimator be A = x[O] and determine
T2 (x) = L x[n] sin 27rfon. A = E(x[O]1 L~;Ol x[n]). To do so we will need some properties of the conditional
n=O
5.5. USING SUFFICIENCY TO FIND THE MVU ESTIMATOR 109
CHAPTER 5. GENERAL MVU ESTIMATION
108
Gaussian PDF. For [x yf a Gaussian random vector with mean vector,." = [E(x) E(y)f
Variance
and covariance matrix ( ))
var(x) cov x, Y
C =[ cov(y, x) var(y) ,
i:
Effect of
since variance --...r---- conditional
has increased expectation
E(x\y) xp(x\y) dx operation
xp(x, y) dx
J oo (J = MVU estimator
-00 p(y)
Figure 5.2 RBLS argument for MVU estimator
_ () cov(x,y) ( -E()). (5.5)
- E x + var(y) y y
is an unbiased estimator of A. By inspection this is g(x) = x/N, which yields
Applying this result, we let x = x[O) and y = L.~;;; x[n) and note that 1 N-l
..4= N Lx[nJ
[ )=[ as the MVU estimator. This alternative method is much easier to apply, and therefore
n=O L x[N - 1) in practice, it is the one we generally employ. 0
~ ). 2. unbiased
3. of lesser or equal variance than that of iJ, for all 8.
Hence, we have finally from (5.5) that
Additionally, if the sufficient statistic is complete, then 0 is the MVU estimator.
A proof is given in Appendix 5B. In the previous example we saw that E(x[OJI L.~;;ol x[n))
= A + ;'(72 ~ x[n)- N A
2 (N-l )
E(x\y)
= x did not depend on A, making it a valid estimator, was unbiased, and had less vari-
N-l
ance than x[OJ. That there is no other estimator with less variance, as Theorem 5.2
= ~ Lx[n) asserts, follows from the property that the sufficient statistic L.~;;ol x[n) is a complete
n=O sufficient statistic. In essence, a statistic is complete if there is only one function of the
statistic that is unbiased. The argument that 0 = E(iJIT(x)) is the MVU estimator is
which is the MVU estimator. This approach, requiring evaluation of a conditional now given. Consider all possible unbiased estimators of 8, as depicted in Figure 5.2. By
expectation is usually mathematically intractable. f . determining E(iJIT(x)) we can lower the variance of the estimator (property 3 of Theo-
Thrning 'our attention to the second approach, we need to find some unctlOn 9 so rem 5.2) and still remain within the class (property 2 of Theorem 5.2). But E(iJIT(x))
that is solely a function of the sufficient statistic T(x) since
N-l )
..4=g ( ~x[n)
E(iJIT(x)) = J iJp(iJIT(x))diJ
g(T(x)). (5.6)
r
1:
one function of the sufficient statistic that leads to an unbiased estimator, we need only Choose Vi (r) to make
find the unique 9 to make the sufficient statistic unbiased. For this latter approach we
1
found in Example 5.5 that g(2:::01 x[nJ) = 2:::0 x[nJ/N. v'(r)w(A1-r)dr=O
The property of completeness depends on the PDF of x, which in turn determines
the PDF of the sufficient statistic. For many practical cases of interest it holds. In par- r
r
ticular, for the exponential family of PDFs (see Problems 5.14 and 5.15) this condition
is satisfied. To validate that a sufficient statistic is complete is in general quite difficult, (a) Integral equals zero
and we refer the reader to the discussions in [Kendall and Stuart 1979J. A flavor for (b) Integral does not equal zero
the concept of completeness is provided by the next two examples.
Figure 5.3 Completeness condition for sufficient statistic (satisfied)
1 vl(r)~exp[-2N2(A-r)2]
00
-00 2~ N (12 (1
dr=O for all A (5.7) 1: v(T)p(x; A) dx = 0 for all A.
which may be recognized as the convolution of a function v'(r) with a Gaussian pulse For this problem, however, x = x[O) = T, so that
w(r) (see Figure 5.3). For the result to be zero for all A, v'(r) must be identically zero.
To see this recall that a signal is zero if and only if its Fourier transform is identically
zero, resulting in the condition 1: v(T)p(T; A) dT = 0 for all A.
where V'C!) = F{v'(r)} and W(f) is the Fourier transform of the Gaussian pulse in p(T;A) = { ~ A-!<T<A+!
2 - -
otherwise
2
(5.7). Since W(f) is also Gaussian and therefore positive for all f, we have that the
so that the condition reduces to
condition is satisfied if and only if V'(f) = 0 for all f. Hence, we must have that
v'(r) = 0 for all r. This implies that 9 = h or that the function 9 is unique. 0
for all A.
5.5. USING SUFFICIENCY TO FIND THE MVU ESTIMATOR 113
CHAPTER 5. GENERAL MVU ESTIMATION
112
Use Neyman-Fisher factorization
theorem to find sufficient
statistic
T(x)
next example illustrates the overall ;::~t::~~n evaluatIOn IS usually too tedious. The
or However, in practice the conditional . ..
h(T) =T - sin 21fT.
_ 1
con 1 IOns see Problem 3.1). A natural
N-I
is satisfied only by the zero junction or by v(T) = 0 for all T.
At this point it is worthwhile to review our results and then apply them to an
()= N L x[n].
n=O
estimation problem for which we do not know the MVU estimator. The procedure is
The sample mean is easily shown to be unbiased and to have varIance
.
as follows (see also Figure 5.5):
1. Find a single sufficient statistic for (), that is, T(x), by using the Neyman-Fisher 1
var(8) = Nvar(x[n))
factorization theorem.
;32
2. Determine if the sufficient statistic is complete and, if so, proceed; if not, this (5.9)
12N'
approach cannot be used.
IIIIIII 1
This PDF will be nonzero only if 0 < x[nJ < ,B for all x[nJ, so that
I:
We now have
otherwise
so that
E(T) ~PT(~) d~
max x [nJ),:':n ~ . r (~)N-l 1
i3
p(x; 9) = pu(,B -
.., ~
10 ~N /3 /3 ~
g(T(x),9) h(x) N
= N + 1,B
By the Neyman-Fisher factorization theorem T(x) = maxx[nJ is a sufficient statistic
for 9. Furthermore, it can be shown that the sufficient statistic is complete. We omit ~9.
the proof. N+1
Next we need to determine a function of the sufficient statistic to make it unbiased. To make this unbiased we let {j = [(N + 1)/2NJT, so that finally the MVU estimator is
To do so requires us to determine the expected value of T = maxx[nJ. The statistic T
is known as an order statistic. Its PDF is now derived. Evaluating first the cumulative . N+1
9 = 2N maxx[nJ.
distribution function
Somewhat contrary to intuition, the sample mean is not the MVU estimator of the mean
Pr{T::; 0 Pr{x[OJ ::; ~,x[lJ ::;~, ... ,x[N -lJ ::; 0 for uniformly distributed noise! It is of interest to compare the minimum variance to
N-l that of the sample mean estimator variance. Noting that
= II Pr{x[nJ ::;~} .
n=O var( 9) = (N+1)2
2N var(T)
~ - ~ * ~ ~ 1
not depend on 8. This is to say that p(xIT(x)j 8) = p(xIT(x)). In general, many such g(T(x),8) h(x)
~ ~ * ~ ~ ~ -- -- --- y~-------~--------- ------
=
%x[n] cos 27rfon 1 In th~ vector parameter case completeness means that for veT)
functlOn of T, if ' an arbitrary r x 1
T(x) N-l .
J
l I>2[n]
n=O
Hence, T(x) is a sufficient statistic for A and a 2 , but only if fa is known (see also
E(v(T =
veT) =0 for aU T.
Problem 5.17). The next example continues this problem. 0
estimator directly from T(x) without h '. t , t~ be able. to determine the MVU
As before, this can be difficult to verify AlSO
the sufficient statistic should be equal tV~~g d~ eva ~ate E (8\ T (x) ), the dimension of
r = p. If this is satisfied, then we need o~IY ~n~~~nslOd~ of t~e unI kfnow~ parameter or
Example 5.10 _ DC Level in White Noise with Unknown Noise Power
'jr ImenSlOna unctlOn g such that
previous example, we have as our sufficient statistic Example 5.11 - DC Level in WGN with U n k nown N Olse
. Power (continued)
T(x) = n=O
2: x[n]
N-l 1
We first need to find the mean of the sufficient statistic
N-I
N-l
2: x [n]
l 2:
n=O
x2 [n]
Note that we had already observed that when a 2 is known, L.~';:ol x[n] is a sufficient
n=O
2
statistic for A (see Example 5.2), and that when A is known (A = 0), L.~';:01 x [n] is
a sufficient statistic for a 2 (see Example 5.3). In this example, the same statistics are Taking the expected value produces
jointly sufficient. This will not always be true, however. Since we have two parameters
to be estimated and we have found two sufficient statistics, we should be able to find
E(T(x= [ NA ]_[ NA ]
the MVU estimator for this example. 0 NE(x 2[n]) - N(a 2 + A2) .
Before actually finding the MVU estimator for the previous example, we need to state Clearly, we could remove the bias of the first com f . . .
by N. However, this would not help th d ponent 0 T(x) by dlVldmg the statistic
the RBLS theorem for vector parameters. T (x) _ " N - l x2 [n] t' t h e secon component. It should be obvious that
Theorem 5.4 (Rao-Blackwell-Lehmann-Scheffe (Vector Parameter If () is
- L.m=O
we2 transform T(x) ases ima es t e second moment and not the variance. as desired. If
an unbiased estimator of 8 and T(8) is an r x 1 sufficient statistic for 8, then fJ =
E(}\T(x is g(T(x
1. a valid estimator for 8 (not dependent on 8)
.2. unbiased
3. of lesser or equal variance than that of () (each element of fJ has lesser or equal
variance)
CHAPTER 5. GENERAL MVU ESTIMATION 5.6. EXTENSION TO A VECTOR PARAMETER 121
120
and ;2 are independent and that
then
E(x) =A
and
E (~ ~ x2[n] - X2) = a
2
+ A2 - E(x
2
).
~ [~
(
2 1-'(8) ,;,]
If we multiply this statistic by N / (N - 1), it will then be unbiased for a Finally, the
appropriate transformation is
Thus, the MVU estimator could not have been obtained by examining the CRLB.
1fTl(x) ] Finally, we should observe that if we had known beforehand or suspected that
g(T(x)) = [ N~l [T2 (x) - N( 1fTl (x) )2] the sample mean and sample variance were the MVU estimators, then we could have
reduced our work. The PDF
1 [1
[~x:[n[ -Nx'll
N-l ]
= p(x; 6) = (27ra 2 ),- exp - 2a 2 ~ (x[n]- A?
[ NO'
can be factored more directly by noting that
N-l N-l
However, since
N-l N-l
L(x[n]-A? L (x[n] - x+x- A)2
N-l 2 n=O n=O
L(x[n]- x)2 L x 2[n]- 2 L x[n]x + Nx N-l N-l
n=O n=O n=O L (x[n]- X)2 + 2(x - A) L (x[n]- x) + N(x - A?
N-l n=O
L x2 [n]- Nx 2 , The middle term is zero, so that we have the factorization
n=O
which is the MVU estimator of 6 = [Aa 2 JT. The normalizing f~tor ~f l/~N -:-1) for;2 T'(x) = N-l X l .
(the sample variance) is due to the one degree of freedom lost m estlmatmg the mean~ [ ~(x[n]-x?
In this example {j is not efficient. It can be shown [Hoel, Port, and Stone 1971] that x
..........
CHAPTER 5. GENERAL MVU ESTIMATION PROBLEMS 123
122
5.5 The IID observations x[~] for n = 0,1, ... , N - 1 are distributed according to
Dividing the second component by (N -1) would produce iJ. Of course T'(x) and T(x) U[-O,O], where 0 > O. Fmd a sufficient statistic for O.
are related by a one-to-one transformation, illustrating once again that the sufficient
statistic is unique to within these transformations. 5.6 If x[~] = A 2+ w[n] for n = 0,1, ... , N - 1 are observed, where w[n] is WGN with
In asserting that iJ is the MVU estimator we have not verified the completeness variance a , find the ~VU estimator for a 2 assuming that A is known. You may
of the statistic. Completeness follows because the Gaussian PDF is a special case of assume that the suffiCient statistic is complete. '
the vector exponential family of PDFs, which are known to be complete [Kendall and
Stuart 1979] (see Problem 5.14 for the definition of the scalar exponential family of 5.7 Consider the frequency estimation of a sinusoid embedded in WGN or
PDFs). 0 x[n] = cos 21l' fon + w[n] n = 0,1, ... , N - 1
2
where w[n] is WGN of known variance a . Show that it is not possible to find a
sufficient statistic for the frequency fo.
References 5.8 In a similar fashion to Problem 5.7 consider the estimation of the damping constant
Hoel, P.G., s.c. Port, C.J. Stone, Introduction to Statistical Theory, Houghton Mifflin, Boston, r for the data
1971. ) x[n] = rn + w[n] n = 0, 1, ... , N - 1
Hoskins, R.F., Generalized Functions, J. Wiley, New York, 1979.
Kendall, Sir M., A. Stuart, The Advanced Theory of Statistics, Vol. 2, Macmillan, New York, 1979. where w[n] ~s WGN with known variance a 2 . Again show that a sufficient statistic
Papoulis, A., Probability, Random Variables, and Stochastic Processes, McGraw-Hill, New York, does not eXist for r.
1965. 5.9 Assume that x[n] is the result of a Bernoulli trial (a coin toss) with
Pr{x[n] = I} = 0
Problems Pr{x[n] = O} = 1 - 0
5.1 For Example 5.3 prove that ~~:; x2 [n] is a sufficient statistic for a by using the
2
definition that p(xl ~~:Ol x 2 [n] = To; a 2 ) does not depend on a Hint: Note that
2 and t~at .N IID observations have been made. Assuming the Neyman-Fisher
factOriZatIOn theor~m holds for discrete random variables, find a sufficient statistic
the PDF of s = ~~:Ol x 2 [n]/a 2 is a chi-squared distribution with N degrees of for O. Then, assummg completeness, find the MVU estimator of O.
freedom or
-=_I--:-:_exp (_~) /f- 1 s>0 5.10 For Example 5.4 the following estimator is proposed
p(s) ={ 2~r( If) 2
o s < O. J = - arctan [TT2(x)
(X)]
1 .
5.2 The IID observations x[n] for n = 0,1, ... , N - 1 have the Rayleigh PDF
Show that at high SNR or a 2 -+ 0, for which
p(x[n];a 2 ) = { x;~] exp (O-~ x:[~]) x[n] > 0 N-I
T 1 (x) ~ L Acos(21l'fon + 4cos 21l'fon
x[n] < O. n=O
N-1
2
Find a sufficient statistic for a .
T2 (x) ~ LAcos(21l'fon+4sin21l'fon
5.3 The IID observations x[n] for n = 0,1, ... , N - 1 have the exponential PDF n=O
the estimator sa~isfies J ~ 4>. Use any approximations you deem appropriate. Is
p(x[n]; >.) = { ,\ exp( ~'\x[n]) ~~~l ~ ~. the proposed estimator the MVU estimator?
Find a sufficient statistic for ,\. 5.11 Fo: Example 5.2 it is desired to estimate 0 = 2A + 1 instead of A. Find the MVU
estimator. of 9. What if the parameter 0 = A 3 is desired? Hint: Reparameterize
5.4 The IID observations x[n] for n = 0,1, ... , N - 1 are distributed according to
the PDF m terms of 0 and use the standard approach .
.N (9,9), where 9 > O. Find a sufficient statistic for 9.
....................... ,
CHAPTER 5. GENERAL MVU ESTIMATION PROBLEMS 125
124
5.12 This problem examines the idea that sufficient statistics are Uniq~ehtohwithiffin ~net 5.15 If we observe x[n] for n = 0, 1, ... ,N - 1, which are IID and whose PDF belongs
to-one transformations. Once again, consider Example 5.2 but Wit t e su clen to the exponential family of PDFs, show that
statistics N-l
Tl(X)
N-l
L x[n]
T(x) = L B(x[n])
n=O
n=O
is a sufficient statistic for B. Next apply this result to determine the sufficient
statistic for the Gaussian, Rayleigh, and exponential PDFs in Problem 5.14. Fi-
nally, assuming that the sufficient statistics are complete, find the MVU estimator
Find functions gl and g2 so that the unbiased condition is satisfied or in each case (if possible). (You may want to compare your results to Example 5.2
and Problems 5.2 and 5.3, respectively.)
E[gl (Tl (x))] = A
5.16 In this problem an example is given where there are fewer sufficient statistics
E[g2 (T2 (x))] = A
than parameters to be estimated [Kendall and Stuart 1979]. Assume that x[n] for
and hence the MVU estimator. Is the transformation from Tl to T2 one-to-one? n = 0, 1, ... ,N - 1 are observed, where x '" N(I', C) and
What would happen if the statistic T3(X) = (l:~:OI x[n])2 were considered? Could
you find a function g3 to satisfy the unbiased constraint? NJ.L
o
5.13 In this problem we derive the MVU estimator for an example .that we have not o
encountered previously. If N IID observations are made accordmg to the PDF
. B) _ { exp[-(x[n]- B)]
P(x [n], - 0
x[n] > B
x[n] < B,
o
N - 1 + (J2 -1 T ]
find the MVU estimator for B. Note that B represents the minimum value that C = [ -1 I'
x[n] may attain. Assume that the sufficient statistic is complete.
The covariance matrix C has the dimensions
5.14 For the random variable x consider the PDF
Ix 1 1 x (N - 1) ]
p(x; B) = exp [A(B)B(x) + C(x) + D(B)]. [ (N - 1) x 1 (N - 1) x (N - 1) .
A PDF of this type is said to belong to the scalar exponential far:"ily of PDFs.
Many useful properties are associated with this family, and in particular, that of Show that with 8 = [J.L (J2JT
sufficient statistics, which is explored in this and the next problem. Show that
the following PDFs are within this class. p(x; 8) 1 {I
= -----;y- exp
(271") 2 (J
-- [N2
- 2 (x - J.L)2
2 (J
L x-2[n]l ] } .
+N
n=l
a. Gaussian
What are the sufficient statistics for 8? Hint: You will need to find the inverse
and determinant of a partitioned matrix and to use Woodbury's identity.
h. Rayleigh
5.17 Consider a sinusoid of known frequency embedded in WGN or
x>O
x <0. x[n] = A cos 271" fon + w[n] n = 0, 1, ... , N - 1
c. Exponential where w[n] is WGN with variance (J2. Find the MVU estimator of
a. the amplitude A, assuming (J2 is known
where). > O. h. the amplitude A and noise variance (J2.
CHAPTER 5. GENERAL MVU ESTIMATION
126
You may assume that the sufficient statistics are complete. Hint: The results of
Examples 5.9 and 5.11 may prove helpful.
5.18 If x[n] for n = 0,1, ... , N - 1 are observed, where. t~e samples are D and V
distributed according to U[(:/t, 92 ], find a sufficient stat1st1c for 6 = [91 92 ]
5.19 Recall the linear model in Chapter 4 in which the data are modeled as
Appendix 5A
x=H6+w
where w ,...., N(o, 0'21) with 0'2 known. Find the MVU es~imator ~f ~ ~y using the
Neyman-Fisher and RBLS theorems. Assume the suffic1~nt st~t1st1C 1S complete. Proof of Neyman-Fisher Factorization
Compare your result with that described in Chapter 4. Hmt: F1rst prove that the
following identity holds: Theorem (Scalar Parameter)
(x - H6)T(x - H6) = (x - H8f(x - H8) + (6 - 8)THTH(6 - 9)
where Consider the joint PDF p(x, T(x); 9). In evaluating this PDF it should be noted
that T(x) is functionally dependent on x. Hence, the joint PDF must be zero when
evaluated at x = XQ, T(x) = To, unless T(xo) = To. This is analogous to the situation
where we have two random variables x and y with x = y always. The joint PDF
p(x, y) is p(x)8(y - x), a line impulse in the two-dimensional plane, and is a degenerate
two-dimensional PDF. In our case we have p(x, T(x) = To; 9) = p(x; 9)8(T(x) - To),
which we will use in our proof. A second result that we will need is a shorthand way
of representing the PDF of a function of several random variables. If y = g(x), for x a
vector random variable, the PDF of y can be written as
Using properties of the Dirac delta function [Hoskins 1979], this can be shown to be
equivalent to the usual formula for transformation of random variables [Papoulis 1965].
If, for example, x is a scalar random variable, then
~(
ay-gx ~ 8(x
()) =L..J d
- Xi)
;=1 I-..!!..I
dx x=x,
where {X1,X2,'" ,Xk} are all the solutions of y = g(x). Substituting this into (5A.1)
and evaluating produces the usual formula.
We begin by proving that T(x) is a sufficient statistic when the factorization holds.
p(x, T(x) = To; 9)
p(xIT(x) = To; 9) = p(T(x) = To; 9)
p(x; 9)8(T(x) - To)
= p(T(x) = To; 8) .
127
I 't':; j' }! !)! -: ~ r J, I,,J,( It =: 14.':' I I iLX I . - .......~~__ I . . . . . . . . . . . . . . . . . . . . . . . . . . . . I I I . . . . . . . . . . J
128 APPENDIX 5A: PROOF OF FACTORIZATION THEOREM APPENDIX 5A: PROOF OF FACTORIZATION THEOREM 129
Using the factorization so that (5AA) is satisfied: Then, (5A:5) becomes
g(T(x) = To, 0) Jh(x)8(T(x) - To) dx: p(T(x) = To; 0) = g(T(x) = To; 0) J h(x)8(T(x) - To) dx:
The latter step is possible because the integral is zero except over the surface in RN
for which T(x) = To: Over this surface 9 is a constant: Using this in (5A2) produces
h(x)8(T(x) - To)
p(xIT(x) = To; 0) = J h(x)8(T(x) _ To) dx
which does not depend on 0: Hence, we conclude that T(x) is a sufficient statistic:
Next we prove that if T(x) is a sufficient statistic, then the factorization holds:
Consider the joint PDF
we have
E [(8 - 0 + 0 _ 8)2]
E[(8 - O?] + 2E[(8 - 0)(0 - 8)] + E[(O - 8?].
Appendix 5B
The cross-term in the above expression, E[(O - 0)(0 - 8)], is now shown to be zero. We
know that 0 is solely a function of T. Thus, the expectation of the cross-term is with
respect to the joint PDF of T and 8 or
Proof of
Rao- Blackwell-Lehmann-Scheffe
But
Theorem (Scalar Parameter)
and also
EOIT[8 - 0] = EOIT(8IT) - 0 = 0- 0 = O.
Thus, we have the desired result
To prove (1) of Theorem 5.2, that 0 is a valid estimator of 8 (not a function of 8),
note that var(8) E[(O - 0)2] + var(O)
o= E(8I T (x)) ~ var(O).
J 8(x)p(xIT(x); 8) dx. (5B.1) Finally, that 0 is the MVU estimator if it is complete follows from the discussion given
in Section 5.5.
By the definition of a sufficient statistic, p(xIT(x); 8) does not depend on 8 but. only
on x and T(x). Therefore, after the integration is performed over x, the result WIll be
solely a function of T ~nd therefore of x only. . .
To prove (2), that 8 is unbiased, we use (5B.1) and the fact that 8 IS a functIOn of
T only.
J 8(x)p(x; 8) dx
E(8).
But 8 is unbiased by assumption, and therefore
E(O) = E(8) = 8.
To prove (3), that
var( 0) ~ var( 8)
130
Chapter 6
6.1 Introduction
,
It frequently occurs in practice that the MVU estimator, if it exists, cannot be found.
For instance, we may not know the PDF of the data or even be willing to assume a
model for it. In this case our previous methods, which rely on the CRLB and the
theory of sufficient statistics, cannot be applied. Even if the PDF is known, the latter
approaches are not guaranteed to produce the MVU estimator. Faced with our inability
to determine the optimal MVU estimator, it is reasonable to resort to a suboptimal
estimator. In doing so, we are never sure how much performance we may have lost
(since the minimum variance of the MVU estimator is unknown). However, if the
variance of the suboptimal estimator can be ascertained and if it meets our system
specifications, then its use may be justified as being adequate for the problem at hand.
If its variance is too large, then we will need to look at other suboptimal estimators,
hoping to find one that meets our specifications. A common approach is to restrict the
estimator to be linear in the data and find the linear estimator that is unbiased and
has minimum variance. As we will see, this estimator, which is termed the best linear
unbiased estimator (BLUE), can be determined with knowledge of only the first and
second moments of the PDF. Since complete knowledge of the PDF is not necessary,
the BLUE is frequently more suitable for practical implementation.
6.2 Summary
The BLUE is based on the linear estimator defined in (6.1). If we constrain this
estimator to be unbiased as in (6.2) and to minimize the variance of (6.3), then the
BLUE is given by (6.5). The minimum variance of the BLUE is (6.6). To determine the
BLUE requires knowledge of only the mean and covariance of the data. In the vector
parameter case the BLUE is given by (6.16) and the covariance by (6.17). In either the
scalar parameter or vector parameter case, if the data are Gaussian, then the BLUE is
also the MVU estimator.
133
134 CHAPTER 6. BEST LINEAR UNBIASED ESTIMATORS 6.3. DEFINITION OF THE BLUE
135
6.3 Definition of the BLUE All unbiased
estimators
All unbiased
We observe the data set {x[O] , x[l], ... , x[N - 1]} whose PDF p(x; (J) depends on an
unknown parameter (J. The BLUE restricts the estimator to be linear in the data or Linear
estimators
N-l
{) = L anx[n] (6.1)
n=O
where the an's are constants yet to be determined. (See also Problem 6.6 for a similar
definition except for the addition of a constant.) Depending on the an's chosen, we
may generate a large number of different estimators of (J. However, the best estimator N+l
or BLUE is defined to be the one that is unbiased and has minimum variance. Before (} = """"2N maxx[nJ
determining the an's that yield the BLUE, some comments about the optimality of
=MVU
the BLUE are in order. Since we are restricting the class of estimators to be linear,
the BLUE will be optimal (that is to say, the MVU estimator) only when the MVU
estimator turns out to be linear. For example, for the problem of estimating the value
of a DC level in WGN (see Example 3.3) the MVU estimator is the sample mean (a) DC level in WGN; BLUE is opti (b) Mean of uniform noise; BLUE is
mal suboptimal
N-l 1
=x= L
A
which is clearly linear in the data. Hence, if we restrict our attention to only linear
the expected value of the estimator becomes
estimators, then we will lose nothing since the MVU estimator is within this class.
Figure 6.1a depicts this idea. On the other hand, for the problem of estimating the N-l
mean of uniformly distributed noise'(see Example 5.8), the MVU estimator was found
to be
E(;2) = L anE(x[n]) = 0
n=O
sinc~ E(x[n]) = 0 for all n. Thus, we cannot even find a single linear estimator that is
which is nonlinear in the data. If we restrict our estimator to be linear, then the BLUE unb1a.:'ed, let alone one that has minimum variance. Although the BLUE is unsuitable
is the sample mean, as we will see shortly. The BLUE for this problem is suboptimal, f~r thIS p~oblem, a BLUE utilizing the transformed data y[n] = x 2[n] would produce a
as illustrated in Figure 6.1b. As further shown in Example 5.8, the difference in per- VIable estImator since for
formance is substantial. Unfortunately, without knowledge of the PDF there is no way
N-l N-l
to determine the loss in performance by resorting to a BLUE.
Finally, for some estimation problems the use of a BLUE can be totally inappropri- ;2 = L any[n] = L anx 2[n]
n=O n=O
ate. Consider the estimation of the power of WGN. It is easily shown that the MVU
estimator is (see Example 3.6) the unbiased constraint yields
N-l
E(;2) = L anu 2 = u 2.
n=O
which is nonlinear in the data. If we force the estimator to be linear as per (6.1), so
There are many values of the an'S that could satisfy this constraint. Can you guess
that w~at the an's should.be to yield the BLUE? (See also Problem 6.5 for an example of
N-l
;2 = L anx[n], thIS dat~ transformatIOn approach.) Hence, with enough ingenuity the BLUE may still
be used If the data are first transformed suitably.
n=O
:IIII~IIII _
. -.---------------.- ........ ,
136 CHAPTER 6. BEST LINEAR UNBIASED ESTIMATORS 6.4. FINDING THE BLUE
137
6.4 Finding the BLDE subject to the unbiased constraint, which from (6.2) and (6.4) becomes
To determine the BLUE we constrain 8 to be linear and unbiased and then find the N-I
an's to minimize the variance. The unbiased constraint, is from (6.1), L anE(x[n]) = (J
n=O
N-I N-I
E(8) = L anE(x[n]) = (J. (6.2) L ans[n](J (J
n=O n=O
N-I
The variance of 8 is L ans[n] 1
n=O
N-I (N-I)) 2] .
[( ~ anx[n]- E ~ anx[n]
var(8) =E or
aTs =1
But by using (6.2) and letting a = lao al ... aN-If we have where s = [s[o] s[I] ... s[N -IW. The solution to this minimization problem is derived
in Appendix 6A as
C-Is
var(8) E [(aTx - aTE(x))2]
Ilopt = STC-IS
= ~ LxlnJ=x var(A.)
1
(6.8)
n:::;O N-l 1 .
var(A.) The BLUE weights those samples most heavily with smallest variances in an attempt
to equalize the noise contribution from each sample. The denominator in (6.7) is the
a2 scale factor needed to make the estimator unbiased. See also Problem' 6.2 for some
further results. 0
N' d"
n is the BL UE independent of the PDF ?f the data. In ad ItlO~ In general, the presence of C- 1 in the BLUE acts to prewhiten the data prior to
Hence, the sample me~. h MVU estimator for Gaussian nOlse. averaging. This was previously encountered in Example 4.4 which discussed estimation
as already discussed, It IS t e
of a DC level in colored Gaussian noise. Also, in that example the identical estimator
for A was obtained. Because of the Gaussian noise assumption, however, the estimator
62 DC Level in Uncorrelated Noise could be said to be efficient and hence MVU. It is not just coincidental that the MVU
ExaxnP1e . - E I estimator for Gaussian noise and the BLUE are identical for this problem. This is a
.h (lnJ) - a 2 and repeat xamp e
Now let wlnJ be z~ol me~ f~:~o~~;)t::~ot~~6)lt var w
general result which says that for estimation of the parameters of a linear model (see
- n Chapter 4) the BLUE is identical to the MVU estimator for Gaussian noise. We will
6.1. As before, s - , an say more about this in the next section.
lTC-IX
A.= ~
1 6.5 Extension to a Vector Parameter
var(A.) = }Tc-li'
If the parameter to be estimated is a p x 1 vector parameter, then for the estimator to
The covariance matrix is be linear in the data we require
o
o. 1 , Oi =L
N-l
E(x) = H9 (6.12) where w '" N(O, C). With this model we are attempting to estimate fJ, where E(x) =
HfJ. This is exactly the assumption made in (6.12). But from (4.25) the MVU estimator
for H a known N x p matrix. In the scalar parameter case (see (6.4)) for Gaussian data is
9 = (HTC-1H)-IHTC-1X
:f~l 1 which is clearly linear in x. Hence, restricting our estimator to be linear does not lead
to a suboptimal estimator since the MVU estimator is within the linear class. We may
E(x) =
[ s[N:- 1] ()
conclude that if the data are truly Gaussian, then the BL UE is also the MVU estimator.
To summarize our discussions we now state the general BLUE for a vector parameter
~ of a general linear model. In the present context the general linear model does not
H
assume Gaussian noise. We refer to the data as having the general linear model form.
so that (6.12) is truly the vector parameter generalization. Now, substitution of (6.12) The following theorem is termed the Gauss-Markov theorem.
into (6.11) produces the unbiased constraint Theorem 6.1 (Gauss-Markov Theorem) If the data are of the general linear model
AH=I. (6.13) form
x = HfJ+w (6.18)
If we define a; = [aiO ail ... ai(N-Ilf, so that ei = aT x, the unbiased constraint may be where H is a known N x p matrix, fJ is a p x 1 vector of parameters to be estimated, and
rewritten for each ai by noting that w is a N x 1 noise vector with zero mean and covariance C (the PDF of w is otherwise
arbitrary), then the BLUE of fJ is
(6.19)
using the results for the scalar case (see (6.3)). The BLUE for a ve~tor para~~te: is 6.6 Signal Processing Example
found by minimizing (6.15) subject to the constraints of (6.1~), repeatIn~ the mInimiZa-
In the design of many signal processing systems we are given measurements that may
tion for each i. In Appendix 6B this minimization is carned out to Yield the BLUE
not correspond to the "raw data." For instance, in optical interferometry the input to
as (6.16) the system is light, and the output is the photon count as measured by a photodetector
[Chamberlain 1979J. The system is designed to mix the incoming light signal with a
and a covariance matrix of delayed version of itself (through spatially separated mirrors and a nonlinear device).
(6.17)
.. ------------------ c
142 CHAPTER 6. BEST LINEAR UNBIASED ESTIMATORS 6.6. SIGNAL PROCESSING EXAMPLE
143
Aircraft
.~ ~
R3 ~~ - ~~. Antenna 3
Nominal position (xn, Yn)
Antenna 1 Antenna 2
ti = Time of received signal (emitted
by aircraft) at antenna i Antenna
0
...
L x
Figure 6.2 Localization of aircraft by time difference Figure 6.3 Source localization geometry
of arrival measurements
unknown PDF. For a signal emitted by the source at time t - 1: th
are modeled by - 0, e measurements
The data then is more nearly modeled as an autocorrelation function. The ~aw data,
that is, the light intensity, is unavailable. From the given information it is desired , ti=To+R;jc+fi i=0,1, ... ,N-1 (6.22)
to estimate the spectrum of the incoming light. To assume that the measurements whereI the fi s are measurement noises and C denotes the propagation speed Th .
are Gaussian would probably be unrealistic. This is because if a random process is samp es are assumed to be zero mean with va . 2 d . e nOlse
th N, nance (j an uncorrelated with h
Gaussian, then the autocorrelation data are certainly non-Gaussian [Anderson 1971]. o er. 0 assumptions are made about the PDF 1 th . 1:: eac
Furthermore, the exact PDF is impossible to find in general, being mathematically must relate the range from each antenna R~ t tOh e kno~se. 0 ?~oceed further we
Lett mg h .. ,0 e un nown POSItIOn (J - [ ]T
intractable. Such a situation lends itself to the use of the BLUE. A second example is t e pOSItion of the ith antenna be (x~ .) h h . - x. Y. .
explored in some detail next. have that "y, , W IC IS assumed to be known, we
of these lines gives the aircraft position. In the more general situation for which the See also ~hapte.r 8 for a more complete description of the linearization of a non!"
TDOAs are not zero, the dashed lines become hyperbolas, the intersection again giving model. WIth thIS approximation we have upon substitution in (6.22) mear
the position. It is necessary to have at least three antennas for localization, and when
noise is present, it is desirable to have more. ti = 1:0 + -Rni + ---6x
Xn - Xi y~, 6
Y_- _
+ _n
We now examine the localization problem using estimation theory [Lee 1975]. To C R n,~C R niC Y. + fi
do so we assume that N antennas have been placed at known locations and that the which is now linear in the unknown parameters 6x and 6y A
time of arrival measurements ti for i = 0,1, ... ,N -1 are available. The problem is to defined in Figure 6.3 . Iternatively, since as
estimate the source position (x., Y.) as illustrated in Figure 6.3. The arrival times are
assumed to be corrupted by noise with zero mean and known covariance but otherwise
COS et i
........................ ........ t
CHAPTER 6. BEST LINEAR UNBIASED ESTIMATORS 6.6. SIGNAL PROCESSING EXAMPLE 145
144
Source
sinoi,
flUJ
Ti = To + - - 8 x s + --8ys + i (6.25)
C e -1 1 o
where the unknown parameters are To, 8x., 8ys' By assuming the time To when the
source emits a signal is unknown, we are bowing to practical considerations. Knowledge o o -1
of To would require accurate clock synchronization between the source and the receiver, ~-------~~------~.~
A E
which for economic reasons is avoided. It is customary to consider time difference of
arrivals or TDOA measurements to eliminate To in (6.25). We generate the TDOAs as
~~~re A has dimension (N - 1) x N. Since the covariance matrix of e is 0'21, we have
6 T1 - TO
6 T2 - T1 C = E[AeeTAT] = O' 2 AAT.
From (6.19) the BLUE of the source position parameters is
~N-1 = TN-1 - TN-2'
iJ (H TC- 1H)-1H TC- 1e
Then, from (6.25) we have as our final linear model [HT (AA T )-1H]-1 HT (AAT)-1e (6.27)
~i = ~(COSOi - COSOi_1)8xs + ~(sinoi - si no i_1)8ys + i - i-1 (6.26) and the minimum variance, is from (6.20),
e c
for i = 1,2, ... , N - 1. An alternative but equivalent approach is to use (6.25) and
estimate To as well as the source position. Now we have reduced the estimation problem
var(Bi) = 0'2 [{HT(AAT)-1H} -1] .. (6.28)
to that described by the Gauss-Markov theorem where
"
or the covariance matrix, is from (6.21),
8 = [8 x s 8Ys]T
cos 01 - cos 00 sin 01 - sin 00
cos 02 - cos 01 sin 02 - sin 01
H 1e [ :
COSON-1 - COSON-2 sinoN_1 - sinoN-2 1
101 -
10 2 -
100
10 1
1 H 1 [ - coso 1 - sino
e -coso -(1- sino)
w =
[
N-1 ~ N-2 .
A [~1 !1 ~].
PROBLEMS 147
CHAPTER 6. BEST LINEAR UNBIASED ESTIMATORS
146
6.4 T;~;;served samples {x[O], x[I], ... , x[N - I]} are lID according to the following
After substituting in (6.29) we have for the covariance matrix
a. Laplacian
o
3/2
(1- sina)2
1 . p(x[n]; J-L)
1
= 2 exp[-Ix[n]- J-LI]
h. Gaussian
For the best localization we would like a to be small. This is accomplished by making p(x[n]; J-L) = ~ exp [-~(x[n]- J-L)2] .
the spacing between antennas d large, so that the baseline of the array, which is the
total length, is large. Note also that the localization accuracy is range dependent, with Fi~d the BLUE of the mean J-L in both cases. What can you say about the MVU
the best accuracy for short ranges or for a small. estImator for J-L?
6.5 T;~~bserved samples {x[O],x[I], ... ,x[N -I]} are lID according to the lognormal
References
Anderson, T.W., The Statistical Analysis of Time Series, J. Wiley, New York, 1971. p(x[n]; 0) = { ~x[n] exp [-~(lnx[n] - 0)2] x[n] > 0
Chamberlain, J., The Principles of Interferometric Spectroscopy, J. Wiley, New York, 1979.
Lee, H.B., "A Novel Procedure for Assessing the Accuracy of Hyperbolic Multilateration Systems,"
o x[n] < O.
IEEE Trans. Aerosp. Electron. Syst., Vol. 11, pp. 2-15, Jan. 1975. Prove that th.e mean i~ exp(O + 1/2) and therefore that the unbiased constraint
ca[nnot be satIsfied. Usmg the transformation of random variables approach with
y n] = lnx[n], find the BLUE for O.
Problems 6.6 In this probl~m we extend the scalar BLUE results. Assume that E(x[n]) = Os[n] +
6.1 If x[n] = Arn + w[n] for n = 0,1, ... , N - 1, where A is an unknown parameter,
2
r /3, where 0 IS the unknown parameter to be estimated and /3 is a known constant.
is a known constant, and w[n] is zero mean white noise with variance 0- , find the The data ~ector x has covariance matrix e. We define a modified linear (actuall
BL UE of A and the minimum variance. Does the minimum variance approach affine) estImator for this problem as y
zero as N -+ oo? N-I
6.2 In Example 6.2 let the noise variances be given by o-~ = n + 1 and examine what {} = L anx[n] + b.
happens to the variance of the BLUE as N -+ 00. Repeat for o-~ = (n + 1)2 and n=O
~Sample
these eigenvectors are orthonormal due to the symmetric nature of C. Hence, an
orthogonal representation of the signal is x[n] "---- A
at
N-l n=N-l
s= L ai v; N-l
i=O
1i.(z) = L h[k]z-k Figure 6.5 FIR filter estimator
where {vo, Vb ... , VN-d are the orthonormal eigenvectors of C. Prove that the k=O for Problem 6.11
minimum variance of the BLUE is
1A
6.10 We continue Problem 6.9 by examining the question of signal selection for an
var(A) = --
OaK system in the presence of colored noise. Let the noise w[n] be a zero mean
N-l 2
La; WSS random process with ACF
i=O Ai
k=O
where Ai is the eigenvalue of C corresponding to the eigenvector V;. Next show
k=l
that the signal energy . = ST S is given by ~;:~l aT. Finally, prove that if the
k=2
energy is constrained to be some value '0, then the smallest possible> variance of
k;::: 3.
the BLUE is obtained by choosing the signal
S = CVrnin Find the PSD and plot it for frequencies 0 $ / $ 1/2. As in the previous
problem, find the frequency which yields the smallest value of the BLUE variance
where Vrnin is the eigenvector of C corresponding to the minimum eigenvalue and for N = 50. Explain your results. Hint: You will need a computer to do this.
c is chosen to satisfy the energy constraint .0 = sT S = c2. Assume that the
eigenvalues of C are distinct. Explain why this result makes sense. 6.11 Consider the problem of estimating a DC level in WSS noise or given
6.9 In an on-off keyed (OaK) communication system we transmit one of two signals x[n] = A + w[n] n = 0, 1, ... ,N - 1
SO(t) =0 where w[n] is a zero mean WSS random process with ACF Tww[k], estimate A. It
is proposed to estimate A by using the output of the FIR filter shown in Figure 6.5
to represent a binary 0 or at time n = N - 1. Note that the estimator is given by
SI(t) = Acos271"/l t N-l
C [~~]. a~PtC80Pt
STC-ICC-I S
Compare the variances for a -t 0, a = 1, and a -t 00. (STC- I s)2
1
STC-IS'
151
..
~~III .... ".... """" ..... l
Expanding G, we have
G aT Ca - 2a~Pt Ca + a~Pt C80pt
sTC- 1 Ca 1
Appendix 6B
aTCa - 2 STC- 1 S + STC- 1 S
T 1
a Ca - STC- 1S Derivation of Vector BLUE
where we have used the constraint equation. Hence,
1
aTCa = (a - 8opt)TC(a - 8opt) + STC- 1S To derive the BLUE for a vector parameter we need to minimize
. . . _ This is because the positive definite property
which is uniquely mlmmlzed fohr a'-h~h~nd side greater than zero for all (a- 8opt) =1= o.
of C makes the first term on t e ng
for i = 1,2, ... ,p subject to the constraints
This problem is much the same as that of the scalar BLUE (see Appendix 6A) except
for the additional constraints on Iii. Now we have p constraints on Iii instead of just
one. Since each Iii is free to assume any value, independently of the others, we actually
have p separate minimization problems linked only by the constraints. Proceeding in a
similar manner to that of Appendix 6A, we consider the Lagrangian function for ai.
P
Ji = a;CIii + LAY)(aThj - 8ij ).
j=1
oJ i
"i!r .f.. (i)
= 2Cai + L.. \ hj .
Iii j=1
We now let ~i = [Aii) A~i) ... A~i)jT and note that H = [hI h2 ... hpj to yield
oJi
-0
Iii
= 2CIii + H~i'
Setting the gradient equal to zero yields
1 -1
Iii = -2'C HA;. (6B.1)
153
l"""'-
154 APPENDIX 6B. DERIVATION OF VECTOR BLUE APPENDIX 6B. DERIVATION OF VECTOR BLUE 155
To find the vector of Lagrangian multipliers we use the constraint equations since [el e2 ... epf is the identity matrix. Also, the covariance matrix of iJ is
j=I,2, ... ,p
or in combined form
o
where
HTa; = _~HTC-IH~i = ei with the minimum variances given by the diagonal elements of Cli or
2
I
so that assuming invertibility of HTC- H, the Lagrangian multiplier vector is
Chapter 7
7.1 Introduction
We now investigate an alternative to the MVU estimator, which is desirable in situations
where the MVU estimator does not exist or cannot be found even if it does exist. This
estimator, which is based on the maximum likelihood principle, is overwhelmingly the
most popular approach to obtaining pmctical estimators. It has the distinct advantage
of being a "turn-the-crank" procedure, allowing it to be implemented for complicated
estimation problems. Additionally, for most cases of practical interest its performance
is optimal for large enough data records. Specifically, it is approximately the MVU
estimator due to its approximate efficiency. For these reasons almost all practical
estimators are based on the maximum likelihood principle.
7.2 Summary
In Section 7.3 we discuss an estimation problem for which the CRLB cannot be achieved
and hence an efficient estimator does not exist, and for which the sufficient statistic
approach in Chapter 5 to finding the MVU estimator cannot be implemented. An es-
timator is examined, however, that is asymptotically (for large data records) efficient.
This is the maximum likelihood estimator (MLE), defined as the value of () that maxi-
mizes the likelihood function. In general, as statedln Theorem 7.1, the MLE has the
asymptotic properties of being unbiased, achieving the CRLB, and having a Gaussian
PDF. Thus, it can be said to be asymptotically optimal. In Example 7.6 it is also
shown that for signal in noise problems, the MLE achieves the CRLB for high SNRs.
The MLE of () can be used to find the MLE of a function of () by invoking the invari-
ance property as described by Theorem 7.2. When a closed form expression cannot
be found for the MLE, a numerical approach employs either a grid search or an itera-
tive maximization of the likelihood function. The iterative approaches, which are used
only when a grid search is not practical, are the Newton-Raphson (7.29) and scoring
(7.33) methods. Convergence to the MLE, however, is not guaranteed. The MLE for
a vector parameter (J is the value maximizing the likelihood function, which is now a
157
158 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.3. AN EXAMPLE 159
function of the component., of 6. The asymptotic properties of the vector parameter an efficient estimator does not exist. We can still determine the CRLB for this problem
MLE are summarized in Theorem 7.3, and the invariance property in Theorem 7.4. For to find that (see Problem 7.1)
the linear model the MLE achieves the CRLB for finite data records. The resultant
, A2
estimator is thus the usual MVU estimator as summarized in Theorem 7.5. Numerical (7.2)
determination of the vector parameter MLE is based on the Newton-Raphson (7.48) var(A) ~ N(A + k)'
or scoring (7.50) approach. Another technique is the expectation-maximization (EM)
algorithm of (7.56) and (7.57), which is iterative in nature. Finally, for WSS Gaussian We next try to find the MVU estimator by resorting to the theory of sufficient statistics
random processes the MLE can be found approximately by maximizing the asymptotic (see Chapter 5). Attempting to factor (7.1) into the form of (5.3), we note that
log-likelihood function of (7.60). This approach is computationally less intensive than
1 N-l 1 N-l
determination of the exact MLE.
A 2:: (x[n] - A)2 = A 2:: x 2[n] - 2Nx + N A
n=O n=O
n=O
2 )]
exp(Nx) .
the MVU estimator. By resorting to the estimator based on the maximum likelihood
~----------~y~--------~"~~~~'
principle, termed the maximum likelihood estimator (MLE), we can obtain an estimator
that is approximately the MVU estimator. The nature of the approximation relies on
the property that the MLE is asymptotically (for large data records) efficient.
g(%x 2 [nJ,A) hex)
Example 7.1 - DC Level in White Gaussian Noise - Modified Based on the Neyman-Fisher factorization theorem a single sufficient statistic for A is
T(x) = ~~:Ol x 2 [n]. The next step is to find a function of the sufficient statistic that
Consider the observed data set produces an unbiased estimator, assuming that T(x) is a complete sufficient statistic.
To do so we need to find a function 9 such that
x[n] = A + w[n] n = 0, 1, ... ,N - 1
where A is an unknown level, which is assumed to be positive (A > 0), and w[n] is for all A > O.
WGN with 'Unknown variance A. This problem differs from our usual one (see Example
3.3) in that the unknown parameter A is reflected in the mean and the variance. In
Since
searching for the MVU estimator of A we first determine the CRLB (see Chapter 3) to
see if it is satisfied with equality. The PDF is
NE[x 2[nlJ
p(x; A) = 1!i. exp [- 21A
(27r A) 2
'tl
n=O
(x[n] - A)2] . (7.1) N [var(x[nJ) + E2(x[nJ)]
= N(A+A 2)
Taking the derivative of the log-likelihood function (the logarithm of the PDF consid-
ered as a function of A), we have it is not obvious how to choose g. We cannot simply scale the sufficient statistic to
make it unbiased as we did in Example 5.8.
8lnp(x;A) N 1 N-l 1 N-l
A second approach would be to determine the conditional expectation E(AI ~~:Ol x2[nJ),
8A
= -- + - 2:)x[n] - A) + -2:: (x[n] - A?
where A is any unbiased estimator. As an example, if we were to choose the unbiased
2A A n=O 2A2 n=O
estimator A = x[O], then the MVU estimator would take the form
J(A)(A - A).
It is certainly not obvious if the derivative of the log-likelihood function can be put in (7.3)
the required form. It appears from a casual observation that it cannot, and therefore,
160 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.3. AN EXAMPLE
161
Unfortunately, the evaluation of the conditional expectation appears to be a formidable It is nonetheless a reasonable estimator in that as N -+ 00, we have by the law of large
task. numbers
We have now exhausted our possible optimal approaches. This is not to say that 1 N-I
2
we could not propose some estimators. One possibility considers A to be the mean, so N L x [nJ-+ E(x 2 [n]) = A + A2
that n=O
A
Al =
{x0 ifx>O
if x ~ 0
and therefore from (7.6)
A -+A.
since we know that A > O. Another estimator considers A to be the variance, to yield
The estimator A is said to be a consistent estimator (see Problems 7.5 and 7.6). To find
1 N-I
the mean and variance of A as N -+ 00 we use the statistical linearization argument
A2 = - - ' " (x[nJ- AI)2
N-1 L.J
n=O
described in Section 3.6. For this example the PDF of 1:; x 2 [nJ will be concen-2:,:::01
trated about its mean, A + A 2 , for large data records. This allows us to linearize the
(see also Problem 7.2). However, these estimators cannot be claimed to be optimal in
any sense. o
function given in (7.6) that transforms 1:; 2:,:::01
x 2 [nJ into A. Let 9 be that function,
so that
A = g(u)
1:; 2:,:::01x2 [n],
Faced with our inability to find the MVU estimator we will propose an estimator that
is approximately optimal. We claim that for large data records or as N -+ 00, the where u = and therefore,
proposed estimator is efficient. This means that as N -+ 00
E(A) -+ A (7.4) g(u)=-~+Ju+~
2 4'
var(A) -+ CRLB, (7.5) Linearizing about Uo = E(u) = A + A 2 , we have
the CRLB being given by (7.2). An estimator A that satisfies (7.4) is said to be
g(u) : : : : g(uo) d9(U)j
+- - (u - uo)
asymptotically unbiased. If, in addition, the estimator satisfies (7.5), then it is said to
be asymptotically efficient. For finite data records, however, we can say nothing about du U=Uo
or
its optimality (see Problem 7.4). Better estimators may exist, but finding them may
not be easy! 1 [1
A::::::: A + A!! N ~ x [nJ- (A + A2) .
N-I 2 ]
(7.7)
Example 7.2 - DC Level in White Gaussian Noise - Modified (continued) It now follows that the asymptotic mean is
r
A
var(A)
A2
162 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.4. FINDING THE MLE 163
p(xo;9)
Considering this as a function of A, it becomes the likelihood function. Differentiating
the log-likelihood function, we have
-+---..-L.-t-~-r--(}
8A - -2A + A L(x[n]- A) + 2A2 L(x[n]- A)
Figure 7.1 Rationale for maxi- n=O n=O
mum likelihood estimator
and setting it equal to zero produces
to correspond to the permissible range 9f A or A > O. Note that A > 0 for all possible
The MLE for a scalar parameter is defined to be the value of B that maximizes p Xj B values of L,~:Ol x [n]. Finally, that A indeed maximizes the log-likelihood function
1J 2
or x e, I.e., t e ue that maximizes t e likelihoo unction. The maximization is is verified by examining the second derivative. <>
performed over the allowable range of B. In the previous example this was A> O. Since
p(x; B) will also be a function of x, the maximization produces a 8 that is a function of x. Not only does the maximum likelihood procedure yield an estimator that is asymptot-
The rationale for the MLE hinges on the observation that p(Xj B) dx gives the probability ically efficient, it also sometimes yields an efficient estimator for finite data records.
of observing x in a small volume for a given B. In Figure 7.1 the PDF is evaluated for This is illustrated in the following example.
x = Xo and then plotted versus B. The value of p(x = Xo; B) dx for each B tells us the
probability of observing x in the region in RN centered around Xo with volume dx, Example 7.4 - DC Level in White Gaussian Noise
assuming the given value of B. If x = Xo had indeed been observed, then inferring that
B = Bl would be unreasonable. Because if B = B1 , the probability of actually observing For the received data
x = Xo would be small. It is more "likely" that () = (}2 is the true value. It yields a high
probability of observing x = Xo, the data that were actually observed. Thus, we choose x[n] = A + w[n] n = 0,1, ... , N - 1
8 = B2 as our estimate or the value that maximizes p(x = Xo; B) over the allowable
range of B. We now continue with our example. where A is the unknown level to be estimated and w[n] is WGN with known variance
0'2, the PDF is
To actually find the MLE for this problem we first write the PDF from (7.1) as
Taking the derivative of the log-likelihood function produces
p(x; A) = 1 [1
N exp - - L (x[n] -
N-l ]
A? . 8Inp(x;A) = ~ ~( []_ )
8A 0'2 ~ X n A
(211'A)T 2A n=O
n=O
CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.5. PROPERTIES OF THE MLE 165
164
TABLE 7.1 Theoretical Asymptotic and Actual Mean and
which being set equal to zero yields the MLE Variance for Estimator in Example 7.2
. 1 N-1
A= N Ex[nj. Data Record NxVariance,
Mean, E(A)
n=O Length, N Nvar{A)
But we have already seen that the sample mean is an efficient estimator (see Example 5 0.954 0.624
3.3). Hence, the MLE is efficient. <> 10 0.976 0.648
15 0.991 0.696
This result is true in general. If an efficient estimator exists, the maximum likelihood
20 0.996 (0.987) 0.707 (0.669)
procedure will produce it. See Problem 7.12 for an outline of the proof.
25 0.994 0.656
Theoretical asymptotic value 0.667
7.5 Properties of the MLE
The example discussed in Section 7.3 led to an estimator that for large data records since t~is is indepen~ent of N, allowing us to check the convergence more readily. The
(or asymptotically) was unbiased, achieved the CRLB, and had a Gaussian PDF. In theoretical asymptot1c values of the mean and normalized variance are
summary, the MLE was distributed as
(7.8) E(A) A =1
2
Nvar(A)
where;!:" denotes "asymptotically distributed according to." This result is quite general 3'
and forms the basis for claiming optimality of the MLE. Of course, in practice it is It is ob~erved. from Table 7.1 that the mean converges at about N = 20 samples, while
seldom known in advance how large N must be in order for (7.8) to hold. An analytical the .va~1ance Jumps around somewhat for N ~ 15 samples. The latter is due to the
expression for the PDF of the MLE is usually impossible to derive. As an alternative statlstl~al fl~ctutatio~s i~ estimating the variance via a computer simulation, as well
means of assessing performance, a computer simulation is usually required, as discussed as pos~lbl~ maccur~cles m the random number generator. To check this the number
next. of reahz~tlOns was mcreased to M = 5000 for a data record length of N = 20. This
res~lted I? the mean and normalized variance shown in parentheses. The normalized
Example 7.5 - DC Level in White Gaussian Noise - Modified (continued) vanance IS now nearly identical to its asymptotic value, whereas for some unknown
reason (p:esumably the random number generator) the mean is off slightly from its
A computer simulation was performed to determine how large the <!ata record had to be asymptotic value. (See ~lso Problem 9.8 for a more accurate formula for E(A).)
for the asymptotic results to apply. In principle the exact PDF of A (see (7.6)) could be .Next, the PDF of A was determined using a Monte Carlo computer simulation.
found but would be extremely tedious. Using the Monte Carlo method (see Appendix ThIS was .done fo~ data record lengths of N = 5 and N = 20. According to (7.8) the
7A), M = 1000 realizations of A were generated for various data record lengths. The asymptotic PDF IS '
mean E(Ji) and variance var(A) were estimated by A;!:., N(A, r1(A)),
1 M (7.9)
which for A = 1 becomes, upon using (7.2),
E(A) = -EAi
M i=1
A;!:., N(l 2/3)
-
var(A) = ~
M
t (Ai - E(A))2
i=1
(7.10)
'N .
In Figure 7.2 the theoretical PDF and estimated PDF or histogram (see Appendix 7A)
are sho~n. To co~s~ruct the histogram we used M = 5000 realizations of A and divided
For a value of A equal to 1 the results are shown in Table 7.1 for various data record.
~he honzontal ~IS mto 100 cells or divisions. Note that for N = 5 the estimated PDF
lengths. Instead of the asymptotic variance or the CRLB of (7.2), we tabulate
IS somewhat dIsplaced to the left, in accordance with the mean being too small (see
A2 Table 7.1). For N = 20, however, the match is better, although the estimated PDF still
Nvar(A) = - A1 appear~ to be skewed to the left. Presumably for larger data records the asymptotic
+"2 PDF WIll more closely match the true one. <>
166 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.5. PROPERTIES OF THE MLE 167
1.25 -T 1975]. Some practice in performing these simulations is provided by Problems 7.13 and
7.14. We now summarize the asymptotic properties of the MLE in a theorem.
Theorem 7.1 (Asymptotic Properties of the MLE) If the PDF p(x; 8) of the
0.94-+
data x satisfies some "re ularit " conditions, then the MLE of the unknown parameter
B is asymptotzcaUy distributed (for large ata recon s according to
0.624
(7.11)
,
I where I(B) is the Fisher information evaluated at the true value of the unknown param-
0.31-+
eter.
The regularity conditions require the existence of the derivatives of the log-likelihood
O.OO-i'---------j~=------r___----"'-'--p____----_, A function, as well as the Fisher information being nonzero, as described more fully in
-1.00 0.00 1.00 2.00 3.00 Appendix 7B. An outline of the proof for lID observations is also given there.
From the asymptotic distribution, the MLE is seen to be asymptotically unbiased
(a) N = 5
and asymptotically attains the CRLB. It is therefore asymptotically efficient, and hence
asymptotically optimaL Of course, in practice the key question is always How large
does N have to be for the asymptotic properties to apply? Fortunately, for many cases
2.50'1
,
or interest the data record lengths are not exceSSive, as illustrated by Example 7.5.
I Another example follows.
I
1.88-+
I
Example 7.6 - MLE of the Sinusoidal Phase
i
1.25-+ We now reconsider the problem in Example 3.4 in which we wish to estimate the phase
062~
I/> of a sinusoid embedded in noise or
I
where w[n] is WGN with variance (J'2 and the amplitude A and frequency fo are assumed
O.OO4------~~----~----~~-----~ to be known. We saw in Chapter 5 that no single sufficient statistic exists for this
0.00 0.50 1.00 1.50 2.00 problem. The sufficient statistics were
N-I
(b) N=20
TI(x) = L x[n]cos(27l'fon)
n=O
N-I
Figure 7.2 Theoretical PDF and histogram T2 (x) = L x[n] sin(27l'Jon). (7.12)
n=O
n=O
]
of nonlinear estimators. For this reason it is important to be able to carry out such
a simulation. In Appendix 7A a description of the computer methods used in the or, equivalently, by minimizing
previous example is given. Of course, the subject of Monte Carlo computer methods N-I
for statistical evaluation warrants a more complete discussion. The interested reader J(I/ =L (x[n]- Acos(27l'fon + 1/)2. (7.13)
should consult the following references: [Bendat and Piersol 1971, Schwartz and Shaw n=O
168 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.5. PROPERTIES OF THE MLE 169
Differentiating with respect to <I> produces TABLE 7.2 Theoretical Asymptotic and Actual Mean and
Variance for Estimator in Example 7.6
8J(<I N-I
----a = 2L (x[n]- Acos(21l'fon + <1) Asin(21l'fon + <1 Data Record NxVariance,
n=O Mean, E(d
Length, N Nvar()
and setting it equal to zero yields
20 0.732 0.0978
N-I N-I 40 0.746 0.108
L x[n]sin(21l'fon+) =AL sin(21l'fon+~)cos(21l'fon+). (7.14) 60 0.774 0.110
n=O n=O
80 0.789 0.0990
But the right-hand side may be approximated since (see Problem 3.7) Theoretical asymptotic value = 0.785 liT! = 0.1
1 N-I , ,1 N-I ,
N L sin(21l' fon + <1 cos(21l' fon + <1
n=O
= 2N L
n=O
sin( 41l' fon + 2<1 ~ 0 (7.15) From Example 3.4
1(<1 = NA2
for fo not near 0 or 1/2. Thus, the left-hand side of (7.14) when divided by N and set 2a 2
equal to zero will produce an approximate MLE, which satisfies so that the asymptotic variance is
N-I , 1 1
L x[n] sin(21l'fon + ) = o. (7.16) var( <1 = ---:42 = -
N 2,,2 N"1
(7.19)
n=O
Upon expanding this we have where "1 = (A 2 /2)/a 2 is the SNR. To determine the data record length for the asymp-
totic mean and variance to apply we performed a computer simulation using A = 1, fo =
N-I N-I
0.08, <I> = 1l' /4, and a 2 = 0.05. The results are listed in Table 7.2. It is seen that the
L x[n] sin 21l'fon cos = - L x[n] cos 21l'fonsin asymptotic mean and normalized variance are attained for N = 80. For shorter data
n=O n=O records the estimator is considerably biased. Part of the bias is due to the assumption
or finally the MLE of phase is given approximately as made in (7.15). The MLE given by (7.17) is actually valid only for large N. To find the
exact MLE we would have to minimize J as given by (7.13) by evaluating it for all <1>.
N-I This could be done by using a grid search to find the minimum. Also, observe from the
L x[n] sin 21l' fon table that the variance for N = 20 is below the CRLB. This is possible due to the bias
'"
<I> =- n=O
arctan 7'Nc:.:_.::.I- - - - - (7.17) of the estimator, thereby invalidating the CRLB which assumes an unbiased estimator.
Next we fixed the data record length at N = 80 and varied the SNR. Plots of the
Lx[n] cos 21l'fon mean and variance versus the SNR, as well as the asymptotic values, are shown in
n=O
Figures 7.3 and 7.4. As shown in Figure 7.3, the estimator attains the asymptotic
It is interesting to note that the MLE is a function of the sufficient statistics. In hind- mean above about -10 dB. In Figure 7.4 we have plotted 10 10glO var( ). This has the
sight, this should not be surprising if we keep in mind the Neyman-Fisher factorization desirable effect of causing the CRLB to be a straight line when plotted versus the SNR
theorem. In this example there are two sufficient statistics, effecting a factorization as in dB. In particular, from (7.19) the asymptotic variance or CRLB is
I ~ . s
:::
.S
..-u
:::
I:)
-20
-30
.z -40
"0
0
0
"i
:-S -{i0
-.;
3 -70
~
0 -80
...l
-90
0.651--------rI-------4I--------rl-------.I--------rl ------~I -100
-20 -15 -10 -5 0 5 10 -3 -2 -1 0 2 3
SNR (dB) Phase,
Figure 7.3 Actual vs. asymptotic mean for phase estimator (a) High SNR (10 dB)
0
:::
-1O~
0, .~ -20~
~ -30
i -LO~ .............
-5--+ ......
j .
..
"50
:9
-40
-50
-{i0
= -
> -15-+ ~ -70
Asymptotic variance 700 -80
.:2 :.3
; -20J ...l
o
"" - 25 1, -90
-100
-3 -2
<PI
-1 0
True <P
2 3
-30 I I I I I Phase,
-15 -10 -5 0 5
SNR (dB) (b) Low SNR (-15 dB)
Figure 7.4 Actual vs. asymptotic variance for phase estimator Figure 7.5 Typical realizations of log-likelihood function for phase
In su.mma~y, th~ asymptotic PDF of the MLE is valid for large enough data records.
For slgna~ III .nOlse problems the CRLB may be attained even for short data records if
understand why this occurs we plot typical realizations of the log-likelihood function the SNR IS high enough. To see why this is so the phase estimator can be written from
for different SNRs. As seen in Figure 7.5 for a high SNR, the maximum is relatively (7.17) as
stable from realization to realization, and hence the MLE exhibits a low variance. For
N-I
lower SNRs, however, the effect of the increased noise is to cause other peaks to occur.
Occasionally these peaks are larger than the peak near the true value, causing a large L [Acos(21!'fon + J) + w[n]] sin 21!'fon
estimation error and ultimately a larger variance. These large error estimates are said ~ - arctan r;;~=~~'---------------------
to be outliers and cause the threshold effect seen in Figures 7.3 and 7.4. Nonlinear
estimators nearly always exhibit this effect. <> L [A cos(21!' fon + J) + w[n]] cos 211' fon
n=O
172 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.6. MLE FOR TRANSFORMED PARAMETERS 173
p(w[n)) p(x[nJ;A)
only x[O], we first note that the PDF of x[O] is a shifted version of p(w[n]) , where the
shift is A as shown in Figure 7.6b. This is because Px[Oj(x[OJ; A) = PW[Oj(x[OJ - A). The
MLE of A is the value that maximizes Pw[Oj(x[O]- A), which because the PDF of w[OJ
has a maximum at w[OJ = 0 becomes
A=x[OJ.
w[nJ x[nJ
A This estimator has the mean
(a) (b) E(A) = E(x[O]) = A
Figure 7.6 Non-Gaussian PDF for Example 7.7 since the noise PDF is symmetric about w[OJ = o. The variance of A is the same as the
variance of x[O] or of w[O]. Hence,
~
N-l
where we have used the same type of approximation as in (7.15) and some standard
trigonometric identities. Simplifying, we have
N-l
sin>- ;ALw[n]sin211"fon and the two are not in general equal (see Problem 7.16). In this example, then, the
estimation error does not decrease as the data record length increases but remains the
J ~ arctan n=O
N-l (7.20) same. Furthermore, the PDF of A = x[OJ as shown in Figure 7.6b is a shifted version
cos > + ;A L w[n] cos 211" fon of p(w[n]), clearly not Gaussian. Finally, A is not even consistent as JV -+ 00. 0
n=O
If the data record is large and/or the sinusoidal power is large, the noise terms will be
small. It is this condition, that the estimation error is small, that allows the MLE to 7.6 MLE for Transformed Parameters
attain its asymptotic distribution. See also Problem 7.15 for a further discussion of this
point. In many instances we wish to estimate a function of () the arameter charac erizin
In some cases the asymptotic distribution does not hold, no matter how large the the P . or example, we may no e mteres e in the value of a DC level A in WGN
data record and/or the SNR becomes. This tends to occur when the estimation error but only in the power A2. In such a situation the MLE of A2 is easily found from the
cannot be reduced due to a lack of averaging in the estimator. An example follows. MLE of A. Some examples illustrate how this is done.
Example 7.7 - DC Level in Nonindependent Non-Gaussian Noise Example 7.8 - 'fransformed DC Level in WGN
or .'
...
a = exp(x). a
But x is just the MLE of A, so that
a = exp(A) = exp(B).
ftr(x; a) = max {PTl (x; a), PT. (x; a)}
The MLE of the transformed parameter is found by substituting the MLE of the original
parameter mto the transformatIOn. ThIS property of the MLE is termed the invanance'
property. '
A=vIa
1. Fo; a given value of a, say ao, determine whether PT (x; a) or Pr (x' a) is larger If
since the transformation is not one-to-one. If we choose A = ,;a, then some of the lOr example, 1 2 , ,
possible PDFs of (7.21) will be missing. We actually require two sets of PDFs
PT, (x; ao) > PT2 (x; ao),
PT, (x; a) = 1 N exp [--\
(211'0'2)2 20'
~I (x[nJ- vIa)2]
n=O
a ~0 ~he(n de)no(~ the value of PT, (x; ao) as PT(X; ao). Repeat for all a > 0 to form
PT x; a. ote that PT(X; a = 0) = p(x; A = 0).)
PT. (x; a) = 1 N exp [-2 12 ~\x[nJ + val] a> 0 (7.23) 2. The MLE is given as the a that maximizes PT(X; a) over a ~ o.
(211'0'2) 2 0' n=O
to characterize all possible PDFs. It is possible to find the MLE of a as the value of a
'::5tdifi~d
b
p,roced.ure. is illustrated in Fi?ure 7.7. The function fiT (x; a) can be thought of as
lzk:Mood functzon, havmg been derived from the original likelihood function
that yields the maximum of PT, (x; a) and PT. (x; a) or y rans ormmg the value of A that yields the maximum value for a given a In this
a = arg max {PT, (x; a), PT. (x; a)}. (7.24) example, for each a the possible values of A are ,;a.
Now, from (7.24) the MLE Q is
"
Alternatively, we can find the maximum in two steps as arg max {p(x; va),p(x; -va)}
"2:0
CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION
7.7. NUMERICAL DETERMINATION OF THE MLE 177
176
p(x;lf)
[
ar g max {p(x; va),p(x; -va) })2
va~o
ar max p(x;
[ g -oo<A<oo
A)) 2
iF If
x2 a O=MLE b Figure 7.8 Grid search for MLE
so that again the invariance property holds. The understan~in~ is that 0: ~axi~izes
the modified likelihood function pr(x; 0:) since the standard hkehhood function with ~ and upon setting it equal to zero yields the MLE
as a parameter cannot be defined.
o
j)r(X;O:) = max p(x;O).
{8:a=g(8)}
178 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.7. NUMERICAL DETERMINATION OF THE MLE 179
requiring the maximization of a random function. Nevertheless, these methods can at
times produce good results. We now describe some of the more common ones. The
interested reader is referred to [Bard 1974] for a more complete description of methods
for nonlinear optimization as applied to estimation problems.
As a means of comparison, we will apply the methods to the following example.
or, equivalently, the value that minimizes Again w~ linearize 9 but use the new guess, 91 , as our point of linearization and repeat
the prevlOUS procedure to find the new zero. As shown in Figure 7.9, the sequence of
N-l
guesses will converge to the true zero of g(9). In general, the Newton-Raphson iteration
J(r) =L (x[n]- rn)2 . finds the new guess, 9k + 1 , based on the previous one, 9k , using
n=O
n=O
Note that at convergence 9k + 1 = 9k , and from (7.28) g(9 k ) = 0, as desired. Since g(9)
This is a nonlinear equation in r and cannot be solved directly. We will now consider is the derivative of the log-likelihood function, we find the MLE as
the iterative methods of Newton-Raphson and scoring. 0
9 =9 _ [a2Inp(X;9)]-lalnp(X;9)1
The iterative methods attempt to maximize the log-likelihood function by finding a k+1 k a92 a9 (7.29)
zero of the derivative function. To do so the derivative is taken and set equal to zero, 8=8.
yielding
Several points need to be raised concerning the Newton-Raphson iterative procedure.
alnp(x; 9) = 0
a9 . (7.26)
1. The iteration may not converge. This will be particularly evident when the second
Then, the methods attempt to solve this equation iteratively. Let derivative of the. log-likelihood function is small. In this case it is seen from (7.29)
that the correctlOn term may fluctuate wildly from iteration to iteration.
9) = alnp(x; 9)
9( a9
2. Even if the iteration converges, the point found may not be the global maximum but
and assume that we have an initial guess for the solution to (7.26). Call this guess 90 , possibly only a local maximum or even a local minimum. Hence, to avoid these
Then, if g(9) is approximately linear near 90 , we can approximate it by possibilities it is best to use several starting points and at convergence choose the
one ~hat yields .the n:axi~um. Generally, if the initial point is close to the global
dg(9) maxImUm, the IteratlOn Will converge to it. The importance of a good initial guess
g(9) ~ g(90 ) + dJJ (7.27)
1
Applying the Newton-Raphson iteration to the problem in Example 7.11, we have TABLE 7.3 Sequence of Iterates
for Newton-Raphson Method
olnp(x; r) 1 N-l
" (x[nJ - rn)nr n- 1
-a2 'L...J
or n=O
Iteration Initial Guess, ro
0.8 0.2 1.2
02Inp(x; r)
or2
1
a2 ?; n(n - 1)x[nJrn- 2 - ?; n(2n - 1)r2n - 2
[N-l N-l ]
n=O
29 0.493
A second common iterative procedure is the method of scoring. It recognizes that
where /(9) is the Fisher information. Indeed, for lID samples we have
2
NE [0 Inp(x[nJ; 9)] Ln rz2 n
-
2
:::::J n=O
09 2 As an example of the iterative approach, we implement the Newton-Raphson method
-Ni(9) for Example 7.11 using a computer simulation. Using N = 50, r = 0.5, and a 2 = 0.01,
= -1(9) we generated a realization of the process. For a particular outcome -J(r) is plotted
in Figure 7.10 for 0 < r < 1. The peak of the function, which is the MLE, occurs
by the law of large numbers. Presumably, replacing the second derivative by its expected at r = 0.493. It is seen that the function is fairly broad for r < 0.5 but rather sharp
value will increase the stability of the iteration. The method then becomes for larger values of r. In fact, for r > 1 it was difficult to plot due to the exponential
signal r n , which causes the function to become very large and negative. We applied the
D
Uk+!
= D
uk + I- 1 (D)0In
U
(x;9)
09
p I . (7.33) Newton-Raphson method as given by (7.31) using several initial guesses. The resulting
8=8. iterates are listed in Table 7.3. For ro = 0.8 or ro = 0.2 the iteration quickly converged
to the true maximum. However, for ro = 1.2 the convergence was much slower, although
This approach is termed the method of scoring. It too suffers from the same convergence the true maximum was attained after 29 iterations. If the initial guess was less than
problems as the Newton-Raphson iteration. As applied to the problem in Example 7.11, about 0.18 or greater than about 1.2, the succeeding iterates exceeded 1 and kept
182 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION
7.B. EXTENSION TO A VECTOR PARAMETER
183
N-l
- L(x[nl- rn)2 Example 1.12 - DC Level in WGN
n=O
0.00 Consider the data
x[nJ = A + w[nJ n = 0, 1, ... ,N - 1
-{).75 where w[nJ is WGN with variance 0'2 and the vector parameter 6 = [A 0'2]T is to be
estimated. Then, from Example 3.6
If multiple solutions exist, then the one that maximizes the likelihood function is the
MLE. We now find the MLE for the problem in Example 3.6.
(7.37)
--r-~"""'-""'~-----"--\ ~-~---~----~ ... .... ---------------_ ......... .
184 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.S. EXTENSION TO A VECTOR PARAMETER 185
and that the two statistics are independent. For large N it is well known from the or, equivalently,
central limit theorem that 8 = x.
x~ :;. N(N, 2N) It is clear that the MLE is not Gaussian even as N -+ 00. The PDF of 8 is given by
so that (7.39) with x replaced by 8. The difficulty is that no averaging is possible since we
T:;' N(N - 1, 2(N - 1)) have chosen to estimate as many parameters as data points. Therefore, the central limit
theorem, which accounts for the asymptotic Gaussian PDF of the MLE (see Appendix
or
. 1 N-I _ 2 a ((N -1)a 2 . 2(N -1)a 4 ) 7B), does not apply. <)
a 2 = N 2)x[n]-x) ",N N ' N2 . (7.38)
n=O As in the scalar case, the invariance property holds, as summarized by the following
The asymptotic PDF of 8 is jointly Gaussian since the statistics are individually Gaus- theorem.
sian and independent. Furthermore, from (7.37) and (7.38) for large N Theorem 1.4 (Invariance Property of MLE (Vector Parameter The MLE of
the parameter a = g(9), where g is an r-dimensional function of the p x 1 parameter
;1)a' 1 [~ ] =
9, and the PDF p(Xj 9) is parameterized by 9, is given by
E(8) -+ 9
[ (N it = g(8)
[~
for 8, the MLE of 9. If g is not an invertible function, then it maximizes the modified
C(9) = 2(N - l)a
0
[ a'N4 --+
0
2a 4
= rl(9) likelihood function PT (Xj a), defined as
N2 1 0
N 1 PT(Xj a) = max
{B:a=g(B)}
p(x; 9).
where the inverse of the Fisher information matrix has previously been given in Example
An example of the use of this theorem can be found in Problem 7.2l.
3.6. Hence, the asymptotic distribution is described by (7.36).
In Chapter 3 we discussed the computation of the CRLB for the general Gaussian
In some instances, the asymptotic PDF of the MLE is not given by (7.36). These
case in which the observed data vector had the PDF
situations generally occur when the number of parameters to be estimated is too large
relative to the number of data samples available. As in Example 7.7, this situation X'" N(Jl.(9), C(9)).
restricts the averaging in the estimator.
In Appendix 3C (see (3C.5)) it was shown that the partial derivatives of the log-
likelihood function were
Example 1.13 - Signal in Non-Gaussian Noise
8Inp(x; 9)
Consider the data 8(h
x[n] = s[n] + w[n] n = 0, 1, .. . ,N - 1
where w[n] is zero mean lID noise with the Laplacian PDF
for k = 1,2, ... ,p. By setting (7.40) equal to zero we can obtain necessary conditions
p( w[n]) = ~ exp [-~ Iw[nll ] . for the MLE and on occasion, if solvable, the exact MLE. A somewhat more convenient
form uses (3C.2) in Appendix 3C to replace 8C-I(9)/8(h in (7.40). An important
example of the latter is the linear model described in detail in Chapter 4. Recall that
The signal samples {s[O], s[I], . .. ,s[N - I]} are to be estimated. The PDF of the data the general linear data model is
is x = H9+w
II 41exp [1 ] (7.41)
N-I
p(xj9) = -2Ix[n]-s[n]1 . (7.39) where H is a known N x p matrix and w is a noise vector of dimension N x 1 with PDF
n=O N(O, C). Under these conditions the PDF is
The MLE of 9 = [s[O] s[I] ... s[N - 1]jT is easily seen to be 1
p(x; 9) = IV 1 1 exp [--2 (x - H9)TC- I (x - H9)]
s[n] = x[n] n = 0,1, ... ,N - 1 (27r)T det2(C)
'IAIAAIA.III - - ,.------------------_ ....... .
CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION
7.B. EXTENSION TO A VECTOR PARAMETER 187
186
Many examples are given in Chapter 4. The preceding result can be generalized to
so that the MLE of 6 is found by minimizing assert that if an efficient estimator exists, it is given by the MLE (see Problem 7.12).
J(6) = (x - H6fe- l (x - H6). (7.42) In finding the MLE it is quite common to have to resort to numerical techniques of
maximization. The Newton-Raphson and scoring methods were described in Section
Since this is a quadratic function of the elements of 6 and e- .is a positive defi~ite
l
7.7. They are easily extended to the vector parameter case. The Newton-Raphson
matrix, differentiation will produce the global minimum. Now, usmg (7.40) and notmg iteration becomes
that 6
2
=fJ _ [8 In p (X;fJ)]-1 8Inp(x;fJ)I
,.,,(6) = H6 k+l k 8fJ8fJT 8fJ (7.48)
and that the covariance matrix does not depend on 6, we have 9=9.
where
8lnp(x;6) = 8(H6)T e- l (x-H6). 8 2 In p (X;fJ)] 82Inp(x;fJ) i=I,2, ... ,p
8fh 80 k [ 8fJ86 T ij 88i 88j j = 1,2, ... ,p
Combining the partial derivatives to form the gradient, we have is the Hessian of the log-likelihood function and alnb~x;9) is the p x 1 gradient vector.
In implementing (7.48) inversion of the Hessian is not required. Rewriting (7.48) as
8lnp(x;6) = 8(H6f e-l(x-H6)
2 2
86 86 8 In p (X;fJ)/ fJ = 8 In p (X;fJ)/ fJ _ 8In p (x;fJ)/ (7.49)
so that upon setting the gradient equal to zero we have 8fJ8fJ T 9=9. k+l 8fJ86T 9=9. k 8fJ 9=9.
HTe-l(x - H8) = o. we see that the new iterate, fJk+l, can be found from the previous iterate, 6 k , by solving
a set of p simultaneous linear equations.
The scoring method is obtained from the Newton-Raphson method by replacing the
Solving for {j produces the MLE
Hessian by the negative of the Fisher information matrix to yield
8 = (HTe-1H)-IHTe-lx. (7.43)
fJ
k+l
= fJ k + r-l(fJ) 8Inp(x;fJ) /
8fJ (7.50)
But this estimator was shown in Chapter 4 to be the MVU estimator as well as an 9=9.
efficient estimator. As such, 8 is unbiased and has a covariance of As explained previously, the inversion may be avoided by writing (7.50) in the form
eli = (HTe-1H)-I. (7.44) of (7.49). As in the scalar case, the Newton-Raphson and scoring methods may suffer
from convergence problems (see Section 7.7). Care must be exercised in using them.
F 11 th PDF of the MLE is Gaussian (being a linear function of x), so that the Typically, as the data record becomes large, the log-likelihood function becomes more
~;~~otice properties of Theorem 7.3 are attained even for finite data records. ~or the nearly quadratic near the maximum, and the iterative procedure will produce the MLE.
linear model it can be said that the MLE is optimal. These results are summarized by A third method of numerically determining the MLE is the expectation-maximization
the following theorem. (EM) algorithm [Dempster, Laird, and Rubin 1977]. This method, although iterative
Theorem 7.5 (Optimality of the MLE for the Linear Model) If the observed in nature, is guaranteed under certain mild conditions to converge, and at convergence
to produce at least a local maximum. It has the desirable property of increasing the
data x are described by the general linear model
likelihood at each step. The EM algorithm exploits the observation that some data
x = H6 + w (7.45) sets may allow easier determination of the MLE than the given one. As an example,
consider
p
where H is a known N x p matrix with N > p and of rank p, 6 is a p x 1 parameter
vector to be estimated, and w is a noise vector with PDF N(O, C), then the MLE of 6 x[n] = L cos 271}in + w[n] n = 0, 1, ... ,N - 1
i=l
is
8 = (HTe-1H)-IHTe-lx. (7.46) where w[n] is WGN with variance a 2 and the frequencies f = [it h ... fpV are to be
estimated. The MLE would require a multidimensional minimization of
8 is also an efficient estimator in that it attains the CRLB and hence is the MVU
estimator. The PDF of 8 is
J(f) =~
N-l( t;P x[n] - cos 271" fin
)2
(7.47)
CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.S. EXTENSION TO A VECTOR PARAMETER 189
188
On the other hand, if the original data could be replaced by the independent data sets maximizing lnpx(x; (J). Finding this too difficult, we instead maximize lnpy(y; (J). Since
y is unavailable, we replace the log-likelihood function by its conditional expectation
i = 1,2, ... ,p (7.51) or
Yi[n] = cos2rrfi n + wdn] n = O,I, ... ,fY - 1 EYix[lnpy(y; (J)] = J lnpy(y; (J)p(yJx; (J) dy. (7.55)
where wi[n] is WGN with variance a-;, then the problem would be decoupled. It is easily
shown that due to the independence assumption the PDF would factor into the PDF Finally, since we need to know (J to determine p(yJx; (J) and hence the expected log-
for each data set. Consequently, the MLE of fi could be obtained from a minimization likelihood function, we use the current guess. Letting (Jk denote the kth guess of the
MLE of (J, we then have the following iterative algorithm:
of N-l
J(fi) = 2)ydn]- cos2rrfin)2 i = 1,2, ... ,p. Expectation (E): Determine the average log-likelihood of the complete data
n=O
The original p-dimensional minimization has been reduced to p separate one-dimensional (7.56)
minimizations, in general, an easier problem. The new data set {Ydn], Y2[n], ... , Yp[n]}
is termed the complete data and can be related to the original data as
Maximization (M): Maximize the average log-likelihood function of the complete
p
data
x[n] = Ly;[n] (7.52)
(7.57)
i=1
if the noise is decomposed as At convergence we hopefully will have the MLE. This approach is termed the EM
p
w[n] = LWi[n]. algorithm. Applying it to our frequency estimation problem, we have the following
i=l
iteration as derived in Appendix 7C.
For this decomposition to hold we will assume that the wi[n] noise processes are inde- E Step: For i = 1,2, ... ,p
pendent of each other and
p
(7.53) Ydn] = cos 2rr J;.n + ,8i(x[n]- L cos 2rrfi.n). (7.58)
i=l i=l
The key question remains as to how to obtain the complete data from the original or
incomplete data. The reader should also note that the decomposition is not unique. M Step: For i = 1,2, ... , p
We could have just as easily hypothesized the complete data to be
N-l
p
fi.+ 1 = arg max L Yi [n] cos 2rr fin (7.59)
L cos 2rrfi n Ji n=O
i=1
w[n] where the ,8;'s can be arbitrarily chosen as long as I:f=1 ,8i = l.
and thus Note that ,8i(x[n] - I:f=1 cos2rrJ;.n) in (7.58) is an estimate of wdn]. Hence, the
algorithm iteratively decouples the original data set into p separate data sets, with each
one consisting of a single sinusoid in WGN. The maximization given above corresponds
as a signal and noise decomposition. to the MLE of a single sinusoid with the data set given by the estimated complete
In general, we suppose that there is a complete to incomplete data transformation data (see Problem 7.19). Good results have been obtained with this approach. Its
given as disadvantages are the difficulty of determining the conditional expectation in closed
(7.54)
form and the arbitrariness in the choice of the complete data. Nonetheless, for the
where in the previous example M = p and the elements of Yi are given by (7.51). Gaussian problem this method can easily be applied. The reader should consult [Feder
The function g is a many-to-one transformation. We wish to find the MLE of (J by and Weinstein 1988] for further details.
-- - - - - - - - - - - -- --- ~ - -- - . - . . . .
190 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.10. SIGNAL PROCESSING EXAMPLES 191
In many cases it is difficult to evaluate the MLE of a parameter whose PDF is Gaussian Pxx (f) = 11 + b[l] exp( - j27r f) + b[2] exp( - j47r f) 12 .
due to the need to invert a large dimension covariance matrix. For example, if x '"
N(O, C(8)), the MLE of 8 is obtained by maximizing The process x[n] is obtained by passing WGN of variance 1 through a filter with system
function B(z) = 1 + b[l]z-1 + b[2]z-2. It is usually assumed that the filter B(z) is
minimum-phase or that the zeros Z1, Z2 are within the unit circle. This is termed
p(x;8) = N I, exp [--21XTC-1(8)X] . a moving average (MA) process of order 2. In finding the MLE of the MA filter
(21l}2 det2(C(8))
parameters b[lJ, b[2], we will need to invert the covariance matrix. Instead, we can use
If the covariance matrix cannot be inverted in closed form, then a search technique (7.60) to obtain an approximate MLE. Noting that the filter is minimum-phase
will require inversion of the N x N matrix for each value of 8 to be searched. An
alternative approximate method can be applied when x is data from a zero mean WSS
random process, so that the covariance matrix is Toeplitz (see Appendix 1). In such
a case, it is shown in Appendix 3D that the asymptotic (for large data records) log-
I: 1
2
InPxx(f)df =0
likelihood function is given by (see (3D.6)) for this process (see Problem 7.22), the approximate MLE is found by minimizing
~
1-! 11 + b[l] exp( - j27rf) + b[2] exp( - j47r f) 12
(7.60) 1(f) df.
where
I;
However,
=~
B(z) = (1- z1z-1)(1- Z2Z-1)
1(f) x[n] exp( -j21l"!n{
with IZ11 < 1, IZ21 < 1. The MLE is found by minimizing
is the periodogram of the data and Pxx(f) is the PSD. The dependence of the log-
likelihood function on 8 is through the PSD. Differentiation of (7.60) produces the
necessary conditions for the MLE
alnp(x;8)
aOi
= _ N j~
2 -~
[_1 __ ~(f) ] apxx(f)
Pxx(f) Pxx(f) aO
df
over the allowable zero values and then converting to the MA parameters by
i b[l] -(Z1 + Z2)
or b[2] Z1Z2'
j ! [_1_
-~
Pxx(f)
_Pxx(f)
~(f) ] apxx(f) df O.
aO i
=
(7.61) For values of Z1 that are complex we would employ the constraint Z2 = z;. For real
values of Z1 we must ensure that Z2 is also real. These constraints are necessitated by
The second derivative is given by (3D.7) in Appendix 3D, so that the Newton- Raphson the coefficients of B( z) being real. Either a grid search or one of the iterative techniques
or scoring method may be implemented using the asymptotic likelihood function. This could then be used. <>
leads to simpler iterative procedures and is commonly used in practice. An example
follows. Another example of the asymptotic MLE is given in the next section.
p(x;no) =
no-l
II _1_2
v'27rcr
exp [-~x2[n]]
2cr
p(x; 9) = 1 ..v
[1
exp - -2 L (x[nJ -
N-l
A cos(27r fon + ))2 ]
n=U (27rcr 2 ) '2 2cr n=O
no+M-l 1
JI v'27rcr 2 exp [- 2~2 (x[n] - sin - no])2]
where A > 0 and 0 < fo
is found by minimizing
< 1/2. The MLE of amplitude A, frequency fo, and phase
.'V-I
II
1 [1 2 exp --22 x2 [n]].
N-l
n=no+M
v'2
7rcr cr J(A, fo, ) = L (x[nJ - Acos(27rfon + ))2.
n=O
I? t~is case the continuous parameter TO has been discretized as n = To/A. The We first expand the cosine to yield
hkehhood function simplifies to 0
N-l
p(x; no)
J(A, fo,) = L (x[nJ - Acoscos27rfon + Asinsin27rfon?
n=O
L
n=no
(-2x[n]s[n - no] + s2[n - no]). Also, let
(cA(2)no. N?te th.at the MLE of the delay no is found by correlating the data with all where a = [D:l D:2]T and H = [cs]. Hence, the function to be minimized over a is
possible received signals and then choosing the maximum. 0 exactly that encountered in the linear model in (7.42) with C = I. The minimizing
solution is, from (7.43),
194 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 195
7.10. SIGNAL PROCESSING EXAMPLES
The bearing j3 of a sinusoidal signal impinging on a line array is related to the spatial
frequency fs as
Fod
fs = - cosj3.
2 c
N [(c T xj2 + (ST X)2]
Assuming the amplitude, phase, and spatial frequency are to be estimated, we have the
Fod .
1
196 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION 7.10. SIGNAL PROCESSING EXAMPLES 197
To find is
it is common practice to use the approximate MLE obtained from the peak Substituting this into (7.67) results in
of the periodogram (7.66) or
ANN N
In p(x' a (T2) = - - In 271' - - In (T2 - - .
A
I(fs) = ~
M-l
L x[n]exp(-j27rJsn) 12
"u 2 2 u 2
!vI I
n=O To find a we must minimize ;~ or
for M sensors. Here, x[n] is the sample obtained from sensor n. Alternatively, the 1
approximate MLE of bearing can be found directly by maximizing J(a) = [ : IA(fW I(f) dJ.
2
1'(/3) = M1 IM-l
~ x[n]exp (-j27rFo~ncos/3
d ) 12 Note that this function is quadratic in a, resulting in the global minimum upon differ-
entiation. For k = 1,2, ... ,p we have
1 ~ [A(f) BA*(f)
over /3. o BJ(a) + BA(f) A*(f)] I(f) dJ
Ba[k] _~ Ba[k] Ba[k]
1
Example 7.18 - Autoregressive Parameter Estimation (Example 3.16) [21 [A (f) exp(j27rJk) + A*(f) exp( -j27rJk)] I(f) df.
2
To find the MLE we will use the asymptotic form of the log-likelihood given by (7.60).
Then, since the PSD is Since A( - f) = A*(f) and 1(- f) = I(f), we can rewrite this as
BJ(a) 1~
Ba[k] = 2 _1 A(f)I(f) exp(j27rJk) dJ.
2
we have
2
lnp(x; a, O'u) N In 271' -
= -'2 N
'2 1~
_1
[ (T~
In IA(J)12 + I(f)
-0'2-
1dJ. Setting it equal to zero produces
2 IAd)!2
[ : (1 + ta[l]eXP(-j27rJl)) I(f) exp(j27rJk) dJ = 0 k = 1,2, ... ,p (7.68)
As shown in Problem 7.22, since A(z) is minimum-phase (required for the stability of
1/ A(z)), then
and therefore
1
[21 In IA(fWdj =
2
0
1
p !
But J.!1 I(f) exp[j27r Jk]df is just the inverse Fourier transform of the periodogram
N
Inp(x;a,O'~) = --ln27r- '2ln(T~ -
2
N
22'1,
N
O'u
~
-2
IA(fWI(f)dJ. (7.67)
2
evaluated at k. This can be shown to be the estimated ACF (see Problem 7.25):
N-I-Ikl
Differentiating with respect to O'~ and setting the result equal to zero produces
rxx[k] = ~ ~ x[n]x[n + Ikl] Ikl::; N - 1
{
o Ikl ~ N.
The set of equations to be solved for the approximate MLE of the AR filter parameters
and thus a becomes
1 p
l
Bard, Y., Nonlinear Parameter Estimation, Academic Press, New York, 1974.
Txx[l] ...
... rxxlP Bendat, J.S., A.G. Piersol, Random Data: Analysis and Measurement Procedures, J. Wiley, New
.
.' ... . =- . . (7.69)
.. ..
York, 1971.
Bickel, P.J., K.A. Doksum, Mathematical Statistics, Holden-Day, San Francisco, 1977.
rxxlP - 1] ... rxx[OJ alP] rxxlP] Dempster, A.P., N.M. Laird, D.B. Rubin, ":\laximum Likelihood From Incomplete Data via the
EM Algorithm," Ann. Roy. Statist. Soc., Vol. 39, pp. 1-38, Dec. 1977.
These are the so-called estimated Yule- Walker equations and this is the autocorrelation
Dudewicz, E.J., Introduction to Probability and Statistics, Holt, Rinehart, and Winston, New York,
method oj linear prediction. Note the special form of the matrix and the right-hand
1976.
vector, thereby allowing a recursive solution known as the Levinson recursion [Kay Feder, M., E. Weinstein, "Parameter Estimation of Superimposed Signals Using the EM Algo-
1988]. To complete the discussion we determine an explicit form for the MLE of a~. rithm," IEEE Trans. Acoust., Speech, Signal Process., Vol. 36, pp. 477-489, April 1988.
From (7.68) we note that Hoel, P.G., S.C. Port, C.J. Stone, Introduction to Statistical Theory, Houghton MifHin, Boston,
, 1971.
Kay, S.M, Modem Spectral Estimation: Theory and Application, Prentice-Hall, Englewood Cliffs,
121 A(J)/(J) exp(j27r Jk) dJ =0 k= 1,2, ... ,p. N.J., 1988.
2
Rao, C.R. Linear Statistical Inference and Its Applications, J. Wiley, New York, 1973.
Since Schwartz, M., L. Shaw, Signal Processing: Discrete Spectral Analysis, Detection, and Estimation,
, McGraw-Hili, New York. 1975.
P 1
7.6 Another consistency result asserts that if a = g(O) for g, a continuous function, where A is to be estimated and w[nJ is WGN with known variance 0'2. Using a
and 8 is consistent for 0, then Q = g(8) is consistent for a. Using the statist.ic~l Monte Carlo computer simulation, verify that the PDF of the MLE or sample
linearization argument and the formal definition of consistency, show why thIS IS mean is N(A, 0'2 IN). Plot the theoretical and computer-generated PDFs for
true. Hint: Linearize 9 about the true value of O. comparison. Use A = 1, 0'2 = 0.1, N = 50, and M = 1000 realizations. What
happens if M is increased to 5000? Hint: Use the computer subroutines given in
7.7 Consider N lID observations from the exponential family of PDFs Appendix 7A.
p(x; 0) = exp [A(O)B(x) + C(x) + D(O)J 7.14 Consider the lID data samples x[nJ for n = 0, 1, ... , N -1, where x[nJ '" N(O, 0'2).
Then, according to Slutsky's theorem [Bickel and Doksum 1977], if x denotes the
where ABC and D are functions of their respective arguments. Find an equa- sample mean and
tion to be ~ol~ed for the MLE. Now apply your results to the PDFs in Problem A 1 N-l
7.3.
0'2 = N L(x[nJ-x)2
n=O
7.8 If we observe N lID samples from a Bernoulli experiment (coin toss) with the denotes the sample variance, then
probabilities
X aN
Pr{x[nJ = I} p
a/VN '" (0,1).
Pr{x[nJ = O} 1- P Although this may be proven analytically, it requires familiarity with convergence
of random variables. Instead, implement a Monte Carlo computer simulation to
find the MLE of p. find the PDF of x/(a/VN) for 0'2 = 1 and N = 10, N = 100. Compare your
7.9 For N lID observations from a U[O, OJ PDF find the MLE of O. results to the theoretical asymptotic PDF. Hint: Use the computer subroutines
given in Appendix 7A.
7.10 If the data set 7.15 In this problem we show that the MLE attains its asymptotic PDF when the
estimation error is small. Consider Example 7.6 and let
x[nJ = As[nJ + w[nJ n = 0, 1, ... ,N - 1
2 N-l
is observed, where s[nJ is known and w[nJ is WGN with known variance 0'2, fi~d ~s - NA L w[nJsin27rfon
the MLE of A. Determine the PDF of the MLE and whether or not the asymptotIC n=O
PDF holds. 2 N-l
~c = NA L w[nJ cos 27r fan
7.11 Find an equation to be solved for the MLE of the correlation coeffici~nt p in n=O
Problem 3.15. Can you find an approximate solution if N -+ oo? Hmt: Let
L,~':Ol xHnJ/N -+ 1 and L,~:OI x~[nJ/N -+ 1 in the equation where x[nJ = in (7.20). Assume that fa is not near 0 or 1/2, so that the same approximations
can be made as in Example 3.4. Prove first that ~s and ~c are approximately
[Xl[-;;'J x2[nJf.
202 CHAPTER 7. AJAXIMUM LIKELIHOOD ESTIMATION PROBLEMS 203
uncorrelated and hence independent Gaussian random variables. Next, determine 7.20 Consider the data set
the PDF of f." f e . Then, assuming fs and fe to be small, use a truncated Taylor
series expansion of ~ about the true value 1/>. This will yield x[n] = s[n] + w[n] n = 0, 1, ... , N - 1
where s[n] is unknown for n = 0,1, ... , N - 1 and w[n] is WGN with known
variance (>2. Determine the MLE of s[n] and also its PDF. Do the asymptotic
MLE properties hold? That is, is the MLE unbiased, efficient, Gaussian, and
consistent?
7.21 For N lID observation from the PDF N(A, (>2), where A and are both un-
where the function 9 is given in (7.20). Use this expansion to find the PDF of ~.
(>2
known, find the MLE of the SNR Q = A 2 / (>2.
Compare the variance with the CRLB in Example 3.4. Note that for fs> Ee to be
small, either N --+ 00 and/or A --+ 00. 7.22 Prove that if p
1.16 For Example 7.7 determine the var(A) as well as the CRLB for the non-Gaussian A(z) = 1 + La[k]z-k
PDF (termed a Laplacian PDF) k=l
is a minimum-phase polynomial (all roots inside the unit circle of the z plane),
1 then
p(w[O]) = 2 exp( -lw[OJl).
1~ In IA(f)12 df = O.
Does the MLE attain the CRLB as N --+ oo? 2
~ flnA(z)dZ
271"J z
over the domain -3 ::; x ::; 13. Find the maximum of the function from the graph.
Then, use a Newton-Raphson iteration to find the maximum. In doing so use the is the inverse z transform of In A( z) evaluated at n = O. Finally, use the fact that
initial guesses of Xo = 0.5 and Xo = 9.5. What can you say about the importance a minimum-phase A(z) results in an inverse z transform of InA(z) that is causal.
of the initial guess?
7.23 Find the asymptotic MLE for the total power Po of the PSD
1.19 If
= O,I, ... ,N - Pxx(f) = PoQ(f)
x[n] = cos 271" fan + w[n] n 1
where w[n] is WGN with known variance (>2, show that the MLE of frequency is
where ,
obtained approximately by maximizing (fa not near 0 or 1/2) 1: 2
Q(f) df = 1.
N-l
If Q(f) = 1 for all f so that the process is WGN, simplify your results. Hint: Use
L x[n] cos 271" fan the results from Problem 7.25 for the second part.
n=O
7.24 The peak of the periodogram was shown in Example 7.16 to be the MLE for
over the interval 0 < fa < 1/2. Next, use a Monte Carlo computer simulation frequency under the condition that fa is not near 0 or 1/2. Plot the periodogram
(see Appendix 7A) for N = 10, fa = 0.25, (>2 = 0.01 and plot the function to be for N = 10 for the frequencies fa = 0.25 and fa = 0.05. Use the noiseless data
maximized. Apply the Newton-Raphson method for determining the maximum
and compare the results to those from a grid search. x[n] = cos 271" fan n = 0, 1, ... , N - 1.
204 CHAPTER 7. MAXIMUM LIKELIHOOD ESTIMATION
What happens if this approximation is not valid? Repeat the problem using the
exact function
xTH(HTH)-lH Tx
where H is defined in Example 7.16.
7.25 Prove that the inverse Fourier transform of the periodogram is
N-I-Ikl Appendix 7A
f"lkl ~ {
: ~ xlnJxln + [kli Ikl"5o N-l
Ikl2: N. Monte Carlo Methods
Hint: Note that the periodogram can be written as
I(f) = ~X'(f)X'(f)'
We now describe the computer methods employed to generate realizations of the
where X'(f) is the Fourier transform of the sequence random variable (see Example 7.1)
Statistical Properties:
1. Determine the mean using (7.9).
2. Determine the variance using (7.10).
3. Determine the PDF using a histogram.
205
L ~~ _____ ~
~x l~~::z p(x) dx
produces independent UfO, 1] random variates. As an example, on a VAX 11/780 the
intrinsic function RAN can be used. In MONTECARLO the subroutine RANDOM
calls the RAN function N times (for N even). Next, we convert these independent 2
uniformly distributed random variates into independent Gaussian random variates with or the average PDF over the cell Unfortunat I .
(more cells), the probability of a r~alizatio f ll~ y, .as th~ cell wIdth becomes smaller
mean 0 and variance 1 by using the Box-Mueller transformation
yields highly variable PDF estimate A ~: mg mto t e cell becomes smaller. This
until the estimated PDF appear t s. s e fe , a ?ood strategy keeps increasing M
Wl J -21n Ul cos 21rU2 in the subroutine HISTOG A; thcondv~rge .. he hIstogram approach is implemented
= J -21n Ul sin 21rU2 rB d . ur er ISCUSSlOn of PDF est' t' b
W2 l en at and Piersol 1971J. Ima lOn can e found in
where Ul, U2 are independent U[O, 1] random variables and Wb W2 are independent
N(O, 1) random variables. Since the transformation operates on two random variables Fortran Program MONTECARLO
at a time, N needs to be even (if odd, just increment by 1 and discard the extra random
variate). To convert to N(0,O' 2 ) random variates simply multiply by o'. This entire C MONTECARLO
procedure is implemented in subroutine WGN. Next, A is added to w[n] to generate C This program determines the
C f asymptotic properties
one time series realization x[n] for n = 0,1, ... ,N -1. For this realization of x[n], A is o the MLE for a N(A,A) PDF (see Examples 7.1-7.3, 7.5).
C
computed. We repeat the procedure M times to yield the M realizations of A.
~
The
The mean and variance are determined using (7.9) and (7.10), respectively. The . array d'1menS10ns
. are given as variable R 1
alternative form of the variance estimate v w1th numerical values To use thO . ep ace them
C a l tt' . . 1S program you will need
p 0 1ng subrout1ne to replace PLOT and a random n
ber
- 1 M - 2 C gebnerat~r to replace the intrinsic function RAN in t:m
var(A) =M L AT - E(A) C su rout1ne RANDOM. e
i=l C
is used by the subroutine STATS. The number of realizations, M, needed to provide an DIMENSION X(N),W(N),AHAT(M),PDF(NCELLS) HIST(NCELLS)
* ,XX(NCELLS) ,
accurate estimate of the mean and variance can be determined by finding the variance
of the estimators since (7.9) and (7.10) are nothing more than estimators themselves. A
PI=4. *ATAN(1.)
C
simpler procedure is to keep increasing M until the numbers given by (7.9) and (7.10) Input the value of A, the number of data pOints
C number of realizations M. N, and the
converge.
WRITE(6,10)
Finally, to determine the PDF of A we use an estimated PDF called the histogram. 10
The histogram estimates the PDF by determining the number of times A falls within FORMAT(' INPUT A, N, AND M')
a specified interval. Then, a division by the total number of realizations to yield the
READ(S,*)A,N,M
probability, followed by a division by the interval length, produces the PDF estimate. A C Generate M realizations of the estimate
DO 40 K=1,M of A.
typical histogram is shown in Figure 7A.l where the PDF is estimated over the interval
C Generate the noise samples and add A to each one.
~_--------r---~---~---~-~---'
208 APPENDIX 7A. MONTE CARLO METHODS .4PPENDIX 7A. MONTE CARLO METHODS 209
DIMENSION X(1),Yl(1),Y2(1)
C Replace this subroutine with any standard Fortran-compatible
C plotting routine.
RETURN
END
SUBROUTINE WGN(N,VAR,W)
C This subroutine generates samples of zero mean white Appendix 7B
C Gaussian noise.
C
C Input parameters:
C Asymptotic PDF of MLE for a Scalar
C N - Number of noise samples desired
C VAR - Variance of noise desired Parameter
C
C Output parameters:
C
C W - Array of dimension Nxl containing noise samples We now give an outline of the proof of Theorem 7.1. A rigorous proof of consistency
C can be found in [Dudewicz 1976] and of the asymptotic Gaussian property in [Rao
DIMENSION W(l) 1973]. To simplify the discussion the observations are assumed to be lID. We further
PI=4. *ATAN(1.) assume the following regularity conditions.
C Add 1 to desired number of samples if N is odd. 1. The first-order and second-order derivatives of the log-likelihood function are well
Nl=N defined.
IF(MOD(N,2).NE.0)Nl=N+l 2.
C Generate Nl independent and uniformly distributed random
C variates on [0,1].
CALL RANDOM(Nl,VAR,W)
L=N1/2
We first show that the MLE is consistent. To do so we will need the following inequality
C Convert uniformly distributed random variates to Gaussian
(see [Dudewicz 1976] for proof), which is related to the Kullback-Leibler information.
C ones using a Box-Mueller transformation.
DO 10 I=l,L
Ul=W(2*I -1) (7B.1)
U2=W(2*I)
TEMP=SQRT(-2.*ALOG(Ul)) with equality if and only if el = ()2. Now, in maximizing the log-likelihood function,
W(2*I-l)=TEMP*COS(2.*PI*U2)*SQRT(VAR) we are equivalently maximizing
10 W(2*I)=TEMP*SIN(2.*PI*U2)*SQRT(VAR) 1 1 N-I
RETURN N lnp(xi e) N In II
p(x[n]i e)
END n=O
SUBROUTINE RANDOM(Nl,VAR,W) 1 N-I
DIMENSION W(l) N L lnp(x[n];e).
DO 10 I=l,Nl n=O
C For machines other than DEC VAX 11/780 replace RAN(ISEED) But as N ~ 00, this converges to the expected value by the law of large numbers.
C with a random number generator. Hence, if eo denotes the true value of e, we have
10 W(I)=RAN(lllll)
L lnp(x[n]i e) ~ Jlnp(x[n]i e)p(x[n]i eo) dx[n].
1 N-I
RETURN
END N (7B.2)
n=O
211
- ,...- - - - r - ~ ~ ~ ~ ~ - - - -
212 APPENDIX 7B. ASYMPTOTIC PDF OF MLE A.PPENDIX 7B. ASYMPTOTIC PDF OF MLE 213
However, from (7B.l)
where the last convergence is due to the law of large numbers. Also, the numerator
To derive the asymptotic PDF of the MLE we first use a Taylor expansion about is a random variable, being a function of x[nJ. Additionally, since the x[nl's are lID,
the true value of 8, which is 80 Then, by the mean value theorem so are the 'n's. By the central limit theorem the numerator term in (7BA) has a PDF
88 8=6 88 8=8
0
+ 882 8=8 (8 - ( 0 )
E[_I_I:18Inp~[nJ;8)1 1=0
where 80 < {} < O. But v'N n=O 8=80
I.J ']
88 8=6
by the definition of the MLE, so that
E [( ~ '~ aInp~nJ; 9) ~ ~ E [ (a Inp~i"J; 9))'I.J
0= 8Inp(x;8)
88
I +
2
8 Inp(x;8)
882
I '_(8 - ( 0 ), (7B.3)
8=80 8=8 = i(80 )
Now consider VN(O - ( 0 ), so that (7B.3) becomes due to the independence of the random variables. We now resort to Slutsky's theorem
[Bickel and Doksum 1977], which says that if the sequence of random variables Xn has
1 8In p (X;8)/ the asymptotic PDF of the random variable x and the sequence of random variables
Yn converges to a constant c, then xn/Yn has the same asymptotic PDF as the random
../N(O - ( 0 ) = TN 88 8-80
(7BA) variable x/c. In our case,
_~ 8 Inp(x; 8) /
2
N
2
1 8 Inp(x; 8)
882
I
--+
~ ~I 8 2 Inp(x[nJ; 8)
~
N n=O 882
I or finally
8=8 8=8 0
-+ E[82Inp(~[nJ;8)1 ]
88 8=80
-i(80 )
J _________ - - - - - - - - - - _ - _____ _ - - - -
C=
Appendix 7C
Inpy(y; 9) = 2)np(Yi; Bi )
x = LYi = [I I. .. I]y
i=l
i=l
p
Lin
i=1
{I 2 Ii. exp --2
(27ra i ) 2
2 L
[1
a i n=O
N-1
(Yi[n]- cos 27rfin)2
]}
where 1 is the N x N identity matrix and the transformation matrix is composed of
p identity matrices. Using the standard result for conditional expectations of jointly
Gaussian random vectors (see Appendix IDA), we have
1 N-1
p
E(ylx; 9 k ) = E(y) + CyxC;;;(x - E(x)).
C - L 2a 2 L (y;[n]- cos 27rfin)2
i=l 1, n=O T;le means are given by
p 1 N-1 1
r~~ 1
= g(y) +L 2" L (y;[n] cos 27r fin - - cos 2 27r fin)
i=1 a i n=O 2
where c is a constant and g(y) does not depend on the frequencies. Making the ap-
E(r)
proximation that "2:~:01 cos 227rfin ~ N/2 for fi not near 0 or 1/2, we have p
E(x) LCi
p 1 N-1 i=1
Inpy(y;9) = h(y) + L 2" LYi[n]cos27rfin
at while the covariance matrices are
i=l n=O
p 1
Inpy(y; 9) = h(y) + L 2"c; Yi (7C.1)
i=1 ai
214
_ _ _ _ _ _ _ _ _ _ _ _ _ _ I
216 APPENDIX 7C. DERIVATION FOR EM ALGORITHM EXAMPLE APPENDIX 7C. DERIVATION FOR EM ALGORITHM EXAMPLE 217
(7C.4)
Finally, since the aT's are not unique, they can be chosen arbitrarily as long as (see
(7.53) )
i=l
0r, equivalently,
so that
or
i=1,2, ... ,p
where Ci is computed using (h. Note that E(Ydx; 9 k ) can be thought of as an estimate
of the Yi[n] data set since, letting Yi = E(Yilx; 9 k ),
U'(9,9 k ) = LC;Yi
i=l
._~ _____ I l
Chapter 8
Least Squares
8.1 Introduction
In previous chapters we have attempted to find an optimal or nearly optimal (for large
data records) estimator by considering the class of unbiased estimators and determining
the one exhibiting minimum variance, the so-called MVU estimator. We now depart
from this philosophY to investigate a class of estimators that in general have no op-
timality properties associated with them but make good sense for many problems of
interest. This is the method of least squares and dates back to 1795 when Gauss used
the method to study planetary motions. A salient feature of the method is that no
probabilistic assumptions are made about the data, only a signal model is assumed.
The advantage, then, is its broader range of possible applications. On the negative
side, no claims about optimality can be made, and furthermore, the statistical perfor-
mance cannot be assessed without some specific assumptions about the probabilistic
structure of the data. Nonetheless, the least squares estimator is widely used in practice
due to its ease of implementation, amounting to the minimization of a least squares
error criterion.
8.2 Summary
The least squares approach to parameter estimation chooses () to minimize (8.1), where
the signal depends on (). Linear versus nonlinear least squares problems are described
in Section 8.3. The general linear least squares problem which minimizes (8.9) leads
to the least squares estimator of (8.10) and the minimum least squares error of (8.11)-
(8.13). A weighted least squares error criterion of (8.14) results in the estimator of
(8.16) and the minimum least squares error of (8.17). A geometrical interpretation
of least squares is described in Section 8.5 and leads to the important orthogonality
principle. When the dimeqsionality of the vector parameter is not known, an order-
recursive least squares approach can be useful. It computes the least squares estimator
recursively as the number of unknown parameters increases. It is summarized by (8.28)-
(8.31). !fit is desired to update in time the least squares estimator as more data become
219
\11&1,11&1&&&&&1..& . . . . . . ..1.. . . . . . . . . -..___- II I I I I I I I I I I I I I IIJ
Assume that the signal model in Figure 8.1 is s[nJ = A and we observe x[nJ for n =
x[n] 0,1, ... , N -1. Then, according to the LS approach, we can estimate A by minimizing
(8.1) or
(b) Least squares error N-I
J(A) =L (x[n]- A)2.
n=O
Figure 8.1 Least squares approach
Differentiating with respect to A and setting the result equal to zero produces
1 N-I
available, then a sequential approach can be employed. It determines the least squ~res NLx[nJ
estimator based on the estimate at the previous time and the new data. The equatIOns n=O
(8.46)-(8.48) summarize the calculations required. At times the parameter .vector ~s
constrained, as in (8.50). In such a case the constrained least squares estimator IS
given by (8.52). Nonlinear least squares is discussed in Section ~.9. S?me ~e:h?ds .for or the sample mean estimator. Our familiar estimator, however, cannot be claimed to
converting the problem to a linear one are described, followed by Iterative mlmmlzatlOn be optimal in the MVU sense but only in that it minimizes the LS error. We know,
approaches if this is not possible. The two methods that are generally used are the however, from our previous discussions that if x[nJ = A+w[nJ, where w[nJ is zero mean
Newton-Raphson iteration of (8.61) and the Gauss-Newton iteration of (8.62). WGN, then the LSE will also be the MVU estimator, but otherwise not. To underscore
the potential difficulties, consider what would happen if the noise were not zero mean.
Then, the sample mean estimator would actually be an estimator of A + E( w[n]) since
w[nJ could be written as
8.3 The Least Squares Approach w[nJ = E(w[n]) +w'[nJ
Our focus in determining a good estimator has been to find one that was unbiased and where w'[nJ is zero mean noise. The data are more appropriately described by
had minimum variance. In choosing the variance as our measure of goodness we im-
plicitly sought to minimize the discrepancy (on the average) between our estim~t~ a~d x[nJ = A + E(w[n]) + w'[nJ.
the true parameter value. In the least squares (LS) approach we attempt to mllllmize
It should be clear that in using this approach, it must be assumed that the observed
the squared difference between the given data x[n] and the assumed signal or noi~ele~s
data. This is illustrated in Figure 8.1. The signal is generated by some model which III data are composed of a deterministic signal and zero mean noise. If this is the case,
turn depends upon our unknown parameter 8. The signal s[n] is purely det.erministic. the error f[nJ = x[nJ- s[nJ will tend to be zero on the average for the correct choice of
Due to observation noise or model inaccuracies we observe a perturbed version of s[n], the signal parameters. The minimization of (8.1) is then a reasonable approach. The
which we denote by x[n]. The least squares estimator (LSE) of 8 chooses the value that reader might also consider what would happen if the assumed DC level signal model
\4AA&A&I&&II&I&~.II
If the signal is s[nJ = A cos 27r fon, where fo is known and A is to be estimated, then Jrnin = J(8) = L (x[nJ- 8h[n])(x[nJ- 8h[n])
n=O
the LSE minimizes
N-l N-I N-I
J(A) = z)x[nJ- A cos 27rfon? L x[n](x[nJ - 8h[n]) - 8L h[n](x[nJ- 8h[n])
n=O
is quadratic in A but nonquadratic in fo. The net result is that J can be minimized (8.6)
in closed form with respect to A for a given fo, reducing the minimization of J to one
n=O
~ _____ ~ __ ~ __ I I I . I I J
to the signal fitting. For Example 8.1 in which () = A we have h[nJ = 1, so that from
(8.10)
(8.4) A= x and from (8.5)
N-I The equations HTH9 = HT x to be solved for iJ are termed the normal equations.
Jrnin = L 2
x 2[nJ - Nx . The assumed full rank of H guarantees the invertibility of HTH. See also Problem
n=O 8.4 for another derivation. Somewhat surprisingly, we obtain an estimator that has
the identical funct~onal form as the efficient estimator for the linear model as well as
If the data were noiseless so that x[nJ = A, then Jrnin = 0 or we would have a perfect
2 the BLUE. That 9 as given by (8.10) is not the identical estimator stems from the
LS fit to the data. On the other hand, if x[nJ = A + w[nJ, where E(w [nJ) A2, then
assumptions made about the data. For it to be the BLUE would require E(x) = H9
L~:OI x 2 [nJlN x2 The minimum LS error would then be and ~x = a 2 1 (see Ch~pter 6), and to be efficient would in addition to these properties
N-I reqUire x t.o be Gaussian (see Chapter 4). As a side issue, if these assumptions hold
Jrnin ::::: L 2
x [nJ we can. easl~y determine the statistical properties of the LSE (see Problem 8.6), havin~
n=O been given In Chapters 4 and 6. Otherwise, this may be quite difficult. The minimum
)
LS error is found from (8.9) and (8.10) as
or not much different than the original error. It can be shown (see Problem 8.2) that
the minimum LS error is always between these two extremes or J(iJ)
N-I (x - HiJf(x - HiJ)
o s: Jrnin s: L x2[nJ.
n=O
(8.7)
(x - H(HTH)-IH TX)T (x - H(HTH)-IHT x)
x T (I - H(HTH)-IHT) (I - H(HTH)-IH T ) x
The extension of these results to a vector parameter 9 of dimension p x 1 is straight-
x T (I - H(HTH)-IHT) x. (8.11)
forward and of great practical utility. For the signal s = [s[OJ s[lJ ... s[N - 1]jT to be
linear in the unknown parameters, we assume, using matrix notation,
The last step results from the fact that 1 - H(HTH)-I HT is an idempotent matrix or
s=H9 (8.8) it has the property A2 = A. Other forms for Jrnin are
where H is a known N x p matrix (N > p) of full rank p. The matrix H is referred to xTX - xTH(HTH)-IH TX (8.12)
as the obsenJation matrix. This is, of course, the linear model, albeit without the usual xT(x - HiJ). (8.13)
noise PDF assumption. Many examples of signals satisfying this model can be found
in Chapter 4. The LSE is found by minimizing ;\n extension of the li~:ar LS p~oblem is to weighted LS. Instead of minimizing (8.9),
N-I we In.elude an N x N posItive defimte (and by definition therefore symmetric) weighting
J(9) L (x[nJ - s[nJ)2 matnx W, so that
J(9) = (x - H9fW(x - H9). (8.14)
(x - H9f(x - H9). (8.9)
If, for instance, W is diagonal with diagonal elements [WJii = Wi > 0, then the LS error
This is easily accomplished (since J is a quadratic function of 9) by using (4.3). Since for Example 8.1 will be
N-I
The ratio~ale. for introducing weighting factors into the error criterion is to emphasize
(note that x T H9 is a scalar), the gradient is the ?ont:lbutlOns of thos: data samples that are deemed to be more reliable. Again,
conSiderIng Example 8.1, If x[n] = A+w[n], where w[n] is zero mean uncorrelated noise
with variance a~, then it is reasonable to choose Wn = l/a~. This choice will result in
..... - ........... -~------~-.
the estimator (see Problem 8.8) where fo is a known frequency and 9 = [a bJT is to be estimated. Then, in vector form
we have
f-0
x[~]
an s[O]
s[l] 1
A=
A
n_ . (8.15)
N-1 1 (8.18)
[
L
n=O
a2
n
s[N:-1]
This familiar estimator is of course the BLUE since the w[nl's are uncorrelated so that It is seen that the columns of H are composed of the samples of the cosinusoidal and
W = C- 1 (see Example 6.2). sinusoidal sequences. Alternatively, since
The general form of the weighted LSE is readily shown to be
[1 cos 27rfo cos 27rfo(N - 1) ]T
(8.16) [0 sin 27rfo sin 27rJo(N - 1) ( ,
We now reexamine the linear LS approach from a geometrical perspective. This has the If we further define the Euclidean length of an N x 1 vector e= [6 6 ... ~N JT as
advantage of more clearly revealing the essence of the approach and leads to additional
useful properties and insights into the estimator. Recall the general signal model s =
H9. If we denote the columns of H by hi, we have
J(9) Ilx-H9W
p
p Ilx - L B hil1
i=l
i
2
(8.19)
LB;h;
i=l
We now see that the linear LS approach attempts to minimize the square of the distance
so that the signal model is seen to be a linear combination of the "signal" vectors from the data vector x to a signal vector L:f=1 Bih;, which must be a linear combination
of the columns of H. The data vector can lie anywhere in an N-dimensional space,
{hI> h2 , , hp}.
termed RN, while all possible signal vectors, being linear combinations of p < N vectors,
must lie in a p-dimensional subspace of RN, termed SP. (The full rank of H assumption
Example 8.4 - Fourier Analysis assures us that the columns are linearly independent and hence the subspace spanned
is truly p-dimensional.) For N = 3 and p = 2 we illustrate this in Figure 8.2. Note
Referring to Example 4.2 (with M = 1), we suppose the signal model to be that all possible choices of B}, B2 (where we assume -00 < B1 < 00 and -00 < B2 < 00)
produce signal vectors constrained to lie in the subspace S2 and that in general x does
s[n] = a cos 27rfon + bsin27rfon n = 0, 1, ... , N - 1 not lie in the subspace. It should be intuitively clear that the vector s that lies in S2 and
--~~------------.--.- ....... . IIII.XIIIIII.J
or
(8.20)
Finally, we have as our LSE
Subspace spanned 9 = (HTH)-IH Tx.
by {h 1 ,h2 } = 52
Note that if . = x - H6 denotes the error vector, then the LSE is found from (8.20) by
invoking the condition
(8.21)
(a) Signal subspace (b) Orthogonal projection to deter-
mine signal estimate The error vector must be orthogonal to the columns of H. This is the well-known
orthogonality principle. In effect, the error represents the part of x that cannot be
Figure 8.2 Geometrical viewpoint of linear least squares in R3 described by the signal model. A similar orthogonality principle will arise in Chapter 12
in our study of estimation of random parameters.
Again referring to Figure 8.2b, the minimum LS error is Ilx - 8W or
that is closest to x in the Euclidean sense is the component of x in 8 2. Alternatively, 8
is the orthogonal projection of x onto 8 2. This means that the error vector x - s must IIx - H9W
be orthogonal to all vectors in 8 2. Two vectors in RN are defined to be orthogonal if
x T y = O. To actually determine s for this example we use the orthogonality condition. (x - H9f(x - H9).
This says that the error vector is orthogonal to the signal subspace or
In evaluating this error we can make use of (8.21) (we have already done so for the
(x - s) ..L 8 2 scalar case in arriving at (8.5)). This produces
where ..L denotes orthogonal (or perpendicular). For this to be true we must have (x - H9)T (x - H9)
x T (x - H9) -9T HT
(x - s) ..L hI
x T X - x T H9
(x - 8) ..L h2
x T (I - H(HTH)-IHT) X. (8.22)
since then the error vector will be orthogonal to any linear combination of hI and h z.
Using the definition of orthogonality, we have In summary, the LS approach can be interpreted as the problem of fitting or ap-
proximating a data vector x in RN by another vector 8, which is a linear combination
(x - s)Thl 0 of vectors {hI> h 2, ... , IIp} that lie in a p-dimensional subspace of RN. The problem is
(x - 8)Th2 O. solved by choosing s in the subspace to be the orthogonal projection of x. Many of
our intuitive notions about vector geometry may be used to our advantage once this
connection is made. We now discuss some of these consequences.
Referring to Figure 8.3a, if it had happened that hI and h2 were orthogonal, then 8
(x - 81 h l - 8zhzfhl 0 could have easily been found. This is because the component of 8 along hI or 81 does
(x - 81 h l - 82h2)Thz O. not contain a component of s along h 2. If it did, then we would have the situation
in Figure 8.3b. Making the orthogonality assumption and also assuming that IIhtll =
In matrix form this is IIh211 = 1 (orthonormal vectors), we have
(x-H6fh l o 81 + S2
(x - H6)Th2 O. (hi x)h l + (hf x)h2
~-- - - ,- ~ r
and also
N
x
2
N
2
so that hI and h2 are orthogonal but not orthonormal. Combining these results produces
HTH = (NI2)I, and therefore,
where hT x is the length of the vector x along 11;. In matrix notation this is
HHTx
s[nJ = a' ~ cos ( 27r ~ n) + b' ~ sin (27r ~ n)
so that
then the columns of H would have been orthonormal. o
This result is due to the orthonormal columns of H. As a result, we have In general, the columns of H will not be orthogonal, so that the signal vector estimate
is obtained as
s = HB = H(HTH)-IHTx.
The signal estimate is the orthogonal projection of x onto the p-dimensional subspace.
and therefore
The N x N matrix P = H (HTH) -I HT is known as the orthogonal projection matrix or
just the projection matrix. It has the properties
No inversion is necessary. An example follows. 1. p T = P, symmetric
Example 8.5 - Fourier Analysis (continued) 2. p2 = P, idempotent.
That the projection matrix must be symmetric is shown in Problem 8.11, that it must be
Continuing Example 8.4, if fo = kiN, where k is an integer taking on any of the values idempotent follows from the observation that if P is applied to Px, then the same vector
k = 1,2, ... ,N12 -1, it is easily shown (see (4.13)) that must result since Px is already in the subspace. Additionally, the projection matrix
must be singular (for independent columns of H it has rank p, as shown in Problem
8.12). If it were not, then x could be recovered from s, which is clearly impossible since
many x's have the same projection, as shown in Figure 8.4.
,.......................
232 CHAPTER 8. LEAST SQUARES
5.0 -
(a) Experimental data
4.5 -
4.0.., .
. ... ...
3.5.., .. . . '
..
....' .
3.0 -
'
2.5 -
2.0 - ..
.' ' .
1.5 _ '.
Figure 8.4 Vectors with same
....- '., 00
1.0 - o 0
00
0.5.
0.0 I
Likewise, the error vector 10 = X - S = (I - P)x is the projection of x onto the 0 10 20 30 40 50 60 70 80 90 100
complement subspace or the subspace orthogonal to the signal subspace. The matrix Time, t
pl. = 1 _ P is also a projection matrix, as can be easily verified from the properties
given above. As a result, the minimum L8 error is, from (8.22),
5.0 -c
(b) One-parameter fit
J rnin x T (I - P)x 4.5 -
xTpl.x 4.0 _ 00
0 0
0
0 0
00
3.5 - 0
..
XTpl.Tpl. X '0 0
3.0 -
IIPJ.xW
2.5 0
0
00
o
.000
00
00
\
0
00
2.0 - 0
In th.e next section we further utilize the geometrical theory to derive an order-
1.5 _
1.0 -
....- 0
00
000
0
00 000
SI(t) = Al
'
recursive L8 solution. 0
0
0.5.
0.0 I
8.6 Order-Recursive Least Squares 0 10 20 30 40 50 60 70 80 90 100
Time, t
In many cases the signal model is unknown and must be assumed. For example, consider
the experimental data shown in Figure 8.5a. The following models might be assumed
5.0 -
81 (t) A (c) Two-parameter fit
4.5 -
82(t) = A + Bt 4.0 - 00
3.5..,
for 0:::; t :::; T. If data x[n] are obtained by sampling x(t) at times t = n~, where ~ =1 3.0..,
and n = 0,1, ... , N - 1, the corresponding discrete-time signal models become
2.5.;
A 2.0 J
A+Bn.
1.0
Using a L8E with 0.5.
0.0 +1--r----r----r---"T--r-----r-----r-----,---r--~
[:1 1 o 10 20 30 40 50 60 70 80 90 100
H2 =[: : Time, t
l I N -1
Figure 8.5 Experimental data fitting by least squares
L....a..&.,~ ____ .& . . . . . . . . . . . . . . . . . . . . . . . .
III.II IIII.)
~~~ I I I
1~1
1.0
I
1.5
:
~-;-----I----:
2.0
I
2.5 3.0 3.5 4.0
~
n=O
N(]'Z
N umber of parameters, k so t~at when the true order is reached, J rnin ~ 10. This is verified in Figure 8.6 and
also Increases our confidence in the chosen model.
Figure 8.6 Effect of chosen number of parameters on minimum least
In the preced~ng example we saw that it was important to be able to determine the
squares error
LSE for severa! signal models. A straightforward approach would compute the LSE for
each model USIng (8.10). Alternati.vely, the computation may be reduced by using an
would produce the estimates for the intercept and slope as order-recurswe LS approach. In thiS method we update the LSE In . or d er. Speci'fica11y
a: bl
we e a e to compute the. LSE based on an H of dimension N x (k + 1) from th~
Al =x (8.23) solutIOn based on an H of dimension N x k. For the previous example this update in
order would have bee~ exceedingly simple had the columns of H2 been orthogonal. To
and (see Problem 8.13) see why assume the signal models to be
In Figures 8.5b and 8.5c we have plotted SI(t) = Al and S2(t) = A2 + iht, where
T = 100. The fit using two parameters is better, as expected. We will later show that
the minimum LS error must decrease as we add more parameters. The question might
be asked whether or not we should add a quadratic term to the signal model. If we
did, the fit would become better yet. Realizing that the data are subject to error, we
H, ~ [; -(7 \) ]
may very well be jitting the noise. Such a situation is undesirable but in the absence of The LSE is readily found since
a known signal model may be unavoidable to some extent. In practice, we choose the
simplest signal model that adequately describes the data. One such scheme might be to 2M +1
increase the order of the polynomial until the minimum LS error decreases only slightly HfHz= 0
as the order is increased. For the data in Figure 8.5 the minimum LS error versus [
the number of parameters is shown in Figure 8.6. Models SI(t) and S2(t) correspond
to k = 1 and k = 2, respectively. It is seen that a large drop occurs at k = 2. For is a diagonal matrix. The solutions are
larger orders there is only a slight decrease, indicating a modeling of the noise. In this
_ 1 M
example the signal was actually
A1 = - -
2M + 1 n=-M
x[n] L
s(t) = 1 + 0.03t
,'IIAAI.I'AA'A"I"AA '.1 ~____ I.I. I I. III I.IIIIII I I IIIIJ
and where
1 M
pt = I - Hk(H[Hk)-IH[
2M+1 L
n=-M
x[n] is the projection
.,., 'd .matrix
. onto
HT the subspace orthogonal to that spanned by the co Iumns
of H k'.LO avO! mvertmg k Hk we let
M
(8.29)
L nx[n]
n=-lvl
In this case the LSE of A does not change as we add a parameter to the model and
follows algebraically from the diagonal nature of HIH 2. In geometric terms this result
is readily apparent from Figure 8.3a. The orthogonality of the h;'s allows us to project (8.30)
x along hi and h2 separately and then add the results. Each projection is independent
of the other. In general, the column vectors will not be orthogonal but can be replaced
by anoth.er set of p vectors that are orthogonal. The procedure for doing so is shown 2
in Figure 8.7 and is called a Gram-Schmidt orthogonalization. The new column h2 is J. - J (hk+1Ptx)
projected onto the subspace orthogonal to hi' Then, since hi and h~ are orthogonal, mm.+! - min. - hT Pl.h
k+1 k
.
k+1
(8.31)
the LS signal estimate becomes Ihe entire algorithm. requires no matrix inversions. The recursion begins by determining
s = hle l + h~e; 01 , J min " and DI usmg (8.25), (8.26), and (8.29), respectively. We term (8.28), (8.30),
:md (8.31) the order-recursive least squares method. To illustrate the computations
where hle l is also the LSE for the signal based on H = hi only. This procedure mvolved we apply the method to the previous line fitting example.
recursively adds terms to the signal model; it updates the order. In Problem 8.14
this geometrical viewpoint is used together with a Gram-Schmidt orthogonalization to Example 8.6 - Line Fitting
derive the order-update equations. A purely algebraic derivation is given in Appendix
8A. It is now summarized. Since sl[n] = Al and s2[n] = A2 + B 2n for n = 0,1, ... , N - 1, we have
Denote the N x k observation matrix as Hk, and the LSE based on Hk as {h or
(8.25)
A1 81 N-1 N-1
(HiH1t1Hi x L nx[nJ -i L n
n-O n-O
x
N-1 1 (N-1 ) 2
and ~nz_ N ~n
(x - H 1B1 f(x -
N-1
H1Bd 1
- N;;,~nB' .
N-1 A 1
L(x[nJ-x)z. [
n=O
L nx[nJ -x L n
n=O n=O
Lnx[nJ-x Ln x- N LnBz
n=O n=O n=O
~-
~(~l?
N -1 A
hrhz - x---Bz
2
~nz_ ~ (~nr
2(2N - 1) N-1 6 N-1
(hfPtX)2
J min1 - hIPth 2
(~nx[n]-x ~ n)'
J
mm
, - ~ n' - ~ (~ n
N-I
r )2 = new information in hk + 1
<> and the recursive procedure will "blow up" (see (8.28. In essence, since hk+1
In solving for the LSE of the parameters of the second-order model we first solved ~or nearly lies in the subspace spanned by the columns of Hk , the new observation
the LSE of the first-order model. In general, the recursive procedure solves successIve matrix Hk+I will be nearly of rank k and hence HI+! Hk+! will be nearly singular.
LS problems until the desired order is attained. Hence, the recursive order~update In practice, we could monitor the term hI+ I pt hk+I and exclude from the recursion
solution not only determines the LSE for the desired model but also determmes the those column vectors that produce small values of this term.
LSE for all lower-order models as well. As discussed previously, this property is useful
4. The minimum LS error can also be written in a more suggestive fashion, indicating
when the model order is not known a priori.
the contribution of the new parameter in reducing the error. From (8.22) and
Several interesting observations can be made based on (8.28). (8.31) we have
1. If the new column hk+I is orthogonal to all the previous ones, then HIhk+I = 0 and
from (8.28) x T p.1 _ (hI+I P tX)2
X
k hk+IPthk+I
2
x T p.1 X [1 _ (hk+!Ptx) ]
k xTPtxhk+!Pthk+!
or the LSE for the first k components of 9k+1 remains the same. This generalizes J min .(1 - r~+I) (8.32)
the geometrical illustration in Figure 8.3a.
where
2. The term Ptx = (I - Hk(HIHk)-IHI)x =x- = k
is the LS error vector
Hk9k 2
2 [(Pthk+IY(Pt X)]
or the data residual that cannot be modeled by the columns of H k It represents rk+I = II P t hk+IWIIPt x I1 2 (8.33)
the component of x orthogonal to the space spanned by the columns of Hk (see
Problem 8.17). We can view Ptx, the residual, as the part ofx yet to be modeled. Letting (x,y) = xTy denote the inner product in R N , we have
3. If hk+1 is nearly in the space spanned by the columns of H k , then, as shown in
Figure 8.8, Pthk+I will be small. Consequently,
T.1 T p.1 Tp.1h
hk+I p k hk+I hk+I k k k+1
where r~+I is seen to be the square of a correlation coefficient and as such has the
IIPthk+IW property
~ 0
8.7. SEQUENTIAL LEAST SQUARES 243
242 CHAPTER 8. LEAST SQUARES
{x[~], x[I]' ... , x[N -l)}. If we n?w observe x[N], can we update {J (in time) without
Intuitively, ptx represents the residual or error in modeling x based on k pa-
~avmg to resolve the lmear equatIOns of (8.1O)? The answer is yes, and the procedure
rameters, while pth k+1 represents the new model information contributed by
IS termed sequentzal least squares to distinguish it from our original approach in which
the (k + l)st parameter (see Figure 8.8). For instance, if ptx and Pthk+l are we processed all the data at once, termed the batch approach.
collinear, then r~+1 = 1 and Jrnink+l = O. This says that the part of x that could
Consider Example 8.1 in which the DC signal level is to be estimated. We saw that
not be modeled by the columns of Hk can be perfectly modeled by hk+1' Assum-
the LSE is
ing that rk+l i= 0, the expression of (8.32) also shows that the minimum LS error 1 N-l
monotonically decreases with order as in Figure 8.6. A[N-l]= N Lx[n]
n=O
5. Recall that the LS signal estimate is
where the argument of A denotes the index of the most recent data point observed. If
s = HO = Px we now observe the new data sample x[N], then the LSE becomes
(8.34) In computing this new estimator we do not have to recompute the sum of the observa-
tions since
This is termed the recursive orthogonal projection matrix. See Problem 8.18 for 1 (N-l )
its utility in determining the recursive formula for the minimum LS error. If we A[N] N + 1 ~ x[n] + x[N]
define the unit length vector
N A 1
(I - P k )h k +1 = N + 1 A[N -1] + N + 1 x[N]. (8.35)
11(1 - P k)hk+111
~he new LSE is found by using the previous one and the new observation. The sequen-
Pt h k+1 tIal approach also lends itself to an interesting interpretation. Rearranging (8.35), we
IIPthk+111 have
the recursive projection operator becomes A[N]=A[N-l]+ N~1 (x[N]-A[N-l]). (8.36)
P k+1 = P k + uk+1 u I+1 The new estimate is equal to the old one plus a correction term. The correction term
decreases with N, reflecting the fact that the estimate A[N - 1] is based on many
where Uk+1 points in the direction of the new information (see Figure 8.8), or by more data samples and therefor.e shoul? ?e given more weight. Also, x[N] - A[N - 1]
letting Sk = P kX, we have can be thought of as the error m predIctmg x[N] by the previous samples, which are
summarized by A[N - 1]. If this error is zero, then no correction takes place for that
Sk+1 P k +1 X update. Otherwise, the new estimate differs from the old one.
Pkx + (uI+1 X)Uk+1 The minimum LS error may also be computed recursively. Based on data samples
Sk + (UI+1 X)Uk+1' up to time N - 1, the error is
N-l
In many signal processing applications the received data are obtained by sampling a and thus using (8.36)
continuous-time waveform. Such data are on-going in that as time progresses, more
data become available. We have the option of either waiting for all the available data N
or, as we now describe, of processing the data sequentially in time. This yields a L(x[n]- A[N])2
n=O
sequence of LSEs in time. Specifically, assume we have determined the LSE {J based on
-------------------- .... . . . 1f 1f . . . . . . . . . . . . . . . . . . . . . . I I .... I I IIIIII..I.J
N-l[
~ x[n]- A[N - 1] - N
1
+ 1 (x[N]- A[N - 1])
]2 + (x[N] - A[N])2
_1 A[N _ 1]
172
x[N]
17 2
A[N-1]- N +_N_
N 1 N 1
2 N-l
Jmin[N - 1] - ~ L (x[n]- A[N - l])(x[N] - A[N - 1]) L
n==O
172
n
L
n=O
172
n
+ 1 n=O
+ (N N -
+ 1)2 (x[N]- A[N -
_1])2 + (x[N]- A[N])2.
or finally
1
- - ';? -
Noting that the middle term on the right-hand side is zero, we have after some simpli- A[N] = A[N - 1] + ~(x[N] - A[N - 1]). (8.37)
cation 1
N -
Jmin[N] = Jmin [N - 1] + N + 1 (x[N]- A[N - 1]) .
2 L
n=O
172
n
The apparent paradoxical behavior of an increase in the minimum LS error is readily As expected, if (j~ = 17 2 for all n, we have our previous result. The gain factor that
explained if we note that with each new data point, the number of squared error terms multiplies the correction term now depends on our confidence in the new data sample.
increases. Thus, we need to fit more points with the same number of parameters. If the new sample is noisy or (jJv -+ 00, we do not correct the previous LSE. On the
A more interesting example of a sequential LS approach arises in the weighted LS other hand, if the new sample is noise-free or (jJv -+ 0, then A[N] -+ x[N]. We discard
problem. For the present example, if the weighting matrix W is diagonal, with [W]ii = all the previous samples. The gain factor then represents our confidence in the new
l/(jr, then the weighted LSE is, from (8.15), data sample relative to the previous ones. We can make a further interpretation of
our results if indeed x[n] = A + w[n], where w[n] is zero mean uncorrelated noise with
variance 17;. Then, we know from Chapter 6 (see Example 6.2) that the LSE is actually
~ x[n] the BLUE, and therefore,
L 172
A[N _ 1] = n=O n - 1
N-l 1 var(A[N - 1]) = ~
1
L
n=O
172
n 172 L
n=O n
To find the sequential version and the gain factor for the Nth correction is (see (8.37
t
n=O
x[n]
(j~ ';?
_N_
1
A[N] K[N]
N 1 N 1
L 172
n=O n
L
n=O
172
n
1
(jJv
1 1
-2- + -
(jN var(A[N - 1])
var(A[N - 1])
(8.38)
1) var(A[N - 1]) + (jJv
(
L
N-l
17 2 A[N -1] x[~]
n=O n +~
Since 0 ::; K[N] ::; 1, the correction is large if K[N] is large or var(A[N - 1]) is large.
N 1 N 1
Likewise, if the variance of the previous estimator is small, then so is the correction.
n=O
L 172 n
L 172
n=O n Further expressions may be developed for determining the gain recursively, since K[N]
I I . I I I IIJ
----~-------------
1
N
OlOi
---,-~:::;=~::::::;;::::=::=:t=====t==t===t'=
05
1 1 0. 1I
, + -2- 0.00+ I I I I I I I I I I
var(A[N - 1]) O'N o 10 ~ W ~ W 00 m ~ 00 ~
Current sample, N
var(A[N - 1])0'~
var(A[N - 1]) + O'~ 0. 50 1
0.45i
1_ va:(A[N - 1]) ) var(A[N - 1])
( 0.40-+
var(A[N - 1]) + O'~ I
0.35-+
~ ::~
or finally
var(A[N]) = (1 - K[N])var(A[N - 1]). (8.39)
To find K[N] recursively we can use (8.38) and (8.39) as summarized below. In the
process of recursively finding the gain, the variance of the LSE is also found recursively. 0'15~
0.10
Summarizing our results, we have
0.05 I _ _
0.00+ I
r-_~::;=::::;==::;::::::::::~=~=~=~=
I I I I I I I I I
Estimator Update: o 10 20 30 40 50 60 70 80 90 100
where
K[N] = va:(A[N - 1]) (8.41)
var(A[N - 1]) + O'~
10.0
1Zoj
Variance Update:
var(A[N]) = (1 - K[N])var(A[N - 1]). (8.42) E00
9.8
/:iJ
9.6
To start the recursion we use
9.4
..1[0] x[O]
var(A[O]) O'~. 9.2+------r-~--_r_---;-----------r--r----t------1-'---1-----1
o 10 20 30 40 50 60 70 80 90 100
Then, K[I] is found from (8.41), and ..1[1] from (8.40). Next, var(A[I]) is determined Current sample, N
from (8.42). Continuing in this manner, we find K[2], A[2], var(A[2]), etc. An example
of the sequential LS approach is shown in Figure 8.9 for A = 10, 0'; = 1 using a Monte Figure 8.9 Sequential least squares for DC level in white Gaussian
noise
Carlo computer simulation. The variance and gain sequences have been computed
.- ~ ~ ~ - - . . . . . . . . .. . ........... IIIIIIIII 1
recursively from (8.42) and (8.41), respectively. Note that they decrease to zero since O"~, h[nJ, E[n - 1J
. 1 1
var(A[N]) = -N- = -N
,,~ +1
~a2
n=O n
x[nJ ---+-{ >-_-.... ~ e[nJ
and from (8.41)
K[N] = 11
.1.+1
= _1_
N+1'
N
Also, as seen in Figure 8.9c, the estimate appears to be converging to the true value e[n -1J
of A = 10. This is in agreement with the variance approaching zero. Finally, it can be
shown (see Problem 8.19) that the minimum LS error can be computed recursively as
-l]f
(x[N]- A[N
Jmin[N] = Jmin[N - 1] + -'-----;.:------'-- (8.43)
Figure 8.10 Sequential least squares estimator
where C li is the covariance matrix of O. IIC is diagonal or the noise is un correlated, then Covariance Update:
o may be computed sequentially in time, but otherwise not. Assuming this condition
E[n] = (I - K[n]hT[nJ) ~[n - 1]. (8.48)
to hold, let The gain factor K[n] is apx 1 vector, and the covariance matrix E[n] has dimensionpxp.
" It is of great interest that no matrix inversions are required. The estimator update is
C(n] diag(ag,a~, ... ,a;) t, summarized in Figure 8.10, where the thick arrows indicate vector processing. To start
.,,< the recursion we need to specify initial values for O[n-1] and E[n-1], so that K[n] can
[ H[n -1] ] [nxp]
H[n]
hT[n] 1xp ?r- be determined from (8.47) and then O[n] from (8.46). In deriving (8.46)-(8.48) it was
assumed that O[n -1] and E[n - 1] were available or that HT[n -1]C- 1 [n -1]H[n -1]
x[n] [ x[O] x[l] ... x[n]]T
was invertible as per (8.44) and (8.45). For this to be invertible H[n - 1] must have
rank greater than or equal to p. Since H[n -1] is an n x p matrix, we must have n ~ p
and denote the weighted LSE of 0 based on x[n] or the (n + 1) data samples as O[n]. (assuming all its columns are linearly independent). Hence, the sequential LS procedure
Then, the batch estimator is
normally determines O[p - 1] and E[P - 1] using the batch estimator (8.44) and (8.45),
(8.44) and then employs the sequential equations (8.46)-(8.48) for n ~ p. A second method of
initializing the recursion is to assign values for 0[-1] and E[-l]. Then, the sequential
IIIIIIIIIIIII.l
--rr~----~------~-------
L8 estimator is computed for n 2: O. This has the effect of biasing the estimator toward Once the gain vector has been found, 8[2] is determined from (8.46) as
8[-1]. Typically, to minimize the biasing effect we choose E[-I] to be large (little
confidence in 8[-1]) or E[-I] = aI, where a is large, and also 8[-1] = o. The LSE for 8[2] = 8[1] + K[2](x[2]- hT[2]8[1]).
n 2: p will be the same as when the batch estimator is used for initialization if a -+ 00
Finally, the 2 x 2 LSE covariance matrix is updated as per (8.48) for use in the next
(see Problem 8.23). In the next example we show how to set up the equations. In computation of the gain vector or
Example 8.13 we apply this procedure to a signal processing problem.
E[2] = (I - K[2]h T [2])E[1].
Example 8.7 - Fourier Analysis
It should be clear that, in general, computer evaluation of 8[n] is necessary. Also, except
We now continue Example 8.4 in which the signal model is for the initialization procedure, no matrix inversion is required. Alternatively, we could
have avoided even the matrix inversion of the initialization by assuming 8[-1] = 0 and
s[n] = a cos 27rfon + bsin27rfon n2:0 E[-l] = 01 with 0 large. The recursion then would have begun at n = 0, and for n 2: 2
we would have the same result as before for large enough o. 0
and 6 = [a bV is to be estimated by a sequential LSE. We furthermore assume that the
noise is uncorrelated (e[n] must be diagonal for sequential LS to apply) and,has equal Finally, we remark that if the minimum LS error is desired, it too can be found sequen-
variance (72 for each data sample. Since there are two parameters to be estimated, tiallyas (see Appendix 8e)
we need at least two observations or x[O], x[l] to initialize the procedure using a batch
initialization approach. We compute our first L8E, using (8.44), as (x[n]- hT [n]8[n - 1])2
Jmin[n] = Jmin[n - 1] + (7~ + hT[n]E[n - l]h[n] (8.49)
8[1] (HT[I] (:21) H[I]) -1 HT[l] (:21) x[l]
(HT[I]H[I])-1 HT[l]x[l] 8.8 Constrained Least Squares
where At times we are confronted with LS problems whose unknown parameters must be
constrained. Such would be the case if we wished to estimate the amplitudes of several
signals but knew a priori that some of the amplitudes were equal. Then, the total
H[I]
number of parameters to be estimated should be reduced to take advantage of the a
priori knowledge. For this example the parameters are linearly related, leading to a
x[l] = linear least squares problem with linear constmints. This problem is easily solved, as
we now show.
Assume that the parameter 6 is subject to r < p linear constraints. The constraints
(H[l] is the 2 x 2 partition of H given in (8.18).) The initial covariance matrix is, from must be independent, ruling out the possibility of redundant constraints such as 81 +
(8.45), 82 = 0, 28 1 + 282 = 0, where 8i is the ith element of 6. We summarize the constraints
1 as
E[l] [HT[l] (:21) H[I]r A6=b (8.50)
where A is a known r x p matrix and b is a known r x 1 vector. If, for instance,
(72(HT[1]H[lW1.
p = 2 and one parameter is known to be the negative of the other, then the constraint
would be 81 + 82 = O. We would then have A = [11] and b = o. It is always assumed
Next we determine 8[2]. To do so we first compute the 2 x 1 gain vector from (8.47) as
that the matrix A is full rank (equal to r), which is necessary for the constraints to
E[1]h[2] be independent. It should be realized that in the constrained LS problem there are
K[2] = (72 + hT[2]E[I]h[2] actually only (p - r) independent parameters.
To find the L8E subject to the linear constraints we use the technique of Lagrangian
where hT[2] is the new row of the H[2] matrix or multipliers. We determine 8e (c denotes the constrained LSE) by minimizing the La-
grangian
hT [2] = [ cos 47r fo sin 47r fo 1. J e = (x - H6)T(X - H6) + >..T(A6 - b)
CHAPTER B. LEAST SQUARES B.B. CONSTRAINED LEAST SQUARES 253
252
and hence
~= [A(HTH)-1ATr1 (AiJ - b). must lie in the plane shown in Figure 8.lla. The unconstrained LSE is
2
Substituting into (8.51) produces the solution
(8.52)
and the signal estimate is
where iJ = (HTH) -1 HT x. The constrained LSE is a corrected version of the uncon-
strained LSE. If it happens that the constraint is fortuitously satisfied by iJ or AiJ = b, s= HiJ
x[OJ
= X~1J
1
then according to (8.52) the estimators are identical. Such is usually not the case, .
[
however. We consider a simple example.
As shown in Figure 8.lla, this is intuitively reasonable. Now assume that we know a
Example 8.8 - Constrained Signal priori that (}1 = (}2. In terms of (8.50) we have
__ [ Hx[OJ + x[l]) ]
6e -
~(x[OJ + x[l])
LIIIII
and the constrained signal estimate becomes iterative approaches and so suffers from the same limitations discussed in Chapter 7
for MLE determination by numerical methods. The preferable method of a grid search
~(x[O]+x[I]) 1 is practical only if the dimensionality of 9 is small, perhaps p ~ 5.
Before discussing general methods for determining nonlinear LSEs we first describe
Sc = HOc = Hx[O]: x[I])
two methods that can reduce the complexity of the problem. They are
[
1. transformation of parameters
as illustrated in Figure 8.llb. Since Bl = B2 , we just average the two observations,
which is again intuitively reasonable. In this simple problem we could have just as 2. separability of parameters.
easily incorporated our parameter constraints into the signal model directly to yield In the first case we seek a one-to-one transformation of 9 that produces a linear signal
model in the new space. To do so we let
B n=O
s[n] ={ B n =1 a = g(9)
o n =2
) where g is a p-dimensional function of 9 whose inverse exists. If a g can be found so
and estimated B. This would have produced the same result, as it must. This new model that
is sometimes referred to as the reduced model [Graybill 1976] for obvious reasons. It s(9(a)) = s (g-l(a)) = Ha
is worthwhile to view the constrained LS problem geometrically. Again referring to
Figure 8.11b, if Bl = B2 , then the signal model is then the signal model will be linear in a. We can then easily find the linear LSE of a
and thus the nonlinear LSE of 9 by
where
a = (HTH)-lHTx.
and s must lie in the constraint subspace shown. The constrained signal estimate Sc may This approach relies on the property that the minimization can be carried out in any
be viewed as the projection of the unconstrained signal estimate s onto the constrained transformed space that is obtained by a one-to-one mapping and then converted back
subspace. This accounts for the correction term of (8.52). In fact, (8.52) may be to the original space (see Problem 8.26). The determination of the transformation g, if
obtained geometrically using projection theory. See also Problem 8.24. 0 it exists, is usually quite difficult. Suffice it to say, only a few nonlinear LS problems
may be solved in this manner.
In Section 8.3 we introduced the nonlinear LS problem. We now investigate this in more For a sinusoidal signal model
detail. Recall that the LS procedure estimates model parameters 9 by minimizing the
LS error criterion s[n] = Acos(21rfon + 1 n = 0, 1, ... , N - 1
J = (x - S(9))T(X - s(9))
it is desired to estimate the amplitude A, where A > 0, and phase 1>. The frequency fo
where s(B) is the signal model for x, with its dependence on 9 explicitly shown. (Note is assumed known. The LSE is obtained by minimizing
that if x - s(9) '" N(O,a 2 I), the LSE is also the MLE.) In the linear LS problem the
N-l
signal takes on the special form s(9) = H9, which leads to the simple linear LSE. In
general, s(9) cannot be expressed in this manner but is an N-dimensional nonlinear J =L (x[n]- Acos(21rfon + 1>))2
n=O
function of 9. In such a case the minimization of J becomes much more difficult, if not
impossible. This type of nonlinear LS problem is often termed a nonlinear regression over A and 1>, a nonlinear LS problem. However, because
problem in statistics [Bard 1974, Seber and Wild 1989], and much theoretical work on
it can be found. Practically, the determination of the nonlinear LSE must be based on Acos(21rfon + 1 = A cos 1> cos 21rfon - Asin1>sin21rfon
r--r------------- . ~.III.I.I.I.IIIIIII.II I )
!
256
CHAPTER B. LEAST SQUARES B.9. NONLINEAR LEAST SQUARES 257
if we let
where
0'1
0'2
Acos
-Asin,
8 = [ ; ] = [ (p ~ :\X 1]
and H(a) is an N x q matrix dependent on a. This model is linear in,B but nonlinear
then the signal model becomes
in a. As a result, the LS error may be minimized with respect to ,B and thus reduced
to a function of a only. Since
s[n] = 0'1 cos 27r fan + 0'2 sin 27r fan.
In matrix form this is J(a,,B) = (x - H(a),B)T (x - H(a),B)
H=
COS ;7r fa sin ~7r fa ] and the resulting LS error is, from (8.22),
[
cos 27rfo:(N -1) sin 27rfo:(N -1) , J(a,/3) = xT [I - H(a) (HT(a)H(a)fl HT(a)] x.
which is now linear in the new parameters. The LSE of a is The problem now reduces to a maximization of
and to find {j we must find the inverse transformation g-l(a). This is over a. If, for instance, q = p - 1, so that a is a scalar, then a grid search can possibly
be used. This should be contrasted with the original minimization of a p-dimensional
function. (See also Example 7.16.)
A
8 ~ [; 1 where the unknown parameters are {AI, A 2, A 3 , r}. It is known that 0 < r < 1. Then,
the model is linear in the amplitudes ,B = [AI A2 A 3 jT, and nonlinear in the damping
A second ty~~ of nonlinear LS problem that is less complex than the general one exhibits
the separability property. Although the signal model is nonlinear, it may be linear in
some of the parameters, as illustrated in Example 8.3. In general, a separable signal
H(r) =
[J, r 2(N-l) r 3(N-l)
1
model has the form Once f is found we have the LSE for the amplitudes
s = H(a),B
I IIIIIIIIIIIIIIIIIIIIIIIJ
8J N-l 8 [.] for i = 1,2, ... ,Pi j = 1,2, ... ,po Then,
80 = -2 L (x[i] - sri]) ;OZ = 0 N-l
J i=O J 8[g(8)]i
for j = 1,2, ... , p. If we define an N x p Jacobian matrix as 80j
L (x[n] - s[n]) [G n(8)J;j - [H(8)]nj [H(8)]ni
n=O
N-l
8S(8)] = 8s[i] i = O,I, ... ,N - 1
L [G n(8)];j (x[n]- s[n]) - [HT(8)Ln [H(8)]nj
[ 88 ij 80j j = 1,2, ... ,p, n=Q
where the dependence of s[n] on 8 is now shown. The LS error becomes Example 8.11 - Digital Filter Design
N-1 A common problem in digital signal processing is to design a digital filter whose fre-
J = L (x[n] - s[n; 8]? quency response closely matches a given frequency response specification [Oppenheim
n=O and Schafer 1975]. Alternatively, we could attempt to match the impulse response,
~ I:
n=O
(x[n] - s[n; 80 ] + 8s[n; 8]1
88 8=80
80 _ 8s[n; 8]1
88 8=80
8) 2
which is the approach examined in this example. A general infinite impulse response
(IIR) filter has the system function
(x - 8(80 ) + H(80 )80 - H(80 )8)T (x - 8(80 ) + H(80 l80 - H(80 )8). B(z)
1i(z) = A(z)
Since x - s(80 ) + H(80 )80 is known, we have as the LSE b[O] + b[1]z-1 + ... + b[q]z-q
1 + a[1]z-1 + ... + a[p]z-p .
(H T (80 )H(80 )r HT(8
1
0) (x - s(80 ) + H(80 l80)
If the desired frequency response is Hd(f) = 1i d(exp[j271}]), then the inverse Fourier
80 + (H T
(80 )H(80 ) ) -1 T
H (80 ) (x - s(80 )). transform is the desired impulse response or
which is identical to the Newton-Raphson iteration except for the omission of the second P q
derivatives or for the presence of Gn . The linearization method is termed the Gauss- h[n] = - t;a[k]h[n - k] + {;b[k]8[n - k] n ~ 0
Newton method and can easily be generalized to the vector parameter case as {
o n < O.
(8.62) A straightforward LS solution would seek to choose {ark], b[k]} to minimize
N-1
where
J = L (hd[n]- h[n])2
[H(8)]ij = ~Ji]. n=O
J
where N is some sufficiently large integer for which hd[n] is essentially zero. Unfortu-
In Example 8.14 the Gauss method is illustrated. Both the Newton-Raphson and nately, this approach produces a nonlinear LS problem (see Problem 8.28 for the exact
the Gauss methods can have convergence problems. It has been argued that neither solution). (The reader should note that hd[n] plays the role of the "data," and h[n]
method is reliable enough to use without safeguards. The interested reader should that of the "signal" model. Also, the "noise" in the data is attributable to modeling
consult [Seber and Wild 1989] for additional implementation details. error whose statistical characterization is unknown.) As an example, if
b[O]
8.10 Signal Processing Examples 1i(z) = 1 + a[1]z-1'
then
We now describe some typical signal processing problems for which a LSE is used. In
these applications the optimal MVU estimator is unavailable. The statistical charac- h[n] = { ~[O]( -a[l]t ~ ~ ~
terization of the noise may not be known or, even if it is, the optimal MVU estimator
cannot be found. For known noise statistics the asymptotically optimal MLE" is gener- and
N-1
ally too complicated to implement. Faced with these practical difficulties we resort to
least squares.
J =L (hd[n]- b[0](-a[1])n)2
n=O
262 CHAPTER 8. LEAST SQUARES 8.lD. SIGNAL PROCESSING EXAMPLES 263
h[n] - which is now a quadratic function of the a[k]'s and b[k]'s. Alternatively,
6[n]
-I 'H.(z) = B(z)
A(z) 2:
+
E[n]
J, = N-l[
~ hd[n]- (P
- {; a[k]hd[n - k] + b[n] )]2
hd[n] In minimizing this over the filter coefficients note that the b[n]'s appear only in the
first (q + 1) terms since b[n] = 0 for n > q. As a result, we have upon letting a =
(a) True least squares error
[a[l] a[2] ... a[p]f and b = [b[O] b[l] ... b[q]f,
6[n]
-I 'H.(z) = B(z)
A(z)
h[n]
Ef[n]
~ [hd[n] - (- ~ a[k]hd[n - k] + b[n]) r
+ n~l [hd[n]- (- ~ a[k]hd[n - k])] 2
n = 0, 1, ... , q
(b) Filtered least squares error
or
P
Figure 8.12 Conversion of nonlinear least squares filter design to lin- b[n] = hd[n] + L a[k]hd[n - k].
ear least squares k=l
which is clearly very nonlinear in a[l]. In fact, it is the presence of A(z) that causes
the L8 error to be nonquadratic. To alleviate this problem we can filter both hd[n] and
h[n] by .A.(z), as shown in Figure 8.12. Then, we minimize the filtered L8 error where
hd! [n] =L
k=O
P
a[k]hd[n - k]
Ho hd[l]
hd[q - 1] hd[q - 2]
hd[O]
h,JJ
The vector h has dimension (q + 1) x 1, while the matrix Ho has dimension (q + 1) x p.
To find the L8E of the denominator coefficients or Ii we must minimize
and a[O] = 1. The filtered L8 error then becomes
N-l [
n~l hd[n] -
(P
- {; a[k]hd[n - k]
)]2
N-l(P
J, = ~ t;a[k]hd[n - k]- b[n]
)2
(x - H8f(x - H8)
- - r - - - - - - - - - - - - - - - - -
j :::~
x [ hd[q+1] hd[N -1] (N -l-q xl)
hd[q - 1] h,lq - p+ I] ]
[ h,lql hd[q] hd [q-p+2]
_ hd[q;+ 1]
H (N - 1- q x p).
'"
.~
'" D.OO-/.
The LSE of the denominator coefficients is
Q I
-O.05-t1----r-----=-r-----r----..::~-...,_---r__--___j1
o 00
This method for designing digital filters is termed the least squares Prony method [Parks Sample number, n
and Burrus 1987]. As an example, consider the design of a low-pass filter for which the
desired frequency response is '
I If I < fe (b)
H~(f) = { 0 If I > fe
and as expected is not causal (due to the zero phase frequency response assumption).
To ensure causality we delay the impulse response by no samples and then set the I I I I I I I I
resultant impulse response equal to zero for n < o. Next, to approximate the desired -0.5 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5
impulse response using the LS Prony method we assume that N samples are available Frequency
or
hd[n] = sin27rfe(n-no) n = 0, 1, ... , N - 1.
7r(n - no)
For a cutoff frequency of fe = 0.1 and a delay of no = 25 samples, the desired impulse
u
response is shown in Figure 8.13a for N = 50. The corresponding desired magnitude <::
frequency response is shown in Figure 8.13b. The effects of truncating the impulse 'g'"
:>
-10
response are manifested by the sidelobe structure of the response. Using the LS Prony .!::
method with p = q = 10, a digital filter was designed to match the desired low-pass '"
-0
:>
-15
filter frequency response. The result is shown in Figure 8.13c, where the magnitude '8 -20
of the frequency response or 11(exp(j27r f)) I has been plotted in dB. The agreement is ~ -25
generally good, with the Prony digital filter exhibiting a peaky structure in the pass- E ~---
band (frequencies below the cutoff) and a smooth rolloff in the stopband (frequencies S -30~1----11----,1----11----,1----11----,1----11----,I----,I--~I
above the cutoff) in contrast to the desired response. Larger values for p and q would ~ -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5
presumably improve the match. <> Frequency
Figure 8.13 Low pass filter design by least squares Prony method
LII.II ~ i
A(f) = 1+ L a[k]exp(-j21rjk). Finally, a difference equation for the ACF can be written as
k=1 p
)
The b[kl's are termed the MA filter parameters, th~ a[kl's. the AR filt~~ parameters, and La[k]rxx[n - k] = 0 for n > q (8.63)
u 2 the driving white noise variance. The process IS obtamed by excltmg a causal filter k=O
whose frequency response is B(f)jA(f) with white noise of variance u~. If the b[kl's are
where a[O] = 1. These equations are called the modified Yule- Walker equations. Recall
zero then we have the AR process introduced in Example 3.16. The estimation of the
from Example 7.18 that the Yule-Walker equations for an AR process were identical
AR ~arameters of an AR model using an asymptotic MLE was discussed in Example
except that they held for n > 0 (since q = 0 for an AR process). Only the AR filter
7.18. For the ARMA process even the asymptotic MLE proves to be intractable. An parameters appear in these equations.
alternative approach relies on equation error modeling of the ACF. It estimates the In practice, we must estimate the ACF lags as
AR filter parameters, leaving the MA filter parameters and white noise variance to be
found by other means. N-I-Ikl
To determine the ACF we can take the inverse z transform of the PSD extended to Txx[k] = ~ L x[n]x[n + Ikl],
the z plane or n=O
u 2 B(z)B(z-l)
Pxx(z) = A(z)A(Z-I) assuming x[n] is available for n = 0,1, ... , N - 1. Substituting these into (8.63) yields
evaluating Pxx(z) on the unit circle of the z plane or for z = exp(J21r f). It IS convement L a[k]Txx[n - k] = E[n] n>q
to develop a difference equation for the ACF. This difference equation will serve as the k=O
basis for the equation error modeling approach. Taking the inverse z transform of
where E[n] denotes the error due to the effect of errors in the ACF function estimate.
A(z)Pxx(z), we have The model becomes
-1 B(Z-I) } p
Z {A(z)Pxx(z)} = Z -1 { u 2 B ( Z) A(Z-I) .
fxx[n] =- L a[k]fxx[n - k] + E[n] n>q
k=1
Since the filter impulse response is causal, it follows that
which can be seen to be linear in the unknown AR filter parameters. If the ACF is
h[n] Z-1 {B(Z) } estimated for lags n = 0,1, ... , M (where we must have M S N - 1), then a LSE of
A(z) a [k] will minimize
o
for n < O. Consequently, we have that
h[-n] = Z
-1 {B(Z-I)}
A(Z-I) = 0
J n~1 [fxx[n]- (- ~a[k]fxx[n - k])
(x - IiBf(x - HB)
r (8.64)
- ~ - - ~ - - - - - - - --- -- ~
x
Error
Txx[M] signal
()
[:f~ll
alP]
Txx[q] Txx[q - 1] ... ~xx[q-P+1]l oscillator
H :xx[~ + 1] Txx[q] ... r xx [q-p+2]
.. . oV r VD V(
[ .
rxx[M -1] ... Txx[M - p] Figure 8.14 Adaptive noise canceler for 60 Hz interference
The LSE of () is (HTH) -1 HT x and is termed the least squares modified Yule- Walker
equationJ>. It is interesting to observe that in this problem we use a LSE for the Primary __x""[n.:.!] _ _ _ _ _ _ _ _ _ _ _ _ _--,
1
estimated ACF data, not the original data. Additionally, the observation matrix H,
which is usually a known deterministic matrix, is now a random matrix. As expected,
channel
+,
the statistics of the LSE are difficult to determine. In practice, M should not be chosen L:'r-_ [n]
too large since the ACF estimate is less reliable at higher lags due to the (N - k)
lag products averaged in the estimate of rxx[k]. Some researchers advocate a weighted
LSE to reflect the tendency for the errors of (8.64) to increase as n increases. As an Reference xR[n] x[n]
example, the choice Wn = 1 - (n/(M + 1)) might be made and the weighted LSE of channel
(8.16) implemented Further discussions of this problem can be found in [Kay 1988].
o
p-i
1i n (z) = L hn[l]z-I
Example 8.13 - Adaptive Noise Canceler 1=0
in Figure 8.16b, then the adaptive filter must quickly change its coefficients to respond E[n - l)h[n]
K[n]
to the changing interference for t > to. How fast it can change depends upon how An + hT[n]E[n - l)h[n)
many error terms are included in J[n] before and after the transition at t = to. The
chosen weights for t > to will be a compromise between the old weights and the desired
h[n] [xR[n) xR[n - 1] ... xR[n - p + 1] (
new ones. If the transition occurs in discrete time at n = no, then we should probably E[n) (I - K[n]hT[nJ) E[n - 1].
expect the wrong weights until n no. To allow the filter to adapt more quickly
we can downweight previous errors in J[n] by incorporating a weighting or "forgetting The forgetting factor is chosen as 0 < A < 1, with A near lor 0.9 < A < 1 being typical.
As an example, for a sinusoidal interference
factor" '\, where 0 < A < 1, as follows:
x[n] = lOcos(27l"(0.1)n+ 7l"/4)
and a reference signal
XR[n) = cos(27l'(0.1)n)
This modification downweights previous errors exponentially, allowing the filter to react we implement the. AN? using .two filter coefficients (p = 2) since the reference signal
more quickly to interference changes. The penalty paid is that the estimates of the filter need only be modIfied III amplItude and phase to match the interference. To initialize
weights are noisier, relying on fewer effective LS errors. Note that the solution will not the sequential LS estimator we choose
change if we minimize instead
8[-1] 0
E[-I) 10 5 1
(8.65)
and a forgetting factor of A = 0.99 is assumed. The interference x[n] and output
~[n] of the ~NC are shown in Figures 8.17a and 8.17b, respectively. As expected, the
for each n. Now we have our standard sequential weighted LS problem described in Illterference IS canceled. The LSE of the filter coefficients is shown in Figure 8.17c and
Section 8.7. Referring to (8.46), we identify the sequential LSE of the filter weights as is seen to rapidly converge to the steady-state value. To find the steady state values of
the weights we need to solve
1[exp(27l'(0.1))) = lOexp(j7l'/4)
- :-
10 which says that the adaptive filter must increase the gain of the reference signal by 10
84 A A A A A A ~ A A ( and the phase by 7r / 4 to match the interference. Solving, we have
6
h[O] + h[l] exp( -j27r(0.1)) = lOexp(j7r/4)
CI)
4
u
Q
...
CI)
2 which results in h[O] = 16.8 and h[l] = -12.0. o
Jl... 0
.s" -2
-4 Example 8.14 - Phase-Locked Loop
-u
V ~
-8 We now consider the problem of carrier recovery in a communication system, which is
-10 \I V V \J \J V V V necessary for coherent demodulation. It is assumed that the carrier is received but is
o 10 20 30 40 50 60 70 80 90 100 embedded in noise. The received noise-free carrier is
(a) Sample number, n
s[n] = cos(27r fon + 1 n = -!'vI, ... ,0, ... ,M
10
where the frequency fo and phase 1> are to be estimated. A symmetric observation
8
interval is chosen to simplify the algebra. A LSE is to be used. Due to the nonlinearity
6
we employ the linearization technique of (8.62). Determining first H(8), where 8 =
4
[fo1>f, we have
...0 2
...... 0 8s[n]
r.l
-2 8fo
-4 8s[n]
-u - sin(27r fon + 1
81>
-8
-10 so that
(b)
0 10 20 30 40 50
Sample number, n
60 70 80 90 100
- M27r sin( - 27r foM + 1
-(M - 1)27rsin( -27rfo(M - 1) + 1 sin( -27r fo (7 -
sin(-27rfoM + 1
1) + 1
1
H(8) =- :
20, [
15J _ - - - - - - - - - - - - - - - h[OJ M27rsin(27rfoM + 1 sin(27rfoM + 1
I
10
Cl"'
CI)
and
'u 5 M M
~
CI)
0
u
0 47r 2 L n 2 sin 2 (27rfon+1 27r L nsin 2 (27rfon + 1
... n=-M n=-M
..,
CI)
M M
Ei:
~----------------- h[lJ 27r L nsin 2 (27rfon + 1 L 2
sin (27rfon + 1
n=-M n=-M
I I I But
o 10 20 30 40 50 60 70 80 90 100
(c)
LM [~- - -~ cos(47rfon + 21 ]
M
Sample number, n
L n 2 sin2 (27rfon + 1
n=-M 2 2
n=-M
Figure 8.17 Interference cancellation example
l"'" - --1""" 1""""" '11' ... ~ ... JI' JI' . . . . . . . I I I I I I I ~J
L
M
sin 2 (27rfon+<p) LM [12 - 1 2cos(47rfon + 2<p) ]
n=-M n=-M
~ [ST' : 1 1975.
Parks, T.W., C.S. Burrus, Digital Filter Design, J. Wiley, New York, 1987.
Proakis, J.G., Digital Communications, McGraw-Hill, New York, 1983.
Scharf, 1.1., Statistical Signal Processing, Addison-Wesley, New York, 1991.
for M 1. We have then from (8.62) Seber, G.A.F., C.J. Wild, Nonlinear Regression, J. Wiley, New York, 1989.
Stoica, P., R.1. Moses, B. Friedlander, T. Soderstrom, "Maximum Likelihood Estimation of the
3 M Parameters of Multiple Sinusoids from Noisy Measurements," IEEE 'Irans. Acoust., Speech,
fO H 1
f Ok -47r
-lvI3
- '"
~ nsin(27rfo kn+<Pk)(x[n]-cos(27rfokn+<Pk)) Signal Process., Vol. 37, pp. 378-392, March 1989.
n=-M Widrow, B., S.D. Stearns, Adaptive Signal Processing, Prentice-Hall, Englewood Cliffs, N.J., 1985.
Problems
or since
8.1 The LS error
N-l
2(2M + 1)
L sin( 411' fOk n + 2<pk) is to be minimized to find the LSE of 0 = [A fo B rjY, where 0 < r < 1. Is
n=-M this a linear or nonlinear LS problem? Is the LS error quadratic in any of the
~ 0 parameters, and if so, which ones? How could you solve this minimization problem
using a digital computer?
it follows that
3 M 8.2 Show that the inequality given in (8.7) holds.
FO - -
JI.
- '"
47rM3 ~
nx[n] sin(27r fo. n + <Pk)
n=-M
8.3 For the signal model
1 M
<Pk - M L x[n] sin(27rfo.n + <Pk). s[nJ = { _~ O:5n:5M-l
M:5 n:5 N -1,
n=-M
276 CHAPTER 8. LEAST SQUARES PROBLEMS 277
find the LSE of A and the minimum LS error. Assume that x[n] = s[n] + w[n] for 8.11 In this problem we prove that a projection matrix P must be symmetric. Let
n = 0, 1, ... ,N - 1 are observed. If now w[n] is WGN with variance (7z, find the x= + e e, e
where lies in a subspace which is the range of the projection matrix
PDF of the LSE. or Px = e and e.L
lies in the orthogonal subspace or pe.L
= O. For arbitrary
vectors Xl', Xz in RN show that
8.4 Derive the LSE for a vector parameter as given by (8.10) by verifying the identity
where by decomposing Xl and Xz as discussed above. Finally, prove the desired result.
8 = (HTH)-IHTx.
8.12 Prove the following properties of the projection matrix
To complete the proof show that J is minimized when fJ = 8, assuming that H is
full rank and therefore that HTH is positive definite.
8.5 For the signal model a. P is idempotent.
p
s[n] =L Ai cos 21r fin h. P is positive semidefinite.
i=l
c. The eigenvalues of P are either 1 or O.
where the frequencies fi are known and the amplitudes Ai are to be estimated,
find the LSE normal equations (do not attempt to solve them). Then, if the d. The rank of P is p. Use the fact that the trace of a matrix is equal to the sum
of its eigenvalues.
frequencies are specifically known to be J; = i/N, explicitly find the LSE and
the minimum LS error. Finally, if x[n] = s[n] + w[n], where w[n] is WGN with 8.13 Verify the result given in (8.24).
variance (7z, determine the PDF of the LSE, assuming the given frequencies. Hint:
The columns of H are orthogonal for the given frequencies. 8.14 In this problem we derive the order-update LS equations using a geometrical
8.6 For the general LSE of (8.10) find the PDF of the LSE if it is known that x ,..., argument. Assume that 8k is available and therefore
N(HfJ, (7zl). Is the LSE unbiased?
8.7 In this problem we consider the estimation of the noise variance (7z in the model
x = HfJ + w, where w is zero mean with covariance matrix (7zl. The estimator is known. Now, if hk+l is used, the LS signal estimate becomes
All AT A
(7Z = - Jrnin = -(x - HfJ) (x - HfJ)
N N
where h~+l is the component of hk+l orthogonal to the subspace spanned by
where 8 is the LSE of (8.10), is proposed. Is this estimator unbiased, and if
{hI, h z, ... , hd. First, find h~+l and then determine a by noting that ah~+l is
not, can you propose one that is? Explain your results. Hint: The identities
the projection of X onto h~+l' Finally, since
E(xTy) = E(tr(yx T )) = tr(E(yxT )) and tr(AB) = tr(BA) will be useful.
8.8 Verify the LSE for A given in (8.15). Find the mean and variance for A if x[n] =
A + w[n], where w[n] is zero mean uncorrelated noise with variance (7~.
determine 8k +l'
8.9 Verify (8.16) and (8.17) for the weighted LSE by noting that if W is positive
definite, we can write it as W = DTD, where D is an invertible N x N matrix. 8.15 Use order-recursive LS to find the values of A and B that minimize
8.10 Referring to Figure 8.2b, prove that N-I
J =L (x[n]- A - Brn)Z.
n=O
This can be thought of as the least squares Pythagorean theorem. The parameter r is assumed to be known.
I1III&I 1
(~:~h[k]hT[k]) (~:~X[k]hT[k])
The second signal is useful for modeling ajump in level at n = M. We can express
the second model in the alternative form -I
Now let 9 1 = A and 92 = [A (B - A)]T and find the LSE and minimum LS error and
for each model using order-recursive LS. Discuss how you might use it to detect
a jump in level. where
8.17 Prove that Ptx is orthogonal to the space spanned by the columns of H k .
8.18 Using the orthogonal recursive projection matrix formula (8.34), derive the update H[-I]
hT~!;~ 1)1]
formula for the minimum LS error (8.31). [ hT[-I]
8.19 Verify the sequential formula for the minimum LS error (8.43) by using the se-
quential LSE update (8.40). C[ -1] = diag ((J:p, (J:(P_I)' ... ,(J:I) .
8.20 Let x[n] = Arn + wIn], where wIn] is WGN with variance (J2 = 1. Find the Thus, we may view the initial estimate for the sequential LSE as the result of
sequential LSE for A, assuming that r is known. Also, determine the variance of applying a batch estimator to the initial observation vectors. Since the sequential
the LSE in sequential form and then solve explicitly for the variance as a function LSE initialized using a batch estimator is identical to the batch LSE using all
of n. Let A[O] = x[O] and var(A[O]) = (J2 = 1. the observation vectors, we have for the sequential LSE with the assumed initial
conditions
8.21 Using the sequential update formulas for the gain (8.41) and variance (8.42), solve
for the gain and variance sequences if (J~ = rn. Use var(A[O]) = var(x[O]) = (J6 =
1 for initialization. Then, examine what happens to the gain and variance as
N -+ 00 if r = 1, 0 < r < 1, and r > 1. Hint: Solve for I/var(A[N]).
8.22 By implementing a Monte Carlo computer simulation plot A[N] as given by (8.40). Prove that this can be rewritten as
Assume that the data are given by
Os[n] =
x[n] = A + wIn]
where wIn] is zero mean WGN with (J~ = rn. Use A = 10 and r = 1,0,95,1.05. (E-1[-I] + ~ :~h[k]hT[k]) - I (E-1[-I]6[-I] + ~ :~X[k]hT[k]) .
Initialize the estimator by using A[O] = x[O] and var(A[O]) = var(x[O]) = (J6 = 1.
Also, plot the gain and variance sequences. Then examine what happens as a -+ 00 for 0 ~ n ~ p - 1 and n ~ p.
- -,..,..--1""- ___ - .. ~---_- I IIIIIIIIII I.;
Let Q = g(B), where 9 is a one-to-one mapping. Prove that if & minimizes a[p] alP - 1] ... 1 o 0
h(g-l(a)), then {} = g-I(&). AT = ~ a~] .;. a\l]
1 0
In this problem we derive the equations that must be optimized to find the true where the elements in A are replaced by the LSE of a. Note that this problem is
LSE [Scharf 1991]. It is assumed that p = q+ 1. First, show that the signal model an example of a separable nonlinear LS problem. Hint: To prove (8.66) consider
can be written as L = [AG], which is invertible (since it is full rank), and compute
s
[ h[l]
h[O[ 1 You will also need to observe that ATG = 0, which follows from a[n]*g[n] = 8[n].
h[N:-1] 8.29 In Example 8.14 assume that the phase-locked loop converges so that fO.+1 = fa.
g[O] 0 0 and 4>k+l = 4>k For a high SNR so that x[n] ~ cos(27rfon + 4, show that the
g[l] g[O] 0 final iterates will be the true values of frequency and phase.
we have
Ek H fhk+1
hf+1hk+1
..
Ek
hf+1HkEk 1
Appendix SA hf+lhk+l
Derivation of
Order-Recursive
Least Squares
EkHfhk+1 DkHfhk+1 Dk H f h k+l h f+1 Hk Dk H fh k+1
hf+1hk+l hf+1hk+1 + hf+1 P t h k+l h f+l h k+1
In this appendix we derive the order-recursive least squares formulas (8.28)-(8.31).
Dk H fhk+1 [hf+1 Pthk+1 + hf+l (I - Pt )hk+d
Referring to (8.25), we have
hf+1Pthk+1hf+lhk+1
Dk H fhk+1
hf+1 Pt hk+1 .
But Finally, we have
A b -1 _
bbT)-1
( A - -c-
1(
T
bb )-1
--;; A - -c- b
1
T T
[b C] - [ _!b T (A- bb )-1 1
c c c- b T A- 1b
bbT)-1 _ -1 A- 1bbTA- 1
( A - -c- - A + c _ bTA-1b
282
-- ~ ".. --- r ........ .... ~ I I I I . I I I &~ I .... Il
{h - DkHh~hkp+~hI+l Ptx
k+l k hk+l
1 Appendix 8B
[ hI+lPtx .
hI+I P t h k+l
Finally, the update for the minimum LS error is
Derivation of Recursive Projection
T
(x - Hk+l9Hd (x - HH19Hd
Matrix
xTx - xTHk+l8Hl
We now derive the recursive update for the projection matrix given by (8.34). Since
285
__ ~~ __________ ~~~ ___ IIIIIIIIIIIIIII.IIIA.J
The first term in parentheses is just ~[n]. We can use Woodbury's identity (see
Appendix 1) to obtain
(
[ HT[n _ 1] h[n] 1 [ C[n:; 1]
o 2]-I [ H[nhT[n]- 1] ])-1
O
an
since
e[n - 1] = (HT[n - I]C- 1 [n -1]H[n -1]) - I HT[n -1]C- 1 [n - l]x[n - 1]
=
. ([ HT[n-l] h[n] 1[C[n:;l]
o 2]-1 [x[n-l]]).
O
an x[n] Continuing, we have
~[n - I]H T [n -1]C- I [n - l]x[n - 1].
Since the covariance matrix is diagonal, it is easily inverted to yield e[n] = e[n - 1] + ~~[n - l]h[n]x[n]- K[n]hT [n]8[n - 1]
a~
1
- -K[n]hT[n]~[n -1]h[n]x[n].
17 n2
But
Let
~[n - 1] = (HT[n - I]C- 1[n - I]H[n _ 1])-1
and therefore
e[n] e[n -1] + K[n]x[n]- K[n]hT[n]e[n -1]
e[n - 1] + K[n] (x[n]- hT[n]e[n -1]) .
286
288 APPENDIX Be. DERIVATION OF SEQUENTIAL LEAST SQUARES
[x T[n-1] x[n]l
C-l[n - 1]
OT ~
0] [x[n - 1]- H[n - 1]8[n]]
Chapter 9
[ a2
n
x[n] - hT[n]8[n]
xT[n -1]C- 1 [n -1] (x[n -1]- H[n -1]8[n])
Method of Moments
+ :2 x[n] (x[n] - hT[n]8[n]) .
n
Let ern] = x[n]- hT[n]8[n - 1]. Using the update for 8[n] produces
)
Jrnin[n] 1
xT[n - 1]C- [n - 1] (x[n - 1]- H[n - 1]8[n - 1]- H[n - l]K[n]e[n]) 9.1 Introduction
+ a~ x[n] (x[n] - hT[n]8[n - 1] - hT[n]K[n]e[n]) The approach known as the method 0/ moments is described in this chapter. It produces
n an estimator that is easy to determine and simple to implement. Although the estimator
Jrnin[n - 1]- xT[n - 1]C- 1 [n - l]H[n - l]K[n]e[n] has no optimality properties it is useful if the data record is Ion enou h. This is
ecause the method of moments estimator is usually consistent. If the performance
+ ~x[n] (1 - hT[n]K[nJ) ern].
a; is not satisfactory, then it can be used as an initial estimate, which is subsequently
improved through a Newton-Raphson implementation of the MLE. After obtaining the
But method of moments estimator, we describe some approximate approaches for analyzing
its statistical performance. The techniques are general enough to be used to evaluate
so that the performance of many other estimators and are therefore useful in their own right.
Jrnin[n] = Jrnin[n - 1]- 8T[n -1]:E- 1 [n - l]K[n]e[n]
+ ~x[n] (1 - hT[n]K[nJ) ern] 9.2 Summary
a;
and also The method of moments approach to estimation is illustrated in Section 9.3 with some
examples. In general, the estimator is iven b 9.11 for a vector arameter. The
~x[n] (1 - hT[n]K[nJ) - 8T[n -1]:E- 1 [n -l]K[n] Eerformance of the met 0 0 moments estimator may be partially characterized~
a; the approximate mean of (9.15) and the approximate variance of (9.16). Also,!!:!!l'
1 ( hT[n]:E[n - l]h[n]) 8T[n - l]h[n] estimator that depends on a group of statistics whose PDF is concentrated about its
a; x[n] 1 - a; + hT[n]:E[n - l]h[n] - a; + hT[n]:E[n - l]h[n] mean can make use of the same expressions. Another case where an approximate mea;
x[n] - hT [n]8[n - 1] and variance evaluation is useful involves signal in noise problems when the SNR ~
high. Then, the estimation performance can be partially described by the approximate
a; + hT[n]:E[n - l]h[n] mean of (9.18) and the approximate variance of (9.19).
ern]
a; + hT[n]:E[n - l]h[n]'
Hence, we have finally
9.3 Method of Moments'
e2 [n] The method of moments approach to estimation is based on the solution of a theoretical
Jrnm[n] = Jrnin[n - 1] + a; + hT [n]:E [n - l]h[n] equation involving the moments of a PDF. As an example, assume we observe x[n] for
n = 0,1, ... ,N - 1, which are liD samptes from the Gaussian mixture PDF (see also
289
-.....-..._ I I I I I I .. I I I I I I I .. I I I I I I I I I J
290 CHAPTER 9. METHOD OF MOMENTS 9.3. METHOD OF MOMENTS 291
Problem 6.14) But it is easily shown that
2
p(x[n]; f) = ~ exp (_~ x [n]) + _ f _ exp (_~ X2[~]) E(x 4 [n]) = (1 - f)3at + f3a~
v27rar 2 ar V27ra? 2 a2
which when combined with (9.1) produces the variance
or in more succinct form
0) _ 3(1- flat + 3fai - [(1 - f)ai + fai]2
var (f - N( a 2 _ a 2)2 (9.4)
2 1
where
To determine the loss in performance we could evaluate the CRLB (we would need to do
(,b;(x[n]) = _1_ exp (_~ X2[~]) . this numerically) and compare it to (9.4). If the increase in variance were substantial,
v27rar 2 ai
we could attempt to implement the MLE, which would attain the bound for large data
The parameter f is termed the mixture parameter, which satisfies 0 < f < 1, and ai, a~ records. It should be observed that the method of moments estimator is consistent in
are the variances of the individual Gaussian PDFs. The Gaussian mixture PDF may that i. -+ f as N -+ 00. This is because as N -+ 00, from (9.3) E(i.) = f and from (9.4)
be thought of as the PDF of a random variable obtained from a N(O, ail PDF with var( i.) -+ O. Since the estimates of the moments that are substituted into the theoretical
probability 1- f and from a N(o, an PDF with probability f. Now if ar, ai are known equation approach the true moments for large data records, the equation to be solved
and f is to be estimated, all our usual MVU estimation methods will fail. The MLE approaches the theoretical equation. As a result, the method of moments estimator
will require the maximization of a very nonlinear function of f. Although this can will in general be consistent (see Problem 7.5). In (9.2) for example, as N -+ 00, we
be implemented using a grid search, the method of moments provides a much simpler have
I:
estimator. Note that
If x[n] = A+w[n] is observed for n = 0,1, ... ,N -1, where w[n] is WGN with variance
a 2 and A is to be estimated, then we know that
JJ.l = E(x[n]) = A.
.~ I I I I I . I I I . & I I & I ' I ' I I
-----------~~~~~~~-~
This is the theoretical equation of (9.5). According to (9.7), we replace J.11 by its natural or in matrix form
estimator, resulting in I-' = h(6). (9.9)
1 N-l
A
11
-
A 0
00
~ exp( -~) d~
It may occur that the first p moments are insufficient to determine all the parameters
1 to be estimated (see Example 9.3). In this case we need to find some set of p moment
:x-. equations allowing (9.9) to be solved for 6 to obtain (9.10). In practice, it is desirable to
use the lowest order moments possible. This is because the variance of the moment es-
Solving for A timator generally increases with order. Also, to be able to solve the resulting equations
A=~ we would like them to be linear or at least mildly nonlinear. Otherwise, a nonlinear
J.11 optimization may be needed, defeating the original motivation for the method of mo-
and substituting the natural estimator result in the method of moments estimator ments estimator as an easily implemented estimator. In the vector parameter case we
A 1 may also need cross-moments, as the signal processing example in Section 9.6 illustrates
A= 1 N- 1 (see also Problem 9.5). We now continue the Gaussian mixture example.
NLx[nJ
n=O Example 9.3 - Gaussian Mixture PDF
o
Returning to the introductory example of a Gaussian mixture PDP, assume now that
in addition to f, the Gaussian variances a;
and a~ are unknown as well. To estimate
all three parameters we require three moment equations. Noting that the PDP is an
9.4 Extension to a Vector Parameter even function so that all odd order moments are zero, we utilize
Now consider a vector parameter 6 of dimension p x 1. It is obvious that to solve for J.12 E(x 2 [nJ) = (1 - f)ai + fa~
8 requires p theoretical moment equations. Hence, we suppose J.14 E(x 4 [nJ) = 3(1 - f)at + 3fa~
J.11 h1 (Bl>B 2 , ,Bp) J.16 E(x 6 [nJ) = 15(1 - f)a~ + 15fag.
J.12 h2 (B 1 ,B2 , ,Bp) Although nonlinear, these equations may be solved by letting [Rider 1961J
u ai + a~
(9.8) v (9.12)
~~----'r-~~--1
Then it can be shown by direct substitution that this latter assumption the PDF of [Tl T2 ... Trf will be concentrated about its mean.
In Example 9.2, for instance, the estimator of ,\ can be written as
J.L6 - 5J.L4J.L2
U
5J.L4 - 15J.L5 A=g(Tl(x))
J.L4
v J.L2 U -T where Tl (x) = 1:i'L::ol x[nJ and g(Tr) = 11Tl . For large N the PDF of Tl will be
heavily concentrated about its mean since var(Tr) = 1/(N,\2), as will be shown in
Onc~ u is found, v can be determined. Then, a; and a~ are obtained by solving (9.12) Example 9.4. Using the statistical linearization argument of Chapter 3, we can use a
to YIeld first-order Taylor expansion of 9 about the mean of T l In general, we assume that
u+~ 0= g(T)
a I2
2
v where T = [Tl T 2 ... TrJ We then perform a first-order Taylor expansion of 9 about
T
where the x[nl's are lID and each has an exponential distribution. To find the approx- and from (9.16) the approximate variance is
imate mean and variance we use (9.15) and (9.16). In this case we have
T1 = N L x[n] A2
n=O N
and
The estimator is seen to be approximately unbiased and has an approximate variance
decreasing with N. It should be emphasized that these expressions are only approxi-
mate. They are only as accurate as is the linearization of g. In fact, for this problem
The mean of T1 is
~ can be easily shown to be the MLE. As such, its asymptotic PDF (as N -+ 00) can
1 N-1 be shown to be (see (7.8))
/11 = E(Td N L E(x[n])
n=O
E(x[n]) Hence, the approximation is accurate only as N -+ 00. (In Problem 9.10 quadratic
1 expansions of 9 are employed to derive the second-order approximations to the mean
and variance of i) To determine how large N must be for the mean and variance
A
expressions to hold requires a Monte Carlo computer simulation. <>
using the results of Example 9.2. The variance of T1 is
The basic premise behind the Taylor series approach is that the function 9 be approx-
1 N-1 ) imately linear over the range of T for which p(T; 8) is essentially nonzero. This occurs
var(Td = var ( N ~ x[n] naturally when
var(x[n]) 1. The data record is large so that p(T; 8) is concentrated about its mean, as in the pre-
N it
vious example. There the PDF of T1 = L~:01 x[n] becomes more concentrated
about E(T1 ) as N -+ 00 since var(T1 ) = 1/(NA2 ) -+ 0 as N -+ 00.
But
1 2. The problem is to estimate the parameter of a signal in noise, and the SNR is
1
00
var(x[n]) x 2 [n]A exp( -Ax[n]) dx[n]- - high. In this case, by expanding the function about the signal we can obtain the
o A2
approximate mean and variance. For a high SNR the results will be quite accurate.
2 1 1
This is because at a high SNR the noise causes only a slight perturbation of the
estimator value obtained in the case of no noise. We now develop this second case.
so that
1 We consider the data
var(Td = NA 2 '
x[n] = s[n; 8] + w[n] n = 0, 1, ... ,N - 1
~I where w[n] is zero mean noise with covariance matrix C. A general estimator of the
OT1T.=!". --_..!..-_A2
/1~ - . scalar parameter 8 is
From (9.15) we have the approximate mean
g(x)
g(8(8) + w)
h(w).
~~ ......................... . IIIIIIIIIIIIIIIII.Il
In this case we may choose the statistic T as the original data since as the SNR becomes Similarly,
larger, the PDF of T = x becomes more concentrated about its mean, which is the
signal. We now use a first-order Taylor expansion of 9 about the mean of x, which is
,." = s( 0), or, equivalently, of h about w = O. This results in
oh I
ow[l] ..=0
r-l
r+l
oh I 1
8;::, h(O) + ~I
n=O ow[n] w=o
w[n]. (9.17) ow[2] w=o r+l
For the data It is also worthwhile to remark that from (9.14) and (9.17) the estimators are approx-
x[n] = rn + w[n] n = 0, 1,2 imately linear functions of T and w, respectively. If these are Gaussian, then 8 will
where w[n] is zero mean uncorrelated noise with variance 0'2, the damping factor r is also be Gaussian - at least to within the approximation assumed by the Taylor expan-
to be estimated. Wishing to avoid a maximization of the likelihood to find the MLE sion. Finally, we should note that for a vector parameter the approximate mean and
(see Example 7.11), we propose the estimator variance for each compont;.nt can be obtained by applying these techniques. Obtaining
the covariance matrix of (J is possible using a first-order Taylor expansion but can be
, x[2] + x[l] extremely tedious.
r= .
x[l] + x[O]
In terms of the signal and noise we have 9.6 Signal Processing Example
, _r2_+~w~[2:,--]+_r_+_w....;.[....;..I] We now apply the method of moments and the approximate performance analysis to
r = h(w) = r + w[l] + 1 + w[O]
the problem of frequency estimation. Assume that we observe
and according to (9.18) = 0,1, ... ,N -
r2 +r x[n] = Acos(211'fon + </J) + w[n] n 1
E(f) = h(O) = r +1 = r
where w[n] is zero mean white noise with variance 0'2. The frequency fa is to be
or the estimator is approximately unbiased. To find the variance estimated. This problem was discussed in Example 7.16 in which the MLE of frequency
was shown to be approximately given by the peak location of a periodogram. In an
oh I r2 + w[2] + r + w[l] I effort to reduce the computation involved in searching for the peak location, we now
ow [0] w=o = - (r + w[l] + 1 + W[0])2 w=o describe a method of moments estimator. To do so we depart slightly from the usual
r2 +r sinusoidal model to assume that the phase </J is a random variable independent of w[n]
(r + 1)2 and distributed as </J rv UfO, 211']. With this assumption the signal s[n] = A cos (211' fan +
r </J) can be viewed as the realization of a WSS random process. That this is the case is
verified by determining the mean and ACF of s[n]. The mean is
r+l
-~~ .. ---~.-~~~~-~--.- .. -... 1
N~l s[N - 2)
- 1) + s[t + 1])
.
i = 1,2, ... , N
i = N - 1
- 2
27T - 1 n=O
302 CHAPTER 9. METHOD OF MOMENTS 9.6. SIGNAL PROCESSING EXAMPLE
303
Also, arccos(x)
1 N-2
N -1 L
n=O
s[n)s[n+ 1]
0.220
From (9.19) with C = 0"21 we obtain
0.215
0.195
. . . . . . . . . . . . . . . ., ~~~:~:i~~; .
I I
But s[n - 1) + s[n + 1) = 2cos27rfos[n), as can easily be verified, so that finally the -10 -5 o 5
I
10 15 20
variance of the frequency estimator is for a high SNR
SNR (dB)
(9.22) (a)
where {a[l], a[2], ... , alP]} denote the AR filter parameters and q is the MA order.
References Propose a method of moments estimator for the AR filter parameters.
Kay, S., "A Fast and Accurate Single Frequency Estimator," IEEE Trans. Acoust., Speech, Signal 9.6 For a DC level in WGN or
Process., Vo!' 37, pp. 1987-1990, Dec. 1989.
Lank, G.W., 1.8. Reed, G.E. Pollon, "A Semicoherent Detection and Doppler Estimation Statistic," x[n] = A + w[n] n = O,l, ... ,N - 1
IEEE Trans. Aerosp. Electron. Systems, Vo!' 9, pp. 151-165, March 1973.
Rider, P.R., "Estimating the Parameters of Mixed Poisson, Binomial, and Weibull Distributions
where w[n] is WGN with variance a 2 , the parameter A2 is to be estimated. It is
by the Method of Moments," Bull. Int. Statist. Inst., Vo!' 38, pp. 1-8, 1961. proposed to use
A2 = (xj2.
Problems For this estimator find the approximate mean and variance using a first-order
Taylor expansion approach.
9.1 If N lID observations {x [0], x[l], ... , x[N - In are made from the Rayleigh PDF
9.7 For the observed data
9.3 Assume that N lID samples from a bivariate Gaussian PDF are observed or
{xQ' Xl.' .. xN-d, where each x is a 2 x 1 random vector with PDF x,....., N(O, C). where
If
9.10 Using the results of Problems 9.8 and 9.9 find the approximate mean and variance
to second order for the estimator of .A discussed in Example 9.4. Ho'f do your
results compare to the first-order result? Be sure to justify the approximate
Gaussian PDF of x, required to apply the variance expression. Also, compare the
results to those predicted from asymptotic MLE theory (recall that the estimator
is also the MLE).
s[n] = A cos(21l-Jon + )
where the phase is deterministic but unknown, that
1 N-2 A2
N _ 1 L s[n]s[n + 1]-+ 2 cos 27rfo
n=O
as N -+ 00. Assume that fa is not near 0 or 1/2. Hence, comment on the use of
the method of moments estimator proposed in the signal processing example in
Section 9.6 for the deterministic phase sinusoid.
9.12 To extend the applicability of the frequency estimator proposed in the signal
processing example in Section 9.6 now assume that the signal amplitude A is
unknown. Why can't the proposed estimator (9.20) be used in this case? Consider
the estimator
,1 [N~l~x[n]x[n+l]
fa = -27r arccos 1
N
-
1
N L
n=O
2
x [n]
Show that at a high SNR and for large N, E(jo) = fa. Justify this estimator
as a method of moments estimator based on the ACF. Would you expect this
estimator to work well at a lower SNR?
~ .... . . . . . . . - - - -- - - - -
Chapter 10
10.1 Introduction
We now depart from the classical approach to statistical estimation in which the pa-
rameter e of interest is assumed to be a deterministic but unknown constant. Instead,
we assume that e is a mndom variable whose particular realization we must estimate.
This is the Bayesian approach, so named because its implementation is based directly
on Bayes' theorem. The motivation for doing so is twofold. First, if we have available
some prior knowledge about e, we can incorporate it into our estimator. The mech-
e
anism for doing this requires us to assume that is a random variable with a given
prior PDF. Classical estimation, on the other hand, finds it difficult to make use of any
prior knowledge. The Bayesian approach, when applicable, can therefore improve the
estimation accuracy. Second, Bayesian estimation is useful in situations where an MVU
estimator cannot be found, as for example, when the variance of an unbiased estimator
may not be uniformly less than that of all other unbiased estimators. In this instance,
it may be true that for most values of the parameter an estimator can be found whose
mean square error may be less than that of all other estimators. By assigning a PDF
to e we can devise strategies to find that estimator. The resultant estimator can then
be said to be optimal "on the average," or with respect to the assumed prior PDF of
e. In this chapter we attempt to motivate the Bayesian approach and to discuss some
of the issues surrounding its use. The reader should be aware that this approach to
estimation has had a long and controversial history. For a more definitive account [Box
and Tiao 1973J is recommended.
10.2 Summary
The Bayesian MSE is defined in (10.2) and is minimized by the estimator of (10.5),
which is the mean of the posterior PDF. The example of a DC level in WGN with a
Gaussian prior PDF is described in Section 10.4. The minimum MSE estimator for this
example is given by (10.11) and represents a weighting between the data knowledge and
prior knowledge. The corresponding minimum MSE is given by (1O.14). The ability to
309
CHAPTER 10. THE BAYESIAN PHILOSOPHY 10.3. PRIOR KNOWLEDGE AND ESTIMATION 311
310
where u(x) is the unit step function. This is shown in Figure 1O.lb. It is seen that A
P.4(~; A)
is a biased estimator. However, if we compare the MSE of the two estimators, we note
mse(..1) = 1: -
that for any A in the interval - Ao ~ A ~ Ao
(~ Aj2pA (~; A) d~
-Ao
1:(~ - A)2pA (~; A) d~ + 1::(~ - A)2pA(~; A) d~
(a) PDF of sample mean
(b) PDF of truncated sample mean
+ r (~- A)2pA(~;A)~
lAo
Figure 10.1 Improvement of estimator via prior knowledge
> 1: (-Ao - A)2pA(~; A) d~ + 1:0 (~- A)2pA(~;A) ~
1" ate the realization of a random variable based on data is described in Section ~0.5
+ roo(Ao-A)2pA(~;A)d~
: ~~ingdue to th~ ~orrelation ?etw;~~th~e~~~:o:;~;:~~:~,p~~~~:~ i~O~S~u:;::~::~~ lAo
the result that a Jomtly Gaussian. y (1025) This is then applied to the)Bayesian mse(A).
havmg. (1024) and a covariance . . A
a mean . t or PDF as summarized in Theorem 10.3. s Hence, A, the truncated sample mean estimator, is better than the sample mean esti-
linear model of (10.26) to Yield the pos erl h t ' PDF (10 28) is the minimum
. h . eha ter 11 the mean 0 f t e pos erlor . mator in terms of MSE. Although A is still the MVU estimator, we have been able to
wIll be s. own m p , ter Section 10.7 discusses nuisance parameters from reduce the mean square error by allowing the estimator to be biased. In as much as
11SE eS~lmat~r for .a vectho:IPaSramt'~n
a Bayesian vlewpomt, w Ie. ec I . '.
io
8 describes the potential difficulties of using a
we have been able to produce a better estimator, the question arises as to whether an
Bayesian estimator in a classical estimatIOn problem. optimal estimator exists for this problem. (The reader may recall that in the classical
case the MSE criterion of optimality usually led to unrealizable estimators. We shall
see that this is not a problem in the Bayesian approach.) We can answer affirmatively
10.3 Prior Knowledge and Estimation but only after reformulating the data model. Knowing that A must lie in a known
interval, we suppose that the true value of A has been chosen from that interval. We
It is a fundamental :uletof eS;~~:~:;fee~?at:::a~:t~~ei:~~~~~~~~~;~~dIT: :il~ l~~~:~ then model the process of choosing a value as a random event to which a PDF can
~ more accurate estl:~ ~~timator should produce only estimates within that interva!. be assigned. With knowledge only of the interval and no inclination as to whether A
mterval, then an~ g h that the MVU estimator of A is the sample mean x. should be nearer any particular value, it makes sense to assign a U[-Ao, Ao] PDF to
In Example 3.1 It was s o w n " . I A < the random variable A. The overall data model then appears as in Figure 10.2. As
However, this assumed :hat .A could take on anyo~:~~: : ~:~::~at -;c~ take ~~ shown there, the act of choosing A according to the given PDF represents the departure
Due to physical constramts It may be more AreasT t ' A" _ - as the best estimator of the Bayesian approach from the classical approach. The problem, as always, is to
. fi"t I A <A< oream-x
~~~I~a~~e~~~:~;at7~t:i~:eeTm~y ;icld v~lue~ outside the known interva~. As shown estimate the value of A or the realization of the random variable. However, now we
can incorporate our knowledge of how A was chosen. For example, we might attempt
in Figure 10.la, this is due to noise effects. eertainl~, we would expect to Improve our
to find an estimator A that would minimize the Bayesian MSE defined as
estimation if we used the truncated sample mean estimator
Bmse(..1) = E[(A - ..1)2]. (10.2)
i < -Au
. {-Ao -Ao ~ i ~ Ao We choose to define the error as A - A in contrast to the classical estimation error of
A= i
Ao i > Ao A- A. This definition will be useful later when we discuss a vector space interpretation
of the Bayesian estimator. In (10.2) we emphasize that since A is a random variable, the
which would be consistent with the known constraints. Such an estimator would have expectation operator is with respect to the joint PDF p(x, A). This is a fundamentally
different MSE than in the classical case. We distinguish it by using the Bmse notation.
the PDF
To appreciate the difference compare the classical MSE
Pr{i ~ -Ao}8(~ + Ao)
+ PA (~; A)[u(~ + Ao) - u(~ - Ao)]
(10.1)
mse(..1) = J (A - A)2p(X; A) dx (10.3)
+ Pr{i :::: Ao}8(~ - Ao)
CHAPTER 10. THE BAYESIAN PHILOSOPHY 10.3. PRIOR KNOWLEDGE AND ESTIMATION 313
312
ptA) p(Alx)
PDF 1
~:o 1
2Ao
A A
-Ao Ao -Ao \X Ao
E(Alx)
x[n] n = 0, 1. ... , IV - 1
314 CHAPTER 10. THE BAYESIAN PHILOSOPHY 10.3. PRIOR KNOWLEDGE AND ESTIMATION 315
In determining the MMSE estimator we first require the posterior PDF. We can use so that we have
Bayes' rule to determine it as
p(Alx)
p(xIA)p(A) --;:1=2 exp [--h-(A - X)2] IAI :s Ao
p(x) cV27r %:i 27'1 (10.8)
p(xIA)p(A) o IAI >Ao
J p(xIA)p(A) dA'
(10.6)
-Ao
---2
V 27r
exp -~(A - )2 dA.
2N
]
that w[nJ is independent of A. Then, for n = 0,1, ... , N - 1
The PDF is seen <0 be a truncated Gaussian, as shown in Figure 1O.3b. The MMSE
px(x[nllA) pw(x[nJ - AlA) estimator, which is the mean of p(Alx), is
pw(x[nJ- A)
_1_ exp [__I_(x[nJ - A)2]
..,j27r(J2 2(J2 1:
E(Alx)
Ap(Alx) dA
J 1%:i [1 ]
and therefore
1 N-J ] AO
A - - -2 exp -~(A - x? dA
p(xIA) = 1 N exp [ --22 L (x[nJ - A)2 . (10.7) -Ao V 27r 2N
(27r(J2) 2' (J n=O
1 [1 ]. (10.9)
It is apparent that the PDF is identical in form to the usual classical PDF p(x; A). In
the Bayesian case, however, the PDF is a conditional PDF, hence the "I" separator,
while in the classical case, it represents an unconditional PDF, albeit parameterized
J yi27r%:i
-Ao
AO
---2 exp -~(A - X)2
2N
dA
by A, hence the separator ";" (see also Problem 10.6). Using (10.6) and (10.7), the
Although this cannot be evaluated in closed form, we note that A will be a function of
posterior PDF becomes
x as well as of Ao and (J2 (see Problem 10.7). The MMSE estimator will not be x due
1 [1 ~ (x[nJ- A?
2A (27r(J2)'f exp --2(J-2
N-J ]
to the truncation shown in Figure 1O.3b unless Ao is so large that there is effectively no
truncation. This will occur if Ao J (J2 / N. Otherwise, the estimator will be "biased"
o towards zero as opposed to being equal to X. This is because the prior knowledge
IAI :s Ao
p(Alx) = 1 [1 L (x[nJ - A)2 dA embodied in p(A) would in the absence of the data x produce the MMSE estimator
J
AO N J ]
---2--"I:!.. exp --22 (see Problem 10.8)
-Ao 2Ao(27r(J ) 2 (J n=O
o IAI > Ao
The effect of the data is to position the posterior mean between A = 0 and A = x in a
But compromise between the prior knowledge and that contributed by the data. To further
N-J N-J appreciate this weighting consider what happens as N becomes large so that the data
L(x[nJ- A)2 L x 2[nJ- 2NAx + NA2 knowledge becomes more important. As shown in Figure 10.4, as N increases, we have
n=O n=O from (10.8) that the posterior PDF becomes more concentrated about x (since (J2/N
N-l decreases). Hence, it becomes nearly Gaussian, and its mean becomes just x. The
N(A - X)2 +L x 2[nJ- Nx 2 MMSE estimator relies less and less on the prior knowledge and more on the data. It
n=O is said that the data "swamps out" the prior knowledge.
- ~ ~ - - - - - - - -- -- - - - - - -- - - - - ~ . - - - - - - - - - . - - . . . . . . . . -----. . -
316 CHAPTER 10. THE BAYESIAN PHILOSOPHY 10.4. CHOOSING A PRIOR PDF 317
p(Alx) p(Alx) the introductory example the posterior PDF p(Alx) as given by (10.8) could not be
found explicitly due to the need to normalize p(xIA)p(A) so that it integrates to 1.
Additionally, the posterior mean could not be found, as evidenced by (10.9). We would
have to resort to numerical integration to actually implement the MMSE estimator.
This problem is compounded considerably in the vector parameter case. There the
posterior PDF becomes
A A
-Ao ~ Ao p(8Ix) = p(xI8)p(8)
X"" E(Alx)
J p(xI8)p(8) d8
p(xIA)
1
(27ra2)~
[1 ?; (x[nJ-
exp - 2a 2
N-J
A)2
]
with the MVU estimator in the classical approach. The only practical stumbling block
that remains, however, is whether or not E(Blx) can be determined in closed form. In we have
~ ......... .. - - - - - - - i. - - - .. ............ I. _ a .'
318 CHAPTER 10. THE BAYESIAN PHILOSOPHY 10.4. CHOOSING A PRIOR PDF 319
2 2
1 2 JjAlx JjA
-2-(A - JjAlx ) - - 2 - + 2"
p(xIA)p(A) (T Alx (T Alx (T A
p(Alx)
Jp(xIA)p(A) dA so that
L
exp [ - -12 N-l x 2[n] ] exp [-1
[ 1( )2] exp [1
-- (Jj~
2" - Jj~IX)]
N
1 - 2 (N A2 - 2N Ax)
]
(27l"(T2) T J27l"(T~ 2(T n-O 2(T exp --2- A - JjAlx -2-
2(T Alx 2 (T A (T Alx
p(Alx)
J [1
OO
-00
exp --2-(A - JjAlx)
2(T Alx
2] exp [1
-- (Jj~
2" - Jj~IX)]
2 (T A
-2-
(T Alx
dA
where the last step follows from the requirement that p(Alx) integrate to 1. The pos-
terior PDF is also Gaussian, as claimed. (This result could also have been obtained by
A E(Alx)
exp [-~Q(A)]
Note, however, that the denominator does not depend on A, being a normalizing factor,
and the argument of the exponential is quadratic in A. Hence, p(Alx) must be a
Gaussian PDF whose mean and variance depend on x. Continuing, we have for Q(A) or finally, the MMSE estimator is
N 2 2N Ax A2 2Jj AA Jj~
Q(A) = -A ---+2---2-+2
(T2 (T2 (T A (T A (T A
320 CHAPTER 10. THE BAYESIAN PHILOSOPHY 10.5. PROPERTIES OF THE GAUSSIAN PDF 321
N 1
-2 + -
p(x, Y) =
1
1
27rdef2(C)
1
exp --
[ 2 [
x - E(x)
y-E(y)]
T C- 1
x - E(x)
[y-E(y)]
1
(10.15)
17 (j~
- - - - - - - _ - _ _ . _.
~ . . . . ...... )
322 CHAPTER 10. THE BAYESIAN PHILOSOPHY 10.5. PROPERTIES OF THE GAUSSIAN PDF 323
Xo E(x)
x stant density for bivariate Gaussian
PDF p(x, y) =
1
1
2rrdet2(C)
1
exp --
[
x-E(x)
2[y-E(y)]
T C- 1
x-E(x)
[Y-E(Y)]
1
'
This is also termed the bivariate Gaussian PDF. The mean vector and covariance matrix then the conditional PDF p(ylx) is also Gaussian and
are
so that the conditional PDF of y is that of the cross section shown in Figure 10.6 when In normalized form (a random variable with zero mean and unity variance) this becomes
suitably normalized to integrate to 1. It is readily seen that since p(xo,y) (where Xu
is a fixed number) has the Gaussian form in y (from (10.15) the exponential argument y- E(y) _ cov(x, y) x - E(x)
is quadratic in y), the conditional PDF must also be Gaussian. Since p(y) is also )var(y) - )var(x)var(y) )var(x)
Gaussian, we may view this property as saying that if x and yare jointly Gaussian,
or
the prior PDF p(y) and posterior PDF p(ylx) are both Gaussian. In Appendix lOA we
derive the exact PDF as summarized in the following theorem. (10.21)
--.--~--;---,...-~-- ..... .,..I'!I';
324 CHAPTER 10. THE BAYESIAN PHILOSOPHY 10.6. BAYESIAN LINEAR MODEL 325
y
then the conditional PDF p(ylx) is also Gaussian and
Note that the covariance matrix of the conditional PDF does not depend on x, although
this property is not generally true. This will be useful later. As in the bivariate case,
the prior PDF p(y) is Gaussian, as well as the posterior PDF p(ylx). The question may
Figure 10.7
arise as to when the jointly Gaussian assumption may be made. In the next section we
Contours of constant density of nor-
malized bivariate PDF
examine an important data model for which this holds, the Bayesian linear model.
The correlation coefficient then acts to scale the normalized observation Xn to obtain the 10.6 Bayesian Linear Model
MMSE estimator of the normalized realization of the random variable Yn' If the random
variables are already normalized (E(x) = E(y) = 0, var(x) = var(y) = 1), the constant Recall that in Example 10.1 the data model was
PDF contours appear as in Figure 10.7. The locations of the peaks of p(x, y), when
considered as a function of y for each x, is the dashed line y = px, and it is readily x[n] = A + w[n] n = 0, 1, ... , N - 1
shown that fj = E(ylx) = px (see Problem 10.12). The MMSE estimator therefore
where A ~ N(IlA, (T~), and w[n] is WGN independent of A In terms of vectors we
exploits the correlation between the random variables to estimate the realization of one
have the equivalent data model
based on the realization of the other.
The minimum MSE is, from (10.13) and (10.18),
x = 1A+w.
This appears to be of the form of the linear model described in Chapter 4 except for
Bmse(fj) J var(ylx)p(x) dx the assumption that A is a random variable. It should not then be surprising that a
Bayesian equivalent of the general linear model can be defined. In particular let the
var(ylx) data be modeled as
var(y)(1- p2) (10.22) x = HO+w (10.26)
where x is an N x 1 data vector, H is a known N x p matrix, 0 is a p x 1 random
since the posterior variance does not depend on x (var(y) and p depend on the covariance
vector with prior PDF N(J.Le, Ce ), and w is an N x 1 noise vector with PDF N(O, C w )
matrix only). Hence, the quality of our estimator also depends on the correlation and independent of 9. This data model is termed the Bayesian general linear model. It
coefficient which is a measure of the statistical dependence between x and y. differs from the classical general linear model in that 0 is modeled as a random variable
To gen'eralize these results consider a jointly Gaussian vector [x T yTf, where x is
with a Gaussian prior PDF. It will be of interest in deriving Bayesian estimators to
k x 1 and y is l x 1. In other words, [x T yTf is distributed according to a multivariate
have an explicit expression for the posterior PDF p(Olx). From Theorem 10.2 we know
Gaussian PDF. Then, the conditional PDF of y for a given x is also Gaussian. as
that if x and 0 are jointly Gaussian, then the posterior PDF is also Gaussian. Hence,
summarized in the following theorem (see Appendix lOA for proof).
it only remains to verify that this is indeed the case. Let z = [x T OTf, so that from
Theorem 10.2 (Conditional PDF of Multivariate Gaussian) If x and y are (10.26) we have
jointly Gaussian, where x is k x 1 and y is lxI, with mean vector [E(X)T E(yf]T and
partitioned covariance matrix z [HO;W]
] -_[kXk kXl] (10.23)
[~ ~][:]
lxk lxl
so that
where the identity matrices are of dimension N x N (upper right) and p x p (lower
1 [ 1 ([ x - E(x) ]) T 1 ([ X - E(x) ])] left), and 0 is an N x N matrix of zeros. Since 0 and ware independent of each other
p(x, y) = (27l')~ det t (C) exp -2 y _ E(y) c- Y _ E(y) " and each one is Gaussian, they are jointly Gaussian. Furthermore, because z is a linear
326 CHAPTER 10. THE BAYESIAN PHILOSOPHY
x = IA+w.
I
328 CHAPTER 10. THE BAYESIAN PHILOSOPHY 10.7. NUISANCE PARAMETERS 329
Using (10.30) once again, we obtain to complicate the problem, are referred to as nuisance parameters. Such would be the
case if for a DC level in WGN, we were interested in estimating 0- 2 but A was unknown.
The DC level A would be the nuisance parameter. If we assume that the parameters
are deterministic, as in the classical estimation approach, then in general we have no
var(Alx)
alternative but to estimate 0- 2 and A. In the Bayesian approach we can rid ourselves of
nuisance parameters by "integrating them out." Suppose the unknown parameters to
be estimated are 9 and some additional nuisance parameters a are present. Then, if
p(9, alx) denotes the posterior PDF, we can determine the posterior PDF of 9 only as
p(xI9)p(9)
p(9Ix) = Jp(xI9)p(9) d9 (10.36)
In the succeeding chapters we will make extensive use of the Bayesian linear model.
p(xI9) = J p(xI9, a)p( a19) da. (10.37)
For future reference we point out that the mean (10.28) and covariance (10.29) of the If we furthermore assume that the nuisance parameters are independent of the desired
posterior PDF can be expressed in alternative forms as (see Problem 10.13) parameters, then (10.37) reduces to
and
(10.32)
p(xI9) = J p(xI9, a)p(a) da. (10.38)
This form lends itself to the interpretation that the "information" or reciprocal of the Assume that we observe the N x 1 data vector x whose conditional PDF p(xI9, 0- 2 ) is
variance of the prior knowledge 1/o-~ and the "information" of the data 1/ (0- 2/ N) add N(O, 0-2C(9)). (The reader should not confuse C(9), the scaled covariance matrix of x,
to yield the information embodied in the posterior PDF. with Co, the covariance matrix of 9.) The parameter 9 is to be estimated, and 0- 2 is
to be regarded as a nuisance parameter. The covariance matrix depends on 9 in some
unspecified manner. We assign the prior PDF to 0- 2 of
10.7 Nuisance Parameters
Many estimation problems are characterized by a set of unknown parameters, of which (10.39)
we are really interested only in a subset. The remaining parameters, which serve only
. - - - ---. . . . . . . . . . . ...... . ...
~
1
330 CHAPTER 10. THE BAYESIAN PHILOSOPHY
,
j
where A > 0, and assume a 2 to be independent of O. The prior PDF is a special case mse(A) (Bayesian)
of the inverted gamma PDF. Then, from (10.38) we have
{'XJ _~_17"1___ exp [_ ~XT (a2C( 0)) -IX] Aexp ( -;;\-) da2
Jo (21l}'f det 2 [a 2 C( 0)] 2 a4
A
roo 1 1 exp [__l_XTC-I(O)X] Aexp(-;;\-) da2.
Jo (21l}'f a N det2[C(O)] 2a 2 a4 Figure 10.8 Mean square error of MVU and Bayesian estimators for
deterministic DC level in WGN
Letting ~ = 1/a 2 , we have
p(:xIO) = N A1
(2rr), det2[C(O)] Jo
roo (f exp [- (A + ~XT
2
C- (O)x)~] d~.
I
But
1 00
x m- I exp( -ax) dx = a-mf(m) and 0 < a < 1. If A is a deterministic parameter, we can evaluate the MSE using (2.6)
as
for a > 0 and m > 0, resulting from the properties of the gamma integral. Hence, the mse(A) = var(A) + b2 (A)
integral can be evaluated to yield
where b(A) = E(A) - A is the bias. Then,
p(xIO) = Af( ~ + 1) N
(2rr)-'f det~ [C(O)] (A + !XTC-I(O)X) ,+1 mse(A) a 2 var(x) + faA + (1 - a)/-lA - A]2
2
2a 2 2
The posterior PDF may be found by substituting this into (10.36) (at least in theory!)
a N + (1 - a) (A - /-lA) . (10.40)
o It is seen that the use of the Bayesian estimator reduces the variance, since 0 < a < 1,
but may substantially increase the bias component of the MSE. As shown further in
Figure 10.8, the Bayesian estimator exhibits less MSE than the MVU estimator x only
10.8 Bayesian Estimation for Deterministic if A is close to the prior mean /-lA. Otherwise, it is a poorer estimator. It does not have
the desirable property of being uniformly less in MSE than all other estimators but
Parameters only "on the average" or in a Bayesian sense. Hence, only if A is random and mse(A)
can thus be considered to be the MSE conditioned on a known value of A, do we obtain
Although strictly speaking the Bayesian approach can be applied only when 0 is ran-
dom, in practice it is often used for deterministic parameter estimation. By this we Bmse(A) EA[mse(A)]
mean that the Bayesian assumptions are made to obtain an estimator, the MMSE es- 2
timator for example, and then used as if 0 were nonrandom. Such might be the case a2~ + (1 - a)2 EA[(A - /-lA)2]
if no MVU estimator existed. For instance, we may not be able to find an unbiased
a2
estimator that is uniformly better in terms of variance than all others (see Example a 2N + (1 - a)2a~
2.3). Within the Bayesian framework, however, the MMSE estimator always exists and
thus provides an estimator which at least "on the average" (as different values of 0 are a 2 a~
chosen) works well. Of course, for a particular 0 it may not perform well, and this is the N a 2A + ~
N
risk we take by applying it for a deterministic parameter. To illustrate this potential a2
pitfall consider Example 10.1 for which (see (10.11)) < N = Brnse(x),
332 CHAPTER 10. THE BAYESIAN PHILOSOPHY PROBLEMS 333
or that the Bayesian MSE is less. In effect, the Bayesian MMSE estimator trades off where z is the conditioning random variable. We observe x[n] = A + w[n] for
bias for variance in an attempt to reduce the overall MSE. In doing so it benefits from n ~ 0,1, where A, w[O], and w[l] are random variables. If A, w[O], w[l] are
the prior knowledge that A ~ N(/-LA, u~). This allows it to adjust Q so that the overall all I~~ep~ndent, prove that x[O] and x[l] are conditionally independent. The
MSE, on the average, is less. The large MSEs shown in Figure 10.8 for A not near /-LA Co~dl~lOmng random variable is A. Are x[O] and x[l] independent unconditionally
are of no consequence to the choice of Q since they occur infrequently. Of course, the or IS It true that
classical estimation approach will not have this advantage, being required to produce p(x[O], x[l]) = p(x[O])p(x[l])?
an estimator that has good performance for all A.
A second concern is the choice of the prior PDF. If no prior knowledge is available, To answer this consider the case where A, w[O], w[l] are independent and each
as we assume in classical estimation, then we do not want to apply a Bayesian estimator random variable has the PDF N(O, 1).
based on a highly concentrated prior PDF. Again considering the same example, if A
10.3 The data x[n] for n = 0,1, ... , N - 1 are observed each sample having the
is deterministic but we apply the Bayesian estimator anyway, we see that conditional PDF '
u~ u 2 /N
= u~ u~ + u 2 /N/-LA'
A
~
[Box and Tiao 1973] for further details and philosophy.
p(x[n]IO) ={ O:S x[n] :S 0
o otherwise
References
and the uniform prior PDF 0 ~ U[O, /3]. What happens if /3 is very large so that
Ash, R., Information Theory, J. Wiley, New York, 1965.
there is little prior knowledge?
Box, G.E.P., G.C. Tiao, Bayesian Inference in Statistical Analysis, Addison-Wesley, Reading,
Mass., 1973. 10.5 Rederive the MMSE estimator by letting
Gallager, R.G., Information Theory and Reliable Communication, J. Wiley, New York, 1968.
Zacks, S., Pammetric Statistical Inference, Pergamon, New York, 1981.
Bmse({J) Ex,e [(0_0)2]
Problems Ex,e {[(O - E(Olx)) + (E(Olx) - 0) f}
10.1 In this problem we attempt to apply the Bayesian approach to the estimation of
a deterministic parameter. Since the parameter is deterministic, we assign the and evaluating. Hint: Use the result Ex,eO = Ex [EelxO J.
prior PDF p(O) = 0(0 - ( 0 ), where 00 is the true value. Find the MMSE estimator 10.6 In Example 10.1 modify the data model as follows:
for this prior PDF and explain your results.
10.2 Two random variables x and yare said to be conditionally independent of each x[n] = A + w[n] n = 0, 1, ... ,N - 1
other if the joint conditional PDF factors as
where w[n] is WGN with variance u! if A ~ 0 and u: if A < O. Find the PDF
p(x, ylz) = p(xlz)p(Ylz) p(xIA). Compare this to p(x; A) for the classical case in which A is deterministic
c
334 CHAPTER 10. THE BAYESIAN PHILOSOPHY PROBLEMS 335
2 w (Ib) w (lb)
and w[n] is WGN with variance a for
200 200
a. a! = a: e
:. . . . ..
h. a~ #- a:.
: ... : ..
-_..... ...: :.....
: -.. : . -.
100
_.-. : e
100
J
10.7 Plot A as given by (10.9) as a function of x if a 2 / N = 1 for Ao = 3 and Ao = 10.
Compare your results to the estimator A = x. Hint: Note that the numerator
can be evaluated in closed form and the denominator is related to the cumulative h (ft) h (ft)
distribution function of a Gaussian random variable. 5 6 5 6
(a) Data for planet Earth (b) Data for faraway planet
10.8 A random variable 0 has the PDF p(O). It is desired to estimate a realization of
o without the availability of any data. To do so a MMSE estimator is proposed Figure 10.9 Height-weight data
that minimizes E[(O - 8)2], where the expectation is with respect to p(O) only.
Prove that the MMSE estimator is 8 = E(O). Apply your results to Example 10.1
to show that the minimum Bayesian MSE is reduced when data are inco'rporated where the x[n]'s are conditionally independent (see Problem 10.2). Next, assume
into the estimator. the gamma prior PDF
10.9 A quality assurance inspector has the job of monitoring the resistance values ACt OCt-l ( )
p(O) = f(a) exp -AO 0> 0
of manufactured resistors. He does so by choosing a resistor from a batch and {
measuring its resistance with an ohmmeter. He knows that the ohmmeter is of o 0<0
poor quality and imparts an error to the measurement which he models as a
where A > 0, a > 0, and find the posterior PDF. Compare it to the prior PDF.
N(O, 1) random variable. Hence, he takes N independent measurements. Also, Such a PDF, in this case the gamma, is termed a conjugate prior PDF.
he knows that the resistors should pe 100 ohms. Due to manufacturing tolerances,
however, they generally are in error by E, where E '"" N(O, 0.011). If the inspector 10.11 It is desired to estimate a person's weight based on his height. To see if this
chooses a resistor, how many ohmmeter measurements are necessary to ensure is feasible, data were taken for N = 100 people to generate the ordered pairs
that a MMSE estimator of the resistance R yields the correct resistance to 0.1 (h, w), where h denotes the height and w the weight. The data that were ob-
ohms "on the average" or as he continues to choose resistors throughout the day? tained appear as shown in Figure 1O.9a. Explain how you might be able to guess
How many measurements would he need if he did not have any prior knowledge someone's weight based on his height using a MMSE estimator. What modeling
about the manufacturing tolerances? assumptions would you have to make about the data? Next, the same experiment
was performed for people on a planet far away. The data obtained are shown in
10.10 In this problem we discuss reproducing PDFs. Recall that Figure 1O.9b. What would the MMSE estimator of weight be now?
10.12 If [x yf '"" N(O, C), where
C=[! i]
let g(y) = p(xo, y) for some x = Xo. Prove that g(y) is maximized for y = pXo.
where the denominator does not depend on O. If p(O) is chosen so that when
Also, show that E(Ylxo) = pXo. Why are they the same? If p = 0, what is the
multiplied by p(xIO) we obtain the same form of PDF in 0, then the posterior MMSE estimator of y based on x?
PDF p(Olx) will have the same form as p(O). Such was the case in Example 10.1
for the Gaussian PDF. Now assume that the PDF of x[n] conditioned on 0 is the 10.13 Verify (10.32) and (10.33) by using the matrix inversion lemma. Hint: Verify
exponential PDF (10.33) first and use it to verify (10.32).
10.14 The data
p(x[n]IO) ={ 0oexp ( -Ox[n]) x[n] > 0
x[n] < 0 x[n] = Arn + w[n] n = 0, 1, ... , N - 1
,
336 CHAPTER 10. THE BAYESIAN PHILOSOPHY
f
where r is known, w[n] is WGN with variance (J2, and A '" N(O, (J~) independent I
;
of w[n] are observed. Find the MMSE estimator of A as well as the minimum
Bayesian MSE.
10.15 A measure of the randomness of a random variable B is its entropy defined as
(27r)~dett(C) exp 1
is describeq more fully in [Zacks 1981] and relies upon the standard concept of 1 1 x - E(x) x - E(x)
l
T
mutual information in information theory [Ash 1965, Gallagher 1968].) [ -2" ([ y-E(y)]) C- ([ Y-E(Y)])
[-~(X - E(x)fC~;(x -
10.16 For Example 10.1 show that the information gained by observing the data is
11 exp E(X))]
(27r)2 det 2 (C xx ) 2
That p(x) can be written this way or that x is also multivariate Gaussian with the given
mean and covariance follows from the fact that x and yare jointly Gaussian. To verify
10.17 In choosing a prior PDF that is noninformative or does not assume any prior this let
knowledge, the argument is made that we should do so to gain maximum informa-
tion from the data. In this way the data are the principal contributor to our state
of knowledge about the unknown parameter. Using the results of Problem 10.15,
x=[! ~][;]
this approach may be implemented by choosing p(B) for which I is maximum. For
and apply the property of a Gaussian vector that a linear transformation also produces
the Gaussian prior PDF for Example 10.1 how should J.tA and (J~ be chosen so
that p(A) is noninformative? a Gaussian vector. We next examine the determinant of the partitioned covariance
matrix. Since the determinant of a partitioned matrix may be evaluated as
337
338 APPENDIX lOA. DERIVATION OF CONDITIONAL GAUSSIAN PDF APPENDIX lOA. DERIVATION OF CONDITIONAL GAUSSIAN PDF 339
it follows that
and thus
det(C) -IC )
( )
det C xx
= det(C yy - CyxC XX xy'
or finally
We thus have that
1
Q [y - (E(y) + CyxC;;;(x - E(x)))f
p(ylx) = , ! exp ( --Q
1 )
(27r)2 det 2 (C yy - CyxC~"iCxy) 2 [C yy - CYXC;;;Cxyt1 [y - (E(y) + CyxC;;;(x - E(x)))] .
where The mean of the posterior PDF is therefore given by (10.24), and the covariance by
(10.25).
Q= [ x - E(x) ] T C- l [ X - E(x) ] _ (x _ E(x)fC;;;(x _ E(x)).
y - E(y) y - E(y)
)
To evaluate Q we use the matrix inversion formula for a partitioned symmetric matrix
(All - A12A221 A2tl- l = A;-11 + A;-/ A12(A22 - A21 A;-11 A I2 )-1 A21A;-/
so that
C-1 _ C xx
-l + C-xxI C xy B- 1C yx C-1
xx -
C-xxI C xy B- 1
- [ -B- 1 C yX C;; B-1
where
B = Cyy - CyxC;;;Cxy .
The inverse can be written in factored form as
C-
l
= [~ -C;i
CXY
] [cg; B~I] [-Cy~C;; ~]
so that upon letting i =x- E(x) and y=y - E(y) we have
Chapter 11
11.1 Introduction
Having introduced the Bayesian approach to parameter estimation in the last chapter,
we now study more general Bayesian estimators and their properties. To do so, the
concept of the Bayesian risk function is discussed. Minimization of this criterion results
in a variety of estimators. The ones that we will concentrate on are the MMSE estimator
and the maximum a posteriori estimator. These two estimators are the principal ones
used in practice. Also, the performance of Bayesian estimators is discussed, leading
to the concept of error ellipses. The use of Bayesian estimators in signal processing
is illustrated by a deconvolution problem. As a special case we consider in detail the
noise filtering problem, in which we use the MMSE estimator in conjunction with the
Bayesian linear model to yield the important Wiener filter. In the succeeding chapter
we will describe more fully the properties and extensions of the Wiener filter.
11.2 Summary
The Bayes risk is defined in (11.1). For a quadratic cost function the mean of the
posterior PDF or the usual MMSE estimator minimizes the risk. A proportional cost
function (11.2) results in an optimal estimator which is the median ofthe posterior PDF.
For a "hit-or-miss" cost function (11.3) the mode or maximum location of the posterior
PDF is the optimal estimator. The latter is termed the maximum a posteriori (MAP)
estimator. The MMSE estimator for a vector parameter is given in (11.10), and the
corresponding minimum Bayesian MSE by (11.12). Some examples of the computation
of the MAP estimator are included in Section 11.5, followed by the definition for a
vector MAP estimator, (11.23) or (11.24). The vector MAP estimator is not a simple
extension of the scalar MAP estimator but minimizes a slightly different Bayes risk. It is
pointed out that the MMSE and MAP estimators commute over linear transformations,
but this property does not carryover to nonlinear ones. The performance of the MMSE
estimator is characterized by the PDF of its error (11.26). For the Bayesian linear model
this may be determined explicitly from (11.29) and (11.30) and leads to the concept of
341
342 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS 11.3. RISK FUNCTIONS 343
C(.) = 1.1 errors in excess of a threshold error or
(11.3)
where 8 > O. If 8 is small, we can think of this cost function as assigning the same
penalty for any error (a "miss") and no penalty for no error (a "hit"). Note that in all
three cases the cost function is symmetric in f, reflecting the implicit assumption that
(a) Quadratic error (b) Absolute error positive errors are just as bad as negative errors. Of course, in general this need not
be the case.
c(.) We already have seen that the Bayes risk is minimized for a quadratic cost function
by the MMSE estimator B= E(Olx). We now determine the optimal estimators for the
other cost functions. The Bayes risk R is
1-
E[C(t:)]
-6 6
JJ C(0-B)p(x,8)dxdO
( c) Hit-or-miss error
J[j C(8 - O)p(Olx) dO] p(x)dx. (11.4)
Figure 11.1 Examples of cost function As we did for the MMSE case in Chapter 10, we will attempt to minimize the inner
integral for each x. By holding x fixed 0 becomes a scalar variable. First, considering
the absolute error cost function, we have for the inner integral of (11.4)
an error ellipse as discussed in Example 11.7. Finally, Theorem 11.1 summarizes the
MMSE estimator and its performance for the important Bayesian linear model.
g(O) = J 10 - 0Ip(8Ix) dO
and measures the performance of a given estimator. If C(t:) = f2, then the cost function Letting h(O,O) = (0 - O)p(Olx) for the first integral, we have
is quadratic and the Bayes risk is just the MSE. Of course, there is no need to restrict
ourselves to quadratic cost functions, although from a mathematical tractibility stand-
point, they are highly desirable. Other possible cost functions are shown in Figures and d411(U)/du = 0 since the lower limit does not depend on u. Similarly, for the second
11.1b and 11.1c. In Figure 11.1b we have integral the corresponding terms are zero. Hence, we can differentiate the integrand
only to yield
This cost function penalizes errors proportionally. In Figure 11.1c the "hit-or-miss"
(11.2)
dO
dg(~)
=
9
-00
1
p(Olx) dO _ roo p(Olx) dO = 0
19
cost function is displayed. It assigns no cost for small errors and a cost of 1 for all
344 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS 11.4. MINIMUM MEAN SQUARE ERROR ESTIMATORS 345
or
~)
1oo
li
p(Olx) dO = 1 00
p(Olx) dO.
By definition {} is the median of the posterior PDF or the point for which Pr{ 0 :S {}Ix} =
1/2.
For the "hit-or-miss" cost function we have C(t) = 1 for t > 0 and t < -0 or for
0> {} + t5 and 0 < {} - 0, so that the inner integral in (11.4) is
yielding
1: p(Olx) dO = 1,
Mode
(a)
Mean Yledian
p(lIlx)
li +6
li +6
i _
0-0
p(Olx) dO.
Mean = median = mode
II
For 0 arbitrarily small this is maximized by choosing {} to correspond to the location of
the maximum of p(Olx). The estimator that minimizes the Bayes risk for the "hit-or- (b) Gaussian posterior PDF
miss" cost function is therefore the mode (location of the maximum) of the posterior
PDF. It is termed the maximum a posteriori (MAP) estimator and will be described
in more detail later. Figure 11.2 Estimators for different cost functions
In summary, the estimators that minimize the Bayes risk for the cost functions of
Figure 11.1 are the mean, median, and mode of the posterior PDF. This is illustrated in
estimator. We continue our discussion of this important estimator by first extending it
Figure 11.2a. For some posterior PDFs these three estimators are identical. A notable
example is the Gaussian posterior PDF to the vector parameter case and then studying some of its properties.
If (J is a vector parameter of dimension p x 1, then to estimate OJ, for example, we
p(Olx) = 1 [1
r;;:::::2 exp - 20"2 (0 - J.1-olx)
V21l'0"0lx 0lx
2] .
may view the remaining parameters as nuisance parameters (see Chapter 10). If p(xl(J)
is the conditional PDF of the data and p((J) the prior PDF of the vector parameter, we
may obtain the posterior PDF for OJ as
The mean J.1-olx is identical to the median (due to the symmetry) and the mode, as (11.5)
illustrated in Figure 11.2b. (See also Problem 11.2.)
where
11.4 Minimum Mean Square Error Estimators p((J Ix ) =
p(xl(J)p((J)
.
In Chapter 10 the MMSE estimator was determined to be E(Olx) or the mean of the
J p(xl(J)p( (J) d(J
(11.6)
posterior PDF. For this reason it is also commonly referred to as the conditional mean Then, by the same reasoning as in Chapter 10 we have
346 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS 11.4. MINIMUM MEAN SQUARE ERROR ESTIMATORS 347
As discussed in Chapter 10, the minimum Bayesian MSE for a scalar parameter is
the posterior PDF variance when averaged over the PDF of x (see (10.13. This is
E(Bdx) because
J BlP(Bdx) dBl
(11.8)
J var(Bllx)p(x) dx.
J [j ... J
Bl p(Olx) dB 2 ... dBp] dBl
The inner integral in (11.11) is the variance of Bl for the posterior PDF p(Olx). This
is just the [1,1] element of C 01x , the covariance matrix of the posterior PDF. Hence, in
J BlP(Olx) dO
general we have that the minimum Bayesian MSE is
J B2 P(0Ix) dO
Example 11.1 - Bayesian Fourier Analysis
We reconsider Example 4.2, but to simplify the calculations we let M = 1 so that our
data model becomes
J Bpp(Olx) dO x[n] = a cos 211" fon + b sin 211" fon + w[n] n = 0, 1, ... ,lV - 1
J Op(Olx) dO (11.9) where fo is a multiple of 1/lV, excepting 0 or 1/2 (for which sin 211" fon is identically
zero), and w[n] is WGN with variance (j2. It is desired to estimate 0 = [abf. We
E(Olx) (11.10) depart from the classical model by assuming a, b are random variables with prior PDF
where the expectation is with respect to the posterior PDF of the vector parameter
or p(Olx). Note that the vector MMSE estimator E(Olx) minimizes the MSE for each
component of the unknown vector parameter, or [9]i = [E(Olx)]i minimizes E[(Bi -{)i)2]. and 0 is independent of w[nJ. This type of model is referred to as a Rayleigh fad-
This follows from the derivation. ing sinusoid [Van Trees 1968] and is frequently used to represent a sinusoid that has
- - . - - - - - - - - - - - - - - - -. . . . . . . . . -r --... ...... .
348 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS 11.4. MINIMUM MEAN SQUARE ERROR ESTIMATORS 349
propagated through a dispersive medium (see also Problem 11.6). To find the MMSE The results differ from the classical case only in the scale factor, and if (T~ 2(Tz / N,
estimator we need to evaluate E(9Ix). The data model is rewritten as
the two results are identical. This corresponds to little prior knowledge compared to
x= H9+w the data knowledge. The posterior covariance matrix is
where 1
1 o C01x = 1 N 1
cos 27r fo sin 27r fo ~ + Z,,2
H=
which does not depend on x. Hence, from (11.12)
cos[27rfo(N -1)] sin[27rfo(N -1)]
which is recognized as the Bayesian linear model. From Theorem 10.3 we can obtain 1
Bmse(ii) 1 1
the mean as well as the covariance of the posterior PDF. To do so we let ~ + Z,,2/N
1
Bmse(b) 1 l'
~ + Z,,2/N
<>
to obtain
8 = E(9Ix) = (T~HT (H(T~HT + (TZI)-lX It is interesting to note that in the absence of prior knowledge in the Bayesian linear
model the MMSE estimator yields the same form as the MVU estimator for the classical
C 01x = (T~I - (T~HT(H(T~HT + (T21)-lH(T~.
linear model. Many fortuitous circumstances enter into making this so, as described in
A somewhat more convenient form is given by (10.32) and (10.33) as Problem 11.7. To verify this result note that from (10.32)
1 1 ) -1 1
E(9Ix) = ( 21 + HT 2H HT 2X
(To (T (T
For no prior knowledge COl -t 0, and therefore,
(11.14)
Now, because the columns of H are orthogonal (due to the choice of frequency), we
have (see Example 4.2) which is recognized as the MVU estimator for the general linear model (see Chapter 4).
The MMSE estimator has several useful properties that will be exploited in our study
of Kalman filters in Chapter 13 (see also Problems 11.8 and 11.9). First, it commutes
and over linear (actually affine) transformations. Assume that we wish to estimate 0: for
0: = A9+ b (11.15)
This holds regardless of the joint PDF p(x, 8). A second important property focuses This was shown to minimize the Bayes risk for a "hit-or-miss" cost function. In finding
on the MMSE estimator based on two data vectors Xll X2' We assume that 8, Xll X2 are the maximum of p(Olx) we observe that
jointly Gaussian and the data vectors are independent. The MMSE estimator is
Letting x = [xi xf]T, we have from Theorem 10.2 so an equivalent maximization is of p(xIO)p(O). This is reminiscent of the MLE except
for the presence of the prior PDF. Hence, the MAP estimator is
{) = E(8Ix) = E(8) + CoxC;;(x - E(x)). (11.17)
{j = arg max p( xIO)p( 0) (11.18)
8
Since Xl, X2 are independent,
or, equivalently,
C-
xx
I [ Cx,x,
Cx,x,
Cx,x,
Cx,x, r l {j = arg max
8
[lnp(xIO) + L p(O)].
Before extending the MAP estimator to the vector case we give some examples.
(11.19)
[ CxO",
C~lx,
0
Cx,x, r l
Example 11.2 - Exponential PDF
[ 0
C- IX
X2 2
] Assume that
={ 0oexp ( -Ox[n]) x[n] > 0
n
p(x[n]IO)
and also x[n] < 0
~ E [. [ :: ~ [C'"
where the x[nl's are conditionally IID, or
C" C' I N-I
dg(O) =N _ Nx _ A
11.5 Maximum A Posteriori Estimators dO 0
and setting it equal to zero yields the MAP estimator
In the MAP estimation approach we choose {j to maximize the posterior PDF or
A 1
{j = argmaxp(Olx). 0=--
8 x+~
I
1
352 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS 11.5. MAXIMUM A POSTERIORI ESTIMATORS 353
p(Alx)
Note that as N --+ 00, (J --+ 1/x. Also, recall that E(x[nlIO) = 1/0 (see Example 9.2), p(Alx)
~othat
0= 1
E(x[nllO)
confirming the reasonableness of the MAP estimator. Also, if >. --+ 0 so that, the prior
PDF is nearly uniform, we obtain the estimator 1/x. In fact, this is the Bayesian MLE
(the estimator obtained by maximizing p(xIO)) since as >. --+ 0 we have the situation in A
Figure 11.3 in which the conditional PDF dominates the prior PDF. The maximum of -Ao Ao x
9 is then unaffected by the prior PDF. <>
(b) x> Ao
p(Alx)
Example 11.3 - DC Level in WGN - Uniform Prior PDF
Recall the introductory example in Chapter 10. There we discussed the MMSE esti- :--
............. ..............
mator of A for a DC level in WGN with a uniform prior PDF. The MMSE estimator
as given by (10.9) could not be obtained in explicit form due to the need to evaluate ................
the integrals. The posterior PDF was given as
A
x -Ao Ao
1 1 2
J27r1j 2 1N exp -2]J(A - x) (c) X < -Ao
----".,.----- IAI SAo
p(Alx) = 1 1 _
l
Ao 2
x -Ao S x S Ao (11.20)
difficulties encountered in determining the MLE (see Section 7.7). <>
Ao x> Ao.
354 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS 11.5. MAXIMUM A POSTERIORI ESTIMATORS 355
To extend the MAP estimator to the vector parameter case in which the posterior PDF
is now p{9Ix), we again employ the result
81 = argmaxp(Brlx)
0,
2 3 4 5 2 3 4 5
or in general
8i = argmaxp(Bilx)
0,
i = 1,2, ... ,p. (11.22) (a) Posterior PDF p(lh,02Ix) (b) Posterior PDF p(02Ix)
This estimator minimizes the average "hit-or-miss" cost function Figure 11.5 Comparison of scalar and vector MAP estimators
Ri = E[C(Bi - 8i )]
for each i, where the expectation is over p(x, Bi). and is shown in Figure 11.5b. Clearly, the MAP estimator is any value 1 < 82 < 2,
which differs from the vector MAP estimator. It can be shown, however, that the vector
One of the advantages of the MAP estimator for a scalar parameter is that to
MAP estimator does indeed minimize a Bayes risk, although a different one than the
numerically determine it we need only maximize p(xIB)p(B). No integration is required.
"hit-or-miss" cost function (see Problem 11.11). We will henceforth refer to the vector
This desirable property of the MAP estimator for a scalar parameter does not carry
over to the vector parameter case due to the need to obtain P(Bilx) as per (11.21). MAP estimator as just the MAP estimator.
However, we might propose the following vector MAP estimator
Example 11.4 - DC Level in WGN - Unknown Variance
(j = argmaxp(fJlx) (11.23)
9
We observe
in which the posterior PDF for the vector parameter fJ is maximized to find the es- x[n] = A + w[n] n = O,l, ... ,lV - 1.
timator. Now we no longer need to determine the marginal PDFs, eliminating the
integration steps, since, equivalently, In contrast to the usual example in which only A is to be estimated, we now assume the
variance a 2 of the WGN w[n] is also unknown. The vector parameter is fJ = [Aa 2]T.
(j = arg maxp(xlfJ)p(fJ) (11.24) We assume the conditional PDF
9
That t~is ~stimator is not in general the same as (11.22) is illustrated by the example
shown III Figure 11.5. Note that p{B I , B2 1x) is constant and equal to 1/6 on the rectangu- p(xIA, ( 2) = 1 .v
[1
exp --22 L (x[n]- A)2
N -I ]
{ -
:3~ 0 < B2
1 < B2
<1
<2
to maximize
p(xIA, ( 2)p{A, ( 2)
p(xIA, ( 2)p(Ala 2)p{a 2)
356 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS 11.5. MAXIMUM A POSTERIORI ESTIMATORS 357
over A and 17 2 . To find the value of A that maximizes 9 we can equivalently maximize Therefore, we must maximize over 17
2
2 1
17 Alx = N 1 and setting it to zero yields
;>+;;;
1
Maximizing h(A) over A is equivalent to minimizing Q(A), so that N+5
a
-2-
, ~x+~ 1 [1 N-l 1 ]
2" ~x [n]+2"~+A
2
A = J.LAlx = N lA Ni5
;>+;;;
and since a~ = aa 2 , we have _1_ [~ x2[n] + J.L~ _ A2 (N + !..) + 2A]
, Nx+~ N +5 n=O a a
A-
- N + 1." .
"
Since this does not depend on 17 2 , it is the MAP estimator. We next find the MAP N
N
+5 [1 ~
Nf;:a 2
x [n]- A
'2] + (N +15)a (J.LA2- A'2 ) + N 2A+ 5
estimator for 17 2 by maximizing g(A, (7 2) = h(A)p((j2). First, note that
The MAP estimator is therefore
where
9 = [:' 1
~:r
2 2
= J.L~ _ J.L~lx.
[
Q(A)
17 A a Alx
N [1 ~ N-l 2 '2] 1 2 '2 2A
1.
Letting (j~ = a(j2, we have N +5 N x [n]- A + (N + 5)a (J.LA - A ) + N + 5
J.L~
a(j2 -
A2 (N2 + a(j21)
17
As expected, if N --+ 00 so that the conditional or data PDF dominates the prior PDF,
we have
[J.L~
-12 - - A'2 (N+-)
1 ] .
A --+ x
1 N-l 1 N-l _ 2
17 ... a ...
a '
{
d2 --+ N L 2
x [n]- x2 = N L (x[n]- x)
n=O n=O
1
358 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS 11.6. PERFORMANCE DESCRIPTION 359
which is the Bayesian MLE. Recall that the Bayesian MLE is the value of 0 that of the transformation or
maximizes p(xIO). 0
Po(B(o:))
Pa(O:)
It is true in general that as N -+ 00, the MAP estimator becomes the Bayesian
MLE. To find the MAP estimator we must maximize p(xIO)p(O). If the prior p(O) is I!~I
uniform over the range of 0 for which p(xIO) is essentially nonzero (as will be the case { >~p(->/Q)
if N -+ 00 so that the data PDF dominates the prior PDF), then the maximization 0:>0
0: 2
will be equivalent to a maximization of p(xIO). If p(xIO) has the same form as the PDF 0 0: < o.
family p(x; 0) (as it does for the previous example), then the Bayesian MLE and the
classical MLE will have the same form. The reader is cautioned, however, that the Otherwise, the prior PDF would not integrate to 1. Completing the problem
estimators are inherently different due to the contrasting underlying experiments.
Some properties of the MAP estimator are of importance. First, if the posterior
g(o:) Inp(xlo:) + Inp(o:)
PDF is Gaussian, as in the Bayesian linear model for example, the mode or peak
location is identical to the mean. Hence, the MAP estimator is identical to the MMSE In [(~) N exp ( _~ ~x[n])] + In >.exp~~>.jo:)
estimator if x and 0 are jointly Gaussian. Second, the invariance property encountered
in maximum likelihood theory does not hold for the MAP estimator. The next example - N In 0: - N-
x + In>. - >.
- - 21n 0:
illustrates this. ) 0: 0:
Nx+>.
-(N + 2) In 0: - -- + In>..
Example 11.5 - Exponential PDF 0:
0:=.."
do: 0: 0:
(J and setting it equal to zero yields the MAP estimator
where {j is the MAP estimator of (J, so that
A Nx + >.
A >. 0:= N+2
o:=x+ N. (11.25)
which is not the same as (11.25). Hence, it is seen that that 11AP estimator does not
We now show that this is not true. As before commute over nonlinear transformations, although it does so for linear transformations
(see Problem 11.12).
p(x[n]I(J) ={ (JO exp( -(Jx[n]) x[n] > 0 o
x[n] < O.
The conditional PDF based on observing 0: is
11.6 Performance Description
I (x[n]) x[n] > 0
p(x[n] 10:) = ; exp -~
{ In the classical estimation problem we were interested in the mean and variance of an
x[n] < 0 estimator. Assuming the estimator was Gaussian, we could then immediately obtain
since knowing 0: is equivalent to knowing (J. The prior PDF the PDF. If the PDF was concentrated about the true value of the parameter, then
we could say that the estimator performed well. In the case of a random parameter
B) _ { >. exp( ->.(J) (J > 0 the same approach cannot be used. Now the randomness of the parameter results }n a
p( - 0 (J<O different PDF of the estimator for each realization of B. We denote this PDF by p(BIB).
To perform well the estimate should be close to (J for every possible value of B or the
cannot be transformed to p(o:) by just letting (J = 1/0:. This is because (J is a mndom error
variable, not a deterministic parameter. The PDF of 0: must account for the derivative
360 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS 11.6. PERFORMANCE DESCRIPTION 361
and
. 1
Bmse(A) = N l'
;;;-+;y:;;
In this case the error is E = A - A. Because 11 depends linearly on x, and x and A are
jointly Gaussian, E is Gaussian. As a result, we have from (11.26)
of parameter
As N -+ 00, the PDF collapses about zero and the estimator can be said to be consistent
in the Bayesian sense. Consistency means that for a large enough data record 11 will
should be small. This is illustrated in Figure 11.6 where several conditional PDFs are
always be close to the realization of A, regardless of the realization value (see also
shown. Each PDF corresponds to that obtained for a given realization of B. For 9 Problem 11.10). 0
to be a good estimator p(9IB) should be concentrated about B for all possible B. This
reasoning led to the MMSE estimator, which minimizes Ex.o[(B - 9)2]. For aIParbitrary For a vector parameter 0 the error can be defined as
Bayesian estimator it thus makes sense to assess the performance by determining the
PDF of the error. This PDF, which now accounts for the randomness of B, should be t;. =0 - e.
concentrated about zero. For the MMSE estimator 9 = E(Blx), the error is
As before, if eis the MMSE estimator, then t;. will have a zero mean. Its covariance
E = B- E(Blx) matrix is
and the mean of the error is
The expectation is with respect to the PDF p(x, 0). Note that
Ex,o [B - E(Blx)]
Ex [Eolx(B) - Eo1x(Blx)]
Ex [E(Olx) - E(Blx)] = [E(Olx)Ji = E(Bilx) J OiP(Olx) dO
so that the error of the MMSE estimator is on the average (with respect to p( x, 0))
zero. The variance of the error for the MMSE estimator is
J BiP(Oil x ) dBi
which is just the minimum Bayesian MSE. Finally, if E is Gaussian, we have that Integrating with respect to 01 , , Bi - 1, Bi +l l ... , Op produces
(11.27)
since C Olx does not depend on x. Specializing to the Bayesian linear model, we have
from (10.29) (shortening the notation from Coo to Co) (a) Contours of constant probability density (b) Error "ellipses" for (7~ = 1, (72/N = 1/2
(11.28) Figure 11.7 Error ellipses for Bayesian Fourier analysis
or from (10.33)
(11.29)
= 9 - 8 is Gaussian
Finally, it should be observed that for the Bayesian linear model
since from (10.28)
IN
( -1+-1
)-1
(7~ 2(72
9-8 I N )-1
9 - 1-'0 - CoHT(HCoHT + Cw )-I(X - HI-'o) ( (7~ + 2(72 I.
and thus is a linear transformation of x, 9, which are themselves jointly Gaussian. Therefore, the error = [Ea Ehf has the PDF
Hence, the error vector for the MMSE estimator of the parameters of a Bayesian linear
model is characterized by
'" N(O, Me) (11.30) where
where Me is given by (11.28) or by (11.29). An example follows. 1
1 1
~ + 2(T2/N
Me =
Example 11.1 - Bayesian Fourier Analysis (continued) [ o
We now again consider Example 11.1. Recall that The error components are seen to be independent and, furthermore,
Co (7~1 Bmse(a) [Mel 11
Cw (721
Bmse(b) [Meb
HTH N 1.
2 in agreement with the results of Example 11.1. Also, for this example the PDF of the
error vector has contours of constant probability density that are circular, as shown in
Hence, from (11.29) Figure 11.7a. One way to summarize the performance of the estimator is by the error
364 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS 11. 7. SIGNAL PROCESSING EXAMPLE 365
or concentration ellipse (in this case it is a circle). It is in general an ellipse within The performance of the estimator is measured by the error e = (J - 6, whose PDF is
which the error vector will lie with probability P. Let Gaussian with mean zero and covariance matrix
eTM:-le = c2 .
o
(11.31) Ex,o(ee T )
Then, the probability P that e will lie within the ellipse described by (11.31) is Co - CoB T( BCoB T + Cw )-1 BCo (11.34)
(COl + BTC;;;IBfl. (11.35)
P = Pr{eTMile:-::::: c2 }.
The error covariance matrix is also the minimum MSE matrix M o' whose diagonal
But u = eTMile is a X~ random variable with PDF (see Problem 11.13) elements yield the minimum Bayesian MSE or
so that See Problems 11.16 and 11.17 for an application to line fitting.
2
P Pr{eTMilf:-::::: c }
~)
C
exp ( - du
There are many signal processing problems that fit the Bayesian linear model. We
1-exp(-C;) mention but a few. Consider a communication problem in which we transmit a signal
set) through a channel with impulse response h(t). At the channel output we observe a
or noise corrupted waveform as shown in Figure 11.8. The problem is to estimate set) over
eTMile = 2ln C~ p) the interval 0 :-: : : t :-: : : Ts. Because the channel will distort and lengthen the signal, we
observe x(t) over the longer interval 0 :-: : : t :-: : : T. Such a problem is sometimes referred
to as a deconvolution problem [Haykin 1991]. We wish to deconvolve set) from a noise
describes the error ellipse for a probability P. The error vector will lie within this ellipse
corrupted version of so(t) = set) * h(t), where * denotes convolution. This problem also
with probability P. An example is shown in Figure 11.7b for a~ = 1,a 2 /N = 1/2 so
arises in seismic processing in which set) represents a series of acoustic reflections of an
that
explosively generated signal due to rock inhomogeneities [Robinson and Treitel 1980].
Mo = [~0 ?]
2"
The filter impulse response h(t) models the earth medium. In image processing the same
problem arises for two-dimensional signals in which the two-dimensional version of set)
and thus represents an image, while the two-dimensional version of h(t) models a distortion due
e Te = l n ( - l ). to a poor lens, for example [Jain 1989]. To make any progress in this problem we
1-P assume that h(t) is known. Otherwise, the problem is one of blind deconvolution, which
In general, the contours will be elliptical if the minimum MSE matrix is not a scaled is much more difficult [Hay kin 1991]. We further assume that set) is a realization of a
identity matrix (see also Problems 11.14 and 11.15). <> random process. This modeling is appropriate for speech, as an example, for which the
signal changes with speaker and content. Hence, the Bayesian assumption appears to
We now summarize our results in a theorem. be a reasonable one. The observed continuous-time data are
Theorem. 11.1 (Performance of the MMSE Estimator for the Bayesian Lin- (.
ear Model) If the observed data x can be modeled by the Bayesian linear model of x(t) = 10 h(t-r)s(r)dr+w(t) 0:-::::: t:-::::: T. (11.37)
Theorem 10.3, the MMSE estimator is
It is assumed that set) is nonzero over the interval [0, Ts], and h(t) is nonzero over the
(11.32) interval [0, T h ], so that the observation interval is chosen to be [0, T], where T = Ts + T h
(11.33) We observe the output signal so(t) embedded in noise wet). In converting the problem
366 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS 11. 7. SIGNAL PROCESSING EXAMPLE 367
w(t)
We sample x(t) at times t = n~, where ~ = 1/(2B), to produce the equivalent discrete-
time data (see Appendix llA)
n,,-l
(a) System model where x[n] = x(n~), h[n] = ~h(n~), s[n] = s(n~), and w[n] = w(n~). The discrete-
time signal s[n] is nonzero over the interval [0, ns - 1], and h[n] is nonzero over the
s(t) so(t) = s(t) * h(t) interval [0, nh - 1]. The largest integers less than or equal to Ts/ ~ and Th/ ~ are
denoted by ns - 1 and nh - 1, respectively. Therefore, JV = ns + nh - 1. Finally, it
follows that w[n] is WGN with variance (72 = JVoB. This is because the ACF of w(t) is
a sinc function and the samples of w(t) correspond to the zeros of the sinc function (see
Example 3.13). If we assume that s(t) is a Gaussian process, then s[n] is a discrete-time
Gaussian process. In vector-matrix notation we have
T x[O]
x[l] 1 (11.38)
(b) Typical signals [
x[JV:- 1]
~
x(t) x
h[O] o o
h[l] h[O] o
[
h[JV:- 1] h[JV - 2]
T
Note that in H the elements h[n] = 0 for n > nh - 1. This is exactly in the form of
( c) Typical data waveform the Bayesian linear model with p = ns. Also, H is lower triangular due to the causal
nature of h(t) and thus h[n]. We need only specify the mean and covariance of (J = s
to complete the Bayesian linear model description. If the signal is zero mean (as in
Figure 11.8 Generic deconvolution problem speech, for example), we can assume the prior PDF
s "-' N(O,C s ).
An explicit form for the covariance matrix can be found by assuming the signal is WSS,
to discrete time we will assume that s(t) is essentially bandlimited to B Hz so that the
at least over a short time interval (as in speech, for example). Then,
output signal so(t) is also bandlimited to B Hz. The continuous-time noise is assumed
to be a zero mean WSS Gaussian random process with PSD
where Tss[k] is the ACF. If the PSD is known, then the ACF may be found and therefore
C s specified. As a result, the MMSE estimator of s is from (11.32)
IFI<B
IFI > B. (11.39)
368 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS
An interesting special case occurs when H = I, where I is the n., x n" identity matrix.
In this case the channel impulse response is h[n] = <5[n], so that the Bayesian linear
model becomes
x = () + w.
In effect the channel is transparent. Note that in the classical linear model we always
assumed that H was N x p with N > p. This was important from a practical viewpoint
since for H = I the MVU estimator is -5--t.._ _ _ ~
-10~1--T---r--TI--TI--TI--rl- - r l--Ir--~Ir--~I
-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5
x.
Frequency
In this case there is no averaging because we are estimating as many parameters as
Figure 11.9 Power spectral density of AR(l) process
observations. In the Bayesian case we can let N = p or even N < p since we have
prior knowledge about (). If N < p, (11.39) still exists, while in classical estimation,
the MVU estimator does not. We now assume that H = I. The MMSE estimator of s
is 1O~
(a) Signal and data
(11.40) 8-+
Note that the estimator may be written as s = Ax, where A is an ns x ns matrix. The
6-+
4J
A matrix is called the Wiener filter. We will have more to say about Wiener filters in :0
<l)~
2--t
Chapter 12. As an example, for the scalar case in which we estimate s[O] based on x[O] ..:::-0
~ :.:: I
the Wiener filter becomes -0
~~
~ 0-+
~ ~ -2i
8[0] rss[O] [0]
~r
fl
rss[O] + (72 X
'"
_1]_x[O] -8
1]+1 -10 I I I I
0 5 10 15 20 25 30 35 40 45 50
where 1] = r ss [0]/(72 is the SNR. For a high SNR we have 8[0] ~ x[O], while for a low
SNR 8[0] ~ O. A more interesting example can be obtained by assuming that s[n] is a Sample number, n
realization of an AR(l) process or
10
1 (b) Signal and signal estimate
= -a[l]s[n - 1] + urn]
:~
s[n]
where urn] is WGN with variance (7~ and a[l] is the filter coefficient. The ACF for this 4
process can be shown to be (see Appendix 1) :0
<l)~
":::-0 2
~ :.::
~~ 0 ..... ,
~ ~
-2 .w.... , True
'" "" -4
and the PSD is --6 signal
(72
-8
Pss(f) = u 2'
11 +a[1]exp(-j27rf)1 -10
0 5 10 15 20 25 30 35 40 45 50
For a[l] < 0 the PSD is that of a low-pass process, an example of which is shown in
Figure 11.9 for a[l] = -0.95 and (7~ = 1. For the same AR parameters a realization of Sample number, n
s[n] is shown in Figure l1.lOa as the dashed curve. When WGN is added to yield an Figure 11.10 Wiener filtering example
370 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS PROBLEMS 371
SNR of 5 dB, the data x[n] become the solid curve shown in Figure 11.10a. The Wiener 11.5 If fJ and x are jointly Gaussian so that
filter given by (11.40) will smooth the noise fluctuations, as shown in Figure 11.1Ob by
the solid curve. However, the price paid is that the signal will suffer some smoothing
as well. This is a typical tradeoff. Finally, as might be expected, it can be shown that
the Wiener filter acts as a low-pass filter (see Problem 11.18).
where
E(fJ) ]
References
[ E(x)
Haykin, S., Adaptive Filter Theory, Prentice-Hall. Englewood Cliffs, N.J., 1991.
Jain, A.K., Fundamentals of Digital Image Processing, Prentice-Hall, Englewood Cliffs, N.J., 1989. Coo Cox]
C = [C
Robinson, E.A, S. Treitel, Geophysical Signal Analysis. Prentice-Hall, Englewood Cliffs, N.J., 1980. xo Cxx
Van Trees, H.L., Detection, Estimation, and Modulation Theory, Part I, J. Wiley, New York, 1968. find the MMSE estimator of fJ based on x. Also, determine the minimum Bayesian
MSE for each component of fJ. Explain what happens if Cox = O.
Problems 11.6 Consider the data model in Example 11.1 for a single sinusoid in WGN. Rewrite
the model as
11.1 The data x[n], n = 0, 1, ... ,N - 1 are observed having the PDF x[n] = A COS(21T fan + ) + w[n]
p(x[nlll1) = ~exp
2
[- 212 (x[n]- 11)2] .
where
v 21Ta a
A Va 2 + b2
The x[nl's are independent when conditioned on 11. The mean 11 has the prior
PDF arctan ( ~b) .
11 ~ N(110, a~).
Find the MMSE and MAP estimators of 11. What happens as 0'6 -+ 0 and 0'6 -+ If fJ = [a bf ~ N(O, am,
show that the PDF of A is Rayleigh, the PDF of is
oo? U[O, 21T], and that A and are independent.
11.2 For the posterior PDF 11.7 In the classical general linear model the MVU estimator is also efficient and
therefore the maximum likelihood method produces it or the MVU estimator is
p(Olx) = _- exp [-~(O - X)2] + 1 - (' exp [-~(O + X)2] . found by maximizing
y'?;ff 2 y'?;ff 2
1
Plot the PDF for = 1/2 and = 3/4. Next, find the MMSE and MAP estimators p(x; fJ) = NIl exp [--2 (x - HfJfC- 1 (x - HfJ)] .
for the same values of . (21T)T det 2 (C)
11.3 For the posterior PDF For the Bayesian linear model the MMSE estimator is identical to the MAP
estimator. Argue why in the absence of prior information the MMSE estimator
p(Olx)={ ~xp[-(O-x)] :~~ for the Bayesian linear model is identical in form to the MVU estimator for the
classical general linear model.
find the MMSE and MAP estimators. 11.8 Consider a parameter fJ which changes with time according to the deterministic
11.4 The data x[n] = A + w[n] for n = 0,1, ... ,N - 1 are observed. The unknown relation
parameter A is assumed to have the prior PDF fJ[n] = AfJ[n - 1] n2:1
where A is a known p x p invertible matrix and fJ[O] is an unknown parameter
A) _ {AexP(-AA) A> 0
p( - 0 A<O which is modeled as a random vector. Note that once fJ[O] is specified, so is fJ[n]
for n 2: 1. Prove that the MMSE estimator of fJ[n] is
where A > 0, and w[n] is W@N with variance 0'2 and is independent of A. Find
the MAP estimator of A. 9[n] = A n9[0]
372 CHAPTER 11. GENERAL BAYESIAN ESTIMATORS PROBLEMS 373
where 0[0] is the MMSE estimator of 0[0] or, equivalently, 11.14 If e is a 2 x 1 random vector and e ~ N(o, Me), plot the error ellipse for P = 0.9
for the following cases:
O[n] = A6[n - 1].
11.9 A vehicle starts at an unknown position (x[O], y[OJ) and moves at a constant a. Me = [~ ~ ]
velocity according to
x[n]
y[n]
= x[O]
y[O]
+ vxn
+ vyn
h. Me = [~ n
=
for i = 0,1, ... ,N - 1. Next, extend this result to the case where s[n] is to be
estimated for Inl :S M based on x[n] for Inl :S M, and let M -+ 00 to obtain
Conversion of
00 00 Continuous-Time System
L Txx[i - j]s[j] = L Tss[i - j]x[j]
j=-oo j=-oo to Discrete-Time System
!
for -00 < i < 00. Using Fourier transforms, find the frequency response of the
Wiener filter H(J) or
H(J) = S(J) A continuous-time signal s(t) is the input to a linear time invariant system with
X(J) . impulse response h(t). The output so(t) is
(You may assume that the Fourier transforms of x[n] and s[n] exist.) Explain
your results.
so(t) = [ : h(t - 7)S(7) d7.
We assume that s(t) is bandlimited to B Hz. The output signal must therefore also
be bandlimited to B Hz and hence we can assume that the frequency response of the
system or F{h(t)} is also bandlimited to B Hz. As a result, the original system shown in
Figure llA.la may be replaced by the one shown in Figure llA.lb, where the operator
shown in the dashed box is a sampler and low-pass filter. The sampling function is
00
l1 IFI < B
H1pf(F) ={ 0 IFI > B.
For bandlimited signals with bandwidth B the system within the dashed box does
not change the signal. This is because the sampling operation creates replicas of the
spectrum of s(t), while the low-pass filter retains only the original spectrum scaled by
l1 to cancel the 1/l1 factor introduced by sampling. Next, note that the first low-pass
filter can be combined with the system since h(t) is also assumed to be bandlimited to
B Hz. The result is shown in Figure llA.lc. To reconstruct so(t) we need only the
376 APPENDIX llA. CONTINUOUS-TIME TO DISCRETE-TIME APPENDIX llA. CONTINUO US- TIME TO DISCRETE-TIME 377
s(m~)h(t - m~)
--- -- - -. ---- - ----- -- - ---1 m=-oo
L
I
p(t) or letting so[n]
so(n~)
= so(n~), h[n] =
=
m=-oo
~h(n~),
~h(n~ - m~)s(m~)
L s[n)6(t - nA) L so[n)6(t - nA) The continuous-time output signal is obtained by passing the signal
00
S(t)-'M 1--_
.. sort)
n=-oo
through the low-pass filter as shown in Figure llA.lc. In any practical system we
will sample the continuous-time signal so(t) to produce our data set so[n]. Hence, we
may replace the continuous-time data so(t) by the discrete-time data so[n], as shown
p(t) p(t) in Figure llA.ld, without a loss of information. Note finally that in reality s(t) cannot
be bandlimited since it is time-limited. We generally assume, and it is borne out in
(c) Sampled data system practice, that the signal is approximately bandlimited. The loss of information is usually
negligible.
h[n) = Ah(nA)
Chapter 12
12.1 Introduction
The optimal Bayesian estimators discussed in the previous chapter are difficult to deter-
mine in closed form, and in practice too computationally intensive to implement. They
involve multidimensional integration for the MMSE estimator and multidimensional
maximization for the MAP estimator. Although under the jointly Gaussian assump-
tion these estimators are easily found, in general, they are not. When we are unable to
make the Gaussian assumption, another approach must be used. To fill this gap we can
choose to retain the MMSE criterion but constrain the estimator to be linear. Then,
an explicit form for the estimator may be determined which depends only on the first
two moments of the PDF. In many ways this approach is analogous to the BLUE in
classical estimation, and some parallels will become apparent. In practice, this class of
estimators, which are generically termed Wiener jilters, are extensively utilized.
12.2 Summary
The linear estimator is defined by (12.1), and the corresponding Bayesian MSE by
(12.2). Minimizing the Bayesian MSE results in the linear MMSE (LMMSE) estimator
of (12.6) and the minimum Bayesian MSE of (12.8). The estimator may also be derived
using a vector space viewpoint as described in Section 12.4. This approach leads to
the important orthogonality principle which says that the error of the linear MMSE
estimator must be uncorrelated with the data. The vector LMMSE estimator is given
by (12.20), and the minimum Bayesian MSE by (12.21) and (12.22). The estimator
commutes over linear transformations (12.23) and has an additivity property (12.24).
For a data vector having the Bayesian linear model form (the Bayesian linear model
without the Gaussian assumption), the LMMSE estimator and its performance are
summarized in Theorem 12.1. If desired, the estimator for the Bayesian linear model
form can be implemented sequentially in time using (12.47)-(12.49).
379
. . . - - -- - - - -- -- - - - - - - - - - - - - . . -
~ - - - - - - - - - - - - ----- ----------
380 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS
12.3. LINEAR MMSE ESTIMATION 381
12.3 Linear MMSE Estimation
We differentiate this with respect to ao and al and set the results equal to zero to
produce
We begin our discussion by assuming a scalar parameter B is to be estimated based
on the data set {x[Oj,x[lj, ... , x[N - I]} or in vector form x = [x[Oj x[lj ... x[N - 1]jT. E [(B - aox[Oj- adx[O]] 0
The unknown parameter is modeled as the realization of a random variable. We do not
assume any specific form for the joint PDF p(x,O), but as we shall see shortly, only a E(B - aox[Oj - ad 0
knowledge of the first two moments. That 0 may be estimated from x is due to the or
assumed statistical dependence of 0 on x as summarized by the joint PDF p(x, B), and
in particular, for a linear estimator we rely on the correlation between Band x. We aoE(x2[0]) + alE(x[O]) = E(Ox[O])
now consider the class of all linear (actually affine) estimators of the form aoE(x[O]) + al = E(O).
N-l But E(x[O]) = 0 and E(Ox[O]) = E(X3[0]) = 0, so that
0= L anx[nj +aN (12.1 )
ao 0
n=O
al E(B) = E(X2[O]) = (72.
and choose the weighting coefficients an's to minimize the Bayesian MSE
Therefore, the LMMSE estimator is 0 = (72 and does not depend on the data. This is
(12.2) because 0 and x[Oj are uncorrelated. The minimum MSE is
where the expectation is with respect to the PDF p(x,O). The resultant estimator is Bmse(O) E[(B-O)2]
termed the linear minimum mean square error (LMMSE) estimator. Note that we have
E [(0 - (72)2]
included the aN coefficient to allow for nonzero means of x and O. If the means are
both zero, then this coefficient may be omitted, as will be shown later. E [(x 2[Oj - (72)2]
Before determining the LMMSE estimator we should keep in mind that the estima- E(X4[o]) - 2(72 E(X2[0]) + (74
tor will be suboptimal unless the MMSE estimator happens to be linear. Such would 3(74 _ 2(74 + (74
be the case, for example, if the Bayesian linear model applied (see Section 10.6). Other-
2(74
wise, better estimators will exist, although they will be nonlinear (see the introductory
example in Section 10.3). Since the LMMSE estimator relies on the correlation between as opposed to a minimum MSE of zero for the nonlinear estimator 0 = x 2 [Oj. Clearly,
random variables, a parameter uncorrelated with the data cannot be linearly estimated. the LMMSE estimator is inappropriate for this problem. Problem 12.1 explores how
Consequently, the proposed approach is not always feasible. This is illustrated by the to modify the LMMSE estimator to make it applicable.
following example. Consider a parameter B to be estimated based on the single data We now derive the optimal weighting coefficients for use in (12.1). Substituting
sample x[Oj, where x[Oj ~ N(o, (72). If the parameter to be estimated is the power of (12.1) into (12.2) and differentiating
the x[Oj realization or 0 = X2[Oj, then a perfect estimator will be
since the minimum Bayesian MSE will be zero. This estimator is clearly nonlinear. If,
however, we attempt to use a LMMSE estimator or Setting this equal to zero produces
N-l
aN = E(B) - L anE(x[n]) (12.3)
n=O
then the optimal weighting coefficients ao and al can be found by minimizing
which as asserted earlier is zero if the means are zero. Continuing, we need to minimize
Bmse(O) E [(0 - O?]
E [(0 - aox[Oj- ad 2 ]. B_(") ~ E {[}; a,,(x[n]- E(x[n])) - (0 - E(O)f}
382 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.3. LINEAR MMSE ESTIMATION 383
over the remaining an's, where aN has been replaced by (12.3). Letting Example 12.1 - DC Level in WGN with Uniform Prior PDF
a = lao ai' .. aN-if, we have
Consider the introductory example in Chapter 10. The data model is
Bmse(O) E {[aT(x - E(x)) - (B - E(B))]2} x[nJ = A + w[nJ n = 0, 1, ... , N - 1
E [aT (x - E(x))(x - E(x)f a] - E [aT(x - E(x))(B - E(B))] where A,..... U[-Ao, AoJ, w[nJ is WGN with variance (J"2, and A and w[nJ are independent.
- E [(B - E(B))(x - E(xWa] + E [(B - E(B))2] We wish to estimate A. The MMSE estimator cannot be obtained in closed form due
to the integration required (see (10.9)). Applying the LMMSE estimator, we first note
(12.4) that E(A) = 0, and hence E(x[n]) = 0. Since E(x) = 0, the covariances are
where C xx is the N x N covariance matrix of x, and Cox is the 1 x N cross-covariance Cxx E(xx T )
vector having the property that Crx = C xo , and Coo is the variance of B. Making use E [(AI + w)(AI + wf]
of (4.3) we can minimize (12.4) by taking the gradient to yield
E(A2)l1T + (J"2I
8Bmse( 0) _ 2C a _ 2C Cox E(AxT )
8a - xx xO E [A(AI + W)T]
which when set to zero results in E(A2)IT
where I is an N x 1 vector of all ones. Hence, from (12.7)
(12.5)
A = CoxC;;x
Using (12.3) and (12.5) in (12.1) produces = (J"~IT((J"~l1T + (J"2I)-lX
aTx+aN where we have let (J"~ = E(A2). But the form of the estimator is identical to that
C;oC;;x + E( B) - C;oC;; E(x) encountered in Example 10.2 if we let /LA = 0, so that from (10.31)
(J"~_
or finally the LMMSE estimator is A= 2X.
(J"~+N
0= E(B) + CoxC;;(x - E(x)). (12.6) Since (J"~ = E(A2) = (2Ao)2/12 = A~/3, the LMMSE estimator of A is
A2
Note that it is identical in form to the MMSE estimator for jointly Gaussian x and B, as .= T-x.
can be verified from (10.24). This is because in the Gaussian case the MMSE estimator A A2 (12.9)
=+~
happens to be linear, and hence our constraint is automatically satisfied. If the means 3 N
of B and x are zero, then As opposed to the original MMSE estimator which required integration, we have ob-
(12.7) tained the LMMSE estimator in closed form. Also, note that we did not really need to
know that A was uniformly distributed but only its mean and variance, or that w[nJ
The minimum Bayesian MSE is obtained by substituting (12.5) into (12.4) to yield was Gaussian but only that it is white and its variance. Likewise, independence of
A and w was not required, only that they were uncorrelated. In general, all that is
Bmse(O) = C;oC;;CxxC;;Cxo - C;oC;;C xo required to determine the LMMSE estimator are the first two moments of p(x, B) or
- CoxC;;Cxo + Coo
CoxC;;Cxo - 2C ox C;;C xo + Coo E(B)] [Coo Cox]
[ E(x) , Cxo Cxx .
or finally
However, we must realize that the LMMSE of (12.9) will be suboptimal since it has been
Bmse(O) = Coo - CoxC;;C xo , (12.8) constrained to be linear. The optimal estimator for this problem is given by (10.9).
Again this is identical to that obtained by substituting (10.25) into (11.12). Anexample o
follows.
384 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.4. GEOMETRICAL INTERPRETATIONS 385
e N=2
y
Figure 12.2 Orthogonal random
variables-y cannot be linearly es-
x[l] Figure 12.1 Vector space inter- a=Q timated based on x
x[Q] pretation of random variables
consistent with our earlier definition of the length of a vector. Also, we can now define
12.4 Geometrical Interpretations two vectors to be orthogonal if
In Chapter 8 we discussed a geometrical interpretation of the LSE based on the concept (x, y) = E(xy) = O. (12.12)
of a vector space. The LMMSE estimator admits a similar interpretation, although now
the "vectors" are random variables. (Actually, both vector spaces are special cases of the Since the vectors are zero mean this is equivalent to saying that two vectors are orthog-
more general Hilbert space [Luenberger 1969].) This alternative formulation assumes onal if and only if they are unc~rrelated. (In R3 two Euclidean vectors are ortho~onal if
that fJ and x are zero mean. If they are not, we can always define the zero mean random the angle between them is Q = 90 so that (x, y) = IIxlillyll cos Q = 0.) Recallmg our
0
,
variables 8 ' = fJ - E(fJ) and x' = x - E(x), and consider estimation of fJ' by a linear discussions from the previous section, this implies that if two vectors are orthogonal, we
function of x' (see also Problem 12.5). Now we wish to find the an's so that cannot use one to estimate the other. As shown in Figure 12.2, since there is no com-
N-J
ponent of y along x, we cannot use x to estimate y. Attempting to do s.? means tha: we
need to find an a so that fj = ax minimizes the Bayesian MSE, Bmse(y) = E[(y - y)2].
0= L anx[n]
To find the optimal value of a
n=O
minimizes dBmse(fj)
!E[(y - axn
Bmse(O) = E [(fJ - 8)2]. da da
Let us now think of the random variables fJ, x[O], x[I], ... , x[N - 1] as elements in a ![E(y2) - 2aE(xy) + a2E(x 2)]
da
vector space as shown symbolically in Figure 12.1. The reader may wish to verify that -2E(xy) + 2aE(x 2)
the properties of a vector space are satisfied, such as vector addition and multiplication
by a scalar, etc. Since, as is usually the case, fJ cannot be perfectly expressed as a linear 0,
combination of the x[nl's (if it could, then our estimator would be perfect), we picture
fJ as only partially lying in the subspace spanned by the x[n]'s. We may define the which yields
_ E(xy) _ 0
"length" of each vector x as Ilxll = JE(x 2) or the square root of the variance. Longer a - E(x 2 ) -
length vectors are those with larger variances. The zero length vector is the random
variable with zero variance or, therefore, the one that is identically zero (actually not The LMMSE estimator of y is just fj = O. Of course, this is a special case of (12.7)
random at all). Finally, to complete our description we require the notion of an inner where N = 1, fJ = y, x[O] = x, and Cox = O. . .
product between two vectors. (Recall that if x, yare Euclidean vectors in R 3 , then With these ideas in mind we proceed to determine the LMMSE estImator usmg the
the inner product is (x,y) = xTy = IIxlillylicOSQ, where Q is the angle between the vector space viewpoint. This approach is useful for conceptualization of the L!"1 MSE
vectors.) It can be shown that an appropriate definition, i.e., one that satisfies the estimation process and will be used later to derive the sequential LMMSE estImator.
properties of an inner product between the vectors x and y, is (see Problem 12.4) As before, we assume that
1
386 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.4. GEOMETRICAL INTERPRETATIONS 387
(a) (b)
[
E(Ox[~ -1])
These are the normal equations. The matrix is recognized as C xx , and the right-hand
Figure 12.3 Orthogonality principle for LMMSE estimation
vector as Cxo Therefore,
(12.16)
to minimize the MSE and
(12.17)
E[(0- }; a.x[nl)'] The LMMSE estimator of 0 is
%
2
0
11 _ an x[n]11
or finally
But this means that minimization of the MSE is equivalent to a minimization of the iJ = CoxC;;x (12.18)
squared length of the error vector f = 0 - iJ. The error vector is shown in Figure 12.3b in agreement with (12.7). The minimum Bayesian MSE is the squared length of the
for several candidate estimates. Clearly, the length of the error vector is minimized error vector or
when is orthogonal to the subspace spanned by {x[O], x[I], ... , x[N - I]}. Hence, we
require Bmse(iJ) IIW
f .1 x[O], x[I]' ... , x[N - 1]
L amE(x[m]x[n]) = E(Ox[n])
n=O
n = 0, 1, ... , N-1. E(02) - % anE(x[n]O) - f: %
amE [(0 - anx[n]) x[m]] .
388
CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.5. THE VECTOR LMMSE ESTIMATOR 389
2
[ E(Ox[Q]) E(Ox[l]) 1 [ E(x [Q]) Q ] -I [ x[Q] ]
Q E(x 2 [1]) x[l]
(}
COxC;;x.
X[OJ Clearly, the ease with which this result was obtained is due to the orthogonality of
x[Q], x[l] or, equivalently, the diagonal nature of Cxx: For nonorthogonal data samples
the same approach can be used if we first orthogonahze the samples or replace the data
by uncorrelated samples that span the same subspace. We will have more to say about
(a) (b) this approach in Section 12.6. 0
Assume that x[Q] and x[l] are zero mean and uncorrelated with each other. However, Oi = E(Oi) + COiXC;;(x - E(x)) i = 1,2, ... ,p
they are each correlated with O. Figure 12.4a illustrates this case. The LMMSE esti- and the minimum Bayesian MSE is from (12.8)
mator of 0 based on x[Q] and x[l] is the sum of the projections of 0 on x[Q] and x[l] as
shown in Figure 12.4b or
Bmse(Oi) = COiOi - COiXC;;CXOi i=1,2, ... ,p.
o 00 + OI The scalar LMMSE estimators can be combined into a vector estimator as
[ ~~:;i 1+ [g~:
A _(0, x[Q]) (0, x[l])
0- (x[Q], x[Q]) x[Q] + (x[l], x[l]) x[l]. C;;(x - E(x))
By the definition of the inner product (12.10) we have
E(Op) COpX
o= E(Ox[Q]) x[Q]
E(X2[Q])
+ E(Ox[l]) x[l]
E(x2[1])
E(8) + CoxC;;-(x - E(x)) (12.2Q)
390 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.5. THE VECTOR LMMSE ESTIMATOR 391
where now Cox is a p x N matrix. By a similar approach we find that the Bayesian and 6 and ware uncorrelated. Then, the LMMSE estimator of 6 is given by (12.20),
MSE matrix is (see Problem 12.7) where
with 0 given by (12.20). The second property states that the LMMSE estimator of a The performance of the estimator is measured by the error e = 6 - 0 whose mean is
sum of unknown parameters is the sum of the individual estimators. Specifically, if we zero and whose covariance matrix is
wish to estimate 0: = 6 1 + 6 2 , then
Ex.o(ee T )
(12.24) Coo - COOHT(HCooHT + Cw)-lHC oo (12.28)
(COOl +HTC;;;lH)-l. (12.29)
where
The error covariance matrix is also the minimum MSE matrix Me whose diagonal
E(6 1 ) + Co,xC;;(x - E(x)) elements yield the minimum Bayesian MSE
E(62 ) + C02X C;;(x - E(x)).
[Met [Cft
The proof of these properties is left to the reader as an exercise (see Problem 12.8). Bmse(Bi)' (12.30)
In analogy with the BLUE there is a corresponding Gauss-Markov theorem for
the Bayesian case. It asserts that for data having the Bayesian linear model form, These results are identical to those in Theorem 11.1 for the Bayesian linear model except
i.e., the Bayesian linear model without the Gaussian assumption, an optimal linear that the error vector is not necessarily Gaussian. An example of the determination of
estimator exists. Optimality is measured by the Bayesian MSE. The theorem is just the this estimator and its minimum Bayesian MSE has already been given in Section 10.6.
application of our LMMSE estimator to the Bayesian linear model. More specifically, The Bayesian Gauss-Markov theorem states that within the class of linear estimators
the data are assumed to be the one that minimizes the Bayesian MSE for each element of 6 is given by (12.26)
x=H6+w or (12.27). It will not be optimal unless the conditional expectation E(6Ix) happens
to be linear. Such was the case for the jointly Gaussian PDF. Although suboptimal,
where 6 is a random vector to be estimated and has mean E(6) and covariance Coo, H the LMMSE estimator is in practice quite useful, being available in closed form and
is a known observation matrix, w is a random vector with zero mean and covariance Cw , depending only on the means and covariances.
392 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.6. SEQUENTIAL LMMSE ESTIMATION 393
12.6 Sequential LMMSE Estimation Similar to the sequential LS estimator we correct the old estimate A[N - 1] by a scaled
version of the prediction error x[N] - A[N - 1]. The scaling or gain factor is from
In Chapter 8 we discussed the sequential LS procedure as the process of updating in (12.32) and (12.31)
time the LSE estimator as new data become available. An analogous procedure can
be used for LMMSE estimators. That this is possible follows from the vector space
viewpoint. In Example 12.2 the LMMSE estimator was obtained by adding to the K[N]
(N + 1)0'~ + 0'2
old estimate 00 , that based on x[O], the estimate 01 , that based on the new datum
Bmse(A[N - 1])
x[l]. This was possible because x[O] and x[l] were orthogonal. When they are not, the (12.33)
algebra becomes more tedious, although the approach is similar. Before stating the Bmse(A[N - 1]) + 0'2
general results we will illustrate the basic operations involved by considering the DC
level in white noise, which has the Bayesian linear model form. The derivation will be and decreases to zero as N -+ 00, reflecting the increased confidence in the old estimator.
purely algebraic. Then we will repeat the derivation but appeal to the vector space We can also update the minimum Bayesian MSE since from (12.31)
approach. The reason for doing so will be to "lay the groundwork" for the Kalman
, O'~ 0'2
filter in Chapter 13, which will be derived using the same approach. We will assume a Bmse(A[N]) =
zero mean for the DC level A, so that the vector space approach is applicable. (N + 1)0'~ + 0'2
To begin the algebraic derivation we have from Example 10.1 with /-LA : 0 (see N O'~ + 0'2 O'~ 0'2
(10.11)) (N + 1)0'~ + 0'2 NO'~ + 0'2
(1 - K[N])Bmse(A[N - 1]).
We now use the vector space viewpoint to derive the same results. (The general sequen-
tial LMMSE estimator is derived in Appendix 12A using the vector space approach. It
is a straightforward extension of the following derivation.) Assume that the LMMSE
estimator A[l] is to be found, which is based on the data {x[O],x[l]} as shown in Fig-
ure 12.5a. Because x[O] and x[l] are not orthogonal, we cannot simply add the estimate
based on x[O] to the estimate based on x[l]. If we did, we would be adding in an extra
component along x[O]. However, we can form A[l] as the sum of A[O] and a component
orthogonal to A[O] as shown in Figure 12.5b. That component is ~A[l]. To find a vector
in the direction of this component recall that the LMMSE estimator has the property
, O'~ , that the error is orthogonal to the data. Hence, if we find the LMMSE estimator of
A[N -1] + (N ) 2 2(x[N]- A[N -1]). (12.32)
x[l] based on x[O], call it x[110], the error x[l]- x[110] will be orthogonal to x[O]. This
+lO'A+O'
1
394 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.6. SEQUENTIAL LMMSE ESTIMATION 395
and the error vector x[l] = x[l] - I[lIO] represents the new information that x[l]
contributes to the estimation of A. As such, it is called the innovation. The projection
of A along this error vector is the desired correction
X[l]) x[l]
LlA[l] ( A, Ilx[l]11 Ilx[l]11
E(Ax[l])x[l]
E(x 2 [1]) .
A[lJ
If we let K[l] = E(Ax[l])/ E(x 2[1]), we have for the LMMSE estimator
(a) (b)
A[l] = A[O] + K[l](x[l]- I[110]).
To evaluate I[110] we note that x[l] = A + w[l]. Hence, from the additive property
(12.24) I[110] = A[O] + w[lIO]. Since w[l] is uncorrelated with w[O], we have w[lIO] = 0.
Finally, then I[lIO] = A[O], and A[O] is found as
ilA[lJ x[lJ - 5:[lloJ
,E(Ax[O]) (1~
A[O] = E(x 2[0]) x[O] = (1~ + (12 x[O].
x[lJ
Thus,
(c) A[l] = A[O] + K[l](x[l]- A[O]).
It remains only to determine the gain K[l] for the correction. Since
Figure 12.5 Sequential estimation using vector space approach x[l] x[l]- I[lIO]
x[l]- A[O]
is shown in Figure l2.5c. Since x[o] and x[l]- I[lIO] are orthogonal, we can project A (12
onto each vector separately and add the results, so that x[l]- 2 A 2X[0]
(1A +(1
A[l] = A[O] + LlA[l]. the gain becomes
(1~(1l(12X[0])]
To find the correction term LlA[l] first note that the LMMSE estimator of a random
variable y based on x, where both are zero mean, is from (12.7) E[A(X[l]-
K[l]
,
y
E(xy)
= E(x2) x. (12.37) E[(X[l]- (1~(1l(12X[0])2]
Thus, we have that
E(x[O]x[l]) [0]
I[lIO]
E(x 2[0]) x which agrees with (12.33) for N = 1. We can continue this procedure to find A[2], A[3],
. . .. In general, we
E [(A + W[O]) (A + w[l])] x[O]
E [(A + w[O])2] 1. find the LMMSE estimator of A based on x[O], yielding A[O]
(1A [0] (12.38) 2. find the LMMSE estimator of x[l] based on x[O], yielding I[110]
(1~ + (12 X
CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.6. SEQUENTIAL LMMSE ESTIMATION 397
396
3. determine the innovation of the new datum x[l], yielding x[l] - x[110] since x[N]- A[N - 1] is ~he innovation of x[N] which is orthogonal to {x[O], x[I], ... ,
x[N - Inan~ hence to A[N - 1] (a linear combination of these data samples). Also,
4. add to A[O] the LMMSE estimator of A based on the innovation, yielding .1[1] E[w[N](A - A[N - 1])] = 0 as explained previously, so that
5. continue the process.
E [A(x[N]- A[N -1])] E [(A - A[N - 1])2]
In essence we are generating a set of uncorrelated or orthogonal random variables,
namely, the innovations {x[O], x[I]-x[II0], x[2]-x[210, I]' ... }. This procedure is termed Bmse(A[N - 1]). (12.44)
a Gram-Schmidt orthogonalization (see Problem 12.10). To find the LMMSE estimator
of A based on {x [0], x[I], ... , x[N-Inwe simply add the individual estimators to yield
Also,
To complete the derivation we must find x[NIO, 1, ... , N - 1] and the gain K[N]. We E [ (A - A[N - 1] - K[N](x[N] _ A[N _ 1])) 2]
first use the additive property (12.24). Since A and w[N] are uncorrelated and zero
mean, the LMMSE estimator of x[N] = A + w[N] is E [(A - A[N _1])2] - 2K[N]E [(A - A[N -1])(x[N]- A[N -1])]
x[N\O, 1, ... ,N -1] = A[NIO, 1, ... ,N -1] + w[NIO, 1, ... ,N -1]. + K2[N]E [(x[N]- A[N - 1])2]
But A[NIO, 1, ... , N - 1] is by definition the LMMSE estimator of the realization of A Bmse(A[N - 1]) - 2K[N]Bmse(A[N - 1])
at time N based on {x[O], x[I]' ... , x[N - In
or A[N - 1] (recalling that A does not + K2[N] (17 2 + Bmse(A[N - 1]))
change over the observation interval). Also, w[NIO, 1, ... , N - 1] is zero since w[N] is
uncorrelated with the previous data samples since A and w[n] are uncorrelated and where we have used (12.43)-(12.45). Using (12.46), we have
w[n] is white noise. Thus,
Bmse(A[N]) = (1 - K[N])Bmse(A[N - 1]).
x[NIO, 1, ... , N - 1] = A[N - 1]. (12.42)
We have now derived the sequential LMMSE estimator equations based on the vector
To find the gain we use (12.40) and (12.42): space viewpoint.
In Appendix 12A we generalize the preceding vector space derivation for the sequen-
E [A(x[N]- A[N -1])] tial vector LMMSE estimator. To do so we must assume the data has the Bayesian
K[N] = , . linear model form (see Theorem 12.1) and that Cw is a diagonal matrix, so that the
E [(x[N]- A[N - 1])2]
w[n] 's are uncorrelated with variance E(w 2[n]) = a;.
The latter is a critical assumption
and was used in the previous example. Fortuitously, the set of equations obtained are
But
valid for the case of nonzero means as well (see Appendix 12A). Hence, the equations
E [A(x[N]- A[N -1])] = E [(A - A[N -1])(x[N]- A[N -1])] . (12.43) to follow are the sequential implementation of the vector LMMSE estimator of (12.26)
398 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.6. SEQUENTIAL LMMSE ESTIMATION 399
(T~, h[n), M[n - 1] Since no data have been observed at n = -1, then from (12.19) our estimator becomes
the constant Bi = aiO. It is easily shown that the LMMSE estimator is just the mean
of (}i (see also (12.3)) or
8[-1] = E(8).
x[n] - - - + - I l - - _ -....~ B[n] As a result, the minimum MSE matrix is
6 &i!
400 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.7. SIGNAL PROCESSING EXAMPLES - WIENER FILTERING 401
The 2 x 1 gain vector is, from (12.48), x[n)
N=4
N=4
x[n)
M[-I]h[O]
K[O] = (j2 + hT[O]M[-I]h[O]
where M[-I] = (j~1. Once the gain vector has been found, 0[0] can be computed.
n
Finally, the MSE matrix update is, from (12.49), n
M[O] = (I - K[O]hT[O])M[-I]
'--v--'
8[0) .
8[0)
~
8[1)
.
and the procedure continues in like manner for n :::: 1. <> .. 8[1)
8[2) ..
8[2)
8[3)
..
8[3)
12.7 Signal Processing Examples - Wiener Filtering
(a) Filtering (b) Smoothing
We now examine in detail some of the important applications of the LMMSE dtimator.
l
In doing so we will assume that the data {x[O],x[I], ... ,x[N -I]} is WSS with zero
rll
N=4
mean. As such, the N x N covariance matrix e xx takes the symmetric Toeplitz form x[n) 1=2
rxx[I] rxx[N - 1]
r .. [O] rxx[O] rxx[N - 2]
rxx[I]
e xx
rxx[N -1] rxx[N - 2] rxx[O]
j 2
f
= Rxx (12.50)
n
.. 3
n
x[2)
where rxx[k] is the ACF of the x[n] process and Rxx denotes the autocorrelation matrix.
Furthermore, the parameter fJ to be estimated is also assumed to be zero mean. The ( c) Prediction (d) Interpolation
group of applications that we will describe are generically called Wiener filters.
There are three main problems that we will study. They are (see Figure 12.7) Figure 12.7 Wiener filtering problem definitions
Also,
COx = E(SXT) = E(s(s + wf) = Rss recalling that
Therefore, the Wiener estimator of the signal is, from (12.51), s[n] = aT x
where a = [aD al ... an]T as in our original formulation of the scalar LMMSE estimator
s= Rss(Rss + Rww)-lX. (12.53)
of (12.1). We may interpret the process of forming the estimator as n increases as a
The N x N matrix filtering operation if we define a time varying impulse response h(n)[k]. Specifically, we
(12.54) let h(n)[k] be the response of a filter at time n to an impulse applied k samples before.
To make the filtering correspondence we let
is referred to as the Wiener smoothing matrix. The corresponding minimum l\ISE
matrix is, from (12.52), k = 0, 1, ... ,n.
M-s Note for future reference that the vector h = [h(n)[o] h(n)[l] ... h(n)[n]jY is just the vector
(I - W)Rss. (12.55) a when flipped upside down. Then,
)
W = rss[O] 17 n
rss[O] + rww[O] 17 + 1
(12.56) L h(n)[n - k]x[k]
k=o
where 17 = Tss[O]/rww[O] is the SNR. For a high SNR so that W --t 1, we have S[O] --t x[O],
or
while for a low SNR so that W --t 0, we have S[O] --t O. The corresponding minimum n
n
s[n] = L akx[k]
L h(n)[k]rxx[l- k] = rss[l] 1= 0, 1, ... ,n
k=-oo
the time varying impulse response h(n)[k] by its time invariant version h[k] s[n] = L h[k]x[n - k]
,=-00
00
L h[k]rxx[l- k] = rss[l] 1= 0, 1, .... (12.60) where h[k] is the impulse response of an infinite two-sided time invariant filter. The
k=O
Wiener-Hopf equations become
The same set of equations result if we attempt to estimate s[n] based on the present 00
and infinite past or based on x[m] for m :::; n. This is termed the infinite Wiener filter. L h[k]rxx[l- k] = rss[k] -00<1<00 (12.61)
To see why this is so we let k=-oo
00
s[n] = L h[k]x[n - k] (see Problem 12.17). The difference from the filtering case is that now the equations
k=O must be satisfied for alii, and there is no constraint that h[k] must be ca~sal. Hence,
and use the orthogonality principle (12.13). Then, we have we can use Fourier transform techniques to solve for the impulse response smce (12.61)
becomes (12.62)
s[n] - s[n] 1. ... , x[n - 1], x[n] h[n] * rxx[n] = rss[n]
where * denotes convolution. Letting H(f) be the frequency response of the infinite
or by using the definition of orthogonality
Wiener smoother, we have upon taking the Fourier transform of (12.62):
E [(s[n]- s[n])x[n -I])] = 0 1= 0, 1, ....
H(f)
Hence,
L h[k]rxx[l- k] = rss[l] 1= 0, 1, .... data. Before leaving this topic it is worthwhile to point out that the Wlene~ s~oother
emphasizes portions of the frequency spectrum of the data where the SNR IS ~,Igh and
k=O
A little thought will convince the reader that the solutions must be identical since the attenuates those where it is low. This is evident if we define the "local SNR as the
problem of estimating s[n] based on x[m] for 0 :::; m :::; n as n -+ 00 is really just that of SNR in a narrow band of frequencies centered about f as
using the present and infinite past to estimate the current sample. The time invariance
of the filter makes the solution independent of which sample is to be estimated or
independent of n. The solution of (12.60) utilizes spectral factorization and is explored
in Problem 12.16. At first glance it might appear that (12.60) could be solved by using
Fourier transform techniques since the left-hand side of the equation is a convolution Then, the optimal filter frequency response becomes
of two sequences. This is not the case, however, since the equations hold only for I 2 o.
TJ(f)
If, indeed, they were valid for I < 0 as well, then the Fourier transform approach would H(f) = TJ(f) + 1
be viable. Such a set of equations arises in the smoothing problem in which s[n] is
J $ 14 e $
406 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.7. SIGNAL PROCESSING EXAMPLES - WIENER FILTERING 407
Clearly, the filter response satisfies 0 < H(f) < 1, and thus the Wiener smoother
response is H(f) ;:::: 0 when TJ(f) ;:::: 0 (low local SNR) and H(f) ;:::: 1 when TJ(f) --t 00 r xxTxx[l]
[l+I] 1
(high local SNR). The reader may wish to compare these results to those of the Wiener (12.65)
filter given by (12.56). See also Problem 12.18 for further results on the infinite Wiener
smoother.
[
r xx [N ~ 1 + l] .
Finally, we examine the prediction problem in which we wish to estimate () = x[N - These are the Wiener-Hopf prediction equations for the i-step linear predictor based
1 + l] for l 2: 1 based on x = [x[O] x[I] ... x[N - 1]jY. The resulting estimator is termed on N past samples. A computationally efficient method for solving these equations is
the i-step linear predictor. We use (12.51) for which the Levinson recursion [Marple 1987]. For the specific case where l = 1, the one-step
linear predictor, the values of -h[n] are termed the linear prediction coefficients which
are used extensively in speech modeling [Makhoul 1975]. Also, for l = 1 the resulting
equations are identical to the Yule-Walker equations used to solve for the AR filter
where Rxx is of dimension N x Nand
parameters of an AR(N) process (see Example 7.18 and Appendix 1).
C8x E [x[N - 1 + l] [x[O] x[l] ... x[N - 1] ]] The minimum MSE for the i-step linear predictor is, from (12.52),
[ rxx[N - 1 + l] rxx[N - 2 + l] '" rxx[l]].
T
Let the latter vector be denoted by r~x' Then, or, equivalently,
'[N- I
x + l] = r ,Txx R xx
- 1 x. TXX[O]-r~:a
N-l
Recalling that rxx[O] - L akTxx[N - 1 + l - k]
( 12.63) k=O
N-l
we have
N-l rxx[O]- L h[N - k]rxx[N - 1 + l- k]
x[N - 1 + l] =L akx[k]. k=O
N
k=O
J ., (SWIi"ZZiMi. . . .4:w:ww:.t
408 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS 12.7. SIGNAL PROCESSING EXAMPLES - WIENER FILTERING 409
4-+
I
h[k] = {o-a[I] k=1 3i
k = 2,3, ... , N.
2-+
Hence, the one-step linear predictor is 1i
oj
...................
x[N] = -a[I]x[N - 1] -d
and depends only on the previous sample. This is not surprising in light of the fact -2-+
that the AR(I) process satisfies
-3-+
-4-+I
x[n] = -a[I]x[n - 1] + urn] -5+----r---r--~----r---T_--~--_r--_T--~k---r_--
o 5 10 15 20 25 30 35 40 45 50
where u[n] is white noise, so that the sample to be predicted satisfies Sample number, n
x[N] = -a[I]x[N -1] + urN]. Figure 12.8 Linear prediction for realization of AR(l) process
The predictor cannot predict urN] since it is uncorrelated with the past data samples
(recall that since the AR(I) filter is causal. x[n] is a linear combination of {urn], u[n-
I], ... } and thus x[n] is uncorrelated with all future samples of urn]). The prediction and therefore the I-step predictor is
error is x[N]- x[N] = urN], and the minimum MSE is just the driving noise power a~.
To verify this we have from (12.66) x[(N - 1) + I] = (_a[I])l x[N - 1]. (12.67)
N
References Prove that the definition (x, y) = E(xy) for x and y zero mean random variables
satisfies these properties.
Kay,S., "Some Results in Linear Interpolation Theory," IEEE Trans. Acoust., Speech. Signal
Process., Vol. 31, pp. 746-749, June 1983. 12.5 If we assume nonzero mean random variables, then a reasonable approach is
Luenberger, D.G., Optimization by Vector Space Methods, J. Wiley, New York, 1969. to define the inner product between x and y as (x, y) = cov(x, y). With this
Makhoul, J., "Linear Prediction: A Tutorial Review," Proc. IEEE, Vol. 63, pp. 561-580, April
definition x and yare orthogonal if and only if they are uncorrelated. For this
1975.
"inner product" which of the properties given in Problem 12.4 is violated?
Marple, S.L., Jr., Digital Spectral Analysis, Prentice-Hall, Englewood Cliffs, N.J., 1987.
Orfanidis, S.J., Optimum Signal Processing, Macmillan, New York, 1985.
12.6 We observe the data x[n] = s[n] + w[n] for n = 0,1, ... ,N - 1, where s[n] and
w[n] are zero mean WSS random processes which are uncorrelated with each other.
Problems The ACFs are
(j = ax2[0] + bx[O] + c
) Determine the LMMSE estimator of s = [s[O] s[l] ... s[N - 1W based on x
of a scalar parameter () based on the single data sample x[O]. Find the coefficients [x[O] x[l] ... x[N - 1W and the corresponding minimum MSE matrix.
a, b, c that minimize the Bayesian MSE. If x[O] '" U[-!,!], find the LMMSE
estimator and the quadratic MMSE estimator if () = cos 27rx[O]. Also, compare 12.7 Derive the Bayesian MSE matrix for the vector LMMSE estimator as given by
the minimum MSEs. (12.21), as well as the minimum Bayesian MSE of (12.22). Keep in mind the
averaging PDFs implied by the expectation operator.
12.2 Consider the data
12.8 Prove the commutative and additive properties of the LMMSE estimator as given
x[n] = Arn + w[n] n = 0,1,. " ,N - 1 by (12.23) and (12.24).
where A is a parameter to be estimated, r is a known constant, and w[n] is zero 12.9 Derive a sequential LMMSE estimator analogous to (12.34)-(12.36) but for the
mean white noise with variance a 2 The parameter A is modeled as a random case where J1-A i= O. Hint: Use an algebraic approach based on the results in
variable with mean J1-A and variance a! and is independent of w[n]. Find the Example 10.1.
LMMSE estimator of A and the minimum Bayesian MSE.
12.10 Given a set of vectors {Xl,X2,'" ,x n }, the Gram-Schmidt orthogonalization
12.3 A Gaussian random vector x = [Xl X2JT has zero mean and covariance matrix Cxx
proced ure finds a new set of vectors {e I, e2, ... , en} which are orthonormal (or-
If X2 is to be linearly estimated based on Xl, find the estimator that minimizes
thogonal and having unit length) or (ei,ej) = 8ij . The procedure is
the Bayesian MSE. Also, find the minimum MSE and prove that it is zero if and
only if C xx is singular. Extend your results to show that if the covariance matrix Xl
of an N x 1 zero mean Gaussian random vector is not positive definite, then any
a. el=W
random variable may be perfectly estimated by a linear combination of the others.
Hint: Note that
h. Z2=X2 - (X2' el)el
Z2
e
2=ll z211
c. and so on,
12.4 An inner product (x, y) between two vectors x and y of a vector space must satisfy or in general for n ~ 2
the following properties n-l
J
CHAPTER 12. LINEAR BAYESIAN ESTIMATORS PROBLEMS 413
412
Give a geometrical interpretation of the procedure. For the Euclidean vectors 12.16 In this problem we explore the solution of the Wiener-Hopffilter equations based
on the present and infinite past or the solution of
00
12.15 Consider the Wiener smoother for a single data sample as given by (12.56). Noting that Q(z) is the z transform of a causal sequence, show that
Rewrite Wand Ms as a function of the correlation coefficient
11.( ) 1 [Pss(Z)]
cov( 8[0], x[o]) z = B(z) B(Z-l) +'
P = -y'7=var=(;2;8[;::'O]';;:)va=r::;=:(x~[O~])
For an example of the computations involved in determining the Wiener filter, see
between 8[0] and x[O]. Explain your results. [Orfanidis 1985].
414 CHAPTER 12. LINEAR BAYESIAN ESTIMATORS
12.17 Rederive the infinite Wiener smoother (12.61) by first assuming that
x
s[n] = L h[k]x[n - k]
k=-oo
Ms = Tss[O]- L h[k]Tss[k].
k=-oo
Derivation of Sequential LMMSE
Then, using Fourier transform techniques, show that this can be rewritten as
Ms = 1: 2
Pss(f) (1 - H(f)) df
Estimator for the Bayesian Linear
Model
where
H(f) = Pss(f)
Pss(f) + Pww(f)
Evaluate the Wiener smoother and minimum MSE if We will use a vector space approach to update the ith component of O[n - 1] as
follows.
Pss(f) {
ROo If I :S ~ Bdn] = Bdn - 1] + Kdn] (x[n]- x[nln - 1]) (12A.l)
~ < If I :S ! where x[nln - 1] is the LMMSE estimator of x[n] based on {x[O],x[I], ... ,x[n - I]}.
We now shorten the notation from that used in Section 12.6 where we denoted the
same LMMSE estimator as x[nIO, 1, ... , n - 1]. The motivation for (12A.l) follows by
and explain your results.
the same rationale as depicted in Figure 12.5 (see also (12.41)). Also, we let x[n] =
12.19 Assume that x[n] is a zero mean WSS random process. We wish to predict x[n] x[n] - x[nln - 1], which is the innovation sequence. We can then form the update for
based on {x[n-l], x[n-2] ... , x[n- N]}. Use the orthogonality principle to derive the entire parameter as
the LM1ISE estimator or predictor. Explain why the equations to be solved are
the same as (12.65) for I = 1, i.e., they are independent of n. Also, rederive the O[n] = O[n - 1] + K[n](x[n]- x[nln - 1]).
minimum MSE (12.66) by again invoking the orthogonality principle. Before beginning the derivation we state some properties which are needed. The reader
12.20 Consider an AR(N) process should recall that all random variables are assumed to be zero mean.
1. The LMMSE estimator of A9 is AO, the commutative property of (12.23).
N
x[n] = - L a[k]x[n - k] + urn] 2. The LMMSE estimator of 9 1 + 9 2 is 01 + O2 , the additive property of (12.24).
k=1
3. Since B;[n - 1] is a linear combination of the samples {x[O], x[I], ... , x[n - I]}, and
where urn] is white noise with variance O'~. Prove that the optimal one-step linear the innovation x[n] - x[nln - 1] is uncorrelated with the past data samples, it
predictor of x[n] is follows that
N
E [Bi[n - 1] (x[n]- x[nln - 1])] O.
x[n] =- L a[k]x[n - k].
=
k=1
4. Since 9 and w[n] are uncorrelated as assumed in the Bayesian linear model and
Also, find the minimum MSE. Hint: Compare the equations to be solved to the O[n - 1] and w[n] are also uncorrelated, it follows that
Yule-Walker equations (see Appendix 1).
E[(9-0[n-l])w[n]] =0.
415
1 U a ;; ,
416 APPENDIX 12A. DERIVATION OF SEQUENTIAL LMMSE ESTIMATOR
To see why O[n - IJ and w[nJ are uncorrelated note first that O[n - IJ depends
linearly on the past data samples or on 0 and {w[OJ, w[IJ, .. . , w[n -I]}. But w[nJ
I
;
APPENDIX 12A. DERIVATION OF SEQUENTIAL LMMSE ESTIMATOR
and therefore
M[n -lJh[nJ
417
(12A.4)
is un correlated with 0 and by the assumption that Cw is diagonal or that w[nJ is K[nJ = hT[nJM[n - 1Jh[nJ + (7; .
a sequence of uncorrelated random variables, w[nJ is uncorrelated with the past To determine the update for the !'vISE matrix
noise samples.
With these results the derivation simplifies. To begin we note that M[nJ E [(0 - O[n]) (0 - O[nW]
KdnJ = E [Oi(x[nJ - .i:[nln - 1])] . K[nJE [(x[nJ - .i:[nln - 1])2] = M[n - IJh[nJ
(12A.2)
E [(x[nJ - .i:[nln - 1])2] and from property 3 and (12A.3)
Evaluating the denominator E [(0 - O[n - 1])(x[nJ - .i:[nln -1])] E [O(x[nJ - .i:[nln - 1])]
M[n -IJh[nJ
E [(x[nJ - .i:[nln - 1])2] E [(hT[nJO + w[nJ - hT[nJO[n _ 1J) 2] so that
E [(hT[n](O - O[n - 1]) + w[nJ) 2] M[nJ M[n - IJ - M[n - IJh[nJK T [nJ - K[nJhT[nJM[n - 1J
+ M[n - IJh[nJK T [nJ
hT[nJE [(0 - O[n - 1])(0 - O[n - 1W] h[nJ (I - K[nJhT[nJ) M[n - IJ.
+ E(w 2 [n]) We next show that the same equations result if 0 and x[nJ are not zero mean. Since
hT[nJM[n - 1Jh[nJ + (7~ the sequential implementation of the minimum !'vISE matrix (12.49) must be identical
to (12.28), which does not depend on the means, M[nJ likewise must be independent of
where we have used property 4. Now, using properties 3 and 4 to evaluate the numerator the means. Also, since M[nJ depends on K[nJ, the gain vector must also be independent
of the means. Finally, the estimator update equation (12.47) is valid for zero means, as
E [O;(x[nJ - .i:[nln - I])J E [(Oi - Bdn -l])(x[nJ - .i:[nln -1])] we have already shown in this appendix. If the means are not zero, we may still apply
E [(Oi - Bdn -1])(hT[n]O + w[nJ - hT[nJO[n -1])] (12.47) to the data set x[nJ - E(x[n]) and view the estimate as that of 0 - E(O). By
the commutative property, however, the LMMSE estimator of 0 + b for b a constant
E [(Oi - Bdn - 1]) (hT[nJ(O - O[n -1]))] vector is 0 + b (see (12.23)). Thus, (12.47) becomes
E [(Oi - Bdn -1])(0 - O[n -lW] h[nJ (12A.3) O[nJ - E(O) = O[n - IJ - E(O) + K[nJ [x[nJ - E(x[nJ) - hT[nJ(O[n - IJ - E(O))]
so that where O[nJ is the LMMSE estimator for nonzero means. Rearranging and canceling
E [(Oi - Bi[n - 1])(0 - O[n - lW] h[nJ terms we have
Kd n J = ----"--;-h-;;:;-T-;-[n-;-:;JM~[;-n----:l:-;;Jh-'[n'J-+-(7-::~--'-- O[nJ = O[n - 1J + K[nJ [x[nJ - hT[nJO[n - IJ - (E(x[n]) - hT[nJE(O))]
1
418 APPENDIX 12A. DERiVATION OF SEQUENTIAL LMMSE ESTIMATOR
and since
Kalman Filters
13.1 Introduction
We now discuss an important generalization of the Wiener filter. The significance of the
extension is in its ability to accommodate vector signals and noises which additionally
may be nonstationary. This is in contrast to the Wiener filter, which is restricted to
stationary scalar signals and noises. This generalization is termed the Kalman filter. It
may be thought of as a sequential MMSE estimator of a signal embedded in noise, where
the signal is characterized by a dynamical or state model. It generalizes the sequential
MMSE estimator in Section 12.6, to allow the unknown parameters to evolve in time
according to a dynamical model. If the signal and noise are jointly Gaussian, then the
Kalman filter is an optimal MMSE estimator, and if not, it is the optimal LMMSE
estimator.
13.2 Summary
The scalar Gauss-Markov signal model is given in recursive form by (13.1) and explicitly
by (13.2). Its mean, covariance, and variance are given by (13.4), (13.5), and (13.6),
respectively. Generalizing the model to a vector signal results in (13.12) with the
statistical assumptions summarized just below. Also, the explicit representation is given
in (13.13). The corresponding mean and covariances are (13.14), (13.15), and (13.16) or
in recursive form by (13.17) and (13.18). A summary of the vector Gauss-Markov signal
model is given in Theorem 13.1. The sequential MMSE estimator or Kalman filter for
a scalar signal and scalar observations is given by (13.38)-(13.42). If the filter attains
steady-state, then it becomes the infinite length Wiener filter as described in Section
13.5. Two generalizations of the Kalman filter are to a vector signal, whose equations
are (13.50)-(13.54), and to a vector signal and vector observations, whose assumptions
and implementation are described in Theorem 13.2. When the signal model and/or
observation model is nonlinear, the preceding Kalman filter cannot be applied directly.
However, using a linearization approach, one obtains the suboptimal extended Kalman
filter, whose equations are (13.67)-(13.71).
419
420 CHAPTER 13. KALMAN FILTERS 13.3. DYNAMICAL SIGNAL MODELS 421
0-'
noise w[n] might model the error introduced by an inaccurate voltmeter as successive "id'"
measurements are taken. Hence, x[n] represents the noise corrupted observations of the 3en -10-1
power supply output. Now, even though the power supply should generate a constant "-l
voltage of A, in practice the true voltage will vary slightly as time progresses. This is -20 i I I I I I I I
due to the effects of temperature, component aging, etc., on the circuitry. Hence, a 0 10 20 30 40 50 60 70 80 90 100
more accurate measurement model would be Sample number, n
x[n] = A[n] + w[n] Figure 13.1 True voltage and MVU estimator
where A[n] is the true voltage at time n. However, with this model our estimation
problem becomes considerably more complicated since we will need to estimate A[n]
A simple model for s[n] which allows us to specify the correlation between samples
for n = 0,1, ... , N - 1 instead of just the single parameter A. To underscore the
difficulty assume that we model the voltage A[n] as a sequence of unknown deterministic is the first-order Gauss-Markov process
parameters. Then, the MVU estimator of A[n] is easily shown to be
s[n] = as[n - 1] + urn] n2:0 (13.1)
A[n] = x[n].
The estimates will be inaccurate due to a lac~ of averaging, and in fact the variability where urn] is WGN with variance a~, s[-l] ~ N(J.t.,a;), and s[-l] is independent of
will be identical to that of the noise or var(A[n]) = a 2 This estimator is undesirable urn] for all n 2: O. (The reader should not confuse the Gauss-Markov process with the
in that it allows estimates such as those shown in Figure 13.1. We can expect that if model considered in the Gauss-Markov theorem since they are different.) The noise
the power supply is set for A = 10 volts, then the true voltage will be near this and the urn] is termed the driving or excitation noise since s[n] may be viewed as the output
variation over time will be slow (otherwise it's time to buy a new power supply!). An of a linear time invariant system driven by urn]. In the control literature the system
example of the true voltage is given in Figure 13.1 and is seen to vary about 10 volts. is referred to as the plant, and urn] is termed the plant noise [Jaswinski 1970]. The
Successive samples of A[n] will not be too different, leading us to conclude that they model of (13.1) is also called the dynamical or state model. The current output s[n]
display a high degree of "correlation." This reasoning naturally leads us to consider A[n] depends only on the state of the system at the previous time, or s[n-1], and the current
as a realization of a random process with a mean of 10 and some correlation between input urn]. The state of a system at time no is generally defined to be the amount of
samples. The imposition of a correlation constraint will prevent the estimate of A[n] information, which together with the input for n 2: no determines the output for n 2: no
from fluctuating too wildly in time. Thus, we will consider A[n] to be a realization of a [Chen 1970]. Clearly, the state is s[n-1], and it summarizes the effect of all past inpu~s
random process to be estimated, for which Bayesian approaches are appropriate. This to the system. We will henceforth refer to (13.1) as the Gauss-Markov model, where It
type of modeling was used in Chapter 12 in our discussion of Wiener filtering. There is understood to be first order.
the signal to be estimated was termed s[n], and it was assumed to be zero mean. In The signal model of (13.1) resembles an AR(l) process except that the signal starts
keeping with this standard notation, we now adopt s[n] as our notation as opposed to at n = 0, and hence may not be WSS. We shall see shortly that as n -+ 00, so that
O[n]. Also, because of the zero mean assumption, s[n] will represent the signal model for the effect of the initial condition is negligible, the process is actually WSS and may be
A[n]-lO. Once we specify the signal model for zero mean s[n], it is easily modified to regarded as an AR(l) process with filter parameter a[l] = -a. A typical realization of
accommodate nonzero mean processes by adding E(s[n]) to it. We will always assume s[n] is shown in Figure 13.2 for a = 0.98, a~ = 0.1, /-Ls = 5, a; = 1. Note that the mean
that the mean is known. starts off at about 5 but quickly decreases to zero. This behavior is just the transient
." --.-
422 CHAPTER 13. KALMAN FILTERS 13.3. DYNAMICAL SIGNAL MODELS 423
response of the system to the large initial sample s[-l] ~ fl.s = 5. Also, the samples
are heavily correlated. We may quantify these results by determining the mean and
covariance of s[n]. Then, a complete statistical description will have been specified
since s[n] is a Gaussian process, as we will now show.
First, we express s[n] as a function of the initial condition, and the inputs as
s[O] as[-l] + u[O]
s[l] as[O] + u[l]
a2s[-1] + au[O] + u[l] 34
etc.
~j
In general, we have I
n oI I I I
s[n] = an +1 s[_1] + I>ku[n - k] (13.2)
o 10 20 30 50 60 70 80 90 100
+ ~ aku[m - k])
m1 k=o
E [(a + (s[_1]_ fl.s)
+ ~ alu[n
Clearly, s[n] is not WSS since the mean depends on n and the covariance depends on
n
. ( a +I(S[-l]- fl.s) -LJ) ] m and n, not the difference. However, as n ~ 00, we have from (13.4) and (13.5)
m n
:i ~s
I
and steady-state covariance or ACF in Figure 13.3.
Because of the special form of the Gauss-Markov process, the mean and variance can ~
also be expressed recursively. This is useful for conceptualization purposes as well as ~ 4-+I
for extending the results to the vector Gauss-Markov process. The mean and variance
are obtained directly from (13.1) as
31
2-+I
E(s[n]) = aE(s[n - 1]) + E(u[n]) 1-+ !
1 1
or 20 40 60 80 100
E(s[n]) = aE(s[n - 1]) (13.7) Sample number, n
and
4.0] (b) Variance
var(s[n]) E [(s[n]- E(s[n]))2] 3.5
Because s[ n] now depends on the p previous samples, the mean and variance propa-
gation equations become more complicated. To extend our previous results we first
note that the state of the system at time n is {s[n - 1], s[n - 2], ... ,s[n - p]} since the
previous p samples together with urn] determine the output. We thus define the state
vector as
s[n -1] =
s[n - p]
s[n-p+l]
:
j .
a
1
10
1
20
1
30
1
40
1
50 60 70 80 90 100
(13.10) Lag number, k
[
s[n -1]
Figure 13.3 Statistics of first-order Gauss-Markov process
426 CHAPTER 13. KALMAN FILTERS 13.3. DYNAMICAL SIGNAL MODELS 427
'l[n]
With this definition we can rewrite (13.9) in the form
s2[n]
s[n - p + 1]
s[n - p + 2]
ur[n] - - - L_____J - - -
s[n- 1] (a) Multi-input-multi-output model
s[n]
0 1 0 0 s[n -p] u[n] sIn]
s[n - p + 1]
0
0 0
0 1
0
0
1 s[n - 2]
+ r:J u[nl
-alP] -alP - 1] -alP - 2] -a [1] s[n -1]
... '--v--'
A B
(b) Equivalent vector model
where the additional (p - 1) equations are just identities. Hence, using the definition
of the state vector, we have
Figure 13.4 Vector Gauss-Markov signal system model
s[n] = As[n - 1] + Bu[n] (13.11)
where A is a p x p nonsingular matrix (termed the state transition matrix) and B is a 2. The initial state or s[-l] is a random vector with
p x 1 vector. Now we have the desired form (compare this to (13.1)) in which the vector
s[-l] rv N(J.'s' C s )
signal s[n] is easily computed based on its value at the previous time instant, the state
vector, and the input. This is termed the vector Gauss-Markov model. A final level of and is independent of urn] for all n 2 o.
generality allows the input urn] to be an r x 1 vector so that, as shown in Figure 13.4,
we have a model for a vector signal as the output of a linear time invariant system (A We now illustrate with an example.
and B are constant matrices) excited by a vector input. In Figure 13.4a l(z) is the
p x r matrix system function. Summarizing, our general vector Gauss-Markov model Example 13.1 - Two DC Power Supplies
takes the form
s[n] = As[n - 1] + Bu[n] (13.12) Recalling the introductory example, consider now the model for the outputs of two DC
power supplies that vary with time. If we assume that the outputs are independent (in
where A, B are constant matrices with dimension p x p and p x r, respectively, s[n] is
a functional sense) of each other, then a reasonable model would be the scalar model
the p x 1 signal vector, and urn] is the r x 1 driving noise vector. We will from time to
of (13.1) for each output or
time refer to (13.12) as the state model. The statistical assumptions are that
1. The input urn] is a vector WGN sequence, i.e., urn] is a sequence of uncorrelated
alsdn - 1] + udn]
jointly Gaussian vectors with E(u[n]) = O. As a result, we have that a2s2[n - 1] + u2[n]
E(u[m]uT[n]) = 0 m i- n where sd-l] rv N(/lspa;,), s2[-I] rv N(/ls2,a;2)' udn] is WGN with variance a~"
u2[n] is WGN with variance a~2' and all random variables are independent of each
and the covariance of urn] is other. Considering s[n] = [sdn] s2[n]]T as the vector parameter to be estimated, we
have the model
E(u[n]uT[n]) = Q,
where Q is an r x r positive definite matrix. Note that the vector samples are
independent due to the jointly Gaussian assumption. .
- S4
428 CHAPTER 13. KALMAN FILTERS 13.3. DYNAMICAL SIGNAL MODELS 429
so that
Q _ [a~, 0
.(A"+'('HI -~.) + t,A'BU[n -llf]
- 0 a~2 m n
T
and
Am+lCsAn+l + L LAkBE(u[m - k]uT[n -l])B TALT.
k=OI=O
s[-l] But
If, on the other hand, the two outputs were generated from the same source (maybe Thus, for m 2: n
some of the circuitry was shared), then they would undoubtedly be correlated. As
we will see shortly, we could model this by letting any of the matrices A, B, or Q be
nondiagonal (see (13.26)). <> Cs[m,n] = Am+lCsAn+l T + L AkBQBTAn-m+kT (13.15)
k=m-n
We now complete our discussion of the vector Gauss-Markov model by deriving its and for m < n
statistical properties. The computations are simple extensions of those for the scalar
model. First, we determine an explicit expression for s[n]. From (13.12) we have
Cs[m,n) = C;(n,m].
The covariance matrix for s[n] is
s[O) As[-l] + Bu[O]
s[l] As[O] + Bu[l) C[n] Cs[n, n]
n
A 2 s[-1] + ABu[O] + Bu[l] T
An+lCsAn+l + LAkBQBT AkT. (13.16)
etc. k=O
In general, we have Also, note that the mean and covariance propagation equations can be written as
n
s[n) = An+ls[_l) + LAkBu[n - k] (13.13) E(s[n]) AE(s[n - 1]) (13.17)
k=O
C[n) AC[n - l]AT + BQBT (13.18)
where A O = I. It is seen that s[n] is a linear function of the initial condition and the
driving noise inputs. As a result, s[n] is a Gaussian random process. It remains only which follow from (13.12) (see Problem 13.5). As in the scalar case, the covariance
to determine the mean and covariance. From (13.13) matrix decreases due to the AC[n - l)AT term but increases due to the BQBT term.
(It can be shown that for a stable process the eigenvalues of A must be less than 1 in
E(s[n]) An+l E(s[-l]) magnitude. See Problem 13.6.) Steady-state properties similar to those of the scalar
= n
A + 1 /-ts' (13.14) Gauss-Markov model are evident from (13.14) and (13.16). As n -+ 00, the mean will
converge to zero or
The covariance is E(s[n]) = An+lJ.ts -+ 0
J uasas;
430 CHAPTER 13. KALMAN FILTERS 13.4. SCALAR KALMAN FILTER 431
since the eigenvalues of A are all less than 1 in magnitude. Also, from (13.16) it can The mean and covariance propagation equations are
be shown that as n -t 00 (see Problem 13.7) E(s[n]) AE(s[n -1]) (13.25)
C[n] AC(n - I]AT + BQBT. (13.26)
so that 00
13.4 Scalar Kalman Filter
C(n]-t C = LAkBQBTAkT. (13.19)
k=O The scalar Gauss-Markov signal model discussed in the previous section had the form
It is interesting to note that the steady-state covariance is also the solution of (13.18)
if we set e[n - 1] = C(n] = C in (13.18). Then, the steady-state covariance satisfies
sin] = as[n - 1] + urn] n 2:: o.
We now describe a sequential MMSE estimator which will allow us to estimate sin]
(13.20)
based on the data {x[O],x[I]' ... ,x[n]} as n increases. Such an operation is referred to
and (13.19) is the solution, as can be verified by direct substitution. This is known as as filtering. The approach computes the estimator s[n] based on the estimator for the
the Lyapunov equation. previous time sample s[n-l] and so is recursive in nature. This is the so-called Kalman
Although in our definition of the Gauss-Markov model we assumed that the matrices filter. As explained in the introduction, the versatility of the Kalman filter accounts
A, B, and Q did not depend on n, it is perfectly permissible to do so, and in srime cases for its widespread use. It can be applied to estimation of a scalar Gauss-Markov signal
quite useful. Similar expressions for the mean and covariance can be developed. One as well as to its vector extension. Furthermore, the data, which previously in all our
major difference though is that the process may not attain a statistical steady-state. discussions consisted of a scalar sequence such as {x[O], x[I], . .. ,x[n]}, can be extended
We now summarize the model and its properties. to vector observations or {x [0], x[I], ... , x[n]}. A common example occurs in array
processing in which at each time instant we sample the outputs of a group of sensors.
Theorem 13.1 (Vector Gauss-Markov Model) The Gauss-Markov model for a If we have M sensors, then each data sample x[n] or observation will be a M x 1 vector.
p x 1 vector signal sin] is Three different levels of generality are now summarized in hierarchical order.
sin] = As[n -1] + Bu[n] (13.21) 1. scalar state - scalar observation (s[n - 1], x[n])
The A, B are known matrices having dimensions p x p and p x r, respectively, and it 2. vector state - scalar observation (s[n - 1],x[n])
is assumed that the eigenvalues of A are less than 1 in magnitude. The driving noise 3. vector state - vector observation (s[n - 1], x[n]).
vector urn] has dimension r x 1 and is vector WGN or urn] ~ N(o, Q) with the urn] 's
independent. The initial condition s[ -1] is a p x 1 random vector distributed according In this section we discuss the first case, leaving the remaining ones to Section 13.6.
to s[-I] ,..... N(l-'s,C s ) and is independent of the urn] 'so Then, the signal process is Consider the scalar state equation and the scalar observation equation
Gaussian with mean sin] = as[n - 1] + urn]
E(s[n]) = An+ll-'s (13.22)
x[n] = sin] + win] (13.27)
and covariance for m 2:: n
where urn] is zero mean Gaussian noise with independent samples and E(u [n]) = o'~,
2
C.[m,n] E [(s[m]- E(s[m]))(s[n]- E(s[n])fJ win] is zero mean Gaussian noise with independent samples and E(w 2 [n]) = o'~. We
m
T further assume that s[-I], urn], and win] are all independent. Finally, we assume
Am+lCsAn+l
T
+ L AkBQBT An-m+k (13.23) that s[-I] ~ N(/LSl 0';). The noise process win] differs from WGN only in that its
k=m-n variance is allowed to change with time. To simplify the derivation we will assume
that /Ls = 0, so that according to (13.4) E(s[n]) = 0 for n 2:: O. Later we will account
and for m < n
for a nonzero initial signal mean. We wish to estimate sin] based on the observations
Cs[m, n] = C;[n, m] {x[O], x[I], ... , x[n]} or to filter x[n] to produce sin]. More generally, the estimator of
and covariance matrix sin] based on the observations {x[O],x[I], ... ,x[m]} will be denoted by s[nlm]. Our
n criterion of optimality will be the minimum Bayesian MSE or
T
C(n] = Cs[n,n] = An+lC.An+l + LAkBQBTAkT. (13.24)
k=O
E [(s[n]- s[nln]?J
; ....
432 CHAPTER 13. KALMAN FILTERS 13.4. SCALAR KALMAN FILTER 433
where the expectation is with respect to p(x[O], x[l], ... , x[n], s[n]). But the MMSE x[n] which will subsequently be used for vector observations) and x[n] denote the in-
estimator is just the mean of the posterior PDF or novation. Recall that the innovation is the part of x[n] that is uncorrelated with the
previous samples {x[O], x[l], ... , x[n - 1]} or
s[nln] = E(s[n]lx[O], x[l], .. . , x[n]). (13.28)
x[n] = x[n] - x[nln - 1]. (13.30)
Using Theorem 10.2 with zero means this becomes
This is because by the orthogonality principle x[nln - 1] is the MMSE estimator of
(13.29) x[n] based on the data {x[O], x[l], . .. ,x[n - 1]}, the error or x[n] being orthogonal
since 8 = s[n] and x = [x[O] x[l] . .. x[nJf are jointly Gaussian. Because we are assum- (uncorrelated) with the data. The data X[n - 1], x[n] are equivalent to the original
ing Gaussian statistics for the signal and noise, the MMSE estimator is linear and is data set since x[n] may be recovered from
identical in algebraic form to the LMMSE estimator. The algebraic properties allow us
x[n] x[n] + x[nln - 1]
to utilize the vector space approach to find the estimator. The implicit linear constraint
n-l
+ L akx[k]
does not detract from the generality since we already know that the optimal estima-
tor is linear. Furthermore, if the Gaussian assumption is not valid, then the resulting x[n]
k=O
estimator is still valid but can only be said to be the optimal LMMSE estimator. Re-
turning to the sequential computation of (13.29), we note that if x[n] is uncorrelated where the ak's are the optimal weighting coefficients of the MMSE estimator of x[n]
with {x[O], x[l], .. . ,x[n - 1]}, then from (13.28) and the orthogonality principle we will based on {x[O], x[l], ... , x[n - 1]}. Now we can rewrite (13.28) as
have (see Example 12.2)
s[nln] = E(s[n]IX[n - 1], x[n])
s[nln] E(s[n]lx[O], x[l], ... ,x[n - 1]) + E(s[n]lx[n])
s[nln - 1] + E(s[n]lx[n]) and because X[n - 1] and x[n] are uncorrelated, we have from property 1 that
which has the desired sequential form. Unfortunately, the x[nl's are correlated due s[nln] = E(s[n]IX[n - 1]) + E(s[n]lx[n]).
to their dependence on s[n], which is correlated from sample to sample. From our
discussions in Chapter 12 of the sequential LMMSE estimator, we can use our vector But E(s[n]IX[n-1]) is the prediction of s[n] based on the previous data, and we denote
space interpretation to determine the correction of the old estimator s[nln - 1] due it by s[nln - 1]. Explicitly the prediction is, from (13.1) and property 2,
to the observation of x[n]. Before doing so we will summarize some properties of the
MMSE estimator that will be used. s[nln - 1] E(s[n]IX[n -1])
E(as[n - 1] + u[n]IX[n - 1])
1. The MMSE estimator of 8 based on two uncorrelated data vectors, assuming jointly
Gaussian statistics, is (see Section 1l.4) aE(s[n - 1J1X[n - 1])
as[n - lin - 1]
E(8Ix1,X2)
E(8lxd + E(8Ix2) since E(u[n]IX[n - 1]) = O. This is because
With these properties we begin the derivation of (13.38)-(13.42). Let X[n] = [x[O] x[l] where
... x[n]f (we now use X as our notation to avoid confusion with our previous notation s[nln - 1] = as[n - lin - 1].
SJ , #4W"
434 CHAPTER 13. KALMAN FILTERS 13.4. SCALAR KALMAN FILTER 435
To determine E(s[n]lx[n]) we note that it is the MMSE estimator of s[n] based on x[n]. is uncorrelated with the past data and hence with s[nln - 1], which is a linear com-
As such, it is linear, and because of the zero mean assumption of s[n], it takes the form bination of x[O], x[l], ... , x[n - 1]. The second result follows from s[n] and w[n] being
uncorrelated and w[n] being uncorrelated with the past data (since w[n] is an uncorre-
E(s[n]lx[n]) K[n]x[n] lated process). Using these properties, the gain becomes
= K[n](x[n]- x[nln - 1])
E[(s[n]- s[nln - l])(x[n]- s[nln - 1])]
K[n] (13.36)
where E[(s[n]- s[nln - 1] + w[n])2]
K[ ] = E(s[n]x[n]) (13.32) E[(s[n]- s[nln - 1])2]
n E(x 2 [n]).
(j~ + E[(s[n] - s[nln - 1])2]
This follows from the general MMSE estimator for jointly Gaussian 8 and x
But the numerator is just the minimum MSE incurred when s[n] is estimated based on
-1 E(8x) the previous data or the minimum one-step prediction error. We will denote this by
8 = COxCxx x = E(X2) x.
M[nln - 1], so that
Kn = M[nln-1]
But x[n] = s[n] + w[n], so that by property 2
[] (j~ + M[nln - 1]"
(13.37)
x[nln - 1] = s[nln - 1] + w[nln - 1] To evaluate the gain we need an expression for the minimum prediction error. Using
= s[nln -1] (13.34)
since w[nln - 1] = 0 due to w[n] being independent of {x[O], x[l], ... , x[n - I]}. Thus, M[nln-1] E[(s[n]- s[nln - 1])2]
E[(as[n - 1] + u[n]- s[nln - 1])2]
E(s[n]lx[n]) = K[n](x[n]- s[nln - 1])
E[(a(s[n - 1] - s[n - lin - 1]) + u[n])2].
and from (13.31) we now have
We note that
s[nln] = s[nln - 1] + K[n](x[n]- s[nln - 1]) (13.33) E[(s[n - 1]- s[n - lin - l])u[n]] = 0
since s[n -1] depends on {u[O], u[l], ... , urn -1], s[-l]}, which are independent of urn],
where
and s[n - lin - 1] depends on past data samples or {s[O] + w[O], s[l] + w[l], ... , s[n-
s[nln - 1] = as[n - lin - 1]. (13.34)
1] + w[n - I]}, which also are independent of urn]. Thus,
It remains only to determine the gain factor K[n]. From (13.32) the gain factor is
M[nln - 1] = a2M[n - lin - 1] + (j~.
K[n] = E[s[n](x[n]- s[nln - 1])]
E[(x[n] - s[nln - 1])2] . Finally, we require a recursion for M[nln]. Using (13.33), we have
j . j CLM"
436 CHAPTER 13. KALMAN FILTERS 13.4. SCALAR KALMAN FILTER 437
This completes the derivation of the scalar state-scalar observation Kalman filter. Al-
though tedious, the final equations are actually quite simple and intuitive. We summa-
rize them below. For n 2 0
urn] ---i:t}---az-_1-J-.......-. s[n]
Prediction:
s[nln - 1] = as[n - lin - 1]. (13.38)
(a) Dynamical model
Minimum Prediction MSE:
Kalman Gain:
K[n] = M[nln -1] (13.40)
0-; + M[nln - 1]'
, ,
I. ___________________________ 4
Correction:
Dynamical model
s[nln] = s[nln - 1] + K[n](x[n]- s[nln - 1]). (13.41)
(b) Kalman filter
Minimum MSE:
M[nln] = (1 - K[n])M[nln - 1]. (13.42)
Figure 13.5 Scalar state-scalar observation Kalman filter and rela-
tionship to dynamic model
Although derived for fLs = 0 so that E(s[n]) = 0, the same equations result for fL. i= 0
(see Appendix 13A). Hence, to initialize the equations we use s[-11-1] = E(s[-I]) = fL.
and M[ -11-1] = 0-;
since this amounts to the estimation of s[ -1] without any data. A
block diagram of the Kalman filter is given in Figure 13.5. It is interesting to note that with independent samples, a variance of 0-; = (1/2)n, and independent of s[-I] and
the dynamical model for the signal is an integral part of the estimator. Furthermore, urn] for n 2 O. We initialize the filter with
we may view the output of the gain block as an estimator urn] of urn]. From (13.38)
and (13.41) the signal estimate is
s[-II- 1] E(s[-I]) = 0
s[nln] = as[n - lin - 1] + urn] M[-11-1] E[(s[-I]- s[-II- 1])2]
E(s2[-I]) = 1.
where urn] = K[n](x[n]- s[nln -1]). To the extent that this estimate is approximately
urn], we will have s[nln] ::::: s[n], as desired. We now consider an example to illustrate
the flow of the Kalman filter. According to (13.38) and (13.39), we first predict the s[O] sample to obtain
I
J
438 CHAPTER 13. KALMAN FILTERS 13.4. SCALAR KALMAN FILTER 439
to yield
x[n) - -...8,----i~ s[nln)
M[OI-l]
K[O]
(75 + M[OI- 1]
~
1 + 24
9
13 K[n)
I a(I - K[n))z-l
For n = 1 we obtain the results The Kalman filter will reduce to these equations when the parameter to be es-
timated does not evolve in time. This follows by assuming the driving noise to
9 be zero or (7~ = 0 and also a = 1. Then, the state equation becomes from (13.1)
s[110] 26 x [0]
s[n] = s[n - 1] or explicitly s[n] = s[-l] = O. The parameter to be estimated is a
113 constant, which is modeled as the realization of a random variable. Then, the pre-
M[110]
52 diction is s[nln-1] = s[n-1In-1] with minimum MSE M[nln-1] = M[n-1In-1],
113 so that we can omit the prediction stage of the Kalman filter. This says that the
K[l]
129 prediction is just the last estimate of s[n]. The correction stage reduces to (13.43)-
(13.45) since we can let s[nln] = s[n], s[nln - 1] = s[n - 1], M[nln] = M[n], and
s[111] ~x[O]
26
+ 113 (X[l] - ~x[O])
129 26 M[nln - 1] = M[n - 1].
452
M[111] 2. No matrix inversions are required. This should be compared to the batch method
1677 of estimating 0 = s[n] as
. -\
which the reader should verify. o 0= CoxCxxx
where x = [x[O] x[l] ... x[n]f. The use of this formula requires us to invert Cxx
We should note that the same set of equations result if E(s[-l]) =1= O. In this case the for each sample of s[n] to be estimated. And in fact, the dimension of the matrix
mean of s[n] will be nonzero since E(s[n]) = a n +lE(s[-l]) (see (13.3)). This extension is (n + 1) x (n + 1), becoming larger with n.
is explored in Problem 13.13. Recall that the reason for the zero mean assumption
was to allow us to use the orthogonality principle. We now discuss some important 3. The Kalman filter is a time varying linear filter. Note from (13.38) and (13.41)
properties of the Kalman filter. They are: that
1. The Kalman filter extends the sequential MMSE estimator in Chapter 12 to the s[nln] as[n - lin - 1] + K[n](x[n]- as[n - lin - 1])
case where the unknown parameter evolves in time according to the dynamical
a(l - K[n])s[n - lin - 1] + K[n]x[n].
model. In Chapter 12 we derived the equations for the sequential LMMSE esti-
mator (see (12.47)-(12.49)), which are identical in form to those for the sequential This is a first-order recursive filter with time varying coefficients as shown in
MMSE estimator which assumes Gaussian statistics. In particular, for the scalar Figure 13.6.
I
J
440 CHAPTER 13. KALMAN FILTERS 13.4. SCALAR KALMAN FILTER 441
...
.....
"'"
1.2-
1.0-
0.8-
0.6-
x[n] -'7-t:t: "'_, J'
K[n] ..
s[nln]
iii
X
0.4_ Figure 13.8 Innovation-driven Kalman filter
!:!
... 0.2-
......
"'" 0.0 I
n=2
-5
Sample number, n
x[2]
Figure 13.1 Prediction and correction minimum MSE
4. The Kalman filter provides its own performance measure. From (13.42) the min-
imum Bayesian MSE is computed as an integral part of the estimator. Also, the
Figure 13.9 Orthogonality (un-
error measure may be computed off-line, i.e., before any data are collected. This
correlated property) of innovation
is because M[nln] depends only on (13.39) and (13.40), which are independent Span of x[O], x[i] sequence x(n)
of the data. We will see later that for the extended Kalman filter the minimum
MSE sequence must be computed on-line (see Section 13.7).
and since K[n] -70, we have from (13.41) that s[nln] ~ ~[nl~ - 1], where s[nln-
5. The prediction stage increases the error, while the correction stage decreases it. 1] = as[n - lin - 1]. Thus, the optimal two-step predictIOn IS
There is an interesting interplay between the minimum prediction MSE and the
minimum MSE. If as n -7 00 a steady-state condition is achieved, then M[nln] s[n + lin - 1] s[n + lin]
and M[nln - 1] each become constant with M[nln - 1] > M[n - lin - 1] (see as[nln]
Problem 13.14). Hence, the error will increase after the prediction stage. When as[nln - 1]
the new data sample is obtained, we correct the estimate, which decreases the
MSE according to (13.42) (since K[n] < 1). An example of this is shown in a2 s[n - lin - 1].
Figure 13.7, in which a = 0.99, O'~ = 0.1, O'~ = 0.9 n +l, and M[-11-1] = l. We
This can be generalized to the l-step predictor, as shown in Problem 13.15.
will say more about this in the next section when we discuss the relationship of
the Kalman filter to the Wiener filter. 7. The Kalman filter is driven by the uncorrelated innovation sequence and in steady-
state can also be viewed as a whitening filter. Note from (13.38) and (13.41) that
6. Prediction is an integral part of the Kalman filter. It is seen from (13.38) that to
determine the best filtered estimate of s[n] we employ predictions. We can find s[nln] = as[n - lin - 1] + K[n](x[n] - s[nln - 1])
the best one-step prediction of s[n] based on {x[O], x[I], ... , x[n-1n from (13.38).
If we desire the best two-step prediction, we can obtain it easily by noting that so that the input to the Kalman filter is the innovation sequence x[~] = ~[n] -
this is the best estimate of s[n + 1] based on {x[O], x[I], .. . ,x[n - In.
To find this s[nln - 1] (see (13.35)) as shown in Figure 13.8. We ~now from our diSCUSSIOn of
we let 0';
-700, implying that x[n] is so noisy that the Kalman filter will not use the vector space approach that x[n] is uncorrelated With {x[O],.x[I], ... , x[n -I]},
it. Then, s[n + lin - 1] is just the optimal two-step prediction. To evaluate this which translates into a sequence of uncorrelated random variables as sho,:n III
we have from (13.38) Figure 13.9. Alternatively, if we view x[n] as the Kalm~n filt:r outp~t a~d If the
filter attains steady-state, then it becomes a linear time Illvarlant wh~tenlllg filter
s[n + lin] = as[nln] as shown in Figure 13.10. This is discussed in detail in the next sectIOn.
iln] and the infinite past. Since the Kalman filter will approach a linear time invariant
filter, we can let K[n] --+ K[oo] , M[nln] --+ M[oo], and M[nln - 1] --+ Mp[oo], where
Mp[oo] is the steady-state one-step prediction error. To find the steady-state Kalman
x[n]---.( filter we need to first find M[oo]. From (13.39), (13.40), and (13.42) we have
2
s[nln - 1] a Mp[00]
Mp[oo] +a 2
(a) Kalman filter
a 2 (a 2 M[oo] + a~)
(13.46)
a2 M[oo] + a~ + a 2
x[n]
-I Hw(f)
I i[n]
which must be solved for M[oo]. The resulting equation is termed the steady-state
Ricatti equation and is seen to be quadratic. Once M[oo] has been found, Mp[oo] can
(b) System model be determined and finally K[oo]. Then, the steady-state Kalman filter takes the form of
the first-order recursive filter shown in Figure 13.6 with K[n] replaced by the constant
K[oo]. Note that a simple way of solving the Ricatti equation numerically is to run
Figure 13.10 Whitening filter interpretation of Kalman filter
the Kalman filter until it converges. This will produce the desired time invariant filter.
The steady-state filter will have the form
8. The Kalman filter is optimal in that it minimizes the Bayesian MSE for each
estimator s[n]. If the Gaussian assumption is not valid, then it is still the optimal
s[nln] as[n - lin - 1] + K[oo](x[n]- as[n - lin - 1])
linear MMSE estimator as described in Chapter 12. = a(1 - K[oo])s[n - lin - 1] + K[oo]x[n]
All these properties of the Kalman filter carryover to the vector state case, except for so that its steady-state transfer function is
property 2, if the observations are also vectors.
1 ( ) K[oo] (13.47)
00 Z = 1 _ a(1 _ K[OO])Z-I .
13.5 Kalman Versus Wiener Filters
As an example, if a = 0.9, a~ = 1, a 2 = 1, we will find from (13.46) that M[oo] = 0.5974
The causal infinite length Wiener filter described in Chapter 12 produced an estimate (the other solution is negative). Hence, from (13.39) Mp[oo] = 1.4839, and from 03.40)
of s[n] as the output of a linear time invariant filter or K[oo] = 0.5974. The steady-state frequency response is
00 100 (exp(j27r f))
s[n] = L h[k]x[n - k]. 0.5974
k=O 1- 0.3623exp(-j27rf)
The estimator of s[n] is based on the present data sample and the infinite past. To whose magnitude is shown in Figure 13.11 as a solid line versus the PSD of the steady-
determine the filter impulse response h[k] analytically we needed to assume that the state signal
signal s[n] and noise w[n] were WSS processes, so that the Wiener-Hopf equations could
be solved (see Problem 12.16). In the Kalman filter formulation the signal and noise
Pss(f)
need not be WSS. The variance of w[n] may change with n, and furthermore, s[n] will 11 - a exp( - j27rf)12
only be WSS as n --+ 00. Additionally, the Kalman filter produces estimates based 1
on only the data samples from 0 to n, not the infinite past as assumed by the infinite 11 - 0.9 exp( - j27r f)12
length Wiener filter. The two filters will, however, be the same as n --+ 00 if a 2 = a 2 .
:rhi~ is because s[n] will approach statistical steady-state as shown in Section 13.3, i.e., shown as a dashed line. The same results would be obtained if the Wiener-Hopf equation
It WIll become an AR(I) process, and the estimator will then be based on the present had been solved for the causal infinite length Wiener filter. Hence, the steady-state
J .z, ~. _a
444 CHAPTER 13. KALMAN FILTERS 13.5. KALMAN VERSUS WIENER FILTERS 445
co co
~ 20-,- .
," , ~ 2s
1
,, ,,
I
''"0:" '"
0
c.. 154 ,,, ,,, '"
c..
20'
1 .
,
" ''
..
: en ~-+
'"
'"
....
'
, 1;; I
'"
"0
10-+
'" 10-+
E
I
I
. '\",(PssU) "0
Z 5 .....
.~
2
bD
S-+ 2
~
I---~~.........:..:-'"
o-{ .. ................ ..
'a...." 0-+i .....
.#
.........
: -S1'
! .. ............ . !
;:a -S-+. ........... ............. ;:a -10-+
"0
0:
'" i
] - lS 1
Ci -10 I I I I I I I I Ci -2o~I----TI----rl--~Ir---~I----TI----Ir---~I----TI----rl--~I
CIl
Q.. -{l.S -{l.4 -{l.3 -{l.2 -{l.1 0.0 0.1 0.2 0.3 0.4 o.s g; -{l.S -{l.4 -{l.3 -{l.2 -{l.1 0.0 0.1 0.2 0.3 0.4 o.s
Frequency Frequency
Figure 13,11 Signal PSD and steady-state Kalman filter magnitude Figure 13.12 Whitening filter property of steady-state Kalman filter
response
But in steady-state s[nln] is the output of the filter with system function 1-l oo (z) driven which for this example is
by x[n]. Thus, the system function relating the input x[n] to the output x[n] is, from
(13.48) and (13.47), Pxx(f) = 1 + 11- 0.gexp(-j27rfW
11- 0.gexp(-j27rf)12
1 - az- l 1-l oo (z)
az-1K[00] The PSD at the output of the whitening filter is therefore
1- ----;-:----=-=;-':;-:---;-
1 - a(1 - K[OO])Z-l
IH (fW P (f) = 1 + 11- 0.gexp( -j27rf)12
1 - az- 1 w xx 11 - 0.3623 exp( - j27r f)12
1 - a(1 - K[oo])z 1
which can be verified to be the constant PSD Pxx(f) = 2.48. In general, we have
For this example we have the whitening filter frequency response
1 - 0.gexp( -j27rf)
Hw(f) = 1 - 0.3623exp( -j27rf)
where (J"~ is the variance of the innovation. Thus, as shown in Figure 13.13, the output
whose magnitude is plotted in Figure 13.12 as a solid line along with the PSD of x[n], PSD of a filter with frequency response Hw(f) is a flat PSD with height (J"~.
J
13.6. VECTOR KALMAN FILTER 447
446 CHAPTER 13. KALMAN FILTERS
where h[n] is a known p x 1 vector and w[n] is zero mean Gaussian noise with uncor- x= H6+w
related samples, with variance (1~, and also independent of s[-I] and urn]. The data
model of (13.49) is called the observation or measurement equation. An example is where each vector observation has this form. Indexing the quantities by n and replacing
given in Section 13.8, where we wish to track the coefficients of a random time varying 6 by s, we have as our observation model
FIR filter. For that example, the state is comprised of the impulse response values.
The Kalman filter for this setup is derived in exactly the same manner as for the scalar x[n] = H[n]s[n] + w[n] (13.57)
state case. The derivation is included in Appendix 13A. We now summarize the results. where H[n] is a known M x p matrix, x[n] is an M x 1 observation vector, and w[n]
The reader should note the similarity to (13.38)-(13.42). is a M x 1 observation noise sequence. The w[nl's are independent of each other
Prediction: and of urn] and s[-I], and w[n] '" N(O,C[n]). Except for the dependence of the
covariance matrix on n, w[n] can be thought of as vector WGN. For the array processing
s[nln -1] = As[n - lin -1]. (13.50)
problem s[n] represents a vector of p transmitted signals, which are modeled as random,
448 CHAPTER 13. KALMAN FILTERS 13.7. EXTENDED KALMAN FILTER 449
and H[n] models the linear transformation due to the medium. The medium may Minimum MSE Matrix (p x p):
be modeled as time varying since H[n] depends on n. Hence, H[n]s[n] is the signal
output at the M sensors. Also, the sensor outputs are corrupted by noise w[n]. The M[nln] = (I - K[n]H[n])M[nln - 1]. (13.62)
statistical assumptions on w[n] indicate the noise is correlated from sensor to sensor
at the same time instant with covariance C[n], and this correlation varies with time.
From time instant to time instant, however, the noise samples are independent since The recursion is initialized by 8[-11- 1] = f..ts' and M[-ll- 1] = Cs
E(w[i]wT[j]) = 0 for i =1= j. With this data model we can derive the vector state-vector
All the comments of the scalar state-scalar observation Kalman filter apply here as well,
observation Kalman filter, the most general estimator. Because of the large number of
with the exception of the need for matrix inversions. We now require the inversion of
assumptions required, we summarize the results as a theorem.
an M x M matrix to find the Kalman gain. If the dimension of the state vector p is less
Theorem 13.2 (Vector Kalman Filter) The p x 1 signal vector s[n] evolves in time than the dimension of the observation vector M, a more efficient implementation of the
according to the Gauss-Markov model Kalman filter can be obtained. Referred to as the information form, this alternative
Kalman filter is described in [Anderson and Moore 1979].
s[n] = As[n - 1] + Bu[n] Before concluding the discussion of linear Kalman filters, it is worthwhile to note
that the Kalman filter summarized in the previous theorem is still not the most general
where A, B are known matrices of dimension p x p and p x r, respectively. The driving one. It is, however, adequate for many practical problems. Extensions can be made
noise vector urn] has the PDF urn] ~ N(o, Q) and is independent from sa.mple to by letting the matrices A, B, and Q be time varying. Fortuitously, the equations that
sample, so that E(u[m]uT[n]) = 0 for m =1= n (u[n] is vector WGN). The initial state result are identical to those of the previous theorem when we replace A by A[n], B by
vector s[-I] has the PDF s[-l] ~ N(f..ts, C s ) and is independent ofu[n]. B[n], and Q by Q[n]. Also, it is possible to extend the results to colored observation
The M x 1 observation vectors x[n] are modeled by the Bayesian linear model noise and to signal models with deterministic inputs (in addition to the driving noise).
Finally, smoothing equations have also been derived based on the Kalman philosophy.
x[n] = H[n]s[n] + w[n] These extensions are described in [Anderson and Moore 1979, Gelb 1974, Mendel 1987].
where H[n] is a known M x p observation matrix (which may be time varying) and w[n]
is an M x 1 observation noise vector with PDF w[n] ~ N(O, C[n]) and is independent
from sample to sample, so that E(w[m]wT[n]) = 0 for m =1= n. (If C[n] did not depend
13.7 Extended Kalman Filter
on n, then w[n] would be vector WGN.)
In practice we are often faced with a state equation and/or an observation equation
The MMSE estimator of s[n] based on {x[O], x[l], ... , x[n]} or
which is nonlinear. The previous approaches then are no longer valid. A simple example
8[nln] = E(s[nJlx[O]' x[l], ... , x[n]) that will be explored in some detail in Section 13.8 is vehicle tracking. For this problem
the observations or measurements are range estimates R[n] and bearing estimates S[n].
can be computed sequentially in time using the following recursion: If the vehicle state is the position (rx,ry) in Cartesian coordinates (it is assumed to
travel in the x-y plane), then the noiseless measurements are related to the unknown
Prediction: parameters by
8[nln - 1] = As[n - lin - 1]. (13.58)
Kalman Gain Matrix (p x M): Due to measurement errors, however, we obtain the estimates R[n] and S[n], which are
assumed to be the true range and bearing plus measurement noise. Hence, we have for
K[n] = M[nln - l]HT[n] (C[n] + H[n]M[nln - l]H T [n]) -1. (13.60) our measurements
J
450 CHAPTER 13. KALMAN FILTERS 13.7. EXTENDED KALMAN FILTER 451
or represents the true physical model for the evolution of the state, while urn] accounts for
the modeling errors, unforeseen inputs, etc. Likewise, h(s[n]) represents the transfor-
R[n] Jr~[n] + r~[n] + WR[n] mation from the state variables to the ideal observations (without noise). For this case
the MMSE estimator is intractable. The only hope is an approximate solution based
ry[n] on linearizing a and h, much the same as was done for nonlinear LS, where the data
S[n] arctan -[-] + wi3[n].
rx n were nonlinearly related to the unknown parameters. The result of this linearization
and the subsequent application of the linear Kalman filter of (13.58)-(13.62) results in
Clearly, we cannot express these in the linear model form or as
the extended Kalman filter. It has no optimality properties, and its performance will
x[n] = H[n]9[n] + w[n] depend on the accuracy of the linearization. Being a dynamic linearization there is no
way to determine its performance beforehand.
where 9[n] = [rx[n] ry[n]jT. The observation equation is nonlinear. Proceeding with the derivation, we linearize a(s[n -1]) about the estimate of s[n -1]
An example of a nonlinear state equation occurs if we assume that the vehicle is or about s[n-1In-1]. Likewise, we linearize h(s[n]) about the estimate of s[n] based on
traveling in a given direction at a known fixed speed and we choose polar coordinates, the previous data or s[nln-1] since from (13.61) we will need the linearized observation
range and bearing, to describe the state. (Note that this choice would render the equation to determine s[nln]. Hence, a first-order Taylor expansion yields
measurement equation linear as described by (13.63)). Then, ignoring the driving
a(s[n - 1]) ~ a(s[n - lin - 1])
noise, the state equation becomes
+ aa 1 (s[n-1]-s[n- 1In - 1])
vxn6 + rx[O] as[n - 1] s[n-I)=s[n-Iln-I)
vyn6 + ry[O] (13.64)
h(s[n]) ~ h(s[nln - 1]) + aa[h ]1 (s[n]- s[nln - 1]).
where (vx, v y) is the known velocity, 6 is the time interval between samples, and s n s[n)=s[nln-I)
(rx[O], ry[O]) is the initial position. This can be expressed alternatively as We let the Jacobians be denoted by
rx[n] rx[n - 1] + v x6 aa
A[n-1] 1
ry[n] ry[n - 1] + vy6 as[n - 1] s[n-I)=s[n-Iln-I)
R[n] Jr~[n - 1] + r~[n - 1] + 2vx6rx[n - 1] + 2vy6ry[n - 1] + (v~ + v~)62 so that the linearized state and observation equations become from (13.65) and (13.66)
JR2[n -1] +2R[n -1]6(vxcos,B[n -1] +vysin,B[n -1]) + (v~ +V~)62. s[n] = A[n - l]s[n - 1] + Bu[n] + (a(s[n - lin - 1]) - A[n - l]s[n - lin - 1])
x[n] = H[n]s[n] + w[n] + (h(s[nln - 1]) - H[n]s[nln - 1]).
This is clearly very nonlinear in range and bearing. In general, we may be faced with
sequential state estimation where the state and/or observation equations are nonlinear. The equations differ from our standard ones in that A is now time varying and both
Then, instead of our linear Kalman filter models equations have known terms added to them. It is shown in Appendix 13B that the
linear Kalman filter for this model, which is the extended Kalman filter, is
s[n] As[n - 1] + Bu[n]
Prediction:
x[n] H[n]s[n] + w[n] s[nln - 1] = a(s[n - lin - 1]). (13.67)
we would have Minimum Prediction MSE Matrix (p x p):
s[n] a(s[n - 1]) + Bu[n] (13.65) M[nln -1] = A[n - l]M[n -lin - l]AT [n - 1] + BQBT. (13.68)
x[n] h(s[n]) + w[n] (13.66)
Kalman Gain Matrix (p x M):
where a is a p-dimensional function and h is an M-dimensional function. The dimen-
sions of the remaining matrices and vectors are the same as before. Now a(s[n - 1])
K[n] = M[nln - l]HT [n] (C[n] + H[n]M[nln - l]HT[n]) - I . (13.69)
I
I
1 IIqm. . .;S44.g.4 ,,,,.. -"t_
452 CHAPTER 13. KALMAN FILTERS 13.8. SIGNAL PROCESSING EXAMPLES 453
Correction:
s[nln] = s[nln - 1] + K[n](x[n]- h(s[nln - 1])). (13.70)
Input Output
MinimUln MSE Matrix (p x p):
M[nln] = (I - K[n]H[n])M[nln - 1] (13.71)
where
A[n-1] 8a I
8s[n - 1] s[n-l]=s[n-lln-ll
H[n] = 8h I
8s[n] s[nl=i[nln-l] . (a) Multipath channel
Note that in contrast to the linear Kalman filter the gain and MSE matrices must be Input Output
computed on-line, as they depend upon the state estimates via A[n -1] and H[n]. Also,
the use of the term MSE matrix is itself a misnomer since the MMSE estimator has not
been implemented but only an approximation to it. In the next section we w(ll apply
the extended Kalman filter to vehicle tracking.
Example 13.3 - Time Varying Channel Estimation Figure 13.14 Input-output waveforms for fading and multipath channels
Many transmission channels can be characterized as being linear but not time invariant.
These are referred to by various names such as fading dispersive channels or fading
---.j
multipath channels. They arise in communication problems in which the troposphere v[nl--r-~
is used as a medium or in sonar in which the ocean is used [Kennedy 1969]. In either
case, the medium acts as a linear filter, causing an impulse at the input to appear as
a continuous waveform at the output (the dispersive or multipath nature), as shown
in Figure 13.14a. This effect is the result of a continuum of propagation paths, i.e.,
multipath, each of which delays and attenuates the input signal. Additionally, however,
a sinusoid at the input will appear as a narrowband signal at the output or one whose
amplitude is modulated (the fading nature), as shown in Figure 13.14b. This effect
is due to the changing character of the medium, for example, the movement of the
scatterers. A little thought will convince the reader that the channel is acting as a
linear time varying filter. If we sample the output of the channel, then it can be shown
that a good model is the low-pass tapped delay line model as shown in Figure 13.15
[Van Trees 1971]. The input-output description of this system is
Figure 13.15 Tapped delay line channel model
p-l
This is really nothing more than an FIR filter with time-varying coefficients. To design where w[n] is assumed to be WGN with variance (Y2 and the v[n] sequence is assumed
effective communication or sonar systems it is necessary to have knowledge of these known (since we provide the input to the channel). We can now form the MMSE
coefficients. Hence, the problem becomes one of estimating hn[k] based on the noise estimator for the tapped delay line weights recursively in time using the Kalman filter
corrupted output of the channel equations for a vector state and scalar observations. With obvious changes in notation
we have from (13.50)-(13.54)
p-l
__ cue F,
MI1$II= M~ Qau
. - -- - -- - - - - - - - - -
~ - - - - -
2.0~
1.8
1.64
:
1.4~
-<:!
1.24
.:<:bO 1.0
.~
0.8j
Q.
0.6
~
O.4J
0.21
0.0 I I I I I I I I I I
0 10 20 30 40 50 60 70 80 90 100
Sample number, n
2.0]
1.8
1.6
....
1.4-+
-<:!"
1.2j
.ibO 1.0
.iii ~
o,~
~
Q.
0.6
~
0.4
0.2
0.0 i I I I I I I I I I I
0 10 20 30 40 50 60 70 80 90 100
Sample number, n
In this example we use an extended Kalman filter to track the position and velocity of a
vehicle moving in a nominal given direction and at a nominal speed. The measurements
are noisy versions of the range and bearing. Such a track is shown in Figure 13.21. In
arriving at a model for the dynamics of the vehicle we assume a constant velocity,
perturbed only by wind gusts, slight speed corrections, etc., as might occur in an
aircraft. We model these perturbations as noise inputs, so that the velocity components
in the x and y directions at time n are
vx[n - 1] + ux[n]
vy[n - 1] + uy[n]. (13.74)
J
458 CHAPTER 13. KALMAN FILTERS 13.8. SIGNAL PROCESSING EXAMPLES 459
S 2.0, 1.0~
.<f , I
.., 1.S-+ I
-, 1.6~ 0,S1
.~ r; 0.6
~ 1.4-+ I
0.. I ~ 0.4-+
I
~ 1.2-+
ci'
.", '@
2to b.O
0:
<!l
'E
00 0.6j ......................... ~
S
"0:
.", 0.4 ~
<!l 0.2-+ -D.4-+
"
;:l
0.0 -D.6 I I I I I I I I I I
E=<
0 10 20 30 40 50 60 70 SO 90 100 0 10 20 30 40 50 60 70 SO 90 100
Sample number, n Sample number, n
2:
"
-t:! 2.0~ 1.0,
1.S
:Jb.O O'si
'0) 1.6j r;
~ 1.4
0.. "<'" 0.61
~
.",
1. 2 i /True ci'
'<i!
0.4,
1.0-+ .................................................
"'"
b.O
Cd o.s-r
.................... 0: 0.2
<!l
'E 0.6i
s 0.0
00 ~
.",
0:
" 0.4j
~
I
<!l -D.2-t
0.2
, I
"
;:l
0.0 -D.41 I I I I I I I I I I
E=<
0 10 20 30 40 50 60 70 SO 90 100 0 10 20 30 40 50 60 70 SO 90 100
Sample number, n Sample number, n
Figure 13.18 Kalman filter estimate Figure 13.19 Kalman filter gains
velocity components or
Without the noise perturbations ux[n], uy[n] the velocities would be constant, and hence
the vehicle would be modeled as traveling in a straight line as indicated by the dashed
line in Figure 13.21. From the equations of motion the position at time n is s[n] = Ty[n]
Tx[n] 1
[ vx[n]
vy[n]
Tx[n - 1] + vx[n - 1]~
and from (13.74) and (13.75) it is seen to satisfy
Ty[n - 1] + vy[n - 1]~ (13.75)
where ~ is the time interval between samples. In this discretized model of the equations
[ ~:!~ll [~~ ~ o 1[ Ty[n
o
- 1] 1 [ 0 1
Tx[n -1] ~
vx[n - 1] + ux[n] . (13.76)
----'
of motion the vehicle is modeled as moving at the velocity of the previous time instant I vx[n] 0 0 1
vy[n] 0 0 0 1 vy[n - 1] uy[n]
and then changing abruptly at the next time instant, an approximation to the true
continuous behavior. Now, we choose the signal vector as consisting of the position and I '---.....-..- ~
I s~J
.
A s[n-1J u[nJ
I
I
______....................~~=--=.%~.=...~,~=,. -~~,. =."'-.~-.~-~
. . ~-~~-~-~.~.~=.~=~~=~~="'"=a'-===m~n=M-~
. '-,
460 CHAPTER 13. KALMAN FILTERS 13.8. SIGNAL PROCESSING EXAMPLES 461
0.20 The measurements are noisy observations of the range and bearing
1
0.18-+
.: i
0. 16 R[n] Jr;[n] + r~[n]
.I~::i ~
~ 0.04
0.02
R[n]
~[n]
R[n] +wR[n]
j1[n] + w,,[n]. (13.77)
~a ~:~~jl
rx n
Unfortunately, the measurement vector is nonlinear in the signal parameters. To esti-
0.08 mate the signal vector we will need to apply an extended Kalman filter (see (13.67)-
a"
'2
0.06 (13.71)). Since the state equation of (13.76) is linear, we need only determine
~ 0.04
0.02: H[n]=~1
0.00 ~~----'I----'I----~I==~I==;1==;:1===;:1====I;:::~:;I;::::~~I
8s[n] .[n]=s[nln-l]
o 10 20 30 40 50 60 70 80 00 100 because A[n] is just A as given in (13.76). Differentiating the observation equation, we
Sample number, n have the Jacobian
ry[n]
Figure 13.20 Kalman filter minimum MSE R[n]
rx[n]
o 01.
R2[n]
o 0
Finally, we need to specify the covariances of the driving noise and observation noise.
If we assume that the wind gusts, speed corrections, etc., are just as likely to occur
ry[n] Vehicle track in any direction and with the same magnitude, then it seems reasonable to assign the
same variances to ux[n] and uy[n] and to assume that they are independent. Call the
common variance O'~. Then, we have
..
: ............... .
/3[n]
The exact value to use for O'~ should depend on the possible change in the velocity
Figure 13.21 Typical track of vehicle moving in given direction at
constant speed component from sample to sample since ux[n] = vx[n] - vx[n - 1]. This is just the
I
L
462 CHAPTER 13. KALMAN FILTERS 13.8. SIGNAL PROCESSING EXAMPLES 463
acceleration times D. and should be derivable from the physics of the vehicle. In speci- 20,
fying the variances of the measurement noise we note that the measurement error can I
Final
~ I position
be thought of as the estimation error of R[n] and J3[n] as seen from (13.77). We usu- 15i
ally assume the estimation errors WR [n] , w,a[n] to be zero mean. Then, the variance of
.,'"'" '.
wR[n], for example, is E(w~[n]) = E[(R[n]- R[nW]. This variance is sometimes deriv-
:.
., 10-+
I
"""
able but in most instances is not. One possibility is to assume that E(w~[n]) does not
..<::
bll
I ...................
'Cd 5-t
depend on the PDF of R[n], so that E(w~[n]) = E[(R[n]- R[nWIR[n]]. Equivalently, ten I
we could regard R[n] as a deterministic parameter so that the variance of wR[n] is just "0
C 0-t
the classical estimator variance. As such, if R[n] were the MLE, then assuming long .,'"
:l Initial
data records and/or high SNRs, we could assume that the variance attains the CRLB. -54
~ I
position
Using this approach, we could then make use of the CRLB for range and bearing such -10
as was derived in Examples 3.13 and 3.15 to set the variances. For simplicity we usually I I I I I I I
-15 -10 -5 0 5 10 15
assume the estimation errors to be independent and the variances to be time invariant
True and straight line Tx[n]
(although this is not always valid). Hence, we have
t (j~0]
Figure 13.22 Realization of vehicle track
172
C[n] = C = [
In summary, the extended Kalman filter equations for this problem are, from o 0
(13.67)-(13.71), VT;[n] + T~[n]
H[n]
Tx[n]
s[nln - 1] As[n - lin - 1] o 0
T;[n] + T~[n] s[n]=s[nln-I]
M[nln -1] AM[n - lin - l]AT + Q
K[n] M[nln - l]HT [n] (C + H[n]M[nln _ l]HT[n])-1 C = [ (j~o (j~0]
s[nln] s[nln - 1] + K[n](x[n]- h(s[nln - 1]))
M[nln] (I - K[n]H[n])M[nln - 1] and the initial conditions are s[-ll - 1] = ILs' M[-ll - 1] = Cs As an example,
consider the ideal straight line trajectory shown in Figure 13.22 as a dashed line. The
where coordinates are given by
[!~!tl
Tx[n] lO - 0.2n
A Ty[n] -5 + 0.2n
for n = 0, 1, ... ,100, where we have assumed D. = 1 for convenience. From (13.75) this
trajectory assumes Vx = -0.2, Vy = 0.2. To accommodate a more realistic vehicle track
[~ ~ ~d 1
we introduce driving or plant noise, so that the vehicle state is described by (13.76)
Q with a driving noise variance of (j~ = 0.0001. With an initial state of
x[n]
[~[n]l
;3[n]
s[-l] = [ -~~2l '
0.2
(13.78)
VT;[n] + T~[n] 1 which is identical to that of the initial state of the straight line trajectory, a realization
h(s[n])
arctan Ty[n] of the vehicle position [Tx[n] Ty[nW is shown in Figure 13.22 as the solid curve. The
[
Tx[n] state equation of (13.76) has been used to generate the realization. Note that as time
_ _ _ _ _ _ _ ~~".~~"C,
".~ "'.~ .'"'"' "'T.~ .'.-.-~T'.-'; ;'-~.' ' ' ' '."; ; ; ;~iiiiiiiiiiiiiiiiiiiil.i
" .
~ -----------i ----
20..,.
464 CHAPTER 13. KALMAN FILTERS (a) Range
18--+
increases, the true trajectory gradually deviates from the straight line. It can be shown
that the variances of Vx [n] and Vy [n] will eventually increase to infinity (see Problem
13.22), causing Tx[n] and Ty[n] to quickly become unbounded. Thus, this modeling
is valid for only a portion of the trajectory. The true range and bearing are shown
in Figure 13.23. We assume the measurement noise variances to be (]"~ = 0.1 and
(]"~ = 0.01, where {3 is measured in radians. In Figure 13.24 we compare the true
trajectory with the noise corrupted one as obtained from
80--"
00
<ll
<ll 60-+
....
bO
<ll
40_
~
and so as not to "bias" the extended Kalman filter we assume a large initial MSE or
7 20--+
M[-lJ-1] = 1001. The results of an extended Kalman filter are shown in Figure 13.25 (Q
as the solid curve. Initially, because of the poor state estimate, the error is large. This 0-
is also reflected in the MSE curves in Figure 13.26 (actually these are only estimates
based on our linearization). However, after about 20 samples the extended Kalman
filter attains the track. Also, it is interesting to note that the minimum MSE does -40 I I I I I I I I I I
0 10 20 30 40 50 60 70 80 90 100
not monotonically decrease as it did in the previous example. On the contrary, it
increases for part of the time. This is explained by contrasting the Kalman filter with Sample number, n
the sequential LMMSE estimator in Chapter 12. For the latter we estimated the same
Figure 13.23 Range and bearing of true vehicle track
parameter as we received more and more data. Consequently, the minimum MSE
decreased or at worst remained constant. Here, however, each time we receive a new
data sample we are estimating a new parameter. The increased uncertainty of the new
20..,.
parameter due to the influence of the driving noise input may be large enough to offset .:
the knowledge gained by observing a new data sample, causing the minimum MSE ., i
... 15i
to increase (see Problem 13.23). As a final remark, in this simulation the extended <ll
I
Kalman filter appeared to be quite tolerant of linearization errors due to a poor initial "
1:: I
-c 10-1
state estimate. In general, however, we cannot expect to be so fortunate. <>
"
oj
-c 5-+
I
<ll
i
Q. I
"
....
....
0
0-+ True track
u i
<ll
<n -5-+
0 a bserved track
Z
-10 I I I I I I
-15 -10 -5 0 5 10 15
Noise corrupted and true Tx[n]
References
Anderson, B.D.O., J.B. Moore, Optimal Filtering, Prentice-Hall, Englewood Cliffs, N.J., 1979.
Chen, C.T., Introduction to Linear System Theory, Holt, Rinehart, and Winston, New York, 1970.
Gelb, A., Applied Optimal Estimation, M.LT. Press, Cambridge, Mass., 1974.
Jazwinski, A.H., Stochastic Processes and Filtering Theory, Academic Press, New York, 1970.
Kennedy, R.S., Fading Dispersive Communication Channels, J. Wiley, New York, 1969.
Mendel, J.M., Lessons in Digital Estimation Theory, Prentice-Hall, Englewood Cliffs, N.J., 1987.
Van Trees, H.L., Detection, Estimation, and Modulation Theory III, J. Wiley, New York, 1971.
Problems
13.1 A random process is Gaussian if for arbitrary samples {s[nl]' s[n2]"'" s[nk]} and
for any k the random vector s = [s[nl] s[n2]' .. s[nkllT is distributed according to
a multivariate Gaussian PDF. If s[n] is given by (13.2), prove that it is a Gaussian
random process.
13.2 Consider a scalar Gauss-Markov process. Show that if J.L. = 0 and 0"; = 0"~/(1 -
a2 ), then the process will be WSS for n 2: 0 and explain why this is so.
13.3 Plot the mean, variance, and steady-state covariance of a scalar Gauss-Markov
process if a = 0.98, O"~ = 0.1, J.Ls = 5, and 0"; = 1. What is the PSD of the
steady-state process?
13.4 For a scalar Gauss-Markov process derive a covariance propagation equation, i.e.,
a formula relating cs[m, n] to cs[n, n] for m 2: n. To do so first show that
cs[m, n] = am-nvar(s[n])
for m 2: n.
13.5 Verify (13.17) and (13.18) for the covariance matrix propagation of a vector Gauss-
Markov process.
13.6 Show that the mean of a vector Gauss-Markov process in general will grow with
n if any eigenvalue of A is greater than 1 in magnitude. What happens to the
steady-state mean if all the eigenvalues are less than 1 in magnitude? To simplify
matters assume that A is symmetric, so that it can be written as A = 2::f=1 AiViVT,
where Vi is the ith eigenvector of A and Ai is the corresponding real eigenvalue.
13.8 Consider the recursive difference equation (for q < p) a. O'~ = (0.9)n
p q h. O'~ = 1
r[n] =- L a[k]r[n - k] + urn] + L b[k]u[n - k] c. O'~ = (l.1)n.
k=l k=l Explain your results. Using a Monte Carlo computer simulation generate a real-
where urn] is WGN with variance O'~. (In steady-state this would be an ARMA ization of the signal and noise and apply your Kalman filter to estimation of the
process.) Let the state vector be defined as signal for all three cases. Plot the signal as well as the Kalman filter estimate.
13.12 For the scalar state-scalar observation Kalman filter assume that O'~ = 0 for all
s[n-l] = [r[~[~~~I]l n so that we observe sIn] directly. Find the innovation sequence. Is it white?
13.13 In this problem we show that the same set of equations result for the scalar
r[n -1] state-scalar observation Kalman filter even if E(s[-l]) i= O. To do so let s'[n] =
s[n]- E(s[n]) and x'[n] = x[n]- E(x[n]), so that equations (13.38)-(13.42) apply
where s[-I] rv N(p,.,C.) and is independent of urn]. Also, define the vector for s'[n]. Now determine the corresponding ;quations for. sIn]. Recall that the
driving noise sequence as MMSE estimator of e + c for c a constant is e+ c, where e is the MMSE of e.
urn - q]
u[n-q+l] 1 13.14 Prove that for the scalar state-scalar observation Kalman filter
urn] = : . M[nln -1] > M[n -lin -1]
[
urn] for large enough n or in steady-state. Why is this reasonable?
Rewrite the process as in (13.11). Explain why this is not a vector Gauss-Markov 13.15 Prove that the optimall-step predictor for a scalar Gauss-Markov process s[n]
model by examining the assumptions on urn]. is
13.9 For Problem 13.8 show that the process may alternatively be expressed by the
sIn + lin] = at s[nln]
difference equations where s[nln] and s[n + lin] are based on {x[O], x[l]' ... , x[n]}.
p 13.16 Find the transfer function of the steady-state, scalar state-scalar observation
sIn] - L a[k]s[n - k] + urn] Kalman filter if a = 0.8, O'~ = 1, and 0'2 = 1. Give its time domain form as a
k=l recursive difference equation.
q
r[n] = s[n] + L b[k]s[n - k] 13.17 For the scalar state-scalar observation Kalman filter let a = 0.9, O'~ = 1, 0'2 = 1
k=l and find the steady-state gain and minimum MSE by running the filter until
convergence, i.e., compute equations (13.39), (13.40), and (13.42). Compare your
for n ~ O. Assume that s[-I] = [s[-p] s[-p + 1] ... s[-I]jT rv N(p, .. C,,) and results to those given in Section 13.5.
that we observe x[n] = r[n] + wIn], where wIn] is WGN with variance O'~ , and
s[-I], urn], and wIn] are independent. Show how to set up the vector state-scalar 13.18 Assume we observe the data
observation Kalman filter.
x[k] = Ark + w[k]
13.10 Assume that we observe x[n] = A+w[n] for n = 0, 1, ... , where A is the realiza-
tion of a random variable with PDF N(O, O'~) and wIn] is WGN with variance 0'2. for k = 0,1, ... , n, where A is the realization of a random variable with PDF
Using the scalar state--scalar observation Kalman filter find a sequential estimator N(J.LA, O'~), 0 < r < 1, and the w[kl's are samples of WGN with variance 0'2.
of A based on {x[O], x[I], . .. , x[n]} or A[n]. Solve explicitly for A[n], the Kalman Furthermore, assume that A is independent of the w[kl's. Find the sequential
gain, and the minimum MSE. MMSE estimator of A based on {x[O], x[l], ... , x[n]}.
13.11 In this problem we implement a scalar state--scalar observation Kalman filter 13.19 Consider the vector state-vector observation Kalman filter for which H[n] is
(see (13.38)-(13.42)). A computer solution is advised. If a = 0.9, O'~ = 1, J.L. = 0, assumed to be invertible. If a particular observation is noiseless, so that C[n] = 0,
0'; = 1, find the Kalman gain and minimum MSE if
find s[nln] and explain your results. What happens if C[n] --+ oo?
1
470 CHAPTER 13. KALMAN FILTERS
13.20 Prove that the optimal l-step predictor for a vector Gauss-llarkov process s[n]
is
s[n + lin] = A1s[nln]
where s[nln] and s[n + lin] are based on {x[O], x[I], ... ,x[n]}. Use the vector
state-vector observation Kalman filter.
13.21 In this problem we set up an extended Kalman filter for the frequency tracking Appendix 13A
application. Specifically, we wish to track the frequency of a sinusoid in noise.
The frequency is assumed to follow the model
x[n] = cos(27r fo[n]) + w[n] We derive the vector state-vector observation Kalman filter with the vector state-
scalar observation being a special case. Theorem 13.2 contains the modeling assump-
where w[n] is WGN with variance a2 and is independent of urn] and fo[-I]. Write tions. In our derivation we will assume fJ-s = 0, so that all random variables are zero
down the extended Kalman filter equations for this problem. mean. With this assumption the vector space viewpoint is applicable. For I-'s i= 0 it
can be shown that the same equations result. The reason for this is a straightforward
13.22 For the vehicle position model in Example 13.4 we had generalization of the comments made at the end of Appendix 12A (see also Problem
13.13).
The properties of MMSE estimators that we will use are
1. The MMSE estimator of 8 based on two uncorrelated data samples Xl and X2,
where ux[n] is WGN with variance a~. Find the variance of vx[n] to show that
it increases with n. What might be a more suitable model for vx[n]? (See also assuming jointly Gaussian statistics (see Section 11.4), is
[Anderson and Moore 1979] for a further discussion of the modeling issue.)
9 E(8IxI' X2)
13.23 For the scalar state-scalar observation Kalman filter find an expression relating E(8lxd + E(8I x2)
M[nln] to M[n - lin - 1]. Now, let a = 0.9, a~ = 1, and a~ = n + 1. If
M[-ll- 1] = 1, determine M[nln] for n 2: O. Explain your results. if 8, X I ,X2 are zero mean.
2. The MMSE estimator is linear in that if 8 = Al 8 1 + A2 82 , then
9 E(8Ix)
E(A 1 8 1 + A2 8 2 1x)
AIE(8dx) + A2 E(82 Ix)
A l 91 + A 2 92 .
The derivation follows that for the scalar state-scalar observation Kalman filter with
obvious adjustments for vector quantities. We assume 1-'. = 0 so that all random vectors
are zero mean. The MMSE estimator of s[n] based on {x[O],x[I], ... ,x[n]} is the mean
of the posterior PDF
s[nln] = E(s[n]lx[O], x[I], ... , x[n]). (13A.l)
II But s[n], x[O] , ... ,x[n] are jointly Gaussian since from (13.13) s[n] depends linear~y on
{s[-I], u[O], ... , urn]} and x[n] depends linearly on s[n], w[n]. Hence, we have a lmear
I 471
J
472 APPENDIX 13A. VECTOR KALMAN FILTER DERIVATION APPENDIX 13A. VECTOR KALMAN FILTER DERIVATION 473
dependence on the set of random vectors S = {s[-I], u[O], ... , urn], w[O], ... , w[n]}, But
where each random vector is independent of the others. As a result, the vectors in x[n] = H[n]s[n] + w[n]
S are jointly Gaussian and any linear transformation also produces jointly Gaussian
random vectors. Now, from (10.24) with zero means we have so that from property 2
since x[n] is recoverable from X[n-l] and i[n] (see comments in Section 13.4), Because where
X[n - 1] and i[n] are uncorrelated, we have from property 1 s[nln - 1] = As[n - lin - 1].
The Kalman filter for s'[n] can be found using (13.58)-(13.62) with x[n] replaced by
x'[n]. Then, the MMSE estimator of s[n] can easily be found by the relations
s'[nln - 1] = s[nln - 1]- E(s[n])
s'[n - lin - 1] = s[n - lin - 1]- E(s[n - 1]).
Using these in (13.58), we have for the prediction equation
Appendix 13B
s'[nln - 1] = As'[n - lin - 1]
or
Extended Kalman Filter Derivation s[nln - 1]- E(s[n]) = A (s[n - lin - 1]- E(s[n - 1]))
and from (13B.3) this reduces to
s[nln - 1] = As[n - lin - 1] + v[n].
To derive the extended Kalman filter equations we need to first determine the equa- For the correction we have from (13.61)
tions for a modified state model that has a known deterministic input or )
s'[nln] = s'[nln - 1] + K[n](x'[n]- H[n]s'[nln - 1])
s[n] = As[n - 1] + Bu[n] + v[n] (13B.l) or
s[nln] - E(s[n])
where v[n] is known. The presence of v[n] at the input will produce a deterministic
component in the output, so that the mean of s[n] will no longer be zero. We assume = s[nln - 1] - E(s[n]) + K[n] [x[n]- H[n]E(s[n]) - H[n](s[nln - 1] - E(s[n]))].
that E(s[-l]) = 0, so that if v[n] = 0, then E(s[n]) = o. Hence, the effect of the This reduces to the usual equation
deterministic input is to produce a nonzero mean signal vector
s[nln] = s[nln - 1] + K[n](x[n]- H[n]s[nln - 1]).
s[n] = s'[n] + E(s[n]) (13B.2) The Kalman gain as well as the MSE matrices remain the same since the MMSE
estimator will not incur any additional error due to known constants. Hence, the only
where s'[n] is the value of s[n] when v[n] = o. Then, we can write a state equation for
revision is in the prediction equation.
the zero mean signal vector s'[n] = s[n] - E(s[n]) as Returning to the extended Kalman filter, we also have the modified observation
s'[n] = As'[n - 1] + Bu[n]. equation
x[n] = H[n]s[n] + w[n] + z[n]
Note that the mean satisfies where z[n] is known. With x'[n] = x[n] - z[n] we have the usual observation equation.
Hence, our two equations become, upon replacing A by A[n],
E(s[n]) = AE(s[n - 1]) + v[n] (13B.3)
s[nln - 1] A[n - l]s[n - lin - 1] + v[n]
which follows from (13B.l). Likewise, the observation equation is s[nln] s[nln - 1] + K[n](x[n]- z[n]- H[n]s[nln - 1]).
Summary of Estimators
14.1 Introduction
The choice of an estimator that will perform well for a particular application depends
upon many considerations. Of primary concern is the selection of a good data model.
It should be complex enough to describe the principal features of the data, but at the
same time simple enough to allow an estimator that is optimal and easily implemented.
We have seen that at times we were unable to determine the existence of an optimal
estimator, an example being the search for the MVU estimator in classical estimation.
In other instances, even though the optimal estimator could easily be found, it could
not be implemented, an example being the MMSE estimator in Bayesian estimation.
For a particular problem we are neither assured of finding an optimal estimator or,
even if we are fortunate enough to do so, of being able to implement it. Therefore, it
becomes critical to have at one's disposal a knowledge of the estimators that are opti-
mal and easily implemented, and furthermore, to understand under what conditions we
may justify their use. To this end we now summarize the approaches, assumptions, and
for the linear data model, the explicit estimators obtained. Then, we will illustrate the
decision making process that one must go through in order to choose a good estima-
tor. Also, in our discussions we will highlight some relationships between the various
estimators.
479
480 CHAPTER 14. SUMMARY OF ESTIMATORS 14.2. ESTIMATION APPROACHES 481
summ~rized b~ the joint PDF. p(x, 6) or, equivalently, by the conditional PDF p(xI6) h. Estimator
(data mformatlOn) and the prior PDF p(6) (prior information).
i. Find a sufficient statistic T(x) by factoring PDF as
Classical Estimation Approaches
p(x; 6) = g(T(x), 6)h(x)
1. Cramer-Rao Lower Bound (CRLB)
where T(x) is a p-dimensional function of x, 9 is a function depending
a. Data Model/Assumptions
only on T and 6, and h depends only on x.
PDF p(x; 6) is known.
ii. If E[T(x)] = 6, then {J = T(x). If not, we must find a p-dimensional
h. Estimator
function g so that E[g(T)] = 6, and then {J = g(T).
If the equality condition for the CRLB
c. Optimality/Error Criterion
oln~~x;6) =I(6)(g(x)-6) {J is the MVU estimator.
d. Performance
is satisfied, then the estimator is Bi for i = 1,2, ... ,p is unbiased. The variance depends on the PDF- no
general formula is available.
{J = g(x) e. Comments
Also, "completeness" of sufficient statistic must be checked. A p-dimensional
where 1(6) is a p x p matrix dependent only on 6 and g(x) is a p-dimensional
sufficient statistic may not exist, so that this method may fail.
function of the data x.
c. Optimality/Error Criterion f. Reference
Chapter 5
{J ~chieves the CRLB, the lower bound on the variance for any unbiased
estimator (and hence is said to be efficient), and is therefore the minimum 3. Best Linear Unbiased Estimator (BLUE)
variance unbiased (MVU) estimator. The MVU estimator is the one whose
variance for each component is minimum among all unbiased estimators. a. Data Model/Assumptions
d. Performance E(x) = H6
It is unbiased or
E(Bi)=Bi i=1,2, ... ,p where H is an N x p (N > p) known matrix and C, the covariance matrix of
x, is known. Equivalently, we have
and has minimum variance
x = H6+w
where E(w) = 0 and Cw = C.
where
p p h. Estimator
[1(6)] .. = E [8In (X;6) Oln (X;6)]
'J oBi oBj
{J = (HTC-IH)-IHTC-Ix.
e. Comments
An efficient estimator may not exist, and hence this approach may fail. c. Optimality/Error Criterion
Bi for i = 1,2, ... , p has the minimum variance of all unbiased estimators
f. Reference
Chapter 3 that are linear in x.
d. Performance
2. Rao-Blackwell-Lehmann-Scheffe Bi for i = 1,2, ... ,p is unbiased. The variance is
a. Data M odellAssumptions i=1,2, ... ,p.
PDF p(x; 6) is known.
I
1
482 CHAPTER 14. SUMMARY OF ESTIMATORS 14.2. ESTIMATION APPROACHES 483
where 1
e. Comments
If an MVU estimator exists, the maximum likelihood procedure will produce 11 2:::0 x[n] 1
l ",N-1 x2[n]
, N L..m=O
it. p, = . .
f. Reference [l ",N-1 xP[n]
Chapter 7 N Lm=O
e. Comments
If x, 0 are jointly Gaussian, this is identical to the MMSE and MAP estima-
tors.
e. Comments
f. Reference: Chapter 12
In the non-Gaussian case, this will be difficult to implement.
f. Reference The reader may observe that we have omitted a summary of the Kalman filter. This is
Chapters 10 and 11 because it is a particular implementation of the MMSE estimator and so is contained
within that discussion.
486 CHAPTER 14. SUMMARY OF ESTIMATORS 14.3. LINEAR MODEL 487
I
1
- -
TABLE 14.1 Properties of {J for Classical General Hence, for the Bayesian linear model the MMSE estimator, MAP estimator, and the
Linear Model LMMSE estimator are identical. A last comment concerns the form of the estimator
Model: x = H6 + w when there is no prior information. This may be modeled by letting COl --+ o. Then,
Assumptions: E(w) = 0 from (14.7) we have
Cw = C {J = (HTC;;/H)-lHTC;;/x
Estimator: fJ = (HTC-1H)-lHTC-1x which is recognized as having the identical form as the MVU estimator for the classical
w ~Gaussian general linear model. Of course, the estimators cannot really be compared since they
Properties w ~ Non-Gaussian have been derived under different data modeling assumptions (see Problem 11.7). How-
(linear model)
ever, this apparent equivalence has often been identified by asserting that the Bayesian
Efficient * approach with no prior information is equivalent to the classical approach. When viewed
Sufficient Statistic * in its proper statistical context, this assertion is incorrect.
MVU *
BLUE * * 14.4 Choosing an Estimator
MLE *
WLS (W = C- 1 ) * * We now illustrate the decision making process involved in choosing an estimator. In
doing so our goal is always to find the optimal estimator for a given data model. If this
* Property holds. is not possible, we consider suboptimal estimation approaches. We will consider our
old friend the data set
and the prior PDF of 8 is
x[n] = A[n] + w[n] n = O,l, ... ,lV - 1
where the unknown parameters are {A[O],A[l], ... ,A[lV -I]}. We have allowed the
parameter A to change with time, as most parameters normally change to some extent
in real world problems. Depending on our assumptions on A[n] and w[n], the data may
The posterior PDF p(8Ix) is again Gaussian with mean and covariance have the form of the classical or Bayesian linear model. If this is the case, the optimal
estimator is easily found as explained previously. However, even if the estimator is
E(8Ix) 1-'0 + CoHT (HCoHT + Cw)-l(X - HI-'o) (14.6) optimal for the assumed data model, its performance may not be adequate. Thus, the
1-'0 + (COl + HTC;;;lH)-lHTC;;;l(X - HI-'o) (14.7) data model may need to be modified, as we now discuss. We will refer to the flowchart
Co - CoHT(HCoHT + Cw)-lHC o (14.8) in Figure 14.1 as we describe the considerations in the selection of an estimator. Since
(COl +HTC;;;lH)-l. we are attempting to estimate as many parameters as data points, we can expect poor
(14.9)
estimation performance due to a lack of averaging (see Figure 14.1a). With prior
7. Minimum Mean Square Error Estimator knowledge such as the PDF of 8 = [A[O] A[l], ... A[!V - 1]jT, we could use a Bayesian
The MMSE estimator is just the mean of the posterior PDF given by (14.6) or approach as detailed in Figure 14.1b. Based on the PDF p(x, 8) we can in theory find
(14.7). The minimum Bayesian MSE or E((8 i - 8i )2), is from (14.3), the MMSE estimator. This will involve a multidimensional integration and may not in
practice be possible. Failing to determine the MMSE estimator we could attempt to
maximize the posterior PDF to produce the MAP estimator, either analytically or at
least numerically. As a last resort, if the first two joint moments of x and 8 are available,
since C 81x does not depend on x (see (14.8)). we could determine the LMMSE estimator in explicit form. Even if dimensionality is
not a problem, as for example if A[n] = A, the use of prior knowledge as embodied
8. Maximum A Posteriori Estimator by the prior PDF will improve the estimation accuracy in the Bayesian sense. That
Because the location of the peak (or mode) of the Gaussian PDF is equal to the is to say, the Bayesian MSE will be reduced. If no prior knowledge is available, we
mean, the MAP estimator is identical to the MMSE estimator. will be forced to reevaluate our data model or else obtain more data. For example, we
might suppose that A[n] = A or even A[n] = A + Bn, reducing the dimensionality of
9. Linear Minimum Mean Square Error Estimator
the problem. This may result in bias errors due to modeling inaccuracies, but at least
Since the MMSE estimator is linear in x, the LMMSE estimator is just given by
(14.6) or (14.7).
1
490 CHAPTER 14. SUMMARY OF ESTIMATORS 14.4. CHOOSING AN ESTIMATOR 491
Signal processing
problem First two
No moments
Yes LMMSE
estimator
known
Bayesian
approach
Compute mean
of Yes
MMSE estimator
posterior PDF
Classical
approach
(b) Bayesian approach
(a) Classical versus Bayesian
the variability of any resultant estimator would be reduced. Then, we could resort
to a classical approach (see Figure 14.1c). If the PDF is known, we first compute
the equality condition for the CRLB, and if satisfied, an efficient and hence MVU MVU estimator LSE
estimator will be found. If not, we could attempt to find a sufficient statistic, make it
unbiased, and if complete, this would produce the MVU estimator. If these approaches Yes
fail, a maximum likelihood approach could be tried if the likelihood function (PDF
with x replaced by the observed data values) can be maximized analytically or at least Complete Yes
numerically. Finally, the moments could be found and a method of moments estimator sufficient MVU estimator
statistic
tried. Note that the entire PDF need not be known for a method of moments estimator. exist
If the PDF is unknown but the problem is one of a signal in noise, then either a BLUE No First two No
noise moments
or LS approach could be tried. If the signal is linear in 8 and the first two moments of known
the noise are known, a BLUE can be found. Otherwise, a LS and a possibly nonlinear
LS estimator must be employed. Yes
MLE
In general, the choice of an appropriate estimator for a signal processing problem
should begin with the search for an optimal estimator that is computationally feasible. BLUE
If the search proves to be futile, then suboptimal estimators should be investigated.
Evaluate
method of Yes Moments
moments estimator
estimator
( c) Classical approach
Chapter 15
15.1 Introduction
For many signal processing applications the data samples are more conveniently mod-
eled as being complex-typically the concatenation of two time series of real data into
a single time series of complex data. Also, the same technique can be useful in rep-
resenting a real parameter vector of dimension 2 x 1 by a single complex parameter.
In doing so it is found that the representation is considerably more intuitive, is ana-
lytically more tractable, and lends itself to easier manipulation by a digital computer.
An analogous situation arises in the more convenient use of a complex Fourier series
composed of complex exponentials for a real data signal as opposed to a real Fourier
series of sines and cosines. Furthermore, once the complex Fourier series representation
has been accepted, its extension to complex data is trivial, requiring only the Hermi-
tian symmetry property of the Fourier coefficients to be relaxed. In this chapter we
reformulate much of our previous theory to accommodate complex data and complex
parameters. In doing so we do not present any new theory but only an algebra for
manipulating complex data and parameters.
15.2 Summary
The need for complex data models with complex parameters is discussed in Section
15.3. A particularly important model is the complex envelope representation of a real
bandpass signal as given by (15.2). Next, the tedious process of minimizing a function
with respect to the real and imaginary parts of a complex parameter is illustrated in
Example 15.2. It is shown how the introduction of a complex derivative greatly simpli-
fies the algebra. Complex random variables are defined and their properties described
in Section 15.4. The important complex Gaussian PDF for a complex random variable
is given by (15.16), while for a complex random vector the corresponding PDF is given
by (15.22). These PDFs assume that the real covariance matrix has the special form
493
494 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.3. COMPLEX DATA AND PARAMETERS 495
S(F) S(F)
F (Hz) F (Hz)
-Fa Fa - ~ Fa Fa +~
Fs=!"=B
~
(a) Fourier transform of real bandpass signal (b) Fourier transform of complex envelope
sin 271'Fat
(quadrature) ,
Figure 15.1 Definition of complex envelope B F -. - --- ----- --- -- -- -- ------ - -- ----
'2 Within digital computer
of (15.19) or, equivalently, the covariances satisfy (15.20) for the case of two random Figure 15.2 Extraction of discrete-time complex envelope from bandpass signal
variables. Properties of complex Gaussian random variables are summarized iIi' that
section as well. If a complex WSS random process has an autocorrelation and a cross-
correlation that satisfy (15.33), then it is said to be a complex Gaussian WSS random relationship between the Fourier transforms of the real bandpass signal and its complex
process. An example is the complex envelope of a real bandpass Gaussian random envelope is
process, as described in Example 15.5. The complex derivative of a real function with S(F) = S(F - Fa) + S*( -(F + Fa)). (15.1)
respect to a complex variable is formally defined in (15.40). The associated complex
gradient can be used to minimize Hermitian functions by employing (15.44)-(15.46). To obtain S(F) we shift the complex envelope spectrum up to F = Fa and also, after
If the Hermitian function to be minimized has a linear constraint on its parameters, first "flipping it around" in F and conjugating it, down to F = -Fa. Taking the
then the solution is given by (15.51). Classical estimation based on complex Gaussian inverse Fourier transform of (15.1) produces
data with real parameters employs the CRLB of (15.52) and the equality condition of
(15.53). When the Fisher information matrix has a special form, however, the equality s(t) = s(t) exp(j2nPot) + [s(t) exp(j27l'Fot)] *
condition of (15.54) can be used. This involves a complex Fisher information matrix.
One important example is the complex linear model in Example 15.9, in which (15.58) or
ts the efficient estimator and (15.59) is its corresponding covariance. Bayesian estima- s(t)= 2Re [s(t) exp(j27l'Fot)]. (15.2)
tion for the complex Bayesian linear model is discussed in Section 15.8. The MMSE
estimator is given by (15.64) or (15.65), while the minimum Bayesian MSE is given by Alternatively, if we let s(t) = SR(t) + jSJ(t), where R and I refer to the real and
(15.66) or (15.67). For large data records an approximate PDF for a complex Gaussian imaginary parts of the complex envelope, we have
WSS random process is (15.68) and an approximate CRLB can be computed based on
(15.69). These forms are frequently easier to use in deriving estimators. (15.3)
"in-phase" channel is where Ai = ~ exp(j<Pi) is the complex amplitude and fi = (Fi - Fo)~ is the frequency
of the ith discrete sinusoid. For the purposes of designing a signal processor we may
[2s R(t) COS 2 21T Fat - 2s I(t) sin 21TFot cos 21T Fat] LPF assume the data model
[SR(t) + SR(t) cos 41TFot - SI(t) sin 41TFot]LPF p
since the other signals have spectra centered about 2Fo. The "LPF" designation where w[n] is a complex noise sequence. This is a commonly used model. o
means the output of the low-pass filter in Figure 15.2. Similarly, the output of the
low-pass filter in the "quadrature" channel is SI(t). Hence, the discrete signal at the A complex signal can also arise when the analytic signal sa(t) is used to represent a
output of the processor in Figure 15.2 is sR(n~) + jSI(n~) = s(n~). It is seen that real low-pass signal. It is formed as
complex data naturally arise in bandpass systems. An important example follows.
sa(t) = s(t) + j1[s(t)]
Example 15.1 - Complex Envelope for Sinusoids where 1 denotes the Hilbert transform [Papoulis 1965]. The effect of forming this
complex signal is to remove the redundant negative frequency components of the Fourier
It frequently occurs that the bandpass signal of interest is sinusoidal, as for example, transform. Hence, if S(F) is bandlimited to B Hz, then Sa(F) will have a Fourier
p
transform that is zero for F < 0 and so can be sampled at B complex samples/sec. As
an example, if
s(t) =L Ai COS(21TFit + <Pi) p
i=l s(t) = L Ai COS(21TFit + <Pi)
i=l
where it is known that Fo - B /2 :S Fi :S Fo + B /2 for all i. The complex envelope for
this signal is easily found by noting that it may be written as then it can be shown that
p
Now that the use of complex data has been shown to be a natural outgrowth of
2Re [t ~i exp [j(21T(Fi - Fo)t + <Pi)] eXp[j21TFot]] bandpass signal processing, how do complex parameters come about? Example 15.1
illustrates this possibility. Suppose we wanted to estimate the amplitudes and phases
of the p sinusoids. Then, a possible parameter set would be {AI> <PI> A 2, <P2, ... , A p , <pp},
so that from (15.2) which consists of 2p real parameters. But, equivalently, we could estimate the complex
parameters {AI exp(j<PI), A2 exp(j<p2), ... , Ap exp(j<pp)}, which is only a p-dimensional
p A but complex parameter set. The equivalence of the two parameter sets is evident from
s(t) =L -;f exp(j<Pi) exp[j21T(Fi - Fo)t]. the transformation
i=l
A = Aexp(j)
:-.rote that the complex envelope is composed of complex sinusoids with frequencies and inverse transformation
F; - F o, which may be positive or negative, and complex amplitudes ~ exp(j<Pi). If we
sample at the Nyquist rate of Fs = 1/ ~ = B, we have as our signal data A
p A
s(n~) =L -;f exp(j<Pi) exp(j21T(Fi - Fo)n~].
i=l Another common example where complex parameters occur is in spectral modeling of
a nonsymmetric PSD. For example, an AR model of the PSD of a complex process x[n]
Furthermore, we can let s[n] = s(n~), so that would assume the form [Kay 1988]
p
=L
(j2
s[n] Ai exp(j21T j;n) Pxx(f) = u 2.
i=l 11 + a[l] exp( -j21Tf) + ... + a[p] exp( -j21Tfp) I
J
498 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.3. COMPLEX DATA AND PARAMETERS 499
If the PSD is nonsymmetric about f = 0, corresponding to a complex ACF and thus a Setting this equal to zero and solving produces
complex random process [Papoulis 1965], then the AR filter parameters {a[l]' a[2], ... ,
alP]} will be complex. Otherwise, for real a[kJ's we would have Pii ( - f) = Pii(f). It A (SfSl+sfs2)-I(sfxR+sfxI)
seems, therefore, natural to let the filter parameters be complex to allow for all possible Sk:R
([ -SI SR
-~kSI] + [ S~SI
spectra. SI SI SRSI
In dealing with complex data and/or complex parameters we can always just decom-
pose them into their real and imaginary parts and proceed as we normally would for
real data vectors and/or real parameter vectors. That there is a distinct disadvantage
in doing so is illustrated by the following example.
2:= x[n]s*[n]
n=O
n=O
This is a standard quadratic form in the real variables AR and AI. Hence, we can let N-l
XR = [XR[O] xR[l] ... xR[N - lW, XI = [xIlO] xIl1] ... xI[N - lW, SR = [SR[O] sR[l] ... 2:= Is[nW
sR[N - lW, SI = [sr[O] sI[l] ... sIlN - 1]JT, so that n=O
a result analogous to the real case. We can simplify the minimization process using
J'(AR,AI) = (XR - ARs R + ArSI)T(XR - ARs R + AIs r ) complex variables as follows. First, define a complex derivative as (see Section 15.6 for
+ (Xr - ARs I - AIsRl (XI - ARs I - AIs R) further discussions) [Brandwood 1983]
oA
1 (15.4)
8A
500 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.4. COMPLEX RANDOM VARIABLES AND PDFS 501
aA* If we now have two complex random variables Xl and X2 and a joint PDF for the real
o (15.5)
aA random vector [Ul U2 VI v2f, then we can define a cross-moment as
aAA* -aA* + -_A*
aA- = A*
-
A-_ (15.6) E [(Ul - jVI)(U2 + jV2)]
aA aA aA
[E(UlU2) + E(VlV2)] + j [E(UlV2) - E(U2Vl)]
as we shall do in Section 15.6. Now, we can minimize J using the complex derivative
which is seen to involve all the possible real cross-moments. The covariance between
aJ a N-l 2
Xl and X2 is defined as
aA
--
aA
L n=O
Ix[n]- As[n]1
(15.11)
tl ~aA (lx[nW - x[n]A*s*[n]- As[n]x*[n] + AA*ls[nW)
n=O
and can be shown to reduce to
N-l
The variance is defined as where H denotes the conjugate transpose of a matrix. Note that the covariance matrix is
var(X) = E (Ix - E(xW) (15.9)
Hermitian or the diagonal elements are real and the off-diagonal elements are complex
conjugates of each other so that Cf = CX. Also, C x can be shown to be positive
which can easily be shown to reduce to semidefinite (see Problem 15.2).
We frequently need to compute the moments of linear transformations of random
(15.10) vectors. For example, if y = Ax + b, where x is a complex n x 1 random vector, A is
...I
502 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.4. COMPLEX RANDOM VARIABLES AND PDFS 503
so that which will require fourth-order moments of x (see Example 15.4). Recall that for a
n
real Gaussian PDF the fourth-order moments were functions of second-order moments,
E(fji) = ~)A]ijE(xj) + bi considerably simplifying the evaluation. We will show next that we can define a complex
j=l
Gaussian PDF for complex random variables and that it will have many of the same
and expressing the results in matrix form. The second result uses the first to yield properties as the real one. Then, we will be able to evaluate complicated expressions
Cy E [(y - E(y))(y - E(y))H] such as (15.15). Furthermore, the complex Gaussian PDF arises naturally from a
consideration of the distribution of the complex envelope of a bandpass process.
= E [A(i - E(i))(i - E(i))H A H]. We begin our discussion by defining the complex Gaussian PDF of a scalar complex
But random variable x. Since x = u + jv, any complete statistical description will involve
the joint PDF of u and v. The complex Gaussian PDF assumes u to be independent
E {[A(i - E(i))(i - E(i))H AHL j } of v. Furthermore, it assumes that the real and imaginary parts are distributed as
N(/-LUl u 2 /2) and N(/-Lv, u 2 /2), respectively. Hence, the joint PDF of the real random
E {~t[A]ik[(i - E(i))(i - E(i))H]k/[AH]/j } variables becomes
n n
p(u,v)
L L[A]idCx]kd A H]lj
k=ll=l
and expressing this in matrix form yields the desired result. It is also of interest to
determine the first two moments of a positive definite Hermitian form or of
Q =iHAi
where i is a complex n x 1 random vector and A is an n x n positive definite hermitian
(AH = A) matrix. Note that Q is real since But letting jj = E(x) = /-Lu + j/-Lv, we have in more succinct form
(i HAi)H
iHAHi (15.16)
iHAi
Q. Since the joint PDF depends on u and v only through x, we can view the PDF to be that
of the scalar random variable x, as our notation suggests. This is called the complex
Also, since A is assumed to be positive definite, we have Q > 0 for all i # O. To find Gaussian PDF for a scalar complex random variable and is denoted by CN(jj, ( 2 ). Note
the moments of Q we assume E(x) = o. (If this is not the case, we can easily modify the similarity to the usual real Gaussian PDF. We would expect p(x) to have many of
the results by replacing x by y = x - E(x) and then evaluating the expressions.) Thus, the same algebraic properties as the real Gaussian PDF, and indeed it does. To extend
these results we next consider a complex random vector i = [Xl X2 ... xnf. Assume
E(Q)
=
E(iH Ax)
E(tr(AiiH))
the components of x are each distributed as CN(jji, un
and are also independent. By
independence we mean that the real random vectors [Ul vd T , [U2 v2f, ... , [un vnf are
I
~
504 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.4. COMPLEX RANDOM VARIABLES AND PDFS 505
independent. Then, the multivariate complex Gaussian PDF is just the product of the This is a special case of the more general form of the real 4 x 4 covariance matrix
marginal PDFs or
n
which follows from the usual property of PDFs for real independent random variables. where A and B are each 2 x 2 matrices. This form of ex allows us to define a complex
From (15.16) this can be written as Gaussian PDF for complex Gaussian random variables that are correlated. By letting
u = [Ul U2JT and v = [VI V2JT we see that the submatrices of ex are
p(i) = 7r n n~ t=1
a2 exp
1
[- t :2lxi - iii 12] .
l=l l ~A
2
E[(u - E(u))(u - E(u)fl = E[(v - E(v))(v - E(v)fl
But noting that for independent complex random variables the covariances are zero, ~B E[(v - E(v))(u - E(u)fl = -E[(u - E(u))(v - E(v)fl.
we have the covariance matrix for i 2
ex = diag(a~, a~, ... , a~) Thus A is symmetric and B is skew-symmetric or BT = -B. Explicitly, the real
covariance matrix is
so that the PDF becomes
var(ul) COV(Ul,U2) I cOV(UJ,Vl) COV(Ul,V2)
(15.17)
cov( U2, ud var( U2) I cov( U2, vd cov( U2, V2)
ex = - - -- - - -- I - - --
This is the multivariate complex Gaussian PDF. and it is denoted by CN(jL, ex). Again [ cov( VI, ud cov( VI, U2) I var( VI) COV(Vl,V2)
COV(V2,Ul) COV(V2,U2) I COV(V2,Vl) var( V2)
note the similarity to the usual real multivariate Gaussian PDF. But (15.17) is more
general than we have assumed in our derivation. It is actually valid for complex covari- For ex to have the form of (15.19) we require (in addition to variances being equal or
ance matrices other than just diagonal ones. To define the general complex Gaussian var(ui)= var(vi) for i = 1,2, and the covariance between the real and imaginary parts
PDF we need to restrict the form of the underlying real covariance matrix, as we now being zero or cov( Ui, Vi) = 0 for i = 1,2, as we have assumed for the scalar complex
show. Recall that for a scalar complex Gaussian random variable x = U + jv the real Gaussian random variable) the covariances to satisfy
covariance matrix of [u vJT is
COV(Ul,U2) COV(Vl,V2)
The covariance between the real parts is identical to that between the imaginary parts,
For a 2 x 1 complex random vector i = [XI x2l = [UI + jVI U2 + jV2]Y with XI inde-
T and the covariance between the real part of Xl and the imaginary part of X2 is the
pendent of X2, as we have assumed, the real covariance matrix of [UI VI U2 V2]Y is negative of the covariance between the real part of X2 and the imaginary part of Xl'
With this form of the real covariance matrix, as in (15.19), we can prove the following
(see Appendix 15A) for i = [Xl X2 xnl = [Ul + jVl U2 + jV2 . Un + jVnJT and for
~
ai0 ai0 0
0
0
0
1 A and B both having dimensions n x n.
2 0 0 a~ 0 .
[ 1. The complex covariance matrix of i is ex = A + jB. As an example, for n = 2
o 0 0 ai and independent XI, X2, ex
is given by (15.18) and thus
If we rearrange the real random vector as x = [UI U2 VI v2lT, then the real covariance
matrix becomes
o o ex = A + jB = A = [~i ~~].
o
(15.18) 2. The quadratic form in the exponent of the real multivariate Gaussian PDF may
o be expressed as
o
J
506 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.4. COMPLEX RANDOM VARlABLES AND PDFS 507
where [L = E(i) = I'u + jl'v' As an example, for n = 2 and independent Xll X2, as
we have from (15.18) and (15.19)
p(x) p(i)
~ [! ~ 1
1 exp [(-
- X - I'-)HC-
x l (-X - I'_)]
1l'n x
d et (C)
C,
where [L is the mean and C x is the covariance matrix of i. Whereas the real Gaussian
PDF involves 2n x 1 real vectors and 2n x 2n real matrices, the complex Gaussian PDF
so that
involves n x 1 complex vectors and n x n complex matrices. We now summarize our
results in a theorem.
Theorem 15.1 (Complex Multivariate Gaussian PDF) If a real random vector
T [ A-I 0 ] [ u - I'u ]
[ (u - I'uf (v - I'v)]2 0 A-I V _ I'v Xof dimension 2n x 1 can be partitioned as
cov(U;, uJ ) cov(V;, vJ )
Note that det(C x ) is real and positive since C x is Hermitian and positive definite.
As an example, for n = 2 and independent Xl, X2 we have C x = A, and from COV(Ui,Vj) -COV(Vi,Uj) (15.21)
(15.18) and (15.19)
then defining the nx 1 complex random vector i = u+jv, i has the complex multivariate
Gaussian PDF
i "" CN([L, Cx)
where
so that
[L I'u+jl'v
Cx 2(C uu + jCvu)
(15.22)
It is important to realize that the complex Gaussian PDF is actually just a different
With these results we can rewrite the real multivariate Gaussian PDF of algebraic form of the real Gaussian PDF. What we have essentially done is to replace
x = [Ul '" Un VI Vn]T 2 x 1 vectors by complex numbers or [u vf --+ X = U + jv. This will simplify subsequent
estimator calculations, but no new theory should be expected. The deeper reason that
allowed us to perform this trick is the isomorphism that exists between the vector
spaces M2 (the space of all special 2 x 2 real matrices) and C l (the space of all complex
I
J
508 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.4. COMPLEX RANDOM VARIABLES AND PDFS 509
numbers). We can transform matrices from M2 to complex numbers in Cl, perform then the conditional PDF p(Yli) is also complex Gaussian with
calculations in C l , and after we're done transform back to M2. As an example, if we
wish to multiply the matrices E(y) + cyxCii (i - E(i)) (15.25)
Cyy - cjjxciicxjj' (15.26)
This is identical in form to the real case (see (10.24) and (10.25)).
we can, equivalently, multiply the complex numbers These properties appear reasonable, and the reader will be inclined to accept them.
The only one that seems strange is property 6, which seems to imply that for a zero
(a + jb)(e + jf) = (ae - bf) + j(aJ + be) mean complex Gaussian random vector we have E(XIX2) = O. To better appreciate
this conjecture we determine this cross-moment as follows:
and then transform back to M2. We will explore this further in Problem 15.5.
There are many properties of the complex Gaussian PDF that mirror those of the E[(UI + jVl)(U2 + j V2)]
real one. We now summarize the important properties with the proofs given in Ap-
E(UIU2) - E(VIV2) + j[E(UIV2) + E(VIU2)]
pendix 15B.
COV(Ub U2) - COV(Vl' V2) + j[COV(Ul, V2) + COV(U2' VI)]
1. Any subvector of a complex Gaussian random vector is also complex Gaussian.
In particular, the marginals are complex Gaussian. which from (15.20) is zero. In fact, we have actually defined E(XIX2) = 0 in arriving
at the form of the covariance matrix in (15.19) (see Problem 15.9). It is this constraint
2. If i = [:i\ X2'" xnJT is complex Gaussian and {Xl' X2,"" xn} are uncorrelated, that leads to the special form assumed for the real covariance matrix. We will now
then they are also independent. illustrate some of the foregoing properties by applying them in an important example.
3. If {Xl,X2,'" ,Xn} are independent, each one being complex Gaussian, then i =
[Xl X2 .. xnJT is complex Gaussian. Example 15.3 - PDF of Discrete Fourier Transform of WGN
4. Linear (actually affine) transformations of complex Gaussian random vectors are Consider the real data {x[O],x[IJ, ... ,x[N - I]} which are samples of a WGN process
again complex Gaussian. More specifically, if y = Ai + b, where A is a complex with zero mean and variance (J2/2. We take the discrete Fourier transform (DFT) of
m x n matrix with m ~ n and full rank (so that C jj is invertible), b is a complex the samples to produce
m x 1 vector, and if i '" eN(jL, Cx), then
N-l
5. The sum of independent complex Gaussian random variables is also complex Gaus- for Jk = k/ Nand k = 0, 1, ... ,N - 1. We inquire as to the PDF of the complex DFT
sian. outputs for k = 1,2, ... , N/2 - 1, neglecting those for k = N/2 + 1, N/2 + 2, ... , N - 1
since these are related to the chosen outputs by X(fk) = X*(fN-k). We also neglect
6. If i = [Xl X2 X2 X4JT '" eN(o, Cx), then X(fo) and X(fN/2) since these are purely real and ar~ usually of little intere~t anyway,
occurring at DC and at the Nyquist frequency. We WIll show that the PDF IS
E(X~X2X;X4) = E(X~X2)E(x;X4) + E(X~X4)E(X2X;), (15.24)
j
510 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.4. COMPLEX RANDOM VARIABLES AND PDFS 511
is a real Gaussian random vector since it is a linear transformation of x. Thus, we need as required. To verify that C uv = -C vu we show that these are both zero.
only show that the special form of the real covariance matrix of X is satisfied. First,
we show that the real and imaginary parts of X(fk) are uncorrelated. Let X(Jk) = E(U(/k)V(ft))
U(/k) + jV(/k), where U and V are both real. Then
N-l
E (% ~ x[m]x[n] cos 27r /km sin 27r!In )
U(M L x[n] cos 27rAn 2 N-l
n=O
N-l
~ Leos 27r fkn sin 27r fm
n=O
V(/k) - L x[n] sin 27r fkn o for all k,l
n=O
and therefore by the DFT relation in Appendix 1. Similarly, COV(V(fk), U(ft)) = 0 for all k, l. Hence,
N-l C uv = -C vu = o. From Theorem 15.1 the complex covariance matrix becomes
E(U(/k)) = L E(x[n]) cos 27rfkn =0
n=O Cx 2(C uu + jC vu )
and similarly E(V(/k)) = O. Now Na 2
-I
2
cov(U(A), U(ft)) E(U(Jk )U(ft))
and the assertion is proved. Observe that the DFT coefficients are independent and
E [%=:~ x[m]x[n] cos 27rAmCOS27rf1n] identically distributed. As an application of this result we may wish to estimate a 2
using
N-l N-l
L
m=O n=O
L
2
~ o[m-n]cos27rfkmCos27r!In ;2 =!
c
t
1!-1
k=l
IX(fkW.
2 N-l
This is frequently required for normalization purposes in determining a threshold for a
~ L cos 27rfknCOS 27rfln detector [Knight, Pridham, and Kay 1981]. For the estimator to be unbiased we need
n=O
to choose an appropriate value for c. But
2 N-l 1
~ L"2 [cos 27r(Jk + ft)n + cos 27r(Jk - fl)n]
n=O
2 N-l
E(;2) =!
c
t
1!-1
E(IX(fkW)
~ L cos 27r(fk - fl)n
k=l
n=O
Since X(fk)""" CN(O,Na 2 /2), we have that
due to the DFT relation in Appendix 1. Now if fk 1= fI, COV(U(Jk), U(Jd) = 0 by
the same DFT relation. If A = f/, we have COV(U(Jk),U(Jk)) = Na 2/4. A similar
calculation shows that
and
and so
Na 2 Hence, c should be chosen to be N(N/2 -1)/2. o
C uu = C vv = -4- I
512 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.5. COMPLEX WSS RANDOM PROCESSES 513
Example 15.4 - Variance of Hermitian Form
where x '" CN(O, Cx) and A is Hermitian. The mean has already been given in (15.14)
as tr(AC x). We first determine the second moment i=1 j=1
n n
L L[B]ij[A]ij
i=1 j=1
so that the first term in (15.28) is just tr2(C xA) or tr 2(AC x), the squared mean. Let
D = AC x . Then, the second term is
n n n n
n n n n
L L L L[A]ij[A]kIE(X;XjXZxt).
i=1 j=1 k=1 1=1 L L[D]idD]ki =L L[DT]ki[D]ki
i=1 k=1 k=li=1
Using the fourth-order moment property (15.24), we have
which is
E(x:Xj)E(x~XI) + E(x:xI)E(x~Xj) tr(DD) = tr(ACxAC x)
[CIJ;j[Cf]kl + [CIlil[cf]kj. so that finally we have for x '" CN(O, Cx) and A Hermitian
Thus, E(XH Ax) tr(AC x) (15.29)
n n n n var(x H Ax) tr(ACxAC x) (15.30)
i=1 j=1 k=1 1=1 An application of this result is given in Problem 15.10. o
n n n n
and as usual the PSD is the Fourier transform of Txx[k]. If the random vector formed
where a;, hi are the ith columns of A and B, respectively. More explicitly this becomes from any subset of samples, say {x[nd, x[n2], . .. , x[nk]}, has a multivariate complex
Gaussian PDF, then we say that x[n] is a WSS complex Gaussian random process. Its
I
J
514
CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.5. COMPLEX WSS RANDOM PROCESSES 515
comp~ete statist~cal description is then summarized by the ACF, which determines the bandpass noise process. Thus, the complex envelope or the analytic signal of a WSS
covariance matnx, or, equivalently, by the PSD (since the mean is zero). Since the Gaussian random process is a complex WSS Gaussian random process (see Problem
real co~iance matrix i~ constrained to be of the form (15.19), this translates into a 15.12). Since these results are standard fare in many textbooks [Papoulis 1965J, we
constraillt on rxx[kJ. To lllterpret this constraint recall that we require
only briefly illustrate them. We consider a bandpass noise process using the complex
envelope representation. What we will show is that by concatenating the in-phase
Cuu = C vv
and quadrature processes together into a single complex envelope process, a complex
C uv = -C vu WSS Gaussian random process results if the bandpass process is a real WSS Gaussian
process.
for all u = [u[nd u[nkW and v = [v[nd ... v[nkW. This means that
Then
(15.39) where all the samples of the process are independent.
and since Before concluding this section let us point out that a real Gaussian random process
is not a special case of a complex Gaussian random process. As a counterexample,
E(xdn]x2[n + k]) consider the second-order moment E(x[m]x[n]), which is known to be zero for all m
E(Xl(nLl)X2(nLl + kLl)) and n. If x[n] is a real Gaussian random process, then the second moment need not be
r X1X2 (kLl) identically zero.
we have finally
15.6 Derivatives, Gradients, and Optimization
In order to determine estimators such as the LSE or the MLE we must optimize a
function over a complex parameter space. We explored one such problem in Example
as required. Thus, x[n] is a complex Gaussian WSS random process. 15.2. We now extend that approach so that a general method can be used for easily
As an example, if x(t) is the result of filtering a continuous-time WGN process with finding estimators. The complex derivative of a real scalar function J with respect to
a PSD of No/2 over a band of B Hz as shown in Figure 15.2, then a complex parameter () was defined to be
But it is well known that the complex function (actually a real function) 101 2 is not an 1 (8J 8J)
analytic function and so cannot be differentiated. Yet by our definition a stationary
2 8a + j 8(3
point of the real function can be found by solving 8J/80 = 0, producing 0 = 0 or
Q = (3 = 0 as expected. Note that if we were to rewrite J as 00*, then the same (~~r (15.43)
result would have been obtained by considering 0* to be a constant in the partial and setting 8J/80* = 0 will produce the same solutions as 8J/80 = o.
differentiation. Specifically, consider J to be a function of the two independent complex The complex gradient of a real function J with respect to the complex vector pa-
variables 0 and 0*, so that we denote it by J (0, 0*). It is easily shown for this example rameter () is defined as
that J is an analytic function with respect to 0, holding 0* constant, and also with 8J
respect to 0, holding 0 constant. Hence, applying the chain rule, we have
8J(O,O*) = 80 0* 0 80 8J
80 80 + 80
8()
To evaluate 86/80 we use the same definition of 8/80 to yield
8J
80
80
= ~(~-j~)(a+j(3)
2 8a 8(3
80 p
where each element is defined by (15.40). Again note that the complex gradient is zero
1 (8a .8(3 .8a 8(3) if and only if each element is zero or 8J/ 80i = 0 for i = 1, 2, ... ,p, and hence if and
= 2 8a + J 8a - J 8(3 + 8(3 only if 8J/8ai = 8J/8(3i = 0 for i = 1,2, ... ,po Stationary points of J may be found
~(1+jO-jO+1) by setting the complex gradient equal to zero and solving.
For the most part we will be interested in differentiating linear and Hermitian forms
and finally such as (}Hb and (}H A() (where A is Hermitian), respectively. We now derive the
complex gradient of these functions. First, consider l ((}) = b H (), where we note that l
80 = 1 (15.41)
80 is complex. Then
which is consistent with known differentiation results for analytic functions. Similarly,
it is easily shown that (see Problem 15.13) i=l
J
520 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.6. DERIVATIVES, GRADIENTS, AND OPTIMIZATION 521
We let the reader show that (see Problem 15.14) TABLE 15.1 Useful Formulas
Definitions
(15.45)
B: Complex scalar parameter (B = 0< + j(3)
Next, consider the Hermitian form J = (JH A(J. Since AH = A, J must be real, which J: Real scalar function of B
follows from J H = (JH AH (J = (JH A(J = J. Now
aJ _ ~ (aJ _ .aJ)
p p
aB - 2 ao< J a(3
J= :E:E 0; [A]ijOj 0: Complex vector parameter
i=l j=l
Formulas
aB aB"
aB = 1 8i =0
;=1 j=l
abHo = b"
ao
i=l
p
:E[AT]ki O;
i=l ~; real
and therefore
(15.46)
As alluded to earlier, this result appears somewhat strange since if (J were real, we Example 15.6 - Minimization of Hermitian Functions
would have had {)J/{)(J = 2A(J (see (4.3)). The real case then is not a special case.
Because we will wish to differentiate the likelihood function corresponding to the Suppose we wish to minimize the LS error
complex Gaussian PDF to determine the CRLB as well as the MLE, the following
formulas are quite useful. If the covariance matrix C x depends upon a number of real
parameters {~1'~2'''''~p}, then denoting the covariance matrix as Cx(e), it can be
shown that where i is a complex N x 1 vector, H is a complex N x p matrix with N > p and full
rank, C is a complex N x N covariance matrix, and (J is a complex p x 1 parameter
(Ci1(e/~~;e)) vector. We first note that J is a real function since JH = J (recall that C H = C). To
{)In det(Cx(e))
tr (15.47)
{)~; find the value of (J that minimizes J we expand the function and then make use of our
{) iHC i 1 (e)i formulas.
_i H Ci 1(e) {)~~:e) Ci 1(e)i. (15.48)
{)~i
These are derived in Appendix 3C for a real covariance matrix. For a complex co- {)J
{)(J
0- (HH C- 1 i)* - 0 + (HH C- 1 H(J)*
variance matrix the derivations are easily extended. The definitions and formulas are
summarized in Table 15.1. We now illustrate their use with an example. - [HHC- 1 (i - H(J)j* (15.49)
522 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.6. DERIVATIVES, GRADIENTS, AND OPTIMIZATION 523
using (15.44), (15.45), and (15.46). We can find the minimizing solution by setting the where ~R' ~I are each real r x 1 Lagrangian multiplier vectors. But letting ~ = ~R+ j~I
complex gradient equal to zero, yielding the LS solution be a complex r x 1 Lagrangian multiplier vector, the constraint equation is simplified
fJ = (HHC-1H)-IHHC-1i. to yield
(15.50)
J(a) aHWa + Re [(~R - j~I f(Ba - b)]
To confirm that fJ produces the global minimum the reader should verify that
1 1
aHWa + _~H (Ba - b) + -~T(B*a* - b*)
J (9 - fJ)HHHC- 1H(9 - fJ) + (i - HfJ)HC-1(i - HfJ) 2 2 .
2: (i - H9)H c- I (i - H9) We can now perform the constrained minimization. Using (15.44), (15.45), and (15.46),
we have
with equality if and only if 9 = fJ. As an example, using (15.50), the solution of the LS
problem in Example 15.2 is easily found by letting 9 = A, H = [S[O] s[I] ... s[N - 1]JT, 8J =(wa)*+(BH~)*
8a 2
and C = I, so that from (15.50)
and setting it equal to zero produces
(HHH)-IHHi
-I H~
N-I Ilo p t = -W B "2'
Lx[n]s*[n]
n=O Imposing the constraint Bllopt = b, we have
N-I
Lls[nW -I H~
Bllopt = -BW B "2'
n=O
o Since B is assumed to be full rank and W to be positive definite, BW-1B H is invertible
and
A final useful result concerns the minimization of a Hermitian form subject to linear
constraints. Consider the real function
g(a) = aHWa so that finally
(15.51)
where a is a complex n x 1 vector and W is a complex n x n positive definite (and
To show that this is indeed the global minimizing solution we need only verify the
Hermitian) matrix. To avoid the trivial solution a = 0 we assume that a satisfies the
identity
linear constraint Ba = b, where B is a complex r x p matrix with r < p and full rank and
b is a complex r x 1 vector. A solution to this constrained minimization problem requires (a - Ilopt) HW(a -Ilo pt) + a~t Wllopt
the use of Lagrangian multipliers. To extend the use of real Lagrangian multipliers to
the complex case we let B = BR + jBI' a = aR + jaI, and b = b R + jb I . Then the 2: a~t Wllopt,
constraint equation Ba = b is equivalent to which holds with equality if and only if a = Ilo pt. We illustrate this constrained mini-
(BR + jBI)(aR + jaI) = b R + jb I mization with an example.
J(a) = aHWa+ ~~(BRaR - BIaI - b R) where A is a complex parameter to be estimated and w[n] is complex noise with zero
mean and covariance matrix C. The complex equivalent of the BLUE is
+ ~f(BIaR + BRaI - bI)
aHWa + ~~Re(Ba - b) + ~flm(Ba - b) A=aHi
CHAPTER 15. COMPLEX DATA AND PARAMETERS
15.7. CLASSICAL ESTIMATION WITH COMPLEX DATA 525
524
The para~eter vector 8 is to be estimated based on the complex data x = [X[O] x[l] ...
where A is an unbiased estimator and has minimum variance among all linear estima- x[N - 1]] and may have components that are real and/or complex. Many of the
tors. To be unbiased we require results encountered for real data/real parameter estimation theory can be extended to
the complex case. A full account, however, would require another book-hence we will
E(A) = aHE(x) = a HAl = A mention the results we believe to be most useful. The reader will undoubtedly wish to
or a H I = = 1,_which can be verified by
1. This constraint can also be written as ITa extend other real estimators and is certainly encouraged to do so.
We first consider the MVU estimator of 8. Because 8 may have complex and real
replacing a by its real and imaginary parts. The variance of A is components, the most general approach assumes that 8 is a real vector. To avoid
confusion with the case when 8 is purely complex we will denote the vector of real
var(A) E [IA - E(A)1 2 ] parameters as e. For example, if we wish to estimate the complex amplitude A and the
H
E [laHx - a A112] e
frequency fo of a sinusoid, we let = [AR AI fo]T. Then, the MVU estimator has its
usual meaning-it is the unbiased estimator having the minimum variance or
E [a H(x - Al)(x - Al)H a]
E(~i)=~i i=1,2 ... ,p
aHCa. var(~i) is minimum i = 1,2, ... ,po
We now wish to minimize the variance subject to the linear constraint on a. L~tting
A good starting point is the CRLB, which for a complex Gaussian PDF results in a
B = IT, b = 1, and W = C, we have from (15.51) that Fisher information matrix (see Appendix 15C)
Iiopt C- I l(I T C- I I)-ll
C-Il
lTC-lI .
A a~tx
for i,j = 1,2, ... ,po The derivatives are defined, as usual, as the matrix or vector
lTC-IX of partial derivatives (see Section 3.9), where we differentiate each complex element
lTC-II 9 = gR + jgI as
which is identical in form to the real case (see Example 6.2). A subtle difference, Bg Bg R .BgI
B~i = B~i + J B~i .
however, is that A minimizes var(A), or if A = A.R + jA.I,
The equality condition for the CRLB to be attained is, from (3.25),
var(A) E (IA - E(AW)
(15.53)
E (I(A. R - E(A. R)) + j(A. I - E(A.I )W)
var(A. R ) + var(A. I ) e
v:here = g(x) is. the efficient estimator of e.
Note that p(x; e)
is the same as p(x; 8)
smce the former IS the complex Gaussian PDF parameterized using real parameters.
which is actua.lly the sum of the variances for each component of the estimator. <>
The example to follow illustrates the computation of the CRLB for the estimation of
a real parameter based on complex data. This situation frequently occurs when the
covariance matrix depends on an unknown real parameter.
15.7 Classical Estimation with Complex Data
Example 15.8 - Random Sinusoid in CWGN
We will restrict our discussion to complex data vectors that have the complex Gaussian
PDF or Assume that
where w[n] is CWGN with variance (T2, A. ~ CN(O, (T~), and A. is independent of w[n]. Hence
In matrix form we have
i = A.e+w
alnp(i; (T~)
a(T~
where e = [1 exp(j27rfo) ... exp(j27rfo(N -l))JT. Since sums of independent complex
Gaussian random variables are also complex Gaussian (see Section 15.4), we have N
( N (T~ + (T2
)2 (liHe l2 _ (T2 _ (T2) .
N2 N A
i ~ CN(O, Cx).
The CRLB is satisfied with
To find the covariance matrix
E(iiH) N2 N
~2 ~ x[n]exp(-j27rfon)
E ((A.e + w)(A.e + w)H) N 1 12
N
E(IA.12)ee H + (T2J I
(T!ee H + (T2J which is an efficient estimator and has minimum variance
since A. and ware independent. We now examine the problem of estimating (T~, the
variance of the complex amplitude, assuming fo and (T2 are known. The PDF is
2 1 [ -HC-1( 2 )-] Note that the variance does not decrease to zero with N, and hence;~ is not consistent.
P(i;(TA) = 7rN det(Cx((T~)) exp -x i; (TA X .
The reader may wish to review the results of Problem 3.14 in explaining why this is
w. 0
To see if the CRLB is attained, we compute
When the parameter to be estimated is complex, we can sometimes determine if an
alnp(i;(T~) alndet(Cx((T~))
efficient estimator exists more easily than by expressing the parameter as a real vector
a(T~ a(T~ parameter and using (15.53). We now examine this special case. First, consider the
complex parameter 8, where 8 = a + j(3. Then, concatenate the real and imaginary
Using formulas (15.47) and (15.48) e
parts into the real parameter vector = [aT (3TJT. If the Fisher information matrix
e
for takes the form
[E
F -F]
8lnp(i;a~)
a(T~ J(e) =2 E
then the equality condition of the CRLB may be simplified. From (15.53) we have
But
so that
1 N(T~
-e- e
(T2 (T4 + N (T~ (T2
e
1
528 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.7. CLASSICAL ESTIMATION WITH COMPLEX DATA 529
is an efficient estimator, and hence the MVU estimator of 0. Also, from (15.54), (15.56), we differentiate using the formulas (15.47) and (15.48). As shown in Appendix 15C,
and (15.57) this leads to
(15.59)
-tr ( Ci (e) 8~~~e))
l
is its covariance matrix. The covariance matrix of the real parameter vector e is
c, I-1(e) + (i - ji,(e))HCil(e/~~~e)Ci1(e)(x - ji,(e))
(2 [~ -i])-l + 2 Re [(i - ji,(e))H Ci l (e) 8~i;)] (15.60)
~ [A -B]
2 B A which when set equal to zero can sometimes be used to find the MLE. An example
follows.
because of the special form ofI(e) (see property 4 in Appendix 15A). But (A + jB)(E+
jF) = I, so that
Example 15.10 - Phase of a Complex Sinusoid in CWGN
A+jB (E + jFt1
r1(0) Consider the data
HHC-1H Aexp [j(27rfon + e/] + w[n]
x[n] = n = 0, 1, ... , N - 1
and hence where the real amplitude A and frequency fa are known, the phase e/> is to be estimated,
_ 1 [ Re [(HHC-1H)-1] -1m [(HHC-1H)-1] ]
c, - 2 1m [(HHC-1H)-1] Re [(HHC-1H)-1] . and w[n] is CWGN with variance {]'2. Then, x[n] is a complex Gaussian process (al-
though not WSS due to the nonstationary mean), so that to find the MLE of e/> we can
To check that the real Fisher information matrix has the special form we have from use (15.60). But C x = (]'2I does not depend on e/>, and
(15.55) and (15.57)
When the CRLB is not satisfied, we can always resort to an MLE. In practice, the and from (15.60)
use of sufficient statistics for real or complex data appears to be of limited value. Hence,
our approach in Chapter 5, although it should be tried, frequently does not produce a
8Inp(x; e/
~ Re [(X _ ji,(e/)H8ji,(e/]
practical estimator. The l\ILE, on the other hand, being a "turn-the-crank" procedure, 8e/> {]'2 8e/>
is usually of more value. The general MLE equations for real parameters of a complex
Gaussian PDF can be expressed in closed form. For complex parameters there does
2
{]'2 Re
[N-1
?; (x*[n]- Aexp[-j(27rfon + e/>)])jAexp[j(27rfon + e/]]
not seem to be any general simplification of these equations. Letting
2A
-~ 1m
[N-1
?; x*[n] exp[j(27rfon + e/]- N A ]
532 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.8. BAYESIAN ESTIMATION 533
2A [N-I
-;2 1m exp(j) ~ x' [n] exp(j21l"!on) - N A .
] Because of the form of the posterior PDF, the MAP and MMSE estimators are identical.
It can further be shown that the MMSE estimator based on x is just E(Oilx) and that
it minimizes the Bayesian MSE
Setting this equal to zero, we have upon letting
N-I
X(fo) =L x[n]exp(-j27rfon)
n=O
where the expectation is with respect to p(x, Oi). Furthermore, if Oi = O:i + j/3i' then O:i
minimizes E [(O:i - O:i)2] and /3i minimizes E [(,8i - /3i)2] (see Problem 15.23).
denote the Fourier transform of the data
An important example is the complex Bayesian linear model. It is defined as
1m [exp(j)X'Uo)] o (15.63)
1m [(cos + j sin)(Re(X(fo)) - jIm(X(fo)))] o
where H is a complex N x P matrix with N :S p possibly, 9 is a complex p x 1 random
or vector with 9 '" CN(I.e, Cee ), and w is a complex N x 1 random vector with w '"
sinRe(X(fo)) = cos Im(X(fo)). CN(O, Cw) and is independent of 9. The MMSE estimator of 9 is given ?y E(9Ix), so
that we need only find the first two moments of the PDF p(x, 9), accordmg to (15.61)
The MLE follows as and (15.62). The mean is
A [Im(X(fo))]
= arctan Re(X(fo)) E(x) HE(9) + E(w)
which may be compared to the results in Example 7.6. HJ.Le
<>
and the covariances are
Cee Cei ] _ E [(9 - E(9))(9 - E(9))H] E [(9 - E(9))(x - E(X))H] ] x, 9 are therefore jointly complex Gaussian. Finally, we have as the MMSE estimator
[ C i9 Cii - for the complex Bayesian linear model, from (15.61),
[
E [(x - E(x))(9 - E(9))H] E [(x - E(x))(x - E(x))H]
px P PXN] J.Le + CeeH H (HCeeH H + Cw) - I (x - HJ.Le) (15.64)
[ Nxp NxN . l
J.Le + (C eel + HHC;i/Hr HHC;Z;l(X - HJ.Le) (15.65)
534 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.9. ASYMPTOTIC COMPLEX GAUSSIAN PDF 535
and the minimum Bayesian MSE can be shown to be, from (15.62), x
where is the sample mean of the x[nl's. This is identical in form to the real case (see
(10.31)). 0
Bmse(Bi) [ Coo - CooH H (HCooH H + CUi )-1 HCoo ] ii (15.66)
1
[(C-98 + HHC-:: Hf1] ii
1
W
(15.67)
15.9 Asymptotic Complex Gaussian PDF
which is just the [i, i] element of COli' The reader will note that the results are identical
to those for the real case with the transposes replaced by conjugate transposes (see Similar to the real case, if the data record is large and the data are a segment from a
Section 11.6). An example follows. complex WSS Gaussian random process with zero mean, then the PDF simplifies. In
essence the covariance matrix becomes an autocorrelation matrix which if large enough
Example 15.11 - Random Amplitude Signal in CWGN in dimension is expressible in terms of the PSD. As we now show, the log-PDF becomes
approximately
Consider the data set
x[n] = As[n] + w[n] n = 0, 1, ... , N - 1
where A ~ CN(O, O'~), s[n] is a known signal, and w[n] is CWGN with variance 0'2
lnp(i;e) = -Nln7r - N I: [lnPii(f) + p~~~j)] df (15.68)
where !k = kiN for k = 0,1, ... , N - 1. We will first find the PDF of the DFT
coefficients X(fk) and then transform back to x[n] to yield the PDF p(i; e). In matrix
Bmse(A) form the DFT can be expressed as
X=Ei
536 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.9. ASYMPTOTIC COMPLEX GAUSS/AN PDF 537
where X = [X(fo) X(fl) ... X(fN-df and [E]kn = exp( -j27r/kn). If we normalize E Thus we have asymptotically
by letting U = E/-IN, so that X = -INUX, then we note that U is a unitary matrix
or U H = V-I. Next, since X is the result of a linear transformation of X, then X is E[X(fk)X*(fl)] = { oN Pi;x(fk) k=I (15.70)
also complex Gaussian and has zero mean and covariance matrix NUCxU H. As an otherwise.
example, if x is CWGN so that C x = (721, then X will have the PDF
The DFT coefficients are approximately uncorrelated, and since they are jointly com-
X""' CN(O,C x ) plex Gaussian, they are also approximately independent. In summary, we have
where k = 0, 1, .. . ,N - 1 (15.71)
where "a" denotes the asymptotic PDF and the X(fd's are asymptotically indepen-
dent. Recall that for CWGN these properties are exact, not requiring N -+ 00. This
is due to the scaled identity matrix of x as well as to the unitary transformation of
The reader may wish to compare this to the results in Example 15.3 to determine the the DFT. In general, however, C x is only approximately diagonal due to the prop-
reason for the difference in the covariance matrices. In general, to find the covariance erty that the columns of U or, equivalently, E are only approximately the eigenvec-
matrix of the DFT coefficients we consider the [k, I] element of C x or E[X(fk)X* (M]. tors of CX, which is a Hermitian Toeplitz matrix [Grenander and Szego 1958]. In
Because the DFT is periodic with period N, we can rewrite it as (for N even) other words, C x = U(NCx)U H is approximately an eigendecomposition of NC x . We
can now write the asymptotic PDF of X as X ""' CN(O, C x ), where from (15.71)
~-1 C x = Ndiag(Pix(fo), Pxx(fl), ... , Pi;x(fN-d). To transform the PDF back to that
X(/k) = L x[n] exp( -j27r fkn). of x we note that X = Ex, and since U = E/-IN,
n=-1f
x E-1X
Then,
_1_U- 1 X
if-I 1--1 -IN
L L E(x[m]x*[n]) exp [-j27r(fkm - /In)] _1_U H X
m=-lf n=-lf -IN
If-I if-I ~EHX
L L Txx[m - n] exp [-j27r(fkm - fin)]. N
1 H
m=-lf n=-f CN(O, N2E CxE).
Letting i =m- n, we have In explicit form this becomes
But as N -+ 00, the term in brackets approaches the PSD, or for large N
-Nln7r -In [det (~EHE) det (~Cx)] - xHEHC:xlEx
11 11 2 2
.l.IX(fW N2 var (X(lk))
~ -Nln1T - N lnPxi(f)dl - N N dl
o 0 Pxx(f) 2P;x(fk)
= jt
-Nln1T-N -t lnPxx(f)
[ J(f) ]
+ Pxx(f) df. making use of the fourth-order moment properties of jointly complex Gaussian random
variables. Thus, we have that asymptotically
In the last step we have used the fact that J(f) and Pxx(f) are periodic with period 1
to change the integration limits. To find the Fisher information matrix we need only
compute the second partial derivatives of lnp(i; e), noting that
and does not decrease with increasing N. Unfortunately, the periodogram is inconsis-
tent, an example of which is shown in Figure 15.4 for CWGN with 17 2 = 1. As can be
E(I(f)) E (~IX(fW) seen, the average value of the periodogram is the PSD value Pxx(f) = 17 2 = 1. However,
the variance does not decrease as the data record length increases. The increasingly
1 "ragged" nature of the spectral estimate results from the larger number of points in
Nvar(X(f))
the PSD that we are estimating. As previously shown, these estimates are independent
~ Pxx(f) (since X(fk) is independent of X(fl) for k i= l), causing the more rapid fluctuations
with increasing N. (>
for large data records. An example of the utility of the asymptotic results follows.
~ I~ x[n]eXp(-j21T lk n{
Examples 3.14 and 7.16 to compare the use of real to complex data. Next, we show
how an adaptive beamformer results from classical estimation theory.
To determine the quality of the estimator we compute its mean and variance using the Consider a data model for a complex sinusoid in CWGN or
results of (15.71). The mean has already been shown to be approximately Pxx(lk), so
that it is asymptotically unbiased. The variance is asymptotically x[n] = Aexp(j21Tlon) + w[n] n = 0,1, .. . ,N - 1
E[P;x(fk)]- E 2 [Pxx (fk)] where A is a complex amplitude, 10 is the frequency, and w[n] is a CWGN process with
E[P;x(!k)]- P;x(fk). variance 17 2 We wish to estimate A and 10. Since the linear model does not apply
because of the unknown frequency, we resort to an MLE. To determine the asymptotic
But performance of the MLE we also compute the CRLB. Recall from Chapter 7 that at
least at high SNRs and/or large data records the MLE is usually efficient. We observe
that one parameter, A, is complex and one, 10, is real. The simplest way to find the
CRLB is to let A = A exp(j</J) and find it for the real parameter vector e = [A 10 </Jf
540 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.10. SIGNAL PROCESSING EXAMPLES 541
20~
-15~
I
-20-+
-25-+
-30-41--TI--t-1----,Ir---""1"I--t-1--Ir---""1"I--T"1- - r - I-----11
-{).5 -{).4 -{).3 -{).2 -{).1 0.0 0.1 0.2 0.3 0.4 0.5 -{).5 -{).4 -{).3 -{).2 -{).1 0.0 0.1 0.2 0.3 0.4 0.5
Frequency Frequency
-25
-{).5 -{).4 -{).3 -{).2 -{l.l 0.0 0.1 0.2 0.3 0.4 0.5 -{).5 -{).4 -{).3 -{).2 -{).1 0.0 0.1 0.2 0.3 0.4 0.5
Frequency Frequency
where s[nJ = Aexp(j27rfon) = Aexp[j(27rfon + )J. Now the partial derivatives are
Then, from (15.52) we have
8s[nJ
exp[j(27rfon + )J
8A
8s[n]
j27rnAexp[j(27rfon + )J
8fo
8s[n]
jAexp[j(27rfon + )J
8
542 CHAPTER 15. COMPLEX DATA AND PARAMETERS 15.10. SIGNAL PROCESSING EXAMPLES 543
so that where x = [X[O] x[l] ... x[N - 1]jT and e = [1 exp(j2rrfo) ... exp(j21rfo(N - 1))JT. But
N 0 0 we have already shown in (15.49) that
N-1 N-1
oj [e H(x - -l"
I(e) = :2 0 A2 L (2rrn)2
n=O
A2L2rrn
n=O
oA =- eA)
N-1 Setting this equal to zero and solving produces
0 A2L2rrn NA2
eHx
n=O A = eHe
Upon inversion and using (3.22) we have
1 N-1
(]"2 N L x[n]exp(-j2 1r fo n ).
var(A) ~ n=O
2N
6(]"2
Substituting back into J
var(jo) ~
(21l-)2A2N(N2 -1) J(A, fo) xH(x -
eA)
(]"2(2N - 1) -H- xHeeHx
var() ~
A2N(N + 1) x x- eHe
and in terms of the SNR, which is TJ = A 2 I (]"2, this reduces to -H- leH xl 2
x x - eHe .
(]"2
To minimize J over fo we need to maximize
var(A) ~
2N
6 Ie H-1
H
X
2
= -
1 IN-1
L x[n]exp(-j2rrfon)
12
var(io) ~
(2rr)2TJN(N2 - 1) e e N n=O
2N -1 which is recognized as the periodogram. The MLE of A and are found from
var() ~ (15.72)
TJN(N + 1)
The bounds for the <.:omplex case are one-half those for the real case for io and , and AlAI = I~ }~~ x[n] exp(-j21rion) 1 (15.73)
one-fourth that for A. (The reader should be warned, however, about drawing further
conclusions. We are actually comparing "apples to oranges" since the data models are Im(A)
different.) To find the MLE we must maximize arctan--_-
Re(A)
_
p(x; e) =
1 [ 1
rr N det( (]"2I) exp - (]"2
-1 _
~ Ix[n] - A exp(j2rr fon) I
N 2] 1m (%x[n]expC-j21rion))
(15.74)
arctan (N-1 ) .
or, equivalently, we must minimize Re ~x[n]exp(-j21rion)
N-1
J(A,fo) =L Ix[n]- A exp(j21rfon) 12 . Note that the MLE is exact and does not assume N is large, as was required for the
n=O real case. The need for large N in the real case so that fo is not near 0 or 1/2 can be
explained by the inability of the periodogram to resolve complex sinusoids closer than
The_ minimization can be done with respect to A and or, equivalently, with respect
about liN in frequency. For a real sinusoid
to A. The latter approach is easier since for fo fixed this is just a complex linear LS
problem. In matrix form we have s[n] Acos(21rfon + )
J(A, fo) = (x - eA)H (x - eA) 4 exp(j(21rfon + )] + 4 exp[-j(21r fon + )]
544 CHAPTER 15. COMPLEX DATA AND PARAMETERS
15.10. SIGNAL PROCESSING EXAMPLES 545
the peak of the periodogram (even in the absence of noise) will shift from the true
frequency if the frequency difference between the complex sinusoidal components or
(a) fo = 0.1
1/0 - (-/0) I = 2/0 is not much greater than 1/N. As an example, in Figure 15.5 we
plot the periodogram for A = 1, = 7r /2, 10 = 0.1,0.05,0.025, and N = 20. No noise
is present. The peak locations for 10 = 0.1 and 10 = 0.05 are slightly shifted towards
I = O. However, for I = 0.025 the peak location is at I = O. This problem does not
occur with a complex sinusoid since there is only one complex sinusoidal component
and therefore no interaction. See also Problem 15.21 for the extension to two complex
sinusoids. 0
so that the signal is passed undistorted but t!J.e noise at the output is minimized. By The beamformer first phases the signals at the various sensors to align them in time,
undistorted we mean that if x(t) = s(t) = Aexp(j27rFot)e, then at the beamformer because of the varying propagation delays, and then averages them. This is the so-called
output we should have conventional beamformer [Knight, Pridham, and Kay 1981].
To illustrate the effect of a nonwhite spatial noise field assume that in addition to
y(t) = Aexp(j27rFot). white noise we have an interfering plane wave at the same temporal frequency but a
This requires that different spatial frequency I; or arrival angle. Then, the model for the received data is
aHs(t) = Aexp(j27rFot) x(t) = A exp(j27r Fot)e + B exp(j27r Fot)i + ii(t) (15.76)
or
where i = [1 exp(j27r1;) ... exp[j(27r fi(M -1)W and ii(t) is a spatial white noise process
or ii(t) has zero mean and a covariance matrix 0'21 for a fixed t. If we model the
or finally the constraint complex amplitude of the interference B as a complex random variable with zero mean
and covariance P and independent of ii(t), then the interference plus noise can be
represented as
The noise at the beamformer output has variance w(t) = Bexp(j27rFot)i+ii(t)
E [aHw(t)wH(t)a] where E(w(t)) = 0 and
aHCa C E(w(t)w H(t))
which is also the variance of y(t). Hence, the optimal beamformer weights are given
Pii H + 0'21.
as the solution to the following problem: Minimize aH Ca subject to the constraint
Then, using Woodbury's identity (see Appendix 1),
e H a = 1. As a result, this is sometimes called a minimum variance distortionless
response (MVDR) beamformer [Owsley 1985]. The reader will recognize that we have
already solved this problem in connection with the BLUE. In fact, from (15.51) with C = -0'21(I - MPP+ 0'2 "H)
-1 11
As can be verified, y(t) is just the BLUE of Aexp(j27rFot) for a given t. In Example 15.7 where
we derived this result for Fo = 0, and hence for e = 1. The name "adaptive beamformer" 1
comes from the practical implementation of (15.75) in which the covariance matrix of
the noise is usually unknown and hence must be estimated before a signal is present.
The beamformer is then said to "adapt" to the noise field present. Of course, when
an estimated covariance matrix is used, there is no optimality associated with the
beamformer. The performance may be poor if a signal is present when the covariance
is estimated [Cox 1973]. Note that if C = 0'21, so that the noise at the sensors is
uncorrelated and of equal variance ("spatially white"), then (15.75) reduces to We see that the beamformer attempts to subtract out the interference, with the amount
depending upon the interference-to-noise ratio (PI0'2) as well as the separation in arrival
1 M-l
angles between the signal and the interference (eHi). Note that if eHi = 0, we will have
y(t) = M L xn(t)exp(-j27rf n).
s a conventional beamformer because then llopt = elM. This is because the interference
n==O
J
~-------------~~
response is unity when ~i = 90 or when the interference arrival angle is the same as
that of the signal. However, it quickly drops to about -60 dB when the difference in Can we define a complex Gaussian random vector, and if so, what is its covariance
arrival angles is about 6. Recall that the adaptive beamformer is constrained to pass matrix?
the signal undistorted, so that when ~i = ~S> the interference cannot be attenuated.
15.4 For the real covariance matrix in Problem 15.3 verify directly the following:
o
550 CHAPTER 15. COMPLEX DATA AND PARAMETERS PROBLEMS 551
a. XT C;1X = 2iH C;1i, where x = ruT vTf and i = u + jv 15.8 Consider the zero mean complex random variable X. If we assume that E(x 2 ) = 0,
2
h. det(C x ) = det (C x )/16 what does this say about the variances and covariances of the real and imaginary
parts of x?
15.5 Let M2 denote the vector space of real 2 x 2 matrices of the form
15.9 Show that the assumption E[(i - jL)(i - jLf] = 0, where i = u + jv leads to the
[~ ~b] special form of the real covariance matrix for ruT
vTf given by (15.19).
15.10 A complex Gaussian random vector i has the PDF i '" CN(O, (12B), where B is a
and C 1 the vector space of scalar complex numbers. Define addition and multi-
known complex covariance matrix. It it desired to estimate (12 using the estimator
plication in the vector spaces in the usual way. Then, let m1, m2 be two vectors
in M2 and consider the operations ;2 = i H Ai, where A is a Hermitian matrix. Find A so that ;2 is unbiased and
has minimum variance. If B = I, what is ;2? Hint: Use (15.29) and (15.30) and
a. Qml, where Q is a real scalar the fact that tr(Dk) = L~1 A7, where the A;'s are the eigenvalues of the N x N
h. m1 +m2 matrix D and k is a positive integer.
c. m1m2
15.11 If x[n] is CWGN with variance (12, find the mean and variance of L~:o1Ix[nW
Show that these operations can be performed as follows. Propose an estimator for (12. Compare its variance to that of;2 = (liN) L~:01 x2 [n]
a. Transform m1, m2 to C 1 using the transformation for the real WGN case. Explain your results.
a
[ b
-b]
a -t a + jb.
15.12 In this problem we show that the random analytic signal is a complex Gaussian
process. First consider the real zero mean WSS Gaussian random process urn].
Assume that urn] is input to a Hilbert transformer, which is a linear time-invariant
h. Carry out the equivalent operation in C 1 system with frequency response
c. Transform the result back to M2 using the inverse transformation
- ef ].
H(f) = {-j
j
0:::; f < 4
-4:::; f <0
e+jf-t [ ef
and denote the output by v[n]. Show that v[n] is also a real zero mean WSS
Because operations in M2 are equivalent to those in C 1 , we say that M2 is iso- Gaussian process. Then, show that x[n] = urn] + jv[n], termed the analytic signal,
morphic to C 1 is a complex WSS Gaussian random process by verifying that Tuu[k] = Tvv[k] and
15.6 Using the ideas in Problem 15.5, compute the matrix product AT A, where Tuv[k] = -Tvu[k]. What is the PDF of x[n]?
a1 -b 1 15.13 Prove that a{)* la{) = O. What would a{)la{) and a{)* la{) be if we used the
b1 a1 alternative definition
a2 -b2
A= b2 a2
a3 -b 3 for the complex derivative?
b3 a3
15.14 Prove that
by manipulating complex numbers.
15.7 In this problem we prove that all third-order moments of a complex Gaussian PDF
2
are zero. The third-order moments are E(x 3 ), E(x''), E(x'x 2 ), E(xx ). Prove where band () are complex.
that these are all zero by using the characteristic function (see Appendix 15B)
15.15 Determine the complex gradient of (}H A() with respect to complex () for an
x(w) = exp ( _~(12IwI2) . arbitrary complex p x p matrix A, not necessarily Hermitian. Note that in this
case (}H A() is not necessarily real.
552 CHAPTER 15. COMPLEX DATA AND PARAMETERS PROBLEMS 553
15.16 If we observe the complex data 15.22 A signal frequently used in radar/sonar is a chirp whose discrete-time equivalent
is
x[n] = A"n + w[n] n = 0, 1, ... ,fV - 1
s[n] = Aexp [j27r(fon + ~an2)]
where A is a complex deterministic amplitude, " is a known complex constant,
and w[n] is CWGN with variance (72, find the LSE and also the MLE of A. Also, where the parameter a is real. The instantaneous frequency of the chirp may be
find its mean and variance. Explain what happens as fV --+ 00 if 11'1 < 1, 11'1 = 1, defined as the difference of the phase for successive samples or
and 11'1 > 1.
15.17 We observe the complex data x[n] for n = 0, 1, ... ,fV - 1. It is known that the
J;[n] = [fon+~an2] - [fo(n-l)+~a(n--l)2]
x[n]'s are uncorrelated and have equal variance (72. The mean is E(x[n]) = A, 1
where A is real. Find the BLUE of A and compare it to the case when A is (fa - 2a) + an.
complex (see Example 15.7). Explain your results.
The parameter a is thus the frequency sweep rate. Assume the chirp signal is
15.18 If we observe the complex data x[n] = s[n; 0] + w[n], where the deterministic embedded in CWGN with variance (72 and fV samples are observed. Show that
signal is known to within a real parameter 0 and w[n] is CWGN with variance the function to be maximized to yield the MLE of fa and a is
(72, find the general CRLB for O. Compare it to the case of real data (see -Section
3.5) and explain the difference.
15.19 If x[n] is CWGN with variance (72, find the CRLB for (72 based on fV complex
1% x[n] exp [-j27r(fon + ~an2)] 12
samples. Can the bound be attained, and if so what is the efficient estimator?
Assume that A is a unknown deterministic constant. Show how the function can
15.20 In this problem we study the complex equivalent of the Gauss-Markov theorem be computed efficiently using an FFT for each assumed sweep rate.
(see Chapter 6). The data model is
15.23 Consider the complex random parameter 0 = a + j/3, which is to be estimated
i= HO+w based on the complex data vector i = u+ jv. The real Bayesian MMSE estimator
where H is a known complex fV x p matrix with fV > p and full rank, 0 is a is to be used for each real parameter. Hence, we wish to find the estimators that
complex p x 1 parameter vector to be estimated, and Vi is a complex fV x 1 noise minimize the Bayesian MSEs
vector with zero mean and covariance matrix C. Show that the BLUE of 0 is E [(a - &)2]
Bmse(&)
given by (15.58). Hint: Let Oi = afi and use (15.51).
15.21 The MLE for the frequencies of two complex sinusoids in CWGN is explored in
Bmse(~) E[(/3-~)2]
this problem, extending the results in Example 15.13. The data model is where the expectations are with respect to the PDFs p(u, v, a) and p(u, v, /3),
x[n]= Al exp(j27riIn) + .12 exp(j27rhn) + w[n] n = 0,1, ... ,fV - 1 respectively. Show that
where All .12 are complex deterministic amplitudes that are unknown, iI, 12are o & + j~
unknown frequencies which are of primary interest to us, and w[n] is CWGN with E(Olu, v)
variance (72. We wish to estimate the frequencies, but since the amplitudes are
E(Oli)
also unknown, we will also need to estimate them as well. Show that to find the
MLE of the frequencies we will need to maximize the function is the MMSE estimator. Comment on the PDF used to implement the expectation
J(iI, h) = iHE(EHE)-IEHi operator of E(Oli).
where E = [el e2] and ei = [1 exp(j27rfi)" .exp(j27rfi(fV -1)W for i = 1,2. To do 15.24 Assume that x[n] is a zero mean complex Gaussian WSS random process whose
so first note that for known frequencies the data model is linear in the amplitudes PSD is given as Pxx(f) = PoQ(f), where Po is a real deterministic parameter to
and so the PDF can be maximized easily over All .12, Finally, show that under be estimated. The function Q(f) satifies
the constraint IiI - 121 l/fV, the function J decouples into the sum .of two
periodograms. Now determine the MLE. Hint: Show that EHE is approximately
diagonal when the frequencies are spaced far apart.
[~ Q(f)df = 1
2
554 CHAPTER 15. COMPLEX DATA AND PARAMETERS
so that Po is the total power in x[n]. Find the CRLB and the MLE for Po by
using the exact results as well as the asymptotic results (see Section 15.9) and
compare.
15.25 In this problem we examine the "processing gain" of a DFT in detecting a
complex sinusoid in CWGN. Assume that we observe
2xT C x x.
Thus, if C x is positive definite or x T Cxx > 0 for all x oF 0, it follows that xH Cxx >
o forall x oF 0 since x = 0 if and only if x = O.
555
J
APPENDIX 15A. PROPERTIES OF COMPLEX COVARIANCE MATRICES APPENDIX 15A. PROPERTIES OF COMPLEX COVARIANCE MATRICES 557
556
C-x 1
C= ~2 [AB -B]
x A
we can use the determinant formula for partitioned matrices (see Appendix 1) to
(A+BA- 1B)-IBA- 1 ] yield
(A + BA -1 B)-1
det(C x ) = Gr1)2n n
det ([ ~ -:])
BE+AF o.
APPENDIX 15B. PROPERTIES OF COMPLEX GAUSSIAN PDF 559
1. Any subset of a complex Gaussian random vector is also complex Gaussian. W~I-Lu +wfI-Lv
Re(wH [L)
This property follows from property 4 to be proven. If we let y be a subset of the
elements of x by setting y = Ai, where A is an appropriate linear transformation, and 2w TCxw = wHCxw from Appendix 15A. Likewise, we have that wTx
then by property 4 y is also complex Gaussian. For example, if x = [Xl X2jT is a Re(w Hx). Hence we can define the characteristic function of x as
complex Gaussian random vector, then
rPi(W) = E [exp(jRe(wHx))] (15B.l)
2. If Xl and X2 are jointly complex Gaussian and uncorrelated, they are also inde- Now, if Y= Ax + b, we have
pendent.
rPy(w) E [exp (j Re(wH y))]
By inserting the covariance matrix
H
E [exp (j Re(w Ax + wHb))]
into (15.22), it is easy to see that the PDF factors as p(x) = P(XI)P(X2)' Then, exp [jRe(wHb)] rPi(AHw)
we note that p(xd is equivalent to p( UI, VI) and P(X2) is equivalent to p( U2, V2)'
Hence, [UI vd T is independent of [U2 V2jT. Of course, this property extends to any exp [j Re(wHb)] exp [j Re(wH A[L)] exp [_~WH ACxAHW]
number of uncorrelated random variables.
3. If Xl and X2 are each complex Gaussian and also independent, then x = [Xl X2J
T exp [jRe(WH(A[L + b)) - ~WH ACxAHW] .
is complex Gaussian.
By identifying the characteristic function with that for the complex Gaussian PDF
This follows by forming p(XdP(X2) and noting that it is equal to p(x) of (15.22). we have
Again this extends to any number of independent complex Gaussian random vari-
ables.
558
APPENDIX I5H. PROPERTIES OF COMPLEX GAUSSIAN PDF APPENDIX I5B. PROPERTIES OF COMPLEX GAUSSIAN PDF 561
560
5. The sum of two independent complex Gaussian random variables is also complex Likewise,
Gaussian. Oi(W)
ow; --E{j- [j
2"x, exp 2" (-H-+-H-)]}
W x x W
Using the characteristic function, we have
since
E [exp(j Re(w'(xl + X2)))] OWi
E [exp(j Re(w'xd) exp(j Re(W'X2))] 0
0-'
Wi
E [exp(j Re(w'xd)] E [exp(j Re(w'x2))] 0-'
Wi
1
Xl(W)X2(W) ow;
exp [j Re(w' ill) - ~ Iw l2 a i ] exp [j Re(w' iL2) - ~ IwI2a~] (similar to the results of (15.41) and (15.42)). By repeated application (15B.3)
follows. We now need to evaluate the fourth partial derivative of
exp [j Re (W'(iLl +iL2)) - ~lwI2(ai +a~)]. 1 - HC xW
-)
A-. ( - )
'l'i W exp ( -4w
Hence, we have Xl + X2 ~ CN(iLl + iL2, ai + a~). Of course, this property extends
1 4 4 )
to any number of independent complex Gaussian random variables. exp ( -4 ~f;W:[CX]ijWj
6. The fourth moment of a complex Gaussian random vector is given by (15.24).
Consider the characteristic function for x= [Xl X2 X3 X4]T, where x is complex which is straightforward but tedious. Proceeding, we have
Gaussian with zero mean. Then,
02X(W)
OWIOW~
(15B.3)
Only the last two terms will be nonzero after differentation with respect to w~
and setting w = O. Hence,
Appendix 15C
or
Derivation of
7. The conditional PDF of a complex Gaussian PDF is also complex Gaussian with
mean and covariance given by (15.61) and (15.62). CRLB and MLE Formulas
The easiest way to verify the form of the conditional PDF is to consider the
complex random vector z = y - CyxC;ix. The PDF of z is complex Gaussian,
being a linear transformation of jointly complex random vectors, with mean
We first prove (15.60) using (15.47).
E(z) = E(y) - cyxc;i E(x)
8ln det(CxCe)) 8 (x - Me))H C;l(e) (x - ji,Ce))
and covariance 8(,i 8(,i
C zi E [(y - E(y) - CyxC;i(x - E(x))) - tr (c;lCe) 8~~:e)) - (x - Me))H 8~i [C;ICe) (x - Me))]
. (y - E(y) - CyxC;i(x - E(X)))H]
+ 8ji,8~:e) C;l (e) Cx - Me))
Cyy - CyxC;iCfx - cyxC;iCxy
+ CyxC;iCxxC;iCfx But
-C;lce) 8~~~)
Cyy - CyxC;iCxy '
+ 8Ct;i(e) Cx - Me))
But conditioned on x we have y = z+ CyxC;ix, where x is just a constant. Then,
p(ylx) must be a complex Gaussian PDF since z is complex Gaussian and x is a -C;lce) 8~~~) _ C;l(e) 8~~:e) C;l(e) (x - ji,(e)).
constant. The mean and covariance are found as
E(ylx) E(i) + CyxC;ix The last step follows from CxC;1 = I, so that
E(y) - cyxc;i E(x) + CyxC;ix 8C x -1 8C;1
E(y) + CyxC;i(x - E(x)) 8(,i Cx + Cx 8(,i = O.
and Thus,
563
564 APPENDIX 15C. DERiVATION OF CRLB AND MLE FORMULAS APPENDIX 15C. DERlVATION OF CRLB AND MLE FORMULAS 565
from which (15.60) follows. Next we prove (15.52). To do so we need the following tr ( Ci 1 (e) a~~:e)) .
lemma. If X rv CN(O, C), then for A and B Hermitian matrices [Miller 1974]
E(iH AiiHBi) = tr(AC)tr(BC) + tr(ACBC). Simplifying, we have
. [-tr ( Ci 1(e) a~~;e)) + (i - [J(e))H Ci1(e) a~~;e) Ci 1(e) (i - Me)) A Ci1(e) a~~:e) Ci1(e)
We note that all first- and third-order moments of y = i - [J are zero. Also, all second- Note that the last two terms are complex conjugates of each other. Now, using the
order moments of the form E(yyT) and therefore E(y*yH) = [E(yyT)j* are zero (see lemma,
Problem 15.7), so that the expectation becomes
- tr (cil(e)a~~:e)) tr (cil(e)a~~;e))
_ tr ( Ci1(e) a~~:e)) tr ( Ci1(e) a~~~e))
J
Appendix 1
A1.1.1 Definitions'
Consider an m x n matrix A with elements aij, 1,2, ... ,m; j 1,2, ... ,n. A
shorthand notation for describing A is
[A]ii = aijo
The tmnspose of AJ which is denoted by AT, is defined as the n x m matrix with
elements aji or
[AT]ij = aji)
A square matrix is one for which m = n. A s uare matrix is s metrl,t if AT = A.
The ron of a matrix is the number of linearly independent rows or columns,
whichever is less. The inverse70f a square n x n matrix is the square n x n matrix
A 1 for which
A-IA=AA- 1 =1'
where I is the n x n identity matrix. The inverse will exist if and onl if the rank of A
is n. I t e mverse does not eXIst t en is sin laf'!
e e erminanfOf a square n x n matrix is denoted by det(A). It is computed as
n
det(A) = EaiPij
)=1
567
568 APPENDIX 1. REVIEW OF IMPORTANT CONCEPTS AU. LINEAR AND MATRIX ALGEBRA 569
where
All 0
o A22
A= . .
n n
[
Q= LL
;=1 j=1
aijXiXj' ;) o 0
In defining the quadratic form it is assumed that aji = aij. This entails no loss in in which all submatrices Aii are square and the other submatrices are identically zero.
generality since any quadratic function may be expressed in this manner. Q may also The dimensions of the submatrices need not be identical. For instance, if k = 2, All
be expressed as might have dimension 2 x 2 while An might be a scalar. If all Aii are nonsingular, then
Q =xTAx} the inverse is easily found as
where x = [XI X2 ... xnf and A is a square n x n matrix with aji = aij or A is a
symmetric matrix.
A square n x n matrix A is positive semidefinite~(f A is symmetric and
xTAx ~ 0 1
o
for all x i= O. If the quadratic form is strictly positive. then A is positive definite. When
referring to a matrix as positive definite or positive semidefinite, it is always assumed Also, the determinant is
n
that the matrix is symmetric.
det(A) = II det(Aii)' I
n
.
The trace-Jof a square n x n matrix is the sum of its diagonal elements or i=l
A = [All AI2].
A21 A22
~ch "element" Aij is a submatrix of A. The dimensions of the partitions are given as where ai denotes the ith column, the conditions
k xl kx(n-l) ]
[ (m-k)xl (m - k) x (n -l) .
A1.1.2 Special Matrices must be satisfied. An important example of an orthogonal matrix arises in modeling of
gata by a sum of harmonically related sinusoids or by a discrete Fourier series. As an
A diagonal matrix is a square n x n matrix with ai . or all elements off example, for n even
the principal diagonal are zero. lagona matrix appears as
1 1 1 0 0
v'2 v'2
I cos~ --L cos 2rr( %) sin 2rr sin 2rr(%-I)
A= _1_ v'2 n v'2 n n n
v1 1 2rr.!!(n-l)
o 1 cos 2rr(n-l) sin 2rr(n-l) sin 2rr( %-I)(n-l)
v'2 n v'2 cos 2 n n n
rt
570 APPENDIX 1. REVIEW OF IMPORTANT CONCEPTS A1.1. LINEAR AND MATRIX ALGEBRA 571
is an orthogonal matrix. This follows from the orthogonality relationships for i, j = A1.1.3 Matrix Manipulation and Formulas
0,1, ... , n/2
Some useful formulas for the algebraic manipulation of matrices are summarized in this
27rki 27rkj
2:COS--COS--
n-I {o
= I
iij
i = j = 1,2, ... , I-I
section. For n x n matrices A and B the following relationships are useful.
BTAT
k=O n n n i = j = 0, ~
(A-If
and for i,j = 1,2, ... ,n/2-1 B- 1 A-I
det(A)
n-I 27rki . 27rkj
. n
en det(A) (e a scalar)
L Sill -n- Sill -n- = -8
"" 2) i
k=O det(A) det(B)
1
and finally for i = 0, 1, ... , n/2;j = 1,2, ... , n/2 - 1
det(A)
tr(BA)
2: cos -27rki.
n-1
k=O n
27rkj
- Sill - - = O.
n
n n
2: 2:[A]ij[Bk
i=1 j=1
These orthogonality relationships may be proven by expressing the sines and cosines in
terms of complex exponentials and using the result Also, for vectors x and y we have
2: exp j2: ~l
n-I ( )
= nt5 10
k=O
It is frequently necessary to determine the inverse of a matrix analytically. To do so
for 1= 0, 1, - .. ,n - 1 [Oppenheim and Schafer 1975]. one can make use of the followmg formula. 'the inverse of a square n x n matrix is
An idempotent matrix is a square n x n matrix which satisfies
A-I=~
det(A)
This condition implies that At = A for I > 1. An example is the projection matrix where C is the square n x n matrix of cofactors of A. The cofactor matrix is defined by
where H is an m x n full rank matrix with m > n. where Mij is the minor of aij obtained by deleting the ith row and ith column of A.
A square n x n Toeplit,(matrix is defined as Another formula which is quite useful is the matrix inversion lemma -
matrices the submatrices which are multiplied together must be conformable. As an h. the principal minors are all positive. (The ith principal minor is the determi-
illustration, for 2 x 2 partitioned matrices nant of the submatrix formed by deleting all rows and columns wIth an mdex
greater than i.) If A can be written as in (A1.2), but C is not full rank or
:~~ ]
the prmclpal minors are only nonnegjative, then A is positive semidefinite.
AB =
3. If A is positive definite, then the inverse exists and may be found from (A1.2) as
AllB12 + A12B22 ] A 1 = (C 1)"(C 1). (
A21B12 + A22B22 .
4. Let A be positive definite. If B is an m x n matrix of full rank with m < n, then
The transposition of a artitioned matrix is formed b trans os in the submatrices of
BABT is also positive definite. -
the matrix an applying T to each submatrix. For a 2 x 2 partitioned matrix
5. If A is positive definite (positive semidefinite). then
A1.1.4 Theorenm
Some important theorems used throughout the text are summarized in this section. or
1. AV=VA (Al.4)
where
= II\' J
i=l If x is linearly transformed as
y=Ax+b
A1.2 Probability, Random Processes, and Time Series where A is m x nand b is m x 1 with m < n and A full rank (so that C y is nonsingular),
Models tfien y is also distributed according to a multivariate Gaussian distribution with
An assumption is made that the reader already has some familiarity with probability E(y) = I-'y = AI-'x + b
theory and basic random process theory. This chapter serves as a review of these topics.
For those readers needing a more extensive treatment the text by Papoulis [1965] on and
probability and random processes is recommended. For a discussion of time series E [(y -l-'y)(Y -I-'yfl = C y = ACxAT.
modeling see [Kay 1988]. Another useful PDF is the 2 distribution, which is derived from the Gaussian dis-
tribution. x is composed of independent and identically distributed random variables
A1.2.1 Useful Probability Density Functions with Xi '" N(O, 1), i = 1,2, ... , n, then
n
A probability density function PDF which is frequently used to model the statistical
behavlOr 0 a ran om vana e is the Gaussian distribution. A random variable x with y = Ex; '" x~
mean /1-x and variance a~ is distributed according to a Gaussian or normal distribution i=l
as
p(X) = _1_ exp [__l_(x _ /1- x )2] - 00 < x < 00. (A1.5) ~y~-l exp(-h) for y ;::: 0
)211"a; 2a; p(y) = 2 r(.)
{ o for y < 0
The shorthand notation x '" N (/1-x, a;) is often used, where '" means "is distributed
~ccording to." If x '" N(O, an,
then the moments of x are where r(u) is the gamma integral. The mean and variance of yare
E(y) n
E( k) _ { 13 (k - l)a~ k even
x - 0 k odd. var(y) = 2n.
J
------------y
576 APPENDIX 1. REVIEW OF IMPORTANT CONCEPTS A1.2. PROBABILITY. RANDOM PROCESSES, TIME SERIES MODELS 577
A1.2.2 Random Process Characterization 00
(A1.9)
A discrete random process x[n] is a sequence of random variables defined for every k=-oo
integer n. If the discrete random process is wide sense stationary (WSS), then it has a
mean It also follows from the definition of the cross-P erty Tyx[k] = Txy[-k]
t at
E(x[n]) = J.Lx
which does not depend on n, and an autocorrelation function (ACFl
In a similar manner, two jointly WSS random processes x[n] and y n have a cross-
correlation function (CCF
Some useful properties of the ACF and CCF are and is observed to be completely flat with fre uenc. Alternatively, ;yhite noise is
compose 0 egUl-power contributions from all freguencies.
Txx[O] > ITxx[kJl For a linear shift invariant (LSI) system with impulse response h[n] and with a
Txx[-k] WSS random process input, various I!lationships between the correlations and spect.ral
Txx[k]
density fUnctions of the input process x[n] and output process yrn] hold. The correlatIOn
Txy[-k] Tyx[k]. relationships are
00
Note that Txx[O] is positive, which follows from (A1.7).
The z transforms of the ACF and CCF defined as h[k] *Txx[k] = L h[l]Txx[k -I]
1=-00
00 00
00
Pxy(z) = L Txy[k]z-k 00 00
In particular, letting H(f) = 1[exp(j27rJ)] be the frequency response of the LSI system The first time series model is termed an autoregressive (AR) process, which has the
results in time domain representation
p
Pxy(f) H(f)Pxx(f) x[n] =- L a[k]x[n - k] + urn].
Pyx (f) H* (f)Pxx (f) k=1
Pyy(f) = IH(fW Pxx(f). It is said to be an AR process of order p and is denoted by AR(p). The AR parameters
consist of the filter coefficients {a[I], a[2], .. . , alP]} and the driving white noise variance
For the special case of a white noise input process the output PSD becomes
(]'~. Since the frequency response is
(Al.lO) 1
H(f) = --p-----
since Pxx(f) = (]'2. This will form the basis of the time series models to be described 1 + La[k]exp(-j27r!k)
shortly. k=1
the AR PSD is
A1.2.3 Gaussian Random Process (]'2
Pxx(f) = u 2'
l
- La[l]rxx[k -l] k2:1
rxx[l] rxx[O] ... rxx[N - 2] 1=1
rxx [ k ] =
rxx[~ - 1] rxx[~ - 2] rx~[O]
p
{
- ~a[l]rxx[l] + (]'~ k=O
::'. ::
The covariance matrix, or more appropriately the autocorrelation matrix, has the spe- In matrix form this becomes for k = 1,2, ... , P
cial symmetric Toeplitz structure of (Al.I) with ak = a_k.
rxx[O]
-l]]l l
l
An important Gaussian random process is the white process. As discussed previ- rxx[p a[I]] rXX[I]]
ously, the ACF for a white process is a discrete delta function. In light of the definition rxx[I] rxx[p - 2] a[2] rxx[2]
.. .
of a Gaussian random process a white Gaussian random process x[n] with mean zero . ..
and variance (]'2 is one for which rxx[p - 1] rxx[p - 2] rxx[O] a[p] rxx[P]
-oo<n<oo and also p
J
580
and from
O"~
APPENDIX 1. REVIEW OF IMPORTANT CONCEPTS
= Txx[O] + a[l]Txx[l]
1 REFERENCES
References
581
Graybill, F.A., Introduction to Matrices with Application in Statistics, Wadsworth, Belmont, Calif.,
we can solve for Txx[O] to produce the ACF of an AR(l) process 1969. .
2 Kay, S., Modern Spectral Estimation: Theory and Application, Prentice-Hall, Englewood Chffs,
Txx[k] = 1 _0";2[1] (_a[l])lk l . N.J., 1988. .
Noble, B., J.W. Daniel, Applied Linear Algebra, Prentice-Hall, Englewood Chffs, N.J., 19.77.
O ppen h elm, A .V. , R .W, Schafer , Digital Signal Processing, Prentice-Hall, Englewood Chffs, N.J.,
The corresponding PSD is
1975.
0"2 Papoulis, A., Probability, Random Variables, and Stochastic Processes, McGraw-Hill, New York,
Pxx(f) = u 2'
11 + a[l] exp( -j271"j)I 1965.
The AR(l) PSD will have most of its power at lower frequencies if a[l] < 0, and at
higher frequencies if a[l] > O. The system function for the AR(l) process is
1
1i(z) = 1 + a[l]z-1
and has a pole at z = -a[l]. Hence, for a stable process we must have la[l]1 < 1.
While the AR process is generated as the output of a filter having only poles, the
moving average (MA) process is formed by passing white noise through a filter whose
system function has only zeros. The time domain representation of a MA(q) process is
q
H(f) = 1+ Lb[k]exp(-j271"jk)
k=1
its PSD is
Symbols
(Boldface characters denote vectors or matrices. All others are scalars.)
complex conjugate
*
convolution
*
denotes estimator
denotes estimator
denotes is distributed according to
denotes is asymptotically distributed according to
argmaxg(O) denotes the value of 0 that maximizes g(O)
8
583
584 APPENDIX 2. GLOSSARY OF SYMBOLS AND ABBREVIATIONS APPENDIX 2. GLOSSARY OF SYMBOLS AND ABBREVIATIONS 585
J(t) Dirac delta function IIxll norm of x
J[n] discrete-time impulse sequence 1 vector of all ones
J ij Kronecker delta p(x) or Px(x) probability density function of x
~ time sampling interval probability density function of x with 0 as a parameter
p(x; 0)
det(A) determinant of matrix A p(x\O) conditional probability density function of x conditioned on 0
diag( .. ) diagonal matrix with elements ... on main diagonal P projection matrix
natural unit vector in ith direction pi. orthogonal projection matrix
E expected value a gradient vector with respect to x
ax
Ex expected value with respect to PDF of x 8'
Hessian matrix with respect to x
axax T
EX[9 or E(x\lI) conditional expected value with respect to PDF of x conditioned on 0 Pr{} probability
[ energy
Pxx(f) power spectral density of discrete-time process x[n]
1] signal-to-noise-ratio Pxy(f) cross-spectral density of discrete-time processes x[n] and y[n]
f discrete-time frequency Pxx(F) power spectral density of continuous-time process x(t)
F continuous-time frequency p correlation coefficient
;= Fourier transform rxx[k] autocorrelation function of discrete-time process x[n]
;=-1 inverse Fourier transform rxxCr) autocorrelation function of continuous-time process x(t)
H conjugate transpose rxy[k] cross-correlation function of discrete-time processes x(n] and y[n]
H observation matrix rxy(T) cross-correlation function of continuous-time processes x(t) and y(t)
(x,y) inner product of x and y autocorrelation matrix of x
Rxx
i(O) Fisher information for single data sample and scalar 0 Re( ) real part
I(O) Fisher information for scalar 0 17
2 variance
1 identity matrix s[n] discrete-time signal
1(9) Fisher information matrix for vector 9 s vector of signal samples
I(I) periodogram s(t) continuous-time signal
Im( ) imaginary part of continuous time
t
j A tr(A) trace of matrix A
mse(O) mean square error of 0 (classical) 0(9) unknown parameter (vector)
Me mean square error matrix of iJ o (iJ) estimator of 0 (9)
J1 mean T transpose
n sequence index U[a,b] uniform distribution over the interval (a, b]
length of observed data set var(x) variance of x
normal distribution with mean J1 and variance 17 2 var(x\O) variance of conditional PDF or of p(x\O)
multivariate normal distribution with mean IL and covariance C w[n] observation noise sequence
586 APPENDIX 2. GLOSSARY OF SYMBOLS AND ABBREVIATIONS APPENDIX 2. GLOSSARY OF SYMBOLS AND ABBREVIATIONS 587
Index
589
590 INDEX INDEX 591
Examples (Contd.): Grid search, 177 Linear model (classical): Maximum likelihood estimator (Contd.):
sinusoidal amplitude, LSE, 255-56 CRLB,85 complex data, 530-31, 563-65
sinusoidal complex amplitude, MMSE es- Hermitian form: definition, 84, 94-95, 97, 529-30 definition, 162, 182
timator, 534-35 definition, 502 efficiency, 85-86 efficiency, 164, 187
sinusoidal modeling, complex, 496-97 minimization, 521-23 estimator and properties, 85, 486-88 Gaussian PDF, 185
sinusoidal parameters, complex MLE, moments, 502-3, 513 line fitting, 45 invariance, 174-76, 185
539-44 Histogram, 10, 165, 206-7, 209 MLE,186 numerical determination, 177-82, 187-89
sinusoidal parameters, CRLB, 56-57 reduced. 99. 254 probability density function, asymptotic,
sinusoidal parameters, LSE, 222-23 Image signal processing, 365 sufficient statistics, 126 167, 183, 211-13
sinusoidal parameters, MLE, 193-95 Innovations, 396, 433. 441 Linear predictive coding, 5, 59, 198, 407 properties, asymptotic, 172, 201-2
sinusoidal parameters, sufficient statistics. In-phase signal, 495-96 Linear random process, 77 Mean square bandwidth, 55
117-18 Interference suppression, 270 Line arrays, 58, 145 Mean square error:
sinusoidal power. complex MVU estimator. Interpolation, 412 Line fitting, 41, 83-84, 237-40, 373 Bayesian, 311, 320, 347, 533
525-27 LMMSE (see Linear minimum mean square classical, 19
sufficient statistic, completeness of, 110-11 Kalman filter: error estimator) Mean square error matrix, 361-62, 390
sufficient statistic, incompleteness of, definition, 436, 446-49, 455 Localization, source, 142-46, 456-66 Minimal sufficient statistic, 102, 117
111-12 derivation, 471-75 LPC (see Linear predictive coding) Minimum mean square error estimator:
sufficient statistic verification, 103-4 extended, 451-52, 462, 476-77 LS, LSE (see Least squares) Bayesian:
vehicle tracking, 456-66 gain, 436, 447 Lyapunov equation, 430 definition, 313, 316, 346
\Viener filtering, 365-70, 400-409, 443-45 information form, 449 performance, 360, 364-65, 534
EXpectation-maximization, 182, 187-89 steady state, 443 MA (see Moving average)
properties. 349-50
Exponential PDF family: MAP (see :Vlaximum a posteriori estimator)
classical, 19, 311
definition, see Probability density func- Least squares: Matrix:
Minimum variance distortionless response,
tions BLUE, relationship with, 225 autocorrelation, 62, 93
546
MLE,200 constrained, 252 determinant. 567
Minimum variance unbiased estimator:
Exponential signals: definition. 220-21 diagonaL 568-69
definition, 20
estimation, 257-58, 298-99 estimator, 225 eigenanalysis, 573
determination of, 109, 112-13
modified Yule-Walker equations, 268 Hermitian. 501
linear model, 85-86
Fading signal, 100, 452 nonlinear, 222, 254 idempotent, 194, 570
MLE (see Maximum likelihood estimator)
Finite impulse response filter, 90-94 numerical determination, 259-60 ill-conditioned, 85, 98, 240-41
MMSE (see Minimum mean square error
FIR (see Finite impulse response filter) order-recursive, 237, 282-84 inversion:
estimator)
Fisher information: separable, 222-23, 256-57 definition. 567
Modeling:
decoupled matrix, 41, 65 sequential, 249. 279, 286-88 lemma, 571
Woodbury'S identity, 571 dynamical signal, 421
definition, 34, 40 weighted, 150, 225-26, 244-48, 270 identifiability, 85
properties, 35, 65 Levinson recursion, 198, 403 orthogonal, 569
partitioned. 571-72 least squares, 232-34
Fourier analysis, 88-90, 226-27, 250-51, Likelihood function:
positive definite (semidefinite), 568, 572 linearization, 143, 259, 273, 451, 461
347-49, 362-64, 399-400 definition, 29
projection, 231. 242, 277, 285 speech spectrum, 5 (see also Autoregres-
Frequency estimation (see Sinusoidal estima- modified, 175, 185
square. 567 sive and Linear predictive coding)
tion and Examples) Linear minimum mean square error estima-
symmetric, 567 Moments, method of:
tor:
Toeplitz. 62, 93, 570 definition, 293
Gaussian random process, 467, 513, 577-78 definition, 380--82, 389
trace, 568 exponential parameter, estimator, 292,
Gauss-Markov process: properties, 390
transpose, 567 295-97
definition, 421, 426, 430--31 sequential, 393, 398, 415-18
Maximum a posteriori estimator: Gaussian mixture, 290--91, 293-94
properties, 424, 429 vector space interpretation, 386
definition, 344, 351, 354, 372 Monte Carlo method, 10, 164-167, 205-10
Gauss-Markov the<>rem, 141, 143, 552 Linear model (Bayesian):
properties, 358, 372 Moving average:
Gauss-Newton iteration, 260 definition, 325
Gradient formulas, 73-74, 84, 519-21 Maximum likelihood estimator: asymptotic MLE, 190--91
Kalman filter modeling, 447
Gram-Schmidt orthogonalization, 236, 396, MMSE estimator, 364-65, 533-34 asymptotic, 190 definition, 580
411 properties, 487-89 Bayesian. 352 MSE (see Mean square error)
594 INDEX INDEX
595
"1YU (see Minimum variance unbiased esti- Probability density functions (Gontd.): Sinusoidal estimation (Gontd.): Unbiased estimator, 16, 22
mator) Laplacian, 63 phase estimator, 123, 167-72
lognormal. 147 sufficient statistics, 117-18 Vector spaces:
:'-Iarrowband representation, 495 Rayleigh, 122, 371 Sinusoidal modeling, complex, 496 least squares. 227-30
:'-Iewton-Raphson iteration, 179-82, 187, 259 Processing gain, 554 Slutsky's theorem, 201 random variables, 384
:'-Ieyman-Fisher factorization, 104-5, 117, Projection theorem, orthogonal, 228-29, 386 Smoothing, Wiener, 400
127-29 Prony method, 264 Sonar signal processing, 2 Wavenumber (see Spatial frequency)
)Jormal equations, 225, 387 PSD (see Power spectral density) Spatial frequency, 58, 195 WGN (see White Gaussian noise)
)Jotational conventions, 13 (see also Ap- Pseudorandom noise, 92, 165, 206 Spectral estimation: White Gaussian noise:
pendix 2) Pythagorean theorem. least squares, 276 autoregressive, 60 real,7
"uisance parameters, 329 Fourier analysis, 88-90 complex, 517
Quadratic form: periodogram, 204, 538-39, 543, 552 Whitening:
Observation equation, 446 definition. 568 Speech recognition, 4 Kalman, 441, 444
Observation matrix, 84, 100, 140, 224 moments, 76 State transition matrix, 426 matrix transformation, 94-96
Order statistics, 114 Quadrature signal, 495-96 State vector, 424 White noise, 576
Orthogonality, 89, 385 (see also Projection Statistical linearization, 39, 200 (see also Wide sense stationary, 575
theorem, orthogonal) Radar signal processing, 1 Modeling) Wiener filtering, 365-70, 373-74, 379, 400--409,
Outliers, 170 Random number generator (see Pseudoran- Sufficient statistic, 22, 102-3, 107, 116 443
dom noise) System identification: Wiener-Hopf equations:
PDF (see Probability density functions) Random variable. complex, 500-501 nonrandom FIR, 90--94, 99 filtering, 403
Periodogram, 80,190,195,197,204 (see also Range estimation, 1, 14, 53-56, 192 random FIR, 452-55 prediction, 406-7
Spectral estimation) Rao- Blackwell-Lehmann-Scheffe theorem. WSS (see Wide sense stationary)
Phase-locked loop, 273-75 22, 109. 118-19, 130-31 Tapped delay line (see FIR)
Posterior PDF: Rayleigh fading, 347 Threshold effect, 170 Yule-Walker equations:
Bayesian linear model, 326, 533 RBLS (see Rao-B1ackwell-Lehmann-Scheffe Time delay estimation, 53-56, 142-46 AR, 198, 579
definition, 313, 317 theorem) Time difference of arrival, 142 ARMA,267
Power estimation, random process, 66, 203, Regression, nonlinear, 254 Time series, 6
553-54 Regularity conditions, 30, 44, 63, 67, 70 Tracking:
Power spectral density, 576-77 Reproducing PDF, 321. 334-35 frequency, 470 (see also Phase-locked loop)
Prediction: Ricatti equation, 443 vehicle position, 456-66
Kalman, 440-41. 469-70 Risk, Bayes, 342
\Viener. 400
Prior PDF: Sample mean estimator, 115, 121, 164 (see
conjugate, 335 (see also Reproducing also DC level in noise)
PDF) Sample variance estimator, 121, 164
definition. 313 Scoring, 180, 187
noninformative. 332, 336 Seismic signal processing, 365
Probability density functions: Separability, least squares, 222-23, 256
chi-squared, 122.575 Signal amplitude estimator, 136, 498-500
complex Gaussian: Sinusoidal estimation:
conditional, 508-9, 562 amplitudes, 88-90
definition, 503-4, 507 complex data. 525-27, 531-32, 534-35, 543
properties, 508-9, 550, 558-62 CRLB for frequency, 36
exponential, 122 CRLB for parameters, 56-57, 542
exponential family, 110, 124 CRLB for phase, 33
gamma, inverted, 329-30, 355 EM for frequency, 187-89
Gaussian, 574 least squares for parameters, 255-56
Gaussian, conditional, 323-25, 337-39 method of moments for frequency, 300, 306
Gaussian mixture, 150 MLE for parameters, 193-95, 203-4