Ash - 2007 - Lectures On Statistics PDF

1
Lectures On Statistics
Robert B. Ash
Preface
These notes are based on a course that I gave at UIUC in 1996 and again in 1997. No
prior knowledge of statistics is assumed. A standard rst course in probability is a prerequisite, but the rst 8 lectures review results from basic probability that are important
in statistics. Some exposure to matrix algebra is needed to cope with the multivariate
normal distribution in Lecture 21, and there is a linear algebra review in Lecture 19. Here
are the lecture titles:
1. Transformation of Random Variables
2. Jacobians
3. Moment-generating functions
4. Sampling from a normal population
5. The T and F distributions
6. Order statistics
7. The weak law of large numbers
8. The central limit theorem
9. Estimation
10. Condence intervals
11. More condence intervals
12. Hypothesis testing
13. Chi square tests
14. Sucient statistics
15. Rao-Blackwell theorem
16. Lehmann-Schee theorem
17. Complete sucient statistics for the exponential class
18. Bayes estimates
19. Linear algebra review
20. Correlation
21. The multivariate normal distribution
22. The bivariate normal distribution
23. Cramer-Rao inequality
24. Nonparametric statistics
25. The Wilcoxon test
c
copyright
2007 by Robert B. Ash. Paper or electronic copies for personal use may be
made freely without explicit permission of the author. All other rights are reserved.
Lecture 1. Transformation of Random Variables

Suppose we are given a random variable X with density fX (x). We apply a function g
to produce a random variable Y = g(X). We can think of X as the input to a black
box, and Y the output. We wish to nd the density or distribution function of Y . We
illustrate the technique for the example in Figure 1.1.
1/2
Y = X
f (x)
X
1 -x
- e
2
-1
-S
x-axis
Sq
qr
rt
t[
[y
y]
Figure 1.1
The distribution function method nds FY directly, and then fY by dierentiation.
We have FY (y) = 0 for y < 0. If y 0, then P {Y y} = P { y x y}.

Case 1. 0 y 1 (Figure 1.2). Then
FY (y) =
1
y+
2
1 x
1
1
y + (1 e y ).
e dx =
2
2
2
1/2
-1
]
[y
t
qr
-S
f (x)
X
Sqr
t[y
]
Figure 1.2
Case 2. y > 1 (Figure 1.3). Then
1
FY (y) = +
2
The density of Y is 0 for y < 0 and
1 x
1 1
e dx = + (1 e y ).
2
2 2
x-axis
f (x)
X
1/2
'
]
[y
t
r
q
-S
-1
Sqr
t[y
]
x-axis
Figure 1.3
1
fY (y) = (1 + e y ),
4 y
1
fY (y) = e y ,
4 y
0 < y < 1;
y > 1.
See Figure 1.4 for a sketch of fY and FY . (You can take fY (y) to be anything you like at
y = 1 because {Y = 1} has probability zero.)
f (y)
Y
1 1
- y )
2 + 2 (1 - e
F (y)
Y
1
2
'
1
'1
y + 12 (1 - e- y )
y
Figure 1.4
The density function method nds fY directly, and then FY by integration; see
Figure 1.5. We have fY (y)|dy| = fX ( y)dx + fX ( y)dx; we write |dy| because probabilities are never negative. Thus
fX ( y)
fX ( y)
fY (y) =
+
|dy/dx|x=y
|dy/dx|x=y
with y = x2 ,
dy/dx = 2x, so
fX ( y) fX ( y)
+
.
2 y
2 y
fY (y) =
(Note that | 2 y| = 2 y.) We have fY (y) = 0 for y < 0, and:

Case 1. 0 < y < 1 (see Figure 1.2).
(1/2)e
fY (y) =
2 y
1/2
1
+ =
y(1 + e y ).
2 y
4
3
Case 2. y > 1 (see Figure 1.3).
fY (y) =
(1/2)e
2 y
1
+ 0 = e y
4 y
as before.
Y
y
X
- y
Figure 1.5
The distribution function method generalizes to situations where we have a single output but more than one input. For example, let X and Y be independent, each uniformly
distributed on [0, 1]. The distribution function of Z = X + Y is

FZ (z) = P {X + Y z} =
fXY (x, y) dx dy
x+yz
with fXY (x, y) = fX (x)fY (y) by independence. Now FZ (z) = 0 for z < 0 and FZ (z) = 1
for z > 2 (because 0 Z 2).
Case 1. If 0 z 1, then FZ (z) is the shaded area in Figure 1.6, which is z 2 /2.
Case 2. If 1 z 2, then FZ (z) is the shaded area in FIgure 1.7, which is 1 [(2 z)2 /2].
Thus (see Figure 1.8)
0z1
z,
fZ (z) = 2 z, 1 z 2 .
0
elsewhere
Problems
1. Let X, Y, Z be independent, identically distributed (from now on, abbreviated iid)
random variables, each with density f (x) = 6x5 for 0 x 1, and 0 elsewhere. Find
the distribution and density functions of the maximum of X, Y and Z.
2. Let X and Y be independent, each with density ex , x 0. Find the distribution (from
now on, an abbreviation for Find the distribution or density function) of Z = Y /X.
3. A discrete random variable X takes values x1 , . . . , xn , each with probability 1/n. Let
Y = g(X) where g is an arbitrary real-valued function. Express the probability function
of Y (pY (y) = P {Y = y}) in terms of g and the xi .
y
1
2-z
2-z
z
z
x
x+y = z
1 z 2
Figures 1.6 and 1.7
f (z)
Z
1-
'1
Figure 1.8
4. A random variable X has density f (x) = ax2 on the interval [0, b]. Find the density
of Y = X 3 .
5. The Cauchy density is given by f (y) = 1/[(1 + y 2 )] for all real y. Show that one way
to produce this density is to take the tangent of a random variable X that is uniformly
distributed between /2 and /2.
Lecture 2. Jacobians
We need this idea to generalize the density function method to problems where there are
k inputs and k outputs, with k 2. However, if there are k inputs and j < k outputs,
often extra outputs can be introduced, as we will see later in the lecture.
2.1 The Setup

Let X = X(U, V ), Y = Y (U, V ). Assume a one-to-one transformation, so that we can
solve for U and V . Thus U = U (X, Y ), V = V (X, Y ). Look at Figure 2.1. If u changes
by du then x changes by (x/u) du and y changes by (y/u) du. Similarly, if v changes
by dv then x changes by (x/v) dv and y changes by (y/v) dv. The small rectangle in
the u v plane corresponds to a small parallelogram in the x y plane (Figure 2.2), with
A = (x/u, y/u, 0) du and B = (x/v, y/v, 0) dv. The area of the parallelogram
is |A B| and

I
J
K
x/u x/v

du dvK.
A B = x/u y/u 0 du dv =
y/u y/v
x/v y/v 0
(A determinant is unchanged if we transpose the matrix, i.e., interchange rows and
columns.)
y
x
u
du
R
dv
du
du
Figure 2.1
S
B
A
Figure 2.2
2.2 Denition and Discussion

The Jacobian of the transformation is

x/u x/v
,
J =
y/u y/v
written as
(x, y)
.
(u, v)
6
Thus |A B| = |J| du dv. Now P {(X, Y ) S} = P {(U, V ) R}, in other words,
fXY (x, y) times the area of S is fU V (u, v) times the area of R. Thus
fXY (x, y)|J| du dv = fU V (u, v) du dv
and

(x, y)
.
fU V (u, v) = fXY (x, y)
(u, v)
The absolute value of the Jacobian (x, y)/(u, v) gives a magnication factor for area
in going from u v coordinates to x y coordinates. The magnication factor going the
other way is |(u, v)/(x, y)|. But the magnication factor from u v to u v is 1, so

fXY (x, y)
.

fU V (u, v) =
(u, v)/(x, y)
In this formula, we must substitute x = x(u, v), y = y(u, v) to express the nal result in
terms of u and v.
In three dimensions, a small rectangular box with volume du dv dw corresponds to a
parallelepiped in xyz space, determined by vectors

x y z
x y z
x y
z
A=
du, B =
dv, C =
dw.
u u u
v v v
w w w
The volume of the parallelepiped is the absolute value of the dot product of A with B C,
and the dot product can be written as a determinant with rows (or columns) A, B, C. This
determinant is the Jacobian of x, y, z with respect to u, v, w [written (x, y, z)/(u, v, w)],
times du dv dw. The volume magnication from uvw to xyz space is |(x, y, z)/(u, v, w)|
and we have
fU V W (u, v, w) =
fXY Z (x, y, z)
|(u, v, w)/(x, y, z)|
with x = x(u, v, w), y = y(u, v, w), z = z(u, v, w).

The jacobian technique extends to higher dimensions. The transformation formula is
a natural generalization of the two and three-dimensional cases:
fY1 Y2 Yn (y1 , . . . , yn ) =
fX1 Xn (x1 , . . . , xn )
|(y1 , . . . , yn )/(x1 , . . . , xn )|
where
y1
x
1
(y1 , . . . , yn )

=
(x1 , . . . , xn ) y
n
x1
..
.
y1
xn

.

yn
xn
To help you remember the formula, think f (y) dy = f (x)dx.
2.3 A Typical Application

Let X and Y be independent, positive random variables with densities fX and fY , and let
Z = XY . We nd the density of Z by introducing a new random variable W , as follows:
Z = XY,
W =Y
(W = X would be equally good). The transformation is one-to-one because we can solve

for X, Y in terms of Z, W by X = Z/W, Y = W . In a problem of this type, we must always
pay attention to the range of the variables: x > 0, y > 0 is equivalent to z > 0, w > 0.
Now
fZW (z, w) =
with
fXY (x, y)
|(z, w)/(x, y)|x=z/w,y=w

(z, w) z/x
=
w/x
(x, y)

z/y y
=
w/y 0

x
= y.
1
Thus
fZW (z, w) =
fX (x)fY (y)
fX (z/w)fY (w)
=
w
w
and we are left with the problem of nding the marginal density from a joint density:

1
fZ (z) =
fX (z/w)fY (w) dw.
fZW (z, w) dw =
w
Problems
1. The joint density of two random variables X1 and X2 is f (x1 , x2 ) = 2ex1 ex2 ,
where 0 < x1 < x2 < ; f (x1 , x2 ) = 0 elsewhere. Consider the transformation
Y1 = 2X1 , Y2 = X2 X1 . Find the joint density of Y1 and Y2 , and conclude that Y1
and Y2 are independent.
2. Repeat Problem 1 with the following new data. The joint density is given by f (x1 , x2 ) =
8x1 x2 , 0 < x1 < x2 < 1; f (x1 , x2 ) = 0 elsewhere; Y1 = X1 /X2 , Y2 = X2 .
3. Repeat Problem 1 with the following new data. We now have three iid random variables
Xi , i = 1, 2, 3, each with density ex , x > 0. The transformation equations are given
by Y1 = X1 /(X1 + X2 ), Y2 = (X1 + X2 )/(X1 + X2 + X3 ), Y3 = X1 + X2 + X3 . As
before, nd the joint density of the Yi and show that Y1 , Y2 and Y3 are independent.
Comments on the Problem Set

In Problem 3, notice that Y1 Y2 Y3 = X1 , Y2 Y3 = X1 +X2 , so X2 = Y2 Y3 Y1 Y2 Y3 ,
(X1 + X2 + X3 ) (X1 + X2 ) = Y3 Y2 Y3 .
If fXY (x, y) = g(x)h(y) for all x, y, then X and Y are independent, because
f (y|x) =
fXY (x, y)
g(x)h(y)
=
fX (x)
g(x) h(y) dy
X3 =
8
which does not depend on x. The set of points where g(x) = 0 (equivalently fX (x) = 0)
can be ignored because it has probability zero. It is important to realize that in this
argument, for all x, y means that x and y must be allowed to vary independently of each
other, so the set of possible x and y must be of the rectangular form a < x < b, c < y < d.
(The constants a, b, c, d can be innite.) For example, if fXY (x, y) = 2ex ey , 0 < y < x,
and 0 elsewhere, then X and Y are not independent. Knowing x forces 0 < y < x, so the
conditional distribution of Y given X = x certainly depends on x. Note that fXY (x, y)
is not a function of x alone times a function of y alone. We have
fXY (x, y) = 2ex ey I[0 < y < x]
where the indicator I is 1 for 0 < y < x and 0 elsewhere.
In Jacobian problems, pay close attention to the range of the variables. For example, in
Problem 1 we have y1 = 2x1 , y2 = x2 x1 , so x1 = y1 /2, x2 = (y1 /2) + y2 . From these
equations it follows that 0 < x1 < x2 < is equivalent to y1 > 0, y2 > 0.
Lecture 3. Moment-Generating Functions

3.1 Denition
The moment-generating function of a random variable X is dened by
M (t) = MX (t) = E[etX ]
where t is a real number. To see the reason for the terminology, note that M (t) is the
expectation of 1 + tX + t2 X 2 /2! + t3 X 3 /3! + . If n = E(X n ), the n-th moment of X,
and we can take the expectation term by term, then
M (t) = 1 + 1 t +
2 t2
n tn
+ +
+ .
2!
n!
Since the coecient of tn in the Taylor expansion is M (n) (0)/n!, where M (n) is the n-th
derivative of M , we have n = M (n) (0).
3.2 The Key Theorem
n
n
If Y = i=1 Xi where X1 , . . . , Xn are independent, then MY (t) = i=1 MXi (t).
Proof. First note that if X and Y are independent, then

E[g(X)h(Y )] =
g(x)h(y)fXY (x, y) dx dy.
Since fXY (x, y) = fX (x)fY (y), the double integral becomes

g(x)fX (x) dx
h(y)fY (y) dy = E[g(X)]E[h(Y )]
and similarly for more than two random variables. Now if Y = X1 + + Xn with the
Xi s independent, we have
MY (t) = E[etY ] = E[etX1 etXn ] = E[etX1 ] E[etXn ] = MX1 (t) MXn (t).
3.3 The Main Application

Given independent random
n variables X1 , . . . , Xn with densities f1 , . . . , fn respectively,
nd the density of Y = i=1 Xi .
Step 1. Compute Mi (t), the moment-generating function of Xi , for each i.
n
Step 2. Compute MY (t) = i=1 Mi (t).
Step 3. From MY (t) nd fY (y).
This technique is known as a transform method. Notice that the moment-generating
function and the density of a random variable are related by M (t) = etx f (x) dx.
With t replaced by s we have a Laplace transform, and with t replaced by it we have a
Fourier transform. The strategy works because at step 3, the moment-generating function
determines the density uniquely. (This is a theorem from Laplace or Fourier transform
theory.)
10
3.4 Examples
1. Bernoulli Trials. Let X be the number of successes in n trials with probability of
success p on a given trial. Then X = X1 + + Xn , where Xi = 1 if there is a success on
trial i and Xi = 0 if there is a failure on trial i. Thus
Mi (t) = E[etXi ] = P {Xi = 1}et1 + P {Xi = 0}et0 = pet + q
with p + q = 1. The moment-generating function of X is
n

n k nk tk
MX (t) = (pet + q)n =
p q
e .
k
k=0
This could have been derived directly:

MX (t) = E[etX ] =
P {X = k}etk =
k=0
k=0
by the binomial theorem.

2. Poisson. We have P {X = k} = e k /k!,
M (t) =

e k
k=0
k!
etk = e
n

n

(et )k
k=0
k!
pk q nk etk = (pet + q)n
k = 0, 1, 2, . . . . Thus
= exp() exp(et ) = exp[(et 1)].
We can compute the mean and variance from the moment-generating function:
E(X) = M (0) = [exp((et 1))et ]t=0 = .
Let h(, t) = exp[(et 1)]. Then
E(X 2 ) = M (0) = [h(, t)et + et h(, t)et ]t=0 = + 2
hence
Var X = E(X 2 ) [E(X)]2 = + 2 2 = .
3. Normal(0,1). The moment-generating function is

2
1
M (t) = E[etX ] =
etx ex /2 dx
2
Now (x2 /2) + tx = (1/2)(x2 2tx + t2 t2 ) = (1/2)(x t)2 + (1/2)t2 so

1
t2 /2
exp[(x t)2 /2] dx.
M (t) = e
2
The integral is the area under a normal density (mean t, variance 1), which is 1. Consequently,
2
M (t) = et
/2
11
4. Normal (, 2 ). If X is normal(, 2 ), then Y = (X )/ is normal(0,1). This is a
good application of the density function method from Lecture 1:
fY (y) =
2
fX (x)
1
=
ey /2 .
|dy/dx|x=+y
2
We have X = + Y , so
MX (t) = E[etX ] = et E[etY ] = et MY (t).
Thus
2
MX (t) = et et
2 /2
Remember this technique, which is especially useful when Y = aX + b and the momentgenerating function of X is known.
3.5 Theorem
If X is normal(, 2 ) and Y = aX + b, then Y is normal(a + b, a2 2 ).
Proof. We compute
2 2
MY (t) = E[et Y ] = E[et(aX+b) ] = ebt MX (at) = ebt eat ea
t 2 /2
Thus
MY (t) = exp[t(a + b)] exp(t2 a2 2 /2).
Here is another basic result.
3.6 Theorem
2
Let X1 , . . . , Xn
be independent, with Xi normal
i , i ). Then Y =
n (
n
2
2
with mean = i=1 i and variance = i=1 i .
n
i=1
Xi is normal
Proof. The moment-generating function of Y is

MY (t) =
n

exp(ti i + t2 i2 /2) = exp(t + t2 2 /2).
i=1
A similar argument works for the Poisson distribution; see Problem 4.
3.7 The Gamma Distribution

First, we dene the gamma function () =
properties:
(a) ( + 1) = (), the recursion formula;
(b) (n + 1) = n!, n = 0, 1, 2, . . . ;
y 1 ey dy,
> 0. We need three
12
(c) (1/2) =
To prove (a), integrate by parts: () = 0 ey d(y /). Part (b) is a special case of (a).
For (c) we make the change of variable y = z 2 /2 and compute

2
(1/2) =
y 1/2 ey dy =
2z 1 ez /2 z dz.
0
The
is 2 times half the area under the normal(0,1) density, that is,
second integral
2 (1/2) = .
The gamma density is
f (x) =
1
x1 ex/
()
where and are positive constants. The moment-generating function is

M (t) =
[() ]1 x1 etx ex/ dx.
0
Change variables via y = (t + (1/))x to get

[() ]1
y
t + (1/)
ey
dy
t + (1/)
which reduces to
1
1 t
= (1 t) .
In this argument, t must be less than 1/ so that the integrals will be nite.
Since M (0) = f (x) dx = 0 f (x) dx in this case, with f 0, M (0) = 1 implies that
we have a legal probability density. As before, moments can be calculated eciently from
the moment-generating function:
E(X) = M (0) = (1 t)1 ()|t=0 = ;
E(X 2 ) = M (0) = ( 1)(1 t)2 ()2 |t=0 = ( + 1) 2 .
Thus
Var X = E(X 2 ) [E(X)]2 = 2 .
3.8 Special Cases

The exponential density is a gamma density with = 1 : f (x) = (1/)ex/ , x 0, with
E(X) = , E(X 2 ) = 2 2 , Var X = 2 .
13
A random variable X has the chi square density with r degrees of freedom (X = 2 (r)
for short, where r is a positive integer) if its density is gamma with = r/2 and = 2.
Thus
f (x) =
1
x(r/2)1 ex/2 ,
(r/2)2r/2
x0
and
M (t) =
Therefore E[2 (r)] = = r,
1
,
(1 2t)r/2
t < 1/2.
Var[2 (r)] = 2 = 2r.
3.9 Lemma
If X is normal(0,1) then X 2 is 2 (1).
Proof. We compute the moment-generating function of X 2 directly:

2
2
2
1
MX 2 (t) = E[etX ] =
etx ex /2 dx.
2
Let y = 1 2tx; the integral becomes

2
1
dy
ey /2
= (1 2t)1/2
1
2t
2
which is 2 (1).
3.10 Theorem
n
If X1 , . . . , Xn are independent, each normal (0,1), then Y = i=1 Xi2 is 2 (n).
Proof. By (3.9), each Xi2 is 2 (1) with moment-generating function (1 2t)1/2 . Thus
MY (t) = (1 2t)n/2 for t < 1/2, which is 2 (n).
3.11 Another Method

Another way to nd the density of Z = X + Y where X and Y are independent random
variables is by the convolution formula

fZ (z) =
fX (x)fY (z x) dx =
fY (y)fX (z y) dy.
To see this intuitively, reason as follows. The probability that Z lies near z (between z
and z + dz) is fZ (z) dz. Let us compute this in terms of X and Y . The probability that
X lies near x is fX (x) dx. Given that X lies near x, Z will lie near z if and only if Y lies
near z x, in other words, z x Y z x + dz. By independence of X and Y , this
probability is fY (z x) dz. Thus fZ (z) is a sum of terms of the form fX (x) dx fY (z x) dz.
Cancel the dzs and replace the sum by an integral to get the result. A formal proof can
be given using Jacobians.
14
3.12 The Poisson Process

This process occurs in many physical situations, and provides an application of the gamma
distribution. For example, particles can arrive at a counting device, customers at a serving
counter, airplanes at an airport, or phone calls at a telephone exchange. Divide the time
interval [0, t] into a large number n of small subintervals of length dt, so that n dt = t. If
Ii , i = 1, . . . , n, is one of the small subintervals, we make the following assumptions:
(1) The probability of exactly one arrival in Ii is dt, where is a constant.
(2) The probability of no arrivals in Ii is 1 dt.
(3) The probability of more than one arrival in Ii is zero.
(4) If Ai is the event of an arrival in Ii , then the Ai , i = 1, . . . , n are independent.
As a consequence of these assumptions, we have n = t/dt Bernoulli trials with probability of success p = dt on a given trial. As dt 0 we have n and p 0, with
np = t. We conclude that the number N [0, t] of arrivals in [0, t] is Poisson (t):
P {N [0, t] = k} = et (t)k /k!, k = 0, 1, 2, . . . .
Since E(N [0, t]) = t, we may interpret as the average number of arrivals per unit time.
Now let W1 be the waiting time for the rst arrival. Then
P {W1 > t} = P {no arrival in [0,t]} = P {N [0, t] = 0} = et , t 0.
Thus FW1 (t) = 1 et and fW1 (t) = et , t 0. From the formulas for the mean and
variance of an exponential random variable we have E(W1 ) = 1/ and Var W1 = 1/2 .
Let Wk be the (total) waiting time for the k-th arrival. Then Wk is the waiting time
for the rst arrival plus the time after the rst up to the second arrival plus plus the
time after arrival k 1 up to the k-th arrival. Thus Wk is the sum of k independent
exponential random variables, and

k
1
MWk (t) =
1 (t/)
so Wk is gamma with = k, = 1/. Therefore
fWk (t) =
1
k tk1 et , t 0.
(k 1)!
Problems
1. Let X1 and X2 be independent, and assume that X1 is 2 (r1 ) and Y = X1 + X2 is
2 (r), where r > r1 . Show that X2 is 2 (r2 ), where r2 = r r1 .
2. Let X1 and X2 be independent, with Xi gamma with parameters i and i , i = 1, 2.
If c1 and c2 are positive constants, nd convenient sucient conditions under which
c1 X1 + c2 X2 will also have the gamma distribution.
3. If X1 , . . . , Xn are independent random variables with moment-generating functions
M1 , . . . , Mn , and c1 , . . . , cn are constants, express the moment-generating function M
of c1 X1 + + cn Xn in terms of the Mi .
15
4. If X1
, . . . , Xn are independent, with Xi Poisson(i ), i = 1, . . . ,
n, show that the sum
n
n
Y = i=1 Xi has the Poisson distribution with parameter = i=1 i .
5. An unbiased coin is tossed independently n1 times and then again tossed independently
n2 times. Let X1 be the number of heads in the rst experiment, and X2 the number
of tails in the second experiment. Without using moment-generating functions, in fact
without any calculation at all, nd the distribution of X1 + X2 .
16
Lecture 4. Sampling From a Normal Population

4.1 Denitions and Comments
Let X1 , . . . , Xn be iid. The sample mean of the Xi is
1
Xi
n i=1
n
X=
and the sample variance is
1
(Xi X)2 .
n i=1
n
S2 =
If the Xi have mean and variance 2 , then

1
1
E(Xi ) = n =
n i=1
n
n
E(X) =
and
Var X =
n
1
n 2
2
0
Var
X
=
=
i
n2 i=1
n2
n
as
n .
Thus X is a good estimate of . (For large n, the variance of X is small, so X is

concentrated near its mean.) The sample variance is an average squared deviation from
the sample mean, but it is a biased estimate of the true variance 2 :
E[(Xi X)2 ] = E[(Xi ) (X )]2 = Var Xi + Var X 2E[(Xi )(X )].
Notice the centralizing technique. We subtract and add back the mean of Xi , which will
make the cross terms easier to handle when squaring. The above expression simplies to
2
2
1
2
2E[(Xi )
E[(Xi )2 ].
(Xj )] = 2 +
n
n j=1
n
n
n
2 +
Thus
E[(Xi X)2 ] = 2 (1 +
1
2
n1 2
)=
.
n n
n
Consequently, E(S 2 ) = (n 1) 2 /n, not 2 . Some books dene the sample variance as
1
n
(Xi X)2 =
S2
n 1 i=1
n1
n
where S 2 is our sample variance. This adjusted estimate of the true variance is unbiased
(its expectation is 2 ), but biased does not mean bad. If we measure performance by
asking for a small mean square error, the biased estimate is better in the normal case, as
we will see at the end of the lecture.
17
4.2 The Normal Case

We now assume that the Xi are normally distributed, and nd the distribution of S 2 . Let
y1 = x = (x1 + +xn )/n, y2 = x2 x, . . . , yn = xn x. Then y1 +y2 = x2 , y1 +y3 =
x3 , . . . , y1 + yn = xn . Add these equations to get (n 1)y1 + y2 + + yn = x2 + + xn ,
or
ny1 + (y2 + + yn ) = (x2 + + xn ) + y1
(1)
But ny1 = nx = x1 + + xn , so by cancelling x2 , . . . , xn in (1), x1 + (y2 + + yn ) = y1 .

Thus we can solve for the xs in terms of the ys:
x1 = y1 y2 yn
x2 = y1 + y2
x3 = y1 + y3
..
.
xn = y1 + yn
(2)
The Jacobian of the transformation is

1

1
(x1 , . . . , xn ) 1
dn =
=
..
(y1 , . . . , yn )
.

1
To see the pattern,

1

1

1

1
look at the 4 by 4 case and

1 1 1
1

1
0
0
= (1) 1
0
1
0
0
0
0
1
1
1
0
1
0
1

1
0
0

1
expand via the last row:

1
0
1

1 1
0 + 1
0 1
1
1
0

1
0
1
so d4 = 1 + d3 . In general, dn = 1 + dn1 , and since d2 = 2 by inspection, we have dn = n

for all n 2. Now
n

(xi )2 =
(xi x + x )2 =
(xi x)2 + n(x )2
(3)
i=1

because (xi x) = 0. By (2), x1 x = x1 y1 = y2 yn and xi x = xi y1 = yi
for i = 2, . . . , n. (Remember that y1 = x.) Thus
n

i=1
(xi x)2 = (y2 yn )2 +
yi2
i=2
Now
fY1 Yn (y1 , . . . , yn ) = nfX1 Xn (x1 , . . . , xn ).
(4)
18
By (3) and (4), the right side becomes, in terms of the yi s,

n
n
n
1
1 2 2
2
n
.
exp
y
n(y
)
i
1
i
2 2
2
i=2
i=2
The joint density of Y1 , . . . , Yn is a function of y1 times a function of (y2 , . . . , yn ), so
Y1 and (Y2 , . . . , Yn ) are independent. Since X = Y1 and [by (4)] S 2 is a function of
(Y2 , . . . , Yn ),
X
and S 2
are independent
Dividing Equation (3) by 2 we have

2
n

Xi
i=1
nS 2
+
2
/ n
2
.
But (Xi )/ is normal (0,1) and

X
/ n
is normal (0,1)
so 2 (n) = (nS 2 / 2 ) + 2 (1) with the two random variables on the right independent. If
M (t) is the moment-generating function of nS 2 / 2 , then (1 2t)n/2 = M (t)(1 2t)1/2 .
Therefore M (t) = (1 2t)(n1)/2 , i.e.,
nS 2
2
is
2 (n 1)
The random variable

T =
S/ n 1
is useful in situations where is to be estimated but the true variance 2 is unknown. It

turns out that T has a T distribution, which we study in the next lecture.
4.3 Performance of Various Estimates

Let S 2 be the sample variance of iid normal (, 2 ) random variables X1 , . . . , Xn . We
will look at estimates of 2 of the form cS 2 , where c is a constant. Once again employing
the centralizing technique, we write
E[(cS 2 2 )2 ] = E[(cS 2 cE(S 2 ) + cE(S 2 ) 2 )2 ]
which simplies to
c2 Var S 2 + (cE(S 2 ) 2 )2 .
19
Since nS 2 / 2 is 2 (n 1), which has variance 2(n 1), we have n2 (Var S 2 )/ 4 = 2(n 1).
Also nE(S 2 )/ 2 is the mean of 2 (n 1), which is n 1. (Or we can recall from (4.1)
that E(S 2 ) = (n 1) 2 /n.) Thus the mean square error is
2
c2 2 4 (n 1) (n 1) 2
2 .
+ c
2
n
n
We can drop the 4 and use n2 as a common denominator, which can also be dropped.
We are then trying to minimize
c2 2(n 1) + c2 (n 1)2 2c(n 1)n + n2 .
Dierentiate with respect to c and set the result equal to zero:
4c(n 1) + 2c(n 1)2 2(n 1)n = 0.
Dividing by 2(n 1), we have 2c + c(n 1) n = 0, so c = n/(n + 1). Thus the best
estimate of the form cS 2 is
n
1
(Xi X)2 .
n + 1 i=1
If we use S 2 then c = 1. If we us the unbiased version then c = n/(n 1). Since
[n/(n + 1)] < 1 < [n/(n 1)] and a quadratic function decreases as we move toward
its minimum, w see that the biased estimate S 2 is better than the unbiased estimate
nS 2 /(n 1), but neither is optimal under the minimum mean square error criterion.
Explicitly, when c = n/(n 1) we get a mean square error of 2 4 /(n 1) and when c = 1
we get
(2n 1) 4
4
2(n 1) + (n 1 n)2 =
2
n
n2
which is always smaller, because [(2n 1)/n2 ] < 2/(n 1) i 2n2 > 2n2 3n + 1 i
3n > 1, which is true for every positive integer n.
For large n all these estimates are good and the dierence between their performance
is small.
Problems
1. Let X1 , . . . , Xn be iid, each normal (, 2 ), and let X be the sample mean. If c is a
constant, we wish to make n large enough so that P { c < X < + c} .954. Find
the minimum value of n in terms of 2 and c. (It is independent of .)
2. Let X1 , . . . , Xn1 , Y1 , . . . Yn2 be independent random variables, with the Xi normal
(1 , 12 ) and the Yi normal (2 , 22 ). If X is the sample mean of the Xi and Y is the
sample mean of the Yi , explain how to compute the probability that X > Y .
3. Let X1 , . . . , Xn be iid, each normal (, 2 ), and let S 2 be the sample variance. Explain
how to compute P {a < S 2 < b}.
4. Let S 2 be the sample variance of iid normal (, 2 ) random variables Xi , i = 1 . . . , n.
Calculate the moment-generating function of S 2 and from this, deduce that S 2 has a
gamma distribution.
20
Lecture 5. The T and F Distributions

The T distribution is dened as follows. Let X1 and X2 be independent, withX1 normal
(0,1) and X2 chi-square with r degrees of freedom. The random variable Y1 = rX1 / X 2
has the T distribution with r degrees of freedom.
To nd the density of Y1 , let Y2 = X2 . Then X1 = Y1 Y 2 / r and X2 = Y2 . The

transformation is one-to-one with < X1 < , X2 > 0 < Y1 < , Y2 > 0.
The Jacobian is given by

(x1 , x2 ) y2 /r y1 /(2 ry2 )

=
= y2 /r.
0
1
(y1 , y2 )

Thus fY1 Y2 (y1 , y2 ) = fX1 X2 (x1 , x2 ) y2 /r, which upon substitution for x1 and x2 becomes

1
1
(r/2)1 y2 /2
exp[y12 y2 /2r]
y2
e
y2 /r.
r/2
(r/2)2
2
The density of Y1 is
1
2(r/2)2r/2
(y12 /r))y2 /2
With z = (1 +
(with y1 replaced by t)
[(r+1)/2]1
y2
exp[(1 + (y12 /r))y2 /2] dy2 / r.
and the observation that all factors of 2 cancel, this becomes
((r + 1)/2)
1
, < t < ,
2
r(r/2) (1 + (t /r))(r+1)/2
the T density with r degrees of freedom.
In sampling from a normal population, (X )/(/ n) is normal (0,1), and nS 2 / 2

2
is (n 1). Thus
Since and
n1
(X )
/ n
divided by
nS/
is
T (n 1).
n disappear after cancellation, we have

X
S/ n 1
is
T (n 1)
Advocates of dening the sample variance withn 1 in the denominator point out that
one can simply replace by S in (X )/(/ n) to get the T statistic.
Intuitively, we expect that for

large n, (X )/(S/ n 1) has approximately the
same distribution as (X )/(/ n), i.e., normal (0,1). This is in fact true, as suggested
by the following computation:

(r+1)/2
r
1/2
2
t2
t2
t2
1+
1+
1+
=
et2 1 = et /2
r
r
r
as r .
21
5.2 A Preliminary Calculation

Before turning to the F distribution, we calculate the density of U = X1 /X2 where X1 and
X2 are independent, positive random variables. Let Y = X2 , so that X1 = U Y, X2 = Y
(X1 , X2 , U, Y ) are all greater than zero). The Jacobian is

(x1 , x2 ) y u
=
= y.
0 1
(u, y)
Thus fU Y (u, y) = fX1 X2 (x1 , x2 )y = fX1 (uy)fX2 (y), and the density of U is

h(u) =
yfX1 (uy)fX2 (y) dy.
0
Now we take X1 to be 2 (m), and X2 to be 2 (n). The density of X1 /X2 is

1
(m/2)1
h(u) = (m+n)/2
y [(m+n)/2]1 ey(1+u)/2 dy.
u
2
(m/2)(n/2)
0
The substitution z = y(1 + u)/2 gives
h(u) =
1
u(m/2)1
(m+n)/2
2
(m/2)(n/2)
z [(m+n)/2]1
2
ez
dz.
[(m+n)/2]1
1+u
[(1 + u)/2]
We abbreviate (a)(b)/(a + b) by (a, b). (We will have much more to say about this
when we discuss the beta distribution later in the lecture.) The above formula simplies
to
h(u) =
1
u(m/2)1
,
(m/2, n/2) (1 + u)(m+n)/2
u 0.

The F density is dened as follows. Let X1 and X2 be independent, with X1 = 2 (m)
and X2 = 2 (n). With U as in (5.2), let
W =
so that
X1 /m
n
= U
X2 /n
m

du m
m

fW (w) = fU (u)
w .
= fU
dw
n
n
Thus W has density

(m/n)m/2
w(m/2)1
,
(m/2, n/2) [1 + (m/n)w](m+n)/2
the F density with m and n degrees of freedom.
w 0,
22
5.4 Denitions and Calculations

The beta function is given by
xa1 (1 x)b1 dx,
(a, b) =
a, b > 0.
We will show that

(a)(b)
(a + b)
(a, b) =
which is consistent with our use of (a, b) as an abbreviation in (5.2). We make the change
of variable t = x2 to get
(a) =
a1 t
dt = 2
x2a1 ex dx.
2
We now use the familiar trick of writing (a)(b) as a double integral and switching to
polar coordinates. Thus
(a)(b) = 4
0
0
/2
=4
x2a1 y 2b1 e(x
d
0
+y 2 )
dx dy
(cos )2a1 (sin )2b1 er r2a+2b1 dr.

2
The change of variable u = r2 yields

2a+2b1 r 2
ua+b1 eu du = (a + b)/2.
dr = (1/2)
Thus
(a)(b)
=
2(a + b)
/2
(cos )2a1 (sin )2b1 d.

0
Let z = cos2 , 1 z = sin2 , dz = 2 cos sin d = 2z 1/2 (1 z)1/2 dz. The above
integral becomes
1
a1
(1 z)
b1
1
dz =
2
z a1 (1 z)b1 dz =
0
1
(a, b)
2
as claimed. The beta density is

f (x) =
1
xa1 (1 x)b1 ,
(a, b)
0x1
(a, b > 0).
23
Problems
1. Let X have the beta distribution with parameters a and b. Find the mean and variance
of X.
2. Let T have the T distribution with 15 degrees of freedom. Find the value of c which
makes P {c T c} = .95.
3. Let W have the F distribution with m and n degrees of freedom (abbreviated W =
F (m, n)). Find the distribution of 1/W .
4. A typical table of the F distribution gives values of P {W c} for c = .9, .95, .975 and
.99. Explain how to nd P {W c} for c = .1, .05, .025 and .01. (Use the result of
Problem 3.)
5. Let X have the T distribution with n degrees of freedom (abbreviated X = T (n)).
Show that T 2 (n) = F (1, n), in other words, T 2 has an F distribution with 1 and n
degrees of freedom.
6. If X has the exponential density ex , x 0, show that 2X is 2 (2). Deduce that the
quotient of two exponential random variables is F (2, 2).
Lecture 6. Order Statistics

6.1 The Multinomial Formula
Suppose we pick a letter from {A, B, C}, with P (A) = p1 = .3, P (B) = p2 = .5, P (C) =
p3 = .2. If we do this independently 10 times, we will nd the probability that the
resulting sequence contains exactly 4 As, 3 Bs and 3 Cs.
The probability of AAAABBBCCC, in that order, is p41 p32 p33 . To generate all favorable
cases, select 4 positions out of 10 for the As, then 3 positions out of the remaining 6 for the
Bs. The positions for the Cs are then determined. One possibility is BCAABACCAB.
The number of favorable cases is

10 6
10! 6!
10!
=
=
.
4
3
4!6! 3!3!
4!3!3!
Therefore the probability of exactly 4 As,3 Bs and 3 Cs is
10!
(.3)4 (.5)3 (.2)3
4!3!3!
In general, consider n independent trials such that on each trial, the result is exactly one of the events A1 , . . . , Ar , with probabilities p1 , . . . , pr respectively. Then the
probability that A1 occurs exactly n1 times, . . . , Ar occurs exactly nr times, is

n
n n1
n n1 n2
n n1 nr2
n4
n1
nr
p1 pr
n1
n2
n3
nr1
nr
which reduces to the multinomial formula
n!
pn1 pnr r
n1 ! nr ! 1
where the pi are nonnegative real numbers that sum to 1, and the ni are nonnegative
integers that sum to n.
Now let X1 , . . . , Xn be iid, each with density f (x) and distribution function F (x).
Let Y1 < Y2 < < Yn be the Xi s arranged in increasing order, so that Yk is the k-th
smallest. In particular, Y1 = min Xi and Yn = max Xi . The Yk s are called the order
statistics of the Xi s
The distributions of Y1 and Yn can be computed without developing any new machinery.
The probability that Yn x is the probability that Xi x for all i, which is
n
P
{X
i x} by independence. But P {Xi x} is F (x) for all i, hence
i=1
FYn (x) = [F (x)]n
and fYn (x) = n[F (x)]n1 f (x).
Similarly,
P {Y1 > x} =
n

i=1
P {Xi > x} = [1 F (x)]n .
2
Therefore
FY1 (x) = 1 [1 F (x)]n
and fY1 (x) = n[1 F (x)]n1 f (x).
We compute fYk (x) by asking how it can happen that x Yk x + dx (see Figure
6.1). There must be k 1 random variables less than x, one random variable between
x and x + dx, and n k random variables greater than x. (We are taking dx so small
that the probability that more one random variable falls in [x, x + dx] is negligible, and
P {Xi > x} is essentially the same as P {Xi > x + dx}. Not everyone is comfortable
with this reasoning, but the intuition is very strong and can be made precise.) By the
multinomial formula,
n!
[F (x)]k1 f (x) dx[1 F (x)]nk
(k 1)!1!(n k)!
fYk (x) dx =
so
fYk (x) =
n!
[F (x)]k1 [1 F (x)]nk f (x).
(k 1)!1!(n k)!
Similar reasoning (see Figure 6.2) allows us to write down the joint density fYj Yk (x, y) of
Yj and Yk for j < k, namely
n!
[F (x)]j1 [F (y) F (x)]kj1 [1 F (y)]nk f (x)f (y)
(j 1)!(k j 1)!(n k)!
for x < y, and 0 elsewhere. [We drop the term 1! (=1), which we retained for emphasis
in the formula for fYk (x).]
k-1
n-k
'
'
x + dx
Figure 6.1
j-1
k-j-1
'
'
x + dx
'
y
n-k
'
y + dy
Figure 6.2
Problems
1. Let Y1 < Y2 < Y3 be the order statistics of X1 , X2 and X3 , where the Xi are uniformly
distributed between 0 and 1. Find the density of Z = Y3 Y1 .
2. The formulas derived in this lecture assume that we are in the continuous case (the
distribution function F is continuous). The formulas do not apply if the Xi are discrete.
Why not?
3
3. Consider order statistics where the Xi , i = 1, . . . , n, are uniformly distributed between
0 and 1. Show that Yk has a beta distribution, and express the parameters and in
terms of k and n.
4. In Problem 3, let 0 < p < 1, and express P {Yk > p} as the probability of an event
associated with a sequence of n Bernoulli trials with probability of success p on a given
trial. Write P {Yk > p} as a nite sum involving n, p and k.
Lecture 7. The Weak Law of Large Numbers

7.1 Chebyshevs Inequality
(a) If X 0 and a > 0, then P {X a} E(X)/a.
(b) If X is an arbitrary random variable, c any real number, and > 0, m > 0, then
P {|X c| } E(|X c|m )/m .
(c) If X has nite mean and nite variance 2 , then P {|X | k} 1/k 2 .
This is a universal bound, but it may be quite weak in a specic cases. For example,
if X is normal (, 2 ), abbreviated N (, 2 ), then
P {|X | 1.96} = P {|N (0, 1)| 1.96} = 2(1 (1.96)) = .05
where is the distribution function of a normal (0,1) random variable. But the Chebyshev
bound is 1/(1.96)2 = .26.
Proof.
(a) If X has density f , then

a

E(X) =
xf (x) dx =
xf (x) dx +
xf (x) dx
0
so

E(X) 0 +
af (x) dx = aP {X a}.
(b) P {|X c| } = P {|X c|m m } E(|X c|m )/m by (a).

(c) By (b) with c = , = k, m = 2, we have
P {|X | k}
E[(X )2 ]
1
= 2.
2
2
k
k
7.2 Weak Law of Large Numbers

Let X1 , . . . , Xn be iid with nite mean and nite variance 2 . For large n, the arithmetic
average of the observations is very likely to be very close to the true mean . Formally,
if Sn = X1 + + Xn , then for any > 0,
Sn

P {
} 0 as n .
n
Proof.
Sn

E[(Sn n)2 ]
P {
} = P {|Sn n| n}
n
n2 2
by Chebyshev (b). The term on the right is
Var Sn
n 2
2
=
=
0.
n2 2
n2 2
n2
7.3 Bernoulli Trials

Let Xi = 1 if there is a success on trial i, and Xi = 0 if there is a failure. Thus Xi is the
indicator of a success on trial i, often written as I[Success on trial i]. Then Sn /n is the
relative frequency of success, and for large n, this is very likely to be very close to the
true probability p of success.

The convergence illustrated by the weak law of large numbers is called convergence in
P
probability. Explicitly, Sn /n converges in probability to . In general, Xn X means
that for every > 0, P {|Xn X| } 0 as n . Thus for large n, Xn is very likely
to be very close to X. If Xn converges in probability to X, then Xn converges to X in
distribution: If Fn is the distribution function of Xn and F is the distribution function
of X, then Fn (x) F (x) at every x where F is continuous. To see that the continuity
requirement is needed, look at Figure 7.1. In this example, Xn is uniformly distributed
P
between 0 and 1/n, and X is identically 0. We have Xn 0 because P {|Xn | } is
actually 0 for large n. However, Fn (x) F (x) for x = 0, but not at x = 0.
To prove that convergence in probability implies convergence in distribution:
Fn (x) = P {Xn x} = P {Xn x, X > x + } + P {Xn x, X x + }
P {|Xn X| } + P {X x + }
= P {|Xn X| } + F (x + )
F (x ) = P {X x } = P {X x , Xn > x} + P {X x , Xn x}
P {|Xn X| } + P {Xn x}
= P {|Xn X| } + Fn (x).
Therefore
F (x ) P {|Xn X| } Fn (x) P {|Xn X| } + F (x + ).
Since Xn converges in probability to X, we have P {|Xn X| } 0 as n . If F is
continuous at x, then F (x ) and F (x + ) approach F (x) as 0. Thus Fn (x) is boxed
between two quantities that can be made arbitrarily close to F (x), so Fn (x) F (x).
7.5 Some Sucient Conditions

In practice, P {|Xn X| } may be dicult to compute, and it is useful to have sucient
conditions for convergence in probability that can often be easily checked.
P
(1) If E[(Xn X)2 ] 0 as n , then Xn X.

P
(2) If E(Xn ) E(X) and Var(Xn X) 0, then Xn X.

Proof. The rst statement follows from Chebyshev (b):
P {|Xn X| }
E[(Xn X)2 ]
0.
2
6
To prove (2), note that
E[(Xn X)2 ] = Var(Xn X) + [E(Xn ) E(X)]2 0.
In this result, if X is identically equal to a constant c, then Var(Xn X) is simply
Var Xn . Condition (2) then becomes E(Xn ) c and Var Xn 0, which implies that Xn
converges in probability to c.
7.6 An Application
In normal sampling, let Sn2 be the sample variance based on n observations. Lets show
P
that Sn2 is a consistent estimate of the true variance 2 , that is, Sn2 2 . Since nSn2 / 2
is 2 (n 1), we have E(nSn2 / 2 ) = (n 1) and Var(nSn2 / 2 ) = 2(n 1). Thus E(Sn2 ) =
(n 1) 2 /n 2 and Var(Sn2 ) = 2(n 1) 4 /n2 0, and the result follows.
Fn(x)
lim Fn(x)
'
1/n
F(x)
Figure 7.1
Problems
1. Let X1 , . . . , Xn be independent, not necessarily identically distributed random variables. Assume that the Xi have nite means i and nite variances i2 , and the
variances are uniformly bounded, i.e., for some positive number M we have i2 M
for all i. Show that (Sn E(Sn ))/n converges in probability to 0. This is a generalization of the weak law of large numbers. For if i = and i2 = 2 for all i, then
P
E(Sn ) = n, so (Sn /n) 0, i.e., Sn /n .

2. Toss an unbiased coin once. If heads, write down the sequence 10101010 . . . , and if
tails, write down the sequence 01010101 . . . . If Xn is the n-th term of the sequence
and X = X1 , show that Xn converges to X in distribution but not in probability.
7
3. Let X1 , . . . , Xn be iid with nite mean and nite variance 2 . Let X n be the sample
mean (X1 + + Xn )/n. Find the limiting distribution of X n , i.e., nd a random
d
variable X such that X n

X.
4. Let Xn be uniformly distributed between n and n + 1. Show that Xn does not have a
limiting distribution. Intuitively, the probability has run away to innity.
Lecture 8. The Central Limit Theorem

Intuitively, any random variable that can be regarded as the sum of a large number
of small independent components is approximately normal. To formalize, we need the
following result, stated without proof.
8.1 Theorem
If Yn has moment-generating function Mn , Y has moment-generating function M , and
Mn (t) M (t) as n for all t in some open interval containing the origin, then
d
Yn
Y.
8.2 Central Limit Theorem

Let X1 , X2 , . . . be iid, each with nite mean , nite variance 2 , and moment-generating
function M . Then
n
Xi n
Yn = i=1
n
converges in distribution to a random variable that is normal (0,1). Thus for large n,

n
i=1 Xi is approximately normal.
n
We will give an informal sketch of the proof. The numerator of Yn is i=1 (Xi ),
and the random variables Xi are iid with mean 0 and variance 2 . Thus we may
assume without loss of generality that = 0. We have

tYn
MYn (t) = E[e

The moment-generating function of
] = E exp
n
i=1

n
t
Xi .
n i=1
Xi is [M (t)]n , so
t n
.
MYn (t) = M
n
Now if the density of the Xi is f (x), then

t
tx
M
exp
=
f (x) dx
n
n

t2 x2
t3 x3
tx
+
=
+
+ f (x) dx
1+
2
3/2
3
n 2!n
3!n
=1+0+
t4 4
t2
t3 3
+
+ 3/2 3 +
2n 6n
24n2 4
where k = E[(Xi )k ]. If we neglect the terms after t2 /2n we have, approximately,

t2 n
MYn (t) = 1 +
2n
9
2
which approaches the normal (0,1) moment-generating function et /2 as n . This

argument is very loose but it can be made precise by some estimates based on Taylors
formula with remainder.
We proved that if Xn converges in probability to X, then Xn convergence in distribution to X. There is a partial converse.
8.3 Theorem
If Xn converges in distribution to a constant c, then Xn converges in probability to X.
Proof. We estimate the probability that |Xn X| , as follows.
P {|Xn X| } = P {Xn c + } + P {Xn c }
= 1 P {Xn < c + } + P {Xn c }
Now P {Xn c + (/2)} P {Xn < c + }, so
P {|Xn c| } 1 P {Xn c + (/2)} + P {Xn c }
= 1 Fn (c + (/2)) + Fn (c ).
where Fn is the distribution function of Xn . But as long as x = c, Fn (x) converges to the
distribution function of the constant c, so Fn (x) 1 if x > c, and Fn (x) 0 if x < c.
Therefore P {|Xn c| } 1 1 + 0 = 0 as n .
8.4 Remarks
If Y is binomial (n, p), the normal approximation to the binomial allows us to regard Y
as approximately normal with mean np and variance npq (with q = 1 p). According
to Box, Hunter and Hunter, Statistics for Experimenters, page 130, the approximation
works well in practice if n > 5 and

1 q
p

< .3
p
q
n
If, for example, we wish to estimate the probability that Y = 50 or 51 or 52, we may write
this probability as P {49.5 < Y < 52.5} , and then evaluate as if Y were normal with
mean np and variance np(1 p). This turns out to be slightly more accurate in practice
than using P {50 Y 52}.
8.5 Simulation
Most computers an simulate a random variable that is uniformly distributed between 0
and 1. But what if we need a random variable with an arbitrary distribution function F ?
For example, how would we simulate the random variable with the distribution function
of Figure 8.1? The basic idea is illustrated in Figure 8.2. If Y = F (X) where X has the
10
continuous distribution function F , then Y is uniformly distributed on [0,1]. (In Figure
8.2 we have, for 0 y 1, P {Y y} = P {X x} = F (x) = y.)
Thus if X is uniformly distributed on [0,1] and w want Y to have distribution function
F , we set X = F (Y ), Y = F 1 (X).
In Figure 8.1 we must be more precise:
Case 1. 0 X 3. Let X = (3/70)Y + (15/70), Y = (70X 15)/3.
Case 2. .3 X .8. Let Y = 4, so P {Y = 4} = .5 as required.
Case 3. .8 X 1. Let X = (1/10)Y + (4/10), Y = 10X 4.
In Figure 8.1, replace the F (y)-axis by an x-axis to visualize X versus Y . If y = y0
corresponds to x = x0 [i.e., x0 = F (y0 )], then
P {Y y0 } = P {X x0 } = x0 = F (y0 )
as desired.
F(y)
0
+ 1
.8 15
3 y + 70 .3 70
-5
'2
10
o
'
4
'6
Figure 8.1
Y = F(X)
1
y
x
Figure 8.2
Problems
1. Let Xn be gamma (n, ), i.e., Xn has the gamma distribution with parameters n and
. Show that Xn is a sum of n independent exponential random variables, and from
this derive the limiting distribution of Xn /n.
2. Show that 2 (n) is approximately normal for large n (with mean n and variance 2n).
11
3. Let X1 , . . . , Xn be iid with density f . Let Yn be the number of observations that
fall into the interval (a, b). Indicate how to use a normal approximation to calculate
probabilities involving Yn .
4. If we have 3 observations 6.45, 3.14, 4.93, and we round o to the nearest integer, we
get 6, 3, 5. The sum of integers is 14, but the actual sum is 14.52. Let Xi , i = 1, . . . , n
be the round-o error of the i-th observation, and assume that the Xi are iid and
uniformly distributed on (1/2, 1/2). Indicate how to use a normal
n approximation to
calculate probabilities involving the total round o error Yn = i=1 Xi .
5. Let X1 , . . . , Xn be iid with continuous distribution function F , and let Y1 < < Yn
be the order statistics of the Xi . Then F (X1 ), . . . , F (Xn ) are iid and uniformly distributed on [0,1] (see the discussion of simulation), with order statistics F (Y1 ), . . . , F (Yn ).
Show that n(1 F (Yn )) converges in distribution to an exponential random variable.
12
Lecture 9. Estimation
9.1 Introduction
In eect the statistician plays a game against nature, who rst chooses the state of
nature (a number or k-tuple of numbers in the usual case) and performs a random
experiment. We do not know but we are allowed to observe the value of a random
variable (or random vector) X, called the observable, with density f (x).
After observing X = x we estimate by (x), which is called a point estimate because
it produces a single number which we hope is close to . The main alternative is an
interval estimate or condence interval, which will be discussed in Lectures 10 and 11.
For a point estimate (x) to make sense physically, it must depend only on x, not on
the unknown parameter . There are many possible estimates, and there are no general
rules for choosing a best estimate. Some practical considerations are:
(a) How much does it cost to collect the data?
(b) Is the performance of the estimate easy to measure, for example, can we compute
P {|(x) | < }?
(c) Are the advantages of the estimate appropriate for the problem at hand?
We will study several estimation methods:
1. Maximum likelihood estimates.
These estimates usually have highly desirable theoretical properties (consistency), and
are frequently not dicult to compute.
2. Condence intervals.
These estimates have a very useful practical feature. We construct an interval from
the data, and we will know the probability that our (random) interval actually contains
the unknown (but xed) parameter.
3. Uniformly minimum variance unbiased estimates (UMVUEs).
Mathematical theory generates a large number of examples of these, but as we know,
a biased estimate can sometimes be superior.
4. Bayes estimates.
These estimates are appropriate if it is reasonable to assume that the state of nature
is a random variable with a known density.
In general, statistical theory produces many reasonable candidates, and practical experience will dictate the choice in a given physical situation.
9.2 Maximum Likelihood Estimates

a value of that makes what we have observed as likely as possible.
We choose (x) = ,
In other words, let maximize the likelihood function L() = f (x), with x xed. This
corresponds to basic statistical philosophy; if what we have observed is more likely under
2 than under 1 , we prefer 2 to 1 .
13
9.3 Example
Let X be binomial (n, ). Then the probability that X = x when the true parameter is
is

n x
f (x) =
(1 )nx , x = 0, 1, . . . , n.
x
Maximizing f (x) is equivalent to maximizing ln f (x):
x nx
ln f (x) =
[x ln + (n x)ln(1 )] =
= 0.
1
Thus x x n + x = 0, so = X/n, the relative frequency of success.
Notation: will be written in terms of random variables, in this case X/n rather than
x/n. Thus is itself a random variable.
P
= n/n = , so is unbiased. By the weak law of large numbers, ,

We have E()
i.e., is consistent
9.4 Example
Let X1 , . . . , Xn be iid, normal (, 2 ), = (, 2 ). Then, with x = (x1 , . . . , xn ),

n
n
(xi )2
1
f (x) =
exp
2 2
2
i=1
and
ln f (x) =
n
n
1
(xi )2 ;
ln 2 n ln 2
2
2 i=1
(xi ) = 0,
ln f (x) = 2
i=1
xi n = 0,
= x;
i=1
n
n

n
n
1
ln f (x) = + 3
(xi )2 = 3 2 +
(xi )2 = 0
i=1
n i=1
with = x. Thus
1
(xi x)2 = s2 .
n i=1
n
2 =
Case 1. and are both unknown. Then = (X, S 2 ).

Case 2. 2 is known. Then = and = X as above. (Dierentiation with respect to
is omitted.)
14
Case 3. is known. Then = 2 and the equation (/) ln f (x) = 0 becomes
1
(xi )2
n i=1
n
2 =
so
=
(Xi )2 .
n i=1
n
The sample mean X is an unbiased and (by the weak law of large numbers) consistent
estimate of . The sample variance S 2 is a biased but consistent estimate of 2 (see
Lectures 4 and 7).
Notation: We will abbreviate maximum likelihood estimate by MLE.
9.5 The MLE of a Function of Theta

Suppose that for a xed x, f (x) is a maximum when = 0 . Then the value of 2 when
f (x) is a maximum is 02 . Thus to get the MLE of 2 , we simply square the MLE of .
In general, if h is any function, then
= h().
h()
If h is continuous, then consistency is preserved, in other words:
P
P

If h is continuous and , then h()
h().
h()| < .
Proof. Given > 0, there exists > 0 such that if | | < , then |h()
Consequently,
h()| } P {| | } 0
P {|h()
as
n .
(To justify the above inequality, note that if the occurrence of an event A implies the
occurrence of an event B, then P (A) P (B).)
9.6 The Method of Moments

This is sometimes
n a quick way to obtain reasonable estimates. We set the observed k-th
moment n1 i=1 xki equal to the theoretical k-th moment E(Xik ) (which will
ndepend on
the unknown parameter ). Or we set the observed k-th central moment n1 i=1 (xi )k
equal to the theoretical k-th central moment E[(Xi )k ]. For example, let X1 , . . . Xn
be iid, gamma with = 1 , = 2 , with 1 , 2 > 0. Then E(Xi ) = = 1 2 and
Var Xi = 2 = 1 22 (see Lecture 3). We set
X = 1 2 ,
S 2 = 1 22
and solve to get estimates i of i , i = 1, 2, namely

2 =
S2
,
X
1 =
X
X
= 2
2
S
15
Problems
1. In this problem, X1 , . . . , Xn are iid with density f (x) or probability function p (x),
and you are asked to nd the MLE of .
(a) Poisson (),
> 0.
(b) f (x) = x , 0 < x < 1, where > 0. The probability is concentrated near the
origin when < 1, and near 1 when > 1.
(c) Exponential with parameter , i.e., f (x) = (1/)ex/ , x > 0, where > 0.
(d) f (x) = (1/2)e|x| , where and x are arbitrary real numbers.
(e) Translated exponential, i.e., f (x) = e(x) , where is an arbitrary real number
and x .
2. let X1 , . . . , Xn be iid, each uniformly distributed between (1/2) and + (1/2).
Find more than one MLE of (so MLEs are not necessarily unique).
3. In each part of Problem 1, calculate E(Xi ) and derive an estimate based on the method
of moments by setting the sample mean equal to the true mean. In each case, show
that the estimate is consistent.
4. Let X be exponential with parameter , as in Problem 1(c). If r > 0, nd the MLE of
P {X r}.
5. If X is binomial (n, ) and a and b are integers with 0 a b n, nd the MLE of
P {a X b}.
16
Lecture 10. Condence Intervals

10.1 Predicting an Election
There are two candidates A and B. If a voter is selected at random, the probability that
the voter favors A is p, where p is xed but unknown. We select n voters independently
and ask their preference.
The number Yn of A voters is binomial (n, p), which (for suciently large n), is
approximately normal with = np and 2 = np(1 p). The relative frequency of A
voters is Yn /n. We wish to estimate the minimum value of n such that we can predict
As percentage of the vote within 1 percent, with 95 percent condence. Thus we want

Yn

P
p < .01 > .95.
n
Note that |(Yn /n) p| < .01 means that p is within .01 of Yn /n. So this inequality can
be written as
Yn
Yn
.01 < p <
+ .01.
n
n
Thus the probability that the random interval In = ((Yn /n) .01, (Yn /n) + .01) contains
the true probability p is greater than .95. We say that In is a 95 percent condence
interval for p.
In general, we nd condence intervals by calculating or estimating the probability of
the event that is to occur with the desired level of condence. In this case,

Yn
Yn np

< .01 n
P
p < .01 = P {|Yn np| < .01n} = P
n
np(1 p)
p(1 p)
and this is approximately

.01 n
.01 n
.01 n

= 2
1 > .95
p(1 p)
p(1 p)
p(1 p)
where is the normal (0,1) distribution function. Since 1.95/2 = .975 and (1.96) = .975,
we have
.01 n

> 1.96, n > (196)2 p(1 p).
p(1 p)
But (by calculus) p(1 p) is maximized when 1 2p = 0, p = 1/2, p(1 p) = 1/4.
Thus n > (196)2 /4 = (98)2 = (100 2)2 = 10000 400 + 4 = 9604.
If we want to get within one tenth of one percent (.001) of p with 99 percent condence,
we repeat the above analysis with .01 replaced by .001, 1.99/2=.995 and (2.6) = .995.
Thus
.001 n

> 2.6, n > (2600)2 /4 = (1300)2 = 1, 690, 000.
p(1 p)
17
To get within 3 percent with 95 percent condence, we have
2
.03 n
196
1

> 1.96, n >
= 1067.
3
4
p(1 p)
If the experiment is repeated independently a large number of times, it is very likely that
our result will be within .03 of the true probability p at least 95 percent of the time. The
usual statement The margin of error of this poll is 3% does not capture this idea.
Note that the accuracy of the prediction depends only on the number of voters polled
and not in total number of votes in the population. But the model assumes sampling
with replacement. (Theoretically, the same voter can be polled more than once since the
voters are selected independently.) In practice, sampling is done without replacement,
but if the number n of voters polled is small relative to the population size N , the error
is very small.
The normal approximation to the binomial (based on the central limit theorem) is
quite reliable, and is used in practice even for modest values of n; see (8.4).
10.2 Estimating the Mean of a Normal Population

Let X1 , . . . , Xn be iid, each normal (, 2 ). We will nd a condence interval for .
Case 1. The variance 2 is known. Then X is normal (, 2 /n), so
X
/ n
hence
P {b <

n
is normal (0,1),

< b} = (b) (b) = 2(b) 1
and the inequality dening the condence interval can be written as

b
b
X <<X+ .
n
n
We choose a symmetrical interval to minimize the length, because the normal density
with zero mean is symmetric about 0. The desired condence level determines b, which
then determines the condence interval.
Case 2. The variance 2 is unknown. Recall from (5.1) that
X
S/ n 1
is
T (n 1)
hence
P {b <
< b} = 2FT (b) 1

S/ n 1
and the inequality dening the condence interval can be written as

X
bS
bS
<<X+
.
n1
n1
18
10.3 A Correction Factor When Sampling Without Replacement

The following results will not be used and may be omitted, but it is interesting to measure
quantitatively the eect of sampling without replacement. In the election prediction
problem, let Xi be the indicator of success (i.e.,selecting an A voter) on trial i. Then
P {Xi = 1} = p and P {Xi = 0} = 1 p. If sampling is done with replacement, then the
Xi are independent and the total number X = X1 + + Xn of A voters in the sample is
binomial (n, p). Thus the variance of X is np(1p). However, if sampling is done without
replacement, then in eect we are drawing n balls from an urn containing N balls (where
N is the size of the population), with N p balls labeled A and N (1 p) labeled B. Recall
from basic probability theory that
Var X =
Var Xi + 2
i=1
Cov(Xi , Xj )
i<j
whee Cov stands for covariance. (We will prove this in a later lecture.) If i = j, then
E(Xi Xj ) = P {Xi = Xj = 1} = P {X1 X2 = 1} =
and

Cov(Xi , Xj ) = E(Xi Xj ) E(Xi )E(Xj ) = p
Np Np 1
N
N 1
Np 1
N 1

p2
which reduces to p(1 p)/(N 1). Now Var Xi = p(1 p), so

n p(1 p)
Var X = np(1 p) 2
2 N 1
which reduces to

n1
N n
np(1 p) 1
= np(1 p)
.
N 1
N 1
Thus if SE is the standard error (the standard deviation of X), then SE (without replacement) = SE (with replacement) times a correction factor, where the correction factor
is

N n
1 (n/N )
=
.
N 1
1 (1/N )
The correction factor is less than 1, and approaches 1 as N , as long as n/N 0.
Note also that in sampling without replacement, the probability of getting exactly k
As in n trials is
N p N (1p)
k
Nnk

n
with the standard pattern N p + N (1 p) = N and k + (n k) = n.
19
Problems
1. In the normal case [see (10.2)], assume that 2 is known. Explain how to compute the
length of the condence interval for .
2. Continuing Problem 1, assume that 2 is unknown. Explain how to compute the length
of the condence interval for , in terms of the sample standard deviation S.
3. Continuing Problem 2, explain how to compute the expected length of the condence
interval for , in terms of the unknown standard deviation . (Note that when is
unknown, we expect a larger interval since we have less information.)
4. Let X1 , . . . , Xn be iid, each gamma with parameters and . If is known, explain
how to compute a condence interval for the mean = .
5. In the binomial case [see (10.1)], suppose we specify the level of condence and the
length of the condence interval. Explain how to compute the minimum value of n.
Lecture 11. More Condence Intervals

11.1 Dierences of Means
Let X1 , . . . , Xn be iid, each normal (1 , 2 ), and let Y1 , . . . , Ym be iid, each normal
(2 , 2 ). Assume that (X1 , . . . Xn ) and Y1 , . . . , Ym ) are independent. We will construct
a condence interval for 1 2 . In practice, the interval is often used in the following
way. If the interval lies entirely to the left of 0, we have reason to believe that 1 < 2 .
Since Var(X Y ) = Var X + Var Y = ( 2 /n) + ( 2 /m),
X Y (1 2 )

1
n1 + m
is normal (0,1).
Also, nS12 / 2 is 2 (n1) and mS22 / 2 is 2 (m1). But 2 (r) is the sum of r independent,
normal (0,1) random variables, so
nS12
mS22
+
2
2
Thus if

R=
is
2 (n + m 2).
nS12 + mS22
n+m2

1
1
+
n m
then
T =
X Y (1 2 )
R
is
T (n + m 2).
Our assumption that both populations have the same variance is crucial, because the
unknown variance can be cancelled.
If P {b < T < b} = .95 we get a 95 percent condence interval for 1 2 :
b <
X Y (1 2 )
<b
R
or
(X Y ) bR < 1 2 < (X Y ) + bR.
If the variances 12 and 22 are known but possibly unequal, then
X Y (1 2 )

12
22
n + m
is normal (0,1). If R0 is the denominator of the above fraction, we can get a 95 percent
condence interval as before: (b) (b) = 2(b) 1 > .95,
(X Y ) bR0 < 1 2 < (X Y ) + bR0 .
11.2 Example
Let Y1 and Y2 be binomial (n1 , p1 ) and (n2 , p2 ) respectively. Then
Y1 = X1 + + Xn1
and Y2 = Z1 + + Zn2
where the Xi and Zj are indicators of success on trials i and j respectively. Assume
that X1 , . . . Xn1 , Z1 , . . . , Zn2 are independent. Now E(Y1 /n1 ) = p1 and Var(Y1 /n1 ) =
n1 p1 (1 p1 )/n21 = p1 (1 p1 )/n1 , with similar formulas for Y2 /n2 . Thus for large n,

Y1
Y2
(p1 p2 )
n1
n2
divided by
p1 (1 p1 ) p2 (1 p2 )
+
n1
n2
is approximately normal (0,1). But this expression cannot be used to construct condence
intervals for p1 p2 because the denominator involves the unknown quantities p1 and p2 .
However, Y1 /n1 converges in probability to p1 and Y2 /n2 converges in probability to p2 ,
and this justies replacing p1 by Y1 /n1 and p2 by Y2 /n2 in the denominator.
11.3 The Variance

We will construct condence intervals for the variance of a normal population. Let
X1 , . . . , Xn be iid, each normal (, 2 ), so that nS 2 / 2 is 2 (n 1). If hn1 is the
b
2 (n 1) density and a and b are chosen to that a hn1 (x) dx = 1 , then
P {a <
nS 2
< b} = 1 .
2
But a < (nS 2 )/ 2 < b is equivalent to

nS 2
nS 2
< 2 <
b
a
2
so we have a condence
interval for
b are

a at condence level 1 . In practice, a and
chosen so that b hn1 (x) dx = hn1 (x) dx. For example, if Hn1 is the 2 (n 1)
distribution function and the condence level is 95 percent, we take Hn1 (a) = .025
and Hn1 (b) = 1 .025 = .975. This is optimal (the length of the condence interval
is minimized) when the density is symmetric about zero, and in the symmetric case we
would have a = b. In the nonsymmetric case (as we have here), the error is usually
small.
In this example, is unknown. If the mean is known, we can make use of this
knowledge to improve performance. Note that
2
n

Xi
is 2 (n)
i=1
3
so if
W =
n

(Xi )2
i=1
b
and we choose a and b so that a hn (x) dx = 1 , then P {a < (W/ 2 ) < b} = 1 .

The inequality dening the condence interval can be written as
W
W
< 2 <
.
b
a
11.4 Ratios of Variances

Here we see an application of the F distribution. Let X1 , . . . , Xn1 be iid, each normal
(1 , 12 ), and let Y1 , . . . , Yn2 be iid, each normal (2 , 22 ). Assume that (X1 , . . . , Xn1 ) and
(Y1 , . . . , Yn2 ) are independent. Then ni Si2 /i2 is 2 (ni 1), i = 1, 2. Thus
(n2 S22 /22 )/(n2 1)
(n1 S12 /12 )/(n1 1)
is
F (n2 1, n1 1).
Let V 2 be the unbiased version of the sample variance, i.e.,

n
1
V =
S2 =
(Xi X)2 .
n1
n 1 i=1
n
Then
V22 12
V12 22
is
F (n2 1, n1 1)
and this allows construction of condence intervals for 12 /22 in the usual way.
Problems
1. In (11.1), suppose the variances 12 and 22 are unknown and possibly unequal. Explain
why the analysis of (11.1) breaks down.
2. In (11.1), again assume that the variances are unknown, but 12 = c22 where c is a
known positive constant. Show that condence intervals for the dierence of means
can be constructed.
Lecture 12. Hypothesis Testing

12.1 Basic Terminology
In our general statistical model (Lecture 9), suppose that the set of possible values of
is partitioned into two subsets A0 and A1 , and the problem is to decide between the two
possibilities H0 : A0 , the null hypothesis, and H1 : A1 , the alternative. Mathematically, it doesnt make any dierence which possibility you call the null hypothesis,
but in practice, H0 is the default setting. For example, H0 : 0 might mean that
a drug is no more eective than existing treatments, while H1 : > 0 might mean that
the drug is a signicant improvement.
We observe x and make a decision via (x) = 0 or 1. There are two types of errors. A
type 1 error occurs if H0 is true but (x) = 1, in other words, we declare that H1 is true.
Thus in a type 1 error, we reject H0 when it is true.
A type 2 error occurs if H0 is false but (x) = 0, i.e., we declare that H0 is true. Thus
in a type 2 error, we accept H0 when it is false.
If H0 [resp. H1 ] means that a patient does not have [resp. does have] a particular
disease, then a type 1 error is also called a false positive, and a type 2 error is also called
a false negative.
If (x) is always 0, then a type 1 error can never occur, but a type 2 error will always
occur. Symmetrically, if (x) is always 1, then there will always be a type 1 error, but
never an error of type 2. Thus by ignoring the data altogether we can reduce one of the
error probabilities to zero. To get both error probabilities to be mall, in practice we must
increase the sample size.
We say that H0 [resp. H1 ] is simple if A0 [resp. A1 ] contains only one element,
composite if A0 [resp. H1 ] contains more than one element. So in the case of simple
hypothesis vs. simple alternative, we are testing = 0 vs. = 1 . The standard example
is to test the hypothesis that X has density f0 vs. the alternative that X has density f1 .
12.2 Likelihood Ratio Tests

In the case of simple hypothesis vs. simple alternative, if we require that the probability
of a type 1 error be at most and try to minimize the probability of a type 2 error, the
optimal test turns out to be a likelihood ratio test (LRT), dened as follows. Let L(x),
the likelihood ratio, be f1 (x)/f0 (x), and let be a constant. If L(x) > , reject H0 ; if
L(x) < , accept H0 ; if L(x) = , do anything.
Intuitively, if what we have observed seems signicantly more likely under H1 , we will
tend to reject H0 . If H0 or H1 is composite, there is no general optimality result as there
is in the simple vs. simple case. In this situation, we resort to basic statistical philosophy:
If, assuming that H0 is true, we witness a are event, we tend to reject H0 .
The statement that LRTs are optimal is the Neyman-Pearson lemma, to be proved
at the end of the lecture. In many common examples (normal, Poisson, binomial, exponential, L(x1 , . . . , xn ) can be expressed as a function of the sum of the observations,
or equivalently
n as a function of the sample mean. This motivates consideration of tests
based on i=1 Xi or on X.
12.3 Example
Let X1 , . . . , Xn be iid, each normal (, 2 ). We will test H0 : 0 vs. H1 : > 0 .
Under H1 , X will tend to be larger, so lets reject H0 when X > c. The power function
of the test is dened by
K() = P {reject H0 },
the probability of rejecting the null hypothesis when the true parameter is . In this case,

X
c
c
>
P {X > c} = P
=1
/ n
/ n
/ n
(see Figure 12.1). Suppose that we specify the probability of a type 1 error when = 1 ,
and the probability of a type 2 error when = 2 . Then

c 1
K(1 ) = 1
=
/ n
and

K(2 ) = 1
c 2
/ n

= 1 .
If , , , 1 and 2 are known, we have two equations that can be solved for c and n.
K( )
1
Figure 12.1
The critical region
nis the set of observations that lead to rejection. In this case, it is
{(x1 , . . . , xn ) : n1 i=1 xi > c}.
The signicance level is the largest type 1 error probability. Here it is K(0 ), since
K() increases with .
12.4 Example
Let H0 : X is uniformly distributed on (0,1), so f0 (x) = 1, 0 < x < 1, and 0 elsewhere.
Let H1 : f1 (x) = 3x2 , 0 < x < 1, and 0 elsewhere. We take only one observation, and
reject H0 if x > c, where 0 < c < 1. Then
1
K(0) = P0 {X > c} = 1 c, K(1) = P1 {X > c} =
3x2 dx = 1 c3 .
c
6
If we specify the probability of a type 1 error, then = 1 c, which determines c. If
is the probability of a type 2 error, then 1 = 1 c3 , so = c3 . Thus (see Figure 12.2)
= (1 )3 .
If = .05 then = (.95)3 .86, which indicates that you usually cant do too well with
only one observation.
Figure 12.2
12.5 Tests Derived From Condence Intervals

Let X1 , . . . , Xn be iid, each normal (0 , 2 ). In Lecture 10, we found a condence interval
for 0 , assuming 2 unknown, via

X 0
X 0
P b <
< b = 2FT (b) 1 where T =
S/ n 1
S/ n 1
has the T distribution with n 1 degrees of freedom.
Say 2FT (b) 1 = .95, so that

X 0

P
b = .05
S/ n 1
If actually equals 0 , we are witnessing an event of low probability. So it is natural to
test = 0 vs.
= 0 by rejecting if

X 0

S/ n 1 b,
in other words, 0 does not belong to the condence interval. As the true mean
moves away from 0 in either direction, the probability of this event will increase, since
X 0 = (X ) + ( 0 ).
Tests of = 0 vs.
= 0 are called two-sided, as opposed to = 0 vs. > 0 (or
= 0 vs. < 0 ), which are one-sided. In the present case, if we test = 0 vs. > 0 ,
we reject if
X 0
b.
S/ n 1
The power function K() is dicult to compute for

= 0 , because (X 0 )/(/ n)
no longer has mean zero. The noncentral T distribution becomes involved.
12.6 The Neyman-Pearson Lemma

Assume that we are testing the simple hypothesis that X has density f0 vs. the simple
alternative that X has density f1 . Let be an LRT with parameter (a nonnegative
constant), in other words, (x) is the probability of rejecting H0 when x is observed,
and
if L(x) >
1
(x) = 0
if L(x) <
anything if L(x) =
Suppose that the probability of a type 1 error using is , and the probability of a
type 2 error is . Let be an arbitrary test with error probabilities and . If
then . In other words, the LRT has maximum power among all tests at signicance
level .
Proof. We are going to assume that f0 and f1 are one-dimensional, but the argument
works equally well when X = (X1 , . . . , Xn ) and the fi are n-dimensional joint densities.
We recall from basic probability theory the theorem of total probability, which says that
if X has density f , then for any evert A,

P (A) =
P (A|X = x)f (x) dx.
A companion theorem which we will also use later is the theorem of total expectation,
which says that if X has density f , then for any random variable Y ,

E(Y ) =
E(Y |X = x)f (x) dx.
By the theorem of total probability,

=
(x)f0 (x) dx,

1 =
(x)f1 (x) dx
and similarly

=
(x)f0 (x) dx,
1 =
(x)f1 (x) dx.
We claim that for all x,

[ (x) (x)][f1 (x) f0 (x)] 0.
For if f1 (x) > f0 (x) then L(x) > , so (x) = 1 (x), and if f1 (x) < f0 (x) then
L(x) < , so (x) = 0 (x), proving the assertion. Now if a function is always
nonnegative, its integral must be nonnegative, so

[ (x) (x)][f1 (x) f0 (x)] dx 0.
8
The terms involving f0 translate to statements about type 1 errors, and the terms involving
f1 translate to statements about type 2 errors. Thus
(1 ) (1 ) + 0,
which says that ( ) 0, completing the proof.
12.7 Randomization
If L(x) = , then do anything means that randomization is possible, e.g., we can ip
a possibly biased coin to decide whether or not to accept H0 . (This may be signicant
in the discrete case, where L(x) = may have positive probability.) Statisticians tend
to frown on this practice because two statisticians can look at exactly the same data and
come to dierent conclusions. It is possible to adjust the signicance level (by replacing
do anything by a denite choice of either H0 or H1 to avoid randomization.
Problems
1. Consider the problem of testing = 0 vs. > 0 , where is the mean of a normal
population with known variance. Assume that the sample size n is xed. Show that
the test given in Example 12.3 (reject H0 if X > c) is uniformly most powerful. In
other words, if we test = 0 vs. = 1 for any given 1 > 0 , and we specify the
probability of a type 1 error, then the probability of a type 2 error is minimized.
2. It is desired to test the null hypothesis that a die is unbiased vs. the alternative that
the die is loaded, with faces 1 and 2 having probability 1/4 and faces 3,4,5 and 6 having
probability 1/8. The die is to be tossed once. Find a most powerful test at level = .1,
and nd the type 2 error probability .
3. We wish to test a binomial random variable X with
n = 400 and H0 : p = 1/2 vs.
H1 : p > 1/2. The random variable Y = (X np)/ np(1 p) = (X 200)/10 is
approximately normal (0,1), and we will reject H0 if Y > c. If we specify = .05,
then c = 1.645. Thus the critical region is X > 216.45. Suppose the actual result is
X = 220, so that H0 is rejected. Find the minimum value of (sometimes called the
p-value) for which the given data lead to the opposite conclusion (acceptance of H0 ).
Lecture 13. Chi-Square Tests

13.1 Introduction
Let X1 , . . . , Xk be multinomial, i.e., Xi is the number of occurrences of the event Ai in
n generalized Bernoulli trials (Lecture 6). Then
P {X1 = n1 , . . . , Xk = nk } =
n!
pn1 pnk k
n1 ! nk ! 1
where the ni are nonnegative integers

whose sum is n. Consider k = 2. Then X1 is

binomial (n, p1 ) and (X1 np1 )/ np1 (1 p1 ) normal(0,1). Consequently, the random
variable (X1 np1 )2 /np1 (1 p1 ) is approximately 2 (1). But

(X1 np1 )2
(X1 np1 )2
1
(X2 np2 )2
(X1 np1 )2 1
=
+
+
.
=
np1 (1 p1 )
n
p1
1 p1
np1
np2
(Note that since k = 2 we have p2 = 1 p1 and X1 np1 = n X2 np1 = np2
X2 = (X2 np2 ), and the outer minus sign disappears when squaring.) Therefore
[(X1 np1 )2 /np1 ] + [(X2 np2 )2 /np2 ] 2 (1). More generally, it can be shown that
Q=
k

(Xi npi )2
i=1
npi
2 (k 1).
where
(Xi npi )2
(observed frequency-expected frequency)2
=
.
npi
expected frequency
We will consider three types of chi-square tests.
13.2 Goodness of Fit

We ask whether X has a specied distribution (normal, Poisson, etc.). The null hypothesis
is that the multinomial probabilities are p = (p1 , . . . , pk ), and the alternative is that
p
= (p1 , . . . , pk ).
Suppose that P {2 (k 1) > c} is at the desired level of signicance (for example,
.05). If Q > c we will reject H0 . The idea is that if H0 is in fact true, we have witnessed
a rare event, so rejection is reasonable. If H0 is false, it is reasonable to expect that some
of the Xi will be far from npi , so Q will be large.
Some practical considerations: Take n large enough so that each npi 5. Each time a
parameter is estimated from the sample, reduced the number of degrees by 1. (A typical
case: The null hypothesis is that X is Poisson (), but the mean is unknown, and is
estimated by the sample mean.)
10
13.3 Equality of Distributions

We ask whether two or more samples come from the same underlying distribution. The
observed results are displayed in a contingency table. This is an h k matrix whose rows
are the samples and whose columns are the attributes to be observed. For example, row
i might be (7, 11, 15, 13, 4), with the interpretation that in a class of 50 students taught
by method of instruction i, there were 7 grades of A, 11 of B, 15 of C, 13 of D and 4
of F . The null hypothesis H0 is that there is no dierence between the various methods
of instruction, i.e., P (A) is the same for each group, and similarly for the probabilities of
the other grades. We estimate P (A) from the sample by adding all entries in column A
and dividing by the total number of observations in the entire experiment. We estimate
P (B), P (C), P (D) and P (F ) in a similar fashion. The expected frequencies in row i are
found by multiplying the grade probabilities by the number of entries in row i.
If there are h groups (samples), each with k attributes, then each group generates a chisquare (k 1), and k 1 probabilities are estimated from the sample (the last probability
is determined). The number of degrees of freedom is h(k 1) (k 1) = (h 1)(k 1),
call it r. If P {2 (r) > c} is the desired signicance level, we reject H0 if the chi-square
statistic is greater than c.
13.4 Testing For Independence

Again we have a contingency table with h rows corresponding to the possible values xi of
a random variable X, and k columns corresponding to the possible values yj of a random
variable Y . We are testing the null hypothesis that X and Y are independent.
Let Ri be the sum of the entries in row i, and let
Cj be the
sum of the entries
in column j. Then the sum of all observations is T =
R
=
i
i
j Cj . We estimate
P {X = xi } by Ri /T , and P {Y = yj } by Cj /T . Under the independence hypothesis H0 ,
P {X = xi , Y = yj } = P {X = xi }P {Y = yj } = Ri Cj /T 2 . Thus the expected frequency
of (xi , yj ) is Ri Cj /T . (This gives another way to calculate the expected frequencies in
(13.3). In that case, we estimated the j-th column probability by Cj /T , and multiplied
by the sum of the entries in row i, namely Ri .)
In an h k contingency table, the number of degrees of freedom is hk 1 minus the
number of estimated parameters:
hk 1 (h 1 + k 1) = hk h k + 1 = (h 1)(k 1).
The chi-square statistic is calculated as in (13.3). Similarly, if there are 3 attributes to
be tested for independence and we form an h k m contingency table, the number of
degrees of freedom is
hkm 1 (h 1) + (k 1) + (m 1) = hkm h k m + 2.
Problems
1. Use a chi-square procedure to tests the null hypothesis that a random variable X has
the following distribution:
P {X = 1} = .5,
P {X = 2} = .3,
P {X = 3} = .2
11
We take 100 independent observations of X, and it is observed that 1 occurs 40 times,
2 occurs 33 times, and 3 occurs 27 times. Determine whether or not we will reject the
null hypothesis at signicance level .05
2. Use a chi-square test to decide (at signicance level .05) whether the two samples corresponding to the rows of the contingency table below came from the same underlying
distribution.
Sample 1
Sample 2
A
33
67
B
147
153
C
114
86
3. Suppose we are testing for independence in a 2 2 contingency table

a b
c d
Show that the chi-square statistic is
(ad bc)2 (a + b + c + d)
(a + b)(c + d)(a + c)(b + d)
(The number of degrees of freedom is 1 1 = 1.)
12
Lecture 14. Sucient Statistics

Let X1 , . . . , Xn be iid with P {Xi = 1} = and P {Xi = 0} = 1 , so P {Xi = x} =
x (1 )1x , x = 0, 1. Let Y be a statistic for , i.e., a function of the observables
X1 , . . . , Xn . In this case we take Y = X1 + + Xn , the total number of successes in n
Bernoulli trials with probability of success on a given trial.
We claim that the conditional distribution of X1 , . . . , Xn given Y is free of , in other
words, does not depend on . We say that Y is sucient for .
To prove this, note that
P {X1 = x1 , . . . , Xn = xn |Y = y} =
P {X1 = x1 , . . . , Xn = xn , Y = y}
.
P {Y = y}
This is 0 unless y = x1 + + xn , in which case we get

y (1 )ny
1
n
= n .
y
ny
y (1 )
y
For example, if we know that there
were 3 heads in 5 tosses, the probability that the
actual tosses were HT HHT is 1/ 53 .
14.2 The Key Idea

For the purpose of making a statistical decision, we can ignore the individual random
variables Xi and base the decision entirely on X1 + + Xn .
Suppose that statistician A observes X1 , . . . , Xn and makes a decision. Statistician
B observes X1 + + Xn only, and constructs X1 , . . . , Xn according to the conditional
distribution of X1 , . . . , Xn given Y , i.e.,
1
P {X1 = x1 , . . . , Xn = xn |Y = y} = n .
y
This construction is possible because the conditional distribution does not depend on the
unknown parameter . We will show that under , (X1 , . . . , Xn ) and (X1 , . . . , Xn ) have
exactly the same distribution, so anything A can do, B can do at least as well, even though
B has less information.
Given x1 , . . . , xn , let y = x1 + + xn . The only way we can have X1 = x1 , . . . , Xn =
xn is if Y = y and then Bs experiment produces X1 = x1 , . . . , Xn = xn given y. Thus
P {X1 = x1 , . . . , Xn = xn } = P {Y = y}P {X1 = x1 , . . . , Xn = xn |Y = y}

n y
1
(1 )ny n = y (1 )ny = P {X1 = x1 , . . . , Xn = xn }.
y
y
13
14.3 The Factorization Theorem

Let Y = u(X) be a statistic for ; (X can be (X1 , . . . , Xn ), and usually is). Then
Y is sucient for if and only if the density f (x) of X under can be factored as
f (x) = g(, u(x))h(x).
n
[In the Bernoulli case, f (x1 , . . . , xn ) = y (1 )ny where y = u(x) = i=1 xi and
h(x) = 1.]
Proof. (Discrete case). If Y is sucient, then
P {X = x} = P {X = x, Y = u(x)} = P {Y = u(x)}P {X = x|Y = u(x)}
= g(, u(x))h(x).
Conversely, assume f (x) = g(, u(x))h(x). Then
P {X = x|Y = y} =
P {X = x, Y = y}
.
P {Y = y}
This is 0 unless y = u(x), in which case it becomes

P {X = x}
g(, u(x))h(x)
=
.
P {Y = y}
{z:u(z)=y} g(, u(z))h(z)
The g terms in both numerator and denominator are g(, y), which can be cancelled to
obtain
P {X = x|Y = y} =
h(x)
{z:u(z)=y}
h(z)
which is free of .
14.4 Example
Let X1 , . . . , Xn be iid, each normal (, 2 ), so that

f (x1 , . . . , xn ) =
Take = (, 2 ) and let x = n1
1
2
n
i=1
n

n
1
exp 2
(xi )2 .
2 i=1
xi , s2 = n1
n
i=1 (xi
x)2 . Then
xi x = xi (x )
and

n
n

1
2
2
s =
(xi ) 2(x )
(xi ) + n(x ) .
n 1
1
2
14
Thus
1
(xi )2 (x )2 .
n 1
n
s2 =
The joint density is given by
f (x1 , . . . , xn ) = (2 2 )n/2 ens
/2 2 n(x)2 /2 2
If and 2 are both unknown then (X, S 2 ) is sucient (take h(x) = 1). If 2 is known
2 n/2 ns2 /2 2
then we can take h(x) = (2
)
e
, = , and X is sucient. If is known

n
then (h(x) = 1) = 2 and i=1 (Xi )2 is sucient.
Problems
In Problems 1-6, show that the given statistic u(X) = u(X1 , . . . , Xn ) is sucient for
and nd appropriate functions g and h for the factorization theorem to apply.
1. The Xi are Poisson () and u(X) = X1 + + Xn .
2. The Xi have density A()B(xi ), 0 < xi < (and 0 elsewhere), where is a positive real
number; u(X) = max Xi . As a special case, the Xi are uniformly distributed between
0 and , and A() = 1/, B(xi ) = 1 on (0, ).
3. The Xi are geometric with parameter , i.e., if is the probability of success on a given
Bernoulli trial, then P {Xi = x} = (1 )x is the probability that there will be x
n
failures followed by the rst success; u(X) = i=1 Xi .
n
4. The Xi have the exponential density (1/)ex/ , x > 0, and u(X) = i=1 Xi .
n
5. The Xi have the beta density with parameters a = and b = 2, and u(X) = i=1 Xi .
6. The Xi have the gamma
n density with parameters = , an arbitrary positive
number, and u(X) = i=1 Xi .
7. Show that the result in (14.2) that statistician B can do at least as well as statistician
A, holds in the general case of arbitrary iid random variables Xi .
15
Lecture 15. Rao-Blackwell Theorem

15.1 Background From Basic Probability
To better understand the steps leading to the Rao-Blackwell theorem, consider a typical
two stage experiment:
Step 1. Observe a random variable X with density (1/2)x2 ex , x > 0.
Step 2. If X = x, let Y be uniformly distributed on (0, x).
Find E(Y ).
Method 1 via the joint density:
f (x, y) = fX (x)fY (y|x) =
In general, E[g(X, Y )] =

E(Y ) =
x=0
1 2 x 1
1
x e ( ) = xex , 0 < y < x.
2
x
2
g(x, y)f (x, y) dx dy. In this case, g(x, y) = y and
y(1/2)xex dy dx =
(x3 /4)ex dx =
y=0
3!
3
= .
4
2
Method 2 via the theorem of total expectation:

E(Y ) =
fX (x)E(Y |X = x) dx.
Method 2 works well when the conditional expectation is easy to compute. In this case
it is x/2 by inspection. Thus

3
E(Y ) =
as before.
(1/2)x2 ex (x/2) dx =
2
0
15.2 Comment On Notation

If, for example, it turns out that E(Y |X = x) = x2 + 3x + 4, we can write E(Y |X) =
X 2 + 3X + 4. Thus E(Y |X) is a function g(X) of the random variable X. When X = x
we have g(x) = E(Y |X = x).
We now proceed to the Rao-Blackwell theorem via several preliminary lemmas.
15.3 Lemma
E[E(X2 |X1 )] = E(X2 ).
Proof. Let g(X1 ) = E(X2 |X1 ). Then

E[g(X1 )] =
g(x)f1 (x) dx =
by the theorem of total expectation.
E(X2 |X1 = x)f1 (x) dx = E(X2 )
16
15.4 Lemma
If i = E(Xi ), i = 1, 2, then
E[{X2 E(X2 |X1 )}{E(X2 |X1 ) 2 }] = 0.
Proof. The expectation is

[x2 E(X2 |X1 = x1 )][E(X2 |X1 = x1 ) 2 ]f1 (x1 )f2 (x2 |x1 ) dx1 dx2

f1 (x1 )[E(X2 |X1 = x1 ) 2 ]
[x2 E(X2 |X1 = x1 )]f2 (x2 |x1 ) dx2 dx1 .
The inner integral (with respect to x2 ) is E(X2 |X1 = x1 ) E(X2 |X1 = x1 ) = 0, and the
result follows.
15.5 Lemma
Var X2 Var[E(X2 |X1 )].
Proof. We have

Var X2 = E[(X2 2 )2 ] = E [{X2 E(X2 |X1 } + {E(X2 |X1 ) 2 }]2
= E[{X2 E(X2 |X1 )}2 ] + E[{E(X2 |X1 ) 2 }2 ]
E[{E(X2 |X1 ) 2 }2 ]
by (15.4)
since both terms are nonnegative.
But by (15.2), E[E(X2 |X1 )] = E(X2 ) = 2 , so the above term is the variance of
E(X2 |X1 ).
15.6 Lemma
Equality holds in (15.5) if and only if X2 is a function of X1 .
Proof. The argument of (15.5) shows that equality holds i E[{X2 E(X2 |X1 )}2 ] = 0,
in other words, X2 = E(X2 |X1 ). This implies that X2 is a function of X1 . Conversely, if
X2 = h(X1 ), then E(X2 |X1 ) = h(X1 ) = X2 , and therefore equality holds.
15.7 Rao-Blackwell Theorem

Let X1 , . . . , Xn be iid, each with density f (x). Let Y1 = u1 (X1 , . . . , Xn ) be a sucient statistic for , and let Y2 = u2 (X1 , . . . , Xn ) be an unbiased estimate of [or more
generally, of a function of , say r()]. Then
(a) Var[E(Y2 |Y1 )] Var Y2 , with strict inequality unless Y2 is a function of Y1 alone.
(b) E[E(Y2 |Y1 )] = [or more generally, r()].
Thus in searching for a minimum variance unbiased estimate of [or more generally,
r()], we may restrict ourselves to functions of the sucient statistic Y1 .
Proof. Part (a) follows from (15.5) and (15.6), and (b) follows from (15.3).
17
15.8 Theorem
Let Y1 = u1 (X1 , . . . , Xn ) be a sucient statistic for . If the maximum likelihood estimate
of is unique, then is a function of Y1 .
Proof. The joint density of the Xi can be factored as
f (x1 , . . . , xn ) = g(, z)h(x1 , . . . , xn )
where z = u1 (x1 , . . . , xn ). Let 0 maximize g(, z). Given z, we nd 0 by looking
at all g(, z), so that 0 is a function of u1 (X1 , . . . , Xn ) = Y1 . But 0 also maximizes
f (x1 , . . . , xn ), so by uniqueness, = 0 .
In Lectures 15-17, we are developing methods for nding uniformly minimum variance
unbiased estimates. Exercises will be deferred until Lecture 17.
Lecture 16. Lehmann-Sche

e Theorem
16.1 Denition
Suppose that Y is a sucient statistic for . We say that Y is complete if there are no
nontrivial unbiased unbiased estimates of 0 based on Y , i.e., if E[g(Y )] = 0 for all , then
P {g(Y ) = 0} = 1 for all . Thus if we have two unbiased estimates of based on Y , say
(Y ) and (Y ), then E [(Y ) (Y )] = 0 for all , so that regardless of , (Y ) and
(Y ) coincide (with probability 1). So if we nd one unbiased estimate of based on Y ,
we have essentially found all of them.
16.2 Theorem (Lehmann-Sche

e)
Suppose that Y1 = u1 (X1 , . . . , Xn ) is a complete sucient statistic for . If (Y1 ) is
an unbiased estimate of based on Y1 , then among all possible unbiased estimates of
(whether based on Y1 or not), (Y1 ) has minimum variance. We say that (Y1 ) is
a uniformly minimum variance unbiased estimate of , abbreviated UMVUE. The term
uniformly is used because the result holds for all possible values of .
Proof. By Rao-Blackwell, if Y2 is any unbiased estimate of , then E[Y2 |Y1 ] is an unbiased
estimate of with Var[E(Y2 |Y1 )] Var Y2 . But E(Y2 |Y1 ) is a function of Y1 , so by
completeness it must coincide with (Y1 ). Thus regardless of the particular value of ,
Var [(Y1 )] Var (Y2 ). .
Note that just as in the Rao-Blackwell theorem, the Lehmann-Schee result holds
equally well if we are seeking a UMVUE of a function of . Thus we look for an unbiassed
estimate of r() based on the complete sucient statistic Y1 .
16.3 Denition and Remarks

There are many situations in which complete sucient statistics can be found quickly.
The exponential class (or exponential family) consists of densities of the form
f (x) = a()b(x) exp
m

pj ()Kj (x)
j=1
where a() > 0, b(x) > 0, < x < , = (1 , . . . , k ) with j < j < j , 1 j k
(, , j , j are constants).
There are certain regularity conditions that are assumed, but they will always be
satised in the examples we consider, so we will omit the details. In all our examples, k
and m will be equal. This is needed in the proof of completeness of the statistic to be
discussed in Lecture 17. (It is not needed for suciency.)
16.4 Examples

1. Binomial(n, ) where n is known. We have f (x)
= nx x (1 )nx , x = 0, 1, . . . , n,

where 0 < < 1. Take a() = (1)n , b(x) = nx , p1 () = ln ln(1), K1 (x) = x.
Note that k = m = 1.
2
2. Poisson(). The probability function is f (x) = e x /x!, x = 0, 1, . . . , where > 0.
We can take a() = e , b(x) = 1/x!, p1 () = ln , K1 (x) = x, and k = m = 1.
3. Normal(, 2 ). The density is
f (x) =
1
exp[(x )2 /2 2 ],
2
Take a() = [1/ 2] exp[2 /2 2 ],

/ 2 , K2 (x) = x, and k = m = 2.
< x < ,
b(x) = 1,
= (, 2 ).
p1 () = 1/2 2 ,
K1 (x) = x2 ,
p2 () =
4. Gamma(, ). The density is x1 ex/ /[() ], x > 0, = (, ). Take a() =

1/[() ], b(x) = 1, p1 () = 1, K1 (x) = ln x, p2 () = 1/, K2 (x) = x,
and k = m = 2.
5. Beta(a, b). The density is [(a + b)/(a)(b)]xa1 (1 x)b1 , 0 < x < 1, = (a, b).
Take a() = [(a + b)/(a)(b)], b(x) = 1, p1 () = a 1, K1 (x) = ln x, p2 () =
b 1, K2 (x) = ln(1 x), and k = m = 2.
6. Negative Binomial
First we derive some properties of this distribution. In a possibly innite sequence of
Bernoulli trials, let Yr be the number of trials required to obtain the r-th success (assume
r is a known positive integer). Then P {Y1 = k} is the probability of k 1 failures followed
by a success, which is q k1 p where q = 1 p and k = 1, 2, . . . . The moment-generating
function of Y1 is
MY1 (t) = E[etY1 ] =
q k1 petk .
k=1
Write etk as et(k1) et . We get

MY1 (t) = pet (1 + qet + (qet )2 + ) =
pet
,
1 qet
|qet | < 1.
The random variable Y1 is said to have the geometric distribution. (The slightly dierent
random variable appearing in Problem 3 of Lecture 14 is also frequently referred to as
geometric.) Now Yr (the negative binomial random variable) is the sum of r independent
random variables, each geometric, so
r

pet
MYr (t) =
.
1 qet
The event {Yr = k} occurs i there are r 1 successes in the rst k 1 trials, followed
by a success on trial k. Therefore

k 1 r1 kr
P {Yr = k} =
p q
p, x = r, r + 1, r + 2, . . . .
r1
3
We can calculate the mean and variance of Yr from the moment-generating function,
but the dierentiation is not quite as messy if we introduce another random variable.
Let Xr be the number of failures preceding the r-th success. Then Xr plus the number
of successes preceding the r-th success is the total number of trials preceding the r-th
success. Thus
Xr + (r 1) = Yr 1,
so
and
MXr (t) = ert MYr (t) =
When r = 1 we have
MX1 (t) =
p
,
1 qet
E(X1 ) =
Xr = Yr r
p
1 qet
r
.

pqet
q
= .
(1 qet )2 t=0
p
Since Y1 = X1 + 1 we have E(Y1 ) = 1 + (q/p) = 1/p and E(Yr ) = r/p. Dierentiating

the moment-generating function of X1 again, we nd that
E(X12 ) =
(1 q)2 pq + pq 2 2(1 q)
pq(1 q)[1 q + 2q]
pq(1 + q)
q(1 + q)
=
=
=
.
(1 q)4
(1 q)4
p3
p2
Thus Var X1 = Var Y1 = [q(1 + q)/p2 ] [q 2 /p2 ] = q/p2 , hence Var Yr = rq/p2 .
Now to show that the negative binomial distribution belongs to the exponential class:

x1 r
P {Yr = x} =
(1 )xr , x = r, r + 1, r + 2, . . . , = p.
r1
Take
a() =
r

,
b(x) =

x1
,
r1
p1 () = ln(1 ),
K1 (x) = x,
k = m = 1.
Here is the reason for the terminology negative binomial:

r

pet
MYr (t) =
= pr ert (1 qet )r .
1 qet
To expand the moment-generating function, we use the binomial theorem with a negative
exponent:

r k
r
(1 + x) =
x
k
k=0
where
r
k

=
r(r 1 (r k + 1)
.
k!
Problems are deferred to Lecture 17.
Lecture 17. Complete Sucient Statistics For The Exponential Class

17.1 Deriving the Complete Sucient Statistic
The density of a member of the exponential class is
f (x) = a()b(x) exp
m

pj ()Kj (x)
j=1
so the joint density of n independent observations is

f (x1 , . . . , xn ) = (a())n
b(xi ) exp
i=1
m
m

pj ()Kj (x1 ) exp
pj ()Kj (xn ) .
j=1
j=1
Since er es et = er+s+t , it follows that pj () appears in the exponent multiplied by the

factor Kj (x1 ) + Kj (x2 ) + + Kj (xn ), so by the factorization theorem,
n

K1 (xi ), . . . ,
i=1
n

Km (xi )
i=1
is sucient for . This statistic is also complete. First consider m = 1:

n
n

n

f (x1 , . . . , xn ) = a()
b(xi ) exp p()
K(xi ) .
i=1
Let Y1 =
n
i=1
i=1
K(Xi ); then E [g(Y1 )] is given by

n

K(xi ) f (x1 , . . . , xn ) dx1 dxn .
i=1
If E [g(Y1 )] = 0 for all , then for all , g(Y1 ) = 0 with probability 1.

What we have here is analogous to a result from Laplace or Fourier transform theory:
If for all t between a and b we have

g(y)ety dy = 0
then g = 0. It is also analogous to the result that the moment-generating function

determines the density uniquely.
When m > 1, the exponent in the formula for f (x1 , . . . , xn ) becomes
p1 ()
n

i=1
K1 (xi ) + + pm ()
n

i=1
Km (xi )
5
and the argument is essentially the same as in the one-dimensional case. The transform
result is as follows. If

exp[t1 y1 + + tm ym ]g(y1 , . . . , yn ) dy1 dym = 0
when ai < ti < bi , i = 1, . . . , m, then g = 0. The above integral denes a joint momentgenerating function, which will appear again in connection with the multivariate normal
distribution.
17.2 Example
Let X1 , . . . , Xn be iid, each normal(, 2 ) where 2 is known. The normal distribution belongs to the exponential class (see (16.4), Example 3), but in this case
the term
n
exp[x2 /2 2 ] can be absorbed in b(x), so only K2 (x) = x is relevant. Thus i=1 Xi ,
equivalently X, is sucient (as found in Lecture 14) and complete. Since E(X) = , it
follows that X is a UMVUE of .
Lets nd a
UNVUE of 2 . The natural conjecture that it is (X)2 is not quite correct.
n
1
2
Since X = n
i=1 Xi , we have Var X = /n. Thus
2
= E[(X)2 ] (EX)2 = E[(X)2 ] 2 ,
n
hence

2
E (X)2
= 2
n
and we have an unbiased estimate of 2 based on the complete sucient statistic X.
Therefore (X)2 [ 2 /n] is a UMVUE of 2 .
17.3 A Cautionary Tale

Restricting to unbiased estimates is not always a good idea. Let X be Poisson(), and
take n = 1, i.e., only one observation is made. From (16.4), Example 2, X is a complete
sucient statistic for . Now
E[(1)X ] =

k=0
(1)k
()k
e k
= e
= e e = e2 .
k!
k!
k=0
thus (1)X is a UMVUE of e2 . But Y 1 is certainly a better estimate, since 1 is

closer to e2 than is 1. Estimating a positive quantity e2 by a random variable which
can be negative is not sensible.
Note also that the maximum likelihood estimate of is X (Lecture 9, Problem 1a), so
the MLE of e2 is e2X , which looks better than Y .
Problems
1. Let X be a random variable that has zero mean for all possible values of . For
example, X can be uniformly distributed between and , or normal with mean 0
and variance . Give an example of a sucient statistic for that is not complete.
2. Let f (x) = exp[(x )], < x < , and 0 elsewhere. Show that the rst order
statistic Y1 = min Xi is a complete sucient statistic for , and nd a UMVUE of .
n
1/n
3. Let f (x) = x1 , 0 < x < 1, where > 0. Show that u(X1 , . . . , Xn ) =
i=1 Xi
is a complete sucient statistic for , and that the maximum likelihood estimate is
a function of u(X1 , . . . , Xn ).
2
4. The density
nf (x) = x exp[x], x > 0, where > 0, belongs to the exponential class,
and Y = i=1 Xi is a complete sucient statistic for . Compute the expectation of
1/Y under , and from the result nd the UMVUE of .
n
5. Let Y1 be binomial (n, ), so that Y1 = i=1 Xi , where Xi is the indicator of a success
on trial i. [Thus each Xi is binomial (1, ).] By Example 1 of (16.4), the Xi , as well
as Y1 , belong to the exponential class, and Y1 is a complete sucient statistic for .
Since E(Y1 ) = n, Y1 /n is a UMVUE of .
Let Y2 = (X1 + X2 )/2. In an eortless manner, nd E(Y2 |Y1 ).

6.
Let X be normal with mean 0 and variance , so that by Example 3 of (16.4), Y =
n
2
i=1 Xi is a complete sucient statistic for . Find the distribution of Y /, and from
this nd the UMVUE of 2 .
n
7. Let X1 , . . . , Xn be iid, each Poisson (), where > 0. (Then Y =
i=1 Xi is a
complete sucient statistic for .) Let I be the indicator of {X1 1}.
(a) Show that E(I|Y ) is the UMVUE of P {X1 1} = (1 + ) exp(). Thus we need
to evaluate P {X1 = 0|Y = y} + P {X1 = 1|Y = y}. When y = 0, the rst term is 1
and the second term is 0.
(b) Show that if y > 0, the conditional distribution of X1 (or equally well, of any Xi )
is binomial (y, 1/n).
(c) Show that

E(I|Y ) =
n1
n
Y
1+
Y
n1
8. Let = (1 , 2 ) and f (x) = (1/2 ) exp[(x 1 )/2 ], x > 1 (and 0 elsewhere)

nwhere 1
is an arbitrary real number and 2 > 0. Show that the statistic (mini Xi , i=1 Xi ) is
sucient for .
Lecture 18. Bayes Estimates

18.1 Basic Assumptions
Suppose we are trying to estimate the state of nature . We observe X = x, where X has
density f (x), and make decision (x) = our estimate of when x is observed. We incur
a loss L(, (x)), assumed nonnegative. We now assume that is random with density
h(). The Bayes solution minimizes the Bayes risk or average loss

B() =
h()f (x)L(, (x)) d dx.
Note that h()f (x) = h()f (x|) is the joint density of and x, which can also be
expressed as f (x)f (|x). Thus

B() =
f (x)
L(, (x))f (|x) d dx.

Since f (x) is nonnegative, it is sucient to minimize L(, (x))f (|x) d for each x.
The resulting is called
the Bayes estimate of . Similarly, to estimate a function of ,
say (), we minimize L((), (x))f (|x) d.
We can jettison a lot of terminology by recognizing that our problem is to observe
a random variable X and estimate a random variable Y by g(X). We must minimize
E[L(Y, g(X)].
18.2 Quadratic Loss Function

We now assume that L(Y, g(X)) = (Y g(X))2 . By the theorem of total expectation,

2
E(Y g(X)) =
E[(Y g(X))2 |X = x]f (x) dx
and as above, it suces to minimize the quantity in brackets for each x. If we let z = g(x),
we are minimizing z 2 2E(Y |X = x)z + E(Y 2 |X = x) by choice of z. Now Az 2 2Bz + C
is a minimum when z = B/A = E(Y |X = x)/1, and we conclude that
E[(Y g(X))2 ] is minimized when g(x) = E(Y |X = x).
What we are doing here is minimizing E[(W c)2 ] = c2 2E(W )c + E(W 2 ) by our choice
of c, and the minimum occurs when c = E(W ).
18.3 A Dierent Loss Function

Suppose that we want to minimize E(|W c|). We have
c

E(|W c|) =
(c w)f (w) dw +
(w c)f (w) dw
=c
f (w) dw
wf (w) dw +

wf (w) dw c
f (w) dw.
c
Dierentiating with respect to c, we get

c

cf (c) +
f (w) dw cf (c) cf (c) + cf (c)
f (w) dw
c

which is 0 when f (w) dw = c f (w) dw, in other words when C is a median of W .
Thus E(|Y g(X)|) is minimized when g(x) is a median of the conditional distribution
of Y given X = x.
18.4 Back To Quadratic Loss

In the statistical decision problem with quadratic loss, the Bayes estimate is

(x) = E[|X = x] =
f (|x) d
and
f (|x) =
Thus
f (, x)
h()f (x|)
=
.
f (x)
f (x)

(x) =
h()f (x) d
h()f (x) d
If we are estimating a function of , say (), replace by () in the integral in the

numerator.
Problems
1. Let X be binomial(n, ), and let the density of be
h() =
r1 (1 )s1
(r, s)
[beta(r, s)].
Show that the Bayes estimate with quadratic loss is

r+x
(x) =
, x = 0, 1, . . . , n.
r+s+n
2. For this estimate, show that the risk function R (), dened as the average loss using
when the parameter is , is
1
[((r + s)2 n)2 + (n 2r(r + s)) + r2 ]
(r + s + n)2
3. Show that if r = s = n/2, then R () is a constant, independent of .

4. Show that a Bayes estimate with constant risk (as in Problem 3) is minimax, that
is, minimizes max R ().
Lecture 19. Linear Algebra Review

19.1 Introduction
We will assume for the moment that matrices have complex numbers as entries, but the
complex numbers will soon disappear. If A is a matrix, the conjugate transpose of A will
be denoted by A . Thus if

a + bi c + di
a bi e f i
A=
then A =
.
e + f i g + hi
c di g hi
The transpose is

a + bi e + f i
A =
.
c + di g + hi
Vectors X, Y , etc., will be regarded as column vectors. The inner product (dot product)
of n-vectors X and Y is
< X, Y >= x1 y 1 + + xn y n
where the overbar indicates complex conjugate. Thus < X, Y >= Y X. If c is any
complex number, then < cX, Y >= c < X, Y > and < X, cY >= c < X, Y >. The
vectors X and Y are said to be orthogonal (perpendicular) if < X, Y >= 0. For an
arbitrary n by n matrix B,
< BX, Y >=< X, B Y >
because < X, B Y >= (B Y ) X = Y B X = Y BX =< BX, Y >.
Our interest is in real symmetric matrices, and symmetric will always mean real
symmetric. If A is symmetric then
< AX, Y >=< X, A Y >=< X, AY > .
The eigenvalue problem is AX = X, or (A I)X = 0, where I is the identity matrix,
i.e., the matrix with 1s down the main diagonal and 0s elsewhere. A nontrivial solution
(X = 0) exists i det(A I) = 0. In this case, is called an eigenvalue of A and a
nonzero solution is called an eigenvector.
19.2 Theorem
If A is symmetric then A has real eigenvalues.
Proof. Suppose AX = X with X = 0. then < AX, Y >=< X, AY >
Y = X gives
with
n
< X, X >=< X, X >, so ( ) < X, X >= 0. But < X, X >= i=1 |xi |2 = 0, and
therefore = , so is real.
The important conclusion is that for a symmetric matrix, the eigenvalue problem can
be solved using only real numbers.
10
19.3 Theorem
If A is symmetric, then eigenvectors of distinct eigenvalues are orthogonal.
Proof. Suppose AX1 = 1 X1 and AX2 = 2 X2 . Then < AX1 , X2 >=< X1 , AX2 >, so
< 1 X1 , X2 >=< X1 , 2 X2 >. Since 2 is real we have (1 2 ) < X1 , X2 >= 0. But
we are assuming that we have two distinct eigenvalues, so that 1 = 2 . Therefore we
must have < X1 , X2 >= 0.
19.4 Orthogonal Decomposition Of Symmetric Matrices

Assume A symmetric with distinct eigenvalues 1 , . . . , n . The assumption that the i
are distinct means that the equation det(A I) = 0, a polynomial equation in of
degree n, has no repeated roots. This assumption is actually unnecessary, but it makes
the analysis much easier.
Let AXi = i Xi with Xi = 0, i = 1, . . . , n. Normalize the eigenvectors so that Xi ,
the length of Xi , is 1 for all i. (The length of the vector x = (x1 , . . . , xn ) is
x =
n

|xi |2
1/2
i=1
hence x2 =< x, x >.) Thus we have AL = LD, where
L = [X1 |X2 | |Xn |] and D =

0
0
..
To verify this, note that multiplying L on the right by a diagonal matrix with entries
1 , . . . , n multiplies column i of L (namely Xi ) by i . (Multiplying on the left by D
would multiply row i by i .) Therefore
LD = [1 X1 |2 X2 | |n Xn |] = AL.
The columns of the square matrix L are mutually perpendicular unit vectors; such a
matrix is said to be orthogonal. The transpose of L can be pictured as follows:

X1
X2

L = .
..
Xn
Consequently L L = I. Since L is nonsingular (det I = 1 = det L det L), L has an inverse,
which must be L . to see this, multiply the equation L L = I on the right by L1 to get
L I = L1 , i.e., L = L1 . Thus LL = I.
Since a matrix and its transpose have the same determinant, (det L)2 = 1, so the
determinant of L is 1.
11
Finally, from AL = LD we get
L AL = D
We have shown that every symmetric matrix (with distinct eigenvalues) can be orthogonally diagonalized.
19.5 Application To Quadratic Forms

Consider a quadratic form
X AX =
n
ai,j xi xj .
i,j=1
If we change variables by X = LY , then

X AX = Y L ALY = Y DY =
n
i yi2 .
i=1

The symmetric
nmatrix2 A is said to be nonnegative denite if X AX 0 for all X.
Equivalently, i=1 i yi 0 for all Y . Set yi = 1, yj = 0 for all j = i to conclude that A
is nonnegative denite if and only if all eigenvalues of A are nonnegative. The symmetric
matrix is said to be positive denite if X AX > 0 except when all xi = 0. Equivalently,
all eigenvalues of A are strictly positive.
19.6 Example
Consider the quadratic form

3
q = 3x + 2xy + 3y = (x, y)
1
2
Then

3
A=
1

1
,
3

3
det(A I) =
1
1
3

x
y

1
= 2 6 + 8 = 0
3

and the eigenvalues are = 2 and = 4. When = 2, the equation A(x,
y) = (x,
y)
reduces to x + y = 0. Thus (1, 1) is an eigenvector. Normalize it to get
(1/ 2, 1/ 2) .
When = 4 we get x + y = 0 and the normalized eigenvector is (1/ 2, 1/ 2) . Consequently,

1/ 2 1/2
L=
1/ 2 1/ 2
and a direct matrix computation yields

L AL =

2
0

0
=D
4
12
as expected.
If (x, y) = L(v, w) , i.e., x = (1/ 2)v + (1/ 2)w, y = (1/ 2)v +
(1/ 2)w, then

2

2

v
w2
v2
w2
v
w2
q=3
+
+ vw + 2
+
+3
+
vw .
2
2
2
2
2
2
Thus q = 2v 2 + 4w2 = (v, w)D(v, w) , as expected.
13
Lecture 20. Correlation

Let X and Y be random variables with nite mean and variance. Denote the mean of X
by 1 and the mean of Y by 2 , and let 12 = Var X and 22 = Var Y . Note that E(XY )
must be nite also, because X 2 Y 2 2XY X 2 + Y 2 . The covariance of X and Y
is dened by
Cov(X, Y ) = E[(X 1 )(Y 2 )]
and it follows that
Cov(X, Y ) = E(XY ) 1 E(Y ) 2 E(X) + 1 2 = E(XY ) E(X)E(Y ).
Thus Cov(X, Y ) = Cov(Y, X). Since expectation is linear, we have Cov(aX, bY ) =
ab Cov(X, Y ), Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z), Cov(X + Y, Z) = Cov(X, Z) +
Cov(Y, Z), and Cov(X + a, Y + b) = Cov(X, Y ). Also, Cov(X, X) = E(X 2 ) (EX)2 =
Var X.
The correlation coecient is a normalized covariance:
=
Cov(X, Y )
.
1 2
The correlation coecient is a measure of linear dependence between X and Y . To see

this, estimate Y by AX + b, equivalently (to simplify the calculation) estimate Y 2 by
c(X 1 ) + d, choosing c and d to minimize
E[(Y 2 (c(X 1 ) + d))2 ] = 22 2c Cov(X, Y ) + c2 12 + d2 .
Note that E[2cd(X 1 )] = 0 since E(X) = 1 , and similarly E[2d(Y 2 )] = 0. We
cant do any better than to take d = 0, so we need to minimize 22 2c1 2 + c2 12 by
choice of c. Dierentiating with respect to c, we have 21 2 + 2c12 , hence
c=
2
1
The minimum expectation is

22 2
2
2
1 2 + 2 22 12 = 22 (1 2 )
1
1
The expectation of a nonnegative random variable is nonnegative, so

1 1
For a xed 2 , the closer || is to 1, the better the estimate of Y by aX + b. If || = 1
then the minimum expectation is 0, so (with probability 1)
Y 2 = c(X 1 ) =
2
(X 1 ) with = 1.
1
14
20.2 Theorem
If X and Y are independent then X and Y are uncorrelated ( = 0) but not conversely.
Proof. Assume X and Y are independent. Then
E[(X 1 )(Y 2 )] = E(X 1 )E(Y 2 ) = 0.
For the counterexample to the converse, let X = cos , Y = sin , where is uniformly
distributed on (0, 2). Then
2
2
1
1
E(X) =
cos d = 0, E(Y ) =
sin d = 0,
2 0
2 0
and
E(XY ) = E[(1/2) sin 2] =
1
4
sin 2 d = 0,
0
so = 0. But X 2 + Y 2 = 1, so X and Y are not independent.
20.3 The Cauchy-Schwarz Inequality

This result, namely
|E(XY )|2 E(X 2 )E(Y 2 )
is closely related to 1 1. Indeed, if we replace X by X 1 and Y by Y 2 ,
the inequality says that [Cov(X, Y )]2 12 22 , i.e., (1 2 )2 12 22 , which gives 2 1.
Thus Cauchy-Schwarz implies 1 1.
Proof. Let h() = E[(X + Y )2 ] = 2 E(X 2 ) + 2E(XY ) + E(Y 2 ). Since h() 0 for
all , the quadratic equation h() = 0 has no real roots or at worst a real repeated root.
Therefore the discriminant is negative or at worst 0. Thus [2E(XY )]2 4E(X 2 )E(Y 2 ) 0,
and the result follows.
As a special case, let P {X = xi } = 1/n, 1 i n. If X = xi , take Y = yi . (The xi
and yi are arbitrary real numbers.) Then the Cauchy-Schwarz inequality becomes
n
2 n
n

2
2
xi yi
xi
yi .
i=1
i=1
i=1
(There will be a factor of 1/n on each side of the inequality, which will cancel.) This is the
result originally proved by Cauchy. Schwarz proved the analogous formula for integrals:

2

b
f (x)g(x) dx
a
[f (x)]2 dx
a
[g(x)]2 dx.
a
Since an integral can be regarded as the limit of a sum, the integral result can be proved
from the result for sums.
We know that if X1 , . . . , Xn are independent, then the variance of the sum of the Xi
is the sum of the variances. If we drop the assumption of independence, we can still say
something.
15
20.4 Theorem
Let X1 , . . . , Xn be arbitrary random variables (with nite mean and variance). Then
Var(X1 + + Xn ) =
n
Var Xi + 2
i=1
n
Cov(Xi , Xj ).
i,j=1
i<j
For example, the variance of X1 + X2 + X3 + X4 is

4
Var Xi + 2[Cov(X1 , X2 ) + Cov(X1 , X3 ) + Cov(X1 , X4 )
i=1
+ Cov(X2 , X3 ) + Cov(X2 , X4 ) + Cov(X3 , X4 )].

Proof. We have
E[(X1 1 ) + + (Xn n )] = E
2

n

(Xi i )
i=1
+2E

(Xi i )(Xj j )
i<j
as asserted.
The reason for the i < j restriction in the summation can be seen from an expansion
such as
(x + y + z)2 = x2 + y 2 + z 2 + 2xy + 2xz + 2yz.
It is correct, although a bit inecient, to replace i < j by i = j and drop the factor of 2.
This amounts to writing 2xy as xy + yx.
20.5 Least Squares

Let (x1 , y1 ), . . .
, (xn , yn ) be points in the plane. The problem is to nd the line y = ax + b
n
that minimizes i=1 [yi (axi + b)]2 . (The numbers a and b are to be determined.)
Consider the following random experiment. Choose X with P {X = xi } = 1/n for
i = 1, . . . , xn . If X = xi , set Y = yi . [This is the same setup as in the special case of the
Cauchy-Schwarz inequality in (20.3).] Then
1
[yi (axi + b)]2
n i=1
n
E[(Y (aX + b))2 ] =
so the least squares problem is equivalent to nding the best estimate of Y of the form
aX + b, where best means that the mean square error is to be minimized. This is the
problem that we solved in (20.1). The least squares line is
y Y =
Y
(x X )
X
16
To evaluate X , Y , X , Y , :
1
xi = x,
n i=1
n
X =
1
yi = y,
n i=1
n
Y =
1
(xi x)2 = s2x ,
n i=1
n
2
X
= E[(X X )2 ] =
E[(X X )(Y Y )]
,
X Y
1
(yi y)2 = s2y ,
n i=1
n
Y2 =
Y
E[(X X )(Y Y )]
=
.
2
X
X
The last entry is the slope of the least squares line, which after cancellation of 1/n in
numerator and denominator, becomes
n
(x x)(yi y)
i=1
n i
.
2
i=1 (xi x)
If > 0, then the least squares line has positive slope, and y tends to increase with x. If
< 0, then the least squares line has negative slope and y tends to decrease as x increases.
Problems
In Problems 1-5, assume that X and Y are independent random variables, and that we
2
know X = E(X), Y = E(Y ), X
= Var X, and Y2 = Var Y . In Problem 2, we also
know , the correlation coecient between X and Y .
1. Find the variance of XY .
2. Find the variance of aX + bY , where a and b are arbitrary real numbers.
3. FInd the covariance of X and X + Y .
4. FInd the correlation coecient between X and X + Y .
5. FInd the covariance of XY and X.
6. Under what conditions will there be equality in the Cauchy-Schwarz inequality?
Lecture 21. The Multivariate Normal Distribution

The joint moment-generating function of X1 , . . . , Xn [also called the moment-generating
function of the random vector (X1 , . . . , Xn )] is dened by
M (t1 , . . . , tn ) = E[exp(t1 X1 + + tn Xn )].
Just as in the one-dimensional case, the moment-generating function determines the density uniquely. The random variables X1 , . . . , Xn are said to have the multivariate normal
distribution or to be jointly Gaussian (we also say that the random vector (X1 , . . . , Xn )
is Gaussian) if
n

1
M (t1 , . . . , tn ) = exp(t1 1 + + tn n ) exp
ti aij tj
2 i,j=1
where the ti and j are arbitrary real numbers, and the matrix A is symmetric and
positive denite.
Before we do anything else, let us indicate the notational scheme we will be using.
Vectors will be written with an underbar, and are assumed to be column vectors unless
otherwise specied. If t is a column vector with components t1 , . . . , tn , then to save space
we write t = (t1 , . . . , tn ) . The row vector with these components is the transpose of t,
written t . The moment-generating function of jointly Gaussian random variables has the
form

1

M (t1 , . . . , tn ) = exp(t ) exp
t At .
2
We can describe Gaussian random vectors much more concretely.
21.2 Theorem
Joint Gaussian random variables arise from linear transformations on independent normal
random variables.
Proof. Let X1 , . . . , Xn be independent, with Xi normal (0,i ), and let X = (X1 , . . . , Xn ) .
Let Y = BX + where B is nonsingular. Then Y is Gaussian, as can be seen by computing
the moment-generating function of Y :
MY (t) = E[exp(t Y )] = E[exp(t BX)] exp(t ).
But
E[exp(u X)] =
n

i=1
E[exp(ui Xi )] = exp
1

i u2i /2 = exp u Du
2
i=1
2
where D is a diagonal matrix with i s down the main diagonal. Set u = B t, u = t B;
then
1
MY (t) = exp(t ) exp( t BDB t)
2
and BDB is symmetric since D is symmetric. Since t BDB t = u Du, which is greater
than 0 except when u = 0 (equivalently when t = 0 because B is nonsingular), BDB is
positive denite, and consequently Y is Gaussian.
Conversely, suppose that the moment-generating function of Y is exp(t ) exp[(1/2)t At)]
where A is symmetric and positive denite. Let L be an orthogonal matrix such that
L AL = D, where D is the diagonal matrix of eigenvalues of A. Set X = L (Y ), so
that Y = + LX. The moment-generating function of X is
E[exp(t X)] = exp(t L )E[exp(t L Y )].
The last term is the moment-generating function of Y with t replaced by t L , or equivalently, t replaced by Lt. Thus the moment-generating function of X becomes

1
exp(t L ) exp(t L ) exp t L ALt
2
This reduces to
n
1
1

exp t Dt = exp
i t2i .
2
2 i=1
Therefore the Xi are independent, with Xi normal (0, i ).
21.3 A Geometric Interpretation

Assume for simplicity that all random variables have zero mean, so that the covariance
of U and V is E(U V ), which can be regarded as an inner product. Then Y1 , . . . , Yn span
an n-dimensional space, and X1 , . . . , Xn is an orthogonal basis for that space. We will
see later in the lecture that orthogonality is equivalent to independence. (Orthogonality
means that the Xi are uncorrelated, i.e., E(Xi Xj ) = 0 for i = j.)
21.4 Theorem
Let Y = + LX as in the proof of (21.2), and let A be the symmetric, positive denite
matrix appearing in the moment-generating function of the Gaussian random vector Y .
Then E(Yi ) = i for all i, and furthermore, A is the covariance matrix of the Yi , in other
words, aij = Cov(Yi , Yj ) (and aii = Cov(Yi , Yi ) = Var Yi ).
It follows that the means of the Yi and their covariance matrix determine the momentgenerating function, and therefore the density.
Proof. Since the Xi have zero mean, we have E(Yi ) = i . Let K be the covariance matrix
of the Yi . Then K can be written in the following peculiar way:
Y1 1
.
..
K=E
(Y1 1 , . . . , Yn n ) .
Yn n
3
Note that if a matrix M is n by 1 and a matrix N is 1 by n, then M N is n by n. In this
case, the ij entry is E[(Yi i )(Yj j )] = Cov(Yi , Yj ). Thus
K = E[(Y )(Y ) ] = E(LXX L ) = LE(XX )L

since expectation is linear. [For example, E(M X) = M E(X) because E( j mij Xj ) =

j mij E(Xj ).] But E(XX ) is the covariance matrix of the Xi , which is D. Therefore

K = LDL = A (because L AL = D).
21.5 Finding the Density

From Y = + LX we can calculate the density of Y . The Jacobian of the transformation
from X to Y is det L = 1, and
n

1
1
fX (x1 , . . . , xn ) =
exp
x2i /2i .
( 2)n 1 n
i=1
We have 1 n = det D = det K because det L = det L = 1. Thus

1
1
fX (x1 , . . . , xn ) =
exp xD1 x .
n
2
( 2) det K
But y = + Lx, x = L (y ), x D1 x = (y ) LD1 L (y ), and [see the end
of (21.4)] K = LDL , K 1 = LD1 L . The density of Y is

1
1
fY (y1 , . . . , yn ) =
exp (y ) K 1 (y ].
n
2
( 2) det K
21.6 Individually Gaussian Versus Jointly Gaussian

If X1 , . . . , Xn are jointly Gaussian, then each Xi is normally distributed (see Problem 4),
but not conversely. For example, let X be normal (0,1) and ip an unbiased coin. If the
coin shows heads, set Y = X, and if tails, set Y = X. Then Y is also normal (0,1) since
P {Y y} =
1
1
P {X y} + P {X y} = P {X y}
2
2
because X is also normal (0,1). Thus FX = FY . But with probability 1/2, X +Y = 2X,
and with probability 1/2, X + Y = 0. Therefore P {X + Y = 0} = 1/2. If X and Y were
jointly Gaussian, then X + Y would be normal (Problem 4). We conclude that X and Y
are individually Gaussian but not jointly Gaussian.
21.7 Theorem
If X1 , . . . , Xn are jointly Gaussian and uncorrelated (Cov(Xi , Xj ) = 0 for all i = j), then
the Xi are independent.
Proof. The moment-generating function of X = (X1 , . . . , Xn ) is
MX (t) = exp(t ) exp
1
t Kt
2
4
where K is a diagonal matrix with entries 12 , 22 , . . . , n2 down the main diagonal, and 0s
elsewhere. Thus
MX (t) =
exp(ti i ) exp
i=1
1 2 2
t
2 i i
which is the joint moment-generating function of independent random variables X1 , . . . , Xn ,

whee Xi is normal (i , i2 ).
21.8 A Conditional Density

Assume X1 , . . . , Xn be jointly Gaussian. We nd the conditional density of Xn given
X1 , . . . , Xn1 :
f (xn |x1 , . . . , xn1 ) =
f (x1 , . . . , xn )
f (x1 , . . . , xn1 )
with
n
1

f (x1 , . . . , xn ) = (2)n/2 (det K)1/2 exp
yi qij yj
2 i,j=1
where Q = K 1 = [qij ], yi = xi i . Also,

f (x1 , . . . , xn1 ) =
f (x1 , . . . , xn1 , xn ) dxn = B(y1 , . . . , yn1 ).
Now
n
yi qij yj =
i,j=1
n1
yi qij yj + yn
i,j=1
n1

j=1
qnj yj + yn
n1
qin yi + qnn yn2 .
i=1
Thus the conditional density has the form

A(y1 , . . . , yn1 )
exp[(Cyn2 + D(y1 , . . . , yn1 )yn ]
B(y1 , . . . , yn1 )
n1
n1
with C = (1/2)qnn , D = j=1 qnj yj = i=1 qin yi since Q = K 1 is symmetric. The
conditional density may now be expressed as

D2
A
D 2
exp
exp C(yn +
) .
B
4C
2C
We conclude that
given X1 , . . . , Xn1 ,
Xn is normal.
The conditional variance of Xn (the same as the conditional variance of Yn = Xn n ) is

1
1
=
2C
qnn
because
1
1
= C, 2 =
.
2 2
2C
5
Thus
Var(Xn |X1 , . . . , Xn1 ) =
1
qnn
and the conditional mean of Yn is
n1
D
1
qnj Yj
=
2C
qnn j=1
so the conditional mean of Xn is

E(Xn |X1 , . . . , Xn1 ) = n
n1
1
qnn
qnj (Xj j ).
j=1
Recall from Lecture 18 that E(Y |X) is the best estimate of Y based on X, in the sense
that the mean square error is minimized. In the joint Gaussian case, the best estimate of
Xn based on X1 , . . . , Xn1 is linear, and it follows that the best linear estimate is in fact
the best overall estimate. This has important practical applications, since linear systems
are usually much easier than nonlinear systems to implement and analyze.
Problems
1. Let K be the covariance matrix of arbitrary random variables X1 , . . . , Xn . Assume
that K is nonsingular to avoid degenerate cases. Show that K is symmetric and positive
denite. What can you conclude if K is singular?
2. If X is a Gaussian n-vector and Y = AX with A nonsingular, show that Y is Gaussian.
3. If X1 , . . . , Xn are jointly Gaussian, show that X1 , . . . , Xm are jointly Gaussian for
m n.
4. If X1 , . . . , Xn are jointly Gaussian, show that c1 X1 + + cn Xn is a normal random
variable (assuming it is nondegenerate, i.e., not identically constant).
Lecture 22. The Bivariate Normal Distribution

22.1 Formulas
The general formula for the n-dimensional normal density is
1

1
1
fX (x1 , . . . , xn ) =
exp (x ) K 1 (x )
n
2
( 2)
det K
where E(X) = and K is the covariance matrix of X. We specialize to the case n = 2:
2

2
1
1 2
1 12
=
, 12 = Cov(X1 , X2 );
K=
12 22
1 2
22
K 1 =

1
22
2
2
2
1 2 (1 ) 1 2

1
1/12
1 2
=
2
2
1
1 /1 2

/1 2
.
1/22
Thus the joint density of X1 and X2 is

!

x1 1
x2 2
x2 2 2
x1 1 2
1
1

exp
2
+
2(1 2 )
1
1
2
2
21 2 1 2
The moment-generating function of X is
MX (t1 , t2 ) = exp(t ) exp
1
t Kt
2

1
= exp t1 1 + t2 2 + 12 t21 + 21 2 t1 t2 + 22 t22 .

2
If X1 and X2 are jointly Gaussian and uncorrelated, then = 0, so that f (x1 , x2 ) is the
product of a function g(x1 ) of x1 alone and a function h(x2 ) of x2 alone. It follows that
X1 and X2 are independent. (We proved independence in the general n-dimensional case
in Lecture 21.)
From the results at the end of Lecture 21, the conditional distribution of X2 given X1
is normal, with
q21
E(X2 |X1 = x1 ) = 2
(x1 1 )
q22
where
q21
/1 2
2
=
=
.
q22
1/22
1
Thus
E(X2 |X1 = x1 ) = 2 +
2
(x1 1 )
1
and
Var(X2 |X1 = x1 ) =
1
= 22 (1 2 ).
q22
For E(X1 |X2 = x2 ) and Var(X1 |X2 = x2 ), interchange 1 and 2 , and interchange 1
and 2 .
22.2 Example
Let X be the height of the father, Y the height of the son, in a sample of father-son pairs.
Assume X and Y bivariate normal, as found by Karl Pearson around 1900. Assume
E(X) = 68 (inches), E(Y ) = 69, X = Y = 2, = .5. (We expect to be positive
because on the average, the taller the father, the taller the son.
Given X = 80 (6 feet 8 inches), Y is normal with mean
Y +
Y
(x X ) = 69 + .5(80 68) = 75
X
which is 6 feet 3 inches. The variance of Y given X = 80 is

Y2 (1 2 ) = 4(3/4) = 3.
Thus the son will tend to be of above average height, but not as tall as the father. This
phenomenon is often called regression, and the line y = Y + (Y /X )(x X ) is called
the line of regression or the regression line.
Problems
1. Let X and Y have the bivariate normal distribution. The following facts are known:
X = 1, X = 2, and the best estimate of Y based on X, i.e., the estimate that
minimizes the mean square error, is given by 3X + 7. The minimum mean square error
is 28. Find X , Y and the correlation coecient between X and Y .
2. Show that the bivariate normal density belongs to the exponential class, and nd the
corresponding complete sucient statistic.
Lecture 23. Cram

er-Rao Inequality
23.1 A Strange Random Variable
Given a density f (x), < x < , a < < b. We have found maximum likelihood
ln f (x). If we replace x by X, we have a random variable. To

estimates by computing
see what is going on, lets look at a discrete example. If X takes on values x1 , x2 , x3 , x4
with p(x1 ) = .5, p(x2 ) = p(x3 ) = .2, p(x4 ) = .1, then p(X) is a random variable with the
following distribution:
P {p(X) = .5} = .5,
P {p(X) = .2} = .4,
P {p(X) = .1} = .1
For example, if X = x2 then p(X) = p(x2 ) = .2, and if X = x3 then p(X) = p(x3 ) = .2.
The total probability that p(X) = .2 is .4.
The continuous case is, at rst sight, easier to handle. If X has density f and X = x,
then f (X) = f (x). But what is the density of f (X)? We will not need the result, but the
question is interesting and is considered in Problem 1.
The following two lemmas will be needed to prove the Cramer-Rao inequality, which
can be used to compute uniformly minimum variance unbiased estimates. In the calculations to follow, we are going to assume that all dierentiations under the integral sign
are legal.
23.2 Lemma
E

ln f (X) = 0.
Proof. The expectation is

1 f (x)
ln f (x) f (x) dx =
f (x) dx

f (x)
which reduces to
f (x) dx =
(1) = 0.
23.3 Lemma
Let Y = g(X) and assume E (Y ) = k(). If k () = dk()/d, then

k () = E Y
ln f (X) .
Proof. We have
k () =
E g(X) =
g(x)f (x) dx =
g(x)
f (x)
dx
9

g(x)
f (x) 1
f (x) dx =
f (x)
= E [g(X)

g(x)
ln f (x) f (x) dx
ln f (X)] = E [Y
ln f (X)].
23.4 Cram
er-Rao Inequality
Under the assumptions of (23.3), we have
Var Y
[k ()]2
2 .
ln f (X)
Proof. By the Cauchy-Schwarz inequality,

[Cov(V, W )]2 = (E[(V V )(W W )])2 Var V Var W
hence
[Cov (Y,
ln f (X))]2 Var Y Var

ln f (X).
Since E [(/) ln f (X)] = 0 by (23.2), this becomes

(E [Y
ln f (X)])2 Var Y E [( ln f (X))2 ].
By (23.3), the left side is [k ()]2 , and the result follows.
23.5 A Special Case

Let X1 , . . . , Xn be
"niid, each with density f (x), and take X = (X1 , . . . , Xn ). Then
f (x1 , . . . , xn ) = i=1 f (xi ) and by (23.2),
E
n

2
= Var
ln f (X)
ln f (X) = Var
ln f (Xi )
i=1
= n Var

2
ln f (Xi ) = nE
ln f (Xi )
23.6 Theorem
Let X1 , . . . , Xn be iid, each with density f (x). If Y = g(X1 , . . . , Xn ) is an unbiased
estimate of , then
Var Y
nE
2 .
ln f (Xi )
10
Proof. Applying (23.5), we have a special case of the Cramer-Rao inequality (23.4) with
k() = , k () = 1.
The lower bound in (23.6) is 1/nI(), where
I() = E

2
ln f (Xi )
is called the Fisher information.

It follows from (23.6) that if Y is an unbiased estimate that meets the Cramer-Rao
inequality for all (an ecient estimate, then Y must be a UMVUE of .
23.7 A Computational Simplication

From (23.2) we have

ln f (x) f (x) dx = 0.

Dierentiate again to obtain

2

ln f (x)
ln f (x) f (x)
dx = 0.
f (x) dx +
2
Thus

2 ln f (x)
f (x) dx +
2
ln f (x) f (x) 1
f (x) dx = 0.
f (x)
But the term in brackets on the right is ln f (x)/, so we have

2

2
ln f (x)
f
(x)
dx
+
ln f (x) f (x) dx = 0.

Therefore
E
2

2 ln f (Xi )
= E
.
ln f (Xi )
Problems
1. If X is a random variable with density f (x), explain how to nd the distribution of
the random variable f (X).
2. Use the Cramer-Rao inequality to show that the sample mean is a UMVUE of the true
mean in the Bernoulli, normal (with 2 known) and Poisson cases.
11
Lecture 24. Nonparametric Statistics

We wish to make a statistical inference about a random variable X even though we know
nothing at all about its underlying distribution.
24.1 Percentiles
Assume F continuous and strictly increasing. If 0 < p < 1, then the equation F (x) = p
has a unique solution p , so that P {X p } = p. When p = 1/2, p is the median; when
p = .3, p is the 30-th percentile, and so on.
Let X1 , . . . , Xn be iid, each with distribution function F , and let Y1 , . . . , Yn be the
order statistics. We will consider the problem of estimating p .
24.2 Point Estimates

On the average, np of the observations will be less than p . (We have n Bernoulli trials,
with probability of success P {Xi < p } = F (p ) = p.) It seems reasonable to use Yk as an
estimate of p , where k is approximately np. We can be a bit more precise. The random
variables F (X1 ), . . . , F (Xn ) are iid, uniform on (0,1) [see (8.5)]. Thus F (Y1 ), . . . , F (Yn )
are the order statistics from a uniform (0,1) sample. We know from Lecture 6 that the
density of F (Yk ) is
n!
xk1 (1 x)nk ,
(k 1)!(n k)!
Therefore
E[F (Yk )] =
0
0 < x < 1.
n!
n!
xk (1 x)nk dx =
(k + 1, n k + 1).
(k 1)!(n k)!
(k 1)!(n k)!
Now (k + 1, n k + 1) = (k + 1)(n k + 1)/(n + 2) = k!(n k)!/(n + 1)!, and

consequently
E[F (Yk )] =
k
,
n+1
1 k n.
Dene Y0 = and Yn+1 = , so that

E[F (Yk+1 ) F (Yk )] =
1
,
n+1
0 k n.
(Note that when k = n, the expectation is 1 [n/(n + 1)] = 1/(n + 1), as asserted.)
The key point is that on the average, each [Yk , Yk+1 ] produces area 1/(n + 1) under
the density f of the Xi . This is true because
Yk+1
f (x) dx = F (Yk+1 ) F (Yk )
Yk
and we have just seen that the expectation of this quantity is 1/(n + 1), k = 0, 1, . . . , n.
If we want to accumulate area p, set k/(n + 1) = p, that is, k = (n + 1)p.
12
Conclusion: If (n + 1)p is an integer, estimate p by Y(n+1)p .
If (n + 1)p is not an integer, we can use a weighted average. For example, if p = .6 and
n = 13 then (n + 1)p = 14 .6 = 8.4. Now if (n + 1)p were 8, we would use Y8 , and if
(n + 1)p were 9 we would use Y9 . If (n + 1)p = 8 + , we use (1 )Y8 + Y9 . In the
present case, = .4, so we use .6Y8 + .4Y9 = Y8 + .4(Y9 Y8 ).
24.3 Condence Intervals

Select order statistics Yi and Yj , where i and j are (approximately) symmetrical about
(n + 1)p. Then P {Yi < p < Yj } is the probability that the number of observations less
than p is at least i but less than j, i.e., between
i and
j 1, inclusive. The probability
that exactly k observations will be less than p is nk pk (1 p)nk , hence
P {Yi < p < Yj } =
j1

n
k=i
pk (1 p)nk .
Thus (Yi , Yj ) is a condence interval for p , and we can nd the condence level by
evaluating the above sum, possibly with the aid of the normal approximation to the
binomial.
24.4 Hypothesis Testing

First lets look at a numerical example. The 30-th percentile .3 will be less than 68
precisely when F (.3 ) < F (68), because F is continuous and strictly increasing. Therefore
.3 < 68 i F (68) > .3. Similarly, .3 > 68 i F (68) < .3, and .3 = 68 i F (68) = .3. In
general,
p0 < F () > p0 ,
p0 > F () < p0
and
p0 = F () = p0 .
In our numerical example, if F (68) were actually .4, then on the average, 40 percent of
the observations will be 68 or less, as opposed to 30 percent if F (68) = .3. Thus a larger
than expected number of observations less than or equal to 68 will tend to make us reject
the hypothesis that the 30-th percentile is exactly 68. In general, our problem will be
H0 : p0 =
( F () = p0 )
H1 : p0 <
( F () > p0 )
where p0 and are specied. If Y is the number of observations less than or equal to ,
we propose to reject H0 if Y c. (If H1 is p0 > , i.e., F () < p0 , we reject if Y c.)
Note that Y is the number of nonpositive signs in the sequence X1 , . . . , Xn , and
for this reason, the terminology sign test is used.
13
Since we are trying to determine whether F () is equal to p0 or greater than p0 , we
may regard = F () as the unknown state of nature. The power function of the test is
K() = P {Y c} =
n

n
k=c
k (1 )nk
and in particular, the signicance level (probability of a type 1 error) is = K(p0 ).

The above condence interval estimates and the sign test are distribution free, that is,
independent of the underlying distribution function F .
Problems are deferred to Lecture 25.
14
Lecture 25. The Wilcoxon Test

We will need two formulas:
n

k=1
k2 =
n
n(n + 1)(2n + 1)
,
6

k3 =
k=1
n(n + 1)
2
2
.
For a derivation via the calculus of nite dierences, see my on-line text A Course in
Commutative Algebra, Section 5.1.
The hypothesis testing problem addressed by the Wilcoxon test is the same as that
considered by the sign test, except that:
(1) We are restricted to testing the median .5 .
(2) We assume that X1 , . . . , Xn are iid and the underlying density is symmetric about
the median (so we are not quite nonparametric). There are many situations where we
suspect an underlying normal distribution but are not sure. In such cases, the symmetry
assumption may be reasonable.
(3) We use the magnitudes as well as the signs of the deviations Xi .5 , so the Wilcoxon
test should be more accurate than the sign test.
25.1 How The Test Works

Suppose we are testing H0 : .5 = m vs. H1 : .5 > m based on observations X1 , . . . , Xn .
We rank the absolute values |Xi m| from smallest to largest. For example, let n = 5
and X1 m = 2.7, X2 m = 1.3, X3 m = 0.3, X4 m = 3.2, X5 m = 2.4. Then
|X3 m| < |X2 m| < |X5 m| < |X1 m| < |X4 m|.
Let Ri be the rank of |Xi m|, so that R3 = 1, R2 = 2, R5 = 3, R1 = 4, R4 = 5. Let Zi be
the sign of Xi m, so that Zi = 1. Then Z3 = 1, Z2 = 1, Z5 = 1, Z1 = 1, Z4 = 1.
The Wilcoxon statistic is
W =
n
Zi Ri .
i=1
In this case, W = 1 2 + 3 + 4 5 = 1. Because the density is symmetric about

the median, if Ri is given then Zi is still equally likely to be 1, so (R1 , . . . , Rn ) and
(Z1 , . . . , Zn ) are independent. (Note that if Rj is given, the odds about Zi (i = j) are
unaected since the observations X1 , . . . , Xn are independent.) Now the Ri are simply a
permutation of (1, 2, . . . , n), so
W is a sum of independent random variables Vi where Vi = i with equal probability.
15
25.2 Properties Of The Wilcoxon Statistic

Under H0 , E(Vi ) = 0 and Var Vi = E(Vi2 ) = i2 , so
E(W ) =
n

i=1
E(Vi ) = 0,
Var W =
n
i2 =
i=1
n(n + 1)(2n + 1)
.
6
The Vi do not have the same distribution, but the central limit theorem still applies
because Liapounovs condition is satised:
n
3
i=1 E[|Vi i | ]
0 as n .
n
3/2
( i=1 i2 )
Now the Vi have mean i = 0, so |Vi i |3 = |Vi |3 = i3 and i2 = Var Vi = i2 . Thus the
Liapounov fraction is the sum of the rst n cubes divided by the 3/2 power of the sum of
the rst n squares, which is
n2 (n + 1)2 /4
.
[n(n + 1)(2n + 1)/6]3/2
For large n, the numerator is of the order of n4 and the denominator
is of the order of
(n3 )3/2 = n9/2 . Therefore the fraction is of the order of 1/ n 0 as n . By the

central limit theorem, [W E(W )]/(W ) is approximately normal (0,1) for large n, with
E(W ) = 0 and 2 (W ) = n(n + 1)(2n + 1)/6.
If the median is larger than its value m under H0 , we expect W to have a positive
bias. Thus we reject H0 if W c. (If H1 were .5 < m), we would reject if W c.) The
value of c is determined by our choice of the signicance level .
Problems
1. Suppose we are using a sign test with n = 12 observations to decide between the null
hypothesis H0 : m = 40 and the alternative H1 : m > 40, whee m is the median. We
use the statistic Y = the number of observations that are less than or equal to 40.
We reject H0 if and only if Y c. Find the power function K(p) in terms of c and
p = F (40), and the probability of a type 1 error f c = 2.
2. Let m be the median of a random variable with density symmetric about m. Using
the Wilcoxon test, we are testing H0 : m = 160 vs. H1 : m > 160 based on n = 16
observations, which are as follows: 176.9, 158.3, 152.1, 158.8, 172.4, 169.8, 159.7, 162.7,
156.6, 174.5, 184.4, 165.2, 147.8, 177.8, 160.1, 160.5. Compute the Wilcoxon statistic
and determine whether H0 is rejected at the .05 signicance level, i.e., the probability
of a type 1 error is .05.
3. When n is small, the distribution of W can be found explicitly. Do it for n = 1, 2, 3.
Solutions to Problems
Lecture 1
1. P {max(X, Y, Z) t} = P {X t and Y t and Z t} = P {X t}3 by
independence. Thus the distribution function of the maximum is (t6 )3 = t18 , and the
density is 18t17 , 0 t 1.
2. See Figure S1.1. We have

P {Z z} =
yzx
zx
fXY (x, y) dx dy =
FZ (z) =
x=0
ex (1 ezx ) dx = 1
fZ (z) =
1
,
(z + 1)2
ex ey dy dx
y=0
1
,
1+z
z0
z0
FZ (z) = fZ (z) = 0 for z < 0.

3. P {Y = y} = P {g(X) = y} = P {X g 1 (y)}, which is the number of xi s that map to
y, divided by n. In particular, if g is one-to-one, then pY (g(xi )) = 1/n for i = 1, . . . , n.
4. Since the area under the density function must be 1, we have ab3 /3 = 1. Then (see
Figure S1.2) fY (y) = fX (y 1/3 )/|dy/dx| with y = x3 , dy/dx = 3x2 . In dy/dx we
substitute x = y 1/3 to get
fY (y) =
fX (y 1/3 )
3 y 2/3
1
= 3 2/3 = 3
2/3
b 3y
b
3y
for 0 < y 1/3 < b, i.e., 0 < y < b3 .

5. Let Y = tan X where X is uniformly distributed between /2 and /2. Then (see
Figure S1.3)
fY (y) =
fX (tan1 y)
1/
=
|dy/dx|x=tan1 y
sec2 x
with x = tan1 y, i.e., y = tan x. But sec2 x = 1 + tan2 x = 1 + y 2 , so fY (y) =

1/[(1 + y 2 )], the Cauchy density.
Lecture 2
1. We have y1 = 2x1 , y2 = x2 x1 , so x1 = y1 /2, x2 = (y1 /2) + y2 , and

2 0
(y1 , y2 )
= 2.
=
1 1
(x1 , x2 )
y = zx
x
Figure S1.1
Y = X
y
y 1/3
Figure S1.2
Thus fY1 Y2 (y1 , y2 ) = (1/2)fX1 X2 (x1 , x2 ) = ex1 x2 = exp[(y1 /2) (y1 /2) y2 ] =
ey1 ey2 . As indicated in the comments, the range of the ys is 0 < y1 < 1, 0 < y2 < 1.
Therefore the joint density of Y1 and Y2 is the product of a function of y1 alone and
a function of y2 alone, which forces independence.
2. We have y1 = x1 /x2 , y2 = x2 , so x1 = y1 y2 , x2 = y2 and

(x1 , x2 ) y2 y1
=
= y2 .
0 1
(y1 , y2 )
Thus fY1 Y2 (y1 , y2 ) = fX1 X2 (x1 , x2 )|(x1 , x2 )/(y1 , y2 )| = (8y1 y2 )(y2 )(y2 ) = 2y1 (4y23 ).
Since 0 < x1 < x2 < 1 is equivalent to 0 < y1 < 1, 0 < y2 < 1, it follows just as in
Problem 1 that X1 and X2 are independent.
3. The Jacobian (x1 , x2 , x3 )/(y1 , y2 , y3 ) is given by

y2 y3
y1 y3
y1 y2

y2 y3 y3 y1 y3 y2 y1 y2

0
y3
1 y2
= (y2 y32 y1 y2 y32 )(1 y2 ) + y1 y22 y32 + y3 (y2 y1 y2 )y2 y3 + (1 y2 )y1 y2 y32
which cancels down to y2 y32 . Thus
fY1 Y2 Y3 (y1 , y2 , y3 ) = exp[(x1 + x2 + x3 )]y2 y32 = y2 y32 ey3 .
This can be expressed as (1)(2y2 )(y32 ey3 /2), and since x1 , x2 , x3 > 0 is equivalent to
0 < y1 < 1, 0 < y2 < 1, y3 > 0, it follows as before that Y1 , Y2 , Y3 are independent.
Lecture 3
1. MX2 (t) = MY (t)/MX1 (t) = (1 2t)r/2 /(1 2t)r1 /2 = (1 2t)(rr1 )/2 , which is
2 (r r1 ).
'
ar
ct
an
/2
y
2
'
Figure S1.3
2. The moment-generating function of c1 X1 + c2 X2 is
E[et(c1 X1 +c2 X2 ) ] = E[etc1 X1 ]E[etc2 X2 ] = (1 1 c1 t)1 (1 2 c2 t)2 .
If 1 c1 = 2 c2 , then X1 + X2 is gamma with = 1 + 2 and = i ci .
n
n
n
3. M (t) = E[exp( i=1 ci Xi )] = i=1 E[exp(tci Xi )] = i=1 Mi (ci t).
4. Apply Problem 3 with ci = 1 for all i. Thus
MY (t) =
n

Mi (t) =
i=1
n

i=1
exp[i (e 1)] = exp

t

n

(e 1)
t
i=1
which is Poisson (1 + + n ).
5. Since the coin is unbiased, X2 has the same distribution as the number of heads in the
second experiment. Thus X1 + X2 has the same distribution as the number of heads
in n1 + n2 tosses, namely binomial with n = n1 + n2 and p = 1/2.
Lecture 4
1. Let be the normal (0,1) distribution function, and recall that (x) = 1 (x).
Then
n
X
n
<c
P { c < X < + c} = P {c
<
}
/ n
= (c n/) (c n/) = 2(c n/) 1 .954.
Thus (c n/) 1.954/2 = .977. From tables, c n/ 2, so n 4 2 /c2 .

2. If Z = X Y , we want P {Z > 0}. But Z is normal with mean = 1 2 and
variance 2 = (12 /n1 ) + (22 /n2 ). Thus
P {Z > 0} = P {
>
} = 1 (/) = (/).
4
3. Since nS 2 / 2 is 2 (n 1), we have
P {a < S 2 < b} = P {
na
nb
< 2 (n 1) < 2 }.
2
If F is the 2 (n 1) distribution function, the desired probability is F (nb/ 2 )

F (na/ 2 ), which can be found using chi-square tables.
4. The moment-generating function is

2 2
2
nS t
E[etS ] = E exp
= E[exp(t 2 X/n)]
2 n
where the random variable X is 2 (n 1), and therefore has moment-generating function M (t) = (1 2t)(n1)/2 . Replacing t by t 2 /n we get

MS 2 (t) =
2t 2
n
(n1)/2
so S 2 is gamma with = (n 1)/2 and = 2 2 /n.
Lecture 5
1. By denition of the beta density,
E(X) =
(a + b)
(a)(b)
xa (1 x)b1 dx
and the integral is (a + 1, b) = (a + 1)(b)/(a + b + 1). Thus E(X) = a/(a + b).

Now

(a + b) a+1
2
E(X ) =
x (1 x)b1 dx
(a)(b) 0
and the integral is (a + 2, b) = (a + 2)(b)/(a + b + 2). Thus
E(X 2 ) =
(a + 1)a
.
(a + b + 1)(a + b)
and
Var X = E(X 2 ) [E(X)]2
=
1
ab
[(a + 1)a(a + b) a2 (a + b + 1)] =
.
(a + b)2 (a + b + 1)
(a + b)2 (a + b + 1)
2. P {c T c} = FT (c) FT (c) = FT (c) (1 FT (c)) = 2FT (c) 1 = .95, so

FT (c) = 1.95/2 = .975. From the T table, c = 2.131.
3. W = (X1 /m)/(X2 /n) where X1 = 2 (m) and X2 = 2 (n). Consequently, 1/W =
(X2 /n)/(X1 /m), which is F (n, m).
5
4. Suppose we want P {W c} = .05. Equivalently, P {1/W 1/c} = .05, hence
P {1/W 1/c} = .95. By Problem 3, 1/W is F (n, m), so 1/c can be found from the
F table, and we can then compute c. The analysis is similar for .1, .025 and .01.

5. If N is normal (0,1), then T (n) = N/( 2 (n)/n). Thus T 2 (n) = N 2 /(2 (n)/n). But
N 2 is 2 (1), and the result follows.
6. If Y = 2X then fY (y) = fX (x)|dx/dy| = (1/2)ex = (1/2)ey/2 , y 0, the chi-square
density with two degrees of freedom. If X1 and X2 are independent exponential random
variables, then
X1
(2X1 )/2
2 (2)/2
=
= 2
= F (2, 2).
X2
(2X2 )/2
(2)/2
Lecture 6
1. Apply the formula for the joint density of Yj and Yk with j = 1, k = 3, n = 3, F (x) =
x, f (x) = 1, 0 < x < 1. The result is fY1 Y3 (x, y) = 6(y x), 0 < x < y < 1. Now let
Z = Y3 Y1 , W = Y3 . The Jacobian of the transformation has absolute value 1, so
fZW (z, w) = fY1 Y3 (y1 , y3 ) = 6(y3 y1 ) = 6z, 0 < z < w < 1. Thus
1
fZ (z) =
6z dw = 6z(1 z), 0 < z < 1.
w=z
2. The probability that more than one random variable falls in [x, x + dx] need not be
negligible. For example, there can be a positive probability that two observations
coincide with x.
3. The density of Yk is
fYk (x0 =
n!
xk1 (1 x)nk ,
(k 1)!(n k)!
0<x<1
which is beta with = k and = n k + 1. (Note that (k) = (k 1)!, (n k + 1) =

(n k)!, (k + n k + 1) = (n + 1) = n!.)
4. We have Yk > p if and only if at most k1 observations are in [0, p]. But the probability
that a particular observation lies in [0, p] is p/1 = p. Thus we have n Bernoulli trials
with probability of success p on a given trial. Explicitly,
k1
n
P {Yk > p} =
pi (1 p)ni .
i
i=0
Lecture 7
1. Let Wn = (Sn E(Sn ))/n; then E(Wn ) = 0 for all n, and
Var Wn =
P
It follows that Wn 0.
n
Var Sn
1 2
nM
M
=
2 =
0.
n2
n2 i=1 i
n
n
6
d
2. All Xi and X have the same distribution (p(1) = p(0) = 1/2), so Xn

0. But if
0 < - < 1 then P {|Xn X| -} = P {Xn = X}, which is 0 for n odd and 1 for n even.
Therefore P {|Xn X| -} oscillates and has no limit as n .
3. By the weak law of large numbers, X n converges in probability to , hence converges
in distribution to . Thus we can take X to have a distribution function F that is
degenerate at , in other words,

0,
F (x) =
1,
x<
x .
4. Let Fn be the distribution function of Xn . For all x, Fn (x) = 0 for suciently large
n. Since the identically zero function cannot be a distribution function, there is no
limiting distribution.
Lecture 8
1. Note that MXn = 1/(1t)n where 1/(1t) is the moment-generating function of an
exponential random variable (which has mean ). By the weak law of large numbers,
P
Xn /n , hence Xn /n
.

n
2. 2 (n) = i=1 Xi2 , where the Xi are iid, each normal (0,1). Thus the central limit
theorem applies.
b
3. We have n Bernoulli trials, with probability of success p = a f (x) dx on a given trial.
Thus Yn is binomial (n, p). If n and p satisfy the sucient condition given in the text,
the normal approximation with E(Yn ) = np and Var Yn = np(1 p) should work well
in practice.
4. We have E(Xi ) = 0 and

Var Xi =
E(Xi2 )
1/2
2
=
1/2
1/2
x2 dx = 1/12.
x dx = 2
0
By the central limit theorem, Yn is approximately normal with E(Yn ) = 0 and Var Yn =
n/12.
5. Let Wn = n(1 F (Yn )). Then
P {Wn w} = P {F (Yn ) 1 (w/n)} = P {max F (Xi ) 1 (w/n)}
hence

w n
P {Wn w} = 1
,
n
0 w n,
which approaches ew as n . Therefore the limiting distribution of Wn is exponential.
Lecture 9
1. (a) We have
f (x1 , . . . , xn ) = x1 ++xn
en
.
x1 ! xn !
With x = x1 + + xn , take logarithms and dierentiate to get
x
(x ln n) = n = 0,
= X.
(b) f (x1 , . . . , xn ) = n (x1 xn )1 , > 0, and

n
n

n
ln xi ) = +
ln xi = 0,
(n ln + ( 1)
i=1
i=1
n
= n
ln xi .
i=1
Note that 0 < xi < 1, so ln xi < 0 for all i and > 0.

n
n
(c) f (x1 , . . . , xn ) = (1/n ) exp[( i=1 xi )/]. With x = i=1 xi we have
x
n
x
(n ln ) = + 2 = 0,
= X
n
n
(d) f (x1 , . . . , xn ) = (1/2)n exp[ i=1 |xi |]. We must minimize i=1 |xi |,
and we must be careful when dierentiating because of the absolute values. If the
order statistics of the xi are yi , i = 1, . . . , n, and yk < < yk+1 , then the sum to be
minimized is
( y1 ) + + ( yk ) + (yk+1 ) + + (yn ).
The derivative of the sum is the number
n of yi s less than minus the number of yi s
greater than . Thus as increases, i=1 |xi | decreases until the number of yi s
less than equals the number of yi s greater than . We conclude that is the median
of the Xi .
n
(e) f (x1 , . . . , xn ) = exp[ i=1 xi ]en if all xi , and 0 elsewhere. Thus
f (x1 , . . . , xn ) = exp[
xi ]en I[ min(x1 , . . . , xn )].
i=1
The indicator I prevents us from dierentiating blindly. As increases, so does en ,

but if > mini xi , the indicator drops to 0. Thus = min(X1 , . . . , Xn ).
2. f (x1 , . . . , xn ) = 1 if (1/2) xi +(1/2) for all i, and 0 elsewhere. If Y1 , . . . , Yn
are the order statistics of the Xi , then f (x1 , . . . , xn ) = I[yn (1/2) y1 +(1/2)],
where y1 = min xi and yn = max xi . Thus any function h(X1 , . . . , Xn ) such that
Yn
1
1
h(X1 , . . . , Xn ) Y1 +
2
2
8
for all X1 , . . . , Xn ) is an MLE of . Some solutions are h = Y1 +(1/2), h = Yn (1/2),
h = (Y1 + Yn )/2, h = (2Y1 + 4Yn 1)/6 and h = (4Y1 + 2Yn + 1)/6. In all cases, the
inequalities reduce to Yn Y1 1, which is true.
3. (a) Xi is Poisson () so E(Xi ) = . The method of moments sets X = , so the
estimate of is = X, which is consistent by the weak law of large numbers.
1
(b) E(Xi ) = 0 x d = /( + 1) = X, = X + X, so
=
X
/( + 1)
P
=
1 [/( + 1)]
1X
hence is consistent.
(c) E(Xi ) = = X, so = X, consistent by the weak law of large numbers.
(d) By symmetry, E(Xi ) = so = X as in (a) and (c).

(e) E(Xi ) = xe(x) dx = (with y = x ) 0 (y + )ey dy = 1 + = X. Thus
= X 1 which converges in probability to (1 + ) 1 = , proving consistency.

r
r
4. P {X r} = 0 (1/)ex/ dx = ex/ 0 = 1 er/ . The MLE of is = X [see
Problem 1(c)], so the MLE of 1 er/ is 1 er/X .
5. The MLE of is X/n, the relative frequency of success. Since
b

n k
P {a X b} =
(1 )nk ,
k
k=a
the MLE of P {a X b} is found by replacing by X/n in the above summation.
Lecture 10
1. Set 2(b) 1 equal to the desired condence level. This, along with the table of the
normal
(0,1) distribution function, determines b. The length of the condence interval
is 2b/ n.
2. Set 2FT (b) 1 equal to the desired condence level. This, along with the table of the
T (n
1) distribution function, determines b. The length of the condence interval is
2bS/ n 1.
3. In order to compute the expected length of the condence interval, we must compute
E(S), and the key observation is

nS 2
2
S=
=
(n 1).
2
n
n
If f (x) is the chi-square density with r = n 1 degrees of freedom [see (3.8)], then the
expected length is

2b
x1/2 f (x) dx
n1 n 0
and an appropriate change of variable reduces the integral to a gamma function which
can be evaluated explicitly.
9
4. We have E(Xi ) = and Var(Xi ) = 2 . For large n,
X
X
/ n
/ n
is approximately normal (0,1) by the central limit theorem. With c = 1/ n we have

P {b <
X
< b} = (b) (b) = 2(b) 1
c
and if we set this equal to the desired level of condence, then b is determined. The
condence interval is given by (1 bc) < X < (1 + bc), or
X
X
<<
1 + bc
1 bc
where c 0 as n .
5. A condence interval of length L corresponds to |(Yn /n) p| < L/2, an event with
probability

L n/2
2
1.
p(1 p)
Setting this probability equal to the desired condence level gives an inequality of the
form
L n/2

> c.
p(1 p)
As in the text, we can replace p(1p) by its maximum value 1/4. We nd the minimum
value of n by squaring both sides.
In the rst example in (10.1), we have L = .02, L/2 = .01 and c = 1.96. This problem
essentially reproduces the analysis in the text in a more abstract form. Specifying how
close to p we want our estimate to be (at the desired level of condence) is equivalent
to specifying the length of the condence interval.
Lecture 11
1. Proceed as in (11.1):

Z = X Y (1 2 )
divided by
12
2
+ 2
n
m
is normal (0,1), and W = (nS12 /12 )+(mS22 /22 ) is 2 (n+m2). Thus

is T (n + m 2), but the unknown variances cannot be eliminated.
n + m 2Z/ W
10
2. If 12 = c22 , then
1
12
2
1
+ 2 = c22
+
n
m
n cm
and
nS12
mS 2
nS12 + cmS22
+ 22 =
.
2
1
2
c22
Thus 22 can again be eliminated, and condence intervals can be constructed, assuming
c known.
Lecture 12
1. The given test is an LRT and is completely determined by c, independent of > 0 .
2. The likelihood ratio is L(x) = f1 (x)/f0 (x) = (1/4)/(1/6) = 3/2 for x = 1, 2, and
L(x) = (1/8)/(1/6) = 3/4 for x = 3, 4, 5, 6. If 0 < 3/4, we reject for all x, and
= 1, = 0. If 3/4 < < 3/2, we reject for x = 1, 2 and accept for x = 3, 4, 5, 6, with
= 1/3 and = 1/2. If 3/2 < , we accept for all x, with = 0, = 1.
For = .1, set = 3/2, accept when x = 3, 4, 5, 6, reject with probability a when
x = 1, 2. Then = (1/3)a = .1, a = .3 and = (1/2) + (1/2)(1 a) = .85.
3. Since (220-200)/10=2, it follows that when c reaches 2, the null hypothesis is accepted.
The associated type 1 error probability is = 1 (2) = 1 .977 = .023. Thus the
given result is signicant even at the signicance level .023. If we were to take additional
observations, enough to drive the probability of a type 1 error down to .023, we would
still reject H0 . Thus the p-value is a concise way of conveying a lot of information
about the test.
Lecture 13
1. We sum (Xi npi )2 /npi , i = 1, 2, 3, where the Xi are the observed frequencies and the
npi = 50, 30, 20 are the expected frequencies. The chi-square statistic is
(40 50)2
(33 30)2
(27 20)2
+
+
= 2 + .3 + 2.45 = 4.75
50
30
20
Since P {2 (2) > 5.99} = .05 and 4.75 < 5.99, we accept H0 .
2. The expected frequencies are given by
1
2
A
49
51
B
147
153
C
98
102
For example, to nd the entry in the 2C position, we can multiply the row 2 sum by
the column 3 sum and divide by the total number of observations (namely 600) to get
11
(306)(200)/600=102. Alternatively, we can compute P (C) = (98 + 102)/600 = 1/3.
We multiply this by the row 2 sum 306 to get 306/3=102. The chi-square statistic is
(33 49)2
(147 147)2
(114 98)2
(67 51)2
(153 153)2
(86 102)2
+
+
+
+
+
49
147
98
51
153
102
which is 5.224+0+2.612+5.02+0+2.510 = 15.366. There are (h1)(k1) = 12 = 2
degrees of freedom, and P {2 (2) > 5.99} = .05. Since 15.366 > 5.94, we reject H0 .
3. The observed frequencies minus the expected frequencies are
a
(a + b)(a + c)
ad bc
=
,
a+b+c+d
a+b+c+d
(a + b)(b + d)
bc ad
=
,
a+b+c+d
a+b+c+d
(a + c)(c + d)
bc ad
=
,
a+b+c+d
a+b+c+d
(c + d)(b + d)
ad bc
=
.
a+b+c+d
a+b+c+d
The chi-square statistic is

(ad bc)2
1
a + b + c + d (a + b)(c + d)(a + c)(b + d)

[(c + d)(b + d) + (a + c)(c + d) + (a + b)(b + d) + (a + b)(a + c)]
and the expression in small brackets simplies to (a + b + c + d)2 , and the result follows.
Lecture 14
1. The joint probability function is
f (x1 , . . . , xn ) =
xi
n

e
i=1
xi !
en u(x)
.
x1 ! xn !
Take g(, u(x)) = en u(x) and h(x) = 1/(x1 ! xn !).

2. f (x1 , . . . , xn ) = [A()]n B(x1 ) B(xn ) if 0 < xi < for all i, and 0 elsewhere. This
can be written as
[A()]n
n

i=1
where
I is an indicator.
n
i=1 B(xi ).

B(xi )I max xi <
1in
We take g(, u(x)) = An ()I[max xi < ] and h(x) =
3. f (x1 , . . . , xn ) = n (1 )u(x) , and the factorization theorem applies with h(x) = 1.

n
4. f (x1 , . . . , xn ) = n exp[( i=1 xi )/], and the factorization theorem applies with
h(x) = 1.
12
5. f (x) = (a + b)/[(a)(b)]xa1 (1 x)b1 on (0,1). In this case, a = and b = 2.
Thus f (x) = ( + 1)x1 (1 x), so
n n
f (x1 , . . . , xn ) = ( + 1)
n

xi
n
1
i=1
(1 xi )
i=1
and the factorization theorem applies with

g(, u(x)) = ( + 1)n n u(x)1
and h(x) =
n
i=1 (1 xi ).
1 x/
6. f (x) = (1/[() ])x

is
, x > 0, with = and arbitrary. The joint density
n

1
1
f (x1 , . . . , xn ) =
u(x)
exp[
xi /]
[()]n n
i=1
and the factorization theorem applies with h(x) = exp[

to the remaining factors.
xi /] and g(, u(x)) equal
7. We have
P {X1 = x1 , . . . , Xn = xn } = P {Y = y}P {X1 = x1 , . . . , Xn = xn |Y = y}
We can drop the subscript since Y is sucient, and we can replace Xi by Xi by
denition of Bs experiment. The result is
P {X1 = x1 , . . . , Xn = xn } = P {X1 = x1 , . . . , Xn = xn }
as desired.
Lecture 17
1. Take u(X) = X.
2. The joint density is

f (x1 , . . . , xn ) = exp

(xi ) I[min xi > ]
i=1
so Y1 is sucient. Now if y > , then

P {Y1 > y} = (P {X1 > y})n =
n
exp[(x )] dx
= exp[n(y )],
so
FY1 (y) = 1 en(y) ,
fY1 (y) = nen(y) ,
y > .
13
The expectation of g(Y1 ) under is

E [g(Y1 )] =
g(y)n exp[n(y )] dy.
If this is 0 for all , divide by en to get

g(y)n exp(ny) dy = 0.
Dierentiating with respect to , we have g()n exp(n) = 0, so g() = 0 for all ,

proving completeness. The expectation of Y1 under is

yn exp[n(y )] dy =
(y )n exp[n(y )] dy +
n exp[n(y )] dy
zn exp(nz) dz + =
0
1
+ .
n
Thus E [Y1 (1/n)] = , so Y1 (1/n) is a UMVUE of .

3.
Since f (x) = exp[( 1) ln x], the density belongs to the exponential

nclass. Thus
n
ln
X
is
a
complete
sucient
statistic,
hence
so
is
exp
(1/n)
=
i
i=1
i=1 ln Xi
u(X1 , . . . , Xn ). The key observation is that if Y is sucient and g is one-to-one,
then g(Y )a is also sucient, since g(Y ) conveys exactly the same information as Y
does; similarly for completeness.
To compute the
likelihood estimate, note that the joint density is f (x1 , . . . , xn ) =
maximum
n
n exp[( 1) i=1 ln xi ]. Take logarithms, dierentiate with respect to , and set the
n
result equal to 0. We get = n/ i=1 ln Xi , which is a function of u(X1 , . . . , Xn ).
4. Each Xi is gamma with = 2, = 1/, so (see Lecture 3) Y is gamma (2n, 1/). Thus

E (1/Y ) =
(1/y)
0
1
y 2n1 ey dy
(2n)(1/)2n
which becomes, under the change of variable z = y,

2n
(2n)

0
z 2n2 z dz
2n (2n 1)
= 2n1
=
.
e
2n2
(2n)
2n 1
Therefore E [(2n 1)/Y ] = , and (2n 1)/Y is the UMVUE of .

5. We have E(Y2 ) = [E(X1 ) + E(X2 )]/2 = , hence E[E(Y2 |Y1 )] = E(Y2 ) = . By
completeness, E(Y2 |Y1 ) must be Y1 /n.
6. Since Xi / is normal (0,1), Y / is 2 (n), which has mean n and variance 2n. Thus
E[(Y /)2 ] = n2 +2n, so E(Y 2 ) = 2 (n2 +2n).Therefore the UMVUE of 2 ) is Y 2 /(n2 +
2n).
14
7. (a) E[E(I|Y )] = E(I) = P {X1 1}, and the result follows by completeness.
(b) We compute
P {X1 = r|X1 + + Xn = s} =
P {X1 = r, X2 + + Xn = s r}
.
P {X1 + + Xn } = s
The numerator is
e r (n1) [(n 1)]sr
e
r!
(s r)!
and the denominator is
en (n)s
s!
so the conditional probability is

sr
r
s (n 1)sr
1
s
n1
=
s
r
r
n
n
n
which is the probability of r successes in s Bernoulli trials, with probability of success
1/n on a given trial. Intuitively, if the sum is s, then each contribution to the sum is
equally likely to come from X1 , . . . , Xn .
(c) By (b), P {X1 = 0|Y } + P {X1 = 1|Y } is given by

1
1
n
Y
+Y
Y 1
Y

1
Y /n
n1
1
1+
=
1
n
n
n
(n 1)/n

=
n1
n
Y
1+

Y
.
n1
This formula also works for Y = 0 because it evaluates to 1.

8. The joint density is

n

(xi 1 )
1
f (x1 , . . . , xn ) = n exp
I min Xi > 1 .
i
2
2
i=1
Since
n

(xi 1 )
i=1
n
1
xi n1 ,
2 i=1
the result follows from the factorization theorem.
15
Lecture 18
1. By (18.4), the numerator of (x) is

1
n x
(1 )nx d
r1 (1 )s1
x
0
and the denominator is
r1
(1 )
s1
n x
(1 )nx d.
x
Thus (x) is
(r + x + 1, n x + s)
(r + x + 1) (r + s + n)
r+x
=
=
.
(r + x, n x + s)
(r + x) (r + s + n + 1)
r+s+n
2. The risk function is

2
r+X
1
E
=
E [(X n + r r s)2 ]
r+s+n
(r + s + n)2
with E (X n) = 0, E [(X n)2 = Var X = n(1 ). Thus
R () =
1
[n(1 ) + (r r s)2 ].
(r + s + n)2
The quantity in brackets is

n n2 + r2 + r2 2 + s2 2 2r2 2rs + 2rs2
which simplies to
((r + s)2 n)2 + (n 2r(r + s)) + r2
and the result follows.
3. If r = s = n/2, then (r + s)2 n = 0 and n 2r(r + s) = 0, so

R () =
r2
.
(r + s + n)2

4. The average loss using is B() = h()R () d. If (x) has a smaller maximum
risk than (x), then since R is constant, we have R () < R () for all . Therefore
B() < B(), contradicting the fact that is a Bayes estimate.
Lecture 20
1.
Var(XY ) = E[(XY )2 ] (EXEY )2 = E(X 2 )E(Y 2 ) (EX)2 (EY )2
2
2 2
2
(X
+ 2X )(Y2 + 2Y ) 2X 2Y = X
Y + 2X Y2 + 2Y X
.
16
2.
Var(aX + bY ) = Var(aX) + Var(bY ) + 2ab Cov(X, Y )
2
= a2 X
+ b2 Y2 + 2abX Y .
3.
2
Cov(X, X + Y ) = Cov(X, X) + Cov(X, Y ) = Var X + 0 = X
.
4. By Problem 3,
X,X+Y =
2
X
X
= 2
.
X X+Y
X + Y2
5.
Cov(XY, X) = E(X 2 )E(Y ) E(X)2 E(Y )
2
2
= (X
+ 2X )Y 2X Y = X
Y .
6. We can assume without loss of generality that E(X 2 ) > 0 and E(Y 2 ) > 0. We will
have equality i the discriminant b2 4ac = 0, which holds i h() = 0 for some .
Equivalently, X + Y = 0 for some . We conclude that equality holds if and only if
X and Y are linearly dependent.
Lecture 21
n
2
1. Let Yi = Xi E(Xi ); then E[
i=1 ti Yi ] 0 for all t. But this expectation is

E[
ti Yi
tj Yj ] =
ti ij tj = t Kt
i
i,j
where ij = Cov(Xi , Xj ). By denition of covariance, K is symmetric, and K is

always nonnegative denite because t Kt 0 for all t. Thus all eigenvalues i of K
are nonnegative. But K = LDL , so det K = det D = 1 n . If K is nonsingular
then all i > 0 and K is positive denite.
2. We have X = CZ + where C is nonsingular and the Zi are independent normal
random variables with zero mean. Then Y = AX = ACZ + A, which is Gaussian.
3. The moment-generating function of (X1 , . . . , Xm ) is the moment-generating function
of (X1 , . . . , Xn ) with tm+1 = = tn = 0. We recognize the latter moment-generating
function as Gaussian; see (21.1).
n
4. Let Y = i=1 ci Xi ; then
tY
E(e ) = E exp
n

i=1
ci tXi

= MX (c1 t, . . . , cn t)
17
n
n

1 2

= exp t
ci i exp t
ci aij cj
2
i=1
i,j=1
which is the moment-generating function of a normally distributed random variable.

Another method: Let W = c1 X1 + + cn Xn = c X = c (AY + ), where the Yi are
independent normal random variables with zero mean. Thus W = b Y + c where
b = c A. But b Y is a linear combination of independent normal random variables,
hence is normal.
Lecture 22
1. If y is the best estimate of Y given X = x, then
y Y =
Y
(x X )
X
and [see (20.1)] the minimum mean square error is Y2 (1 2 ), which in this case is 28.
We are given that Y /X = 3, so Y = 3 2 = 6 and 2 = 36/Y2 . Therefore
Y2 (1
36
) = Y2 36 = 28,
Y2
Y = 8,
2 =
36
,
64
= .75.
Finally, y = Y + 3x 3X = Y + 3x + 3 = 3x + 7, so Y = 4.
2. The bivariate normal density is of the form
f (x, y) = a()b(x, y) exp[p1 ()x2 + p2 ()y 2 + p3 ()xy + p4 ()x + p5 ()y]
so we are in the exponential class. Thus
2 2
Xi ,
Yi ,
Xi Yi ,
Xi ,
Yi
2
is a complete sucient statistic for = (X
, Y2 , , X , Y ). Note also that any statistic
in one-to-one correspondence with this one is also complete and sucient.
Lecture 23
1. The probability of any event is found by integrating the density on the set dened by
the event. Thus

P {a f (X) b} =
f (x) dx, A = {x : a f (x) b}.
A
2. Bernoulli: f (x) = x (1 )1x , x = 0, 1
x 1x
ln f (x) =
[x ln + (1 x) ln(1 )] =
18
2
x
1x
ln f (x) = 2
2
(1 )2
I() = E
X
1X 1
1
1
+
= +
=
2
(1 )2
1
(1 )
since E (X) = . Now

Var Y
1
(1 )
=
.
nI()
n
But
Var X =
1
n(1 )
(1 )
Var[binomial(n, )] =
=
n2
n2
n
so X is a UMVUE of .
Normal:
f (x) =
1
exp[(x )2 /2 2 ]
2
(x )2 x
=
ln f (x) =
2 2
2
1
2
ln f (x) = 2 ,
2
I() =
1
,
2
Var Y
2
n
But Var X = 2 /n, so X is a UMVUE of .

Poisson: f (x) = e x /x!, x = 0, 1, 2 . . .
ln f (x) =
( + x ln ) = 1 +
2
x
ln f (x) = 2 ,
2
Var Y
so X is a UMVUE of .
I() = E
X
1
= 2 =
2
= Var X
n
19
Lecture 25
1.
K(p) =
c

n
k=0
pk (1 p)nk
with c = 2 and p = 1/2 under H0 . Therefore

12
12
12
79
=
+
+
(1/2)n =
= .019.
0
1
2
4096
2. The deviations, with ranked absolute values in parentheses, are
16.9(14), -1.7(5), -7.9(9), -1.2(4), 12.4(12), 9.8(10), -.3(2), 2.7(6), -3.4(7), 14.5(13),
24.4(16), 5.2(8), -12.2(11), 17.8(15), .1(1), .5(3)
The Wilcoxon statistic is W =1-2+3-4-5+6-7+8-9+10-11+12+13+14+15+16=60
Under H0 , E(W ) = 0 and Var W = n(n + 1)(2n + 1)/6 = 1496,
W = 38.678
Now W/38.678 is approximately normal (0,1) and P {W c} = P {W/38.678

c/38.678} = .05. From a normal table, c/38.678 = 1.645, c = 63.626. Since
60 < 63.626, we accept H0 .
3. The moment-generating function of Vj is
MVj (t) = (1/2)(ejt +ejt ) and the momentn
generating function of W is MW (t) = j=1 MVj (t). When n = 1, W = 1 with
equal probability. When n = 2,
MW (t) =
1 t
1
1
(e + et ) (e2t + e2t ) = (e3t + et + et + e3t )
2
2
4
so W takes on the values 3, 1, 1, 3 with equal probability. When n = 3,

MW (t) =
1 3t
1
+ et + et + e3t ) (e3t + e3t )
(e
4
2
1 6t
+ e4t + e2t + 1 + 1 + e2t + e4t + e6t ).
(e
8
Therefoe P {W = k} = 1/8 for k = 6, 4, 2, 2, 4, 6, P {W = 0} = 1/4, and

P {W = k} = 0 for other values of k
Index
Bayes estimate, 18.1
Bernoulli trials, 3.4, 7.3
beta distribution, 5.4
bivariate normal distribution, 22.1
Cauchy density, Lecture 2, Problem 5
Cauchy-Schwarz inequality, 20.3
central limit theorem, 8.2
Chebyshevs inequality, 7.1
chi square distribution, 3.8, 4.2
chi square tests, 13.1
complete sucient statistic, 16.1, 17.1
condence intervals, Lectures 10, 11, 24.3
consistent estimate, 7.6, 9.3
convergence in distribution, 7.4, 8.3
convergence in probability, 7.4, 8.3
convolution, 3.11
correlation coecient, 20.1
covariance, 20.1
covariance matrix, 21.4, 22.1
Cramer-Rao inequality, 23.4
critical region, 12.3
density function method, Lecture 1
distribution function method, Lecture 1
eigenvalues and eigenvectors, 19.1
equality of distributions, 13.3
estimation, 9.1
exponential class (exponential family), 16.3, 17.1
exponential distribution, 3.8
F distribution, 5.3
factorization theorem, 14.3
Fisher information, 23.6
gamma distribution, 3.7
goodness of t, 13.2
hypothesis testing, 12.1, 24.4
inner product (dot product), 19.1
Jacobian, 2.1
jointly Gaussian random variables, 21.1
least squares, 20.5
Lehmann-Schee theorem, 16.2
Liapounov condition, 25.2
likelihood ratio tests, 12.2
limiting distribution, Lecture 7, Problems 3,4
maximum likelihood estimate, 9.2
method of moments, 9.6
2
moment-generating functions, 3.1, 21.1
multivariate normal distribution, 21.1
negative binomial distribution, 16.5
Neyman-Pearson lemma, 12.6
nonnegative denite, 19.5
nonparametric statistics, Lectures 24, 25
normal approximation to the binomial, 8.4
normal distribution, 3.4,
normal sampling, 4.2
order statistics, 6.1
orthogonal decomposition, 19.4
p-value, Lecture 12, Problem 3
percentiles, 24.1
point estimates, 9.1, 24.2
Poisson distribution, 3.4
Poisson process, 3.12
positive denite, 19.5
power function, 12.3
quadratic form, 19.5
quadratic loss function, 18.2
Rao-Blackwell theorem, 15.7
regression line, 22.2
sample mean, 4.1
sample variance, 4.1
sampling without replacement, 10.3
sign test, 24.4
signicance level, 12.3
simulation, 8.5
sucient statistics, 14.1
symmetric matrices, Lecture 19
T distribution, 4.2, 5.1
testing for independence, 13.4
transformation of random variables, Lecture 1
type 1 and type 2 errors, 12.1
unbiased estimate, 4.1, 17.3
uniformly minimum variance unbiased estimate (UMVUE), 16.2
weak law of large numbers, 7.2
Wilcoxon test, 25.1
Errata
There are some minor typos in the following locations.
Section 2.1, line 6 (a section heading counts as line 0) Change d to
Section 2.2, line -6 Capitalize j
Section 3.6, line -2 Change ti to t
Section 4.3, line -9 Change us to use
Section 4.3, line -11 Change w to We
Section 5.2, line 5 Insert y on the right side of the equation
Section 5.2, line 6 Insert y before dy
Section 5.4, line 2 Change to 1
Section 6.1, very end of the third display. Change 4 to r
Section 7.5, line 6 Delete the asterisk
Section 7.6, remove from two of the gures
Section 8.3, line 1 Change X to c, change then Xn converges to then Xn also converges
Section 8.4, line -3 Close up the space
Section 8.5, line 1 change an to can
Section 8.5, Figure 8.1 Remove
Section 9.3, line 5 ln should be roman, not italic
Section 10.3, line 3 Add a space after the comma
Section 10.3, line 10 In the rst summation, change Xi to Var Xi
Section 12.1, line -6 Change mall to small
Section 12.2, line -6 Change are to rare
Section 12.7, line -1 Add right parenthesis after H1
Section 13.2, line -3 Change reduced to reduce and change degrees to degrees of freedom
Section 16.1, line 2 unbiased should only appear once
Section 16.2, line -2 unbiased has only one s
Section 16.4, Example 6, displayed equation beginning with P {Yr = k}, change x to k
Section 17.2, line 6 Change N to M
Section 21.2, line 3 Change i to i
Section 23.3, line 4 Put brackets around g(X)
Section 23.6, line -1 Put right parenthesis after estimate
Solution to Lecture 6, Problem 3, second line, change the 0 before the equals sign to a
right parenthesis

Ash - 2007 - Lectures On Statistics PDF

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Ash - 2007 - Lectures On Statistics PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

1

Lecture 1. Transformation of Random Variables

We have FY (y) = 0 for y < 0. If y 0, then P {Y y} = P { y x y}.

The density of Y is 0 for y < 0 and

(Note that | 2 y| = 2 y.) We have fY (y) = 0 for y < 0, and:

Figures 1.6 and 1.7

2.1 The Setup

2.2 Denition and Discussion

with x = x(u, v, w), y = y(u, v, w), z = z(u, v, w).

To help you remember the formula, think f (y) dy = f (x)dx.

2.3 A Typical Application

(W = X would be equally good). The transformation is one-to-one because we can solve

Comments on the Problem Set

Lecture 3. Moment-Generating Functions

3.2 The Key Theorem

Since fXY (x, y) = fX (x)fY (y), the double integral becomes

3.3 The Main Application

This could have been derived directly:

by the binomial theorem.

pk q nk etk = (pet + q)n

Now (x2 /2) + tx = (1/2)(x2 2tx + t2 t2 ) = (1/2)(x t)2 + (1/2)t2 so

MY (t) = E[et Y ] = E[et(aX+b) ] = ebt MX (at) = ebt eat ea

Proof. The moment-generating function of Y is

exp(ti i + t2 i2 /2) = exp(t + t2 2 /2).

A similar argument works for the Poisson distribution; see Problem 4.

3.7 The Gamma Distribution

> 0. We need three

where and are positive constants. The moment-generating function is

Change variables via y = (t + (1/))x to get

3.8 Special Cases

Var[2 (r)] = 2 = 2r.

Let y = 1 2tx; the integral becomes

3.11 Another Method

3.12 The Poisson Process

Lecture 4. Sampling From a Normal Population

If the Xi have mean and variance 2 , then

Thus X is a good estimate of . (For large n, the variance of X is small, so X is

4.2 The Normal Case

But ny1 = nx = x1 + + xn , so by cancelling x2 , . . . , xn in (1), x1 + (y2 + + yn ) = y1 .

The Jacobian of the transformation is

To see the pattern,

look at the 4 by 4 case and

expand via the last row:

so d4 = 1 + d3 . In general, dn = 1 + dn1 , and since d2 = 2 by inspection, we have dn = n

(xi x)2 + n(x )2

(xi x)2 = (y2 yn )2 +

Dividing Equation (3) by 2 we have

But (Xi )/ is normal (0,1) and

The random variable

is useful in situations where is to be estimated but the true variance 2 is unknown. It

4.3 Performance of Various Estimates

Lecture 5. The T and F Distributions

To nd the density of Y1 , let Y2 = X2 . Then X1 = Y1 Y 2 / r and X2 = Y2 . The

(x1 , x2 )  y2 /r y1 /(2 ry2 ) 

exp[(1 + (y12 /r))y2 /2] dy2 / r.

and the observation that all factors of 2 cancel, this becomes

In sampling from a normal population, (X )/(/ n) is normal (0,1), and nS 2 / 2

n disappear after cancellation, we have

Intuitively, we expect that for

5.2 A Preliminary Calculation

Now we take X1 to be 2 (m), and X2 to be 2 (n). The density of X1 /X2 is

5.3 Denition and Discussion

(x1 , x2 ) y2 /r y1 /(2 ry2 )

(b) P {|X c| } = P {|X c|m m } E(|X c|m )/m by (a).