Lectures On Statistics
Robert B. Ash
Preface
These notes are based on a course that I gave at UIUC in 1996 and again in 1997. No
prior knowledge of statistics is assumed. A standard rst course in probability is a prerequisite, but the rst 8 lectures review results from basic probability that are important
in statistics. Some exposure to matrix algebra is needed to cope with the multivariate
normal distribution in Lecture 21, and there is a linear algebra review in Lecture 19. Here
are the lecture titles:
1. Transformation of Random Variables
2. Jacobians
3. Moment-generating functions
4. Sampling from a normal population
5. The T and F distributions
6. Order statistics
7. The weak law of large numbers
8. The central limit theorem
9. Estimation
10. Condence intervals
11. More condence intervals
12. Hypothesis testing
13. Chi square tests
14. Sucient statistics
15. Rao-Blackwell theorem
16. Lehmann-Schee theorem
17. Complete sucient statistics for the exponential class
18. Bayes estimates
19. Linear algebra review
20. Correlation
21. The multivariate normal distribution
22. The bivariate normal distribution
23. Cramer-Rao inequality
24. Nonparametric statistics
25. The Wilcoxon test
c
copyright
2007 by Robert B. Ash. Paper or electronic copies for personal use may be
made freely without explicit permission of the author. All other rights are reserved.
1/2
Y = X
f (x)
X
1 -x
- e
2
-1
-S
x-axis
Sq
qr
rt
t[
[y
y]
Figure 1.1
The distribution function method nds FY directly, and then fY by dierentiation.
1
y+
2
1 x
1
1
y + (1 e y ).
e dx =
2
2
2
1/2
-1
]
[y
t
qr
-S
f (x)
X
Sqr
t[y
]
Figure 1.2
Case 2. y > 1 (Figure 1.3). Then
1
FY (y) = +
2
1 x
1 1
e dx = + (1 e y ).
2
2 2
x-axis
f (x)
X
1/2
'
]
[y
t
r
q
-S
-1
Sqr
t[y
]
x-axis
Figure 1.3
1
fY (y) = (1 + e y ),
4 y
1
fY (y) = e y ,
4 y
0 < y < 1;
y > 1.
See Figure 1.4 for a sketch of fY and FY . (You can take fY (y) to be anything you like at
y = 1 because {Y = 1} has probability zero.)
f (y)
Y
1 1
- y )
2 + 2 (1 - e
F (y)
Y
1
2
'
1
'1
y + 12 (1 - e- y )
y
Figure 1.4
The density function method nds fY directly, and then FY by integration; see
Figure 1.5. We have fY (y)|dy| = fX ( y)dx + fX ( y)dx; we write |dy| because probabilities are never negative. Thus
fX ( y)
fX ( y)
fY (y) =
+
|dy/dx|x=y
|dy/dx|x=y
with y = x2 ,
dy/dx = 2x, so
fX ( y) fX ( y)
+
.
2 y
2 y
fY (y) =
(1/2)e
fY (y) =
2 y
1/2
1
+ =
y(1 + e y ).
2 y
4
3
Case 2. y > 1 (see Figure 1.3).
fY (y) =
(1/2)e
2 y
1
+ 0 = e y
4 y
as before.
Y
y
X
- y
Figure 1.5
The distribution function method generalizes to situations where we have a single output but more than one input. For example, let X and Y be independent, each uniformly
distributed on [0, 1]. The distribution function of Z = X + Y is
FZ (z) = P {X + Y z} =
fXY (x, y) dx dy
x+yz
with fXY (x, y) = fX (x)fY (y) by independence. Now FZ (z) = 0 for z < 0 and FZ (z) = 1
for z > 2 (because 0 Z 2).
Case 1. If 0 z 1, then FZ (z) is the shaded area in Figure 1.6, which is z 2 /2.
Case 2. If 1 z 2, then FZ (z) is the shaded area in FIgure 1.7, which is 1 [(2 z)2 /2].
Thus (see Figure 1.8)
0z1
z,
fZ (z) = 2 z, 1 z 2 .
0
elsewhere
Problems
1. Let X, Y, Z be independent, identically distributed (from now on, abbreviated iid)
random variables, each with density f (x) = 6x5 for 0 x 1, and 0 elsewhere. Find
the distribution and density functions of the maximum of X, Y and Z.
2. Let X and Y be independent, each with density ex , x 0. Find the distribution (from
now on, an abbreviation for Find the distribution or density function) of Z = Y /X.
3. A discrete random variable X takes values x1 , . . . , xn , each with probability 1/n. Let
Y = g(X) where g is an arbitrary real-valued function. Express the probability function
of Y (pY (y) = P {Y = y}) in terms of g and the xi .
y
1
2-z
2-z
z
z
x
x+y = z
1 z 2
f (z)
Z
1-
'1
Figure 1.8
4. A random variable X has density f (x) = ax2 on the interval [0, b]. Find the density
of Y = X 3 .
5. The Cauchy density is given by f (y) = 1/[(1 + y 2 )] for all real y. Show that one way
to produce this density is to take the tangent of a random variable X that is uniformly
distributed between /2 and /2.
Lecture 2. Jacobians
We need this idea to generalize the density function method to problems where there are
k inputs and k outputs, with k 2. However, if there are k inputs and j < k outputs,
often extra outputs can be introduced, as we will see later in the lecture.
y
x
u
du
R
dv
du
du
Figure 2.1
S
B
A
Figure 2.2
written as
(x, y)
.
(u, v)
6
Thus |A B| = |J| du dv. Now P {(X, Y ) S} = P {(U, V ) R}, in other words,
fXY (x, y) times the area of S is fU V (u, v) times the area of R. Thus
fXY (x, y)|J| du dv = fU V (u, v) du dv
and
(x, y)
.
fU V (u, v) = fXY (x, y)
(u, v)
The absolute value of the Jacobian (x, y)/(u, v) gives a magnication factor for area
in going from u v coordinates to x y coordinates. The magnication factor going the
other way is |(u, v)/(x, y)|. But the magnication factor from u v to u v is 1, so
fXY (x, y)
.
fU V (u, v) =
(u, v)/(x, y)
In this formula, we must substitute x = x(u, v), y = y(u, v) to express the nal result in
terms of u and v.
In three dimensions, a small rectangular box with volume du dv dw corresponds to a
parallelepiped in xyz space, determined by vectors
x y z
x y z
x y
z
A=
du, B =
dv, C =
dw.
u u u
v v v
w w w
The volume of the parallelepiped is the absolute value of the dot product of A with B C,
and the dot product can be written as a determinant with rows (or columns) A, B, C. This
determinant is the Jacobian of x, y, z with respect to u, v, w [written (x, y, z)/(u, v, w)],
times du dv dw. The volume magnication from uvw to xyz space is |(x, y, z)/(u, v, w)|
and we have
fU V W (u, v, w) =
fXY Z (x, y, z)
|(u, v, w)/(x, y, z)|
fX1 Xn (x1 , . . . , xn )
|(y1 , . . . , yn )/(x1 , . . . , xn )|
where
y1
x
1
(y1 , . . . , yn )
=
(x1 , . . . , xn ) y
n
x1
..
.
y1
xn
.
yn
xn
W =Y
fXY (x, y)
|(z, w)/(x, y)|x=z/w,y=w
(z, w) z/x
=
w/x
(x, y)
z/y y
=
w/y 0
x
= y.
1
Thus
fZW (z, w) =
fX (x)fY (y)
fX (z/w)fY (w)
=
w
w
and we are left with the problem of nding the marginal density from a joint density:
1
fZ (z) =
fX (z/w)fY (w) dw.
fZW (z, w) dw =
w
Problems
1. The joint density of two random variables X1 and X2 is f (x1 , x2 ) = 2ex1 ex2 ,
where 0 < x1 < x2 < ; f (x1 , x2 ) = 0 elsewhere. Consider the transformation
Y1 = 2X1 , Y2 = X2 X1 . Find the joint density of Y1 and Y2 , and conclude that Y1
and Y2 are independent.
2. Repeat Problem 1 with the following new data. The joint density is given by f (x1 , x2 ) =
8x1 x2 , 0 < x1 < x2 < 1; f (x1 , x2 ) = 0 elsewhere; Y1 = X1 /X2 , Y2 = X2 .
3. Repeat Problem 1 with the following new data. We now have three iid random variables
Xi , i = 1, 2, 3, each with density ex , x > 0. The transformation equations are given
by Y1 = X1 /(X1 + X2 ), Y2 = (X1 + X2 )/(X1 + X2 + X3 ), Y3 = X1 + X2 + X3 . As
before, nd the joint density of the Yi and show that Y1 , Y2 and Y3 are independent.
fXY (x, y)
g(x)h(y)
=
fX (x)
g(x) h(y) dy
X3 =
8
which does not depend on x. The set of points where g(x) = 0 (equivalently fX (x) = 0)
can be ignored because it has probability zero. It is important to realize that in this
argument, for all x, y means that x and y must be allowed to vary independently of each
other, so the set of possible x and y must be of the rectangular form a < x < b, c < y < d.
(The constants a, b, c, d can be innite.) For example, if fXY (x, y) = 2ex ey , 0 < y < x,
and 0 elsewhere, then X and Y are not independent. Knowing x forces 0 < y < x, so the
conditional distribution of Y given X = x certainly depends on x. Note that fXY (x, y)
is not a function of x alone times a function of y alone. We have
fXY (x, y) = 2ex ey I[0 < y < x]
where the indicator I is 1 for 0 < y < x and 0 elsewhere.
In Jacobian problems, pay close attention to the range of the variables. For example, in
Problem 1 we have y1 = 2x1 , y2 = x2 x1 , so x1 = y1 /2, x2 = (y1 /2) + y2 . From these
equations it follows that 0 < x1 < x2 < is equivalent to y1 > 0, y2 > 0.
2 t2
n tn
+ +
+ .
2!
n!
Since the coecient of tn in the Taylor expansion is M (n) (0)/n!, where M (n) is the n-th
derivative of M , we have n = M (n) (0).
n
n
If Y = i=1 Xi where X1 , . . . , Xn are independent, then MY (t) = i=1 MXi (t).
Proof. First note that if X and Y are independent, then
E[g(X)h(Y )] =
g(x)h(y)fXY (x, y) dx dy.
and similarly for more than two random variables. Now if Y = X1 + + Xn with the
Xi s independent, we have
MY (t) = E[etY ] = E[etX1 etXn ] = E[etX1 ] E[etXn ] = MX1 (t) MXn (t).
function and the density of a random variable are related by M (t) = etx f (x) dx.
With t replaced by s we have a Laplace transform, and with t replaced by it we have a
Fourier transform. The strategy works because at step 3, the moment-generating function
determines the density uniquely. (This is a theorem from Laplace or Fourier transform
theory.)
10
3.4 Examples
1. Bernoulli Trials. Let X be the number of successes in n trials with probability of
success p on a given trial. Then X = X1 + + Xn , where Xi = 1 if there is a success on
trial i and Xi = 0 if there is a failure on trial i. Thus
Mi (t) = E[etXi ] = P {Xi = 1}et1 + P {Xi = 0}et0 = pet + q
with p + q = 1. The moment-generating function of X is
n
n k nk tk
MX (t) = (pet + q)n =
p q
e .
k
k=0
P {X = k}etk =
k=0
k=0
e k
k=0
k!
etk = e
n
n
(et )k
k=0
k!
k = 0, 1, 2, . . . . Thus
= exp() exp(et ) = exp[(et 1)].
We can compute the mean and variance from the moment-generating function:
E(X) = M (0) = [exp((et 1))et ]t=0 = .
Let h(, t) = exp[(et 1)]. Then
E(X 2 ) = M (0) = [h(, t)et + et h(, t)et ]t=0 = + 2
hence
Var X = E(X 2 ) [E(X)]2 = + 2 2 = .
3. Normal(0,1). The moment-generating function is
2
1
M (t) = E[etX ] =
etx ex /2 dx
2
The integral is the area under a normal density (mean t, variance 1), which is 1. Consequently,
2
M (t) = et
/2
11
4. Normal (, 2 ). If X is normal(, 2 ), then Y = (X )/ is normal(0,1). This is a
good application of the density function method from Lecture 1:
fY (y) =
2
fX (x)
1
=
ey /2 .
|dy/dx|x=+y
2
We have X = + Y , so
MX (t) = E[etX ] = et E[etY ] = et MY (t).
Thus
2
MX (t) = et et
2 /2
Remember this technique, which is especially useful when Y = aX + b and the momentgenerating function of X is known.
3.5 Theorem
If X is normal(, 2 ) and Y = aX + b, then Y is normal(a + b, a2 2 ).
Proof. We compute
2 2
t 2 /2
Thus
MY (t) = exp[t(a + b)] exp(t2 a2 2 /2).
Here is another basic result.
3.6 Theorem
2
Let X1 , . . . , Xn
be independent, with Xi normal
i , i ). Then Y =
n (
n
2
2
with mean = i=1 i and variance = i=1 i .
n
i=1
Xi is normal
n
i=1
y 1 ey dy,
12
(c) (1/2) =
To prove (a), integrate by parts: () = 0 ey d(y /). Part (b) is a special case of (a).
For (c) we make the change of variable y = z 2 /2 and compute
2
(1/2) =
y 1/2 ey dy =
2z 1 ez /2 z dz.
0
The
is 2 times half the area under the normal(0,1) density, that is,
second integral
2 (1/2) = .
The gamma density is
f (x) =
1
x1 ex/
()
[() ]1
y
t + (1/)
ey
dy
t + (1/)
which reduces to
1
1 t
= (1 t) .
In this argument, t must be less than 1/ so that the integrals will be nite.
Since M (0) = f (x) dx = 0 f (x) dx in this case, with f 0, M (0) = 1 implies that
we have a legal probability density. As before, moments can be calculated eciently from
the moment-generating function:
E(X) = M (0) = (1 t)1 ()|t=0 = ;
E(X 2 ) = M (0) = ( 1)(1 t)2 ()2 |t=0 = ( + 1) 2 .
Thus
Var X = E(X 2 ) [E(X)]2 = 2 .
13
A random variable X has the chi square density with r degrees of freedom (X = 2 (r)
for short, where r is a positive integer) if its density is gamma with = r/2 and = 2.
Thus
f (x) =
1
x(r/2)1 ex/2 ,
(r/2)2r/2
x0
and
M (t) =
Therefore E[2 (r)] = = r,
1
,
(1 2t)r/2
t < 1/2.
3.9 Lemma
If X is normal(0,1) then X 2 is 2 (1).
Proof. We compute the moment-generating function of X 2 directly:
2
2
2
1
MX 2 (t) = E[etX ] =
etx ex /2 dx.
2
2t
2
which is 2 (1).
3.10 Theorem
n
If X1 , . . . , Xn are independent, each normal (0,1), then Y = i=1 Xi2 is 2 (n).
Proof. By (3.9), each Xi2 is 2 (1) with moment-generating function (1 2t)1/2 . Thus
MY (t) = (1 2t)n/2 for t < 1/2, which is 2 (n).
To see this intuitively, reason as follows. The probability that Z lies near z (between z
and z + dz) is fZ (z) dz. Let us compute this in terms of X and Y . The probability that
X lies near x is fX (x) dx. Given that X lies near x, Z will lie near z if and only if Y lies
near z x, in other words, z x Y z x + dz. By independence of X and Y , this
probability is fY (z x) dz. Thus fZ (z) is a sum of terms of the form fX (x) dx fY (z x) dz.
Cancel the dzs and replace the sum by an integral to get the result. A formal proof can
be given using Jacobians.
14
1
k tk1 et , t 0.
(k 1)!
Problems
1. Let X1 and X2 be independent, and assume that X1 is 2 (r1 ) and Y = X1 + X2 is
2 (r), where r > r1 . Show that X2 is 2 (r2 ), where r2 = r r1 .
2. Let X1 and X2 be independent, with Xi gamma with parameters i and i , i = 1, 2.
If c1 and c2 are positive constants, nd convenient sucient conditions under which
c1 X1 + c2 X2 will also have the gamma distribution.
3. If X1 , . . . , Xn are independent random variables with moment-generating functions
M1 , . . . , Mn , and c1 , . . . , cn are constants, express the moment-generating function M
of c1 X1 + + cn Xn in terms of the Mi .
15
4. If X1
, . . . , Xn are independent, with Xi Poisson(i ), i = 1, . . . ,
n, show that the sum
n
n
Y = i=1 Xi has the Poisson distribution with parameter = i=1 i .
5. An unbiased coin is tossed independently n1 times and then again tossed independently
n2 times. Let X1 be the number of heads in the rst experiment, and X2 the number
of tails in the second experiment. Without using moment-generating functions, in fact
without any calculation at all, nd the distribution of X1 + X2 .
16
X=
and the sample variance is
1
(Xi X)2 .
n i=1
n
S2 =
E(X) =
and
Var X =
n
1
n 2
2
0
Var
X
=
=
i
n2 i=1
n2
n
as
n .
2 +
Thus
E[(Xi X)2 ] = 2 (1 +
1
2
n1 2
)=
.
n n
n
Consequently, E(S 2 ) = (n 1) 2 /n, not 2 . Some books dene the sample variance as
1
n
(Xi X)2 =
S2
n 1 i=1
n1
n
where S 2 is our sample variance. This adjusted estimate of the true variance is unbiased
(its expectation is 2 ), but biased does not mean bad. If we measure performance by
asking for a small mean square error, the biased estimate is better in the normal case, as
we will see at the end of the lecture.
17
(1)
(2)
1
1
(x1 , . . . , xn ) 1
dn =
=
..
(y1 , . . . , yn )
.
1
1
1
0
1
0
1
1
0
0
1
1 1
0 + 1
0 1
1
1
0
1
0
1
(xi )2 =
(xi x + x )2 =
(3)
i=1
because (xi x) = 0. By (2), x1 x = x1 y1 = y2 yn and xi x = xi y1 = yi
for i = 2, . . . , n. (Remember that y1 = x.) Thus
n
i=1
yi2
i=2
Now
fY1 Yn (y1 , . . . , yn ) = nfX1 Xn (x1 , . . . , xn ).
(4)
18
By (3) and (4), the right side becomes, in terms of the yi s,
n
n
n
1
1
2
2
2
n
.
exp
y
n(y
)
i
1
i
2 2
2
i=2
i=2
The joint density of Y1 , . . . , Yn is a function of y1 times a function of (y2 , . . . , yn ), so
Y1 and (Y2 , . . . , Yn ) are independent. Since X = Y1 and [by (4)] S 2 is a function of
(Y2 , . . . , Yn ),
X
and S 2
are independent
nS 2
+
2
/ n
2
.
/ n
is normal (0,1)
so 2 (n) = (nS 2 / 2 ) + 2 (1) with the two random variables on the right independent. If
M (t) is the moment-generating function of nS 2 / 2 , then (1 2t)n/2 = M (t)(1 2t)1/2 .
Therefore M (t) = (1 2t)(n1)/2 , i.e.,
nS 2
2
is
2 (n 1)
S/ n 1
19
Since nS 2 / 2 is 2 (n 1), which has variance 2(n 1), we have n2 (Var S 2 )/ 4 = 2(n 1).
Also nE(S 2 )/ 2 is the mean of 2 (n 1), which is n 1. (Or we can recall from (4.1)
that E(S 2 ) = (n 1) 2 /n.) Thus the mean square error is
2
c2 2 4 (n 1) (n 1) 2
2 .
+ c
2
n
n
We can drop the 4 and use n2 as a common denominator, which can also be dropped.
We are then trying to minimize
c2 2(n 1) + c2 (n 1)2 2c(n 1)n + n2 .
Dierentiate with respect to c and set the result equal to zero:
4c(n 1) + 2c(n 1)2 2(n 1)n = 0.
Dividing by 2(n 1), we have 2c + c(n 1) n = 0, so c = n/(n + 1). Thus the best
estimate of the form cS 2 is
n
1
(Xi X)2 .
n + 1 i=1
If we use S 2 then c = 1. If we us the unbiased version then c = n/(n 1). Since
[n/(n + 1)] < 1 < [n/(n 1)] and a quadratic function decreases as we move toward
its minimum, w see that the biased estimate S 2 is better than the unbiased estimate
nS 2 /(n 1), but neither is optimal under the minimum mean square error criterion.
Explicitly, when c = n/(n 1) we get a mean square error of 2 4 /(n 1) and when c = 1
we get
(2n 1) 4
4
2(n 1) + (n 1 n)2 =
2
n
n2
which is always smaller, because [(2n 1)/n2 ] < 2/(n 1) i 2n2 > 2n2 3n + 1 i
3n > 1, which is true for every positive integer n.
For large n all these estimates are good and the dierence between their performance
is small.
Problems
1. Let X1 , . . . , Xn be iid, each normal (, 2 ), and let X be the sample mean. If c is a
constant, we wish to make n large enough so that P { c < X < + c} .954. Find
the minimum value of n in terms of 2 and c. (It is independent of .)
2. Let X1 , . . . , Xn1 , Y1 , . . . Yn2 be independent random variables, with the Xi normal
(1 , 12 ) and the Yi normal (2 , 22 ). If X is the sample mean of the Xi and Y is the
sample mean of the Yi , explain how to compute the probability that X > Y .
3. Let X1 , . . . , Xn be iid, each normal (, 2 ), and let S 2 be the sample variance. Explain
how to compute P {a < S 2 < b}.
4. Let S 2 be the sample variance of iid normal (, 2 ) random variables Xi , i = 1 . . . , n.
Calculate the moment-generating function of S 2 and from this, deduce that S 2 has a
gamma distribution.
20
(0,1) and X2 chi-square with r degrees of freedom. The random variable Y1 = rX1 / X 2
has the T distribution with r degrees of freedom.
2(r/2)2r/2
(y12 /r))y2 /2
With z = (1 +
(with y1 replaced by t)
[(r+1)/2]1
y2
((r + 1)/2)
1
, < t < ,
2
r(r/2) (1 + (t /r))(r+1)/2
the T density with r degrees of freedom.
Since and
n1
(X )
/ n
divided by
nS/
is
T (n 1).
S/ n 1
is
T (n 1)
Advocates of dening the sample variance withn 1 in the denominator point out that
one can simply replace by S in (X )/(/ n) to get the T statistic.
2
t2
t2
t2
1+
1+
1+
=
et2 1 = et /2
r
r
r
as r .
21
1
u(m/2)1
(m+n)/2
2
(m/2)(n/2)
z [(m+n)/2]1
2
ez
dz.
[(m+n)/2]1
1+u
[(1 + u)/2]
We abbreviate (a)(b)/(a + b) by (a, b). (We will have much more to say about this
when we discuss the beta distribution later in the lecture.) The above formula simplies
to
h(u) =
1
u(m/2)1
,
(m/2, n/2) (1 + u)(m+n)/2
u 0.
X1 /m
n
= U
X2 /n
m
du m
m
fW (w) = fU (u)
w .
= fU
dw
n
n
w 0,
22
(a, b) =
a, b > 0.
(a, b) =
which is consistent with our use of (a, b) as an abbreviation in (5.2). We make the change
of variable t = x2 to get
(a) =
a1 t
dt = 2
x2a1 ex dx.
2
We now use the familiar trick of writing (a)(b) as a double integral and switching to
polar coordinates. Thus
(a)(b) = 4
0
0
/2
=4
d
0
+y 2 )
dx dy
2a+2b1 r 2
ua+b1 eu du = (a + b)/2.
dr = (1/2)
Thus
(a)(b)
=
2(a + b)
/2
Let z = cos2 , 1 z = sin2 , dz = 2 cos sin d = 2z 1/2 (1 z)1/2 dz. The above
integral becomes
1
a1
(1 z)
b1
1
dz =
2
z a1 (1 z)b1 dz =
0
1
(a, b)
2
1
xa1 (1 x)b1 ,
(a, b)
0x1
23
Problems
1. Let X have the beta distribution with parameters a and b. Find the mean and variance
of X.
2. Let T have the T distribution with 15 degrees of freedom. Find the value of c which
makes P {c T c} = .95.
3. Let W have the F distribution with m and n degrees of freedom (abbreviated W =
F (m, n)). Find the distribution of 1/W .
4. A typical table of the F distribution gives values of P {W c} for c = .9, .95, .975 and
.99. Explain how to nd P {W c} for c = .1, .05, .025 and .01. (Use the result of
Problem 3.)
5. Let X have the T distribution with n degrees of freedom (abbreviated X = T (n)).
Show that T 2 (n) = F (1, n), in other words, T 2 has an F distribution with 1 and n
degrees of freedom.
6. If X has the exponential density ex , x 0, show that 2X is 2 (2). Deduce that the
quotient of two exponential random variables is F (2, 2).
n1
n2
n3
nr1
nr
which reduces to the multinomial formula
n!
pn1 pnr r
n1 ! nr ! 1
where the pi are nonnegative real numbers that sum to 1, and the ni are nonnegative
integers that sum to n.
Now let X1 , . . . , Xn be iid, each with density f (x) and distribution function F (x).
Let Y1 < Y2 < < Yn be the Xi s arranged in increasing order, so that Yk is the k-th
smallest. In particular, Y1 = min Xi and Yn = max Xi . The Yk s are called the order
statistics of the Xi s
The distributions of Y1 and Yn can be computed without developing any new machinery.
The probability that Yn x is the probability that Xi x for all i, which is
n
P
{X
i x} by independence. But P {Xi x} is F (x) for all i, hence
i=1
FYn (x) = [F (x)]n
Similarly,
P {Y1 > x} =
n
i=1
2
Therefore
FY1 (x) = 1 [1 F (x)]n
We compute fYk (x) by asking how it can happen that x Yk x + dx (see Figure
6.1). There must be k 1 random variables less than x, one random variable between
x and x + dx, and n k random variables greater than x. (We are taking dx so small
that the probability that more one random variable falls in [x, x + dx] is negligible, and
P {Xi > x} is essentially the same as P {Xi > x + dx}. Not everyone is comfortable
with this reasoning, but the intuition is very strong and can be made precise.) By the
multinomial formula,
n!
[F (x)]k1 f (x) dx[1 F (x)]nk
(k 1)!1!(n k)!
fYk (x) dx =
so
fYk (x) =
n!
[F (x)]k1 [1 F (x)]nk f (x).
(k 1)!1!(n k)!
Similar reasoning (see Figure 6.2) allows us to write down the joint density fYj Yk (x, y) of
Yj and Yk for j < k, namely
n!
[F (x)]j1 [F (y) F (x)]kj1 [1 F (y)]nk f (x)f (y)
(j 1)!(k j 1)!(n k)!
for x < y, and 0 elsewhere. [We drop the term 1! (=1), which we retained for emphasis
in the formula for fYk (x).]
k-1
n-k
'
'
x + dx
Figure 6.1
j-1
k-j-1
'
'
x + dx
'
y
n-k
'
y + dy
Figure 6.2
Problems
1. Let Y1 < Y2 < Y3 be the order statistics of X1 , X2 and X3 , where the Xi are uniformly
distributed between 0 and 1. Find the density of Z = Y3 Y1 .
2. The formulas derived in this lecture assume that we are in the continuous case (the
distribution function F is continuous). The formulas do not apply if the Xi are discrete.
Why not?
3
3. Consider order statistics where the Xi , i = 1, . . . , n, are uniformly distributed between
0 and 1. Show that Yk has a beta distribution, and express the parameters and in
terms of k and n.
4. In Problem 3, let 0 < p < 1, and express P {Yk > p} as the probability of an event
associated with a sequence of n Bernoulli trials with probability of success p on a given
trial. Write P {Yk > p} as a nite sum involving n, p and k.
so
E(X) 0 +
af (x) dx = aP {X a}.
E[(X )2 ]
1
= 2.
2
2
k
k
E[(Xn X)2 ]
0.
2
6
To prove (2), note that
E[(Xn X)2 ] = Var(Xn X) + [E(Xn ) E(X)]2 0.
In this result, if X is identically equal to a constant c, then Var(Xn X) is simply
Var Xn . Condition (2) then becomes E(Xn ) c and Var Xn 0, which implies that Xn
converges in probability to c.
7.6 An Application
In normal sampling, let Sn2 be the sample variance based on n observations. Lets show
P
that Sn2 is a consistent estimate of the true variance 2 , that is, Sn2 2 . Since nSn2 / 2
is 2 (n 1), we have E(nSn2 / 2 ) = (n 1) and Var(nSn2 / 2 ) = 2(n 1). Thus E(Sn2 ) =
(n 1) 2 /n 2 and Var(Sn2 ) = 2(n 1) 4 /n2 0, and the result follows.
Fn(x)
lim Fn(x)
'
1/n
F(x)
Figure 7.1
Problems
1. Let X1 , . . . , Xn be independent, not necessarily identically distributed random variables. Assume that the Xi have nite means i and nite variances i2 , and the
variances are uniformly bounded, i.e., for some positive number M we have i2 M
for all i. Show that (Sn E(Sn ))/n converges in probability to 0. This is a generalization of the weak law of large numbers. For if i = and i2 = 2 for all i, then
P
7
3. Let X1 , . . . , Xn be iid with nite mean and nite variance 2 . Let X n be the sample
mean (X1 + + Xn )/n. Find the limiting distribution of X n , i.e., nd a random
d
8.1 Theorem
If Yn has moment-generating function Mn , Y has moment-generating function M , and
Mn (t) M (t) as n for all t in some open interval containing the origin, then
d
Yn
Y.
] = E exp
n
i=1
n
t
Xi .
n i=1
Xi is [M (t)]n , so
t n
.
MYn (t) = M
n
Now if the density of the Xi is f (x), then
t
tx
M
exp
=
f (x) dx
n
n
t2 x2
t3 x3
tx
+
=
+
+ f (x) dx
1+
2
3/2
3
n 2!n
3!n
=1+0+
t4 4
t2
t3 3
+
+ 3/2 3 +
2n 6n
24n2 4
9
2
8.3 Theorem
If Xn converges in distribution to a constant c, then Xn converges in probability to X.
Proof. We estimate the probability that |Xn X| , as follows.
P {|Xn X| } = P {Xn c + } + P {Xn c }
= 1 P {Xn < c + } + P {Xn c }
Now P {Xn c + (/2)} P {Xn < c + }, so
P {|Xn c| } 1 P {Xn c + (/2)} + P {Xn c }
= 1 Fn (c + (/2)) + Fn (c ).
where Fn is the distribution function of Xn . But as long as x = c, Fn (x) converges to the
distribution function of the constant c, so Fn (x) 1 if x > c, and Fn (x) 0 if x < c.
Therefore P {|Xn c| } 1 1 + 0 = 0 as n .
8.4 Remarks
If Y is binomial (n, p), the normal approximation to the binomial allows us to regard Y
as approximately normal with mean np and variance npq (with q = 1 p). According
to Box, Hunter and Hunter, Statistics for Experimenters, page 130, the approximation
works well in practice if n > 5 and
1 q
p
< .3
p
q
n
If, for example, we wish to estimate the probability that Y = 50 or 51 or 52, we may write
this probability as P {49.5 < Y < 52.5} , and then evaluate as if Y were normal with
mean np and variance np(1 p). This turns out to be slightly more accurate in practice
than using P {50 Y 52}.
8.5 Simulation
Most computers an simulate a random variable that is uniformly distributed between 0
and 1. But what if we need a random variable with an arbitrary distribution function F ?
For example, how would we simulate the random variable with the distribution function
of Figure 8.1? The basic idea is illustrated in Figure 8.2. If Y = F (X) where X has the
10
continuous distribution function F , then Y is uniformly distributed on [0,1]. (In Figure
8.2 we have, for 0 y 1, P {Y y} = P {X x} = F (x) = y.)
Thus if X is uniformly distributed on [0,1] and w want Y to have distribution function
F , we set X = F (Y ), Y = F 1 (X).
In Figure 8.1 we must be more precise:
Case 1. 0 X 3. Let X = (3/70)Y + (15/70), Y = (70X 15)/3.
Case 2. .3 X .8. Let Y = 4, so P {Y = 4} = .5 as required.
Case 3. .8 X 1. Let X = (1/10)Y + (4/10), Y = 10X 4.
In Figure 8.1, replace the F (y)-axis by an x-axis to visualize X versus Y . If y = y0
corresponds to x = x0 [i.e., x0 = F (y0 )], then
P {Y y0 } = P {X x0 } = x0 = F (y0 )
as desired.
F(y)
0
+ 1
.8 15
3 y + 70 .3 70
-5
'2
10
o
'
4
'6
Figure 8.1
Y = F(X)
1
y
x
Figure 8.2
Problems
1. Let Xn be gamma (n, ), i.e., Xn has the gamma distribution with parameters n and
. Show that Xn is a sum of n independent exponential random variables, and from
this derive the limiting distribution of Xn /n.
2. Show that 2 (n) is approximately normal for large n (with mean n and variance 2n).
11
3. Let X1 , . . . , Xn be iid with density f . Let Yn be the number of observations that
fall into the interval (a, b). Indicate how to use a normal approximation to calculate
probabilities involving Yn .
4. If we have 3 observations 6.45, 3.14, 4.93, and we round o to the nearest integer, we
get 6, 3, 5. The sum of integers is 14, but the actual sum is 14.52. Let Xi , i = 1, . . . , n
be the round-o error of the i-th observation, and assume that the Xi are iid and
uniformly distributed on (1/2, 1/2). Indicate how to use a normal
n approximation to
calculate probabilities involving the total round o error Yn = i=1 Xi .
5. Let X1 , . . . , Xn be iid with continuous distribution function F , and let Y1 < < Yn
be the order statistics of the Xi . Then F (X1 ), . . . , F (Xn ) are iid and uniformly distributed on [0,1] (see the discussion of simulation), with order statistics F (Y1 ), . . . , F (Yn ).
Show that n(1 F (Yn )) converges in distribution to an exponential random variable.
12
Lecture 9. Estimation
9.1 Introduction
In eect the statistician plays a game against nature, who rst chooses the state of
nature (a number or k-tuple of numbers in the usual case) and performs a random
experiment. We do not know but we are allowed to observe the value of a random
variable (or random vector) X, called the observable, with density f (x).
After observing X = x we estimate by (x), which is called a point estimate because
it produces a single number which we hope is close to . The main alternative is an
interval estimate or condence interval, which will be discussed in Lectures 10 and 11.
For a point estimate (x) to make sense physically, it must depend only on x, not on
the unknown parameter . There are many possible estimates, and there are no general
rules for choosing a best estimate. Some practical considerations are:
(a) How much does it cost to collect the data?
(b) Is the performance of the estimate easy to measure, for example, can we compute
P {|(x) | < }?
(c) Are the advantages of the estimate appropriate for the problem at hand?
We will study several estimation methods:
1. Maximum likelihood estimates.
These estimates usually have highly desirable theoretical properties (consistency), and
are frequently not dicult to compute.
2. Condence intervals.
These estimates have a very useful practical feature. We construct an interval from
the data, and we will know the probability that our (random) interval actually contains
the unknown (but xed) parameter.
3. Uniformly minimum variance unbiased estimates (UMVUEs).
Mathematical theory generates a large number of examples of these, but as we know,
a biased estimate can sometimes be superior.
4. Bayes estimates.
These estimates are appropriate if it is reasonable to assume that the state of nature
is a random variable with a known density.
In general, statistical theory produces many reasonable candidates, and practical experience will dictate the choice in a given physical situation.
13
9.3 Example
Let X be binomial (n, ). Then the probability that X = x when the true parameter is
is
n x
f (x) =
(1 )nx , x = 0, 1, . . . , n.
x
Maximizing f (x) is equivalent to maximizing ln f (x):
x nx
ln f (x) =
[x ln + (n x)ln(1 )] =
= 0.
1
Thus x x n + x = 0, so = X/n, the relative frequency of success.
Notation: will be written in terms of random variables, in this case X/n rather than
x/n. Thus is itself a random variable.
P
9.4 Example
Let X1 , . . . , Xn be iid, normal (, 2 ), = (, 2 ). Then, with x = (x1 , . . . , xn ),
n
n
(xi )2
1
f (x) =
exp
2 2
2
i=1
and
ln f (x) =
n
n
1
(xi )2 ;
ln 2 n ln 2
2
2 i=1
(xi ) = 0,
ln f (x) = 2
i=1
xi n = 0,
= x;
i=1
n
n
n
n
1
ln f (x) = + 3
(xi )2 = 3 2 +
(xi )2 = 0
i=1
n i=1
with = x. Thus
1
(xi x)2 = s2 .
n i=1
n
2 =
14
Case 3. is known. Then = 2 and the equation (/) ln f (x) = 0 becomes
1
(xi )2
n i=1
n
2 =
so
=
(Xi )2 .
n i=1
n
The sample mean X is an unbiased and (by the weak law of large numbers) consistent
estimate of . The sample variance S 2 is a biased but consistent estimate of 2 (see
Lectures 4 and 7).
Notation: We will abbreviate maximum likelihood estimate by MLE.
h()
If h is continuous, then consistency is preserved, in other words:
P
P
If h is continuous and , then h()
h().
h()| < .
Proof. Given > 0, there exists > 0 such that if | | < , then |h()
Consequently,
h()| } P {| | } 0
P {|h()
as
n .
(To justify the above inequality, note that if the occurrence of an event A implies the
occurrence of an event B, then P (A) P (B).)
S 2 = 1 22
S2
,
X
1 =
X
X
= 2
2
S
15
Problems
1. In this problem, X1 , . . . , Xn are iid with density f (x) or probability function p (x),
and you are asked to nd the MLE of .
(a) Poisson (),
> 0.
(b) f (x) = x , 0 < x < 1, where > 0. The probability is concentrated near the
origin when < 1, and near 1 when > 1.
(c) Exponential with parameter , i.e., f (x) = (1/)ex/ , x > 0, where > 0.
(d) f (x) = (1/2)e|x| , where and x are arbitrary real numbers.
(e) Translated exponential, i.e., f (x) = e(x) , where is an arbitrary real number
and x .
2. let X1 , . . . , Xn be iid, each uniformly distributed between (1/2) and + (1/2).
Find more than one MLE of (so MLEs are not necessarily unique).
3. In each part of Problem 1, calculate E(Xi ) and derive an estimate based on the method
of moments by setting the sample mean equal to the true mean. In each case, show
that the estimate is consistent.
4. Let X be exponential with parameter , as in Problem 1(c). If r > 0, nd the MLE of
P {X r}.
5. If X is binomial (n, ) and a and b are integers with 0 a b n, nd the MLE of
P {a X b}.
16
Yn
Yn np
< .01 n
P
p < .01 = P {|Yn np| < .01n} = P
n
np(1 p)
p(1 p)
and this is approximately
.01 n
.01 n
.01 n
= 2
1 > .95
p(1 p)
p(1 p)
p(1 p)
where is the normal (0,1) distribution function. Since 1.95/2 = .975 and (1.96) = .975,
we have
.01 n
> 1.96, n > (196)2 p(1 p).
p(1 p)
But (by calculus) p(1 p) is maximized when 1 2p = 0, p = 1/2, p(1 p) = 1/4.
Thus n > (196)2 /4 = (98)2 = (100 2)2 = 10000 400 + 4 = 9604.
If we want to get within one tenth of one percent (.001) of p with 99 percent condence,
we repeat the above analysis with .01 replaced by .001, 1.99/2=.995 and (2.6) = .995.
Thus
.001 n
> 2.6, n > (2600)2 /4 = (1300)2 = 1, 690, 000.
p(1 p)
17
To get within 3 percent with 95 percent condence, we have
2
.03 n
196
1
> 1.96, n >
= 1067.
3
4
p(1 p)
If the experiment is repeated independently a large number of times, it is very likely that
our result will be within .03 of the true probability p at least 95 percent of the time. The
usual statement The margin of error of this poll is 3% does not capture this idea.
Note that the accuracy of the prediction depends only on the number of voters polled
and not in total number of votes in the population. But the model assumes sampling
with replacement. (Theoretically, the same voter can be polled more than once since the
voters are selected independently.) In practice, sampling is done without replacement,
but if the number n of voters polled is small relative to the population size N , the error
is very small.
The normal approximation to the binomial (based on the central limit theorem) is
quite reliable, and is used in practice even for modest values of n; see (8.4).
/ n
hence
P {b <
n
is normal (0,1),
< b} = (b) (b) = 2(b) 1
S/ n 1
is
T (n 1)
hence
P {b <
bS
bS
<<X+
.
n1
n1
18
Var Xi + 2
i=1
Cov(Xi , Xj )
i<j
whee Cov stands for covariance. (We will prove this in a later lecture.) If i = j, then
E(Xi Xj ) = P {Xi = Xj = 1} = P {X1 X2 = 1} =
and
Cov(Xi , Xj ) = E(Xi Xj ) E(Xi )E(Xj ) = p
Np Np 1
N
N 1
Np 1
N 1
p2
n1
N n
np(1 p) 1
= np(1 p)
.
N 1
N 1
Thus if SE is the standard error (the standard deviation of X), then SE (without replacement) = SE (with replacement) times a correction factor, where the correction factor
is
N n
1 (n/N )
=
.
N 1
1 (1/N )
The correction factor is less than 1, and approaches 1 as N , as long as n/N 0.
Note also that in sampling without replacement, the probability of getting exactly k
As in n trials is
N p
N (1p)
k
Nnk
n
19
Problems
1. In the normal case [see (10.2)], assume that 2 is known. Explain how to compute the
length of the condence interval for .
2. Continuing Problem 1, assume that 2 is unknown. Explain how to compute the length
of the condence interval for , in terms of the sample standard deviation S.
3. Continuing Problem 2, explain how to compute the expected length of the condence
interval for , in terms of the unknown standard deviation . (Note that when is
unknown, we expect a larger interval since we have less information.)
4. Let X1 , . . . , Xn be iid, each gamma with parameters and . If is known, explain
how to compute a condence interval for the mean = .
5. In the binomial case [see (10.1)], suppose we specify the level of condence and the
length of the condence interval. Explain how to compute the minimum value of n.
is normal (0,1).
Also, nS12 / 2 is 2 (n1) and mS22 / 2 is 2 (m1). But 2 (r) is the sum of r independent,
normal (0,1) random variables, so
nS12
mS22
+
2
2
Thus if
R=
is
2 (n + m 2).
nS12 + mS22
n+m2
1
1
+
n m
then
T =
X Y (1 2 )
R
is
T (n + m 2).
Our assumption that both populations have the same variance is crucial, because the
unknown variance can be cancelled.
If P {b < T < b} = .95 we get a 95 percent condence interval for 1 2 :
b <
X Y (1 2 )
<b
R
or
(X Y ) bR < 1 2 < (X Y ) + bR.
If the variances 12 and 22 are known but possibly unequal, then
X Y (1 2 )
12
22
n + m
is normal (0,1). If R0 is the denominator of the above fraction, we can get a 95 percent
condence interval as before: (b) (b) = 2(b) 1 > .95,
(X Y ) bR0 < 1 2 < (X Y ) + bR0 .
11.2 Example
Let Y1 and Y2 be binomial (n1 , p1 ) and (n2 , p2 ) respectively. Then
Y1 = X1 + + Xn1
and Y2 = Z1 + + Zn2
where the Xi and Zj are indicators of success on trials i and j respectively. Assume
that X1 , . . . Xn1 , Z1 , . . . , Zn2 are independent. Now E(Y1 /n1 ) = p1 and Var(Y1 /n1 ) =
n1 p1 (1 p1 )/n21 = p1 (1 p1 )/n1 , with similar formulas for Y2 /n2 . Thus for large n,
Y1
Y2
(p1 p2 )
n1
n2
divided by
p1 (1 p1 ) p2 (1 p2 )
+
n1
n2
is approximately normal (0,1). But this expression cannot be used to construct condence
intervals for p1 p2 because the denominator involves the unknown quantities p1 and p2 .
However, Y1 /n1 converges in probability to p1 and Y2 /n2 converges in probability to p2 ,
and this justies replacing p1 by Y1 /n1 and p2 by Y2 /n2 in the denominator.
nS 2
< b} = 1 .
2
i=1
3
so if
W =
n
(Xi )2
i=1
b
is
F (n2 1, n1 1).
Then
V22 12
V12 22
is
F (n2 1, n1 1)
and this allows construction of condence intervals for 12 /22 in the usual way.
Problems
1. In (11.1), suppose the variances 12 and 22 are unknown and possibly unequal. Explain
why the analysis of (11.1) breaks down.
2. In (11.1), again assume that the variances are unknown, but 12 = c22 where c is a
known positive constant. Show that condence intervals for the dierence of means
can be constructed.
12.3 Example
Let X1 , . . . , Xn be iid, each normal (, 2 ). We will test H0 : 0 vs. H1 : > 0 .
Under H1 , X will tend to be larger, so lets reject H0 when X > c. The power function
of the test is dened by
K() = P {reject H0 },
the probability of rejecting the null hypothesis when the true parameter is . In this case,
X
c
c
>
P {X > c} = P
=1
/ n
/ n
/ n
(see Figure 12.1). Suppose that we specify the probability of a type 1 error when = 1 ,
and the probability of a type 2 error when = 2 . Then
c 1
K(1 ) = 1
=
/ n
and
K(2 ) = 1
c 2
/ n
= 1 .
If , , , 1 and 2 are known, we have two equations that can be solved for c and n.
K( )
1
Figure 12.1
The critical region
nis the set of observations that lead to rejection. In this case, it is
{(x1 , . . . , xn ) : n1 i=1 xi > c}.
The signicance level is the largest type 1 error probability. Here it is K(0 ), since
K() increases with .
12.4 Example
Let H0 : X is uniformly distributed on (0,1), so f0 (x) = 1, 0 < x < 1, and 0 elsewhere.
Let H1 : f1 (x) = 3x2 , 0 < x < 1, and 0 elsewhere. We take only one observation, and
reject H0 if x > c, where 0 < c < 1. Then
1
K(0) = P0 {X > c} = 1 c, K(1) = P1 {X > c} =
3x2 dx = 1 c3 .
c
6
If we specify the probability of a type 1 error, then = 1 c, which determines c. If
is the probability of a type 2 error, then 1 = 1 c3 , so = c3 . Thus (see Figure 12.2)
= (1 )3 .
If = .05 then = (.95)3 .86, which indicates that you usually cant do too well with
only one observation.
Figure 12.2
X 0
X 0
P b <
< b = 2FT (b) 1 where T =
S/ n 1
S/ n 1
has the T distribution with n 1 degrees of freedom.
Say 2FT (b) 1 = .95, so that
X 0
P
b = .05
S/ n 1
If actually equals 0 , we are witnessing an event of low probability. So it is natural to
test = 0 vs.
= 0 by rejecting if
X 0
S/ n 1 b,
in other words, 0 does not belong to the condence interval. As the true mean
moves away from 0 in either direction, the probability of this event will increase, since
X 0 = (X ) + ( 0 ).
Tests of = 0 vs.
= 0 are called two-sided, as opposed to = 0 vs. > 0 (or
= 0 vs. < 0 ), which are one-sided. In the present case, if we test = 0 vs. > 0 ,
we reject if
X 0
b.
S/ n 1
if L(x) >
1
(x) = 0
if L(x) <
anything if L(x) =
Suppose that the probability of a type 1 error using is , and the probability of a
type 2 error is . Let be an arbitrary test with error probabilities and . If
then . In other words, the LRT has maximum power among all tests at signicance
level .
Proof. We are going to assume that f0 and f1 are one-dimensional, but the argument
works equally well when X = (X1 , . . . , Xn ) and the fi are n-dimensional joint densities.
We recall from basic probability theory the theorem of total probability, which says that
if X has density f , then for any evert A,
P (A) =
P (A|X = x)f (x) dx.
A companion theorem which we will also use later is the theorem of total expectation,
which says that if X has density f , then for any random variable Y ,
E(Y ) =
E(Y |X = x)f (x) dx.
1 =
(x)f1 (x) dx
and similarly
=
1 =
8
The terms involving f0 translate to statements about type 1 errors, and the terms involving
f1 translate to statements about type 2 errors. Thus
(1 ) (1 ) + 0,
which says that ( ) 0, completing the proof.
12.7 Randomization
If L(x) = , then do anything means that randomization is possible, e.g., we can ip
a possibly biased coin to decide whether or not to accept H0 . (This may be signicant
in the discrete case, where L(x) = may have positive probability.) Statisticians tend
to frown on this practice because two statisticians can look at exactly the same data and
come to dierent conclusions. It is possible to adjust the signicance level (by replacing
do anything by a denite choice of either H0 or H1 to avoid randomization.
Problems
1. Consider the problem of testing = 0 vs. > 0 , where is the mean of a normal
population with known variance. Assume that the sample size n is xed. Show that
the test given in Example 12.3 (reject H0 if X > c) is uniformly most powerful. In
other words, if we test = 0 vs. = 1 for any given 1 > 0 , and we specify the
probability of a type 1 error, then the probability of a type 2 error is minimized.
2. It is desired to test the null hypothesis that a die is unbiased vs. the alternative that
the die is loaded, with faces 1 and 2 having probability 1/4 and faces 3,4,5 and 6 having
probability 1/8. The die is to be tossed once. Find a most powerful test at level = .1,
and nd the type 2 error probability .
3. We wish to test a binomial random variable X with
n = 400 and H0 : p = 1/2 vs.
H1 : p > 1/2. The random variable Y = (X np)/ np(1 p) = (X 200)/10 is
approximately normal (0,1), and we will reject H0 if Y > c. If we specify = .05,
then c = 1.645. Thus the critical region is X > 216.45. Suppose the actual result is
X = 220, so that H0 is rejected. Find the minimum value of (sometimes called the
p-value) for which the given data lead to the opposite conclusion (acceptance of H0 ).
n!
pn1 pnk k
n1 ! nk ! 1
k
(Xi npi )2
i=1
npi
2 (k 1).
where
(Xi npi )2
(observed frequency-expected frequency)2
=
.
npi
expected frequency
We will consider three types of chi-square tests.
10
Problems
1. Use a chi-square procedure to tests the null hypothesis that a random variable X has
the following distribution:
P {X = 1} = .5,
P {X = 2} = .3,
P {X = 3} = .2
11
We take 100 independent observations of X, and it is observed that 1 occurs 40 times,
2 occurs 33 times, and 3 occurs 27 times. Determine whether or not we will reject the
null hypothesis at signicance level .05
2. Use a chi-square test to decide (at signicance level .05) whether the two samples corresponding to the rows of the contingency table below came from the same underlying
distribution.
Sample 1
Sample 2
A
33
67
B
147
153
C
114
86
12
P {X1 = x1 , . . . , Xn = xn , Y = y}
.
P {Y = y}
This construction is possible because the conditional distribution does not depend on the
unknown parameter . We will show that under , (X1 , . . . , Xn ) and (X1 , . . . , Xn ) have
exactly the same distribution, so anything A can do, B can do at least as well, even though
B has less information.
Given x1 , . . . , xn , let y = x1 + + xn . The only way we can have X1 = x1 , . . . , Xn =
xn is if Y = y and then Bs experiment produces X1 = x1 , . . . , Xn = xn given y. Thus
P {X1 = x1 , . . . , Xn = xn } = P {Y = y}P {X1 = x1 , . . . , Xn = xn |Y = y}
n y
1
(1 )ny n = y (1 )ny = P {X1 = x1 , . . . , Xn = xn }.
y
y
13
P {X = x, Y = y}
.
P {Y = y}
h(x)
{z:u(z)=y}
h(z)
which is free of .
14.4 Example
Let X1 , . . . , Xn be iid, each normal (, 2 ), so that
f (x1 , . . . , xn ) =
Take = (, 2 ) and let x = n1
1
2
n
i=1
n
n
1
exp 2
(xi )2 .
2 i=1
xi , s2 = n1
n
i=1 (xi
x)2 . Then
xi x = xi (x )
and
n
n
1
2
2
s =
(xi ) 2(x )
(xi ) + n(x ) .
n 1
1
2
14
Thus
1
(xi )2 (x )2 .
n 1
n
s2 =
The joint density is given by
/2 2 n(x)2 /2 2
If and 2 are both unknown then (X, S 2 ) is sucient (take h(x) = 1). If 2 is known
2 n/2 ns2 /2 2
then we can take h(x) = (2
)
e
, = , and X is sucient. If is known
n
then (h(x) = 1) = 2 and i=1 (Xi )2 is sucient.
Problems
In Problems 1-6, show that the given statistic u(X) = u(X1 , . . . , Xn ) is sucient for
and nd appropriate functions g and h for the factorization theorem to apply.
1. The Xi are Poisson () and u(X) = X1 + + Xn .
2. The Xi have density A()B(xi ), 0 < xi < (and 0 elsewhere), where is a positive real
number; u(X) = max Xi . As a special case, the Xi are uniformly distributed between
0 and , and A() = 1/, B(xi ) = 1 on (0, ).
3. The Xi are geometric with parameter , i.e., if is the probability of success on a given
Bernoulli trial, then P {Xi = x} = (1 )x is the probability that there will be x
n
failures followed by the rst success; u(X) = i=1 Xi .
n
4. The Xi have the exponential density (1/)ex/ , x > 0, and u(X) = i=1 Xi .
n
5. The Xi have the beta density with parameters a = and b = 2, and u(X) = i=1 Xi .
6. The Xi have the gamma
n density with parameters = , an arbitrary positive
number, and u(X) = i=1 Xi .
7. Show that the result in (14.2) that statistician B can do at least as well as statistician
A, holds in the general case of arbitrary iid random variables Xi .
15
E(Y ) =
x=0
1 2 x 1
1
x e ( ) = xex , 0 < y < x.
2
x
2
y(1/2)xex dy dx =
(x3 /4)ex dx =
y=0
3!
3
= .
4
2
Method 2 works well when the conditional expectation is easy to compute. In this case
it is x/2 by inspection. Thus
3
E(Y ) =
as before.
(1/2)x2 ex (x/2) dx =
2
0
15.3 Lemma
E[E(X2 |X1 )] = E(X2 ).
Proof. Let g(X1 ) = E(X2 |X1 ). Then
E[g(X1 )] =
g(x)f1 (x) dx =
16
15.4 Lemma
If i = E(Xi ), i = 1, 2, then
E[{X2 E(X2 |X1 )}{E(X2 |X1 ) 2 }] = 0.
Proof. The expectation is
[x2 E(X2 |X1 = x1 )][E(X2 |X1 = x1 ) 2 ]f1 (x1 )f2 (x2 |x1 ) dx1 dx2
f1 (x1 )[E(X2 |X1 = x1 ) 2 ]
The inner integral (with respect to x2 ) is E(X2 |X1 = x1 ) E(X2 |X1 = x1 ) = 0, and the
result follows.
15.5 Lemma
Var X2 Var[E(X2 |X1 )].
Proof. We have
Var X2 = E[(X2 2 )2 ] = E [{X2 E(X2 |X1 } + {E(X2 |X1 ) 2 }]2
= E[{X2 E(X2 |X1 )}2 ] + E[{E(X2 |X1 ) 2 }2 ]
E[{E(X2 |X1 ) 2 }2 ]
by (15.4)
But by (15.2), E[E(X2 |X1 )] = E(X2 ) = 2 , so the above term is the variance of
E(X2 |X1 ).
15.6 Lemma
Equality holds in (15.5) if and only if X2 is a function of X1 .
Proof. The argument of (15.5) shows that equality holds i E[{X2 E(X2 |X1 )}2 ] = 0,
in other words, X2 = E(X2 |X1 ). This implies that X2 is a function of X1 . Conversely, if
X2 = h(X1 ), then E(X2 |X1 ) = h(X1 ) = X2 , and therefore equality holds.
17
15.8 Theorem
Let Y1 = u1 (X1 , . . . , Xn ) be a sucient statistic for . If the maximum likelihood estimate
of is unique, then is a function of Y1 .
Proof. The joint density of the Xi can be factored as
f (x1 , . . . , xn ) = g(, z)h(x1 , . . . , xn )
where z = u1 (x1 , . . . , xn ). Let 0 maximize g(, z). Given z, we nd 0 by looking
at all g(, z), so that 0 is a function of u1 (X1 , . . . , Xn ) = Y1 . But 0 also maximizes
f (x1 , . . . , xn ), so by uniqueness, = 0 .
In Lectures 15-17, we are developing methods for nding uniformly minimum variance
unbiased estimates. Exercises will be deferred until Lecture 17.
m
pj ()Kj (x)
j=1
where a() > 0, b(x) > 0, < x < , = (1 , . . . , k ) with j < j < j , 1 j k
(, , j , j are constants).
There are certain regularity conditions that are assumed, but they will always be
satised in the examples we consider, so we will omit the details. In all our examples, k
and m will be equal. This is needed in the proof of completeness of the statistic to be
discussed in Lecture 17. (It is not needed for suciency.)
16.4 Examples
1. Binomial(n, ) where n is known. We have f (x)
= nx x (1 )nx , x = 0, 1, . . . , n,
where 0 < < 1. Take a() = (1)n , b(x) = nx , p1 () = ln ln(1), K1 (x) = x.
Note that k = m = 1.
2
2. Poisson(). The probability function is f (x) = e x /x!, x = 0, 1, . . . , where > 0.
We can take a() = e , b(x) = 1/x!, p1 () = ln , K1 (x) = x, and k = m = 1.
3. Normal(, 2 ). The density is
f (x) =
1
exp[(x )2 /2 2 ],
2
< x < ,
b(x) = 1,
= (, 2 ).
p1 () = 1/2 2 ,
K1 (x) = x2 ,
p2 () =
q k1 petk .
k=1
pet
,
1 qet
|qet | < 1.
The random variable Y1 is said to have the geometric distribution. (The slightly dierent
random variable appearing in Problem 3 of Lecture 14 is also frequently referred to as
geometric.) Now Yr (the negative binomial random variable) is the sum of r independent
random variables, each geometric, so
r
pet
MYr (t) =
.
1 qet
The event {Yr = k} occurs i there are r 1 successes in the rst k 1 trials, followed
by a success on trial k. Therefore
k 1 r1 kr
P {Yr = k} =
p q
p, x = r, r + 1, r + 2, . . . .
r1
3
We can calculate the mean and variance of Yr from the moment-generating function,
but the dierentiation is not quite as messy if we introduce another random variable.
Let Xr be the number of failures preceding the r-th success. Then Xr plus the number
of successes preceding the r-th success is the total number of trials preceding the r-th
success. Thus
Xr + (r 1) = Yr 1,
so
and
MXr (t) = ert MYr (t) =
When r = 1 we have
MX1 (t) =
p
,
1 qet
E(X1 ) =
Xr = Yr r
p
1 qet
r
.
pqet
q
= .
(1 qet )2 t=0
p
(1 q)2 pq + pq 2 2(1 q)
pq(1 q)[1 q + 2q]
pq(1 + q)
q(1 + q)
=
=
=
.
(1 q)4
(1 q)4
p3
p2
Thus Var X1 = Var Y1 = [q(1 + q)/p2 ] [q 2 /p2 ] = q/p2 , hence Var Yr = rq/p2 .
Now to show that the negative binomial distribution belongs to the exponential class:
x1 r
P {Yr = x} =
(1 )xr , x = r, r + 1, r + 2, . . . , = p.
r1
Take
a() =
r
,
b(x) =
x1
,
r1
p1 () = ln(1 ),
K1 (x) = x,
k = m = 1.
where
r
k
=
r(r 1 (r k + 1)
.
k!
m
pj ()Kj (x)
j=1
b(xi ) exp
i=1
m
m
pj ()Kj (x1 ) exp
pj ()Kj (xn ) .
j=1
j=1
K1 (xi ), . . . ,
i=1
n
Km (xi )
i=1
f (x1 , . . . , xn ) = a()
b(xi ) exp p()
K(xi ) .
i=1
Let Y1 =
n
i=1
i=1
n
K(xi ) f (x1 , . . . , xn ) dx1 dxn .
i=1
n
i=1
K1 (xi ) + + pm ()
n
i=1
Km (xi )
5
and the argument is essentially the same as in the one-dimensional case. The transform
result is as follows. If
when ai < ti < bi , i = 1, . . . , m, then g = 0. The above integral denes a joint momentgenerating function, which will appear again in connection with the multivariate normal
distribution.
17.2 Example
Let X1 , . . . , Xn be iid, each normal(, 2 ) where 2 is known. The normal distribution belongs to the exponential class (see (16.4), Example 3), but in this case
the term
n
exp[x2 /2 2 ] can be absorbed in b(x), so only K2 (x) = x is relevant. Thus i=1 Xi ,
equivalently X, is sucient (as found in Lecture 14) and complete. Since E(X) = , it
follows that X is a UMVUE of .
Lets nd a
UNVUE of 2 . The natural conjecture that it is (X)2 is not quite correct.
n
1
2
Since X = n
i=1 Xi , we have Var X = /n. Thus
2
= E[(X)2 ] (EX)2 = E[(X)2 ] 2 ,
n
hence
2
E (X)2
= 2
n
and we have an unbiased estimate of 2 based on the complete sucient statistic X.
Therefore (X)2 [ 2 /n] is a UMVUE of 2 .
k=0
(1)k
()k
e k
= e
= e e = e2 .
k!
k!
k=0
Problems
1. Let X be a random variable that has zero mean for all possible values of . For
example, X can be uniformly distributed between and , or normal with mean 0
and variance . Give an example of a sucient statistic for that is not complete.
2. Let f (x) = exp[(x )], < x < , and 0 elsewhere. Show that the rst order
statistic Y1 = min Xi is a complete sucient statistic for , and nd a UMVUE of .
n
1/n
3. Let f (x) = x1 , 0 < x < 1, where > 0. Show that u(X1 , . . . , Xn ) =
i=1 Xi
is a complete sucient statistic for , and that the maximum likelihood estimate is
a function of u(X1 , . . . , Xn ).
2
4. The density
nf (x) = x exp[x], x > 0, where > 0, belongs to the exponential class,
and Y = i=1 Xi is a complete sucient statistic for . Compute the expectation of
1/Y under , and from the result nd the UMVUE of .
n
5. Let Y1 be binomial (n, ), so that Y1 = i=1 Xi , where Xi is the indicator of a success
on trial i. [Thus each Xi is binomial (1, ).] By Example 1 of (16.4), the Xi , as well
as Y1 , belong to the exponential class, and Y1 is a complete sucient statistic for .
Since E(Y1 ) = n, Y1 /n is a UMVUE of .
n1
n
Y
1+
Y
n1
Note that h()f (x) = h()f (x|) is the joint density of and x, which can also be
expressed as f (x)f (|x). Thus
B() =
f (x)
L(, (x))f (|x) d dx.
Since f (x) is nonnegative, it is sucient to minimize L(, (x))f (|x) d for each x.
The resulting is called
the Bayes estimate of . Similarly, to estimate a function of ,
say (), we minimize L((), (x))f (|x) d.
We can jettison a lot of terminology by recognizing that our problem is to observe
a random variable X and estimate a random variable Y by g(X). We must minimize
E[L(Y, g(X)].
and as above, it suces to minimize the quantity in brackets for each x. If we let z = g(x),
we are minimizing z 2 2E(Y |X = x)z + E(Y 2 |X = x) by choice of z. Now Az 2 2Bz + C
is a minimum when z = B/A = E(Y |X = x)/1, and we conclude that
E[(Y g(X))2 ] is minimized when g(x) = E(Y |X = x).
What we are doing here is minimizing E[(W c)2 ] = c2 2E(W )c + E(W 2 ) by our choice
of c, and the minimum occurs when c = E(W ).
=c
f (w) dw
wf (w) dw +
wf (w) dw c
f (w) dw.
c
f (w) dw
c
which is 0 when f (w) dw = c f (w) dw, in other words when C is a median of W .
Thus E(|Y g(X)|) is minimized when g(x) is a median of the conditional distribution
of Y given X = x.
and
f (|x) =
Thus
f (, x)
h()f (x|)
=
.
f (x)
f (x)
(x) =
h()f (x) d
h()f (x) d
Problems
1. Let X be binomial(n, ), and let the density of be
h() =
r1 (1 )s1
(r, s)
[beta(r, s)].
A=
then A =
.
e + f i g + hi
c di g hi
The transpose is
a + bi e + f i
A =
.
c + di g + hi
Vectors X, Y , etc., will be regarded as column vectors. The inner product (dot product)
of n-vectors X and Y is
< X, Y >= x1 y 1 + + xn y n
where the overbar indicates complex conjugate. Thus < X, Y >= Y X. If c is any
complex number, then < cX, Y >= c < X, Y > and < X, cY >= c < X, Y >. The
vectors X and Y are said to be orthogonal (perpendicular) if < X, Y >= 0. For an
arbitrary n by n matrix B,
< BX, Y >=< X, B Y >
because < X, B Y >= (B Y ) X = Y B X = Y BX =< BX, Y >.
Our interest is in real symmetric matrices, and symmetric will always mean real
symmetric. If A is symmetric then
< AX, Y >=< X, A Y >=< X, AY > .
The eigenvalue problem is AX = X, or (A I)X = 0, where I is the identity matrix,
i.e., the matrix with 1s down the main diagonal and 0s elsewhere. A nontrivial solution
(X = 0) exists i det(A I) = 0. In this case, is called an eigenvalue of A and a
nonzero solution is called an eigenvector.
19.2 Theorem
If A is symmetric then A has real eigenvalues.
Proof. Suppose AX = X with X = 0. then < AX, Y >=< X, AY >
Y = X gives
with
n
< X, X >=< X, X >, so ( ) < X, X >= 0. But < X, X >= i=1 |xi |2 = 0, and
therefore = , so is real.
The important conclusion is that for a symmetric matrix, the eigenvalue problem can
be solved using only real numbers.
10
19.3 Theorem
If A is symmetric, then eigenvectors of distinct eigenvalues are orthogonal.
Proof. Suppose AX1 = 1 X1 and AX2 = 2 X2 . Then < AX1 , X2 >=< X1 , AX2 >, so
< 1 X1 , X2 >=< X1 , 2 X2 >. Since 2 is real we have (1 2 ) < X1 , X2 >= 0. But
we are assuming that we have two distinct eigenvalues, so that 1 = 2 . Therefore we
must have < X1 , X2 >= 0.
n
|xi |2
1/2
i=1
0
..
To verify this, note that multiplying L on the right by a diagonal matrix with entries
1 , . . . , n multiplies column i of L (namely Xi ) by i . (Multiplying on the left by D
would multiply row i by i .) Therefore
LD = [1 X1 |2 X2 | |n Xn |] = AL.
The columns of the square matrix L are mutually perpendicular unit vectors; such a
matrix is said to be orthogonal. The transpose of L can be pictured as follows:
X1
X2
L = .
..
Xn
Consequently L L = I. Since L is nonsingular (det I = 1 = det L det L), L has an inverse,
which must be L . to see this, multiply the equation L L = I on the right by L1 to get
L I = L1 , i.e., L = L1 . Thus LL = I.
Since a matrix and its transpose have the same determinant, (det L)2 = 1, so the
determinant of L is 1.
11
Finally, from AL = LD we get
L AL = D
We have shown that every symmetric matrix (with distinct eigenvalues) can be orthogonally diagonalized.
X AX =
n
ai,j xi xj .
i,j=1
n
i yi2 .
i=1
The symmetric
nmatrix2 A is said to be nonnegative denite if X AX 0 for all X.
Equivalently, i=1 i yi 0 for all Y . Set yi = 1, yj = 0 for all j = i to conclude that A
is nonnegative denite if and only if all eigenvalues of A are nonnegative. The symmetric
matrix is said to be positive denite if X AX > 0 except when all xi = 0. Equivalently,
all eigenvalues of A are strictly positive.
19.6 Example
Consider the quadratic form
3
q = 3x + 2xy + 3y = (x, y)
1
2
Then
3
A=
1
1
,
3
3
det(A I) =
1
1
3
x
y
1
= 2 6 + 8 = 0
3
and the eigenvalues are = 2 and = 4. When = 2, the equation A(x,
y) = (x,
y)
reduces to x + y = 0. Thus (1, 1) is an eigenvector. Normalize it to get
(1/ 2, 1/ 2) .
When = 4 we get x + y = 0 and the normalized eigenvector is (1/ 2, 1/ 2) . Consequently,
1/ 2 1/2
L=
1/ 2 1/ 2
2
0
0
=D
4
12
as expected.
If (x, y) = L(v, w) , i.e., x = (1/ 2)v + (1/ 2)w, y = (1/ 2)v +
13
Cov(X, Y )
.
1 2
2
1
2
2
1 2 + 2 22 12 = 22 (1 2 )
1
1
2
(X 1 ) with = 1.
1
14
20.2 Theorem
If X and Y are independent then X and Y are uncorrelated ( = 0) but not conversely.
Proof. Assume X and Y are independent. Then
E[(X 1 )(Y 2 )] = E(X 1 )E(Y 2 ) = 0.
For the counterexample to the converse, let X = cos , Y = sin , where is uniformly
distributed on (0, 2). Then
2
2
1
1
E(X) =
cos d = 0, E(Y ) =
sin d = 0,
2 0
2 0
and
E(XY ) = E[(1/2) sin 2] =
1
4
sin 2 d = 0,
0
xi
yi .
i=1
i=1
i=1
(There will be a factor of 1/n on each side of the inequality, which will cancel.) This is the
result originally proved by Cauchy. Schwarz proved the analogous formula for integrals:
2
b
f (x)g(x) dx
a
[f (x)]2 dx
a
[g(x)]2 dx.
a
Since an integral can be regarded as the limit of a sum, the integral result can be proved
from the result for sums.
We know that if X1 , . . . , Xn are independent, then the variance of the sum of the Xi
is the sum of the variances. If we drop the assumption of independence, we can still say
something.
15
20.4 Theorem
Let X1 , . . . , Xn be arbitrary random variables (with nite mean and variance). Then
Var(X1 + + Xn ) =
n
Var Xi + 2
i=1
n
Cov(Xi , Xj ).
i,j=1
i<j
i=1
n
(Xi i )
i=1
+2E
(Xi i )(Xj j )
i<j
as asserted.
The reason for the i < j restriction in the summation can be seen from an expansion
such as
(x + y + z)2 = x2 + y 2 + z 2 + 2xy + 2xz + 2yz.
It is correct, although a bit inecient, to replace i < j by i = j and drop the factor of 2.
This amounts to writing 2xy as xy + yx.
so the least squares problem is equivalent to nding the best estimate of Y of the form
aX + b, where best means that the mean square error is to be minimized. This is the
problem that we solved in (20.1). The least squares line is
y Y =
Y
(x X )
X
16
To evaluate X , Y , X , Y , :
1
xi = x,
n i=1
n
X =
1
yi = y,
n i=1
n
Y =
1
(xi x)2 = s2x ,
n i=1
n
2
X
= E[(X X )2 ] =
E[(X X )(Y Y )]
,
X Y
1
(yi y)2 = s2y ,
n i=1
n
Y2 =
Y
E[(X X )(Y Y )]
=
.
2
X
X
The last entry is the slope of the least squares line, which after cancellation of 1/n in
numerator and denominator, becomes
n
(x x)(yi y)
i=1
n i
.
2
i=1 (xi x)
If > 0, then the least squares line has positive slope, and y tends to increase with x. If
< 0, then the least squares line has negative slope and y tends to decrease as x increases.
Problems
In Problems 1-5, assume that X and Y are independent random variables, and that we
2
know X = E(X), Y = E(Y ), X
= Var X, and Y2 = Var Y . In Problem 2, we also
know , the correlation coecient between X and Y .
1. Find the variance of XY .
2. Find the variance of aX + bY , where a and b are arbitrary real numbers.
3. FInd the covariance of X and X + Y .
4. FInd the correlation coecient between X and X + Y .
5. FInd the covariance of XY and X.
6. Under what conditions will there be equality in the Cauchy-Schwarz inequality?
n
1
M (t1 , . . . , tn ) = exp(t1 1 + + tn n ) exp
ti aij tj
2 i,j=1
where the ti and j are arbitrary real numbers, and the matrix A is symmetric and
positive denite.
Before we do anything else, let us indicate the notational scheme we will be using.
Vectors will be written with an underbar, and are assumed to be column vectors unless
otherwise specied. If t is a column vector with components t1 , . . . , tn , then to save space
we write t = (t1 , . . . , tn ) . The row vector with these components is the transpose of t,
written t . The moment-generating function of jointly Gaussian random variables has the
form
1
M (t1 , . . . , tn ) = exp(t ) exp
t At .
2
We can describe Gaussian random vectors much more concretely.
21.2 Theorem
Joint Gaussian random variables arise from linear transformations on independent normal
random variables.
Proof. Let X1 , . . . , Xn be independent, with Xi normal (0,i ), and let X = (X1 , . . . , Xn ) .
Let Y = BX + where B is nonsingular. Then Y is Gaussian, as can be seen by computing
the moment-generating function of Y :
MY (t) = E[exp(t Y )] = E[exp(t BX)] exp(t ).
But
E[exp(u X)] =
n
i=1
E[exp(ui Xi )] = exp
1
i u2i /2 = exp u Du
2
i=1
2
where D is a diagonal matrix with i s down the main diagonal. Set u = B t, u = t B;
then
1
MY (t) = exp(t ) exp( t BDB t)
2
and BDB is symmetric since D is symmetric. Since t BDB t = u Du, which is greater
than 0 except when u = 0 (equivalently when t = 0 because B is nonsingular), BDB is
positive denite, and consequently Y is Gaussian.
Conversely, suppose that the moment-generating function of Y is exp(t ) exp[(1/2)t At)]
where A is symmetric and positive denite. Let L be an orthogonal matrix such that
L AL = D, where D is the diagonal matrix of eigenvalues of A. Set X = L (Y ), so
that Y = + LX. The moment-generating function of X is
E[exp(t X)] = exp(t L )E[exp(t L Y )].
The last term is the moment-generating function of Y with t replaced by t L , or equivalently, t replaced by Lt. Thus the moment-generating function of X becomes
1
exp(t L ) exp(t L ) exp t L ALt
2
This reduces to
n
1
1
exp t Dt = exp
i t2i .
2
2 i=1
Therefore the Xi are independent, with Xi normal (0, i ).
21.4 Theorem
Let Y = + LX as in the proof of (21.2), and let A be the symmetric, positive denite
matrix appearing in the moment-generating function of the Gaussian random vector Y .
Then E(Yi ) = i for all i, and furthermore, A is the covariance matrix of the Yi , in other
words, aij = Cov(Yi , Yj ) (and aii = Cov(Yi , Yi ) = Var Yi ).
It follows that the means of the Yi and their covariance matrix determine the momentgenerating function, and therefore the density.
Proof. Since the Xi have zero mean, we have E(Yi ) = i . Let K be the covariance matrix
of the Yi . Then K can be written in the following peculiar way:
Y1 1
.
..
K=E
(Y1 1 , . . . , Yn n ) .
Yn n
3
Note that if a matrix M is n by 1 and a matrix N is 1 by n, then M N is n by n. In this
case, the ij entry is E[(Yi i )(Yj j )] = Cov(Yi , Yj ). Thus
K = E[(Y )(Y ) ] = E(LXX L ) = LE(XX )L
since expectation is linear. [For example, E(M X) = M E(X) because E( j mij Xj ) =
j mij E(Xj ).] But E(XX ) is the covariance matrix of the Xi , which is D. Therefore
K = LDL = A (because L AL = D).
1
1
fX (x1 , . . . , xn ) =
exp
x2i /2i .
( 2)n 1 n
i=1
1
1
fX (x1 , . . . , xn ) =
exp xD1 x .
n
2
( 2) det K
But y = + Lx, x = L (y ), x D1 x = (y ) LD1 L (y ), and [see the end
of (21.4)] K = LDL , K 1 = LD1 L . The density of Y is
1
1
fY (y1 , . . . , yn ) =
exp (y ) K 1 (y ].
n
2
( 2) det K
1
1
P {X y} + P {X y} = P {X y}
2
2
because X is also normal (0,1). Thus FX = FY . But with probability 1/2, X +Y = 2X,
and with probability 1/2, X + Y = 0. Therefore P {X + Y = 0} = 1/2. If X and Y were
jointly Gaussian, then X + Y would be normal (Problem 4). We conclude that X and Y
are individually Gaussian but not jointly Gaussian.
21.7 Theorem
If X1 , . . . , Xn are jointly Gaussian and uncorrelated (Cov(Xi , Xj ) = 0 for all i = j), then
the Xi are independent.
Proof. The moment-generating function of X = (X1 , . . . , Xn ) is
MX (t) = exp(t ) exp
1
t Kt
2
4
where K is a diagonal matrix with entries 12 , 22 , . . . , n2 down the main diagonal, and 0s
elsewhere. Thus
MX (t) =
exp(ti i ) exp
i=1
1 2 2
t
2 i i
f (x1 , . . . , xn )
f (x1 , . . . , xn1 )
with
n
1
f (x1 , . . . , xn ) = (2)n/2 (det K)1/2 exp
yi qij yj
2 i,j=1
Now
n
yi qij yj =
i,j=1
n1
yi qij yj + yn
i,j=1
n1
j=1
qnj yj + yn
n1
i=1
D2
A
D 2
exp
exp C(yn +
) .
B
4C
2C
We conclude that
given X1 , . . . , Xn1 ,
Xn is normal.
because
1
1
= C, 2 =
.
2 2
2C
5
Thus
Var(Xn |X1 , . . . , Xn1 ) =
1
qnn
n1
D
1
qnj Yj
=
2C
qnn j=1
n1
1
qnn
qnj (Xj j ).
j=1
Recall from Lecture 18 that E(Y |X) is the best estimate of Y based on X, in the sense
that the mean square error is minimized. In the joint Gaussian case, the best estimate of
Xn based on X1 , . . . , Xn1 is linear, and it follows that the best linear estimate is in fact
the best overall estimate. This has important practical applications, since linear systems
are usually much easier than nonlinear systems to implement and analyze.
Problems
1. Let K be the covariance matrix of arbitrary random variables X1 , . . . , Xn . Assume
that K is nonsingular to avoid degenerate cases. Show that K is symmetric and positive
denite. What can you conclude if K is singular?
2. If X is a Gaussian n-vector and Y = AX with A nonsingular, show that Y is Gaussian.
3. If X1 , . . . , Xn are jointly Gaussian, show that X1 , . . . , Xm are jointly Gaussian for
m n.
4. If X1 , . . . , Xn are jointly Gaussian, show that c1 X1 + + cn Xn is a normal random
variable (assuming it is nondegenerate, i.e., not identically constant).
fX (x1 , . . . , xn ) =
exp (x ) K 1 (x )
n
2
( 2)
det K
where E(X) = and K is the covariance matrix of X. We specialize to the case n = 2:
2
2
1
1 2
1 12
=
, 12 = Cov(X1 , X2 );
K=
12 22
1 2
22
K 1 =
1
22
2
2
2
1 2 (1 ) 1 2
1
1/12
1 2
=
2
2
1
1 /1 2
/1 2
.
1/22
x1 1
x2 2
x2 2 2
x1 1 2
1
1
exp
2
+
2(1 2 )
1
1
2
2
21 2 1 2
The moment-generating function of X is
MX (t1 , t2 ) = exp(t ) exp
1
t Kt
2
1
2
(x1 1 )
1
and
Var(X2 |X1 = x1 ) =
1
= 22 (1 2 ).
q22
For E(X1 |X2 = x2 ) and Var(X1 |X2 = x2 ), interchange 1 and 2 , and interchange 1
and 2 .
22.2 Example
Let X be the height of the father, Y the height of the son, in a sample of father-son pairs.
Assume X and Y bivariate normal, as found by Karl Pearson around 1900. Assume
E(X) = 68 (inches), E(Y ) = 69, X = Y = 2, = .5. (We expect to be positive
because on the average, the taller the father, the taller the son.
Given X = 80 (6 feet 8 inches), Y is normal with mean
Y +
Y
(x X ) = 69 + .5(80 68) = 75
X
Problems
1. Let X and Y have the bivariate normal distribution. The following facts are known:
X = 1, X = 2, and the best estimate of Y based on X, i.e., the estimate that
minimizes the mean square error, is given by 3X + 7. The minimum mean square error
is 28. Find X , Y and the correlation coecient between X and Y .
2. Show that the bivariate normal density belongs to the exponential class, and nd the
corresponding complete sucient statistic.
P {p(X) = .1} = .1
For example, if X = x2 then p(X) = p(x2 ) = .2, and if X = x3 then p(X) = p(x3 ) = .2.
The total probability that p(X) = .2 is .4.
The continuous case is, at rst sight, easier to handle. If X has density f and X = x,
then f (X) = f (x). But what is the density of f (X)? We will not need the result, but the
question is interesting and is considered in Problem 1.
The following two lemmas will be needed to prove the Cramer-Rao inequality, which
can be used to compute uniformly minimum variance unbiased estimates. In the calculations to follow, we are going to assume that all dierentiations under the integral sign
are legal.
23.2 Lemma
E
ln f (X) = 0.
f (x)
which reduces to
f (x) dx =
(1) = 0.
23.3 Lemma
Let Y = g(X) and assume E (Y ) = k(). If k () = dk()/d, then
k () = E Y
ln f (X) .
Proof. We have
k () =
E g(X) =
g(x)f (x) dx =
g(x)
f (x)
dx
9
g(x)
f (x) 1
f (x) dx =
f (x)
= E [g(X)
g(x)
ln f (x) f (x) dx
ln f (X)] = E [Y
ln f (X)].
23.4 Cram
er-Rao Inequality
Under the assumptions of (23.3), we have
Var Y
[k ()]2
2 .
ln f (X)
n
2
= Var
ln f (X)
ln f (X) = Var
ln f (Xi )
i=1
= n Var
2
ln f (Xi ) = nE
ln f (Xi )
23.6 Theorem
Let X1 , . . . , Xn be iid, each with density f (x). If Y = g(X1 , . . . , Xn ) is an unbiased
estimate of , then
Var Y
nE
2 .
ln f (Xi )
10
Proof. Applying (23.5), we have a special case of the Cramer-Rao inequality (23.4) with
k() = , k () = 1.
The lower bound in (23.6) is 1/nI(), where
I() = E
2
ln f (Xi )
ln f (x) f (x) dx = 0.
Thus
2 ln f (x)
f (x) dx +
2
ln f (x) f (x) 1
f (x) dx = 0.
f (x)
ln f (x)
f
(x)
dx
+
ln f (x) f (x) dx = 0.
Therefore
E
2
2 ln f (Xi )
= E
.
ln f (Xi )
Problems
1. If X is a random variable with density f (x), explain how to nd the distribution of
the random variable f (X).
2. Use the Cramer-Rao inequality to show that the sample mean is a UMVUE of the true
mean in the Bernoulli, normal (with 2 known) and Poisson cases.
11
24.1 Percentiles
Assume F continuous and strictly increasing. If 0 < p < 1, then the equation F (x) = p
has a unique solution p , so that P {X p } = p. When p = 1/2, p is the median; when
p = .3, p is the 30-th percentile, and so on.
Let X1 , . . . , Xn be iid, each with distribution function F , and let Y1 , . . . , Yn be the
order statistics. We will consider the problem of estimating p .
E[F (Yk )] =
0
0 < x < 1.
n!
n!
xk (1 x)nk dx =
(k + 1, n k + 1).
(k 1)!(n k)!
(k 1)!(n k)!
k
,
n+1
1 k n.
1
,
n+1
0 k n.
(Note that when k = n, the expectation is 1 [n/(n + 1)] = 1/(n + 1), as asserted.)
The key point is that on the average, each [Yk , Yk+1 ] produces area 1/(n + 1) under
the density f of the Xi . This is true because
Yk+1
f (x) dx = F (Yk+1 ) F (Yk )
Yk
and we have just seen that the expectation of this quantity is 1/(n + 1), k = 0, 1, . . . , n.
If we want to accumulate area p, set k/(n + 1) = p, that is, k = (n + 1)p.
12
Conclusion: If (n + 1)p is an integer, estimate p by Y(n+1)p .
If (n + 1)p is not an integer, we can use a weighted average. For example, if p = .6 and
n = 13 then (n + 1)p = 14 .6 = 8.4. Now if (n + 1)p were 8, we would use Y8 , and if
(n + 1)p were 9 we would use Y9 . If (n + 1)p = 8 + , we use (1 )Y8 + Y9 . In the
present case, = .4, so we use .6Y8 + .4Y9 = Y8 + .4(Y9 Y8 ).
j1
n
k=i
pk (1 p)nk .
Thus (Yi , Yj ) is a condence interval for p , and we can nd the condence level by
evaluating the above sum, possibly with the aid of the normal approximation to the
binomial.
p0 > F () < p0
and
p0 = F () = p0 .
In our numerical example, if F (68) were actually .4, then on the average, 40 percent of
the observations will be 68 or less, as opposed to 30 percent if F (68) = .3. Thus a larger
than expected number of observations less than or equal to 68 will tend to make us reject
the hypothesis that the 30-th percentile is exactly 68. In general, our problem will be
H0 : p0 =
( F () = p0 )
H1 : p0 <
( F () > p0 )
where p0 and are specied. If Y is the number of observations less than or equal to ,
we propose to reject H0 if Y c. (If H1 is p0 > , i.e., F () < p0 , we reject if Y c.)
Note that Y is the number of nonpositive signs in the sequence X1 , . . . , Xn , and
for this reason, the terminology sign test is used.
13
Since we are trying to determine whether F () is equal to p0 or greater than p0 , we
may regard = F () as the unknown state of nature. The power function of the test is
K() = P {Y c} =
n
n
k=c
k (1 )nk
14
k2 =
n
n(n + 1)(2n + 1)
,
6
k3 =
k=1
n(n + 1)
2
2
.
For a derivation via the calculus of nite dierences, see my on-line text A Course in
Commutative Algebra, Section 5.1.
The hypothesis testing problem addressed by the Wilcoxon test is the same as that
considered by the sign test, except that:
(1) We are restricted to testing the median .5 .
(2) We assume that X1 , . . . , Xn are iid and the underlying density is symmetric about
the median (so we are not quite nonparametric). There are many situations where we
suspect an underlying normal distribution but are not sure. In such cases, the symmetry
assumption may be reasonable.
(3) We use the magnitudes as well as the signs of the deviations Xi .5 , so the Wilcoxon
test should be more accurate than the sign test.
n
Zi Ri .
i=1
15
n
i=1
E(Vi ) = 0,
Var W =
n
i2 =
i=1
n(n + 1)(2n + 1)
.
6
The Vi do not have the same distribution, but the central limit theorem still applies
because Liapounovs condition is satised:
n
3
i=1 E[|Vi i | ]
0 as n .
n
3/2
( i=1 i2 )
Now the Vi have mean i = 0, so |Vi i |3 = |Vi |3 = i3 and i2 = Var Vi = i2 . Thus the
Liapounov fraction is the sum of the rst n cubes divided by the 3/2 power of the sum of
the rst n squares, which is
n2 (n + 1)2 /4
.
[n(n + 1)(2n + 1)/6]3/2
For large n, the numerator is of the order of n4 and the denominator
is of the order of
Problems
1. Suppose we are using a sign test with n = 12 observations to decide between the null
hypothesis H0 : m = 40 and the alternative H1 : m > 40, whee m is the median. We
use the statistic Y = the number of observations that are less than or equal to 40.
We reject H0 if and only if Y c. Find the power function K(p) in terms of c and
p = F (40), and the probability of a type 1 error f c = 2.
2. Let m be the median of a random variable with density symmetric about m. Using
the Wilcoxon test, we are testing H0 : m = 160 vs. H1 : m > 160 based on n = 16
observations, which are as follows: 176.9, 158.3, 152.1, 158.8, 172.4, 169.8, 159.7, 162.7,
156.6, 174.5, 184.4, 165.2, 147.8, 177.8, 160.1, 160.5. Compute the Wilcoxon statistic
and determine whether H0 is rejected at the .05 signicance level, i.e., the probability
of a type 1 error is .05.
3. When n is small, the distribution of W can be found explicitly. Do it for n = 1, 2, 3.
Solutions to Problems
Lecture 1
1. P {max(X, Y, Z) t} = P {X t and Y t and Z t} = P {X t}3 by
independence. Thus the distribution function of the maximum is (t6 )3 = t18 , and the
density is 18t17 , 0 t 1.
2. See Figure S1.1. We have
P {Z z} =
yzx
zx
fXY (x, y) dx dy =
FZ (z) =
x=0
ex (1 ezx ) dx = 1
fZ (z) =
1
,
(z + 1)2
ex ey dy dx
y=0
1
,
1+z
z0
z0
fX (y 1/3 )
3 y 2/3
1
= 3 2/3 = 3
2/3
b 3y
b
3y
fX (tan1 y)
1/
=
|dy/dx|x=tan1 y
sec2 x
Lecture 2
1. We have y1 = 2x1 , y2 = x2 x1 , so x1 = y1 /2, x2 = (y1 /2) + y2 , and
2 0
(y1 , y2 )
= 2.
=
1 1
(x1 , x2 )
y = zx
x
Figure S1.1
Y = X
y
y 1/3
Figure S1.2
Thus fY1 Y2 (y1 , y2 ) = (1/2)fX1 X2 (x1 , x2 ) = ex1 x2 = exp[(y1 /2) (y1 /2) y2 ] =
ey1 ey2 . As indicated in the comments, the range of the ys is 0 < y1 < 1, 0 < y2 < 1.
Therefore the joint density of Y1 and Y2 is the product of a function of y1 alone and
a function of y2 alone, which forces independence.
2. We have y1 = x1 /x2 , y2 = x2 , so x1 = y1 y2 , x2 = y2 and
(x1 , x2 ) y2 y1
=
= y2 .
0 1
(y1 , y2 )
Thus fY1 Y2 (y1 , y2 ) = fX1 X2 (x1 , x2 )|(x1 , x2 )/(y1 , y2 )| = (8y1 y2 )(y2 )(y2 ) = 2y1 (4y23 ).
Since 0 < x1 < x2 < 1 is equivalent to 0 < y1 < 1, 0 < y2 < 1, it follows just as in
Problem 1 that X1 and X2 are independent.
3. The Jacobian (x1 , x2 , x3 )/(y1 , y2 , y3 ) is given by
y2 y3
y1 y3
y1 y2
y2 y3 y3 y1 y3 y2 y1 y2
0
y3
1 y2
= (y2 y32 y1 y2 y32 )(1 y2 ) + y1 y22 y32 + y3 (y2 y1 y2 )y2 y3 + (1 y2 )y1 y2 y32
which cancels down to y2 y32 . Thus
fY1 Y2 Y3 (y1 , y2 , y3 ) = exp[(x1 + x2 + x3 )]y2 y32 = y2 y32 ey3 .
This can be expressed as (1)(2y2 )(y32 ey3 /2), and since x1 , x2 , x3 > 0 is equivalent to
0 < y1 < 1, 0 < y2 < 1, y3 > 0, it follows as before that Y1 , Y2 , Y3 are independent.
Lecture 3
1. MX2 (t) = MY (t)/MX1 (t) = (1 2t)r/2 /(1 2t)r1 /2 = (1 2t)(rr1 )/2 , which is
2 (r r1 ).
'
ar
ct
an
/2
y
2
'
Figure S1.3
2. The moment-generating function of c1 X1 + c2 X2 is
E[et(c1 X1 +c2 X2 ) ] = E[etc1 X1 ]E[etc2 X2 ] = (1 1 c1 t)1 (1 2 c2 t)2 .
If 1 c1 = 2 c2 , then X1 + X2 is gamma with = 1 + 2 and = i ci .
n
n
n
3. M (t) = E[exp( i=1 ci Xi )] = i=1 E[exp(tci Xi )] = i=1 Mi (ci t).
4. Apply Problem 3 with ci = 1 for all i. Thus
MY (t) =
n
Mi (t) =
i=1
n
i=1
n
(e 1)
t
i=1
which is Poisson (1 + + n ).
5. Since the coin is unbiased, X2 has the same distribution as the number of heads in the
second experiment. Thus X1 + X2 has the same distribution as the number of heads
in n1 + n2 tosses, namely binomial with n = n1 + n2 and p = 1/2.
Lecture 4
1. Let be the normal (0,1) distribution function, and recall that (x) = 1 (x).
Then
n
X
n
<c
P { c < X < + c} = P {c
<
}
/ n
>
} = 1 (/) = (/).
4
3. Since nS 2 / 2 is 2 (n 1), we have
P {a < S 2 < b} = P {
na
nb
< 2 (n 1) < 2 }.
2
2
nS t
E[etS ] = E exp
= E[exp(t 2 X/n)]
2 n
where the random variable X is 2 (n 1), and therefore has moment-generating function M (t) = (1 2t)(n1)/2 . Replacing t by t 2 /n we get
MS 2 (t) =
2t 2
n
(n1)/2
Lecture 5
1. By denition of the beta density,
E(X) =
(a + b)
(a)(b)
xa (1 x)b1 dx
(a + 1)a
.
(a + b + 1)(a + b)
and
Var X = E(X 2 ) [E(X)]2
=
1
ab
[(a + 1)a(a + b) a2 (a + b + 1)] =
.
(a + b)2 (a + b + 1)
(a + b)2 (a + b + 1)
5
4. Suppose we want P {W c} = .05. Equivalently, P {1/W 1/c} = .05, hence
P {1/W 1/c} = .95. By Problem 3, 1/W is F (n, m), so 1/c can be found from the
F table, and we can then compute c. The analysis is similar for .1, .025 and .01.
5. If N is normal (0,1), then T (n) = N/( 2 (n)/n). Thus T 2 (n) = N 2 /(2 (n)/n). But
N 2 is 2 (1), and the result follows.
6. If Y = 2X then fY (y) = fX (x)|dx/dy| = (1/2)ex = (1/2)ey/2 , y 0, the chi-square
density with two degrees of freedom. If X1 and X2 are independent exponential random
variables, then
X1
(2X1 )/2
2 (2)/2
=
= 2
= F (2, 2).
X2
(2X2 )/2
(2)/2
Lecture 6
1. Apply the formula for the joint density of Yj and Yk with j = 1, k = 3, n = 3, F (x) =
x, f (x) = 1, 0 < x < 1. The result is fY1 Y3 (x, y) = 6(y x), 0 < x < y < 1. Now let
Z = Y3 Y1 , W = Y3 . The Jacobian of the transformation has absolute value 1, so
fZW (z, w) = fY1 Y3 (y1 , y3 ) = 6(y3 y1 ) = 6z, 0 < z < w < 1. Thus
1
fZ (z) =
6z dw = 6z(1 z), 0 < z < 1.
w=z
2. The probability that more than one random variable falls in [x, x + dx] need not be
negligible. For example, there can be a positive probability that two observations
coincide with x.
3. The density of Yk is
fYk (x0 =
n!
xk1 (1 x)nk ,
(k 1)!(n k)!
0<x<1
P {Yk > p} =
pi (1 p)ni .
i
i=0
Lecture 7
1. Let Wn = (Sn E(Sn ))/n; then E(Wn ) = 0 for all n, and
Var Wn =
P
It follows that Wn 0.
n
Var Sn
1 2
nM
M
=
2 =
0.
n2
n2 i=1 i
n
n
6
d
x<
x .
4. Let Fn be the distribution function of Xn . For all x, Fn (x) = 0 for suciently large
n. Since the identically zero function cannot be a distribution function, there is no
limiting distribution.
Lecture 8
1. Note that MXn = 1/(1t)n where 1/(1t) is the moment-generating function of an
exponential random variable (which has mean ). By the weak law of large numbers,
P
Xn /n , hence Xn /n
.
n
2. 2 (n) = i=1 Xi2 , where the Xi are iid, each normal (0,1). Thus the central limit
theorem applies.
b
3. We have n Bernoulli trials, with probability of success p = a f (x) dx on a given trial.
Thus Yn is binomial (n, p). If n and p satisfy the sucient condition given in the text,
the normal approximation with E(Yn ) = np and Var Yn = np(1 p) should work well
in practice.
4. We have E(Xi ) = 0 and
Var Xi =
E(Xi2 )
1/2
2
=
1/2
1/2
x2 dx = 1/12.
x dx = 2
0
By the central limit theorem, Yn is approximately normal with E(Yn ) = 0 and Var Yn =
n/12.
5. Let Wn = n(1 F (Yn )). Then
P {Wn w} = P {F (Yn ) 1 (w/n)} = P {max F (Xi ) 1 (w/n)}
hence
w n
P {Wn w} = 1
,
n
0 w n,
Lecture 9
1. (a) We have
f (x1 , . . . , xn ) = x1 ++xn
en
.
x1 ! xn !
x
(x ln n) = n = 0,
= X.
n
ln xi ) = +
ln xi = 0,
(n ln + ( 1)
i=1
i=1
n
= n
ln xi .
i=1
x
n
x
(n ln ) = + 2 = 0,
= X
n
n
(d) f (x1 , . . . , xn ) = (1/2)n exp[ i=1 |xi |]. We must minimize i=1 |xi |,
and we must be careful when dierentiating because of the absolute values. If the
order statistics of the xi are yi , i = 1, . . . , n, and yk < < yk+1 , then the sum to be
minimized is
( y1 ) + + ( yk ) + (yk+1 ) + + (yn ).
The derivative of the sum is the number
n of yi s less than minus the number of yi s
greater than . Thus as increases, i=1 |xi | decreases until the number of yi s
less than equals the number of yi s greater than . We conclude that is the median
of the Xi .
n
(e) f (x1 , . . . , xn ) = exp[ i=1 xi ]en if all xi , and 0 elsewhere. Thus
f (x1 , . . . , xn ) = exp[
i=1
1
1
h(X1 , . . . , Xn ) Y1 +
2
2
8
for all X1 , . . . , Xn ) is an MLE of . Some solutions are h = Y1 +(1/2), h = Yn (1/2),
h = (Y1 + Yn )/2, h = (2Y1 + 4Yn 1)/6 and h = (4Y1 + 2Yn + 1)/6. In all cases, the
inequalities reduce to Yn Y1 1, which is true.
3. (a) Xi is Poisson () so E(Xi ) = . The method of moments sets X = , so the
estimate of is = X, which is consistent by the weak law of large numbers.
1
(b) E(Xi ) = 0 x d = /( + 1) = X, = X + X, so
=
X
/( + 1)
P
=
1 [/( + 1)]
1X
hence is consistent.
(c) E(Xi ) = = X, so = X, consistent by the weak law of large numbers.
(d) By symmetry, E(Xi ) = so = X as in (a) and (c).
(e) E(Xi ) = xe(x) dx = (with y = x ) 0 (y + )ey dy = 1 + = X. Thus
= X 1 which converges in probability to (1 + ) 1 = , proving consistency.
r
r
4. P {X r} = 0 (1/)ex/ dx = ex/ 0 = 1 er/ . The MLE of is = X [see
Problem 1(c)], so the MLE of 1 er/ is 1 er/X .
5. The MLE of is X/n, the relative frequency of success. Since
b
n k
P {a X b} =
(1 )nk ,
k
k=a
Lecture 10
1. Set 2(b) 1 equal to the desired condence level. This, along with the table of the
normal
(0,1) distribution function, determines b. The length of the condence interval
is 2b/ n.
2. Set 2FT (b) 1 equal to the desired condence level. This, along with the table of the
T (n
1) distribution function, determines b. The length of the condence interval is
2bS/ n 1.
3. In order to compute the expected length of the condence interval, we must compute
E(S), and the key observation is
nS 2
2
S=
=
(n 1).
2
n
n
If f (x) is the chi-square density with r = n 1 degrees of freedom [see (3.8)], then the
expected length is
2b
x1/2 f (x) dx
n1 n 0
and an appropriate change of variable reduces the integral to a gamma function which
can be evaluated explicitly.
9
4. We have E(Xi ) = and Var(Xi ) = 2 . For large n,
X
X
/ n
/ n
X
< b} = (b) (b) = 2(b) 1
c
and if we set this equal to the desired level of condence, then b is determined. The
condence interval is given by (1 bc) < X < (1 + bc), or
X
X
<<
1 + bc
1 bc
where c 0 as n .
5. A condence interval of length L corresponds to |(Yn /n) p| < L/2, an event with
probability
L n/2
2
1.
p(1 p)
Setting this probability equal to the desired condence level gives an inequality of the
form
L n/2
> c.
p(1 p)
As in the text, we can replace p(1p) by its maximum value 1/4. We nd the minimum
value of n by squaring both sides.
In the rst example in (10.1), we have L = .02, L/2 = .01 and c = 1.96. This problem
essentially reproduces the analysis in the text in a more abstract form. Specifying how
close to p we want our estimate to be (at the desired level of condence) is equivalent
to specifying the length of the condence interval.
Lecture 11
1. Proceed as in (11.1):
Z = X Y (1 2 )
divided by
12
2
+ 2
n
m
n + m 2Z/ W
10
2. If 12 = c22 , then
1
12
2
1
+ 2 = c22
+
n
m
n cm
and
nS12
mS 2
nS12 + cmS22
+ 22 =
.
2
1
2
c22
Thus 22 can again be eliminated, and condence intervals can be constructed, assuming
c known.
Lecture 12
1. The given test is an LRT and is completely determined by c, independent of > 0 .
2. The likelihood ratio is L(x) = f1 (x)/f0 (x) = (1/4)/(1/6) = 3/2 for x = 1, 2, and
L(x) = (1/8)/(1/6) = 3/4 for x = 3, 4, 5, 6. If 0 < 3/4, we reject for all x, and
= 1, = 0. If 3/4 < < 3/2, we reject for x = 1, 2 and accept for x = 3, 4, 5, 6, with
= 1/3 and = 1/2. If 3/2 < , we accept for all x, with = 0, = 1.
For = .1, set = 3/2, accept when x = 3, 4, 5, 6, reject with probability a when
x = 1, 2. Then = (1/3)a = .1, a = .3 and = (1/2) + (1/2)(1 a) = .85.
3. Since (220-200)/10=2, it follows that when c reaches 2, the null hypothesis is accepted.
The associated type 1 error probability is = 1 (2) = 1 .977 = .023. Thus the
given result is signicant even at the signicance level .023. If we were to take additional
observations, enough to drive the probability of a type 1 error down to .023, we would
still reject H0 . Thus the p-value is a concise way of conveying a lot of information
about the test.
Lecture 13
1. We sum (Xi npi )2 /npi , i = 1, 2, 3, where the Xi are the observed frequencies and the
npi = 50, 30, 20 are the expected frequencies. The chi-square statistic is
(40 50)2
(33 30)2
(27 20)2
+
+
= 2 + .3 + 2.45 = 4.75
50
30
20
Since P {2 (2) > 5.99} = .05 and 4.75 < 5.99, we accept H0 .
2. The expected frequencies are given by
1
2
A
49
51
B
147
153
C
98
102
For example, to nd the entry in the 2C position, we can multiply the row 2 sum by
the column 3 sum and divide by the total number of observations (namely 600) to get
11
(306)(200)/600=102. Alternatively, we can compute P (C) = (98 + 102)/600 = 1/3.
We multiply this by the row 2 sum 306 to get 306/3=102. The chi-square statistic is
(33 49)2
(147 147)2
(114 98)2
(67 51)2
(153 153)2
(86 102)2
+
+
+
+
+
49
147
98
51
153
102
which is 5.224+0+2.612+5.02+0+2.510 = 15.366. There are (h1)(k1) = 12 = 2
degrees of freedom, and P {2 (2) > 5.99} = .05. Since 15.366 > 5.94, we reject H0 .
3. The observed frequencies minus the expected frequencies are
a
(a + b)(a + c)
ad bc
=
,
a+b+c+d
a+b+c+d
(a + b)(b + d)
bc ad
=
,
a+b+c+d
a+b+c+d
(a + c)(c + d)
bc ad
=
,
a+b+c+d
a+b+c+d
(c + d)(b + d)
ad bc
=
.
a+b+c+d
a+b+c+d
Lecture 14
1. The joint probability function is
f (x1 , . . . , xn ) =
xi
n
e
i=1
xi !
en u(x)
.
x1 ! xn !
n
i=1
where
I is an indicator.
n
i=1 B(xi ).
B(xi )I max xi <
1in
12
5. f (x) = (a + b)/[(a)(b)]xa1 (1 x)b1 on (0,1). In this case, a = and b = 2.
Thus f (x) = ( + 1)x1 (1 x), so
n n
f (x1 , . . . , xn ) = ( + 1)
n
xi
n
1
i=1
(1 xi )
i=1
n
i=1 (1 xi ).
1 x/
n
1
1
f (x1 , . . . , xn ) =
u(x)
exp[
xi /]
[()]n n
i=1
7. We have
P {X1 = x1 , . . . , Xn = xn } = P {Y = y}P {X1 = x1 , . . . , Xn = xn |Y = y}
We can drop the subscript since Y is sucient, and we can replace Xi by Xi by
denition of Bs experiment. The result is
P {X1 = x1 , . . . , Xn = xn } = P {X1 = x1 , . . . , Xn = xn }
as desired.
Lecture 17
1. Take u(X) = X.
2. The joint density is
f (x1 , . . . , xn ) = exp
(xi ) I[min xi > ]
i=1
n
exp[(x )] dx
= exp[n(y )],
so
FY1 (y) = 1 en(y) ,
y > .
13
The expectation of g(Y1 ) under is
E [g(Y1 )] =
zn exp(nz) dz + =
0
1
+ .
n
(1/y)
0
1
y 2n1 ey dy
(2n)(1/)2n
0
z 2n2 z dz
2n (2n 1)
= 2n1
=
.
e
2n2
(2n)
2n 1
6. Since Xi / is normal (0,1), Y / is 2 (n), which has mean n and variance 2n. Thus
E[(Y /)2 ] = n2 +2n, so E(Y 2 ) = 2 (n2 +2n).Therefore the UMVUE of 2 ) is Y 2 /(n2 +
2n).
14
7. (a) E[E(I|Y )] = E(I) = P {X1 1}, and the result follows by completeness.
(b) We compute
P {X1 = r|X1 + + Xn = s} =
P {X1 = r, X2 + + Xn = s r}
.
P {X1 + + Xn } = s
The numerator is
e r (n1) [(n 1)]sr
e
r!
(s r)!
and the denominator is
en (n)s
s!
so the conditional probability is
sr
r
s (n 1)sr
1
s
n1
=
s
r
r
n
n
n
which is the probability of r successes in s Bernoulli trials, with probability of success
1/n on a given trial. Intuitively, if the sum is s, then each contribution to the sum is
equally likely to come from X1 , . . . , Xn .
(c) By (b), P {X1 = 0|Y } + P {X1 = 1|Y } is given by
1
1
n
Y
+Y
Y 1
Y
1
Y /n
n1
1
1+
=
1
n
n
n
(n 1)/n
=
n1
n
Y
1+
Y
.
n1
2
i=1
Since
n
(xi 1 )
i=1
n
1
xi n1 ,
2 i=1
15
Lecture 18
1. By (18.4), the numerator of (x) is
1
n x
(1 )nx d
r1 (1 )s1
x
0
and the denominator is
r1
(1 )
s1
n x
(1 )nx d.
x
Thus (x) is
(r + x + 1, n x + s)
(r + x + 1) (r + s + n)
r+x
=
=
.
(r + x, n x + s)
(r + x) (r + s + n + 1)
r+s+n
2. The risk function is
2
r+X
1
E
=
E [(X n + r r s)2 ]
r+s+n
(r + s + n)2
with E (X n) = 0, E [(X n)2 = Var X = n(1 ). Thus
R () =
1
[n(1 ) + (r r s)2 ].
(r + s + n)2
r2
.
(r + s + n)2
4. The average loss using is B() = h()R () d. If (x) has a smaller maximum
risk than (x), then since R is constant, we have R () < R () for all . Therefore
B() < B(), contradicting the fact that is a Bayes estimate.
Lecture 20
1.
Var(XY ) = E[(XY )2 ] (EXEY )2 = E(X 2 )E(Y 2 ) (EX)2 (EY )2
2
2 2
2
(X
+ 2X )(Y2 + 2Y ) 2X 2Y = X
Y + 2X Y2 + 2Y X
.
16
2.
Var(aX + bY ) = Var(aX) + Var(bY ) + 2ab Cov(X, Y )
2
= a2 X
+ b2 Y2 + 2abX Y .
3.
2
Cov(X, X + Y ) = Cov(X, X) + Cov(X, Y ) = Var X + 0 = X
.
4. By Problem 3,
X,X+Y =
2
X
X
= 2
.
X X+Y
X + Y2
5.
Cov(XY, X) = E(X 2 )E(Y ) E(X)2 E(Y )
2
2
= (X
+ 2X )Y 2X Y = X
Y .
6. We can assume without loss of generality that E(X 2 ) > 0 and E(Y 2 ) > 0. We will
have equality i the discriminant b2 4ac = 0, which holds i h() = 0 for some .
Equivalently, X + Y = 0 for some . We conclude that equality holds if and only if
X and Y are linearly dependent.
Lecture 21
n
2
1. Let Yi = Xi E(Xi ); then E[
i=1 ti Yi ] 0 for all t. But this expectation is
E[
ti Yi
tj Yj ] =
ti ij tj = t Kt
i
i,j
E(e ) = E exp
n
i=1
ci tXi
= MX (c1 t, . . . , cn t)
17
n
n
1 2
= exp t
ci i exp t
ci aij cj
2
i=1
i,j=1
Lecture 22
1. If y is the best estimate of Y given X = x, then
y Y =
Y
(x X )
X
and [see (20.1)] the minimum mean square error is Y2 (1 2 ), which in this case is 28.
We are given that Y /X = 3, so Y = 3 2 = 6 and 2 = 36/Y2 . Therefore
Y2 (1
36
) = Y2 36 = 28,
Y2
Y = 8,
2 =
36
,
64
= .75.
Finally, y = Y + 3x 3X = Y + 3x + 3 = 3x + 7, so Y = 4.
2. The bivariate normal density is of the form
f (x, y) = a()b(x, y) exp[p1 ()x2 + p2 ()y 2 + p3 ()xy + p4 ()x + p5 ()y]
so we are in the exponential class. Thus
2 2
Xi ,
Yi ,
Xi Yi ,
Xi ,
Yi
2
is a complete sucient statistic for = (X
, Y2 , , X , Y ). Note also that any statistic
in one-to-one correspondence with this one is also complete and sucient.
Lecture 23
1. The probability of any event is found by integrating the density on the set dened by
the event. Thus
P {a f (X) b} =
f (x) dx, A = {x : a f (x) b}.
A
x 1x
ln f (x) =
[x ln + (1 x) ln(1 )] =
18
2
x
1x
ln f (x) = 2
2
(1 )2
I() = E
X
1X 1
1
1
+
= +
=
2
(1 )2
1
(1 )
1
(1 )
=
.
nI()
n
But
Var X =
1
n(1 )
(1 )
Var[binomial(n, )] =
=
n2
n2
n
so X is a UMVUE of .
Normal:
f (x) =
1
exp[(x )2 /2 2 ]
2
(x )2 x
=
ln f (x) =
2 2
2
1
2
ln f (x) = 2 ,
2
I() =
1
,
2
Var Y
2
n
ln f (x) =
( + x ln ) = 1 +
2
x
ln f (x) = 2 ,
2
Var Y
so X is a UMVUE of .
I() = E
X
1
= 2 =
2
= Var X
n
19
Lecture 25
1.
K(p) =
c
n
k=0
pk (1 p)nk
W = 38.678
1 t
1
1
(e + et ) (e2t + e2t ) = (e3t + et + et + e3t )
2
2
4
1 3t
1
+ et + et + e3t ) (e3t + e3t )
(e
4
2
1 6t
+ e4t + e2t + 1 + 1 + e2t + e4t + e6t ).
(e
8
Index
Bayes estimate, 18.1
Bernoulli trials, 3.4, 7.3
beta distribution, 5.4
bivariate normal distribution, 22.1
Cauchy density, Lecture 2, Problem 5
Cauchy-Schwarz inequality, 20.3
central limit theorem, 8.2
Chebyshevs inequality, 7.1
chi square distribution, 3.8, 4.2
chi square tests, 13.1
complete sucient statistic, 16.1, 17.1
condence intervals, Lectures 10, 11, 24.3
consistent estimate, 7.6, 9.3
convergence in distribution, 7.4, 8.3
convergence in probability, 7.4, 8.3
convolution, 3.11
correlation coecient, 20.1
covariance, 20.1
covariance matrix, 21.4, 22.1
Cramer-Rao inequality, 23.4
critical region, 12.3
density function method, Lecture 1
distribution function method, Lecture 1
eigenvalues and eigenvectors, 19.1
equality of distributions, 13.3
estimation, 9.1
exponential class (exponential family), 16.3, 17.1
exponential distribution, 3.8
F distribution, 5.3
factorization theorem, 14.3
Fisher information, 23.6
gamma distribution, 3.7
goodness of t, 13.2
hypothesis testing, 12.1, 24.4
inner product (dot product), 19.1
Jacobian, 2.1
jointly Gaussian random variables, 21.1
least squares, 20.5
Lehmann-Schee theorem, 16.2
Liapounov condition, 25.2
likelihood ratio tests, 12.2
limiting distribution, Lecture 7, Problems 3,4
maximum likelihood estimate, 9.2
method of moments, 9.6
2
moment-generating functions, 3.1, 21.1
multivariate normal distribution, 21.1
negative binomial distribution, 16.5
Neyman-Pearson lemma, 12.6
nonnegative denite, 19.5
nonparametric statistics, Lectures 24, 25
normal approximation to the binomial, 8.4
normal distribution, 3.4,
normal sampling, 4.2
order statistics, 6.1
orthogonal decomposition, 19.4
p-value, Lecture 12, Problem 3
percentiles, 24.1
point estimates, 9.1, 24.2
Poisson distribution, 3.4
Poisson process, 3.12
positive denite, 19.5
power function, 12.3
quadratic form, 19.5
quadratic loss function, 18.2
Rao-Blackwell theorem, 15.7
regression line, 22.2
sample mean, 4.1
sample variance, 4.1
sampling without replacement, 10.3
sign test, 24.4
signicance level, 12.3
simulation, 8.5
sucient statistics, 14.1
symmetric matrices, Lecture 19
T distribution, 4.2, 5.1
testing for independence, 13.4
transformation of random variables, Lecture 1
type 1 and type 2 errors, 12.1
unbiased estimate, 4.1, 17.3
uniformly minimum variance unbiased estimate (UMVUE), 16.2
weak law of large numbers, 7.2
Wilcoxon test, 25.1
Errata
There are some minor typos in the following locations.
Section 2.1, line 6 (a section heading counts as line 0) Change d to
Section 2.2, line -6 Capitalize j
Section 3.6, line -2 Change ti to t
Section 4.3, line -9 Change us to use
Section 4.3, line -11 Change w to We
Section 5.2, line 5 Insert y on the right side of the equation
Section 5.2, line 6 Insert y before dy
Section 5.4, line 2 Change to 1
Section 6.1, very end of the third display. Change 4 to r
Section 7.5, line 6 Delete the asterisk
Section 7.6, remove from two of the gures
Section 8.3, line 1 Change X to c, change then Xn converges to then Xn also converges
Section 8.4, line -3 Close up the space
Section 8.5, line 1 change an to can
Section 8.5, Figure 8.1 Remove
Section 9.3, line 5 ln should be roman, not italic
Section 10.3, line 3 Add a space after the comma
Section 10.3, line 10 In the rst summation, change Xi to Var Xi
Section 12.1, line -6 Change mall to small
Section 12.2, line -6 Change are to rare
Section 12.7, line -1 Add right parenthesis after H1
Section 13.2, line -3 Change reduced to reduce and change degrees to degrees of freedom
Section 16.1, line 2 unbiased should only appear once
Section 16.2, line -2 unbiased has only one s
Section 16.4, Example 6, displayed equation beginning with P {Yr = k}, change x to k
Section 17.2, line 6 Change N to M
Section 21.2, line 3 Change i to i
Section 23.3, line 4 Put brackets around g(X)
Section 23.6, line -1 Put right parenthesis after estimate
Solution to Lecture 6, Problem 3, second line, change the 0 before the equals sign to a
right parenthesis