PROBABILITY AND STATISTICS Some Basic Denitions Random variable large set of possible values only one will occur The set of possible values and their probabilities = the probability distribution
Continuous random variable has a probability density function (pdf) fX such that P (X A) =
A
FX (x) :=
fX (u)du
Quantiles if the CDF of X is continuous and strictly increasing then it has a inverse function F 1 for q between 0 and 1, F 1 (q ) is called the q th quantile or 100q th percentile
probability X is below its q th quantile is q : P {X F 1 (q )} = q also called the lower quantile the q th upper quantile is the 1 q th lower quantile
1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2
0.2
0.4
0.6
F 0.8(q)
0.2
0.4
0.6
F (q) 0.8
0.2
0.4
1 F0.6 (q)0.8
median is the 50% percentile or .5 quantile 25% and 75% percentiles (.25 and .75 quantiles) are called the rst and third quartiles for 95% condence intervals we use the 0.025 and 0.975 quantiles, i.e., the 0.025 lower and 0.025 upper quantiles.
E (X ) :=
xfX (x)dx
variance of X is
2 X :=
X = n1
i=1
Xi
X )2 n1
Correlation and Covariance XY = E {X E (X )}{Y E (Y )} If (X, Y ) are continuously distributed, then XY = {x E (X )}{y E (Y )}fXY (x, y ) dx dy
10
11
12
Given a bivariate sample {(Xi , Yi )}n i=1 , the sample correlation coecient is n1 where X and Y are the sample means sX and sY are the sample standard deviations
n i=1 (Xi
X )(Yi Y ) sX sY
(1)
13
r = 0.01
3 2 1 0 1 2 3 5 0 5 3 2 1 0 1 2 3 5
r = 0.25
3 2 1 0 1 2 0 5 3 5
r = 0.5
3 2 1 0 1 2 0 5 3 5
r = 0.95
r = 0.11
4 3 2 2 1 0 0 1 2 5 0 5 2 4 5 6 4
r = 0.83
3 2 1 0
r = 0.89
3 2 1 0 1 2 3 1 2 0 5 3 5
r = 1
4 5
14
an absolute correlation of .25 is very weak an absolute correlation of .5 is only moderate an absolute correlation of .95 is rather strong an absolute correlation of 1 implies a linear relationship a strong nonlinear relationship may or may not imply a high correlation
15
16
X and Y are independent if for all sets A and B , P (X A and Y B ) = P (X A) P (Y B ). If X and Y are independent then for all functions g and h, E {g (X )h(Y )} = E {g (X )} E {h(Y )}. if X and Y are independent, then XY = 0. XY = 0 does not imply independence there could be a strong nonlinear association between X and Y
17
Best Linear Prediction Idea: use X (observed) to predict Y (unobserved) A linear predictor is Y := 0 + 1 X prediction error is Y Y Squared error is {Y Y }2 = {Y ( + 1 X )}2
18
Best linear prediction means nding 0 and 1 to minimize expected squared prediction error given by E {Y (0 + 1 X )}2 . Similar to linear regression, but uses populations rather than samples.
19
20
and 0 = E (Y ) 1 E (X ) = E (Y ) XY /X E (X ) Thus, the best linear predictor of Y is XY Y := 0 + 1 X = E (Y ) + 2 {X E (X )} X Another way to look at this: Y E (Y ) Y (Exercise) = XY X E (X ) X
21
The prediction error is Y Y can be proved that E {Y Y } = 0 the prediction is unbiased expected squared prediction error is
2 XY 2 2 E {Y Y }2 = Y 2 = Y (1 2 XY ) X
22
How much does X help us predict Y ? do not observe X predict Y using a constant denote this constant by c The expected squared prediction error is E (Y c)2 = Var(Y ) + {c E (Y )}2 (exercise: check) minimized by c = E (Y )
2 the expected squared prediction error is Y
23
2 XY is the fraction by which the prediction error is reduced when X is known This is an important fact that we will see again
24
25
Conditional Distributions Let fXY (x, y ) be the joint density of a pair of random variables, (X, Y ). The marginal density of X is fX (x) := and similarly for fY . The conditional density of Y given X is fXY (x, y ) fY |X (y |x) = . fX (x) fXY (x, y )dy
26
The conditional expectation of Y given X is the expectation calculated using fY |X (y |x): E (Y |X = x) = which is a function of x. The conditional variance of Y given X is Var(Y |X = x) = {y E (Y |X = x)}2 fY |X (y |x)dy. yfY |X (y |x)dy
27
The Normal Distribution The standard normal distribution has density 1 (x) := exp 2 The N (, 2 ) density is 1 x
x
x2 2
(u)du.
can be evaluated using tables or more easily using software such as MATLAB or MINITAB.
28
Important: If X N (, 2 ) then P (X x) = {(x )/ }. Example: If X N (5, 4) then what is P (X 7)? Answer: Using x=7 =5 2 = 4 we have (x )/ = (7 5)/2 = 1 and then (1) = .8413 in MATLAB 6, cdfn(1) gives ans = 0.8413
29
Conditional expectations and variances Calculation of conditional expectations and variances can be dicult, but are easy for a bivariate normal distribution. For a bivariate normal pair, the conditional expected equals the best linear predictor: XY E (Y |X ) = E (Y ) + 2 {X E (X )}. X The conditional variance is the expected squared prediction error:
2 (1 2 Var(Y |X ) = Y XY )
30
Note that
Var(w1 X + w2 Y ) = ( w1 w2 )
Var(X ) Cov(X, Y )
Cov(X, Y ) Var(Y )
w1 w2
31
Fact: a11 . wN ) . . aN 1
N N
( w1
w1 a1N . . .. . . . . . aN N wN wi wj aij
=
i=1 j =1
32
Suppose
.. .
33
w1 . w= . . wN
N
wT X =
i=1
wi Xi
N
E (wT X ) = wT {E (X )} =
i=1
wi E (Xi )
34
Var(wT X ) =
i=1 j =1
wi wj Cov(Xi , Xj )
= wT COV(X )w.
35
Example: Suppose that X = (X1 X2 X3 )T , Var(X1 ) = 2, Var(X2 ) = 3, Var(X3 ) = 5, X1 ,X2 = .6, and that X1 and X2 are independent of X3 . Find X ). Var(X1 + X2 + 1 2 3 Answer: The covariance between X1 and X3 is 0 by independence the same is true of X2 and X3 . The covariance between X1 and X2 is (.6) (2)(3) = 1.47.
36
Var(X1 + X2 + X3 /2)
(1
1 2
0 1 0)( 1 ) 1 5 2
= =
(1
1 2
9.19.
37
Var(wT X ) =
i1
2 wi Var(Xi ).
38
Var(X1 ) = 2, Var(X2 ) = 3, Var(X3 ) = 5, X1 ,X2 = .6, and that X1 and X2 are independent of X3 . Find the covariance between (X1 + X2 + X3 )/3 and (X1 + X2 )/2. Answer: Let 1
3 1 3 1 3
1
2 1 2
w1 = and w2 = 0
39
Then Cov X1 + X2 X1 + X2 + X3 , 2 3 =
MT 1 COV(: )M 2
(
1 3 1 3 1 3
2 ) ( 1.47 0
1.47 3 0
1 0 2 ) 0)( 1 2 5 0 1 2 1 2
)
= =
( 1.157 1.323.
1.490
1.667 ) (
40
Important fact: If X has a multivariate normal distribution, then wT X is a normal random variable.
41
Example: Suppose that E (X1 ) = 1, E (X2 ) = 1.5, 2 2 X = 1, X2 = 2, and Cov(X1 , X2 ) = .5. 1 Find E (.3X1 + .7X2 ) and Var(.3X1 + .7X2 ). If (X1 X2 )T is bivariate normal, nd P (.3X1 + .7X2 < 2) Answers: E (.3X1 + .7X2 ) = (.3)(1) + (.7)(1.5) = 1.35 Var(.3X1 +.7X2 ) = (.3)2 (1)+(.7)2 (2)+(2)(.3)(.7)(.5) = 1.28 P (.3X1 + .7X2 < 2) = {(2 1.35)/ 1.28} = (.5745) = .7172
42
Hypothesis Testing Example: H0 : = 1 versus H1 : = 1 Rejection region: set of possible samples that lead us to reject H0 . example: reject H0 if |X 1| exceeds cuto c. Type I error: null hypothesis is true but we reject it Type II error: null hypothesis is false and we accept it
43
Rejection region chosen so P(type I error) below a pre-specied value is called the level of the test typical values of used in practice are .01, .05, or .1 as is made smaller, the rejection region must be made smaller
44
p-values The p-value for a sample is dened as the smallest value of for which the null hypothesis is rejected for that sample. to do the test we nd (typically, using stat software) the p-value of that sample H0 is rejected if we decide to use larger than the p-value Example: if p = .005 and = .01 then reject H0 H0 is accepted if we use smaller than the p-value Example: if p = .03 and = .01 then accept H0
45
Thus small p-value is evidence against the null hypothesis large p-value shows that the data are consistent with the null hypothesis
46
Maximum Likelihood Estimation Y = (Y1 , . . . , Yn )T is vector of data = (1 , . . . , p )T is vector of parameters f (y ; ) is the density of Y depends on
47
f (y ; ) =
i=1 n
Yi 1 2 ( Y ) i 2 2 1 2 2
n
=
i=1
1 exp 1 / 2 (2 )
1 = exp n n/ 2 (2 )
(Yi )2
i=1
48
L( ) := f (Y ; ) is the likelihood function maximum likelihood estimator = MLE = value of that maximizes L( ) denote the MLE by M L often it is mathematically easier to maximize log{L( )}
49
(Yi )2
i=1 n
(Yi )2
i=1
(Yi )2
i=1 n n
0=
i=1
(Yi )
i=1
Yi = n = Y
50
(Yi Y )2
i=1
1 = n
(Yi Y )2
i=1
(Yi Y )2
i=1
51
In this textbook example there is an explicit formula for the MLE With more complex models, there is no explicit formula Rather, one writes program to compute log{L( )} for any uses optimization software to maximize this function numerically For some models such as the ARIMA time series models, there are software packages, e.g, MINITAB and SAS, that compute the MLE
52
Likelihood Ratio Tests (LRTs) LRTs are a convenient, all-purpose tool. Let = 1 2
Want to test a hypothesis about 1 without making any hypothesis about the value of 2 . Example: want to test that population mean is zero; then 1 = and 2 = 2 . Let 1,0 be the hypothesized value of 1 Example: 1,0 = 0 if we want to test that is zero.
53
54
Example: testing that is zero, the hypotheses are H0 : = 0 and H1 : = 0 Neither hypothesis species anything about . M L = maximum likelihood estimator 2,0 be the value of 2 that maximizes L( ) when 1 = 1,0
55
= 2 log{L( M L )} log{L( 1,0 , 2,0 )} 2 ; dim( dim( 1 ) = number of components of 1 2 ,k is the upper-probability value of the chi-squared distribution with k degrees of freedom
56
Example: Y1 , . . . , Yn are IID N (, 2 ) and = (, 2 ). We want to test that is zero. n n 1 2 log(L) = log(2 ) log( ) 2 2 2 2 log(L) at the MLE is log{L(Y
2 , M L )} n
(Yi )2 .
i=1
Yi2 .
i=1
(Exercise: check)
57
Therefore,
2 2 2 log{L(Y , M ) } log { L (0 , L 0 )} 2 2 = n log(0 ) log(M L)
= n log = n log
2 0 2 M L
Yi2 n 2 ( Y Y ) i i=1
n i=1
58
> 2 ,1 .
(2)
To appreciate why (2) is a reasonable test consider if = 0: Simple algebra shows that
n n
Yi2 =
i=1 i=1
(Yi Y )2 + n(Y )2
Y will be close to = 0 and fraction inside the log will be close to 1. The log of 1 is 0 so the left hand side of (2) be small so we do not reject the null (right decision)
59
if is not 0: Y =0 then the left hand side of (2) will be large so that we reject the null (correct decision)