Anda di halaman 1dari 17

- p.

1/17
Statistics 203: Introduction to Regression
and Analysis of Variance
Simple Linear Regression: Inference +
Diagnostics
Jonathan Taylor
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 2/17
Outline
s
Inference for vector of coef cients .
s
Diagnostics: what can go wrong in our model?
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 3/17
Distribution of

,
e
s
The vector

= (

0
,

1
) is a function of

Y so is independent
of e.
s
Both

and

Y are linear transformations of Y so they are
normally distributed.
s
We will prove
E((

0
,

1
)) = (
0
,
1
)
and has covariance matrix
Var(

) =
_

2
n
+
2 X
2
S
xx

2 X
S
xx

2 X
S
xx

2
S
xx
_
s
Natural estimates of covariance matrix

Var(

) =
_
b
2
n
+
2 X
2
S
xx

2 X
S
xx

2 X
S
xx
b
2
S
xx
_
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 4/17
t-random variables
s
Start with Z N(0, 1) is standard normal and G
2

,
independent of Z.
s
Compute
T =
Z
_
G

.
s
Then T t

has a t-distribution with degrees of freedom.


s
Where do they come up in regression?
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 5/17
F-random variables
s
Start with G
1

2

1
and another independent G
2

2

2
s
Compute
F =
G
1
/
1
G
2
/
2
s
Then F F

1
,
2
has an F-distribution with
1
degrees of
freedom in the numerator in
2
in the denominator.
s
Note: if T t

than T
2
F
1,
.
s
Where do they come up in regression?
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 6/17
Inference for

: t-statistics
s
Because e is independent of

it follows that

Var(

1
) and

Var(

0
) are independent of

.
s
Under the hypothesis H
0
:
1
=
0
1
T =

0
1
_

Var(

1
)
t
n2
.
(Why?)
s
To test this hypothesis, compare |T| to t
n2,1/2
the 1 /2
quantile of the t distribution with n 2 degrees of freedom.
s
Reject H
0
if |T| > t
n2,1/2
.
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 7/17
Why reject for large |T|?
s
Observing a large |T| is unlikely if
1
=
0
1
: reasonable to
conclude that H
0
is false.
s
Common to report p-value
p value = 2
_

|T|
f
t
n2
(s) ds.
s
Above, f
t
n2
is the density of a t- random variable with n 2
degrees of freedom.
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 8/17
t vs. Normal
3 2 1 0 1 2 3
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
s
D
e
n
s
i
t
y


f
(
s
)
t, 10 df
Normal
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 9/17
Condence interval for
1
s
For simplicity, write
SE(

1
) =

n
i=1
(X
i
X)
2
.
s
Under the model assumptions
1 = P
_

1
SE(

1
)

< t
n2,1/2
_
= P
_

1
t
n2,1/2
SE(

1
)
_
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 10/17
Linear combinations of
0
,
1
s
It is not too hard to prove that a
0

0
+a
1

1
is normally
distributed and its standard deviation can be estimated by
SE(a
0

0
+a
1

1
) =

_
a
2
0
n
+
(a
0
X a
1
)
2

n
i=1
_
X
i
X
_
2
s
As in last slide, con dence interval is
a
0

0
+a
1

1
t
n2,1/2
SE(a
0

0
+a
1

1
)
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 11/17
A new observation: forecasting
s
New observation
Y
new
=
0
+
1
X
new
+
new
.
s
SE(Y
new
) =

_
1 +
1
n
+
(X X
new
)
2

n
i=1
_
X
i
X
_
2
.
s
Again, prediction interval is

0
+

1
X
new
t
n2,1/2
SE(Y
new
)
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 12/17
Goodness of t
The variation in Y , SST, can be decomposed into two parts:
one for the regression, SSR, and one for the error, SSE.
SST =
n

i=1
(Y
i
Y )
2
= SSE +SSR
SSE =
n

i=1
(Y
i


Y
i
)
2
=
n

i=1
(Y
i

1
X
i
)
2
SSR =
n

i=1
(Y

Y
i
)
2
=
n

i=1
(Y

1
X
i
)
2
SST = SSR+SSE
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 13/17
Goodness of t
s
R
2
=
SSR
SST
= 1
SSE
SST
=

Cor(X, Y )
2
.
s
R
2
tells us how much variability in the Y s is explained by the
regression.
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 14/17
F test for signicance of regression
s
Under H
0
:
1
= 0:
SSR
2

2
1
SSE
2

2
n2
s
Therefore
F =
MSR
MSE
=
SSR/1
SSE/(n 2)
F
1,n2
.
(Why?)
s
Reject H
0
for large values of F.
s
General form of the F: a ratio of dispersion: numerator is
the dispersion of

Y around Y while denominator is disperion
of e.
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 15/17
What can go wrong?
s
Regression function can be wrong missing predictors,
nonlinear.
s
Assumptions about the errors can be wrong.
s
Outliers: both in predictors and observations.
s
Inuential points: these points have undue inuence on the
regression function.
s
Examples:
x
Example #1: diagnostics for usual linear model
x
Example #2: t density
x
Example #3: misspeci ed model
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 16/17
Problems in the regression function
s
True regression function may have higher-order non-linear
terms i.e. X
2
1
, or may truly be non-linear.
s
How to x? Sometimes things can be transformed to
linearity: suppose
Y
i
=
0
e

1
X
i

i
.
Then
log Y
i
= log
0
+
1
X
i
+ log
i
is a linear model and if s are independent lognormal
random variables, then this transformed model has the same
form as the original model!
s
Later, we will see Box-Cox transformations to choose a
transformation that optimally linearizes the model.
qOutline
qDistribution of
b

,
e
qt-random variables
qF-random variables
qInference for
b
: t-statistics
qWhy reject for large |T|?
qt vs. Normal
qCondence interval for
1
qLinear combinations of

0
,
1
qA new observation:
forecasting
qGoodness of t
qGoodness of t
qF test for signicance of
regression
qWhat can go wrong?
qProblems in the regression
function
qProblems with the errors
- p. 17/17
Problems with the errors
s
Errors may not be normally distributed. We will look at
QQplot for a graphical check. May not effect inference in
large samples.
s
Variance may not be constant. We will see some graphical
checks of this and (later) some transformations that might
help correct this.
s
Errors may not be independent. This seriously affects our
estimates of SE which can change t and F statistics
substantially!

Anda mungkin juga menyukai