Anda di halaman 1dari 46

Estadstica II

Chapter 4: Simple linear regression


Chapter 4. Simple linear regression

Contents
I Objectives of the analysis.
I Model specification.
I Least Square Estimators (LSE): construction and properties
I Statistical inference:
I For the slope.
I For the variance.
I Prediction for a new observation (the actual value or the
average value)
Chapter 4. Simple linear regression

Learning objectives
I Ability to construct a model to describe the influence of X on
Y
I Ability to find estimates
I Ability to construct confidence intervals and carry out tests of
hypothesis
I Ability to estimate the average value of Y for a given x (point
estimate and confidence intervals)
I Ability to estimate the individual value of Y for a given x
(point estimate and confidence intervals)
Chapter 4. Simple Linear Regression

Bibliography
I Newbold, P. Statistics for Business and Economics (2013)
I Ch. 10
I Ross, S. Introductory Statistics (2005)
I Ch. 12
Introduction

A regression model is a model that allows us to describe an eect


of a variable X on a variable Y .

I X: independent or explanatory or exogenous variable


I Y: dependent or response or endogenous variable

The objective is to obtain reasonable estimates of Y for X based


on a sample of n bivariate observations (x1 , y1 ), . . . , (xn , yn ).
Introduction

Examples
I Study how the fathers height influences the sons height.

I Estimate the price of an apartment depending on its size.

I Predict an unemployment rate for a given age group.

I Approximate a final grade in Est II based on the weekly


number of study hours.

I Predict the computing time as a function of the processor


speed.
Introduction

Types of relationships
I Deterministic: Given a value of X , the value of Y can be
perfectly identified.
y = f (x)
Example: The relationship between the temp in C (X ) and
Fahrenheit (Y ) is:
y = 1.8x + 32
Plot of Grados Fahrenheit vs Grados centgrados
112
Grados Fahrenheit

92

72

52

32
0 10 20 30 40

Grados centgrados
Introduction
Types of relationships
I Nondeterministic (random/stochastic): Given a value of X ,
the value of Y cannot be perfectly known.

y = f (x) + u

where u is an unknown (random) perturbation (random


variable). Example: Production (X ) and price (Y ).
Plot of Costos vs Volumen
80

60
Costos

40

20

0
26 31 36 41 46 51 56

Volumen

There is a linear pattern, but not perfect.


Introduction
Types of relationships
I Linear: When the function f (x) is linear,

f (x) = 0 + 1x

I If > 0 there is a positive linear relationship.


1
I If < 0 there is a negative linear relationship.
1

Relacin lineal positiva Relacin lineal negativa


10 10

6 6

2 2
Y

Y
-2 -2

-6 -6
-2 -1 0 1 2 -2 -1 0 1 2

X X

The scatterplot is (American) football-shaped.


Introduction

Types of relationships
I Nonlinear: When f (x) is nonlinear. For example,
f (x) = log (x), f (x) = x 2 + 3, . . .
Relacin no lineal
2

-1
Y

-2

-3

-4
-2 -1 0 1 2

The scatterplot is not (American) football-shaped.


Introduction

Types of relationships
I Lack of relationship: When f (x) = 0.
Ausencia de relacin
2,5

1,5

0,5
Y

-0,5

-1,5

-2,5
-2 -1 0 1 2

X
Measures of linear dependence

Covariance
The covariance is defined as
n
X n
X
(xi x) (yi y ) xi yi n(x)(y )
i=1 i=1
cov (x, y ) = =
n 1 n 1

I If there is a positive linear relationship, cov > 0


I If there is a negative linear relationship, cov < 0
I If there is no relationship or the relationship is nonlinear,
cov 0

Problem: Covariance depends on the units of X and Y .


Measures of linear dependence
Correlation coefficient
The correlation coefficient (unitless) is defined as

cov (x, y )
r(x,y ) = cor (x, y ) =
sx sy

where
n
X n
X
2
(xi x) (yi y )2
i=1 i=1
sx2 = and sy2 =
n 1 n 1

I -1 cor (x, y ) 1
I cor (x, y ) = cor (y , x)
I cor (ax + b, cy + d) = sign(a)sign(c)cor (x, y ) for arbitrary
numbers a, b, c, d.
Simple linear regression model

The simple linear regression model assumes that

Yi = 0 + 1 xi + ui

where
I Yi is the value of the dependent variable Y when the random
variable X takes a specific value xi
I xi is the specific value of the random variable X
I ui is an error, a random variable that is assumed to be normal
with mean 0 and unknown variance 2 , ui N(0, 2 )
I 0 and 1 are the population coefficients:
I : population intercept
0
I 1 : population slope
The (population) parameters that we need to estimate are: 0, 1
and 2 .
Simple linear regression model
Our objective is to find the estimators/estimates 0 , 1 of 0, 1
in order to obtain the regression line:
y = 0 + 1 x
which is the best fit to the data with a linear pattern. Example:
Lets say that the regression line for the last example is
[=
Price 15.65 + 1.29 Production
Plot of Fitted Model
80

60
Costos

40

20

0
26 31 36 41 46 51 56

Volumen

Based on the regression line, we can estimate the price when


[ = 15.65 + 1.29(25) = 16.6
Production is 25 millions: Price
Simple linear regression model
The dierence between the observed value of the response variable
yi and its estimate yi is called a residual:
e i = yi yi

Valor observado
Dato (y)

Recta de
regresin
estimada

Example (cont.): Clearly, if for a given year the production is 25


millions, the price will not be exactly 16.6 mil euros. That small
dierence, the residual, in that case will be
ei = 18 16.6 = 1.4
Simple linear regression model: model assumptions
I Linearity: The underlying relationship between X and Y is
linear,
f (x) = 0 + 1 x
I Homogeneity: The errors have mean zero,

E [ui ] = 0

I Homoscedasticity: The variance of the errors is constant,


2
Var (ui ) =

I Independence: The errors are independent,

E [ui uj ] = 0

I Normality: The errors follow a normal distribution,


2
ui N(0, )
Simple linear regression model: model assumptions
Linearity
The scaterplot should have an (American) football-shape, i.e., it
should show scatter around a straight line.
Plot of Fitted Model
80

60
Costos

40

20

0
26 31 36 41 46 51 56

Volumen

If not, the regression line is not an adequate model for the data.
Plot of Fitted Model
34

24

14
Y

-6
Simple linear regerssion model: model assumptions
Homoscedasticity
The vertical spread around the line should roughly remain
constant.
Plot of Costos vs Volumen
80

60

Costos
40

20

0
26 31 36 41 46 51 56
Volumen

If thats not the case, heteroscedasticity is present.


variables dependientes y un conjunto de factores 5
6

independientes.
7

Simple linear regerssion model: model assumptions


8
9
10

f (Y , Y ,..., Y | X , X ,..., X )
11
12
1 2 k 1 2 l 13
14

Independence Tipos de relaciones:


15
16
17
18

- Relacin no lineal
19
20
21

I The observations should


- Relacin lineal be independent.
22
23
24
25
26

I One observation doesnt lineal


Regresin implysimple
any information about another. 27
28
29
30

I In general, time
Regresin Lineal series fail this assumption. 2 Regresin

Normality
I A priori, we assume that the observations are normal.
Modelo H

yi E 0  E 1 xi  u i , u i o N (0, V 2 ) L

yi N
E 0  E1 x
H

xi In
E 0 , E1 ,V : parmetros desconocidos
2
(Ordinary) Least Square Estimators: LSE
x xi method to obtain the
In 1809 Gauss proposed the least squares

estimators 0 and 1 that provide the best fit
Regresin Lineal 7

yi = 0 + 1 xi
The method is based on a criterion in which we minimize the sum
of squares of the residuals, SSR, that is, the sum of squared
Residuos
vertical distances between the observed yi and predicted yi values
X n X n Xn 2
0  E1 xiyi  0e+
ei2 = yi (yi yi )2 E= 1 xi
N 
Ni
Valor Observado
i=1 i=1 i=1
Valor Previsto Residuo
ei

yi

yi E0  E1xi

xi
Least Squares Estimators
y i E 0  E 1 xi  u i , u i o N (0, V 2 )
The resulting estimators aredependiente
y : Variable yi
i
xi : Variable independiente
n y
X
ui : Parte aleatoria (xi x) (yi y )
V
1 = cov (x, y ) = i=1
n
sx2 X 0

Regresin Lineal
(xi x)2
6 Regresin Lineal

i=1

= y 1 x
Recta de regresin
0 Residuos

y E 0  E1 x y
Ni
Valor Observ

y yi
Pendiente
E1
E 0 y  E1 x
x
Regresin Lineal 8 Regresin Lineal
Fitting the regression line
Example 4.1. For the Spanish wheat production data from the 80s with
production (X ) and price per kilo in pesetas (Y ) we have the following
table
production 30 28 32 25 25 25 22 24 35 40
price 25 30 27 40 42 40 50 45 30 25

Fit a least squares regression line to the data.


10
X
xi yi nx y
1 = i=1 9734 10 28.6 35.4
= = 1.3537
X10 8468 10 28.62
xi2 nx 2
i=1
0 = y 1 x = 35.4 + 1.3537 28.6 = 74.116

Regression line is
y = 74.116 1.3537x
Fitting the regression line in software
Estimating the error variance
To estimate the error variance, 2, we can simply take the
uncorrected sample variance,
n
X
ei2
i=1
2 =
n
which is the so-called maximum likelihood estimator of 2.

However, this estimator is biased.

The unbiased estimator of 2, is called the residual variance,


n
X
ei2
i=1 SSR
sR2 = =
n 2 n 2
Estimating the error variance

Exercise. 4.2. Find the residual variance for exercise 4.1.


First, we find the residuals, ei , using the regression line

yi = 74.116 1.3537xi

xi 30 28 32 25 25 25 22 24 35 40
yi 25 30 27 40 42 40 50 45 30 25
yi = 74.116 1.3537xi 33.5 36.21 30.79 40.27 40.27 40.27 44.33 41.62 26.73 19.96
ei = yi yi -8.50 -6.21 -3.79 -0.27 1.72 -0.27 5.66 3.37 3.26 5.03

The residual variance is then


n
X
ei2
i=1 207.92
sR2 = = = 25.99
n 2 8
Estimating the error variance in software
Statistical inference in simple linear regression model

I Up to this point we only talked about point estimation.

I With confidence intervals for model parameters, we can obtain


information about the estimation error.

I And tests of hypothesis will help us to decide if a given


parameter is statistically significant.
I In statistical inference, we begin with the distribution of the
estimators.
Statistical inference of the slope
The estimator 1 follows a normal distribution because it is a linear
combination of normally distributed random variables
Xn Xn
1 = (xi x)
Y i = wi Yi
(n 1)sX2
i=1 i=1

where Yi = 0 + 1 xi + ui , and satisfies Yi N + 2


0 1 xi , .
In addition, 1 is an unbiased estimator of 1 ,
h i X n
(xi x)
E 1 = E [Yi ] = 1
(n 1)sX2
i=1
whose variance is,
h i X n
(xi x) 2 2
Var 1 = Var [Yi ] =
(n 1)sX2 (n 1)sX2
i=1
Thus,
2
1 N 1,
(n 1)sX2
Confidence interval for the slope
We wish to obtain a (1 ) confidence interval for 1 . Since 2 is
unknown, we estimate it using sR2 . The corresponding theoretical
result, when the error variance is unknown is then
1 1
s tn 2
sR2
(n 1)sX2

based on which we obtain (1 ) confidence interval for 1:


s
1 tn sR2
2,/2
(n 1)sX2

The length of the interval decreases if:

I The sample size increases.


I The variance of xi increases.
I The residual variance decreases.
Hypothesis testing for the slope
In a similar manner, we construct a hypothesis test for 1 . In particular, if the true
value of 1 is zero, this means that the variable Y does not depend on X in a linear
fashion. Thus, we are mainly interested in a two-sided test:

H0 : 1 =0
H1 : 1 6= 0

The rejection region is :


8 9
> t >
>
> z }| { >
>
>
< >
=
1
RR = t : q > tn
2,/2
>
> s 2 /((n 1)s 2 ) >
>
>
> R X >
>
: ;

Equivalently, if 0 is outside a (1 ) confidence interval for 1, we reject the null at


significance level.
The p-value is: 0 1


B 1 C
p-value = 2 Pr @Tn 2 > q A
s 2 /((n 1)s 2 )
R X
Inference for the slope
Exercise 4.3
1. Find a 95% CI for the slope of the (population) regression model
from Example 4.1.
2. Test the hypothesis that the price of wheat depends linearly on the
production at a 0.05 significance level.
1. Since tn 2,/2 = t8,0.025 = 2.306

1.3537 1
2.306 q 2.306
25.99
932.04

2.046 1 0.661

2. Since the interval (with the same ) doesnt contain 0, we reject the
null 1 = 0 at 0.05 level. Also, the (observed) test statistic is
1 1.3537
t=p 2 2
=q = 4.509.
sR / (n 1) sX 25.99
932.04

Thus, we have p-value = 2 Pr(|T8 | > | 4.509|) = 0.002


Inference for 1 in software
Statistical inference for the intercept
The estimator 0 follows a normal distribution because it is a linear
combination of normal random variables,
Xn
0 = 1
xwi Yi
n
i=1

where wi = (xi x) /nsX2 and Yi = 0 + 1 xi + ui , which satisfies


Yi N 0 + 1 xi , 2 . Additionally, 0 is an unbiased estimator
of 0 ,
h i X n
1
E 0 = xwi E [Yi ] = 0
n
i=1
whose variance is,
h i X n 2
1 2 1 x 2
Var 0 = xwi Var [Yi ] = + .
n n (n 1)sX2
i=1
Thus,
0 N 2 1 x 2
0, +
n (n 1)sX2
Confidence interval for the intercept
We wish to find a (1 ) confidence interval for 0 . Since 2 is
unknown, we estimate it with sR2 as before. We obtain:
0 0
s tn 2
1 x 2
sR2 +
n (n 1)sX2

which yields the following confidence interval for 0 :


r
0 tn 2,/2 s 2 1 + x 2 2
R n (n 1)s X

The length of the interval decreases if:


I The sample size increases.
I Variance of xi increases.
I The residual variance decreases.
I The mean of xi decreases.
Hypothesis test for the intercept
Based on the distribution of the estimator, we can carry out the test of hypothesis. In
particular, if the true value of 0 is 0, it means that the population regression line goes
through the origin. For this case we would test:

H0 : 0 =0
H1 : 0 6= 0

The rejection region is:


8 9
> t >
>
> z }| { >
>
>
< >
=
0
RR = t : s
> tn 2,/2
>
> >
>
>
> s2 1 + x 2 >
>
: R n (n 2
1)sX ;

Equivalently, if 0 is outside the (1 ) confidence interval for 0 we reject the null.


The p-value is
0 1


B C
B 0 C
p-value = 2 Pr B
BTn 2 > s C
C
@ 2 A
s2 1 + x
R n (n 2
1)sX
Inference for the intercept
Exercise 4.4
1. Find a 95% CI for the intercept of the population regression line of
Exercise 4.1.
2. Test the hypothesis that the population regression line intersects the
origin at a 0.05 significance level.
1. The quantile is tn 2,/2 = t8,0.025 = 2.306 so
74.1151 0
2.306 r 2.306 , 53.969 0 94.261
1 28.62
25.99 10 + 932.04

2. Since the interval (with the same ) doesnt contain 0, we reject the
null hypothesis that 0 = 0. Also, the (observed) test statistic is
t
z }| {
0 74.1151
t=r =r = 8.484
2
28.62
sR2 n1 + (n x1)s 2 1
25.99 10 + 932.04
X

Thus, we have: p-value = 2 Pr(|T8 | > |8.483|) = 0.000 .


Inference for the intercept in software
Inference for the error variance

We have:
(n 2) sR2 2
2
n 2

Which means that :


I The (1 ) confidence interval for 2 is:

(n 2) sR2 2 (n 2) sR2
2
2
n 2,/2 n 2,1 /2

I Which can be used to solve the test:


2 2
H0 : = 0
2 2
H1 : 6= 0
Average and individual predictions
We consider two situations:
1. We wish to estimate/predict the average value of Y for a
given X = x0 .
2. We wish to estimate/predict the actual value of Y for a given
X = x0 .

For example in Ex. 4.1


1. What would be the average wheat price for all years in which
the production was 30?
2. If in a given year, the production was 30, what would be the
corresponding price of wheat?

In both cases:

y0 = 0 + 1 x0
= y + 1 (x0 x)

But the estimation errors are dierent.


Estimating/predicting the average value

Remember that:

2
Var Y0 = Var Y + (x0 x) Var 1
!
2 1 (x0 x)2
= +
n (n 1) sX2

The confidence interval for the mean prediction E [Y0 |X = x0 ] is:


v !
u
u 1 (x x)2
Y0 tn 2,/2 tsR2
0
+
n (n 1) sX2
Estimating/predicting the actual value
The variance for the prediction of the actual value is the mean
squared error:
2
E Y0 Y0 = Var (Y0 ) + Var Y0
!
2
1 (x 0 x)
= 2 1+ +
n (n 1) sX2

And thus the confidence interval for the actual value Y0 is:
v !
u
u 1 (x x)2
Y0 tn 2,/2 tsR2 1 + +
0
n (n 1) sX2

The size of this interval is bigger than that for the average
prediction.
Estimating/predicting the average and actual values

In red: confidence intervals for the prediction of average value.


In pink: confidence intervals for the prediction of actual value.

Plot of Fitted Model


50

45
Precio en ptas.

40

35

30

25
22 25 28 31 34 37 40

Produccion en kg.
Regression line: R-squared and variability decomposition
I Coefficient of determination, R-squared is used to assess the
goodness-of-fit of the model. It is defined as

R 2 = r(x,y
2
) 2 [0, 1]

I R 2 tells us what percentage of the sample variability in the y


variable is explained by the model, that is, by its linear dependence
on x
I Values close to 100% indicate that the regression model is a good
fit to the data (less than 60%, not so good)
I Variability decomposition and R 2 : The Total Sum of Squares
P 2
Pi (yi y )2 can be decomposed into the Residual
P Sum of Squares
(y
i i y ) + the Model Sum of Squares i (y y )2

SST = SSR + SSM


SSR SSM
and we have R 2 = 1 SST = SST
Regression line: R-squared and variability decomposition

From Wikipedia:
ANOVA table

ANOVA (Analysis of Variance) table for the simple linear


regression model

Source of variability SS DF Mean F ratio


Model SSM 1 SSM/1 SSM/sR2
Residuals/errors SSR n 2 SSR/(n 2) = sR2
Total SST n 1

Note that the value of the F statistic is the square of that for the t
statistic in the simple regression significance test.