Anda di halaman 1dari 62

# Simple linear

regression
Introduction
AS202 Applied
Statistical
Models
1

## What is Linear Regression?

It is a statistical technique that attempts to explore
and model the relationship between two or more
variables using a straight line.

Numerical
Relationship

Causal
Relationship

Functional relation:

Regression Model:

## Perfect fit: all values fall

exactly on the straight
line.

## Not a perfect fit: values do

not fall exactly on the line.
Errors exist:

Y f ( X ) a bX
Y a bX cZ

Y a bX errors
Y a b1 X 1 b2 X 2 errors

Type of relationship:
1. Linear relationship (refer to the slope, not regressor)
Simple: one regressor
yi 0 1 xi i

2500

## Size (feet squared)

yi 0 1 x1 2 x2 L k xk i

2000
1500
1000
500
0
500

400

300

200

Price (RM1000)

100

10

20

30

40

## Age of home (years)

y i 0 1 x 2 x 2 k x k i
yi 0 1 x1 2 x2 3 x3 4 x1 x2 5 x1 x3 6 x2 x3
7 x1 x2 x3 i
4

50

## 2. Non-linear relationship in one or more variables:

exponential, logarithm, logistic etc. Eg:

yi e

xi

xi
yi
i
xi
yi

1
1 e xi i

## We use regression for

1. Data description (modeling)
2. Parameter estimation
3. Prediction/ forecasting
4. System control
5. Data reduction

Some applications:
1. An economist wants to investigate the
relationship between the petrol price and
the inflation rate.
2. A sale manager is interested to predict
the total sale in next year based on the
number of staffs and square feet of
space in the store
3. A policy maker wants to identify the main
factors (e.g speed limit, road condition,
weather) contribute to the number of
4. A scientist wants to know at which level
of sound pollution will affect human
health.
5. A computer scientist wants to compress
an image for minimum storage.

System
control
Data
description
Forecast
Data
reduction
Parameter
estimation
7

Simple linear
regression
Simple linear
regression model
AS202 Applied
Statistical
Models
8

Example:
You want to know if there is a relationship between the
monthly personal income and the age of a worker, and
then to forecast your monthly income when you are at
50 year-old. There are five workers in your study.
The two variables are:
1. Age of worker (years): Independent variable (X)
2. Monthly personal income (RM): Dependent
variable (Y)
3. Sample size (n): 5 workers
9

Respondent

Age

34

2950

45

4000

29

2430

32

3000

23

1790

10

RM4565.87

?
?
?

11

## Mathematical equation for the straight line

YX

The gap between the points and the line are errors of the model.
12

## The Simple Linear Regression model represents the straight

line with errors:
yi 0 1 xi i

i=1,2,..n

(Eq 1.1)

Where

13

Assumptions:
1) The error term i is normally distributed with mean
E(i)=0 and constant variance Var(i)=2 ;
2) The errors are uncorrelated with Cov i , j 0; i j

i ~ NID 0, 2
This implies that the dependent variable Y follows a
normal distribution with

E y | xx
Vari ; y x |

14

Some Properties:

15

Distribution of y at
x=23. The mean
E(y) is +(23) and
the standard
deviation is

y
x

E(y|x)=+x

Distribution of y at
x=45. The mean
E(y) is +(45) and
the standard
deviation is

16

Simple linear
regression
Least square
estimation
AS202 Applied
Statistical
Models
17

18

## Find a line that fit the data best

= Estimate the values of and that minimize the errors

19

yi 0 1 xi i

i=1,2,..n

20

## Parameter Estimation of and : Least Square Method

To eliminate the negative signs of the error term,

## Consider the squared error for n pairs of sample data, we have

the error sum of squares:

21

criterion:

22

23

Respondent

Age (x)

Income (y)

xy

x2

34

2950

100300

1156

45

4000

180000

2025

29

2430

70470

841

32

3000

96000

1024

23

1790

41170

529

Sum

163

14170

487940

5575

xi 163,

yi 14170,

x 32.6,

y 2834

xi yi 487940,

2
x
i 5575

24

## Thus, the fitted SLR model is

25

Properties of LSE:
1. The LSEs are linear combinations of the observations yi
n
n
xi x yi

S xy
1

ci yi
S xx i 1
S xx
i 1

0 y 1 x y x ci yi c
i 1

E 1 1

E 0 0

26

2
2

1
x

Var 1
Var 0 Var y 1 x

S xx

S xx

## Gauss-Markov Theorem: (The best linear unbiased estimators)

Under the conditions of regression model, the least squares
estimators are unbiased and have minimum variance among all
unbiased linear estimators.

27

Summary
1. The SLR model
yi 0 1 xi i

i=1,2,,n

28

Simple linear
regression
Forecasting
using SLR
AS202 Applied
Statistical
Models
29

## Suppose you constructed the SLR model

using n pairs of sample data (x1,y1), (x2,y2), ,
(xn,yn) and the range of x is [a,b].
Two types of forecasting:
Extrapolation: predict the values of y using
x outside the range [a,b].
Interpolation: predict the values of y using x
inside the range [a,b].

30

Respondent

Age (x)

Income (y)

xy

x2

34

2950

100300

1156

45

4000

180000

2025

29

2430

70470

841

32

3000

96000

1024

23

1790

41170

529

Sum

163

14170

487940

5575

31

RM4565.8725

Extrapolation!
32

RM2575.21

Interpolation!
33

## 1) It is intended as interpolation model

over the range of the regressor
variables. We must be careful if we
extrapolate outside of this range.

34

## Properties of the Fitted Regression Model:

1. The difference between the observed value yi and the
corresponding fitted value is a residual:

ei yi y i yi 0 1 xi

n

e
i 1

## 2. The LS regression line always passes through the

centroid x, y of the data.

35

## 3. The sum of the observed value yi equals the sum of

the fitted values n
n

y y
i 1

i 1

## 4. The sum of the residuals weighted by the corresponding

value of the regressor variable always equals zero.
n

xe

i i

i 1

## 5. The sum of the residuals weighted by the corresponding

fitted value always equals zero.
n

y
i 1

ei 0

36

Simple linear
regression
Interval
estimation
AS202 Applied
Statistical
Models
37

Estimation of Variance 2
Method 1: based on several observations (replication) on y for
at least one value of x.
Method 2: when prior information concerning 2 is available.
Method 3: estimate based on the residual or2error sum of
n
n

squares.

SS Re s ei yi y i

i 1
i 1
2

SS Re s

MS Re s
n2

## This unbiased estimator of 2 is called the Residual

Mean Square. And its square root is called the
standard error of regression.
38

i i 410.77 99.53
yx
x

SS Re s ei2 yi y i 66063.02

i 1
i 1
2

MSRe s

SS Re s 66063.02

22021.01
n2
52
39

## Interval Estimation in Simple Linear Regression:

If the errors are NID, then the 100(1 )% confidence
interval of 1, 0 and 2 are

1 t ,n 2 se 1 1 1 t ,n 2 se 1
2
2

0 t ,n 2 se 0 0 0 t ,n 2 se 0
2
2

n 2 MS Re s

2
2,n 2

n 2 MS Re s

12 2,n 2

40

## Interval Estimation of the Mean Response

Let x0 be any value of the regressor variable within the range
of the original data on x used to fit the model. Then, the
mean response E(y|x0) can be estimated by

E y | x0 y| x0 0 1 x0

x0 x

2 1

where Var y| x0
S xx

## Then, a 100(1 )% confidence interval on the mean response

at the point x = x0 is

y| x0 t 2,n 2

1 x x
MS Re s 0
n
S xx

E y | x t
0
y | x0
2, n 2

1 x x
MS Re s 0
n
S xx

41

Simple linear
regression
Prediction
interval
AS202 Applied
Statistical
Models
42

## Prediction of New Observations

If x0 is the value of the regressor of interest, the point estimate
of the new value of the response y0 is

y 0 0 1 x0
Note that the random variable

1 x x
0, 1 0
n
S xx

y0 y 0 ~ N

at x0 is

y 0 t 2,n 2

1 x x
MS Re s 1 0

n
S xx

y y t
0
2, n 2
0

1 x x
MS Re s 1 0

n
S xx

43

Simple linear
regression
Hypothesis
testing
AS202 Applied
Statistical
Models
44

## Hypothesis Testing on the Parameters

Assumption: the errors i are normally distributed.

i ~ NID 0,
2
This implies that yi ~ NID 0 1 xi ,
2

Since 1 ci yi ~ NID 1 ,
S xx
i 1

we have

H 0 : 1 10
H1 : 1 10

45

Z0

1 10
2

~ N 0,1

S xx

## Typically 2 is unknown and the unbiased estimator

MSRes is used. Then, the test statistic becomes

1 10
t0
~ t
,n 2
MS Re s
2
S xx
We reject the null hypothesis if t0 t ,n 2
2

46

MS Re s
se 1
S xx

we have

H 0 : 0 00
H1 : 0 00

47

t0

0 00
MS Re s

0 00

~ t

2
,n 2

2
se 0
1 x

n S xx

t0 t
2

,n 2

48

H 0 : 1 0
H1 : 1 0

## Testing: (i) t-statistic, or (ii) analysis of variance (ANOVA)

1) Accept the null hypothesis: no linear relationship
between x and y: (i) x is of little value in explaining the
variation of y, (ii) the true relationship between x and y
is not linear.

49

## 2) Reject the null hypothesis: (i) x is of value in

explaining the variability of y, (ii) the straight-line
model is adequate or better results could be obtained
with the addition of higher order polynomial terms in
x.

50

Measures of Variation
Total variation is made up of two parts:

SST

SSR

Total Sum of
Squares

Regression Sum
of Squares

SST ( Yi Y )2

SSR ( Yi Y )2

SSE
Error Sum of
Squares

SSE ( Yi Yi )2

where:

13-51

## Yi = Predicted value of Y for the given X value

i

Measures of Variation
SST = total sum of squares

(Total Variation)

## Measures the variation of the Yi values around

their mean Y
SSR = regression sum of squares (Explained Variation)

## Variation attributable to the relationship

between X and Y
SSE = error sum of squares (Unexplained Variation)

## Variation in Y attributable to factors other than X

13-52

Measures of Variation
(continued)
DCOVA

Y
Yi

SSE = (Yi - Yi )2

_
SSR = (Yi - Y)2

13-53

Xi

_
Y

## F Test for Significance

F Test statistic:
where

MSR
FSTAT
MSE
MSR

SSR
k

MSE

SSE
n k 1

## where FSTAT follows an F distribution with k numerator and (n k - 1)

denominator degrees of freedom
(k = the number of independent variables in the regression model)
54

Simple linear
regression

AS202 Applied
Statistical
Models

Linear
Association
between X & Y
55

Coefficient of Determination :
R2

SS R
SS
1 Re s
SST
SST

## where SSRES is the residual or error sum of squares, SSR is the

regression or model sum of squares and SST is a measure of
the variability in y without considering the effect of the
regressor variables x.
n

SS Re s yi y i , SS R y i y , SST SS R SS Re s

i 1
i 1

## R2 measures the proportion of variation explained by the

regressor x.
56

SUMMARY OUTPUT
Regression Statistics
Multiple R
0.98747
R Square
0.97511
Standard Error
148.395
Observations
5
ANOVA

df

SS

MS
2587656.98
2587657 3
22021.0056
66063.02 2
2653720

Regression

Residual
Total

3
4

Coefficien Standard
ts
Error

Intercept

-410.77

Significance
F

F
117.50
9
0.00167965

Upper
t Stat
P-value Lower 95% 95%
1.33977800 0.2727 1386.50525
57
306.5981 1
6
9
564.959
10.8401372 0.0016 70.3120596

Simple linear
regression
Some remarks
(optional)
AS202 Applied
Statistical
Models
58

## 1) The disposition of the x values

plays an important role in the
least squares fit.

59

## 2) Outliers can seriously disturb the

least squares fit.
3) When a regression analysis has
indicated a strong relationship
between two variables, this does not
imply that the variables are related in
any causal sense (cause and effect).
4) In some applications, the value of the regressor variable
x required to predict y is unknown. Thus, to predict y,
we must first predict x. The accuracy of the prediction
on y depends on the accuracy of the prediction on x.
60

## Parameters Estimation: Maximum Likelihood Estimation

(MLE):
If the form of the distribution of the errors is known, e.g.
normal, MLE is an alternative way of parameter estimation.
The Likelihood Function from the joint distribution of
the observations is
n

L yi , xi , 0 , 1 , 2
i 1

1
2 2

1
2
exp 2 yi 0 1 xi
2

61

solving

ln L
0

0 , 1 ,

0,

ln L
1

0 , 1 ,

0,

ln L
0
2 2
0 , 1 ,

62