Regression
Prepared by:
Sutikno
Department of Statistics
Faculty of Mathematics and Natural Sciences
Sepuluh Nopember Institute of Technology (ITS)
Surabaya
Y Y
X X
Y Y
X X
Department of Statistics, ITS Surabaya Slide-6
Types of Relationships
(continued)
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Department of Statistics, ITS Surabaya Slide-7
Types of Relationships
(continued)
No relationship
X
Department of Statistics, ITS Surabaya Slide-8
Simple Linear Regression
Model
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable
Yi = β0 + β1Xi + ε i
Linear component Random Error
component
Y Yi = β0 + β1Xi + ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Department of Statistics, ITS Surabaya Slide-10
Simple Linear Regression
Equation (Prediction Line)
The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i
intercept
Value of X for
Ŷi = b0 + b1Xi
observation i
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Department of Statistics, ITS Surabaya Slide-17
Regression Using Excel
Tools / Data Analysis / Regression
ANOVA df SS MS F Significance F
300 = 0.10977
250
200
150
100
Intercept 50
= 98.248 0
0 500 1000 1500 2000 2500 3000
Square Feet
house price = 98.24833 + 0.10977 (square feet)
= 98.25 + 0.1098(2000)
= 317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Department of Statistics, ITS Surabaya Slide-23
Interpolation vs. Extrapolation
When using a regression model for prediction,
only predict within the relevant range of data
Relevant range for
interpolation
450
400
350
House Price ($1000s)
300
250
200
Do not try to
150
extrapolate
100
beyond the range
50
0
of observed X’s
0 500
Department of Statistics, ITS Surabaya
1000 1500 2000 2500 3000 Slide-24
Measures of Variation
Xi X
Department of Statistics, ITS Surabaya Slide-27
Coefficient of Determination, r2
The coefficient of determination is the portion
of the total variation in the dependent variable
that is explained by variation in the
independent variable
The coefficient of determination is also called
r-squared and is denoted as r2
SSR regression sum of squares
r =
2
=
SST total sum of squares
note: 0 ≤r ≤1
2
X
r =1
2
X
Department of Statistics, ITS Surabaya Slide-30
Examples of Approximate
r2 Values
r2 = 0
Y
No linear relationship
between X and Y:
SSE ∑ i i
( Y − Ŷ ) 2
S YX = = i=1
n−2 n−2
Where
SSE = error sum of squares
n = sample size
ANOVA df SS MS F Significance F
small s YX X large s YX X
Independence of Errors
Error values are statistically independent
Normality of Error
Error values (ε) are normally distributed for any given value of X
Y Y
x x
residuals
x residuals x
Not Linear
Department of Statistics, ITS Surabaya
Linear
Slide-38
Residual Analysis for
Independence
Not Independent
Independent
residuals
residuals
X
residuals
Percent
100
0
-3 -2 -1 0 1 2 3
Residual
Y Y
x x
residuals
x residuals x
Non-constant variance
Constant variance
RESIDUAL OUTPUT
House Price Model Residual Plot
Predicted Residuals
House Price
80
1 251.92316 -6.923162
2 273.87671 38.12329 60
3 284.85348 -5.853484 40
4 304.06284 3.937162
Residuals
20
5 218.99284 -19.99284
6 268.38832 -49.38832 0
0 1000 2000 3000
7 356.20251 48.79749 -20
8 367.17929 -43.17929 -40
9 254.6674 64.33264
-60
10 284.85348 -29.85348
Square Feet
Does not appear to violate
any regression assumptions
Department of Statistics, ITS Surabaya Slide-42
Measuring Autocorrelation:
The Durbin-Watson Statistic
15
Here, residuals show a 10
cyclic pattern, not Residuals 5
random. Cyclical 0
patterns are a sign of -5 0 2 4 6 8
positive autocorrelation -10
-15
Time (t)
∑ i
e
i=1
2
D less than 2 may signal positive
autocorrelation, D greater than 2 may
signal negative autocorrelation
0 dL dU 2
Department of Statistics, ITS Surabaya Slide-46
Testing for Positive
Autocorrelation
(continued)
160
140
120
100
y = 30.65 + 4.7038x
Sales
80
2
60 R = 0.8976
40
20
Is there autocorrelation?
0
0 5 10 15 20 25 30
Tim e
y = 30.65 + 4.7038x
Sales
Sum of Squared 3296.18 80
2
Difference of Residuals 60 R = 0.8976
Sum of Squared 3279.98 40
Residuals 20
Durbin-Watson Statistic 1.00494 0
0 5 10 15 20 25 30
Tim e
∑ (e − ei i −1 )2
3296.18
D= i= 2
n
= = 1.00494
3279.98
∑ ei
2
i =1
Department of Statistics, ITS Surabaya Slide-48
Testing for Positive
Autocorrelation
(continued)
Here, n = 25 and there is k = 1 one independent variable
Using the Durbin-Watson table, dL = 1.29 and dU = 1.45
D = 1.00494 < dL = 1.29, so reject H0 and conclude that
significant positive autocorrelation exists
Therefore the linear model is not the appropriate model
to forecast sales
Decision: reject H0 since
D = 1.00494 < dL
S YX S YX
Sb1 = =
SSX ∑ (X − X)
i
2
where:
Sb1 = Estimate of the standard error of the least squares slope
SSE
S YX = = Standard error of the estimate
n−2
Department of Statistics, ITS Surabaya Slide-50
Excel Output
Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error
Observations
41.33032
10
Sb1 = 0.03297
ANOVA df SS MS F Significance F
Y Y
t= b1 = regression slope
Sb1 coefficient
β1 = hypothesized slope
Sb 1= standard
d.f. = n − 2 error of the slope
Department of Statistics, ITS Surabaya Slide-53
Inference about the Slope:
t Test
(continued)
b1 Sb1
H0: β1 = 0 From Excel output:
H1: β1 ≠ 0 Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1 − β1 0.10977 − 0
t= = t = 3.32938
Sb1 0.03297
d.f. = 10-2 = 8
Decision:
α /2=.025 α /2=.025 Reject H0
Conclusion:
Reject H0 Do not reject H0 Reject H
There is sufficient evidence
-tα/2 tα/2 0
SSR
MSR =
k
SSE
MSE =
n − k −1
where F follows an F distribution with k numerator and (n – k - 1)
denominator degrees of freedom
Test statistic
r -ρ
= n – 2 degrees of freedom)
t (with
1− r 2
where
n−2 r=+ r 2
if b1 > 0
r = − r 2 if b1 < 0
Department of Statistics, ITS Surabaya Slide-63
Example: House Prices
Is there evidence of a linear relationship
between square feet and house price at
the .05 level of significance?
r −ρ .762 − 0
t= = = 3.329
1− r 2 1− .762 2
n−2 10 − 2
Department of Statistics, ITS Surabaya Slide-64
Example: Test Solution
r −ρ .762 − 0 Decision:
t= = = 3.329
1− r 2 1− .762 2 Reject H0
n−2 10 − 2 Conclusion:
There is
d.f. = 10-2 = 8
evidence of a
linear association
α /2=.025 α /2=.025
at the 5% level of
significance
Reject H0 Do not reject H0 Reject H0
-tα/2 tα/2
0
-2.3060 2.3060
3.329
Department of Statistics, ITS Surabaya Slide-65
Estimating Mean Values and
Predicting Individual Values
Goal: Form intervals around Y to express
uncertainty about the value of Y for a given Xi
Confidence
Interval for Y ∧
the mean of Y
Y, given Xi
∧
Y = b0+b1Xi
Prediction Interval
for an individual Y,
given Xi
Xi X
Department of Statistics, ITS Surabaya Slide-66
Confidence Interval for
the Average Y, Given X
Confidence interval estimate for the
mean value of Y given a particular Xi
Ŷ ± t n−2S YX hi
Size of interval varies according
to distance away from mean, X
Ŷ ± t n−2S YX 1 + hi
1 (Xi − X)2
Ŷ ± t n-2S YX + = 317.85 ± 37.12
n ∑ (Xi − X) 2
1 (Xi − X)2
Ŷ ± t n-1S YX 1+ + = 317.85 ± 102.28
n ∑ (Xi − X) 2
In Excel, use
Check the
“confidence and prediction interval for X=”
box and enter the X-value and confidence level
desired
Input values
∧
Y