Chapter 11: Simple Linear Regression

Chapter 11: Simple Linear Regression
Where We’ve Been
 Presented methods for estimating and

testing population parameters for a
single sample
 Extended those methods to allow for a
comparison of population parameters
for multiple samples
McClave: Statistics, 11th ed. Chapter 11: Simple 2

Linear Regression
Where We’re Going
 Introduce the straight-line linear regression model
as a means of relating one quantitative variable to
another quantitative variable
 Introduce the correlation coefficient as a means of
relating one quantitative variable to another
quantitative variable
 Assess how well the simple linear regression model
fits the sample data
 Use the simple linear regression model to predict
the value of one variable given the value of another
variable
Linear Regression
11.1: Probabilistic Models
There may be a deterministic reality connecting
two variables, y and x
But we may not know exactly what that reality is,

or there may be an imprecise, or random,
connection between the variables. The
unknown/unknowable influence is referred to as
the random error
So our probabilistic models refer to a specific

connection between variables, as well as
influences we can’t specify exactly in each case:
y = f(x) + random error
Linear Regression
The relationship between home runs and runs in
baseball seems at first glance to be deterministic …
Runs
6
0
0 1 2 3 4 5 6
Home Runs
Linear Regression
But if you consider how many runners are on base when the home run
is hit, or even how often the batter misses a base and is called out, the
Runs rigid model becomes more variable.
6
0
0 1 2 3 4 5 6
Home Runs
Linear Regression
General Form of Probabilistic Models

y = Deterministic component + Random error
where y is the variable of interest, and the
mean value of the random error is assumed
to be 0: E(y) = Deterministic component.

Linear Regression

Linear Regression
 The goal of regression analysis is to find the
straight line that comes closest to all of the points in
the scatter plot simultaneously.
Attendance
300
250 y = -0.192x + 150.3
200
Attendance
150
100
50
0
0 20 40 60 80 100 120 140 160
Week

Linear Regression
 A First-Order Probabilistic Model

y = 0 + 1x + 
where y = dependent variable
x = independent variable
0 + 1x = E(y) = deterministic
component
 = random error component
0 = y – intercept
1 = slope of the line
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
10
0, the y – intercept,

and 1, the slope of the
line, are population
parameters, and
invariably unknown.
Regression analysis is
designed to estimate
these parameters.

Linear Regression
11.2: Fitting the Model: The
Least Squares Approach
Step 1
Hypothesize the deterministic
component of the probabilistic model
E(y) = 0 + 1x
Step 2
Use sample data to estimate the
unknown parameters in the model
Linear Regression
Values on the line are the predicted values
4000
of total offerings given the average offering.
3500
3000
Total Offering
2500
2000
1500
1000
500
The distances between the scattered dots and
0 the line are the errors of prediction.
0 10 20 30 40 50
Average Offering

Linear Regression
Values on the line are the predicted values
4000
of total offerings given the average offering.
3500
3000
Total Offering
2500
The line’s estimated parameters are the values that
2000 minimize the sum of the squared errors of prediction,
1500 and the method of finding those values is called the
method of least squares.
1000
500
The distances between the scattered dots and
0 the line are the errors of prediction.
0 10 20 30 40 50
Average Offering

Linear Regression
 Model: y   0  1 x
ˆ
y  ˆ
  ˆ
 Estimates: 0 1x
( y  ˆ
y )  [ y  ( ˆ
  ˆ

 Deviation: i i i 0 1 xi )]
 SSE:  ˆ
i
ˆ
[ y  (    x)] 2
0 1

Linear Regression
 The least squares line ŷ  ˆ0  ˆ1 x

is the line that has the following two
properties:
1. The sum of the errors (SE) equals 0.
2. The sum of squared errors (SSE) is
smaller than that for any other straight-
line model.

Linear Regression
Formulas for the Least Squares Estimates
  x   y 
( x  x )( y  y )  xy 
i i
Slope: ˆ1 
SSxy

 i
 i
i i
n
 (x  x )   xi 
2 2
SSxx
i 
i 2
x
n
y intercept:  0  y  ˆ1 x

Linear Regression
Can home runs be used to predict errors?
Is there a relationship
between the number of
home runs a team hits and
the quality of its fielding?

Linear Regression
Home Runs (x) Errors (y) xi2 xiyi
158 126 24964 19908

155 87 24025 13485
139 65 19321 9035
191 95 36481 18145
124 119 15625 14756
xi = 767 yi = 492 xi2 = 120416 xiyi = 75329

Linear Regression
  x   y  (767)(492)
 xy  75,329 
i i
SSxy i i
Slope: ˆ1   n  5
  xi 
2
SSxx (767) 2
120, 416 
i 
x 2
n 5
 .0521
y  intercept:  0  y  ˆ1 x  98.4  (.0521)(153.4)  107.2

Linear Regression
These results suggest that teams xi    y  (767)(492)
  75,329 
i
SS
which hit more home xi yiare
runs
Slope: ˆ
1 better fielders
(slightly)
xy
 (maybe not
n  5
 
2
SSxx xi (767) 2
what we expected). There are,
120, 416 

however, only five observationsxi in
2
the sample. It is important to take a n

5
 closer
.0521look at the assumptions we
made and the results we got.
y  intercept:  0  y  ˆ1 x  98.4  (.0521)(153.4)  106.39

Linear Regression
11.3: Model Assumptions
Assumptions
1. The mean of the probability distribution of
 is 0.
2. The variance,  2, of the probability
distribution of  is constant.
3. The probability distribution of  is
normal.
4. The values of  associated with any two
values of y are independent.
Linear Regression
 The variance,  2, is used in every test statistic and
confidence interval used to evaluate the model.
 Invariably,  2 is unknown and must be estimated.

Linear Regression
Estimation of  2 for a (First-Order) Straight-Line Model
 i i  yy 1 xy
   ˆ SS
2
SSE y  yˆ SS
s2  
n2 n2 n2
SS yy    yi  y  and SS xy    xi  x  yi  y 
2
The estimated standard error of  is the square root of the variance:

SSE
s  s2 
n2

Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
Note: There may be many different

patterns in the scatter plot when
there is no linear relationship.

Linear Regression
A critical step in the evaluation of the
model is to test whether 1 = 0
y . y y .
. . . . . .
. . . ............ . . .
. . . . . .
. . .
. . . .
x x x
Positive Relationship No Relationship Negative Relationship
1 > 0 1 = 0 1 < 0

Linear Regression
H0 : 1 = 0
Ha : 1 ≠ 0
y . y y .
. . . . . .
. . . ............ . . .
. . . . . .
. . .
. . . .
x x x
Positive Relationship No Relationship Negative Relationship
1 > 0 1 = 0 1 < 0

Linear Regression
 The four assumptions described above produce a normal
sampling distribution for the slope estimate:
ˆ1 N ( 1 ,  ˆ )
1

where  ˆ 
1
SS xx
s
and ˆ ˆ  sˆ  ,
1 1
SS xx
called the estimated standard error of the least squares
slope estimate.
Linear Regression
A Test of Model Utility: Simple Linear Regression
One-Tailed Test Two-Tailed Test
H 0 : 1  0 H 0 : 1  0
H a : 1  0 ( 0) H a : 1  0
ˆ1 ˆ
Test Statistic : t   1
sˆ s / SS xx
1
Rejection Region:
t  t ( t ) |t | t /2
Degrees of freedom = n  2

Linear Regression
Home Runs Errors xi - (xi - )2 = yi-- (yi-- )2
(x) (y) E(Errors|HRs)
158 126 4.6 21.16 98.14 27.86 776.4

155 87 1.6 2.56 98.31 -11.31 127.9
139 65 -14.4 207.4 99.23 -34.23 1171
191 95 37.6 1414 96.25 -1.245 1.55
124 119 -29.4 864.4 100.1 18.92 357.8
xi = 767 yi = 492 SSxx= 2509 SSE = 2435
= 153.4
SSE 2435
s   28.49
n2 3
Linear Regression
s  28.49
ˆ1  ˆH0 ˆ1  0 .0521
t  t   .1002
sˆ s 28.49 2509
1
SS xx
 Since the t-value does not lead to rejection of the

null hypothesis, we can conclude that
 A different set of data may yield different results.
 There is a more complicated relationship.
 There is no relationship (non-rejection does not lead to
this conclusion automatically).

Linear Regression
 Interpreting p-Values for 

 Software packages report two-tailed p-values.
 To conduct one-tailed tests of hypotheses, the
reported p-values must be adjusted:
 p 2 if t  0
Upper-tailed test (H a : 1  0) : p  value =  p
1  2 if t  0
 p 2 if t  0
Lower-tailed test (H a : 1  0) : p  value =  p
1  2 if t  0

Linear Regression
 A Confidence Interval on 1
ˆ1  t /2 sˆ
1
where the estimated standard error is

s
sˆ 
1
SS xx
and t/2 is based on (n – 2) degrees of freedom
Linear Regression
 In the home runs and errors example, the
estimated 1 was -.0521, and the
estimated standard error was .569. With 3
degrees of freedom, t = 3.182.
 The confidence interval is, therefore,
ˆ1  t /2 sˆ  0.521  3.182(.569)  .0521  1.81

1
which includes 0, so there may be no

relationship between the two variables.
Linear Regression
11.5: The Coefficients of
Correlation and Determination
 The coefficient of correlation, r, is a

measure of the strength of the linear
relationship between two variables. It
is computed as follows:
SS xy
r .
SS xx SS yy

Linear Regression
Positive linear relationship No linear relationship Negative linear relationship
y . y . y .
. . . . . . . . .
. . . . . . .. . . .
. . .. . . . . .
. . . . .
. . . .
x x x
r → +1 r0 r → -1
Values of r equal to +1 or -1 require each point in the scatter plot to lie on a single straight line.

Linear Regression
 In the example about homeruns and errors,

SSxy= -143.8 and SSxx= 2509.
 SSyy is computed as
  y
2
4922
 y 2

n
 50,856 
5
 2443.2
SS xy 143.8
so r    .058
SS xx SS yy (2509)(2443.2)
Linear Regression
 An r value that close to zero suggests

there may not be a linear relationship
between the variables, which is
consistent with our earlier look at the
null hypothesis and the confidence
interval on  1.

Linear Regression
 The coefficient of determination, r2, represents

the proportion of the total sample variability
around the mean of y that is explained by the
linear relationship between x and y.
SS yy  SSE SSE
r 
2
 1
SS yy SS yy
0  r 1 2

Linear Regression
• x provides
important
Predict values of y with the information about
y
mean of y if no other
information is available High • Predictions are
more accurate
based on the
r2 model
Predict values of y|x based

on a hypothesized linear
relationship • Knowing values
of x does not
substantially
improve
Evaluate the power of x to predictions on y
predict values of y with the
Low • There may be no
coefficient of determination r2 relationship

between x and y,
or it may be more
subtle than a
linear relationship

Linear Regression
11.6: Using the Model for
Estimation and Prediction
Estimate the mean of

y for a specific value
of x: E(y)|x
Statistical (over many
Inference experiments with this
based on the ŷ  ˆ0  ˆ1 x x-value)
linear
regression Estimate an
individual value of y
model for a given x value
(for a single
experiment with this
value of x)

Linear Regression
Sampling Error for the Estimator of the Mean of y | x p
1 ( x p  x )2
 y (the standard error of y )   
n SS xx
where  is the standard deviation of the error term  .
Using s to estimate  , a 100(1- )% confidence interval
1 (xp  x )
2
on the mean of y is yˆ  t /2 s  ,
n SS xx
with n-2 degrees of freedom.

Linear Regression
 Based on our model results, a team that hits 140 home
runs is expected to make 99.9 errors:
yˆ  107.2  .0521x  107.2  .0521(140)  99.9
In our example, s  1, n  5, x  153.4 and SS xx  2509, so
1 ( x p  x )2 1 (140  153.4) 2
y  s 
ˆ 1   .5211
n SS xx 5 2509
A 95% Confidence Interval for y | x  140 is
1 ( x p  x )2
yˆ  t.025,df 3 s   99.9  3.182(.5211)  99.9  1.66
n SS xx
Linear Regression
Sampling Error for the Predictor of an Individual Value of y | x p
1 ( x  x ) 2
 ( y  yˆ ) (the standard error of prediction)   1  

p
n SS xx
where  is the standard deviation of the error term  .
Using s to estimate  , a 100(1- )% preduction interval
1 (xp  x )
2
for y|x p is yˆ  t /2 s 1   , with n-2 degrees of freedom.

n SS xx

Linear Regression
 A 95% Prediction Interval for an Individual Team’s Errors
yˆ  107.2  .0521x  107.2  .0521(140)  99.9

In our example, s  1, n  5, x  153.4 and SS xx  2509, so
1 ( x p  x )2 1 (140  153.4) 2
ˆ y  s 1    1 1   1.13
n SS xx 5 2509
A 95% Confidence Interval for y | x  140 is
1 ( x p  x )2
yˆ  t.025,df 3 s 1    99.9  3.182(1.13)  99.9  3.596
n SS xx
Linear Regression
Prediction intervals for
individual new values of y
are wider than confidence Error in
Error in predicting a
intervals on the mean of y E(y|xp) mean value
because of the extra of y|xp
source of error.
Error in
Sampling
predicting a
Error in error from
specific
E(y|xp) the y
value of
population
y|xp

Linear Regression

Linear Regression
 Estimating y beyond the range of values associated with
the observed values of x can lead to large prediction
errors.
 Beyond the range of observed x values, the relationship
may look very different.
Estimated relationship
True relationship
Xi Xj
Range of observed
values of x
Linear Regression
11.7: A Complete Example
 Step 1
 How does the proximity of a fire
house (x) affect the damages (y)
from a fire?
 y = f(x)
 y = 0 +1x + 

Linear Regression

Linear Regression

Linear Regression
 Step 2
 The data (found in Table 11.7) produce the
following estimates (in thousands of dollars):
ˆ0  10.28
ˆ1  4.91
 The estimated damages equal $10,280 + $4910
for each mile from the fire station, or
yˆ  10.28  4.92 x

Linear Regression
 Step 3
 The estimate of the standard deviation,
, of  is
s = 2.31635
 Most of the observed fire damages will be
within 2s  4.64 thousand dollars of the
predicted value

Linear Regression
 Step 4
 Test that the true slope is 0
H 0 : 1  0
H a : 1  0
 SAS automatically performs a two-tailed
test, with a reported p-value < .0001. The
one-tailed p-value is < .00005, which
provides strong evidence to reject the null.
Linear Regression
 Step 4
 A 95% confidence interval on 1 from the
SAS output is 4.071 ≤ 1 ≤ 5.768.
 The coefficient of determination, r 2, is
.9235.
 The coefficient of correlation, r, is
r  r  .9235  .96
2

Linear Regression
 Suppose the distance from the nearest
station is 3.5 miles. We can estimate the
damage with the model estimates.
yˆ  ˆ0  ˆ1 x
yˆ  10.2  4.92(3.5)  27.5
95% prediction interval is (22.324, 32.667)
We’re 95% sure the damage for a fire 3.5
miles from the nearest station will be
between $22,324 and $32,667.
Linear Regression
 Suppose the distance from the nearest
station is 3.5 miles. We can estimate the
damage with the model estimates.
Since the x-values in our
yˆ  ˆ0 sample
 ˆ1 x range from .7 to 6.1,
yˆ  10.2  4.92(3.5)
predictions  27.5
about y for x-
values beyond this range will
95% prediction interval is (22.324, 32.667)
be unreliable.
We’re 95% sure the damage for a fire 3.5
miles from the nearest station will be
between $22,324 and $32,667.
Linear Regression

Chapter 11: Simple Linear Regression

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Chapter 11: Simple Linear Regression

Diunggah oleh

Hak Cipta:

Format Tersedia

Chapter 11: Simple Linear Regression

Where We’ve Been

 Presented methods for estimating and

McClave: Statistics, 11th ed. Chapter 11: Simple 2

But we may not know exactly what that reality is,

So our probabilistic models refer to a specific

General Form of Probabilistic Models

McClave: Statistics, 11th ed. Chapter 11: Simple 7

McClave: Statistics, 11th ed. Chapter 11: Simple 8

250 y = -0.192x + 150.3

McClave: Statistics, 11th ed. Chapter 11: Simple 9

 A First-Order Probabilistic Model

0, the y – intercept,

McClave: Statistics, 11th ed. Chapter 11: Simple 11

McClave: Statistics, 11th ed. Chapter 11: Simple 13

McClave: Statistics, 11th ed. Chapter 11: Simple 14

McClave: Statistics, 11th ed. Chapter 11: Simple 15

 The least squares line ŷ  ˆ0  ˆ1 x

McClave: Statistics, 11th ed. Chapter 11: Simple 16

McClave: Statistics, 11th ed. Chapter 11: Simple 17

McClave: Statistics, 11th ed. Chapter 11: Simple 18

158 126 24964 19908

McClave: Statistics, 11th ed. Chapter 11: Simple 19

McClave: Statistics, 11th ed. Chapter 11: Simple 20

the sample. It is important to take a n

McClave: Statistics, 11th ed. Chapter 11: Simple 21

McClave: Statistics, 11th ed. Chapter 11: Simple 23

The estimated standard error of  is the square root of the variance:

McClave: Statistics, 11th ed. Chapter 11: Simple 24

Note: There may be many different

McClave: Statistics, 11th ed. Chapter 11: Simple 25

McClave: Statistics, 11th ed. Chapter 11: Simple 26

McClave: Statistics, 11th ed. Chapter 11: Simple 27

McClave: Statistics, 11th ed. Chapter 11: Simple 29

158 126 4.6 21.16 98.14 27.86 776.4

 Since the t-value does not lead to rejection of the

McClave: Statistics, 11th ed. Chapter 11: Simple 31

 Interpreting p-Values for 

McClave: Statistics, 11th ed. Chapter 11: Simple 32

where the estimated standard error is

ˆ1  t /2 sˆ  0.521  3.182(.569)  .0521  1.81

which includes 0, so there may be no

 The coefficient of correlation, r, is a

McClave: Statistics, 11th ed. Chapter 11: Simple 35

McClave: Statistics, 11th ed. Chapter 11: Simple 36

 In the example about homeruns and errors,

 An r value that close to zero suggests

McClave: Statistics, 11th ed. Chapter 11: Simple 38

 The coefficient of determination, r2, represents

McClave: Statistics, 11th ed. Chapter 11: Simple 39

Predict values of y|x based

coefficient of determination r2 relationship

McClave: Statistics, 11th ed. Chapter 11: Simple 40

Estimate the mean of

McClave: Statistics, 11th ed. Chapter 11: Simple 41

McClave: Statistics, 11th ed. Chapter 11: Simple 42

 ( y  yˆ ) (the standard error of prediction)   1  

for y|x p is yˆ  t /2 s 1   , with n-2 degrees of freedom.

McClave: Statistics, 11th ed. Chapter 11: Simple 44

yˆ  107.2  .0521x  107.2  .0521(140)  99.9

McClave: Statistics, 11th ed. Chapter 11: Simple 46

McClave: Statistics, 11th ed. Chapter 11: Simple 47

McClave: Statistics, 11th ed. Chapter 11: Simple 49