Anda di halaman 1dari 57

Chapter 11: Simple Linear Regression

Where We’ve Been

 Presented methods for estimating and


testing population parameters for a
single sample
 Extended those methods to allow for a
comparison of population parameters
for multiple samples

McClave: Statistics, 11th ed. Chapter 11: Simple 2


Linear Regression
Where We’re Going
 Introduce the straight-line linear regression model
as a means of relating one quantitative variable to
another quantitative variable
 Introduce the correlation coefficient as a means of
relating one quantitative variable to another
quantitative variable
 Assess how well the simple linear regression model
fits the sample data
 Use the simple linear regression model to predict
the value of one variable given the value of another
variable
McClave: Statistics, 11th ed. Chapter 11: Simple 3
Linear Regression
11.1: Probabilistic Models
There may be a deterministic reality connecting
two variables, y and x

But we may not know exactly what that reality is,


or there may be an imprecise, or random,
connection between the variables. The
unknown/unknowable influence is referred to as
the random error

So our probabilistic models refer to a specific


connection between variables, as well as
influences we can’t specify exactly in each case:
y = f(x) + random error
McClave: Statistics, 11th ed. Chapter 11: Simple 4
Linear Regression
11.1: Probabilistic Models
The relationship between home runs and runs in
baseball seems at first glance to be deterministic …
Runs
6

0
0 1 2 3 4 5 6

Home Runs
McClave: Statistics, 11th ed. Chapter 11: Simple 5
Linear Regression
11.1: Probabilistic Models
But if you consider how many runners are on base when the home run
is hit, or even how often the batter misses a base and is called out, the
Runs rigid model becomes more variable.
6

0
0 1 2 3 4 5 6

Home Runs
McClave: Statistics, 11th ed. Chapter 11: Simple 6
Linear Regression
11.1: Probabilistic Models

General Form of Probabilistic Models


y = Deterministic component + Random error
where y is the variable of interest, and the
mean value of the random error is assumed
to be 0: E(y) = Deterministic component.

McClave: Statistics, 11th ed. Chapter 11: Simple 7


Linear Regression
11.1: Probabilistic Models

McClave: Statistics, 11th ed. Chapter 11: Simple 8


Linear Regression
11.1: Probabilistic Models
 The goal of regression analysis is to find the
straight line that comes closest to all of the points in
the scatter plot simultaneously.
Attendance
300

250 y = -0.192x + 150.3

200
Attendance

150

100

50

0
0 20 40 60 80 100 120 140 160
Week

McClave: Statistics, 11th ed. Chapter 11: Simple 9


Linear Regression
11.1: Probabilistic Models

 A First-Order Probabilistic Model


y = 0 + 1x + 
where y = dependent variable
x = independent variable
0 + 1x = E(y) = deterministic
component
 = random error component
0 = y – intercept
1 = slope of the line
McClave: Statistics, 11th ed. Chapter 11: Simple
Linear Regression
10
11.1: Probabilistic Models

0, the y – intercept,


and 1, the slope of the
line, are population
parameters, and
invariably unknown.
Regression analysis is
designed to estimate
these parameters.

McClave: Statistics, 11th ed. Chapter 11: Simple 11


Linear Regression
11.2: Fitting the Model: The
Least Squares Approach

Step 1
Hypothesize the deterministic
component of the probabilistic model
E(y) = 0 + 1x
Step 2
Use sample data to estimate the
unknown parameters in the model
McClave: Statistics, 11th ed. Chapter 11: Simple 12
Linear Regression
11.2: Fitting the Model: The
Least Squares Approach
Values on the line are the predicted values
4000
of total offerings given the average offering.
3500
3000
Total Offering

2500
2000
1500
1000
500
The distances between the scattered dots and
0 the line are the errors of prediction.
0 10 20 30 40 50
Average Offering

McClave: Statistics, 11th ed. Chapter 11: Simple 13


Linear Regression
11.2: Fitting the Model: The
Least Squares Approach
Values on the line are the predicted values
4000
of total offerings given the average offering.
3500
3000
Total Offering

2500
The line’s estimated parameters are the values that
2000 minimize the sum of the squared errors of prediction,
1500 and the method of finding those values is called the
method of least squares.
1000
500
The distances between the scattered dots and
0 the line are the errors of prediction.
0 10 20 30 40 50
Average Offering

McClave: Statistics, 11th ed. Chapter 11: Simple 14


Linear Regression
11.2: Fitting the Model: The
Least Squares Approach

 Model: y   0  1 x
ˆ
y  ˆ
  ˆ
 Estimates: 0 1x

( y  ˆ
y )  [ y  ( ˆ
  ˆ

 Deviation: i i i 0 1 xi )]

 SSE:  ˆ
i
ˆ
[ y  (    x)] 2
0 1

McClave: Statistics, 11th ed. Chapter 11: Simple 15


Linear Regression
11.2: Fitting the Model: The
Least Squares Approach

 The least squares line ŷ  ˆ0  ˆ1 x


is the line that has the following two
properties:
1. The sum of the errors (SE) equals 0.
2. The sum of squared errors (SSE) is
smaller than that for any other straight-
line model.

McClave: Statistics, 11th ed. Chapter 11: Simple 16


Linear Regression
11.2: Fitting the Model: The
Least Squares Approach
Formulas for the Least Squares Estimates
  x   y 
( x  x )( y  y )  xy 
i i

Slope: ˆ1 
SSxy

 i
 i
i i
n
 (x  x )   xi 
2 2
SSxx
i 
i 2
x
n

y intercept:  0  y  ˆ1 x

McClave: Statistics, 11th ed. Chapter 11: Simple 17


Linear Regression
11.2: Fitting the Model: The
Least Squares Approach
Can home runs be used to predict errors?

Is there a relationship
between the number of
home runs a team hits and
the quality of its fielding?

McClave: Statistics, 11th ed. Chapter 11: Simple 18


Linear Regression
11.2: Fitting the Model: The
Least Squares Approach
Home Runs (x) Errors (y) xi2 xiyi

158 126 24964 19908


155 87 24025 13485
139 65 19321 9035
191 95 36481 18145
124 119 15625 14756
xi = 767 yi = 492 xi2 = 120416 xiyi = 75329

McClave: Statistics, 11th ed. Chapter 11: Simple 19


Linear Regression
11.2: Fitting the Model: The
Least Squares Approach
Formulas for the Least Squares Estimates
  x   y  (767)(492)
 xy  75,329 
i i
SSxy i i
Slope: ˆ1   n  5
  xi 
2
SSxx (767) 2
120, 416 
i 
x 2

n 5
 .0521
y  intercept:  0  y  ˆ1 x  98.4  (.0521)(153.4)  107.2

McClave: Statistics, 11th ed. Chapter 11: Simple 20


Linear Regression
11.2: Fitting the Model: The
Least Squares Approach
Formulas for the Least Squares Estimates
These results suggest that teams xi    y  (767)(492)
  75,329 
i
SS
which hit more home xi yiare
runs
Slope: ˆ
1 better fielders
(slightly)
xy
 (maybe not
n  5
 
2
SSxx xi (767) 2
what we expected). There are,
120, 416 

however, only five observationsxi in
2

the sample. It is important to take a n


5
 closer
.0521look at the assumptions we
made and the results we got.
y  intercept:  0  y  ˆ1 x  98.4  (.0521)(153.4)  106.39

McClave: Statistics, 11th ed. Chapter 11: Simple 21


Linear Regression
11.3: Model Assumptions

Assumptions
1. The mean of the probability distribution of
 is 0.
2. The variance,  2, of the probability
distribution of  is constant.
3. The probability distribution of  is
normal.
4. The values of  associated with any two
values of y are independent.
McClave: Statistics, 11th ed. Chapter 11: Simple 22
Linear Regression
11.3: Model Assumptions
 The variance,  2, is used in every test statistic and
confidence interval used to evaluate the model.
 Invariably,  2 is unknown and must be estimated.

McClave: Statistics, 11th ed. Chapter 11: Simple 23


Linear Regression
11.3: Model Assumptions
Estimation of  2 for a (First-Order) Straight-Line Model

 i i  yy 1 xy
   ˆ SS
2
SSE y  yˆ SS
s2  
n2 n2 n2
SS yy    yi  y  and SS xy    xi  x  yi  y 
2

The estimated standard error of  is the square root of the variance:


SSE
s  s2 
n2

McClave: Statistics, 11th ed. Chapter 11: Simple 24


Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1

Note: There may be many different


patterns in the scatter plot when
there is no linear relationship.

McClave: Statistics, 11th ed. Chapter 11: Simple 25


Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
A critical step in the evaluation of the
model is to test whether 1 = 0

y . y y .
. . . . . .
. . . ............ . . .
. . . . . .
. . .
. . . .

x x x
Positive Relationship No Relationship Negative Relationship
1 > 0 1 = 0 1 < 0

McClave: Statistics, 11th ed. Chapter 11: Simple 26


Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
H0 : 1 = 0
Ha : 1 ≠ 0
y . y y .
. . . . . .
. . . ............ . . .
. . . . . .
. . .
. . . .

x x x
Positive Relationship No Relationship Negative Relationship
1 > 0 1 = 0 1 < 0

McClave: Statistics, 11th ed. Chapter 11: Simple 27


Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
 The four assumptions described above produce a normal
sampling distribution for the slope estimate:
ˆ1 N ( 1 ,  ˆ )
1


where  ˆ 
1
SS xx
s
and ˆ ˆ  sˆ  ,
1 1
SS xx
called the estimated standard error of the least squares
slope estimate.
McClave: Statistics, 11th ed. Chapter 11: Simple 28
Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
A Test of Model Utility: Simple Linear Regression
One-Tailed Test Two-Tailed Test
H 0 : 1  0 H 0 : 1  0
H a : 1  0 ( 0) H a : 1  0
ˆ1 ˆ
Test Statistic : t   1
sˆ s / SS xx
1

Rejection Region:
t  t ( t ) |t | t /2
Degrees of freedom = n  2

McClave: Statistics, 11th ed. Chapter 11: Simple 29


Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
Home Runs Errors xi - (xi - )2 = yi-- (yi-- )2
(x) (y) E(Errors|HRs)

158 126 4.6 21.16 98.14 27.86 776.4


155 87 1.6 2.56 98.31 -11.31 127.9
139 65 -14.4 207.4 99.23 -34.23 1171
191 95 37.6 1414 96.25 -1.245 1.55
124 119 -29.4 864.4 100.1 18.92 357.8
xi = 767 yi = 492 SSxx= 2509 SSE = 2435
= 153.4
SSE 2435
s   28.49
n2 3
McClave: Statistics, 11th ed. Chapter 11: Simple 30
Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
s  28.49
ˆ1  ˆH0 ˆ1  0 .0521
t  t   .1002
sˆ s 28.49 2509
1
SS xx

 Since the t-value does not lead to rejection of the


null hypothesis, we can conclude that
 A different set of data may yield different results.
 There is a more complicated relationship.
 There is no relationship (non-rejection does not lead to
this conclusion automatically).

McClave: Statistics, 11th ed. Chapter 11: Simple 31


Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1

 Interpreting p-Values for 


 Software packages report two-tailed p-values.
 To conduct one-tailed tests of hypotheses, the
reported p-values must be adjusted:
 p 2 if t  0
Upper-tailed test (H a : 1  0) : p  value =  p
1  2 if t  0
 p 2 if t  0
Lower-tailed test (H a : 1  0) : p  value =  p
1  2 if t  0

McClave: Statistics, 11th ed. Chapter 11: Simple 32


Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1

 A Confidence Interval on 1

ˆ1  t /2 sˆ
1

where the estimated standard error is


s
sˆ 
1
SS xx
and t/2 is based on (n – 2) degrees of freedom
McClave: Statistics, 11th ed. Chapter 11: Simple 33
Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
 In the home runs and errors example, the
estimated 1 was -.0521, and the
estimated standard error was .569. With 3
degrees of freedom, t = 3.182.
 The confidence interval is, therefore,

ˆ1  t /2 sˆ  0.521  3.182(.569)  .0521  1.81


1

which includes 0, so there may be no


relationship between the two variables.
McClave: Statistics, 11th ed. Chapter 11: Simple 34
Linear Regression
11.5: The Coefficients of
Correlation and Determination

 The coefficient of correlation, r, is a


measure of the strength of the linear
relationship between two variables. It
is computed as follows:
SS xy
r .
SS xx SS yy

McClave: Statistics, 11th ed. Chapter 11: Simple 35


Linear Regression
11.5: The Coefficients of
Correlation and Determination
Positive linear relationship No linear relationship Negative linear relationship

y . y . y .
. . . . . . . . .
. . . . . . .. . . .
. . .. . . . . .
. . . . .
. . . .

x x x
r → +1 r0 r → -1

Values of r equal to +1 or -1 require each point in the scatter plot to lie on a single straight line.

McClave: Statistics, 11th ed. Chapter 11: Simple 36


Linear Regression
11.5: The Coefficients of
Correlation and Determination

 In the example about homeruns and errors,


SSxy= -143.8 and SSxx= 2509.
 SSyy is computed as
  y
2
4922
 y 2

n
 50,856 
5
 2443.2

SS xy 143.8
so r    .058
SS xx SS yy (2509)(2443.2)
McClave: Statistics, 11th ed. Chapter 11: Simple 37
Linear Regression
11.5: The Coefficients of
Correlation and Determination

 An r value that close to zero suggests


there may not be a linear relationship
between the variables, which is
consistent with our earlier look at the
null hypothesis and the confidence
interval on  1.

McClave: Statistics, 11th ed. Chapter 11: Simple 38


Linear Regression
11.5: The Coefficients of
Correlation and Determination

 The coefficient of determination, r2, represents


the proportion of the total sample variability
around the mean of y that is explained by the
linear relationship between x and y.

SS yy  SSE SSE
r 
2
 1
SS yy SS yy
0  r 1 2

McClave: Statistics, 11th ed. Chapter 11: Simple 39


Linear Regression
11.5: The Coefficients of
Correlation and Determination
• x provides
important
Predict values of y with the information about
y
mean of y if no other
information is available High • Predictions are
more accurate
based on the
r2 model

Predict values of y|x based


on a hypothesized linear
relationship • Knowing values
of x does not
substantially
improve
Evaluate the power of x to predictions on y
predict values of y with the
Low • There may be no

coefficient of determination r2 relationship


between x and y,
or it may be more
subtle than a
linear relationship

McClave: Statistics, 11th ed. Chapter 11: Simple 40


Linear Regression
11.6: Using the Model for
Estimation and Prediction

Estimate the mean of


y for a specific value
of x: E(y)|x
Statistical (over many
Inference experiments with this
based on the ŷ  ˆ0  ˆ1 x x-value)
linear
regression Estimate an
individual value of y
model for a given x value
(for a single
experiment with this
value of x)

McClave: Statistics, 11th ed. Chapter 11: Simple 41


Linear Regression
11.6: Using the Model for
Estimation and Prediction
Sampling Error for the Estimator of the Mean of y | x p

1 ( x p  x )2
 y (the standard error of y )   
n SS xx
where  is the standard deviation of the error term  .
Using s to estimate  , a 100(1- )% confidence interval
1 (xp  x )
2

on the mean of y is yˆ  t /2 s  ,
n SS xx
with n-2 degrees of freedom.

McClave: Statistics, 11th ed. Chapter 11: Simple 42


Linear Regression
11.6: Using the Model for
Estimation and Prediction
 Based on our model results, a team that hits 140 home
runs is expected to make 99.9 errors:
yˆ  107.2  .0521x  107.2  .0521(140)  99.9
In our example, s  1, n  5, x  153.4 and SS xx  2509, so

1 ( x p  x )2 1 (140  153.4) 2
y  s 
ˆ 1   .5211
n SS xx 5 2509
A 95% Confidence Interval for y | x  140 is
1 ( x p  x )2
yˆ  t.025,df 3 s   99.9  3.182(.5211)  99.9  1.66
n SS xx
McClave: Statistics, 11th ed. Chapter 11: Simple 43
Linear Regression
11.6: Using the Model for
Estimation and Prediction
Sampling Error for the Predictor of an Individual Value of y | x p

1 ( x  x ) 2

 ( y  yˆ ) (the standard error of prediction)   1  


p

n SS xx
where  is the standard deviation of the error term  .
Using s to estimate  , a 100(1- )% preduction interval
1 (xp  x )
2

for y|x p is yˆ  t /2 s 1   , with n-2 degrees of freedom.


n SS xx

McClave: Statistics, 11th ed. Chapter 11: Simple 44


Linear Regression
11.6: Using the Model for
Estimation and Prediction
 A 95% Prediction Interval for an Individual Team’s Errors

yˆ  107.2  .0521x  107.2  .0521(140)  99.9


In our example, s  1, n  5, x  153.4 and SS xx  2509, so

1 ( x p  x )2 1 (140  153.4) 2
ˆ y  s 1    1 1   1.13
n SS xx 5 2509
A 95% Confidence Interval for y | x  140 is
1 ( x p  x )2
yˆ  t.025,df 3 s 1    99.9  3.182(1.13)  99.9  3.596
n SS xx
McClave: Statistics, 11th ed. Chapter 11: Simple 45
Linear Regression
11.6: Using the Model for
Estimation and Prediction
Prediction intervals for
individual new values of y
are wider than confidence Error in
Error in predicting a
intervals on the mean of y E(y|xp) mean value
because of the extra of y|xp
source of error.

Error in
Sampling
predicting a
Error in error from
specific
E(y|xp) the y
value of
population
y|xp

McClave: Statistics, 11th ed. Chapter 11: Simple 46


Linear Regression
11.6: Using the Model for
Estimation and Prediction

McClave: Statistics, 11th ed. Chapter 11: Simple 47


Linear Regression
11.6: Using the Model for
Estimation and Prediction
 Estimating y beyond the range of values associated with
the observed values of x can lead to large prediction
errors.
 Beyond the range of observed x values, the relationship
may look very different.
Estimated relationship

True relationship

Xi Xj
Range of observed
values of x
McClave: Statistics, 11th ed. Chapter 11: Simple 48
Linear Regression
11.7: A Complete Example

 Step 1
 How does the proximity of a fire
house (x) affect the damages (y)
from a fire?
 y = f(x)
 y = 0 +1x + 

McClave: Statistics, 11th ed. Chapter 11: Simple 49


Linear Regression
11.7: A Complete Example

McClave: Statistics, 11th ed. Chapter 11: Simple 50


Linear Regression
11.7: A Complete Example

McClave: Statistics, 11th ed. Chapter 11: Simple 51


Linear Regression
11.7: A Complete Example

 Step 2
 The data (found in Table 11.7) produce the
following estimates (in thousands of dollars):
ˆ0  10.28
ˆ1  4.91
 The estimated damages equal $10,280 + $4910
for each mile from the fire station, or
yˆ  10.28  4.92 x

McClave: Statistics, 11th ed. Chapter 11: Simple 52


Linear Regression
11.7: A Complete Example

 Step 3
 The estimate of the standard deviation,
, of  is
s = 2.31635
 Most of the observed fire damages will be
within 2s  4.64 thousand dollars of the
predicted value

McClave: Statistics, 11th ed. Chapter 11: Simple 53


Linear Regression
11.7: A Complete Example

 Step 4
 Test that the true slope is 0
H 0 : 1  0
H a : 1  0
 SAS automatically performs a two-tailed
test, with a reported p-value < .0001. The
one-tailed p-value is < .00005, which
provides strong evidence to reject the null.
McClave: Statistics, 11th ed. Chapter 11: Simple 54
Linear Regression
11.7: A Complete Example

 Step 4
 A 95% confidence interval on 1 from the
SAS output is 4.071 ≤ 1 ≤ 5.768.
 The coefficient of determination, r 2, is
.9235.
 The coefficient of correlation, r, is
r  r  .9235  .96
2

McClave: Statistics, 11th ed. Chapter 11: Simple 55


Linear Regression
11.7: A Complete Example
 Suppose the distance from the nearest
station is 3.5 miles. We can estimate the
damage with the model estimates.
yˆ  ˆ0  ˆ1 x
yˆ  10.2  4.92(3.5)  27.5
95% prediction interval is (22.324, 32.667)
We’re 95% sure the damage for a fire 3.5
miles from the nearest station will be
between $22,324 and $32,667.
McClave: Statistics, 11th ed. Chapter 11: Simple 56
Linear Regression
11.7: A Complete Example
 Suppose the distance from the nearest
station is 3.5 miles. We can estimate the
damage with the model estimates.
Since the x-values in our
yˆ  ˆ0 sample
 ˆ1 x range from .7 to 6.1,
yˆ  10.2  4.92(3.5)
predictions  27.5
about y for x-
values beyond this range will
95% prediction interval is (22.324, 32.667)
be unreliable.
We’re 95% sure the damage for a fire 3.5
miles from the nearest station will be
between $22,324 and $32,667.
McClave: Statistics, 11th ed. Chapter 11: Simple 57
Linear Regression

Anda mungkin juga menyukai