Anda di halaman 1dari 29

Chapter 11B

Chapter 11 Linear Regression

 

We continue to build our statistical confidence (in) intervals of class meetings

 

…and so much more

Chapter 11B Chapter 11 Linear Regression We continue to build our statistical confidence (in) intervals of

The Rest of the Story…

The Rest of the Story…
The Rest of the Story…

11-5 Confidence Intervals

11-5 Confidence Intervals 11-5.1 Confidence Intervals on the Slope and Intercept ˆ  0 t 

11-5.1 Confidence Intervals on the Slope and Intercept

ˆ

  • 0 t

/2,n

2

2  1 x  2  ˆ    n S   xx
2
1
x
2
ˆ
n
S
xx

 

0

ˆ

 

0

t

/2,n

2

2  1 x  2  ˆ    n S   xx
2
1
x
2
ˆ
n
S
xx

ˆ

  • 1 t

/2,n

2

2  ˆ /S   xx 1
2
 ˆ
/S
 
xx
1

ˆ

 

1

t

/2,n

2

2  ˆ /S xx
2
 ˆ
/S
xx

NFL Quarterback Ratings – problem 11-39

Find a 95% CI on the slope, intercept

NFL Quarterback Ratings – problem 11-39  Find a 95% CI on the slope, intercept
NFL Quarterback Ratings – problem 11-39  Find a 95% CI on the slope, intercept
NFL Quarterback Ratings – problem 11-39  Find a 95% CI on the slope, intercept

Confidence Intervals on the Mean

Of course, errors in the slope and intercept propagate through to our

Response

predictions of mean response and of individual observations.

Confidence Intervals on the Mean Of course, errors in the slope and intercept propagate through to
Confidence Intervals on the Mean Of course, errors in the slope and intercept propagate through to

ˆ

ˆ

|

Y x


0

0

ˆ

1

  • x 0

using

ˆ

0

Y ˆ X

1

ˆ

|

Y x

0

Confidence Intervals on the Mean Of course, errors in the slope and intercept propagate through to

y

ˆ

1

x

0

x 
x

  • V

|

Y x

0

ˆ

2

 

1

n

 

x

  • 0 x

S

xx

cov Y ,ˆ

1

0

V (

ˆ

1

)

Confidence Intervals on the Mean Of course, errors in the slope and intercept propagate through to

2

S

xx

ˆ

|

Y x

t

  • 0

/2,

n

2

 1 ( x  x 2  ˆ 0   n S  xx
1
( x
x
2
ˆ
0
n
S
xx

)

2

|

Y x

0

ˆ

|

Y x

0

t

/2,

n

2

2  1 ( x  x  ) 2  ˆ 0   
2
1
(
x
x 
)
2
ˆ
0
n
S
xx

11-5.2 Confidence Interval on the Mean Response

11-5.2 Confidence Interval on the Mean Response Definition

Definition

11-5.2 Confidence Interval on the Mean Response Definition
11-5.2 Confidence Interval on the Mean Response Definition

NFL Quarterback Ratings – more of problem 11-39

Find a 95% CI on the mean rating when the average yards per attempt is 8.0

NFL Quarterback Ratings – more of problem 11-39  Find a 95% CI on the mean
NFL Quarterback Ratings – more of problem 11-39  Find a 95% CI on the mean
NFL Quarterback Ratings – more of problem 11-39  Find a 95% CI on the mean

Prediction Intervals

Recall the extra term that got added to the standard error when we constructed prediction intervals in an earlier chapter. The concept is exactly the same here.

Prediction Intervals Recall the extra term that got added to the standard error when we constructed

y

0

 2 ˆ V Y    V  Y  0 2 2 
2
ˆ
V
Y
 
V
Y
0
2
2
1
(
x
x
)
1
(
x
x
)
2
ˆ
0
2
0
t
1 
Y
y
t
ˆ
1
/ 2,
n 
2
0
0
/ 2,
n
2
n
S
n
S
xx
xx
ˆ
ˆ
with y
x
0
0
1
0
This term accounts for the
inherent variability of the
data itself.

11-6 Prediction of New Observations

Definition

11-6 Prediction of New Observations Definition
11-6 Prediction of New Observations Definition

NFL Quarterback Ratings – even more of problem 11-39

NFL Quarterback Ratings – even more of problem 11-39  Find a 95% prediction interval on

Find a 95% prediction interval on the rating when the average yards per attempt is 8.0

NFL Quarterback Ratings – even more of problem 11-39  Find a 95% prediction interval on
Confidence/Prediction Intervals Regression Plot Y = -5.55762 + 12.6519X R-Sq = 78.7 % 120 110 100

Confidence/Prediction Intervals

Regression Plot Y = -5.55762 + 12.6519X R-Sq = 78.7 % 120 110 100 90 80
Regression Plot
Y = -5.55762 + 12.6519X
R-Sq = 78.7 %
120
110
100
90
80
70
Regression
60
95% CI
95% PI
50
6
7
8
9
Yds per Att
Rating Pts

11-7 Adequacy of the Regression Model

11-7 Adequacy of the Regression Model

Fitting a regression model requires several

assumptions.

  • 1. Errors are uncorrelated random variables with mean zero;

  • 2. Errors have constant variance; and,

  • 3. Errors be normally distributed.

The analyst should always consider the validity of these assumptions to be doubtful and conduct analyses to examine the adequacy of the model

Caution - Warning

Caution - Warning  A good statistical fit does not imply a cause and effect relationship

A good statistical fit does not imply a cause and effect relationship

Extrapolating beyond the data is indeed risky

confidence bands become larger the further the x value is from the center of the data – i.e. the mean

ˆ

|

Y x

0

t

/2 ,

n

2

2  1 ( x  x )  2  ˆ 0   
2
1
(
x
 x
) 
2
ˆ
0
n
S
xx

Cause and Effect?

 

Auto

accident

Year

Sunspots

deaths

  • 1970 54.6

165

  • 1971 53.3

89

  • 1972 56.3

55

  • 1973 49.6

34

  • 1974 47.1

9

  • 1975 45.9

30

  • 1976 48.5

59

  • 1977 50.1

83

  • 1978 52.4

109

  • 1979 52.5

127

  • 1980 53.2

153

  • 1981 51.4

112

  • 1982 46

80

  • 1983 44.6

45

deaths in 1,000

sources: Fundamentals and Frontiers of

Astronomy – Jastrow & Thompson

General Statistics of the US (1985)

Cause and Effect? Auto accident Year Sunspots deaths 1970 54.6 165 1971 53.3 89 1972 56.3
 

Coefficients Standard Error

t Stat

P-value

 
 

Intercept

46.4668526

1.610358049

28.85498

1.87E-12

Multiple R

0.626258

Sunspots

0.04779484

0.017175818

2.782682

0.016568

R Square

0.392199

   

Adjusted R

0.341549

ANOVA

Standard E 2.904781

 

df

SS

MS

F

Significance F

 

Regression

 

1

65.33623

65.33623

7.743319

0.016567818

Observatio

14

 

Residual

12

101.2531

8.437755

Total

13

166.5893

11-7.1 Residual Analysis

11-7.1 Residual Analysis The residuals from a regression model are e = y - ŷ ,

The residuals from a regression model are e i = y i - ŷ i , where y i is

an actual observation and ŷ i is the corresponding fitted value from

the regression model.

• Analysis of the residuals is frequently helpful in checking the

assumption that the errors are approximately normally distributed

with constant variance, and in determining whether additional

terms in the model would be useful.

Model Adequacy

Was NID(0,2 ) true? – Plots of residuals

  • - heteroscedasticity

  • - residuals are the difference between observed

and predicted

  • - you can standardize the residuals

d

i

e /ˆ

i

Model Adequacy Was NID(0,  ) true? – Plots of residuals - heteroscedasticity - residuals are
Model Adequacy Was NID(0,  ) true? – Plots of residuals - heteroscedasticity - residuals are

Figure 11-9

Minitab Options

Minitab Options
Minitab Options

A Plot

Normal Probability Plot of the Residuals

(response is Rating P)

A Plot Normal Probability Plot of the Residuals (response is Rating P) 2 1 0 -1
2 1 0 -1 -2 Normal Score
2
1
0
-1
-2
Normal Score

Another Plot

Residuals Versus the Fitted Values

(response is Rating P)

Another Plot Residuals Versus the Fitted Values (response is Rating P) 2 1 0 -1 -2
2 1 0 -1 -2 Standardized Residual
2
1
0
-1
-2
Standardized Residual

11-7.2 Coefficient of Determination (R 2 )

• The quantity

11-7.2 Coefficient of Determination (R ) • The quantity is called the coefficient of determination and
11-7.2 Coefficient of Determination (R ) • The quantity is called the coefficient of determination and

is called the coefficient of determination and is often used to

judge the adequacy of a regression model.

• 0 R 2 1;

• We often refer (loosely) to R 2 as the amount of variability in the

data explained or accounted for by the regression model.

• R is often referred to as the index of fit where -1 < R < 1

• R and R 2 only measures the improvement in fit when using the

model

y

0

x

1

over the model

y

0

R 2 – Coefficient of Determination SSR/SST

R – Coefficient of Determination SSR/SST R = 3378.5 / 4292.2 = 78.71 Also, as noted
R – Coefficient of Determination SSR/SST R = 3378.5 / 4292.2 = 78.71 Also, as noted

R 2 = 3378.5 / 4292.2 = 78.71

Also, as noted in the text, a very weak relation that is strictly observed will produce a high R 2 .

Adjusted R 2 is a somewhat better metric.

R

2

adj

1

SS

E

/ (

n

p

)

SS

T

/ (

n

1)

1

913.7 / 28 4292.2 / 29

77.95

11-8 Correlation

11-8 Correlation

11-8 Correlation

We may also write:

11-8 Correlation We may also write:
11-8 Correlation We may also write:

11-8 Correlation

11-8 Correlation It is often useful to test the hypotheses The appropriate test statistic for these

It is often useful to test the hypotheses

11-8 Correlation It is often useful to test the hypotheses The appropriate test statistic for these

The appropriate test statistic for these hypotheses is

11-8 Correlation It is often useful to test the hypotheses The appropriate test statistic for these

Reject H 0 if |t 0 | > t /2,n-2 .

More NFL – problem 11-73

More NFL – problem 11-73 (a) estimate the correlation coefficient between quarterback ratings and average yards
  • (a) estimate the correlation coefficient between quarterback

ratings and average yards per attempt

  • (b) test the hypothesis H 0 : = 0 (two-tailed) ;

= .05

R

2

SS

R

3378.5

.7871

 

SS

T

4292.2

11-9 Transformation and Logistic Regression

11-9 Transformation and Logistic Regression

Intrinsically Linear Models

Intrinsically Linear Models Y    2 1 2 x  0   let

Y

 

2

1

2

x

 

0

let

z

0

1

   

x

 

:

Y

z

Y

let

z

0

1

1/

x

1

x

:

Y

0

1

z

Y

1

let

exp

Y

*

1/

0

Y

:

1

x

ln

Y

*

0

1

x

More Intrinsically Linear Models

The Power Function:

 

Y

0

x

1

ln

Y

ln

0

1

ln

x

ln

 

The Exponential Function:

 

Y

0

e

1

x

 

ln

Y

ln

0

1

x

ln

The Log Function:

 

Y

0

1

ln

x

 

let

z

ln

x

:

Y

0

1

z

More Intrinsically Linear Models The Power Function:  Y   0 x 1  ln

Next Week Multiple Linear Regression

   

More Variables, More Coefficients, More Regression –

 
Next Week Multiple Linear Regression More Variables, More Coefficients, More Regression –