Building Multiple Regression Models

Chapter 19
Building Multiple
Regression
Models
Copyright 2012 Pearson Education. All rights reserved.
19-1
19.1 Indicator (or Dummy) Variables

What makes a good roller coaster ride?
Speed
Height
Duration
Inversions
Other
Do you expect that the

duration of the ride to be
related to the length of the
ride?
19-2

From a sample of coasters
worldwide:
Duration on Length looks
strong and positive.
19-3

From a sample of coasters worldwide:
The regression looks strong and the conditions seem to be met.
19-4

The duration of a ride increases by about 23 seconds for
each additional 1000 ft of track.
19-5

Many rides have inversions
loops, corkscrews, etc.
These features impose limitations
on the speed of the coaster.
How do we introduce the
categorical variable inversion?
19-6

Lets analyze each group
(rides with inversions and
rides without inversions)
separately:
The slopes are very similar.
The intercepts are different.
19-7

When the data can be divided into two
groups with similar regression slopes,
we can incorporate the group
information into a single regression
model.
We accomplish this by introducing an
indicator (or dummy) variable that
indicates whether a coaster has an
inversion:
Inversions = 1 if Yes
Inversions = 0 if No
19-8

Re-analyze with a multiple regression that includes both
Length and Inversions:
Duration 22.39 0.028 Length 30.08Inversions

The R2 is larger for the multiple regression (70.4%) than for the
simple regression (62.0%).
The t-ratios of both coefficients are large. (The residuals look
reasonable as well.)
19-9

Notice how the indicator variable works in the model.
Duration 22.39 0.028Length 30.08Inversions
Inversions turns on (1) and turns off (0) an additional 30.08
seconds of duration, depending on whether the ride has an
inversion.
Turning this factor on shifts the
intercept upward by about 30
seconds while leaving the slope
unaffected. This is consistent with
the simple regression analyses of
the separate groups (at right).
19-10

Indicators for Three or More Categories
What if a categorical variables has more than two levels?
Construct indicators by creating a separate indicator for each of
these levels. If a variable has k levels, then create k 1
indicators. Choose one category as a baseline and leave out
its indicator.
Regression coefficients are interpreted as the amount by which
their categories differ from the baseline, after allowing for the
linear effects of the other variables in the model.
19-11

Indicators for Three or More Levels
For example, the variable Month has 12 levels.
Temptation: Create a single indicator variable thus:
Month = 1 for January, Month = 2 for February, . . . Month =
12 for December
This is not recommended!

19-12

Indicators for Three or More Levels
Rather, introduce 11 indicator variables, one that turns on and
off the month of February, another that turns on and off the
month of March, etc.
If no month is turned on, then the model defaults to the
baseline month (January), so there is no need to include a
separate indicator variable for this month.
Clearly, the model will give results when more than one month
is turned on, but these results should not be interpreted.
19-13

Example: Traffic Delays
The Texas Transportation Institute (tti.tamu.edu) studies traffic
delays. Data the institute published in 2001 include information
on the Total Delay per Person (hours per year spent delayed by
traffic), the Average Arterial Road Speed (mph), the Average
Highway Road Speed (mph), and the Size of the city (small,
medium, large, very large). The regression model is:
Dependent variable is: Delay/person
R-squared = 79.1% R-squared (adjusted) = 77.4%
s = 6.474 with 68 - 6 = 62 degrees of freedom
Variable
Intercept
HiWay MPH
Arterial MPH
Small
Large
Very Large
Coeff
139.104
1.07347
2.04836
3.58970
5.00967
3.41058
SE(Coeff)
16.69
0.2474
0.6672
2.953
2.104
3.230
t-ratio
8.33
4.34
3.07
1.22
2.38
1.06
P-value
0.0001
0.0001
0.0032
0.2287
0.0203
0.2951
19-14

The regression model is:
Variable
Intercept
HiWay MPH
Arterial MPH
Small
Large
Very Large
Coeff
139.104
1.07347
2.04836
3.58970
5.00967
3.41058
SE(Coeff)
16.69
0.2474
0.6672
2.953
2.104
3.230
t-ratio
8.33
4.34
3.07
1.22
2.38
1.06
P-value
0.0001
0.0001
0.0032
0.2287
0.0203
0.2951
Why is there no coefficient for Medium?

Interpret the coefficient for Small
19-15

The regression model is:
Variable
Intercept
HiWay MPH
Arterial MPH
Small
Large
Very Large
Coeff
139.104
1.07347
2.04836
3.58970
5.00967
3.41058
SE(Coeff)
16.69
0.2474
0.6672
2.953
2.104
3.230
t-ratio
8.33
4.34
3.07
1.22
2.38
1.06
P-value
0.0001
0.0001
0.0032
0.2287
0.0203
0.2951
Why is there no coefficient for Medium? Medium was selected for

the baseline.
Interpret the coefficient for Small. It says that traffic delays are about
3.6 hours per person per year lower in small cities than in medium
cities, after allowing for the effects of highway and arterial speeds.
19-16
19.2 Adjusting for Different Slopes

Interaction Terms
Indicator variables can account for differences in the intercepts of
different groups.
But, what if the slopes of groups differ?
Example: Carbohydrates vs. Calories for selected Burger
King products
Meat-based dishes
The slopes of the two

groups are different.
Non-meat dishes
19-17

Interaction Terms
Start as before by introducing an indicator variable, Meat.
Meat = 1 if meat is present in the dish; Meat = 0 if it isnt
Adding Meat to the model can adjust the intercept.
To adjust the slope, add the interaction term Carbs*Meat.
Note that Carbs*Meat equals just Carbs when Meat = 1 and
equals 0 (disappears) when Meat = 0.
19-18

Interaction Terms
Re-analyze with the interaction term:
Calories 137.40 3.93Carbs 26.16Meat 7.88Carbs * Meat

The predictor Meat is not significant whereas the interaction
Carbs*Meat is significant.
19-19

Interaction Terms
For Meat = 0, the intercept is 137.4 and the slope is 3.93:
Calories 137.40 3.93Carbs 26.16 0 7.88Carbs 0
137.40 3.93Carbs
For Meat = 1, the intercept drops to 111.24 and the slope
increases to 11.81:
Calories 137.40 3.93Carbs 26.16 1 7.88Carbs 1
137.40 26.16 3.93 7.88 Carbs
111.24 11.81Carbs
(The drop in intercept is not significant.)
19-20

Interaction Terms
Introducing an interaction term produces a result consistent with
simple regressions of the two groups:
Simple regressions of
the meat group and the
non-meat group.
19-21
19.3 Multiple Regression Diagnostics

In multiple regression, a case can be extraordinary by having:
1) an unusual y-value compared to the model (check for
outliers in the residuals plot)
2) an outstanding x-value for one of the predictors (check the
scatterplot for each predictor)
3) an outstanding combination of x-values (not easily
assessed with our current tools!)
19-22

Leverage
In simple regression, leverage is easy to see because its far
from the mean of the x-values in a scatterplot.
In a multiple regression with k predictors, a point might not be far
from any of the x means and yet still exert large leverage
because it has an unusual combination of predictor values. To
determine leverage, calculate leverage values for each case.
19-23

Leverage
Using Leverage to Detect Extraordinary Cases:
Keeping everything else the same, increase (or decrease) the yvalue of a case by 1.0.
Find a new regression and calculate the new predicted value of
the case.
Define leverage as the amount by which the predicted value
changes. The leverage hi of a case will always be between 0 and
1.
Make a histogram of the distribution of leverages. (There is no
statistical test for whether a leverage is too large).
19-24

Recall Speed of Coasters Predicted by Height and Drop
Both Height and Drop are significant and R2 is high
and the scatterplot of

residuals shows nothing
unusual
19-25

Recall Speed of Coasters Predicted by Height and Drop
but the distribution of leverages is interesting.
Englands
Oblivian
Oblivian has an unusual combination of Height and Drop:

Height = only 65 ft
Drop = 180 ft (it drops into a hole in the ground!)
19-26

Oblivians leverage should make us think about conducting
further regression analyses one without Oblivian and one
without Height as a predictor.
The more complex the regression model, the more important it
is to look for high-leverage cases and their effects.
19-27

Residuals and Standardized Residuals
Consider a sample with a high-leverage
data point.
The high-leverage point strongly attracts
the regression line.
So, the residual for this point may not
accurately reflect the true underlying
variance.
We correct for this effect by working with
Standardized or Studentized residuals.
19-28

Residuals and Standardized Residuals
A residual is standardized when it is divided by an
estimate of its standard deviation.
Technology may permit you to choose between externally
Studentized residuals and internally Studentized residuals.
It is the externally Studentized version that follows a tdistribution.
19-29

Influence Measures
A case with both high leverage and large Studentized residual is
likely to change the regression model substantially all by itself.
Such a case is said to be influential and cries out for special
attention.
Use the Cooks Distance to evaluate influence.
ei2
Di 2
kse
hi
1 hi
If Di is unusually large (examine a histogram of distances), the

case should be checked as a possible influential point.
19-30

Influence Measures
Cooks Distances
Coaster Speed Example
Four coasters have high Cooks D.

19-31

Influence Measures
We already understand Oblivions outstanding status (most of

the Drop is underground and doesnt contribute to the Height.)
Further investigation reveals that the other three are
mechanically accelerated (not simply gravity-driven).
19-32

Influence Measures
Re-analysis without the three blast coasters has a striking
effect.
Height is no longer important to the model.
19-33

Influence Measures
Re-analyzing yet again with a simple linear regression, Speed
vs. Drop, produces a stout model. (The 3 blast coasters are
excluded.)
19-34

Indicators for Influence
Good strategy for extraordinary case: Construct an indicator
variable that is zero for all cases except the one we wish to
isolate.
This is equivalent to omitting the case, but with two
advantages:
Makes clear to others which case is extraordinary
The t-statistic for the indicator variables coefficient tests
whether the case is influential.
19-35

Indicators of Influence
Coaster Speed Example
The P-values confirm that the blast coasters dont fit with the
other ones.
Note: The coefficient for Drop is the same as the model without the three
blast coasters.
19-36

Diagnosis Wrapup
When both high leverage and large Studentized residual cases
are present, it is irresponsible to report only the regression on
all the data.
You should compute and discuss regressions with such cases
removed. Discuss extraordinary cases individually if they offer
insight.
19-37

Example: (continued) Traffic Delays
Heres a scatterplot of the residuals from the regression on
Traffic Delays plotted against mean Highway mpg.
The point plotted

with an X is Los
Angeles. Explain.
Is Los Angeles likely
to be an influential
point in this
regression?
19-38

The point plotted with an X is Los Angeles. Explain.
LA has the lowest
average highway speeds
(about 35mph) of all the
cities in this study. The
positive residual says
that drivers lose about
7.5 hours per person per
year more than the
model predicts.
19-39

Is Los Angeles likely to be an influential point in this regression?
It may have elevated
leverage, but the
residual isnt particularly
large. Its probably not
influential.
19-40
19.4 Building Regression Models

What constitutes the best regression model depends on
the situation. Best models should have
Relatively few predictors, each reliably measured and relatively
unrelated.
A relatively high R2
A relatively small value of se
Relatively small P-values for the F- and t-statistics
No cases with extraordinarily high-leverage
No cases with extraordinarily large residues, and Studentized
residues that appear to be nearly Normal
Predictors reliably measured and relatively unrelated.
19-41

Best Subsets and Stepwise Regression
There are several automated systems for searching out best
models once a single criterion is selected for optimization,
including:
Best Subsets Regression
Stepwise Regression
These methods do not check the assumptions and conditions
and are vulnerable to other complications.
Even though the process is automated, human oversight and
analysis is still necessary to guard against misleading results.
19-42

Best Subsets Regression
Choose a single criterion of best. For example, consider the
best model to be the one with the highest R2.
Choose a modest set of potential predictors.
A computer searches all possible models and reports the best
two-predictor model, the best three-predictor model, the best
four-predictor model, etc.
The technique becomes increasingly impracticable as the size
of the data set increases.
19-43

Stepwise Regression
The regression builds the model stepwise from a given initial
model. At each step, a predictor is either added to or removed
from the model.
The predictor chosen to add is the one whose addition
increases the adjusted R2 the most.
The predictor chosen to remove is the one whose removal
reduces the adjusted R2 least.
Hopefully, the computer settles on a good model.
19-44

Challenges in Building Regression Models
For large data sets, checking the data for accuracy, missing
values, consistency, and reasonableness can be a major part of
the effort.
The probability that a Type I error will occur increases when the
number of considered models increases.
19-45
19.5 Collinearity
Predictor variables exhibit collinearity when one of the
predictors can be predicted well from the others.
Consequences of Collinearity:
Coefficients in a multiple regression model can be surprising,
taking on an unanticipated sign or being unexpectedly large or
small.
The stronger the correlation between coefficients, the more
the variance of their coefficients increases when both are
included in the model (variance inflation). This can lead to a
smaller t-statistic and correspondingly large P-value.
19-46
19.5 Collinearity
Recall Housing Prices based on Living Area and Bedrooms
(Simple Regression Models):
Simple regression predicts $113.12 increase in

price for each additional square foot of space.
Simple regression predicts $48,218 increase in

price for each additional bedroom.
19-47
19.5 Collinearity
(Multiple Regression Model):
The coefficient on Bedrooms seems counterintuitive.
19-48
19.5 Collinearity
Recall Housing Prices based on Living Area and Rooms
(Simple Regression Models):
Simple regression predicts $113.12 increase in

price for each additional square foot of space.
Simple regression predicts $22,572.90 increase

in price for each additional room.
19-49
19.5 Collinearity
(Multiple Regression Model):
Standard Errors have undergone significant

percentage change from the simple
regression to multiple regression model.
19-50
19.5 Collinearity
The statistic that measures the degree of collinearity
of the jth predictor with the others called the Variance
Inflation Factor (VIF).
It is found as
1
VIFi
2
1 Ri
19-51
19.5 Collinearity
Facts about Collinearity
The collinearity of any predictor with the others in the model
can be measured with its Variance Inflation Factor (VIF).
High collinearity leads to the coefficient being poorly estimated
and having a large standard error (and correspondingly low tstatistic). The coefficient may seem to be the wrong size or
even the wrong sign.
19-52
19.5 Collinearity
Facts about Collinearity
If a multiple regression model has a high and large F, but

the individual t-statistics are not significant, you should suspect
collinearity.
2
R
Collinearity is measured in terms of the
i
between a
predictor and all of the other predictors in the model. It is not
measured in terms of the correlation between any two predictors.
19-53
19.5 Collinearity
Dealing with Collinearity:
Simplify the model and improve the t-statistic by removing some
of the predictors. Which should you keep?
Variables that are most reliably measured
Variables that are least expensive to find
Variables that are inherently important to the problem
New variables formed by combining variables
19-54
19.6 Quadratic Terms

2002 Winter Olympic Games Downhill Time vs Start Order
(linear model):
Do downhill ski runs get slower as the day wears on?
Time is expected to rise 0.109 seconds for each

subsequent position in the StartOrder, but .
19-55
the residuals
exhibit a bend.
19-56

Downhill Time vs Start Order (quadratic model)
Plotting the original data, a quadratic model of the form

y b0 b1StartOrder b2 StartOrder 2
seems more appropriate. Caution - Use quadratic models
with care, especially if attempting to extrapolate beyond the
range of x-values.
19-57

Downhill Time vs Start Order (quadratic model)
Fitting the quadratic model

gives a better R-squared value
and a reasonable residual plot,
however .
19-58
the coefficient on StartOrder changed from significant and

positive in the linear model to significant and negative in the
quadratic model.
Possible
collinearity?
19-59

To account for the collinearity between StartOrder and StartOrder2,
fit a quadratic model with
StartOrder StartOrder
instead of StartOrder2. The form with the mean subtracted has

zero correlation with the linear term.
19-60

Regression Roles
Reasons for building regression models
to try to understand relationships among variables.
want simple models with unrelated predictors
want large t-statistics
to predict response variable values when given predictor
variable values.
want large R-squared values
collinearity is less of a concern
19-61

Example: Lobster Industry 2008 Water Temps
The seasonal Maine lobster fishing season is seasonal, but has
been shifting later in the fall. Scientists suggest the biggest factor
in predicting the peak time is water temperature. A scatterplot
reveals a curved pattern.
Would re-expressing
either variable help?
19-62

Example: Lobster Industry 2008 Water Temps
The seasonal Maine lobster fishing season is seasonal, but has
been shifting later in the fall. Scientists suggest the biggest factor
in predicting the peak time is water temperature. A scatterplot
reveals a curved pattern.Would re-expressing either variable
help?
The linearity conditions is
violated. But the curve is not
monotonic (constantly rising or
falling), so no re-expression
can help. A regression with a
quadratic term might produce
a useful model.
19-63

Example: (continued) Lobster Industry 2008
Given below is output with a quadratic term. Recall from the
scatterplot that temperatures have not been dropping at 19
degrees per year.
Dependent variable is: Water Temperature
57 total cases of which 1 is missing
R-squared R-squared (adjusted)
s 0.9852 with 56 - 3 = 53 degrees of freedom
Variable
Intercept
Year
Year2
Coeff
19208.2
-19.3947
4.90774e-3
SE(Coeff)
2204
2.229
0.0006
t-ratio
8.71
-8.70
8.71
P-value
0.0001
0.0001
0.0001
Explain why the coefficient of Year is now strongly negative.
19-64

Example: (continued) Lobster Industry 2008
Given below is output with a quadratic term. Recall from the
scatterplot that temperatures have not been dropping at 19
degrees per year.
Dependent variable is: Water Temperature
57 total cases of which 1 is missing
R-squared R-squared (adjusted)
s 0.9852 with 56 - 3 = 53 degrees of freedom
Variable
Intercept
Year
Year2
Coeff
19208.2
-19.3947
4.90774e-3
SE(Coeff)
2204
2.229
0.0006
t-ratio
8.71
-8.70
8.71
P-value
0.0001
0.0001
0.0001
Explain why the coefficient of Year is now strongly negative.

Year and Year2 are highly correlated. The collinearity
accounts for the change in the coefficient of Year.
19-65
Be alert for collinearity when you set aside an

influential point. Removing a high-influence point may
surprise you with unexpected collinearity.
Beware missing data. Missing data may make it difficult to
compare regression models with different predictors.
Dont forget linearity. If the Linearity Assumption is
violated, everything else about a regression model may be
invalid.
Check for parallel regression lines. If an indicator
variable is introduced, an interaction term may need to be
added if coefficients vary significantly among models.
19-66
What Have We Learned?

Use indicator (dummy) variables intelligently.
An indicator variable that is 1 for a group and 0 for

others is appropriate when the slope for that group is the
same as for the others, but the intercept may be different.
If the slope for the indicated group is different, then it

may be appropriate to include an interaction term in the
regression model.
When there are three or more categories, use a

separate indicator variable for each, but leave one out to
avoid collinearity.
19-67

Diagnose multiple regressions to expose any undue influence of
individual cases.
Leverage measures how far a case is from the mean of all

cases when measured on the x-variables.
The leverage of a case tells how much the predicted value

of that case would change if the y-value of that case were
changed by adding 1 and nothing else in the regression changed.
Studentized residuals are residuals divided by their

individual standard errors. Externally Studentized residuals
follow a
t-distribution when the regression assumptions are satisfied.
19-68

Diagnose multiple regressions to expose any undue influence of
individual cases. (continued)
A case is influential if it has both sufficient leverage and a

large enough residual. Removing an influential case from the
data will change the model in ways that matter to your
interpretation or intended use. Measures such as Cooks D and
DFFITS combine leverage and Studentized residuals into a
single measure of influence.
By assigning an indicator variable to a single influential case, we
can remove its influence from the model and test (using the Pvalue of its coefficient) whether it is in fact influential.
19-69

Build multiple regression models when many predictors are
available.
Seek models with few predictors, a relatively high , a

relatively small residual standard deviation, relatively small Pvalues for the coefficients, no cases that are unduly influential,
and predictors that are reliably measured and relatively unrelated
to each other.
Automated methods for seeking regression models

include Best Subsets and Stepwise regression. Neither one
should be used without carefully diagnosing the resulting model.
19-70

Recognize collinearity and deal with it to improve your
regression model. Collinearity occurs when one predictor can
be well predicted from the others.
The R2 of the regression of one predictor on the others is a
suitable measure. Alternatively, the Variance Inflation Factor,
which is based on this R2 is often reported by statistics
programs.
Collinearity can have the effect of making the P-values for
the coefficients large (not significant) even though the overall
regression fits the data well.
Removing some predictors from the model or making
combinations of those that are collinear can reduce this
problem.
19-71

Consider fitting quadratic terms in your regression model when
the residuals show a bend. Re-expressing y is another option,
unless the bent relationship between y and the xs is not
monotonic.
Recognize why a particular regression model is fit.
We may want to understand some of the individual coefficients.
We may simply be interested in prediction and not be
concerned about the coefficients themselves.
19-72

Building Multiple Regression Models

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Building Multiple Regression Models

Diunggah oleh

Hak Cipta:

Format Tersedia

Chapter 19

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

Do you expect that the

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

19.1 Indicator (or Dummy) Variables

Duration 22.39 0.028 Length 30.08Inversions

19.1 Indicator (or Dummy) Variables

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

This is not recommended!

19.1 Indicator (or Dummy) Variables

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

Why is there no coefficient for Medium?

Copyright 2012 Pearson Education. All rights reserved.

19.1 Indicator (or Dummy) Variables

Why is there no coefficient for Medium? Medium was selected for

Copyright 2012 Pearson Education. All rights reserved.

19.2 Adjusting for Different Slopes

The slopes of the two

Copyright 2012 Pearson Education. All rights reserved.

19.2 Adjusting for Different Slopes

Copyright 2012 Pearson Education. All rights reserved.

19.2 Adjusting for Different Slopes

Calories 137.40 3.93Carbs 26.16Meat 7.88Carbs * Meat

19.2 Adjusting for Different Slopes

19.2 Adjusting for Different Slopes

Copyright 2012 Pearson Education. All rights reserved.

19.3 Multiple Regression Diagnostics

Copyright 2012 Pearson Education. All rights reserved.

19.3 Multiple Regression Diagnostics

Copyright 2012 Pearson Education. All rights reserved.

19.3 Multiple Regression Diagnostics

19.3 Multiple Regression Diagnostics

and the scatterplot of

Copyright 2012 Pearson Education. All rights reserved.

19.3 Multiple Regression Diagnostics

Oblivian has an unusual combination of Height and Drop:

19.3 Multiple Regression Diagnostics

Copyright 2012 Pearson Education. All rights reserved.

19.3 Multiple Regression Diagnostics

Copyright 2012 Pearson Education. All rights reserved.

19.3 Multiple Regression Diagnostics

Copyright 2012 Pearson Education. All rights reserved.

19.3 Multiple Regression Diagnostics

If Di is unusually large (examine a histogram of distances), the

19.3 Multiple Regression Diagnostics