Anda di halaman 1dari 72

Chapter 19

Building Multiple
Regression
Models
Copyright 2012 Pearson Education. All rights reserved.

Copyright 2012 Pearson Education. All rights reserved.

19-1

19.1 Indicator (or Dummy) Variables


What makes a good roller coaster ride?
Speed
Height
Duration
Inversions
Other

Do you expect that the


duration of the ride to be
related to the length of the
ride?

Copyright 2012 Pearson Education. All rights reserved.

19-2

19.1 Indicator (or Dummy) Variables


From a sample of coasters
worldwide:
Duration on Length looks
strong and positive.

Copyright 2012 Pearson Education. All rights reserved.

19-3

19.1 Indicator (or Dummy) Variables


From a sample of coasters worldwide:
The regression looks strong and the conditions seem to be met.

Copyright 2012 Pearson Education. All rights reserved.

19-4

19.1 Indicator (or Dummy) Variables


The duration of a ride increases by about 23 seconds for
each additional 1000 ft of track.

Copyright 2012 Pearson Education. All rights reserved.

19-5

19.1 Indicator (or Dummy) Variables


Many rides have inversions
loops, corkscrews, etc.
These features impose limitations
on the speed of the coaster.
How do we introduce the
categorical variable inversion?

Copyright 2012 Pearson Education. All rights reserved.

19-6

19.1 Indicator (or Dummy) Variables


Lets analyze each group
(rides with inversions and
rides without inversions)
separately:
The slopes are very similar.
The intercepts are different.

Copyright 2012 Pearson Education. All rights reserved.

19-7

19.1 Indicator (or Dummy) Variables


When the data can be divided into two
groups with similar regression slopes,
we can incorporate the group
information into a single regression
model.
We accomplish this by introducing an
indicator (or dummy) variable that
indicates whether a coaster has an
inversion:
Inversions = 1 if Yes
Inversions = 0 if No
Copyright 2012 Pearson Education. All rights reserved.

19-8

19.1 Indicator (or Dummy) Variables


Re-analyze with a multiple regression that includes both
Length and Inversions:

Duration 22.39 0.028 Length 30.08Inversions


The R2 is larger for the multiple regression (70.4%) than for the
simple regression (62.0%).
The t-ratios of both coefficients are large. (The residuals look
reasonable as well.)
Copyright 2012 Pearson Education. All rights reserved.

19-9

19.1 Indicator (or Dummy) Variables


Notice how the indicator variable works in the model.
Duration 22.39 0.028Length 30.08Inversions
Inversions turns on (1) and turns off (0) an additional 30.08
seconds of duration, depending on whether the ride has an
inversion.
Turning this factor on shifts the
intercept upward by about 30
seconds while leaving the slope
unaffected. This is consistent with
the simple regression analyses of
the separate groups (at right).

Copyright 2012 Pearson Education. All rights reserved.

19-10

19.1 Indicator (or Dummy) Variables


Indicators for Three or More Categories
What if a categorical variables has more than two levels?
Construct indicators by creating a separate indicator for each of
these levels. If a variable has k levels, then create k 1
indicators. Choose one category as a baseline and leave out
its indicator.
Regression coefficients are interpreted as the amount by which
their categories differ from the baseline, after allowing for the
linear effects of the other variables in the model.

Copyright 2012 Pearson Education. All rights reserved.

19-11

19.1 Indicator (or Dummy) Variables


Indicators for Three or More Levels
For example, the variable Month has 12 levels.
Temptation: Create a single indicator variable thus:
Month = 1 for January, Month = 2 for February, . . . Month =
12 for December

This is not recommended!


Copyright 2012 Pearson Education. All rights reserved.

19-12

19.1 Indicator (or Dummy) Variables


Indicators for Three or More Levels
Rather, introduce 11 indicator variables, one that turns on and
off the month of February, another that turns on and off the
month of March, etc.
If no month is turned on, then the model defaults to the
baseline month (January), so there is no need to include a
separate indicator variable for this month.
Clearly, the model will give results when more than one month
is turned on, but these results should not be interpreted.

Copyright 2012 Pearson Education. All rights reserved.

19-13

19.1 Indicator (or Dummy) Variables


Example: Traffic Delays
The Texas Transportation Institute (tti.tamu.edu) studies traffic
delays. Data the institute published in 2001 include information
on the Total Delay per Person (hours per year spent delayed by
traffic), the Average Arterial Road Speed (mph), the Average
Highway Road Speed (mph), and the Size of the city (small,
medium, large, very large). The regression model is:
Dependent variable is: Delay/person
R-squared = 79.1% R-squared (adjusted) = 77.4%
s = 6.474 with 68 - 6 = 62 degrees of freedom
Variable
Intercept
HiWay MPH
Arterial MPH
Small
Large
Very Large

Coeff
139.104
1.07347
2.04836
3.58970
5.00967
3.41058

SE(Coeff)
16.69
0.2474
0.6672
2.953
2.104
3.230

Copyright 2012 Pearson Education. All rights reserved.

t-ratio
8.33
4.34
3.07
1.22
2.38
1.06

P-value
0.0001
0.0001
0.0032
0.2287
0.0203
0.2951

19-14

19.1 Indicator (or Dummy) Variables


Example: Traffic Delays
The regression model is:
Dependent variable is: Delay/person
R-squared = 79.1% R-squared (adjusted) = 77.4%
s = 6.474 with 68 - 6 = 62 degrees of freedom
Variable
Intercept
HiWay MPH
Arterial MPH
Small
Large
Very Large

Coeff
139.104
1.07347
2.04836
3.58970
5.00967
3.41058

SE(Coeff)
16.69
0.2474
0.6672
2.953
2.104
3.230

t-ratio
8.33
4.34
3.07
1.22
2.38
1.06

P-value
0.0001
0.0001
0.0032
0.2287
0.0203
0.2951

Why is there no coefficient for Medium?


Interpret the coefficient for Small

Copyright 2012 Pearson Education. All rights reserved.

19-15

19.1 Indicator (or Dummy) Variables


Example: Traffic Delays
The regression model is:
Dependent variable is: Delay/person
R-squared = 79.1% R-squared (adjusted) = 77.4%
s = 6.474 with 68 - 6 = 62 degrees of freedom
Variable
Intercept
HiWay MPH
Arterial MPH
Small
Large
Very Large

Coeff
139.104
1.07347
2.04836
3.58970
5.00967
3.41058

SE(Coeff)
16.69
0.2474
0.6672
2.953
2.104
3.230

t-ratio
8.33
4.34
3.07
1.22
2.38
1.06

P-value
0.0001
0.0001
0.0032
0.2287
0.0203
0.2951

Why is there no coefficient for Medium? Medium was selected for


the baseline.
Interpret the coefficient for Small. It says that traffic delays are about
3.6 hours per person per year lower in small cities than in medium
cities, after allowing for the effects of highway and arterial speeds.

Copyright 2012 Pearson Education. All rights reserved.

19-16

19.2 Adjusting for Different Slopes


Interaction Terms
Indicator variables can account for differences in the intercepts of
different groups.
But, what if the slopes of groups differ?
Example: Carbohydrates vs. Calories for selected Burger
King products

Meat-based dishes

The slopes of the two


groups are different.

Copyright 2012 Pearson Education. All rights reserved.

Non-meat dishes

19-17

19.2 Adjusting for Different Slopes


Interaction Terms
Start as before by introducing an indicator variable, Meat.
Meat = 1 if meat is present in the dish; Meat = 0 if it isnt
Adding Meat to the model can adjust the intercept.
To adjust the slope, add the interaction term Carbs*Meat.
Note that Carbs*Meat equals just Carbs when Meat = 1 and
equals 0 (disappears) when Meat = 0.

Copyright 2012 Pearson Education. All rights reserved.

19-18

19.2 Adjusting for Different Slopes


Interaction Terms
Re-analyze with the interaction term:

Calories 137.40 3.93Carbs 26.16Meat 7.88Carbs * Meat


The predictor Meat is not significant whereas the interaction
Carbs*Meat is significant.
Copyright 2012 Pearson Education. All rights reserved.

19-19

19.2 Adjusting for Different Slopes


Interaction Terms
For Meat = 0, the intercept is 137.4 and the slope is 3.93:
Calories 137.40 3.93Carbs 26.16 0 7.88Carbs 0
137.40 3.93Carbs
For Meat = 1, the intercept drops to 111.24 and the slope
increases to 11.81:
Calories 137.40 3.93Carbs 26.16 1 7.88Carbs 1
137.40 26.16 3.93 7.88 Carbs
111.24 11.81Carbs
(The drop in intercept is not significant.)
Copyright 2012 Pearson Education. All rights reserved.

19-20

19.2 Adjusting for Different Slopes


Interaction Terms
Introducing an interaction term produces a result consistent with
simple regressions of the two groups:

Simple regressions of
the meat group and the
non-meat group.

Copyright 2012 Pearson Education. All rights reserved.

19-21

19.3 Multiple Regression Diagnostics


In multiple regression, a case can be extraordinary by having:
1) an unusual y-value compared to the model (check for
outliers in the residuals plot)
2) an outstanding x-value for one of the predictors (check the
scatterplot for each predictor)
3) an outstanding combination of x-values (not easily
assessed with our current tools!)

Copyright 2012 Pearson Education. All rights reserved.

19-22

19.3 Multiple Regression Diagnostics


Leverage
In simple regression, leverage is easy to see because its far
from the mean of the x-values in a scatterplot.
In a multiple regression with k predictors, a point might not be far
from any of the x means and yet still exert large leverage
because it has an unusual combination of predictor values. To
determine leverage, calculate leverage values for each case.

Copyright 2012 Pearson Education. All rights reserved.

19-23

19.3 Multiple Regression Diagnostics


Leverage
Using Leverage to Detect Extraordinary Cases:
Keeping everything else the same, increase (or decrease) the yvalue of a case by 1.0.
Find a new regression and calculate the new predicted value of
the case.
Define leverage as the amount by which the predicted value
changes. The leverage hi of a case will always be between 0 and
1.
Make a histogram of the distribution of leverages. (There is no
statistical test for whether a leverage is too large).
Copyright 2012 Pearson Education. All rights reserved.

19-24

19.3 Multiple Regression Diagnostics


Recall Speed of Coasters Predicted by Height and Drop
Both Height and Drop are significant and R2 is high

and the scatterplot of


residuals shows nothing
unusual

Copyright 2012 Pearson Education. All rights reserved.

19-25

19.3 Multiple Regression Diagnostics


Recall Speed of Coasters Predicted by Height and Drop
but the distribution of leverages is interesting.
Englands
Oblivian

Oblivian has an unusual combination of Height and Drop:


Height = only 65 ft
Drop = 180 ft (it drops into a hole in the ground!)
Copyright 2012 Pearson Education. All rights reserved.

19-26

19.3 Multiple Regression Diagnostics


Oblivians leverage should make us think about conducting
further regression analyses one without Oblivian and one
without Height as a predictor.
The more complex the regression model, the more important it
is to look for high-leverage cases and their effects.

Copyright 2012 Pearson Education. All rights reserved.

19-27

19.3 Multiple Regression Diagnostics


Residuals and Standardized Residuals
Consider a sample with a high-leverage
data point.
The high-leverage point strongly attracts
the regression line.
So, the residual for this point may not
accurately reflect the true underlying
variance.
We correct for this effect by working with
Standardized or Studentized residuals.

Copyright 2012 Pearson Education. All rights reserved.

19-28

19.3 Multiple Regression Diagnostics


Residuals and Standardized Residuals
A residual is standardized when it is divided by an
estimate of its standard deviation.
Technology may permit you to choose between externally
Studentized residuals and internally Studentized residuals.
It is the externally Studentized version that follows a tdistribution.

Copyright 2012 Pearson Education. All rights reserved.

19-29

19.3 Multiple Regression Diagnostics


Influence Measures
A case with both high leverage and large Studentized residual is
likely to change the regression model substantially all by itself.
Such a case is said to be influential and cries out for special
attention.
Use the Cooks Distance to evaluate influence.
ei2
Di 2
kse

hi

1 hi

If Di is unusually large (examine a histogram of distances), the


case should be checked as a possible influential point.
Copyright 2012 Pearson Education. All rights reserved.

19-30

19.3 Multiple Regression Diagnostics


Influence Measures
Cooks Distances
Coaster Speed Example

Four coasters have high Cooks D.


Copyright 2012 Pearson Education. All rights reserved.

19-31

19.3 Multiple Regression Diagnostics


Influence Measures

We already understand Oblivions outstanding status (most of


the Drop is underground and doesnt contribute to the Height.)
Further investigation reveals that the other three are
mechanically accelerated (not simply gravity-driven).

Copyright 2012 Pearson Education. All rights reserved.

19-32

19.3 Multiple Regression Diagnostics


Influence Measures
Re-analysis without the three blast coasters has a striking
effect.

Height is no longer important to the model.

Copyright 2012 Pearson Education. All rights reserved.

19-33

19.3 Multiple Regression Diagnostics


Influence Measures
Re-analyzing yet again with a simple linear regression, Speed
vs. Drop, produces a stout model. (The 3 blast coasters are
excluded.)

Copyright 2012 Pearson Education. All rights reserved.

19-34

19.3 Multiple Regression Diagnostics


Indicators for Influence
Good strategy for extraordinary case: Construct an indicator
variable that is zero for all cases except the one we wish to
isolate.
This is equivalent to omitting the case, but with two
advantages:
Makes clear to others which case is extraordinary
The t-statistic for the indicator variables coefficient tests
whether the case is influential.

Copyright 2012 Pearson Education. All rights reserved.

19-35

19.3 Multiple Regression Diagnostics


Indicators of Influence
Coaster Speed Example

The P-values confirm that the blast coasters dont fit with the
other ones.
Note: The coefficient for Drop is the same as the model without the three
blast coasters.
Copyright 2012 Pearson Education. All rights reserved.

19-36

19.3 Multiple Regression Diagnostics


Diagnosis Wrapup
When both high leverage and large Studentized residual cases
are present, it is irresponsible to report only the regression on
all the data.
You should compute and discuss regressions with such cases
removed. Discuss extraordinary cases individually if they offer
insight.

Copyright 2012 Pearson Education. All rights reserved.

19-37

19.3 Multiple Regression Diagnostics


Example: (continued) Traffic Delays
Heres a scatterplot of the residuals from the regression on
Traffic Delays plotted against mean Highway mpg.

The point plotted


with an X is Los
Angeles. Explain.
Is Los Angeles likely
to be an influential
point in this
regression?
Copyright 2012 Pearson Education. All rights reserved.

19-38

19.3 Multiple Regression Diagnostics


Example: (continued) Traffic Delays
Heres a scatterplot of the residuals from the regression on
Traffic Delays plotted against mean Highway mpg.
The point plotted with an X is Los Angeles. Explain.
LA has the lowest
average highway speeds
(about 35mph) of all the
cities in this study. The
positive residual says
that drivers lose about
7.5 hours per person per
year more than the
model predicts.
Copyright 2012 Pearson Education. All rights reserved.

19-39

19.3 Multiple Regression Diagnostics


Example: (continued) Traffic Delays
Heres a scatterplot of the residuals from the regression on
Traffic Delays plotted against mean Highway mpg.
Is Los Angeles likely to be an influential point in this regression?
It may have elevated
leverage, but the
residual isnt particularly
large. Its probably not
influential.

Copyright 2012 Pearson Education. All rights reserved.

19-40

19.4 Building Regression Models


What constitutes the best regression model depends on
the situation. Best models should have
Relatively few predictors, each reliably measured and relatively
unrelated.
A relatively high R2
A relatively small value of se
Relatively small P-values for the F- and t-statistics
No cases with extraordinarily high-leverage
No cases with extraordinarily large residues, and Studentized
residues that appear to be nearly Normal
Predictors reliably measured and relatively unrelated.
Copyright 2012 Pearson Education. All rights reserved.

19-41

19.4 Building Regression Models


Best Subsets and Stepwise Regression
There are several automated systems for searching out best
models once a single criterion is selected for optimization,
including:
Best Subsets Regression
Stepwise Regression
These methods do not check the assumptions and conditions
and are vulnerable to other complications.
Even though the process is automated, human oversight and
analysis is still necessary to guard against misleading results.

Copyright 2012 Pearson Education. All rights reserved.

19-42

19.4 Building Regression Models


Best Subsets Regression
Choose a single criterion of best. For example, consider the
best model to be the one with the highest R2.
Choose a modest set of potential predictors.
A computer searches all possible models and reports the best
two-predictor model, the best three-predictor model, the best
four-predictor model, etc.
The technique becomes increasingly impracticable as the size
of the data set increases.

Copyright 2012 Pearson Education. All rights reserved.

19-43

19.4 Building Regression Models


Stepwise Regression
The regression builds the model stepwise from a given initial
model. At each step, a predictor is either added to or removed
from the model.
The predictor chosen to add is the one whose addition
increases the adjusted R2 the most.
The predictor chosen to remove is the one whose removal
reduces the adjusted R2 least.
Hopefully, the computer settles on a good model.

Copyright 2012 Pearson Education. All rights reserved.

19-44

19.4 Building Regression Models


Challenges in Building Regression Models
For large data sets, checking the data for accuracy, missing
values, consistency, and reasonableness can be a major part of
the effort.
The probability that a Type I error will occur increases when the
number of considered models increases.

Copyright 2012 Pearson Education. All rights reserved.

19-45

19.5 Collinearity
Predictor variables exhibit collinearity when one of the
predictors can be predicted well from the others.
Consequences of Collinearity:
Coefficients in a multiple regression model can be surprising,
taking on an unanticipated sign or being unexpectedly large or
small.
The stronger the correlation between coefficients, the more
the variance of their coefficients increases when both are
included in the model (variance inflation). This can lead to a
smaller t-statistic and correspondingly large P-value.

Copyright 2012 Pearson Education. All rights reserved.

19-46

19.5 Collinearity
Recall Housing Prices based on Living Area and Bedrooms
(Simple Regression Models):

Simple regression predicts $113.12 increase in


price for each additional square foot of space.

Simple regression predicts $48,218 increase in


price for each additional bedroom.

Copyright 2012 Pearson Education. All rights reserved.

19-47

19.5 Collinearity
Recall Housing Prices based on Living Area and Bedrooms
(Multiple Regression Model):

The coefficient on Bedrooms seems counterintuitive.

Copyright 2012 Pearson Education. All rights reserved.

19-48

19.5 Collinearity
Recall Housing Prices based on Living Area and Rooms
(Simple Regression Models):

Simple regression predicts $113.12 increase in


price for each additional square foot of space.

Simple regression predicts $22,572.90 increase


in price for each additional room.
Copyright 2012 Pearson Education. All rights reserved.

19-49

19.5 Collinearity
Recall Housing Prices based on Living Area and Bedrooms
(Multiple Regression Model):

Standard Errors have undergone significant


percentage change from the simple
regression to multiple regression model.

Copyright 2012 Pearson Education. All rights reserved.

19-50

19.5 Collinearity
The statistic that measures the degree of collinearity
of the jth predictor with the others called the Variance
Inflation Factor (VIF).
It is found as

1
VIFi
2
1 Ri

Copyright 2012 Pearson Education. All rights reserved.

19-51

19.5 Collinearity
Facts about Collinearity
The collinearity of any predictor with the others in the model
can be measured with its Variance Inflation Factor (VIF).
High collinearity leads to the coefficient being poorly estimated
and having a large standard error (and correspondingly low tstatistic). The coefficient may seem to be the wrong size or
even the wrong sign.

Copyright 2012 Pearson Education. All rights reserved.

19-52

19.5 Collinearity
Facts about Collinearity

If a multiple regression model has a high and large F, but


the individual t-statistics are not significant, you should suspect
collinearity.
2
R
Collinearity is measured in terms of the
i

between a
predictor and all of the other predictors in the model. It is not
measured in terms of the correlation between any two predictors.

Copyright 2012 Pearson Education. All rights reserved.

19-53

19.5 Collinearity
Dealing with Collinearity:
Simplify the model and improve the t-statistic by removing some
of the predictors. Which should you keep?
Variables that are most reliably measured
Variables that are least expensive to find
Variables that are inherently important to the problem
New variables formed by combining variables

Copyright 2012 Pearson Education. All rights reserved.

19-54

19.6 Quadratic Terms


2002 Winter Olympic Games Downhill Time vs Start Order
(linear model):
Do downhill ski runs get slower as the day wears on?

Time is expected to rise 0.109 seconds for each


subsequent position in the StartOrder, but .

Copyright 2012 Pearson Education. All rights reserved.

19-55

19.6 Quadratic Terms

the residuals
exhibit a bend.

Copyright 2012 Pearson Education. All rights reserved.

19-56

19.6 Quadratic Terms


Downhill Time vs Start Order (quadratic model)

Plotting the original data, a quadratic model of the form


y b0 b1StartOrder b2 StartOrder 2
seems more appropriate. Caution - Use quadratic models
with care, especially if attempting to extrapolate beyond the
range of x-values.
Copyright 2012 Pearson Education. All rights reserved.

19-57

19.6 Quadratic Terms


Downhill Time vs Start Order (quadratic model)

Fitting the quadratic model


gives a better R-squared value
and a reasonable residual plot,
however .

Copyright 2012 Pearson Education. All rights reserved.

19-58

19.6 Quadratic Terms

the coefficient on StartOrder changed from significant and


positive in the linear model to significant and negative in the
quadratic model.
Possible
collinearity?

Copyright 2012 Pearson Education. All rights reserved.

19-59

19.6 Quadratic Terms


To account for the collinearity between StartOrder and StartOrder2,
fit a quadratic model with

StartOrder StartOrder

instead of StartOrder2. The form with the mean subtracted has


zero correlation with the linear term.

Copyright 2012 Pearson Education. All rights reserved.

19-60

19.6 Quadratic Terms


Regression Roles
Reasons for building regression models
to try to understand relationships among variables.
want simple models with unrelated predictors
want large t-statistics
to predict response variable values when given predictor
variable values.
want large R-squared values
collinearity is less of a concern

Copyright 2012 Pearson Education. All rights reserved.

19-61

19.6 Quadratic Terms


Example: Lobster Industry 2008 Water Temps
The seasonal Maine lobster fishing season is seasonal, but has
been shifting later in the fall. Scientists suggest the biggest factor
in predicting the peak time is water temperature. A scatterplot
reveals a curved pattern.

Would re-expressing
either variable help?

Copyright 2012 Pearson Education. All rights reserved.

19-62

19.6 Quadratic Terms


Example: Lobster Industry 2008 Water Temps
The seasonal Maine lobster fishing season is seasonal, but has
been shifting later in the fall. Scientists suggest the biggest factor
in predicting the peak time is water temperature. A scatterplot
reveals a curved pattern.Would re-expressing either variable
help?
The linearity conditions is
violated. But the curve is not
monotonic (constantly rising or
falling), so no re-expression
can help. A regression with a
quadratic term might produce
a useful model.
Copyright 2012 Pearson Education. All rights reserved.

19-63

19.6 Quadratic Terms


Example: (continued) Lobster Industry 2008
Given below is output with a quadratic term. Recall from the
scatterplot that temperatures have not been dropping at 19
degrees per year.
Dependent variable is: Water Temperature
57 total cases of which 1 is missing
R-squared R-squared (adjusted)
s 0.9852 with 56 - 3 = 53 degrees of freedom
Variable
Intercept
Year
Year2

Coeff
19208.2
-19.3947
4.90774e-3

SE(Coeff)
2204
2.229
0.0006

t-ratio
8.71
-8.70
8.71

P-value
0.0001
0.0001
0.0001

Explain why the coefficient of Year is now strongly negative.

Copyright 2012 Pearson Education. All rights reserved.

19-64

19.6 Quadratic Terms


Example: (continued) Lobster Industry 2008
Given below is output with a quadratic term. Recall from the
scatterplot that temperatures have not been dropping at 19
degrees per year.
Dependent variable is: Water Temperature
57 total cases of which 1 is missing
R-squared R-squared (adjusted)
s 0.9852 with 56 - 3 = 53 degrees of freedom
Variable
Intercept
Year
Year2

Coeff
19208.2
-19.3947
4.90774e-3

SE(Coeff)
2204
2.229
0.0006

t-ratio
8.71
-8.70
8.71

P-value
0.0001
0.0001
0.0001

Explain why the coefficient of Year is now strongly negative.


Year and Year2 are highly correlated. The collinearity
accounts for the change in the coefficient of Year.
Copyright 2012 Pearson Education. All rights reserved.

19-65

Be alert for collinearity when you set aside an


influential point. Removing a high-influence point may
surprise you with unexpected collinearity.
Beware missing data. Missing data may make it difficult to
compare regression models with different predictors.
Dont forget linearity. If the Linearity Assumption is
violated, everything else about a regression model may be
invalid.
Check for parallel regression lines. If an indicator
variable is introduced, an interaction term may need to be
added if coefficients vary significantly among models.
Copyright 2012 Pearson Education. All rights reserved.

19-66

What Have We Learned?


Use indicator (dummy) variables intelligently.

An indicator variable that is 1 for a group and 0 for


others is appropriate when the slope for that group is the
same as for the others, but the intercept may be different.

If the slope for the indicated group is different, then it


may be appropriate to include an interaction term in the
regression model.

When there are three or more categories, use a


separate indicator variable for each, but leave one out to
avoid collinearity.

Copyright 2012 Pearson Education. All rights reserved.

19-67

What Have We Learned?


Diagnose multiple regressions to expose any undue influence of
individual cases.

Leverage measures how far a case is from the mean of all


cases when measured on the x-variables.

The leverage of a case tells how much the predicted value


of that case would change if the y-value of that case were
changed by adding 1 and nothing else in the regression changed.

Studentized residuals are residuals divided by their


individual standard errors. Externally Studentized residuals
follow a
t-distribution when the regression assumptions are satisfied.
Copyright 2012 Pearson Education. All rights reserved.

19-68

What Have We Learned?


Diagnose multiple regressions to expose any undue influence of
individual cases. (continued)

A case is influential if it has both sufficient leverage and a


large enough residual. Removing an influential case from the
data will change the model in ways that matter to your
interpretation or intended use. Measures such as Cooks D and
DFFITS combine leverage and Studentized residuals into a
single measure of influence.
By assigning an indicator variable to a single influential case, we
can remove its influence from the model and test (using the Pvalue of its coefficient) whether it is in fact influential.
Copyright 2012 Pearson Education. All rights reserved.

19-69

What Have We Learned?


Build multiple regression models when many predictors are
available.

Seek models with few predictors, a relatively high , a


relatively small residual standard deviation, relatively small Pvalues for the coefficients, no cases that are unduly influential,
and predictors that are reliably measured and relatively unrelated
to each other.

Automated methods for seeking regression models


include Best Subsets and Stepwise regression. Neither one
should be used without carefully diagnosing the resulting model.

Copyright 2012 Pearson Education. All rights reserved.

19-70

What Have We Learned?


Recognize collinearity and deal with it to improve your
regression model. Collinearity occurs when one predictor can
be well predicted from the others.
The R2 of the regression of one predictor on the others is a
suitable measure. Alternatively, the Variance Inflation Factor,
which is based on this R2 is often reported by statistics
programs.
Collinearity can have the effect of making the P-values for
the coefficients large (not significant) even though the overall
regression fits the data well.
Removing some predictors from the model or making
combinations of those that are collinear can reduce this
problem.
Copyright 2012 Pearson Education. All rights reserved.

19-71

What Have We Learned?


Consider fitting quadratic terms in your regression model when
the residuals show a bend. Re-expressing y is another option,
unless the bent relationship between y and the xs is not
monotonic.
Recognize why a particular regression model is fit.
We may want to understand some of the individual coefficients.
We may simply be interested in prediction and not be
concerned about the coefficients themselves.

Copyright 2012 Pearson Education. All rights reserved.

19-72

Anda mungkin juga menyukai