Anda di halaman 1dari 63

11/6/2017

Points to highlight
— The Simple Linear Regression Model
— Least Square Method
— Assessing the Fit of the Simple Linear Regression
Model
— The Multiple Regression Model
— Inference and Regression
— Categorical Independent Variables
— Modeling Nonlinear Relationship
— Model Fitting

1
11/6/2017

Introduction
— Managerial decisions are often based on the
relationship between two or more variables
— Example: After considering the relationship between
advertising expenditures and sales, a marketing manager
might attempt to predict sales for a given level of
advertising expenditures
— Sometimes a manager will rely on intuition to judge
how two variables are related
— If data can be obtained, a statistical procedure called
regression analysis can be used to develop an
equation showing how the variables are related
3

Introduction
— Regression analysis
- Describe a relationship between one variable and
the other variables in mathematical terms.
- Predict the value of a dependent variable based on
the value of at least one independent variable
- Explain the impact of changes in an independent
variable on the dependent variable

2
11/6/2017

Introduction
— Dependent variable or response: Variable being
predicted (Variable we wish to explain)
— Independent variables or predictor variables: Variables
being used to predict the value of the dependent variable
(Variable used to explain the dependent variable)
— Linear regression: A regression analysis involving at least
one independent variable and one dependent variable
— In statistical notation:
y = dependent variable
x = independent variable

Examples
Example 1: The product manager of a particular
brand of children’s breakfast cereal would like to
predict the demand for cereal during the next
year. To use regression analysis, she and her staff
list the following variables as likely to affect sales:
•Price of the product
•Number of children 5 to 12 years of age (the target
•Price of competitors’ products
•Effectiveness of advertising
•Annual sales this year
•Annual sales in previous year
6

3
11/6/2017

Examples
Example 2: A real estate agent wants to predict
the selling price of houses more accurately. She
believes that the following variables affect the
price of a house:
• Size of the house (number of square feet)
• Number of bedrooms
• Frontage of the lot
• Condition
• Location

Introduction
— Simple linear regression: A regression analysis for
only one independent variable, x, and one dependent
variable, y
— Multiple linear regression: Regression analysis
involving two or more independent variables

4
11/6/2017

§ Regression Model
§ Estimated Regression Equation

The Simple Linear Regression Model


Regression Model
— The equation that describes how Y is related to X and
an error term (ε)
— Simple Linear Regression Model:
Y = β0 + β1x + ε
— Parameters: The characteristics of the population, β0
and β1
— Random variable: Error term, ε
— The error term accounts for the variability in y that
cannot be explained by the linear relationship between
x and y

10

5
11/6/2017

The Simple Linear Regression Model


— The parameter values are usually not known and must
be estimated using sample data
— Sample statistics (denoted b0 and b1) are computed as
estimates of the population parameters β0 and β1

Estimated Regression Equation


— The equation obtained by substituting the values of
the sample statistics b0 and b1 for β0 and β1 in the
regression equation

11

The Simple Linear Regression Model

12

6
11/6/2017

Figure 7.1: The Estimation Process


in Simple Linear Regression

13

Figure 7.2: Possible Regression


Lines in Simple Linear Regression

14

7
11/6/2017

§ Least Squares Estimates of the Regression Parameters


§ Using Excel’s Chart Tools to Compute the Estimated
Regression Equation

Least Squares Method


— Least squares method: A procedure for using
sample data to find the estimated regression equation
— Determine the values of b0 and b1
— Interpretation of b0 and b1:
— The slope b1 is the estimated change in the mean of the
dependent variable y that is associated with a one unit
increase in the independent variable x
— The y-intercept b0 is the estimated value of the
dependent variable y when the independent variable x
is equal to 0

16

8
11/6/2017

Table 7.1: Miles Traveled and Travel Time


for 10 Butler Trucking Company Driving
Assignments

17

Figure 7.3: Scatter Chart of Miles Traveled and


Travel Time for Sample of 10 Butler Trucking
Company Driving Assignments

18

9
11/6/2017

Least Squares Method

19

Least Squares Method

min − = min

20

10
11/6/2017

The Least Squares Equation


— The formulas for b1 and b0 are:
n

 (x i  x )( yi  y )
b1  i 1
n
and b0  y  b1 x
 (x i  x )2
i 1

where
xi: value of the independent variable for the ith observation
yi: value of the dependent variable for the ith observation
x : mean value for the independent variable
y : mean value for the dependent variable
n: total number of observations
21

Least Squares Method

22

11
11/6/2017

Least Squares Method


— Interpretation of b1: If the length of a driving
assignment were 1 unit (1 mile) longer, the mean travel
time for that driving assignment would be 0.0678
units (0.0678 hours, or approximately 4 minutes)
longer
— Interpretation of b0: If the driving distance for a
driving assignment was 0 units (0 miles), the mean
travel time would be 1.2739 units (1.2739 hours, or
approximately 76 minutes)

23

Least Squares Method


— Experimental region: The range of values of the
independent variables in the data used to estimate the
model
— The regression model is valid only over this region
— Extrapolation: Prediction of the value of the
dependent variable outside the experimental region
— It is risky

24

12
11/6/2017

Least Squares Method

25

Table 7.2: Predicted Travel Time and Residuals


for 10 Butler Trucking Company Driving
Assignments

26

13
11/6/2017

Figure 7.4: Scatter Chart of Miles Traveled and


Travel Time for Butler Trucking Company Driving
Assignments with Regression Line Superimposed

27

Figure 7.5: A Geometric Interpretation of the


Least Squares Method

28

14
11/6/2017

Least Squares Method


Using Excel’s Chart Tools to Compute the Estimated
Regression Equation
— After constructing a scatter chart with Excel’s chart tools:
1. Right-click on any data point and select Add
Trendline…
2. In the Format Trendline task pane, in the Trendline
Options area:
— Select Linear
— Select Display Equation on chart

Figure 7.6: Scatter Chart and Estimated


Regression Line for Butler Trucking Company

30

15
11/6/2017

§ The Sums of Squares


§ The Coefficient of Determination
§ Using Excel’s Chart Tools to Compute the Coefficient of
Determination

Assessing the Fit of the Simple


Linear Regression Model
The Sums of Squares
— Sum of squares due to error (SSE): The value of SSE is
a measure of the error in using the estimated
regression equation to predict the values of the
dependent variable in the sample

SSE = 8.0288

From Table 7.2,


32

16
11/6/2017

Figure 7.7: The Sample Mean as a Predictor of Travel


Time for Butler Trucking Company

33

Assessing the Fit of the Simple


Linear Regression Model

34

17
11/6/2017

Table 7.3: Calculations for the Sum of Squares


Total for the Butler Trucking Simple Linear
Regression

35

Figure 7.8: Deviations About the Estimated Regression


Line and the Line y = for the Third Butler Trucking
Company Driving Assignment

36

18
11/6/2017

Assessing the Fit of the Simple Linear


Regression Model

37

Explained and Unexplained Variation

— SST = total sum of squares


— Measures the variation of the yi values around their
mean y
— SSE = sum of squares due to errors
— Variation attributable to factors other than the
relationship between x and y
— SSR = sum of squares due to regression
— Explained variation attributable to the relationship
between x and y

38

19
11/6/2017

Assessing the Fit of the Simple Linear


Regression Model

39

Figure 7.9: Scatter Chart and Estimated


Regression Line with Coefficient of Determination
r2 for Butler Trucking Company
Interpretation
of R2 ?

40

20
11/6/2017

Examples of Approximate R2 Values


y
R2 = 1

Perfect linear relationship


between x and y:
x
R2 = 1
y 100% of the variation in y
is explained by variation
in x
x
R2 = +1
41

Examples of Approximate R2 Values


y
0 < R2 < 1

Weaker linear relationship


between x and y:
x

y Some but not all of the


variation in y is
explained by variation
in x
x
42

21
11/6/2017

Examples of Approximate R 2 Values


R2 = 0
y
No linear relationship
between x and y:
The value of y does not
x depend on x. (None of
R2 = 0
the variation in y is
explained by variation
in x)
43

§ Regression Model
§ Estimated Multiple Regression Equation
§ Least Squares Method and Multiple Regression
§ Butler Trucking Company and Multiple Regression
§ Using Excel’s Regression Tool to Develop the Estimated
Multiple Regression Equation

22
11/6/2017

The Multiple Regression Model


Regression Model
— Multiple regression model
y = β0 + β1x1 + β2x2 + ∙ ∙ ∙ + βqxq + ε
— y = dependent variable
— x1, x2, . . . , xq = independent variables
— β0, β1, β2, . . . , βq = parameters
— ε = error term (accounts for the variability in y that
cannot be explained by the linear effect of the q
independent variables)

45

The Multiple Regression Model


— Interpretation of slope coefficient βj: Represents the
change in the mean value of the dependent variable y
that corresponds to a one unit increase in the
independent variable xj, holding the values of all other
independent variables in the model constant
— The multiple regression equation that describes how
the mean value of y is related to x1, x2, . . . , xq:
E( y | x1, x2, . . . , xq) = β0 + β1x1 + β2x2 + ∙ ∙ ∙ + βqxq

46

23
11/6/2017

The Multiple Regression Model

47

The Multiple Regression Model


Least Squares Method and Figure 7:10: The Estimation Process for
Multiple Regression Multiple Regression

48

24
11/6/2017

Estimates b0, b1, b2,….,bq

y  nb0 b1x1 b2x2 .......bq xq


 2


 x 1 y  b 0  x 1  b1  x1 b2x1x2.......bq x1xq
2
x2 y  b0x2 b1x1x2 b2x2.......bq x2xq
......................................................................................

xq y  b0xq b1x1xq b2x2xq.......bq xq2

Interpretation of Estimated Coefficients


— Slope (bi)
— Estimates that the average value of y changes by bi units
for each 1 unit increase in xi given that all other variables
unchanged
— Intercept (b0)
— The estimated average value of y when all xi = 0

25
11/6/2017

The Multiple Regression Model

51

The Multiple Regression Model

52

26
11/6/2017

Multiple Coefficient of Determination


— Reports the proportion of total variation in y
explained by all x variables taken together

SSR Sum of squares due to regression


R2  
SST Total sum of squares

Example
A distributor of frozen desert pies wants
to evaluate factors thought to influence
demand

Data are collected for 15 weeks

27
11/6/2017

Price Advertising
Week Pie Sales ($) ($100s)
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0
4 430 8.00 4.5
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7

Example
Dependent variable (y): Pie sales

Independent variables 1 (x1): Prices ($)

Independent variables 2 (x2): Advertising ($100s)

Estimated (Predicted) regression equation:

ŷ  b0  b1 x1  b2 x2

28
11/6/2017

Estimates b0, b1, b2

y  nb0 b1x1 b2 x2


 2
x1y  b0 x1 b1x1 b2 x1x2
 2
 2
x y  b0 2
x  b1 1 2
x x  b2 2
x

Example calculation
 y  5990  x x  345.46
1 2

2
 x  99.2
1  x  675.26
1

x 2  52.2  x 22  1 8 5
2
 x y  39152
1 y  2448500

 x y  21087
2

29
11/6/2017

Example calculation
5990  15b0  99.2b1  52.2b2

39152  99.2b0  675.26b1  345.46b2
21087  52.2b  345.46b 185b
 0 1 2

 b0  3 0 6 .5 2 5

 b1   2 4 . 9 7 5
 b  7 4 .1 3 1
 2

Example calculation
Estimated (Predicted) regression equation:

yˆ  306.526  24.975x1  74.131x2


Interpretation b0, b1, b2?

30
11/6/2017

The Multiple Regression Equation

Sales  306.526 - 24.975(Price)  74.131(Adv ertising)


where
Sales is in number of pies per week
Price is in $
Advertising is in $100’s.
b1 = -24.975: sales b2 = 74.131: sales will
will decrease, on increase, on average,
average, by 24.975 by 74.131 pies per
pies per week for week for each $100
each $1 increase in increase in
selling price, net of advertising, net of the
the effects of changes effects of changes
due to price due to advertising

Using The Model to Make


Predictions
Predict sales for a week in which the selling
price is $5.50 and advertising is $350:

Sales  306.526 - 24.975(Price)  74.131(Advertising)


 306.526 - 24.975 (5.50)  74.131(3.5)
 428.62

Predicted sales
is 428.62 pies

31
11/6/2017

Example calculation
2
SSR  ( yˆ  y )  29459.96
i
2
SST  ( y  y )  56493.33
i

29459.96
R2   0.521 Indication
56493.33 ?

Figure 7.11: Data Analysis Tools Box


Using Excel’s Regression Tool to Develop the Estimated
Multiple Regression Equation

64

32
11/6/2017

Figure 7.12: Regression Dialog Box

65

Figure 7.13: Excel Regression Output for the


Butler Trucking Company with Miles and
Deliveries as Independent Variables

66

33
11/6/2017

Figure 7.14: Graph of the Regression Equation for


Multiple Regression Analysis with Two
Independent Variables

67

§ Conditions Necessary for Valid Inference in the Least Squares


Regression Model
§ Testing Individual Regression Parameters
§ Addressing Nonsignificant Independent Variables
§ Multi-collinearity
§ Inference and Very Large Samples

34
11/6/2017

Inference and Regression

69

Inference and Regression


Conditions Necessary for Valid Inference in the
Least Squares Regression Model
— For any given combination of values of the
independent variables x1, x2, . . . , xq, the population of
potential error terms ε is normally distributed with a
mean of 0 and a constant variance [ε ~ N(0,σε)].
Practical implication: regression estimates are unbiased,
possess consistent accuracy, tend to err in small amount

— The values of ε are statistically independent

70

35
11/6/2017

Figure 7.15: Illustration of the Conditions


for Valid Inference in Regression

71

Figure 7.16: Example of a Random Error Pattern in


a Scatter Chart of Residuals and Predicted Values
of the Dependent Variable

72

36
11/6/2017

Figure 7.17: Examples of Diagnostic Scatter Charts


of Residuals from Four Regressions

73

Figure 7.18: Excel Residual Plots for the Butler


Trucking Company Multiple Regression

74

37
11/6/2017

Inference and Regression


Testing Individual Regression Parameters:
— To determine whether statistically significant
relationships exist between the dependent variable y
and each of the independent variables x1, x2, . . . , xq
individually
— If a βj = 0, there is no linear relationship between the
dependent variable y and the independent variable xj
— If a βj ≠ 0, there is a linear relationship between y and
xj

75

Testing Individual Regression Parameters


— Hypotheses:
— H0: βj = 0 (no linear relationship)
— HA: βj ≠ 0 (linear relationship does exist
between xj and y)

38
11/6/2017

Testing Individual Regression Parameters

77

Testing Individual Regression Parameters

α: level of  /2  /2
significance

Reject H0 Do not reject H0 Reject H0

-tα/2 0 tα/2

39
11/6/2017

Confidence Interval Estimate


— Confidence interval can be used to test whether each of
the regression parameters β0, β1, β2, . . . , βq is equal to zero
— Confidence interval: An estimate of a population
parameter that provides an interval believed to contain
the value of the parameter at some level of confidence
— Confidence level: Indicates how frequently interval
estimates based on samples of the same size taken from
the same population using identical sampling techniques
will contain the true value of the parameter we are
estimating

79

Confidence Interval Estimate

Confidence interval for the population slope βj

b j  t /2,n  q 1sb j where t has


(n – q – 1) d.f.

40
11/6/2017

Inference and Regression


Addressing Nonsignificant Independent Variables
— If practical experience dictates that the nonsignificant
independent variable has a relationship with the
dependent variable, the independent variable should be
left in the model
— If the model sufficiently explains the dependent variable
without the nonsignificant independent variable, then
consider rerunning the regression without the
nonsignificant independent variable
— The appropriate treatment of the inclusion or exclusion of
the y-intercept when b0 is not statistically significant may
require special consideration

81

Inference and Regression


Multicollinearity
— Multicollinearity refers to the correlation among the
independent variables in multiple regression analysis
— In t tests for the significance of individual parameters, the
difficulty caused by multicollinearity is that it is possible
to conclude that a parameter associated with one of the
multicollinear independent variables is not significantly
different from zero when the independent variable actually
has a strong relationship with the dependent variable
— This problem is avoided when there is little correlation
among the independent variables

82

41
11/6/2017

Inference and Regression


Inference and Very Large Samples
— Because virtually all relationships between independent
variables and the dependent variable will be statistically
significant:
— If the sample size is sufficiently large, inference can no longer be
used to discriminate between meaningful and specious
relationships
— This is because the variability in potential values of an estimator bj
of a regression parameter βj depends on two factors:
(1) How closely the members of the population adhere to the
relationship between xj and y that is implied by βj
(2) The size of the sample on which the value of the estimator bj is
based
83

Inference and Regression


— Testing for an overall regression relationship:
— Use an F test based on the F probability
distribution
— If the F test leads us to reject the hypothesis that
the values of β1, β2, . . . , βq are all zero:
— Conclude that there is an overall regression
relationship
— Otherwise, conclude that there is no overall
regression relationship

84

42
11/6/2017

Testing for an overall regression relationship


— Hypotheses:
— H0: β1 = β2 = … = βq = 0 (no linear relationship)
— HA: at least one βj ≠ 0 (at least one independent
variable affects y)

Testing for an overall regression relationship

86

43
11/6/2017

Testing for an overall regression relationship


Compare Test Statistic with critical value:
F ,q ,n  q 1

0
Do not Reject H0 F
reject H0

F ,q ,n  q 1

§ Butler Trucking Company and Rush Hour


§ Interpreting the Parameters
§ More Complex Categorical Variables

44
11/6/2017

Categorical Independent Variables


Butler Trucking Company and Rush Hour
— Dependent variable, y: Travel time
— Independent variables: miles traveled (x1) and number
of deliveries (x2)
0 if an assignment did not
include travel on the
congested segment of
highway during afternoon
— Categorical variable: rush hour (x3)= rush hour

1 if an assignment included
travel on the congested
segment of highway during
afternoon rush hour
89

Figure 7.25: Histograms of the Residuals for Driving


Assignments That Included Travel on a Congested
Segment of a Highway During the Afternoon Rush Hour
and Residuals for Driving Assignments That Did Not

90

45
11/6/2017

Figure 7.26: Excel Data and Output for Butler Trucking


with Miles Traveled (x1), Number of Deliveries (x2), and
the Highway Rush Hour Dummy Variable (x3) as the
Independent Variables

91

Categorical Independent Variables


Interpreting the Parameters
— The model estimates that travel time increases by:
— 0.0672 hours for every increase of 1 mile traveled,
holding constant the number of deliveries and
whether the driving assignment route requires the
driver to travel on the congested segment of a highway
during the afternoon rush hour period
— 0.6735 hours for every delivery, holding constant the
number of miles traveled and whether the driving
assignment route requires the driver to travel on the
congested segment of a highway during the afternoon
rush hour period

92

46
11/6/2017

Categorical Independent Variables


— The model estimates that travel time increases by:
— 0.9980 hours if the driving assignment route requires
the driver to travel on the congested segment of a
highway during the afternoon rush hour period,
holding constant the number of miles traveled and the
number of deliveries
— R2 = 0.8838 indicates that the regression model
explains approximately 88.4 percent of the variability
in travel time for the driving assignments in the
sample

93

Categorical Independent Variables


— The mean or expected value of travel time for driving
assignments given no rush hour driving:
E(y|x3 = 0) = β0 + β1x1 + β2 x2 + β3(0) = β0 + β1x1 + β2
x2
— The mean or expected value of travel time for driving
assignments given rush hour driving:
E(y|x3 = 1) = β0 + β1x1 + β2 x2 + β3(1)
= β 0 + β 1 x 1 + β 2x 2 + β 3
= (β0 + β3) + β1x1 + β2x2

94

47
11/6/2017

Categorical Independent Variables


— Using the estimated multiple regression equation:
= –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980x3
— When x3 = 0:
= –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980(0)
= –0.3302 + 0.0672x1 + 0.6735x2
— When x3 = 1:
= –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980(1)
= 0.6678 + 0.0672x1 + 0.6735x2

95

Categorical Independent Variables


More Complex Categorical Variables
— If a categorical variable has k levels, k – 1 dummy variables
are required, with each dummy variable corresponding to
one of the levels of the categorical variable and coded as 0
or 1
— Example:
— Suppose a manufacturer of vending machines organized
the sales territories for a particular state into three regions:
A, B, and C
— The managers want to use regression analysis to help
predict the number of vending machines sold per week
— Suppose the managers believe sales region is one of the
important factors in predicting the number of units sold

96

48
11/6/2017

Categorical Independent Variables


— Example (contd.):
— Sales region: categorical variable with three levels (A,
B, and C)
— Number of dummy variables = 3 – 1 = 2
— Each variable can be coded 0 or 1 as:

1 if sales Region B 1 if sales Region C


x1 = x2 =
0 otherwise 0 otherwise

— The values of x1 and x2 are:

97

Categorical Independent Variables


— Example (contd.):
— The regression equation relating the expected value of the
number of units sold, E( y|x1, x2), to the dummy variables:
E( y|x1, x2) = β0 + β1x1 + β2 x2

— Observations corresponding to Sales Region A are coded x1 =


0, x2 = 0
— Regression equation:
E( y|x1 = 0, x2 = 0) = E( y|Sales Region A) = β0 + β1(0) + β2(0) = β0
— Observations corresponding to Sales Region C are coded x1 =
0, x2 = 1
— Regression equation:
E( y|x1 = 0, x2 = 1) = E( y|Sales Region C) = β0 + β1(0) + β2(1) = β0 + β2

98

49
11/6/2017

§ Quadratic Regression Models


§ Piecewise Linear Regression Models
§ Interaction Between Independent Variables

Figure 7.27: Scatter Chart for the Reynolds Example

100

50
11/6/2017

Figure 7.28: Excel Regression Output for the


Reynolds Example

101

Figure 7.29: Scatter Chart of the Residuals and Predicted


Values of the Dependent Variable for the Reynolds Simple
Linear Regression

102

51
11/6/2017

§Modeling Nonlinear Relationships


Quadratic Regression Model
— An estimated regression equation given by:
= b0 + b1x1 + b2 + e
— In the Reynolds example, to account for the
curvilinear relationship between months employed
and scales sold we could include the square of the
number of months the salesperson has been
employed in the model as a second independent
variable

103

Figure 7.30: Relationships That Can Be Fit with


a Quadratic Regression Model

104

52
11/6/2017

Figure 7.31: Excel Data for the Reynolds


Quadratic Regression Model

105

Figure 7.32: Excel Output for the


Reynolds Quadratic Regression Model

106

53
11/6/2017

Figure 7.33: Scatter Chart of the Residuals and


Predicted Values of the Dependent Variable for the
Reynolds Quadratic Regression Model

107

Modeling Nonlinear Relationships


Piecewise Linear Regression Models
— For the Reynolds data, as an alternative to a quadratic
regression model
— Recognize that below some value of Months Employed, the
relationship between Months Employed and Sales appears
to be positive and linear
— Whereas the relationship between Months Employed and
Sales appears to be negative and linear for the remaining
observations
— Piecewise linear regression model: This model will
allow us to fit these relationships as two linear regressions
that are joined at the value of Months at which the
relationship between Months Employed and Sales changes

108

54
11/6/2017

Modeling Nonlinear Relationships


— Knot: The value of the independent variable at which
the relationship between dependent variable and
independent variable changes
— For the Reynolds data, knot is the value of the
independent variable Months Employed at which the
relationship between Months Employed and Sales
changes

109

Figure 7.34: Possible Position of Knot x(k)

110

55
11/6/2017

Modeling Nonlinear Relationships


— Define a dummy(k)variable:
0 if x1 ≤ x
xk =
1 if x1 > x(k)
— x1 = Months
— x(k) = value of the knot (90 months for the Reynolds
example)
— xk= the knot dummy variable

— Fit the following regression model:


= b0 + b1x1 + b2(x1–x(k))xk + e

111

Modeling Nonlinear Relationships


Interaction Between Independent Variables
— Interaction:This occurs when the relationship
between the dependent variable and one independent
variable is different at various values of a second
independent variable
— The model is given by:
y = β 0 + β 1 x 1 + β 2 x 2 + β 3x 1 x 2 + e
— The estimated model is given by:
= b 0 + b 1 x 1 + b 2 x 2 + b 3x 1 x 2

112

56
11/6/2017

§ Variable Selection Procedures


§ Overfitting

Model Fitting
Variable Selection Procedures
— Special procedures are sometimes employed to select
the independent variables to include in the regression
model Iterative procedures:
At each step of the
— Stepwise regression
procedure a single
— Forward selection procedure independent variable
is added or removed
— Sequential replacement procedure and the new model is
evaluated
— Best-subsets procedure Evaluates regression
models involving
different subsets
of the independent
variables 114

57
11/6/2017

Model Fitting
— Variable Selection Procedures
— Backward elimination
— Forward selection
— Stepwise selection
— Best subsets
— Forward selection procedure:
— The analyst establishes a criterion for allowing independent
variables to enter the model
— Example: The independent variable j with the smallest p-
value associated with the test of the hypothesis βj = 0,
subject to some predetermined maximum p-value for which
a potential independent variable will be allowed to enter the
model

115

Model Fitting
— Forward selection procedure (contd.):
— First step: The independent variable that best satisfies
the criterion is added to the model
— Each subsequent step: The remaining independent
variables not in the current model are evaluated, and
the one that best satisfies the criterion is added to the
model
— Procedure stops: When there are no independent
variables not currently in the model that meet the
criterion for being added to the regression model

116

58
11/6/2017

Model Fitting
— Backward selection procedure:
— The analyst establishes a criterion for allowing
independent variables to remain in the model.
— Example: The largest p-value associated with the test of
the hypothesis βj = 0, subject to some predetermined
minimum p-value for which a potential independent
variable will be allowed to remain in the model.

117

Model Fitting
— Backward selection procedure (contd.):
— First step: The independent variable that violates this
criterion to the greatest degree is removed from the model
— Each subsequent step: The independent variables in the
current model are evaluated, and the one that violates this
criterion to the greatest degree is removed from the model
— Procedure stops: When there are no independent variables
currently in the model that violate the criterion for remaining
in the regression model

118

59
11/6/2017

Model Fitting
— Stepwise procedure:
— The analyst establishes both a criterion for allowing
independent variables to enter the model and a
criterion for allowing independent variables to remain
in the model
— In the first step of the procedure, the independent
variable that best satisfies the criterion for entering
the model is added
— First, the remaining independent variables not in the
current model are evaluated, and the one that best
satisfies the criterion for entering is added to the
model

119

Model Fitting
— Stepwise procedure (contd.):
— Then the independent variables in the current model
are evaluated, and the one that violates the criterion
for remaining in the model to the greatest degree is
removed
— The procedure stops when no independent variables
not currently in the model meet the criterion for
being added to the regression model, and no
independent variables currently in the model violate
the criterion for remaining in the regression model

120

60
11/6/2017

Model Fitting
— Best-subsets procedure:
— Simple linear regressions for each of the independent
variables under consideration are generated, and then
the multiple regressions with all combinations of two
independent variables under consideration are
generated, and so on
— Once a regression has been generated for every
possible subset of the independent variables under
consideration, an output that provides some criteria
for selecting regression models is produced for all
models generated
121

Model Fitting
Overfitting
— Results from creating an overly complex model to explain
idiosyncrasies in the sample data
— Results from the use of complex functional forms or
independent variables that do not have meaningful
relationships with the dependent variable
— If a model is overfit to the sample data, it will perform
better on the sample data used to fit the model than it will
on other data from the population
— Thus, an overfit model can be misleading about its
predictive capability and its interpretation

122

61
11/6/2017

Model Fitting
— How does one avoid overfitting a model?
— Use only independent variables that you expect to have
real and meaningful relationships with the dependent
variable
— Use complex models, such as quadratic models and
piecewise linear regression models, only when you have
a reasonable expectation that such complexity provides
a more accurate depiction of what you are modeling
— Do not let software dictate your model; use iterative
modeling procedures, such as the stepwise and best-
subsets procedures, only for guidance and not to
generate your final model
123

Model Fitting
— How does one avoid overfitting a model? (contd.)
— If you have access to a sufficient quantity of data, assess
your model on data other than the sample data that
were used to generate the model (this is referred to as
cross-validation)
— It is recommended to divide the original sample data
into training and validation sets
— Training set: The data set used to build the candidate
models that appear to make practical sense
— Validation set: The set of data used to compare model
performances and ultimately pick a model for
predicting values of the dependent variable
124

62
11/6/2017

Model Fitting
— Holdout method: The sample data are randomly divided
into mutually exclusive and collectively exhaustive training
and validation sets
— k-fold cross-validation: The sample data are randomly
divided into k equal-sized, mutually exclusive, and
collectively exhaustive subsets called fold, and k iterations
are executed
— Leave-one-out cross-validation: For a sample of n
observations, an iteration consists of estimating the model
on n – 1 observations and evaluating the model on the
single observation that was omitted from the training data

63

Anda mungkin juga menyukai