Chapter 4 (Regression) PDF

11/6/2017
Points to highlight
— The Simple Linear Regression Model
— Least Square Method
— Assessing the Fit of the Simple Linear Regression
Model
— The Multiple Regression Model
— Inference and Regression
— Categorical Independent Variables
— Modeling Nonlinear Relationship
— Model Fitting
1
11/6/2017
Introduction
— Managerial decisions are often based on the
relationship between two or more variables
— Example: After considering the relationship between
advertising expenditures and sales, a marketing manager
might attempt to predict sales for a given level of
advertising expenditures
— Sometimes a manager will rely on intuition to judge
how two variables are related
— If data can be obtained, a statistical procedure called
regression analysis can be used to develop an
equation showing how the variables are related
3
Introduction
— Regression analysis
- Describe a relationship between one variable and
the other variables in mathematical terms.
- Predict the value of a dependent variable based on
the value of at least one independent variable
- Explain the impact of changes in an independent
variable on the dependent variable
2
11/6/2017
Introduction
— Dependent variable or response: Variable being
predicted (Variable we wish to explain)
— Independent variables or predictor variables: Variables
being used to predict the value of the dependent variable
(Variable used to explain the dependent variable)
— Linear regression: A regression analysis involving at least
one independent variable and one dependent variable
— In statistical notation:
y = dependent variable
x = independent variable
Examples
Example 1: The product manager of a particular
brand of children’s breakfast cereal would like to
predict the demand for cereal during the next
year. To use regression analysis, she and her staff
list the following variables as likely to affect sales:
•Price of the product
•Number of children 5 to 12 years of age (the target
•Price of competitors’ products
•Effectiveness of advertising
•Annual sales this year
•Annual sales in previous year
6
3
11/6/2017
Examples
Example 2: A real estate agent wants to predict
the selling price of houses more accurately. She
believes that the following variables affect the
price of a house:
• Size of the house (number of square feet)
• Number of bedrooms
• Frontage of the lot
• Condition
• Location
Introduction
— Simple linear regression: A regression analysis for
only one independent variable, x, and one dependent
variable, y
— Multiple linear regression: Regression analysis
involving two or more independent variables
4
11/6/2017
§ Regression Model
§ Estimated Regression Equation
The Simple Linear Regression Model

Regression Model
— The equation that describes how Y is related to X and
an error term (ε)
— Simple Linear Regression Model:
Y = β0 + β1x + ε
— Parameters: The characteristics of the population, β0
and β1
— Random variable: Error term, ε
— The error term accounts for the variability in y that
cannot be explained by the linear relationship between
x and y
10
5
11/6/2017

— The parameter values are usually not known and must
be estimated using sample data
— Sample statistics (denoted b0 and b1) are computed as
estimates of the population parameters β0 and β1
Estimated Regression Equation

— The equation obtained by substituting the values of
the sample statistics b0 and b1 for β0 and β1 in the
regression equation
11
12
6
11/6/2017
Figure 7.1: The Estimation Process

in Simple Linear Regression
13
Figure 7.2: Possible Regression

Lines in Simple Linear Regression
14
7
11/6/2017
§ Least Squares Estimates of the Regression Parameters

§ Using Excel’s Chart Tools to Compute the Estimated
Regression Equation
Least Squares Method

— Least squares method: A procedure for using
sample data to find the estimated regression equation
— Determine the values of b0 and b1
— Interpretation of b0 and b1:
— The slope b1 is the estimated change in the mean of the
dependent variable y that is associated with a one unit
increase in the independent variable x
— The y-intercept b0 is the estimated value of the
dependent variable y when the independent variable x
is equal to 0
16
8
11/6/2017
Table 7.1: Miles Traveled and Travel Time

for 10 Butler Trucking Company Driving
Assignments
17
Figure 7.3: Scatter Chart of Miles Traveled and

Travel Time for Sample of 10 Butler Trucking
Company Driving Assignments
18
9
11/6/2017
19
min − = min
20
10
11/6/2017
The Least Squares Equation

— The formulas for b1 and b0 are:
n
 (x i  x )( yi  y )
b1  i 1
n
and b0  y  b1 x
 (x i  x )2
i 1
where
xi: value of the independent variable for the ith observation
yi: value of the dependent variable for the ith observation
x : mean value for the independent variable
y : mean value for the dependent variable
n: total number of observations
21
22
11
11/6/2017

— Interpretation of b1: If the length of a driving
assignment were 1 unit (1 mile) longer, the mean travel
time for that driving assignment would be 0.0678
units (0.0678 hours, or approximately 4 minutes)
longer
— Interpretation of b0: If the driving distance for a
driving assignment was 0 units (0 miles), the mean
travel time would be 1.2739 units (1.2739 hours, or
approximately 76 minutes)
23

— Experimental region: The range of values of the
independent variables in the data used to estimate the
model
— The regression model is valid only over this region
— Extrapolation: Prediction of the value of the
dependent variable outside the experimental region
— It is risky
24
12
11/6/2017
25
Table 7.2: Predicted Travel Time and Residuals

for 10 Butler Trucking Company Driving
Assignments
26
13
11/6/2017
Figure 7.4: Scatter Chart of Miles Traveled and

Travel Time for Butler Trucking Company Driving
Assignments with Regression Line Superimposed
27
Figure 7.5: A Geometric Interpretation of the

28
14
11/6/2017

Using Excel’s Chart Tools to Compute the Estimated
Regression Equation
— After constructing a scatter chart with Excel’s chart tools:
1. Right-click on any data point and select Add
Trendline…
2. In the Format Trendline task pane, in the Trendline
Options area:
— Select Linear
— Select Display Equation on chart
Figure 7.6: Scatter Chart and Estimated

Regression Line for Butler Trucking Company
30
15
11/6/2017
§ The Sums of Squares

§ The Coefficient of Determination
§ Using Excel’s Chart Tools to Compute the Coefficient of
Determination
Assessing the Fit of the Simple

Linear Regression Model
The Sums of Squares
— Sum of squares due to error (SSE): The value of SSE is
a measure of the error in using the estimated
regression equation to predict the values of the
dependent variable in the sample
SSE = 8.0288
From Table 7.2,

32
16
11/6/2017
Figure 7.7: The Sample Mean as a Predictor of Travel

Time for Butler Trucking Company
33
Assessing the Fit of the Simple

Linear Regression Model
34
17
11/6/2017
Table 7.3: Calculations for the Sum of Squares

Total for the Butler Trucking Simple Linear
Regression
35
Figure 7.8: Deviations About the Estimated Regression

Line and the Line y = for the Third Butler Trucking
Company Driving Assignment
36
18
11/6/2017
Assessing the Fit of the Simple Linear

Regression Model
37
Explained and Unexplained Variation
— SST = total sum of squares

— Measures the variation of the yi values around their
mean y
— SSE = sum of squares due to errors
— Variation attributable to factors other than the
relationship between x and y
— SSR = sum of squares due to regression
— Explained variation attributable to the relationship
between x and y
38
19
11/6/2017
Assessing the Fit of the Simple Linear

Regression Model
39
Figure 7.9: Scatter Chart and Estimated

Regression Line with Coefficient of Determination
r2 for Butler Trucking Company
Interpretation
of R2 ?
40
20
11/6/2017
Examples of Approximate R2 Values

y
R2 = 1
Perfect linear relationship

between x and y:
x
R2 = 1
y 100% of the variation in y
is explained by variation
in x
x
R2 = +1
41
Examples of Approximate R2 Values

y
0 < R2 < 1
Weaker linear relationship

between x and y:
x
y Some but not all of the

variation in y is
explained by variation
in x
x
42
21
11/6/2017
Examples of Approximate R 2 Values

R2 = 0
y
No linear relationship
between x and y:
The value of y does not
x depend on x. (None of
R2 = 0
the variation in y is
explained by variation
in x)
43
§ Regression Model
§ Estimated Multiple Regression Equation
§ Least Squares Method and Multiple Regression
§ Butler Trucking Company and Multiple Regression
§ Using Excel’s Regression Tool to Develop the Estimated
Multiple Regression Equation
22
11/6/2017
The Multiple Regression Model

Regression Model
— Multiple regression model
y = β0 + β1x1 + β2x2 + ∙ ∙ ∙ + βqxq + ε
— y = dependent variable
— x1, x2, . . . , xq = independent variables
— β0, β1, β2, . . . , βq = parameters
— ε = error term (accounts for the variability in y that
cannot be explained by the linear effect of the q
independent variables)
45

— Interpretation of slope coefficient βj: Represents the
change in the mean value of the dependent variable y
that corresponds to a one unit increase in the
independent variable xj, holding the values of all other
independent variables in the model constant
— The multiple regression equation that describes how
the mean value of y is related to x1, x2, . . . , xq:
E( y | x1, x2, . . . , xq) = β0 + β1x1 + β2x2 + ∙ ∙ ∙ + βqxq
46
23
11/6/2017
47

Least Squares Method and Figure 7:10: The Estimation Process for
Multiple Regression Multiple Regression
48
24
11/6/2017
Estimates b0, b1, b2,….,bq
y  nb0 b1x1 b2x2 .......bq xq

 2


 x 1 y  b 0  x 1  b1  x1 b2x1x2.......bq x1xq
2
x2 y  b0x2 b1x1x2 b2x2.......bq x2xq
......................................................................................

xq y  b0xq b1x1xq b2x2xq.......bq xq2
Interpretation of Estimated Coefficients

— Slope (bi)
— Estimates that the average value of y changes by bi units
for each 1 unit increase in xi given that all other variables
unchanged
— Intercept (b0)
— The estimated average value of y when all xi = 0
25
11/6/2017
51
52
26
11/6/2017
Multiple Coefficient of Determination

— Reports the proportion of total variation in y
explained by all x variables taken together
SSR Sum of squares due to regression

R2  
SST Total sum of squares
Example
A distributor of frozen desert pies wants
to evaluate factors thought to influence
demand
Data are collected for 15 weeks
27
11/6/2017
Price Advertising
Week Pie Sales ($) ($100s)
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0
4 430 8.00 4.5
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
Example
Dependent variable (y): Pie sales
Independent variables 1 (x1): Prices ($)
Independent variables 2 (x2): Advertising ($100s)
Estimated (Predicted) regression equation:
ŷ  b0  b1 x1  b2 x2
28
11/6/2017
Estimates b0, b1, b2
y  nb0 b1x1 b2 x2

 2
x1y  b0 x1 b1x1 b2 x1x2
 2
 2
x y  b0 2
x  b1 1 2
x x  b2 2
x
Example calculation
 y  5990  x x  345.46
1 2
2
 x  99.2
1  x  675.26
1
x 2  52.2  x 22  1 8 5
2
 x y  39152
1 y  2448500
 x y  21087
2
29
11/6/2017
Example calculation
5990  15b0  99.2b1  52.2b2

39152  99.2b0  675.26b1  345.46b2
21087  52.2b  345.46b 185b
 0 1 2
 b0  3 0 6 .5 2 5

 b1   2 4 . 9 7 5
 b  7 4 .1 3 1
 2
Example calculation
Estimated (Predicted) regression equation:
yˆ  306.526  24.975x1  74.131x2

Interpretation b0, b1, b2?
30
11/6/2017
The Multiple Regression Equation
Sales  306.526 - 24.975(Price)  74.131(Adv ertising)

where
Sales is in number of pies per week
Price is in $
Advertising is in $100’s.
b1 = -24.975: sales b2 = 74.131: sales will
will decrease, on increase, on average,
average, by 24.975 by 74.131 pies per
pies per week for week for each $100
each $1 increase in increase in
selling price, net of advertising, net of the
the effects of changes effects of changes
due to price due to advertising
Using The Model to Make

Predictions
Predict sales for a week in which the selling
price is $5.50 and advertising is $350:
Sales  306.526 - 24.975(Price)  74.131(Advertising)

 306.526 - 24.975 (5.50)  74.131(3.5)
 428.62
Predicted sales
is 428.62 pies
31
11/6/2017
Example calculation
2
SSR  ( yˆ  y )  29459.96
i
2
SST  ( y  y )  56493.33
i
29459.96
R2   0.521 Indication
56493.33 ?
Figure 7.11: Data Analysis Tools Box

Using Excel’s Regression Tool to Develop the Estimated
Multiple Regression Equation
64
32
11/6/2017
Figure 7.12: Regression Dialog Box
65
Figure 7.13: Excel Regression Output for the

Butler Trucking Company with Miles and
Deliveries as Independent Variables
66
33
11/6/2017
Figure 7.14: Graph of the Regression Equation for

Multiple Regression Analysis with Two
Independent Variables
67
§ Conditions Necessary for Valid Inference in the Least Squares

Regression Model
§ Testing Individual Regression Parameters
§ Addressing Nonsignificant Independent Variables
§ Multi-collinearity
§ Inference and Very Large Samples
34
11/6/2017
Inference and Regression
69

Conditions Necessary for Valid Inference in the
Least Squares Regression Model
— For any given combination of values of the
independent variables x1, x2, . . . , xq, the population of
potential error terms ε is normally distributed with a
mean of 0 and a constant variance [ε ~ N(0,σε)].
Practical implication: regression estimates are unbiased,
possess consistent accuracy, tend to err in small amount
— The values of ε are statistically independent
70
35
11/6/2017
Figure 7.15: Illustration of the Conditions

for Valid Inference in Regression
71
Figure 7.16: Example of a Random Error Pattern in

a Scatter Chart of Residuals and Predicted Values
of the Dependent Variable
72
36
11/6/2017
Figure 7.17: Examples of Diagnostic Scatter Charts

of Residuals from Four Regressions
73
Figure 7.18: Excel Residual Plots for the Butler

Trucking Company Multiple Regression
74
37
11/6/2017

Testing Individual Regression Parameters:
— To determine whether statistically significant
relationships exist between the dependent variable y
and each of the independent variables x1, x2, . . . , xq
individually
— If a βj = 0, there is no linear relationship between the
dependent variable y and the independent variable xj
— If a βj ≠ 0, there is a linear relationship between y and
xj
75
Testing Individual Regression Parameters

— Hypotheses:
— H0: βj = 0 (no linear relationship)
— HA: βj ≠ 0 (linear relationship does exist
between xj and y)
38
11/6/2017
77
α: level of  /2  /2
significance
Reject H0 Do not reject H0 Reject H0
-tα/2 0 tα/2
39
11/6/2017
Confidence Interval Estimate

— Confidence interval can be used to test whether each of
the regression parameters β0, β1, β2, . . . , βq is equal to zero
— Confidence interval: An estimate of a population
parameter that provides an interval believed to contain
the value of the parameter at some level of confidence
— Confidence level: Indicates how frequently interval
estimates based on samples of the same size taken from
the same population using identical sampling techniques
will contain the true value of the parameter we are
estimating
79
Confidence Interval Estimate
Confidence interval for the population slope βj
b j  t /2,n  q 1sb j where t has

(n – q – 1) d.f.
40
11/6/2017

Addressing Nonsignificant Independent Variables
— If practical experience dictates that the nonsignificant
independent variable has a relationship with the
dependent variable, the independent variable should be
left in the model
— If the model sufficiently explains the dependent variable
without the nonsignificant independent variable, then
consider rerunning the regression without the
nonsignificant independent variable
— The appropriate treatment of the inclusion or exclusion of
the y-intercept when b0 is not statistically significant may
require special consideration
81

Multicollinearity
— Multicollinearity refers to the correlation among the
independent variables in multiple regression analysis
— In t tests for the significance of individual parameters, the
difficulty caused by multicollinearity is that it is possible
to conclude that a parameter associated with one of the
multicollinear independent variables is not significantly
different from zero when the independent variable actually
has a strong relationship with the dependent variable
— This problem is avoided when there is little correlation
among the independent variables
82
41
11/6/2017

Inference and Very Large Samples
— Because virtually all relationships between independent
variables and the dependent variable will be statistically
significant:
— If the sample size is sufficiently large, inference can no longer be
used to discriminate between meaningful and specious
relationships
— This is because the variability in potential values of an estimator bj
of a regression parameter βj depends on two factors:
(1) How closely the members of the population adhere to the
relationship between xj and y that is implied by βj
(2) The size of the sample on which the value of the estimator bj is
based
83

— Testing for an overall regression relationship:
— Use an F test based on the F probability
distribution
— If the F test leads us to reject the hypothesis that
the values of β1, β2, . . . , βq are all zero:
— Conclude that there is an overall regression
relationship
— Otherwise, conclude that there is no overall
regression relationship
84
42
11/6/2017
Testing for an overall regression relationship

— Hypotheses:
— H0: β1 = β2 = … = βq = 0 (no linear relationship)
— HA: at least one βj ≠ 0 (at least one independent
variable affects y)
86
43
11/6/2017

Compare Test Statistic with critical value:
F ,q ,n  q 1
0
Do not Reject H0 F
reject H0
F ,q ,n  q 1
§ Butler Trucking Company and Rush Hour

§ Interpreting the Parameters
§ More Complex Categorical Variables
44
11/6/2017
Categorical Independent Variables

Butler Trucking Company and Rush Hour
— Dependent variable, y: Travel time
— Independent variables: miles traveled (x1) and number
of deliveries (x2)
0 if an assignment did not
include travel on the
congested segment of
highway during afternoon
— Categorical variable: rush hour (x3)= rush hour
1 if an assignment included
travel on the congested
segment of highway during
afternoon rush hour
89
Figure 7.25: Histograms of the Residuals for Driving

Assignments That Included Travel on a Congested
Segment of a Highway During the Afternoon Rush Hour
and Residuals for Driving Assignments That Did Not
90
45
11/6/2017
Figure 7.26: Excel Data and Output for Butler Trucking

with Miles Traveled (x1), Number of Deliveries (x2), and
the Highway Rush Hour Dummy Variable (x3) as the
Independent Variables
91

Interpreting the Parameters
— The model estimates that travel time increases by:
— 0.0672 hours for every increase of 1 mile traveled,
holding constant the number of deliveries and
whether the driving assignment route requires the
driver to travel on the congested segment of a highway
during the afternoon rush hour period
— 0.6735 hours for every delivery, holding constant the
number of miles traveled and whether the driving
assignment route requires the driver to travel on the
congested segment of a highway during the afternoon
rush hour period
92
46
11/6/2017

— The model estimates that travel time increases by:
— 0.9980 hours if the driving assignment route requires
the driver to travel on the congested segment of a
highway during the afternoon rush hour period,
holding constant the number of miles traveled and the
number of deliveries
— R2 = 0.8838 indicates that the regression model
explains approximately 88.4 percent of the variability
in travel time for the driving assignments in the
sample
93

— The mean or expected value of travel time for driving
assignments given no rush hour driving:
E(y|x3 = 0) = β0 + β1x1 + β2 x2 + β3(0) = β0 + β1x1 + β2
x2
— The mean or expected value of travel time for driving
assignments given rush hour driving:
E(y|x3 = 1) = β0 + β1x1 + β2 x2 + β3(1)
= β 0 + β 1 x 1 + β 2x 2 + β 3
= (β0 + β3) + β1x1 + β2x2
94
47
11/6/2017

— Using the estimated multiple regression equation:
= –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980x3
— When x3 = 0:
= –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980(0)
= –0.3302 + 0.0672x1 + 0.6735x2
— When x3 = 1:
= –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980(1)
= 0.6678 + 0.0672x1 + 0.6735x2
95

More Complex Categorical Variables
— If a categorical variable has k levels, k – 1 dummy variables
are required, with each dummy variable corresponding to
one of the levels of the categorical variable and coded as 0
or 1
— Example:
— Suppose a manufacturer of vending machines organized
the sales territories for a particular state into three regions:
A, B, and C
— The managers want to use regression analysis to help
predict the number of vending machines sold per week
— Suppose the managers believe sales region is one of the
important factors in predicting the number of units sold
96
48
11/6/2017

— Example (contd.):
— Sales region: categorical variable with three levels (A,
B, and C)
— Number of dummy variables = 3 – 1 = 2
— Each variable can be coded 0 or 1 as:
1 if sales Region B 1 if sales Region C

x1 = x2 =
0 otherwise 0 otherwise
— The values of x1 and x2 are:
97

— Example (contd.):
— The regression equation relating the expected value of the
number of units sold, E( y|x1, x2), to the dummy variables:
E( y|x1, x2) = β0 + β1x1 + β2 x2
— Observations corresponding to Sales Region A are coded x1 =

0, x2 = 0
— Regression equation:
E( y|x1 = 0, x2 = 0) = E( y|Sales Region A) = β0 + β1(0) + β2(0) = β0
— Observations corresponding to Sales Region C are coded x1 =
0, x2 = 1
— Regression equation:
E( y|x1 = 0, x2 = 1) = E( y|Sales Region C) = β0 + β1(0) + β2(1) = β0 + β2
98
49
11/6/2017
§ Quadratic Regression Models

§ Piecewise Linear Regression Models
§ Interaction Between Independent Variables
Figure 7.27: Scatter Chart for the Reynolds Example
100
50
11/6/2017
Figure 7.28: Excel Regression Output for the

Reynolds Example
101
Figure 7.29: Scatter Chart of the Residuals and Predicted

Values of the Dependent Variable for the Reynolds Simple
Linear Regression
102
51
11/6/2017
§Modeling Nonlinear Relationships

Quadratic Regression Model
— An estimated regression equation given by:
= b0 + b1x1 + b2 + e
— In the Reynolds example, to account for the
curvilinear relationship between months employed
and scales sold we could include the square of the
number of months the salesperson has been
employed in the model as a second independent
variable
103
Figure 7.30: Relationships That Can Be Fit with

a Quadratic Regression Model
104
52
11/6/2017
Figure 7.31: Excel Data for the Reynolds

Quadratic Regression Model
105
Figure 7.32: Excel Output for the

Reynolds Quadratic Regression Model
106
53
11/6/2017
Figure 7.33: Scatter Chart of the Residuals and

Predicted Values of the Dependent Variable for the
Reynolds Quadratic Regression Model
107
Modeling Nonlinear Relationships

Piecewise Linear Regression Models
— For the Reynolds data, as an alternative to a quadratic
regression model
— Recognize that below some value of Months Employed, the
relationship between Months Employed and Sales appears
to be positive and linear
— Whereas the relationship between Months Employed and
Sales appears to be negative and linear for the remaining
observations
— Piecewise linear regression model: This model will
allow us to fit these relationships as two linear regressions
that are joined at the value of Months at which the
relationship between Months Employed and Sales changes
108
54
11/6/2017

— Knot: The value of the independent variable at which
the relationship between dependent variable and
independent variable changes
— For the Reynolds data, knot is the value of the
independent variable Months Employed at which the
relationship between Months Employed and Sales
changes
109
Figure 7.34: Possible Position of Knot x(k)
110
55
11/6/2017

— Define a dummy(k)variable:
0 if x1 ≤ x
xk =
1 if x1 > x(k)
— x1 = Months
— x(k) = value of the knot (90 months for the Reynolds
example)
— xk= the knot dummy variable
— Fit the following regression model:

= b0 + b1x1 + b2(x1–x(k))xk + e
111

Interaction Between Independent Variables
— Interaction:This occurs when the relationship
between the dependent variable and one independent
variable is different at various values of a second
independent variable
— The model is given by:
y = β 0 + β 1 x 1 + β 2 x 2 + β 3x 1 x 2 + e
— The estimated model is given by:
= b 0 + b 1 x 1 + b 2 x 2 + b 3x 1 x 2
112
56
11/6/2017
§ Variable Selection Procedures

§ Overfitting
Model Fitting
Variable Selection Procedures
— Special procedures are sometimes employed to select
the independent variables to include in the regression
model Iterative procedures:
At each step of the
— Stepwise regression
procedure a single
— Forward selection procedure independent variable
is added or removed
— Sequential replacement procedure and the new model is
evaluated
— Best-subsets procedure Evaluates regression
models involving
different subsets
of the independent
variables 114
57
11/6/2017
Model Fitting
— Variable Selection Procedures
— Backward elimination
— Forward selection
— Stepwise selection
— Best subsets
— Forward selection procedure:
— The analyst establishes a criterion for allowing independent
variables to enter the model
— Example: The independent variable j with the smallest p-
value associated with the test of the hypothesis βj = 0,
subject to some predetermined maximum p-value for which
a potential independent variable will be allowed to enter the
model
115
Model Fitting
— Forward selection procedure (contd.):
— First step: The independent variable that best satisfies
the criterion is added to the model
— Each subsequent step: The remaining independent
variables not in the current model are evaluated, and
the one that best satisfies the criterion is added to the
model
— Procedure stops: When there are no independent
variables not currently in the model that meet the
criterion for being added to the regression model
116
58
11/6/2017
Model Fitting
— Backward selection procedure:
— The analyst establishes a criterion for allowing
independent variables to remain in the model.
— Example: The largest p-value associated with the test of
the hypothesis βj = 0, subject to some predetermined
minimum p-value for which a potential independent
variable will be allowed to remain in the model.
117
Model Fitting
— Backward selection procedure (contd.):
— First step: The independent variable that violates this
criterion to the greatest degree is removed from the model
— Each subsequent step: The independent variables in the
current model are evaluated, and the one that violates this
criterion to the greatest degree is removed from the model
— Procedure stops: When there are no independent variables
currently in the model that violate the criterion for remaining
in the regression model
118
59
11/6/2017
Model Fitting
— Stepwise procedure:
— The analyst establishes both a criterion for allowing
independent variables to enter the model and a
criterion for allowing independent variables to remain
in the model
— In the first step of the procedure, the independent
variable that best satisfies the criterion for entering
the model is added
— First, the remaining independent variables not in the
current model are evaluated, and the one that best
satisfies the criterion for entering is added to the
model
119
Model Fitting
— Stepwise procedure (contd.):
— Then the independent variables in the current model
are evaluated, and the one that violates the criterion
for remaining in the model to the greatest degree is
removed
— The procedure stops when no independent variables
not currently in the model meet the criterion for
being added to the regression model, and no
independent variables currently in the model violate
the criterion for remaining in the regression model
120
60
11/6/2017
Model Fitting
— Best-subsets procedure:
— Simple linear regressions for each of the independent
variables under consideration are generated, and then
the multiple regressions with all combinations of two
independent variables under consideration are
generated, and so on
— Once a regression has been generated for every
possible subset of the independent variables under
consideration, an output that provides some criteria
for selecting regression models is produced for all
models generated
121
Model Fitting
Overfitting
— Results from creating an overly complex model to explain
idiosyncrasies in the sample data
— Results from the use of complex functional forms or
independent variables that do not have meaningful
relationships with the dependent variable
— If a model is overfit to the sample data, it will perform
better on the sample data used to fit the model than it will
on other data from the population
— Thus, an overfit model can be misleading about its
predictive capability and its interpretation
122
61
11/6/2017
Model Fitting
— How does one avoid overfitting a model?
— Use only independent variables that you expect to have
real and meaningful relationships with the dependent
variable
— Use complex models, such as quadratic models and
piecewise linear regression models, only when you have
a reasonable expectation that such complexity provides
a more accurate depiction of what you are modeling
— Do not let software dictate your model; use iterative
modeling procedures, such as the stepwise and best-
subsets procedures, only for guidance and not to
generate your final model
123
Model Fitting
— How does one avoid overfitting a model? (contd.)
— If you have access to a sufficient quantity of data, assess
your model on data other than the sample data that
were used to generate the model (this is referred to as
cross-validation)
— It is recommended to divide the original sample data
into training and validation sets
— Training set: The data set used to build the candidate
models that appear to make practical sense
— Validation set: The set of data used to compare model
performances and ultimately pick a model for
predicting values of the dependent variable
124
62
11/6/2017
Model Fitting
— Holdout method: The sample data are randomly divided
into mutually exclusive and collectively exhaustive training
and validation sets
— k-fold cross-validation: The sample data are randomly
divided into k equal-sized, mutually exclusive, and
collectively exhaustive subsets called fold, and k iterations
are executed
— Leave-one-out cross-validation: For a sample of n
observations, an iteration consists of estimating the model
on n – 1 observations and evaluating the model on the
single observation that was omitted from the training data
63

Chapter 4 (Regression) PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Chapter 4 (Regression) PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

11/6/2017

The Simple Linear Regression Model

The Simple Linear Regression Model

Estimated Regression Equation

The Simple Linear Regression Model

Figure 7.1: The Estimation Process

Figure 7.2: Possible Regression

§ Least Squares Estimates of the Regression Parameters

Least Squares Method

Table 7.1: Miles Traveled and Travel Time

Figure 7.3: Scatter Chart of Miles Traveled and

Least Squares Method

Least Squares Method

The Least Squares Equation

Least Squares Method

Least Squares Method

Least Squares Method

Least Squares Method

Table 7.2: Predicted Travel Time and Residuals

Figure 7.4: Scatter Chart of Miles Traveled and

Figure 7.5: A Geometric Interpretation of the

Least Squares Method

Figure 7.6: Scatter Chart and Estimated

§ The Sums of Squares

Assessing the Fit of the Simple

From Table 7.2,

Figure 7.7: The Sample Mean as a Predictor of Travel

Assessing the Fit of the Simple

Table 7.3: Calculations for the Sum of Squares

Figure 7.8: Deviations About the Estimated Regression

Assessing the Fit of the Simple Linear

Explained and Unexplained Variation

— SST = total sum of squares

Assessing the Fit of the Simple Linear

Figure 7.9: Scatter Chart and Estimated

Examples of Approximate R2 Values

Perfect linear relationship

Examples of Approximate R2 Values

Weaker linear relationship

y Some but not all of the

Examples of Approximate R 2 Values

The Multiple Regression Model

The Multiple Regression Model

The Multiple Regression Model

The Multiple Regression Model

Estimates b0, b1, b2,….,bq

y  nb0 b1x1 b2x2 .......bq xq

Interpretation of Estimated Coefficients

The Multiple Regression Model

The Multiple Regression Model

Multiple Coefficient of Determination

SSR Sum of squares due to regression

Data are collected for 15 weeks

Independent variables 1 (x1): Prices ($)

Independent variables 2 (x2): Advertising ($100s)

Estimated (Predicted) regression equation:

Estimates b0, b1, b2

y  nb0 b1x1 b2 x2

yˆ  306.526  24.975x1  74.131x2

The Multiple Regression Equation

Sales  306.526 - 24.975(Price)  74.131(Adv ertising)

Using The Model to Make

Sales  306.526 - 24.975(Price)  74.131(Advertising)