A Decision-Making Approach
6th Edition
Chapter 14
Multiple Regression Analysis
and Model Building
Chap 14-1
Chapter Goals
After completing this chapter, you should be
able to:
understand model building using multiple
regression analysis
apply multiple regression analysis to business
decision-making situations
analyze and interpret the computer output for a
multiple regression model
test the significance of the independent variables
in a multiple regression model
Chap 14-2
Chapter Goals
(continued)
Chap 14-3
The Multiple Regression
Model
Idea: Examine the linear relationship between
1 dependent (y) & 2 or more independent variables (xi)
Population model:
Y-intercept Population slopes Random Error
y 0 1x1 2 x 2 k x k
Estimated multiple regression model:
Estimated Estimated
(or predicted) intercept Estimated slope coefficients
value of y
y b0 b1x1 b 2 x 2 bk x k
Chap 14-4
Multiple Regression Model
Two variable model
y
y b0 b1x1 b 2 x 2
x1
e
abl
i
var
r
fo
ope x2
Sl
varia ble x 2
e fo r
S lo p
x1
Chap 14-5
Multiple Regression Model
Two variable model
y Sample
<yi
observation y b0 b1x1 b 2 x 2
yi
<
e = (y y)
x2i
x2
<
x1i The best fit equation, y ,
is found by minimizing the
x1 sum of squared errors, e2
Chap 14-6
Multiple Regression
Assumptions
<
e = (y y)
Chap 14-7
Model Specification
Chap 14-8
The Correlation Matrix
Chap 14-9
Example
A distributor of frozen desert pies wants to
evaluate factors thought to influence demand
Dependent variable: Pie sales (units per week)
Independent variables: Price (in $)
Advertising ($100s)
Chap 14-10
Pie Sales Model
Pie Price
Advertising
Week
Sales ($) ($100s)
Multiple regression model:
1
350 5.50
3.3
2
460 7.50
3.3
3
350 8.00
3.0
Sales = b0 + b1 (Price)
4
430 8.00
4.5
5
350 6.80
3.0
+ b2 (Advertising)
6
380 7.50
4.0
7
430 4.50
3.0
8
470 6.40
3.7
Correlation matrix:
9
450 7.00
3.5
Pie Sales
Price
Advertising
10
490 5.00
4.0
Pie Sales
1
11 340 7.20
3.5
Price
-0.44327
1
12
300 7.90
3.2
Advertising
0.55632
0.03044
1
13
440 5.90
4.0
14
450 5.00
3.5
15
300 7.00
2.7
Chap 14-11
Interpretation of Estimated
Coefficients
Slope (bi)
Estimates that the average value of y changes by b i
units for each 1 unit increase in Xi holding all other
variables constant
Example: if b1 = -20, then sales (y) is expected to
decrease by an estimated 20 pies per week for each $1
increase in selling price (x1), net of the effects of
changes due to advertising (x2)
y-intercept (b0)
The estimated average value of y when all x i = 0
(assuming all xi = 0 is within the range of observed
values)
Chap 14-12
Pie Sales Correlation Matrix
Pie Sales Price
Advertising
Pie Sales
1
Price
-0.44327
1
Advertising
0.55632
0.03044
1
Chap 14-13
Scatter Diagrams
Sales
Sales
Price
Advertising
Chap 14-14
Estimating a Multiple Linear
Regression Equation
Computer software is generally used to
generate the coefficients and measures of
goodness of fit for multiple regression
Excel:
Tools / Data Analysis... / Regression
PHStat:
PHStat / Regression / Multiple Regression
Chap 14-15
Multiple Regression Output
Regression Statistics
Multiple R
0.72213
R Square
0.52148
Adjusted R Square
0.44172
Standard Error
47.46341
Sales 306.526 - 24.975(Price) 74.131(Adv ertising)
Observations
15
ANOVA df
SS
MS
F
Significance F
Regression
2
29460.027
14730.013
6.53861
0.01201
Residual
12
27033.306
2252.776
Total
14
56493.333
Intercept
306.52619
114.25389 2.68285
0.01993
57.58835
555.46404
Price
-24.97509
10.83213 -2.30565
0.03979
-48.57626
-1.37392
Advertising
74.13096
25.96732 2.85478
0.01449
17.55303
130.70888
Chap 14-16
The Multiple Regression
Equation
Chap 14-17
Using The Model to Make
Predictions
Predict sales for a week in which the selling
price is $5.50 and advertising is $350:
Chap 14-19
Multiple Coefficient of
Determination
(continued)
Regression Statistics
SSR 29460.0
Multiple R
0.72213
R 2
.52148
R Square
0.52148
SST 56493.3
Adjusted R Square
0.44172
Standard Error
47.46341
52.1% of the variation in pie sales
Observations
15
is explained by the variation in
price and advertising
ANOVA df
SS
MS
F
Significance F
Regression
2
29460.027
14730.013
6.53861
0.01201
Residual
12
27033.306
2252.776
Total
14
56493.333
Intercept
306.52619
114.25389 2.68285
0.01993
57.58835
555.46404
Price
-24.97509
10.83213 -2.30565
0.03979
-48.57626
-1.37392
Advertising
74.13096
25.96732 2.85478
0.01449
17.55303
130.70888
Chap 14-20
Adjusted R2
R2 never decreases when a new x variable is
added to the model
This can be a disadvantage when comparing
models
What is the net effect of adding a new variable?
We lose a degree of freedom when a new x
variable is added
Did the new x variable add enough
Chap 14-21
Adjusted R2
(continued)
Shows the proportion of variation in y explained by all
x variables adjusted for the number of x variables
used
n 1
R 1 (1 R )
2
A
2
n k 1
(where n = sample size, k = number of independent variables)
Chap 14-22
Multiple Coefficient of
Determination
(continued)
Regression Statistics
Multiple R
0.72213
R 2A .44172
R Square
0.52148
Adjusted R Square
0.44172
44.2% of the variation in pie sales is
Standard Error
47.46341
explained by the variation in price and
Observations
15
advertising, taking into account the sample
size and number of independent variables
ANOVA df
SS
MS
F
Significance F
Regression
2
29460.027 14730.013
6.53861
0.01201
Residual
12
27033.306 2252.776
Total
14
56493.333
Intercept
306.52619
114.25389 2.68285
0.01993
57.58835
555.46404
Price
-24.97509
10.83213 -2.30565
0.03979
-48.57626
-1.37392
Advertising
74.13096
25.96732 2.85478
0.01449
17.55303
130.70888
Chap 14-23
Is the Model Significant?
F-Test for Overall Significance of the Model
Shows if there is a linear relationship between all
of the x variables considered together and y
Use F test statistic
Hypotheses:
H0: 1 = 2 = = k = 0 (no linear relationship)
HA: at least one i 0 (at least one independent
variable affects y)
Chap 14-24
F-Test for Overall
Significance
(continued)
Test statistic:
SSR
k MSR
F
SSE MSE
n k 1
where F has (numerator) D1 = k and
(denominator) D2 = (n k - 1)
degrees of freedom
Chap 14-25
F-Test for Overall
Significance
(continued)
Regression Statistics
Multiple R
0.72213
MSR 14730.0
R Square 0.52148
F 6.5386
Adjusted R Square
0.44172
MSE 2252.8
Standard Error
47.46341
With 2 and 12 degrees P-value for
Observations
15
of freedom the F-Test
ANOVA df
SS
MS
F
Significance F
Regression
2
29460.027
14730.013
6.53861
0.01201
Residual
12
27033.306
2252.776
Total
14
56493.333
Intercept
306.52619
114.25389 2.68285
0.01993
57.58835
555.46404
Price
-24.97509
10.83213 -2.30565
0.03979
-48.57626
-1.37392
Advertising
74.13096
25.96732 2.85478
0.01449
17.55303
130.70888
Chap 14-26
F-Test for Overall
Significance
(continued)
Chap 14-28
Are Individual Variables
Significant?
(continued)
Test Statistic:
bi 0
t (df = n k 1)
sbi
Chap 14-29
Are Individual Variables
Significant?
(continued)
Regression Statistics
Multiple R
0.72213
t-value for Price is t = -2.306, with
R Square
0.52148
p-value .0398
Adjusted R Square
0.44172
Standard Error
47.46341
t-value for Advertising is t = 2.855,
Observations
15
with p-value .0145
ANOVA df
SS
MS
F
Significance F
Regression
2
29460.027
14730.013
6.53861
0.01201
Residual
12
27033.306
2252.776
Total
14
56493.333
Intercept
306.52619
114.25389 2.68285
0.01993
57.58835
555.46404
Price
-24.97509
10.83213 -2.30565
0.03979
-48.57626
-1.37392
Advertising
74.13096
25.96732 2.85478
0.01449
17.55303
130.70888
Chap 14-30
Inferences about the Slope:
t Test Example
From Excel output:
H0: i = 0
Coefficients
Standard Error
t Stat
P-value
Intercept
306.52619
114.25389
57.58835
555.46404
Price
-24.97509
10.83213
-48.57626
-1.37392
Advertising
74.13096
25.96732
17.55303
130.70888
SSE
s MSE
n k 1
Is this value large or small? Must compare to the
mean size of y for comparison
Chap 14-33
Standard Deviation of the
Regression Model
(continued)
Regression Statistics
Multiple R
0.72213
R Square
0.52148
The standard deviation of the
Adjusted R Square
0.44172
regression model is 47.46
Standard Error
47.46341
Observations
15
ANOVA df
SS
MS
F
Significance F
Regression
2
29460.027
14730.013
6.53861
0.01201
Residual
12
27033.306
2252.776
Total
14
56493.333
Intercept
306.52619
114.25389 2.68285
0.01993
57.58835
555.46404
Price
-24.97509
10.83213 -2.30565
0.03979
-48.57626
-1.37392
Advertising
74.13096
25.96732 2.85478
0.01449
17.55303
130.70888
Chap 14-34
Standard Deviation of the
Regression Model
(continued)
The standard deviation of the regression model is
47.46
A rough prediction range for pie sales in a given
week is 2(47.46) 94.2
Pie sales in the sample were in the 300 to 500
per week range, so this range is probably too
large to be acceptable. The analyst may want to
look for additional variables that can explain more
of the variation in weekly sales
Chap 14-35
R commands
paste("Today is", date())
y=c(350, 460, 350, 430, 350, 380, 430, 470, 450, 490, 340, 300,
440, 450, 300)
#y is apple pie sales
x=c(5.5, 7.5, 8.0, 8.0, 6.8, 7.5, 4.5, 6.4, 7.0, 5.0, 7.2, 7.9, 5.9, 5.0,
7.0)
#price charged
z=c(3.3, 3.3, 3.0, 4.5, 3.0, 4.0, 3.0, 3.7, 3.5, 4.0, 3.5, 3.2, 4.0, 3.5,
2.7)
#advertising expenses
t.test(x); sort(x); t.test(y); sort(y)
t.test(z); sort(z)
library(fBasics)
Chap 14-36
R commands set 2
cs=colStats(cbind(y,x,z), FUN=basicStats); cs
#following function automatically computes outliers
get.outliers = function(x) {
#function to compute the number of outliers automatically
#author H. D. Vinod, Fordham university, New York, 24 March, 2006
su=summary(x)
if (ncol(as.matrix(x))>1) {print("Error: input to get.outliers function has 2 or more columns")
return(0)}
iqr=su[5]-su[2]
dn=su[2]-1.5*iqr
up=su[5]+1.5*iqr
LO=x[x<dn]#vector of values below the lower limit
nLO=length(LO)
UP=x[x>up]
nUP=length(UP)
print(c(" Q1-1.5*(inter quartile range)=",
as.vector(dn),"number of outliers below it are=",as.vector(nLO)),quote=F)
if (nLO>0){
print(c("Actual values below the lower limit are:", LO),quote=F)}
print(c(" Q3+1.5*(inter quartile range)=",
as.vector(up)," number of outliers above it are=",as.vector(nUP)),quote=F)
if (nUP>0){
print(c("Actual values above the upper limit are:", UP),quote=F)}
list(below=LO,nLO=nLO,above=UP,nUP=nUP,low.lim=dn,up.lim=up)}
#xx=get.outliers(x)
# function ends here = = = = = =
Chap 14-37
R commands set 3
xx=get.outliers(x)
xx=get.outliers(y)
xx=get.outliers(z)
#Tests for Correlation Coefficients
cor.test(x,y)
#capture.output(cor.test(x,y), file="c:/stat2/PieSaleOutput.txt", append=T)
cor.test(z,y)
#capture.output(cor.test(z,y), file="c:/stat2/PieSaleOutput.txt", append=T)
cor.test(x,z)
#capture.output(cor.test(x,z), file="c:/stat2/PieSaleOutput.txt", append=T)
#Now regression analysis
reg1=lm(y~x+z)
summary(reg1) #plot(reg1)
library(car); confint(reg1) #prints confidence intervals
Chap 14-38
Multicollinearity
Chap 14-39
Multicollinearity
(continued)
Including two highly correlated independent
variables can adversely affect the regression
results
No new information provided
Can lead to unstable coefficients (large
standard error and low t-values)
Coefficient signs may not match prior
expectations
Chap 14-40
Some Indications of Severe
Multicollinearity
Incorrect signs on the coefficients
Large change in the value of a previous
coefficient when a new variable is added to the
model
A previously significant variable becomes
insignificant when a new independent variable
is added
The estimate of the standard deviation of the
model increases when a variable is added to
the model
Chap 14-41
Detect Collinearity
(Variance Inflationary Factor)
VIFj is used to measure collinearity:
1
VIFj
1 Rj
2
Chap 14-42
Detect Collinearity in PHStat
Regression Analysis
Adjusted R
VIF is < 5
Square -0.075925366 There is no evidence of
Standard Error
1.21527235
Observations
15
collinearity between Price and
VIF
1.000927305
Advertising
Chap 14-43
Qualitative (Dummy)
Variables
Chap 14-44
Dummy-Variable Model
Example (with 2 Levels)
Let:
y = pie sales y b0 b1x1 b 2 x 2
x1 = price
x2 = holiday (X2 = 1 if a holiday occurred during the week)
(X2 = 0 if there was no holiday that week)
Chap 14-45
Dummy-Variable Model
Example
(with 2 Levels)
(continued)
Different Same
intercept slope
y (sales)
If H0: 2 = 0 is
b0 + b 2
Holi rejected, then
day
b0 Holiday has a
No H
olida significant effect
y
on pie sales
Chap 14-47
Dummy-Variable Models
(more than 2 Levels)
The number of dummy variables is one less than
the number of levels
Example:
y = house price ; x1 = square feet
Chap 14-48
Dummy-Variable Models
(more than 2 Levels)
(continued)
Let the default category be condo
y b0 b1x1 b 2 x 2 b 3 x 3
b2 shows the impact on price if the house is a
ranch style, compared to a condo
b3 shows the impact on price if the house is a
split level style, compared to a condo
Chap 14-49
Interpreting the Dummy
Variable Coefficients (with 3
Levels)
Suppose the estimated equation is
y 20.43 0.045x 1 23.53x 2 18.84x 3
For a condo: x2 = x3 = 0
With the same square feet, a
y 20.43 0.045x 1 split-level will have an
estimated average price of
18.84 thousand dollars more
For a ranch: x3 = 0
than a condo
y 20.43 0.045x 1 23.53
With the same square feet, a
ranch will have an estimated
For a split level: x2 = 0
average price of 23.53
y 20.43 0.045x 1 18.84 thousand dollars more than a
condo.
Chap 14-50
Nonlinear Relationships
Chap 14-51
Polynomial Regression Model
General form:
y 0 1x j 2 x p x
2
j
p
j
where:
0 = Population regression constant
i = Population regression coefficient for variable xj : j = 1, 2, k
p = Order of the polynomial
i = Model error
y 0 1x j 2 x 2
j
Chap 14-52
Linear vs. Nonlinear Fit
y y
x x
residuals
residuals
x x
x1 x1 x1 x1
1 < 0 1 > 0 1 < 0 1 > 0
2 > 0 2 > 0 2 < 0 2 < 0
1 = the coefficient of the linear term
2 = the coefficient of the squared term
Chap 14-54
Testing for Significance:
Quadratic Model
Test for Overall Relationship
MSR
F test statistic = MSE
Testing the Quadratic Effect
Compare quadratic model
y 0 1x j 2 x 2j
with the linear model
y 0 1x j
Hypotheses
H0: 2 = 0 (No 2nd order polynomial term)
HA: 2 0 (2nd order polynomial term is needed)
Chap 14-55
Higher Order Models
y 0 1x j 2 x 3 x
2
j
3
j
Chap 14-56
Interaction Effects
Hypothesizes interaction between pairs of x
variables
Response to one x variable varies at different
Chap 14-57
Effect of Interaction
Given:
y 0 1x1 2 x 2 3 x1x 2
Chap 14-58
Interaction Example
where x2 = 0 or 1 (dummy variable)
y = 1 + 2x1 + 3x2 + 4x1x2
y
x2 = 1
12 y = 1 + 2x1 + 3(1) + 4x1(1)
= 4 + 6x1
8
x2 = 0
4 y = 1 + 2x1 + 3(0) + 4x1(0)
= 1 + 2x1
0
x1
0 0.5 1 1.5
Effect (slope) of x1 on y does depend on x2 value
Chap 14-59
Interaction Regression Model
Worksheet
y 0 1x1 2 x 2 3 x1x 2
Hypotheses:
H : = 0 (no interaction between x and x )
0 3 1 2
Chap 14-61
Model Building
Goal is to develop a model with the best set of
independent variables
Easier to interpret if unimportant variables are
removed
Lower probability of collinearity
Stepwise regression procedure
Provide evaluation of alternative models as variables
are added
Best-subset approach
Try all combinations and select the best using the
highest adjusted R2 and lowest s
Chap 14-62
Stepwise Regression
Chap 14-63
Best Subsets Regression
Chap 14-64
Aptness of the Model
residuals
x x
residuals
x x
Chap 14-67
Chapter Summary
Chap 14-68
Chapter Summary
(continued)
Chap 14-69