Anda di halaman 1dari 46

Chapter 12

Multiple Regression

Statistics for Managers Using Microsoft Excel, 4e 2004 Prentice-Hall, Inc.

Chap 12-1

Chapter Goals
After completing this chapter, you should be
able to:

apply multiple regression analysis to business


decision-making situations

analyze and interpret the computer output for a


multiple regression model

perform residual analysis for the multiple


regression model

test the significance of the independent variables


in a multiple regression model

Chapter Goals
(continued)

After completing this chapter, you should be


able to:

use a coefficient of partial determination to test


portions of the multiple regression model

incorporate qualitative variables into the


regression model by using dummy variables

use interaction terms in regression models

Multiple Regression Model


Idea: Examine the linear relationship between
1 dependent (Y) & 2 or more independent variables (Xi)
Multiple Regression Model with k Independent Variables:
Y-intercept

Population slopes

Random Error

Yi 0 1X1i 2 X 2i k Xki

12.2 Multiple Regression Equation


The coefficients of the multiple regression model are
estimated using sample data
Multiple regression equation with k independent variables:
Estimated
(or predicted)
value of Y

Estimated
intercept

Estimated slope coefficients

a b X b X b X
Y
i
1 1i
2 2i
k ki
In this chapter we will always use Excel to obtain the
regression slope coefficients and other regression
summary measures.

Multiple Regression Equation


(continued)

Two variable model


Y

pe
o
Sl

e
bl
a
i
ar
v
r
fo

X1

ab X b X
Y
1 1
2 2

X1

varia
r
o
f
e
lo p

X2
ble X 2

Example:
Two Independent Variables

A distributor of frozen desert pies wants to


evaluate factors thought to influence demand

Dependent variable:
Pie sales (units per week)
Independent variables: Price (in $)
Advertising ($100s)

Data are collected for 15 weeks

Pie Sales Example


Week

Pie
Sales

Price
($)

Advertising
($100s)

350

5.50

3.3

460

7.50

3.3

350

8.00

3.0

430

8.00

4.5

350

6.80

3.0

380

7.50

4.0

430

4.50

3.0

470

6.40

3.7

450

7.00

3.5

10

490

5.00

4.0

11

340

7.20

3.5

12

300

7.90

3.2

13

440

5.90

4.0

14

450

5.00

3.5

15

300

7.00

2.7

Multiple regression equation:

Sales = b0 + b1 (Price)
+ b2 (Advertising)

Multiple Regression Output


Regression Statistics
Multiple R

0.72213

R Square

0.52148

Adjusted R Square

0.44172

Standard Error

47.46341

Observations

ANOVA
Regression

Sales 306.526 - 24.975(Price) 74.131(Adv ertising)

15

df

SS

MS

29460.027

14730.013

Residual

12

27033.306

2252.776

Total

14

56493.333

Coefficients

Standard Error

Intercept

306.52619

114.25389

2.68285

0.01993

57.58835

555.46404

Price

-24.97509

10.83213

-2.30565

0.03979

-48.57626

-1.37392

74.13096

25.96732

2.85478

0.01449

17.55303

130.70888

Advertising

t Stat

6.53861

Significance F

P-value

0.01201

Lower 95%

Upper 95%

Multiple Regression Equation


Sales 306.526 - 24.975(Price) 74.131(Adv ertising)
where
Sales is in number of pies per week
Price is in $
Advertising is in $100s.

b1 = -24.975: sales
will decrease, on
average, by 24.975
pies per week for
each $1 increase in
selling price, net of
the effects of changes
due to advertising

b2 = 74.131: sales will


increase, on average,
by 74.131 pies per
week for each $100
increase in
advertising, net of the
effects of changes
due to price

Using The Equation to Make


Predictions
Predict sales for a week in which the selling
price is $5.50 and advertising is $350:
Sales 306.526 - 24.975(Price) 74.131(Adv ertising)
306.526 - 24.975 (5.50) 74.131 (3.5)
428.62

Predicted sales
is 428.62 pies

Note that Advertising is


in $100s, so $350
means that X2 = 3.5

Predictions in PHStat

PHStat | regression | multiple regression

Check the
confidence and
prediction interval
estimates box

Predictions in PHStat
(continued)

Input values
<

Predicted Y value
<

Confidence interval for the


mean Y value, given
these Xs
<

Prediction interval for an


individual Y value, given
these Xs

12.6 Coefficient of
Multiple Determination

Reports the proportion of total variation in Y


explained by all X variables taken together

2
Y .12..k

SSR regression sum of squares

SST
total sum of squares

Multiple Coefficient of Determination


(continued)
Regression Statistics
Multiple R

0.72213

R Square

0.52148

Adjusted R Square

0.44172

Standard Error

Regression

15

df

SSR 29460.0

.52148
SST 56493.3
52.1% of the variation in pie sales
is explained by the variation in
price and advertising

47.46341

Observations

ANOVA

2
Y.12

SS

MS

29460.027

14730.013

Residual

12

27033.306

2252.776

Total

14

56493.333

Coefficients

Standard Error

Intercept

306.52619

114.25389

2.68285

0.01993

57.58835

555.46404

Price

-24.97509

10.83213

-2.30565

0.03979

-48.57626

-1.37392

74.13096

25.96732

2.85478

0.01449

17.55303

130.70888

Advertising

t Stat

6.53861

Significance F

P-value

0.01201

Lower 95%

Upper 95%

Adjusted r2

r2 never decreases when a new X variable is


added to the model
This can be a disadvantage when comparing
models
What is the net effect of adding a new variable?
We lose a degree of freedom when a new X
variable is added
Did the new X variable add enough
explanatory power to offset the loss of one
degree of freedom?

Adjusted r2
(continued)

Shows the proportion of variation in Y explained by


all X variables adjusted for the number of X
variables used
2
adj

1 (1 r

2
Y .12..k

n 1
)

n k 1

(where n = sample size, k = number of independent variables)

Penalize excessive use of unimportant independent


variables
Smaller than r2
Useful in comparing among models

Adjusted r2
(continued)
Regression Statistics
Multiple R

0.72213

R Square

0.52148

Adjusted R Square

0.44172

Standard Error

47.46341

Observations

ANOVA
Regression

15

df

2
radj
.44172

44.2% of the variation in pie sales is


explained by the variation in price and
advertising, taking into account the sample
size and number of independent variables
SS

MS

29460.027

14730.013

Residual

12

27033.306

2252.776

Total

14

56493.333

Coefficients

Standard Error

Intercept

306.52619

114.25389

2.68285

0.01993

57.58835

555.46404

Price

-24.97509

10.83213

-2.30565

0.03979

-48.57626

-1.37392

74.13096

25.96732

2.85478

0.01449

17.55303

130.70888

Advertising

t Stat

6.53861

Significance F

P-value

0.01201

Lower 95%

Upper 95%

12.10 Residuals in Multiple


Regression (Model Checking)
Two variable model
Sample
observation

Y b0 b1X1 b 2 X 2

<

Residual =
ei = (Yi Yi)

Y
Yi
<

Yi
x2i

X1

<

x1i

X2

The best fit equation, Y ,


is found by minimizing the
sum of squared errors, e2

Multiple Regression Assumptions


Errors (residuals) from the regression model:
<

ei = (Yi Yi)
Assumptions:
The errors are normally distributed
Errors have a constant variance
The model errors are independent

Residual Plots Used


in Multiple Regression

These residual plots are used in multiple


regression:
<

Residuals vs. Yi

Residuals vs. X1i

Residuals vs. X2i

Residuals vs. time (if time series data)


Use the residual plots to check for
violations of regression assumptions

Is the Model Significant?

F-Test for Overall Significance of the Model

Shows if there is a linear relationship between all


of the X variables considered together and Y

Use F test statistic

Hypotheses:
H0: 1 = 2 = = k = 0 (no linear relationship)
H1: at least one i 0 (at least one independent
variable affects Y)

F-Test for Overall Significance

Test statistic:

SSR
MSR
k
F

SSE
MSE
n k 1
where F has (numerator) = k and
(denominator) = (n k - 1)
degrees of freedom

F-Test for Overall Significance


(continued)
Regression Statistics
Multiple R

0.72213

R Square

0.52148

Adjusted R Square

0.44172

Standard Error

47.46341

Observations

ANOVA
Regression

15

df

MSR 14730.0
F

6.5386
MSE
2252.8
With 2 and 12 degrees
of freedom
SS

MS

P-value for
the F-Test
F

29460.027

14730.013

Residual

12

27033.306

2252.776

Total

14

56493.333

Coefficients

Standard Error

Intercept

306.52619

114.25389

2.68285

0.01993

57.58835

555.46404

Price

-24.97509

10.83213

-2.30565

0.03979

-48.57626

-1.37392

74.13096

25.96732

2.85478

0.01449

17.55303

130.70888

Advertising

t Stat

6.53861

Significance F

P-value

0.01201

Lower 95%

Upper 95%

F-Test for Overall Significance


(continued)

Test Statistic:

H0: 1 = 2 = 0
H1: 1 and 2 not both zero
= .05
df1= 2

df2 = 12

Decision:

Critical
Value:

Since F test statistic is in


the rejection region (pvalue < .05), reject H0

F = 3.885
= .05

Do not
reject H0

Reject H0

F.05 = 3.885

MSR
F
6.5386
MSE

Conclusion:
F

There is evidence that at least one


independent variable affects Y

Are Individual Variables Significant?

Use t-tests of individual variable slopes

Shows if there is a linear relationship between


the variable Xi and Y

Hypotheses:

H0: i = 0 (no linear relationship)

H1: i 0 (linear relationship does exist


between Xi and Y)

Are Individual Variables Significant?


(continued)

H0: i = 0 (no linear relationship)


H1: i 0 (linear relationship does exist
between xi and y)
Test Statistic:

bi 0
t
S bi

(df = n k 1)

Are Individual Variables Significant?


(continued)
Regression Statistics
Multiple R

0.72213

R Square

0.52148

Adjusted R Square

0.44172

Standard Error

47.46341

Observations

ANOVA
Regression

15

df

t-value for Price is t = -2.306, with


p-value .0398
t-value for Advertising is t = 2.855,
with p-value .0145
SS

MS

29460.027

14730.013

Residual

12

27033.306

2252.776

Total

14

56493.333

Coefficients

Standard Error

Intercept

306.52619

114.25389

2.68285

0.01993

57.58835

555.46404

Price

-24.97509

10.83213

-2.30565

0.03979

-48.57626

-1.37392

74.13096

25.96732

2.85478

0.01449

17.55303

130.70888

Advertising

t Stat

6.53861

Significance F

P-value

0.01201

Lower 95%

Upper 95%

Inferences about the Slope:


t Test Example
From Excel output:

H0: i = 0
H1: i 0

Price
Advertising

d.f. = 15-2-1 = 12

Coefficients

Standard Error

t Stat

P-value

-24.97509

10.83213

-2.30565

0.03979

74.13096

25.96732

2.85478

0.01449

The test statistic for each variable falls


in the rejection region (p-values < .05)

= .05
t/2 = 2.1788

Decision:
/2=.025

/2=.025

Reject H0 for each variable

Conclusion:
Reject H0

Do not reject H0

-t/2
-2.1788

Reject H0

t/2
2.1788

There is evidence that both


Price and Advertising affect
pie sales at = .05

12.12 Nonlinear Relationships

The relationship between the dependent


variable and an independent variable may
not be linear
Can review the scatter diagram to check for
non-linear relationships
Example: Quadratic model

Yi 0 1X1i 2 X1i2 i

The second independent variable is the square


of the first variable

Quadratic Regression Model


Model form:

Yi 0 1X1i 2 X i
2
1i

where:
0 = Y intercept
1 = regression coefficient for linear effect of X on Y
2 = regression coefficient for quadratic effect on Y
i = random error in Y for observation i

Linear vs. Nonlinear Fit


Y

X
Linear fit does not give
random residuals

residuals

residuals

Nonlinear fit gives


random residuals

Testing the Overall


Quadratic Model

Estimate the quadratic model to obtain the


regression equation:
2

Yi b0 b1X1i b 2 X1i

Test for Overall Relationship


H0: 1 = 2 = 0 (no overall relationship between X and Y)
H1: 1 and/or 2 0 (there is a relationship between X and Y)

F test statistic =

MSR
MSE

Testing for Significance:


Quadratic Effect

Testing the Quadratic Effect


Hypotheses

H0: 2 = 0

(The quadratic term does not improve the model)

H1: 2 0

(The quadratic term improves the model)

The test statistic is

b2 2
t
Sb 2
d.f. n 3

where:
b2 = squared term slope
coefficient
2 = hypothesized slope (zero)
Sb 2= standard error of the slope

Testing for Significance:


Quadratic Effect

(continued)

Testing the Quadratic Effect


Compare r2 from simple regression to
adjusted r2 from the quadratic model

If adj. r2 from the quadratic model is larger


than the r2 from the simple model, then the
quadratic model is a better model

Example: Quadratic Model


Purity

Filter
Time

15

22

33

40

10

54

12

67

13

70

14

78

15

85

15

87

16

99

17

Purity increases as filter time


increases:

Example: Quadratic Model

Simple regression results:

(continued)

Y
^ = -11.283 + 5.985 Time
Coefficients

Standard
Error

-11.28267

3.46805

-3.25332

0.00691

5.98520

0.30966

19.32819

2.078E-10

Intercept
Time

t Stat

P-value

Regression Statistics
R Square

0.96888

Adjusted R Square

0.96628

Standard Error

6.15997

F
373.57904

Significance F
2.0778E-10

t statistic, F statistic, and r2


are all high, but the
residuals are not random:

Example: Quadratic Model

Quadratic regression results:


Y
^ = 1.539 + 1.565 Time + 0.245 (Time)2
Coefficients

Standard
Error

Intercept

1.53870

2.24465

0.68550

0.50722

Time

1.56496

0.60179

2.60052

0.02467

Time-squared

0.24516

0.03258

7.52406

1.165E-05

Regression Statistics
R Square

0.99494

Adjusted R Square

0.99402

Standard Error

2.59513

t Stat

F
1080.7330

P-value

Significance F
2.368E-13

The quadratic term is significant and


improves the model: adj. r2 is higher and
SYX is lower, residuals are now random

(continued)

12.9 Model Building

Goal is to develop a model with the best set of


independent variables

Stepwise regression procedure

Easier to interpret if unimportant variables are


removed
Lower probability of collinearity
Provide evaluation of alternative models as variables
are added

Best-subset approach

Try all combinations and select the best using the


highest adjusted r2 and lowest standard error

Stepwise Regression

Idea: develop the least squares regression


equation in steps, adding one explanatory
variable at a time and evaluating whether
existing variables should remain or be removed

The coefficient of partial determination is the


measure of the marginal contribution of each
independent variable, given that other
independent variables are in the model

Best Subsets Regression

Idea: estimate all possible regression equations


using all possible combinations of independent
variables

Choose the best fit by looking for the highest


adjusted r2 and lowest standard error
Stepwise regression and best subsets
regression can be performed using PHStat

12.11 Alternative Best Subsets


Criterion

Calculate the value Cp for each potential


regression model

Consider models with Cp values close to or


below k + 1

k is the number of independent variables in the


model under consideration

Alternative Best Subsets Criterion

(continued)

The Cp Statistic

(1 Rk2 )(n T )
Cp
(n 2(k 1))
2
1 RT
Where

k = number of independent variables included in a


particular regression model
T = total number of parameters to be estimated in the
full regression model
R k2 = coefficient of multiple determination for model with k
independent variables
2
R T = coefficient of multiple determination for full model with
all T estimated parameters

10 Steps in Model Building


1. Choose explanatory variables to include in the
model
2. Estimate full model and check VIFs
3. Check if any VIFs > 5
4.

If no VIF > 5, go to step 5


If one VIF > 5, remove this variable
If more than one, eliminate the variable with the
highest VIF and go back to step 2

5. Perform best subsets regression with


remaining variables

10 Steps in Model Building


(continued)

6. List all models with Cp close to or less than (k


+ 1)
7. Choose the best model

Consider parsimony
Do extra variable make a significant contribution?

8. Perform complete analysis with chosen model,


including residual analysis
9. Transform the model if necessary to deal with
violations of linearity or other model
assumptions
10. Use the model for prediction

Model Building Flowchart


Choose X1,X2,Xk
Run regression
to find VIFs

Any
VIF>5?
Yes

Remove
variable with
highest
VIF

Yes

More
than one?

No

Run subsets
regression to obtain
best models in
terms of Cp
Do complete analysis
Add quadratic term and/or
transform variables as indicated

No
Remove
this X

Perform
predictions

Anda mungkin juga menyukai