Anda di halaman 1dari 6

# STAT 512 Homework #6 Solution

1. (a)

It appears that the lines have both different intercepts and different slopes. Since the latter represents the
relationship between selling price and assessed value, we suspect this relationship depends on whether a
house is located on a corner lot. But since there are fewer data points for houses not on corner lots, we
cannot know for sure just by examining the scatter plot.
(b)

Analysis of Variance

Source

DF

Sum of
Squares

Mean
Square

Model
Error
Corrected Total

3
60
63

4237.05022
909.10463
5146.15484

1412.35007
15.15174

Root MSE
Dependent Mean
Coeff Var

3.89252
79.02344
4.92578

R-Square

F Value

Pr > F

93.21

<.0001

0.8233
0.8145

Parameter Estimates

Variable
Intercept
valuation
corner
valuationCorner

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Type I SS

Type II SS

1
1
1
1

-126.90517
2.77590
76.02153
-1.10748

14.72247
0.19628
30.13136
0.40554

-8.62
14.14
2.52
-2.73

<.0001
<.0001
0.0143
0.0083

399661
3670.90425
453.14744
112.99852

1125.79655
3030.46056
96.44917
112.99852

## 95% Confidence Limits

-156.35450
2.38328
15.74985
-1.91868

i. Full model:
price = 126.90617 + 2.77590(valuation) + 76.02153(corner) 1.10748(valuation)(corner)
ii. Houses not on a corner:
price = 126.90517 + 2.77590(valuation)
iii. Houses on a corner:
price = 50.88364 + 1.66842(valuation)
(c) Testing whether the lines are the same is equivalent to the following:
H0 : 2 = 3 = 0
Ha : not both 2 = 0 and 3 = 0

-97.45585
3.16852
136.29322
-0.29629

## The test statistic is given by

SSR (X2 , X1 X2 | X1 )/2
SSE (X1 , X2 , X1 X2 )/60
[SSE (X1 ) SSE (X1 , X2 , X1 X2 )] /2
=
SSE (X1 , X2 , X1 X2 )/60
(1475.25059 909.10463)/2
=
909.10463/60
= 18.6825,

F =

with numerator degrees of freedom p q = 2 and denominator degrees of freedom n p = 60. To control
at level .05, we require F (.95; 2, 60) = 3.15. Since F = 18.6825 > 3.15 (and since F > 7.77, the value
corresponding to the 99.9% significance level), we conclude Ha (p < .001), that the regression lines are not
the same. In other words, the data give strong evidence for the conclusion that the relationship between
a houses price and its assessed valuation is not the same for those on corner lots and those not on corner
lots (either the slopes, intercepts, or both are different).
(d) Testing whether the slopes are the same is equivalent to the following:
H0 : 3 = 0
Ha : 3 =
6 0
The test statistic is given by
SSR (X1 X2 | X1 , X2 )
SSE (X1 , X2 , X1 X2 )/60
SSE (X1 , X2 ) SSE (X1 , X2 , X1 X2 )
=
SSE (X1 , X2 , X1 X2 )/60
(1022.10315 909.10463)
=
909.10463/60
= 7.4578,

F =

with numerator degrees of freedom 1 and denominator degrees of freedom 60. To control at level .05,
we require F (.95; 1, 60) = 4.00. Since F = 7.4578 > 4.00 (and since F > 7.08, the value corresponding
to the 99% significance level), we conclude Ha (p < .01), that the slopes of the regression lines are not
the same. Thus the data give strong evidence for the conclusion that the relationship between a houses
price and its assessed valuation is not the same for those on corner lots and those not on corner lots.
The difference in slopes can be estimated by obtaining a 95% confidence interval for 3 . With t(.975; 60) =
2.000, this gives the limits 0.11187 2.000(0.03253) for the interval
0.1770 3 0.0468.

2. (a)

Cp criterion.

Of the three models with the smallest Cp value, there is only one for which Cp < p (model 1 below);
it is also the one with the smallest Cp value overall, hence the others are both less adequate and more
biased.
Model Ranking

Number of Variables

1
2
3

7
4
5

R2

Cp

X1 , X2 , X3 , X12 , X32 , X1 X2 , X1 X3
X1 , X2 , X3 , X12
X1 , X2 , X3 , X12 , X1 X2

0.6437
0.6131
0.6220

0.6096
0.5927
0.5968

6.4118
6.5641
6.7706

Variables

The plot below shows only these three models for clarity.

1X
1

## = age, X2 = expenses, X3 = area

R2 criterion.

From the plot of R2 values for all possible models, it appears that including more than four variables
results in only a small increase in max R2 . Thus one possibility for the best three models using this
criterion is to select the two best four-variable models and the single best three-variable model, since
it is almost as good as the four-variable models. Using this criterion is inherently subjective, but
model 1 here is the same as model 2 in the Cp selection.
Model Ranking

Number of Variables

Variables

R2

4
4
3

X1 , X2 , X3 , X12
X1 , X2 , X3 , X1 X2

0.6131
0.6027
0.5830

1
2
3

X1 , X2 , X3

The adjusted R2 criterion gives an absolute ranking, so the three best models are simply those with
the highest adjusted R2 . Model 1 here matches model 1 from the Cp selection.

Model Ranking

Number of Variables

Variables

1
2
3

7
8
8

X1 , X2 , X3 , X12 , X32 , X1 X2 , X1 X3
X1 , X2 , X3 , X12 , X22 , X32 , X1 X2 , X1 X3
X1 , X2 , X3 , X12 , X32 , X1 X2 , X1 X3 , X2 X3

0.6096
0.6053
0.6042

While this criterion penalizes excess parameters and identifies an ostensible best model, it is better to
use it as part of a more heuristic method, as with the R2 criterion. The plot helps us judge the point
where i) the subset of variables produces a reasonably high adjusted R2 and ii) adding variables does
little to improve the fit.
(b) Forward stepwise regression selects the four-variable model with X1 , X2 , X3 , X12 . This model was also
selected by the Cp and R2 criteria.
(c) Forward selection produces the same seven-variable model as the Cp and adjusted R2 criteria. Backward
elimination results in the four-variable model in part (b).
(d) The simple three-variable model with X1 , X2 , X3 seems to be the best compromise. It does not apparently
have as much explanatory power as the four- or seven-variable models, but it is almost as good; further,
it avoids interpretive problems, and is no worse in terms of violating assumptions than the larger models.
Analysis of Variance

Source

DF

Sum of
Squares

Mean
Square

Model
Error
Corrected Total

3
77
80

137.90716
98.65034
236.55750

45.96905
1.28117

Root MSE
Dependent Mean
Coeff Var

1.13189
15.13889
7.47670

R-Square

F Value

Pr > F

35.88

<.0001

0.5830
0.5667

Parameter Estimates

Variable
Intercept
ageStd
expensesStd
areaStd

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

1
1
1
1

15.13889
-0.14416
0.26717
0.00000818

0.12577
0.02092
0.05729
0.00000131

120.37
-6.89
4.66
6.27

<.0001
<.0001
<.0001
<.0001

The fit of the three-variable model is almost as good as that of the larger ones. For example, the sevenvariable model includes over twice as many variables, yet it improves adjusted R2 by only 8%.
(e)
Correlation.

Variable
ageStd
expensesStd
areaStd
rate

ageStd

expensesStd

areaStd

rate

1.0000
0.3888
0.2886
-0.2503

0.3888
1.0000
0.4407
0.4138

0.2886
0.4407
1.0000
0.5353

-0.2503
0.4138
0.5353
1.0000

This model eliminates the correlation between age and age2 . The variable age2 had the lowest t-value,
and dropping it did not affect the significance of age. Still, the correlation between area and expenses
is a concern.
Normality of Residuals.

The residuals do not indicate any gross departures from normality. The plot of residuals vs. area may
have some outliers, and residuals vs. age appear to be divided between high and low values. One way
to deal with this may be to convert age to a categorical variable.