Stat 512 Homework 6 Solution

STAT 512 Homework #6 Solution
1. (a)
It appears that the lines have both different intercepts and different slopes. Since the latter represents the
relationship between selling price and assessed value, we suspect this relationship depends on whether a
house is located on a corner lot. But since there are fewer data points for houses not on corner lots, we
cannot know for sure just by examining the scatter plot.
(b)
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
60
63
4237.05022
909.10463
5146.15484
1412.35007
15.15174
Root MSE
Dependent Mean
Coeff Var
3.89252
79.02344
4.92578
R-Square
Adj R-Sq
F Value
Pr > F
93.21
<.0001
0.8233
0.8145
Parameter Estimates
Variable
Intercept
valuation
corner
valuationCorner
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Type I SS
Type II SS
1
1
1
1
-126.90517
2.77590
76.02153
-1.10748
14.72247
0.19628
30.13136
0.40554
-8.62
14.14
2.52
-2.73
<.0001
<.0001
0.0143
0.0083
399661
3670.90425
453.14744
112.99852
1125.79655
3030.46056
96.44917
112.99852
95% Confidence Limits

-156.35450
2.38328
15.74985
-1.91868
i. Full model:
price = 126.90617 + 2.77590(valuation) + 76.02153(corner) 1.10748(valuation)(corner)
ii. Houses not on a corner:
price = 126.90517 + 2.77590(valuation)
iii. Houses on a corner:
price = 50.88364 + 1.66842(valuation)
(c) Testing whether the lines are the same is equivalent to the following:
H0 : 2 = 3 = 0
Ha : not both 2 = 0 and 3 = 0
-97.45585
3.16852
136.29322
-0.29629
The test statistic is given by

SSR (X2 , X1 X2 | X1 )/2
SSE (X1 , X2 , X1 X2 )/60
[SSE (X1 ) SSE (X1 , X2 , X1 X2 )] /2
=
SSE (X1 , X2 , X1 X2 )/60
(1475.25059 909.10463)/2
=
909.10463/60
= 18.6825,
F =
with numerator degrees of freedom p q = 2 and denominator degrees of freedom n p = 60. To control
at level .05, we require F (.95; 2, 60) = 3.15. Since F = 18.6825 > 3.15 (and since F > 7.77, the value
corresponding to the 99.9% significance level), we conclude Ha (p < .001), that the regression lines are not
the same. In other words, the data give strong evidence for the conclusion that the relationship between
a houses price and its assessed valuation is not the same for those on corner lots and those not on corner
lots (either the slopes, intercepts, or both are different).
(d) Testing whether the slopes are the same is equivalent to the following:
H0 : 3 = 0
Ha : 3 =
6 0
The test statistic is given by
SSR (X1 X2 | X1 , X2 )
SSE (X1 , X2 , X1 X2 )/60
SSE (X1 , X2 ) SSE (X1 , X2 , X1 X2 )
=
SSE (X1 , X2 , X1 X2 )/60
(1022.10315 909.10463)
=
909.10463/60
= 7.4578,
F =
with numerator degrees of freedom 1 and denominator degrees of freedom 60. To control at level .05,
we require F (.95; 1, 60) = 4.00. Since F = 7.4578 > 4.00 (and since F > 7.08, the value corresponding
to the 99% significance level), we conclude Ha (p < .01), that the slopes of the regression lines are not
the same. Thus the data give strong evidence for the conclusion that the relationship between a houses
price and its assessed valuation is not the same for those on corner lots and those not on corner lots.
The difference in slopes can be estimated by obtaining a 95% confidence interval for 3 . With t(.975; 60) =
2.000, this gives the limits 0.11187 2.000(0.03253) for the interval
0.1770 3 0.0468.
2. (a)
Cp criterion.
Of the three models with the smallest Cp value, there is only one for which Cp < p (model 1 below);
it is also the one with the smallest Cp value overall, hence the others are both less adequate and more
biased.
Model Ranking
Number of Variables
1
2
3
7
4
5
R2
Adj. R2
Cp
X1 , X2 , X3 , X12 , X32 , X1 X2 , X1 X3
X1 , X2 , X3 , X12
X1 , X2 , X3 , X12 , X1 X2
0.6437
0.6131
0.6220
0.6096
0.5927
0.5968
6.4118
6.5641
6.7706
Variables
The plot below shows only these three models for clarity.
1X
1
= age, X2 = expenses, X3 = area
R2 criterion.
From the plot of R2 values for all possible models, it appears that including more than four variables
results in only a small increase in max R2 . Thus one possibility for the best three models using this
criterion is to select the two best four-variable models and the single best three-variable model, since
it is almost as good as the four-variable models. Using this criterion is inherently subjective, but
model 1 here is the same as model 2 in the Cp selection.
Model Ranking
Number of Variables
Variables
R2
4
4
3
X1 , X2 , X3 , X12
X1 , X2 , X3 , X1 X2
0.6131
0.6027
0.5830
1
2
3
X1 , X2 , X3
Adjusted R2 criterion.
The adjusted R2 criterion gives an absolute ranking, so the three best models are simply those with
the highest adjusted R2 . Model 1 here matches model 1 from the Cp selection.
Model Ranking
Number of Variables
Variables
Adj. R2
1
2
3
7
8
8
X1 , X2 , X3 , X12 , X32 , X1 X2 , X1 X3
X1 , X2 , X3 , X12 , X22 , X32 , X1 X2 , X1 X3
X1 , X2 , X3 , X12 , X32 , X1 X2 , X1 X3 , X2 X3
0.6096
0.6053
0.6042
While this criterion penalizes excess parameters and identifies an ostensible best model, it is better to
use it as part of a more heuristic method, as with the R2 criterion. The plot helps us judge the point
where i) the subset of variables produces a reasonably high adjusted R2 and ii) adding variables does
little to improve the fit.
(b) Forward stepwise regression selects the four-variable model with X1 , X2 , X3 , X12 . This model was also
selected by the Cp and R2 criteria.
(c) Forward selection produces the same seven-variable model as the Cp and adjusted R2 criteria. Backward
elimination results in the four-variable model in part (b).
(d) The simple three-variable model with X1 , X2 , X3 seems to be the best compromise. It does not apparently
have as much explanatory power as the four- or seven-variable models, but it is almost as good; further,
it avoids interpretive problems, and is no worse in terms of violating assumptions than the larger models.
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
77
80
137.90716
98.65034
236.55750
45.96905
1.28117
Root MSE
Dependent Mean
Coeff Var
1.13189
15.13889
7.47670
R-Square
Adj R-Sq
F Value
Pr > F
35.88
<.0001
0.5830
0.5667
Parameter Estimates
Variable
Intercept
ageStd
expensesStd
areaStd
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
15.13889
-0.14416
0.26717
0.00000818
0.12577
0.02092
0.05729
0.00000131
120.37
-6.89
4.66
6.27
<.0001
<.0001
<.0001
<.0001
The fit of the three-variable model is almost as good as that of the larger ones. For example, the sevenvariable model includes over twice as many variables, yet it improves adjusted R2 by only 8%.
(e)
Correlation.
Variable
ageStd
expensesStd
areaStd
rate
ageStd
expensesStd
areaStd
rate
1.0000
0.3888
0.2886
-0.2503
0.3888
1.0000
0.4407
0.4138
0.2886
0.4407
1.0000
0.5353
-0.2503
0.4138
0.5353
1.0000
This model eliminates the correlation between age and age2 . The variable age2 had the lowest t-value,
and dropping it did not affect the significance of age. Still, the correlation between area and expenses
is a concern.
Normality of Residuals.
The residuals do not indicate any gross departures from normality. The plot of residuals vs. area may
have some outliers, and residuals vs. age appear to be divided between high and low values. One way
to deal with this may be to convert age to a categorical variable.
The histogram and QQ Plot look normal.

Stat 512 Homework 6 Solution

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Stat 512 Homework 6 Solution

Diunggah oleh

Hak Cipta:

Format Tersedia

STAT 512 Homework #6 Solution

95% Confidence Limits

The test statistic is given by

= age, X2 = expenses, X3 = area

The histogram and QQ Plot look normal.

Anda mungkin juga menyukai