Anda di halaman 1dari 8

Examples showing what happens

when you have omitted variables or irrelevant variables.


All these examples use artificial (that is, made up) data.1 The example is one where the
dependent and independent variables are as follows:

Yi = kidiq the IQ of the child


X 1i = momiq the IQ of the childs mother
X 2i = dadiq the IQ of the childs father
Section I: Omitted Variables
Case I: An omitted variable strongly correlated with an included variable
In the first example, the truth is that both dadiq and momiq matter, each with a true Beta of 0.3.
The two explanatory variables, dadiq and momiq, are highly correlated with one another. Here is
the result of estimating the true model, containing both independent variables. Our focus is going
to be on 1 .
. regress kidiq momiq dadiq
Source |
SS
df
MS
-------------+-----------------------------Model | 7097.20748
2 3548.60374
Residual | 18544.7424
97 191.182911
-------------+-----------------------------Total | 25641.9498
99 259.009594

Number of obs
F( 2,
97)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
18.56
0.0000
0.2768
0.2619
13.827

-----------------------------------------------------------------------------kidiq |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------.3799036
.1555899
2.44
0.016
.0711008
.6887065
momiq |
dadiq |
.2149943
.1500841
1.43
0.155
-.082881
.5128697
_cons |
39.40444
9.992795
3.94
0.000
19.57151
59.23737
------------------------------------------------------------------------------

The estimated coefficients differ, of course, from the true values of 0.3, but the estimated
coefficients are not terribly far from the true coefficients, and the true coefficients are well inside
their 95% confidence intervals.
Now, suppose you felt that since mothers are traditionally the primary care givers, you began the
exercise uncertain whether dadiq would matter. You note the low t-value on dadiq and decide to
omit it. In short, you estimate a misspecified regression with omitted variable bias.

The dataset is FakeIQ.dta.

. regress kidiq momiq


Source |
SS
df
MS
-------------+-----------------------------Model | 6704.89339
1 6704.89339
Residual | 18937.0565
98
193.23527
-------------+-----------------------------Total | 25641.9498
99 259.009594

Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
34.70
0.0000
0.2615
0.2539
13.901

-----------------------------------------------------------------------------kidiq |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------momiq |
.5573807
.0946235
5.89
0.000
.3696034
.7451581
_cons |
43.97
9.521605
4.62
0.000
25.07469
62.86532
------------------------------------------------------------------------------

The coefficient of momiq is now much larger than its true value of 0.3, and the 95% confidence
interval doesnt even contain the true value. Because momiq is strongly positively correlated
with dadiq, the momiq coefficient is biased in a positive direction. Smart moms are married to
smart dads, and the double boost to kidiq is entirely attributed to momiq when dadiq is dropped
from the regression. Compare the standard error of the 1 in the correctly specified equation and
the incorrectly specified equation, you will see that it falls from .1556 in the correctly specified
equation to .09462. The standard error, you may recall, tells you how precise the coefficient
estimate is, and generally a small standard error is better than a large one, so you may mistakenly
think the misspecified equation is estimating 1 more precisely, but the greater precision is an
illusion and a consequence of the misspecification. This is obvious from how far the point
estimate and confidence interval are from the true value of 0.3. In the correct specification, we
were asking the computer to determine the marginal effect of momiq holding dadiq constant.
Since smart moms and smart dads go together in the data, this is a hard question, which is
reflected in the larger standard error. When we estimate the misspecified equation, we have
preemptively decided that only momiq matters by leaving out dadiq. If you start from the
presumption that dadiq is irrelevant, and you arent trying to hold dadiq constant, the question is
intrinsically easier, and the standard error is therefore smaller.
There is an interesting relationship among the coefficients. Lets run an auxiliary regression
explaining dadiq with momiq.
. regress dadiq momiq
Source |
SS
df
MS
-------------+-----------------------------Model | 14706.8117
1 14706.8117
Residual | 8487.50103
98 86.6071534
-------------+-----------------------------Total | 23194.3127
99 234.285987

Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
169.81
0.0000
0.6341
0.6303
9.3063

-----------------------------------------------------------------------------dadiq |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------momiq |
.8254966
.063348
13.03
0.000
.6997846
.9512086
_cons |
21.23573
6.374467
3.33
0.001
8.585807
33.88565
------------------------------------------------------------------------------

Lets call the slope and intercept of this regression 0 21.23573 and 1 .8254966 .

If we call the estimated coefficients in the regression of kidiq on momiq and dadiq 0 , 1 , 2 ,
and call the estimated coefficients in the regression of kidiq on momiq 0 , 1 , then it is true that

0 0 2 0 39.4044 .2149943 21.23573 43.97


1 1 21 .3799036 .2149943.8254966 0.55738
This shows how the estimated slope coefficient in the correctly specified model 1 and the
slope coefficient in the incorrectly specified model 1 differ by an amount equal to 21 , a
magnitude that is greatest when the omitted variable is important in the original regression (that
is, large 2 ) and the included variable is important in predicting the excluded variable (that is,
large 1 ). Not only is the estimated coefficient of momiq in the misspecified model biased, it
is not even consistent. Even an infinitely large sample would not yield an estimate close to 0.3.
Case II: An omitted variable (practically) uncorrelated with included variables.
This algebra suggests that omitted variables are less problematic when the variable you
omit is uncorrelated with the included variables, and therefore has a neglible 1 . Here is an
example. As before, the truth is that both momiq and dadiq matter, and each has a true beta of
0.3. Here, however, the two variables are practically uncorrelated. We begin with the correctly
specified model.
. regress kidiq3 momiq

dadiq2

Source |
SS
df
MS
-------------+-----------------------------Model | 3139.11495
2 1569.55748
Residual | 13740.4033
97 141.653642
-------------+-----------------------------Total | 16879.5182
99 170.500184

Number of obs
F( 2,
97)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
11.08
0.0000
0.1860
0.1692
11.902

-----------------------------------------------------------------------------kidiq3 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------.229673
.0810167
2.83
0.006
.0688773
.3904686
momiq |
dadiq2 |
.3096717
.0826819
3.75
0.000
.1455711
.4737723
_cons |
46.94512
11.66244
4.03
0.000
23.79841
70.09183
------------------------------------------------------------------------------

The 95% confidence intervals for each coefficient brackets its true value of 0.3. Suppose for
some reason you estimated the regression without dadiq, omitting this relevant variable.

. regress kidiq3 momiq


Source |
SS
df
MS
-------------+-----------------------------Model | 1152.05868
1 1152.05868
Residual | 15727.4596
98 160.484281
-------------+-----------------------------Total | 16879.5182
99 170.500184

Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
7.18
0.0087
0.0683
0.0587
12.668

-----------------------------------------------------------------------------kidiq3 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------momiq |
.2310433
.0862328
2.68
0.009
.0599172
.4021694
_cons |
78.1805
8.677272
9.01
0.000
60.96074
95.40026
------------------------------------------------------------------------------

The momiq variables coefficient basically hasnt changed between the correctly and incorrectly
specified regression. Here is the auxiliary regression.
. regress dadiq2 momiq
Source |
SS
df
MS
-------------+-----------------------------Model | .422582718
1 .422582718
Residual | 20720.8297
98 211.437037
-------------+-----------------------------Total | 20721.2522
99 209.305578

Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE

=
100
=
0.00
= 0.9644
= 0.0000
= -0.0102
= 14.541

-----------------------------------------------------------------------------dadiq2 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------momiq |
.004425
.0989798
0.04
0.964
-.1919972
.2008472
_cons |
100.8661
9.959958
10.13
0.000
81.10091
120.6313
------------------------------------------------------------------------------

You can see that 1 is tiny, because momiq and dadiq2 have only a negligible linear association,
therefore 21 , the difference between the estimated coefficient in the correctly and incorrectly
specified models, is also negligible. If the sample correlation of momiq and dadiq were exactly
zero, 1 would be zero, and there would be no difference at all. Even though the coefficient of
momiq is essentially unchanged, there is a noticeable shift in its estimated standard error. In the
correctly specified equation it is 0.0862328, and in the incorrectly specified equation, it is
greater, 0.0989798. This is the reverse of our first example, where the standard error was smaller
in the incorrectly specified equation. The improved standard error in the first example came
about because ignoring the role of dadiq gave an inflated idea of the importance and statistical
significance of momiq. Because momiq and dadiq were strongly correlated, the momiq variable
picked up the influence of the omitted dadiq variable. In the Case II example, the momiq
variable is practically uncorrelated with the dadiq variable, so the momiq variable doesnt pick
anything up. On the other hand, the misspecified regression in Case II missed a chance to explain
an important bit of variation that arising from dadiq. As a result, the variability arising from
dadiq was dropped into the error term of the misspecified regression, raising the estimated
variance of the error term from 160.484281 in the correctly specified model to 211.437037 in the
incorrectly specified model. The more variable error term in the misspecified model led to a
bigger variance (or equivalently, standard error) in the estimated slope coefficient.
If the sample correlation between dadiq and momiq were precisely zero, the estimated

coefficient of momiq in the misspecified model would still be unbiased and consistent, although
it would not be efficient.

Section II: Irrelevant Variables


Case III: An irrelevant variable strongly correlated with an included variable.
Now lets change the model to one in which dadiq is irrelevant, and the coefficient of momiq is
0.6. We begin by fitting this correctly as a function of only momiq.
. regress kidiq2 momiq
Source |
SS
df
MS
-------------+-----------------------------Model | 5211.50746
1 5211.50746
Residual | 15213.4866
98 155.239659
-------------+-----------------------------Total |
20424.994
99 206.313071

Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
33.57
0.0000
0.2552
0.2476
12.46

-----------------------------------------------------------------------------kidiq2 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------.4914029
.084812
5.79
0.000
.3230962
.6597096
momiq |
_cons |
52.62826
8.534308
6.17
0.000
35.6922
69.56431
------------------------------------------------------------------------------

The true value of the coefficient, 0.6, is well within the 95% confidence interval.
Now we add dadiq to the regression; dadiq is an irrelevant variable.
. regress kidiq2 momiq

dadiq

Source |
SS
df
MS
-------------+-----------------------------Model | 5311.40567
2 2655.70283
Residual | 15113.5884
97 155.810189
-------------+-----------------------------Total |
20424.994
99 206.313071

Number of obs
F( 2,
97)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
17.04
0.0000
0.2600
0.2448
12.482

-----------------------------------------------------------------------------kidiq2 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------.4018449
.1404607
2.86
0.005
.1230693
.6806205
momiq |
dadiq |
.1084898
.1354902
0.80
0.425
-.1604208
.3774004
_cons |
50.3244
9.021118
5.58
0.000
32.41997
68.22882
------------------------------------------------------------------------------

The confidence interval for momiq still contains the true value of 0.6, but the interval is wider
now. The same relationship between coefficients of the simple and multiple regressions that we
outlined earlier still exists. Weve got a 2 = .1084898, and our 1 (which was 0.8254966)
doesnt change. Therefore the shift in the coefficient of momiq between the two regressions is
the product of these, or 0.08955796. Note that we cant infer anything about whether dadiq is
relevant or irrelevant from the fact that the coefficient shift obeys this rule. It happens regardless
of the true specification.

What is different in this case is that the estimator of the coefficient of momiq is still unbiased
and consistent even when the irrelevant variable is present. In fact, the model containing an
irrelevant variable is true in a sense, because it is possible to have 2 0 .
This doesnt imply that one can include irrelevant variables without incurring a cost. Notice how
the standard error of the estimated coefficient of momiq increases when dadiq is added to the
regression (from .085 to .140). That the effect is so large is because dadiq and momiq are
strongly correlated. If you leave out dadiq you are imposing a zero coefficient on it. In this
instance, that is the correct thing to do, and it permits a more precise estimate of the momiq
coefficient. Including dadiq in the regression leaves it to the data to sort out how much of the
effect is due to momiq and how much is due to dadiq, which is difficult because of the strong
correlation of the two variables; as a result the coefficient of momiq is less precisely estimated.
A second, much smaller effect, arose from the fact that the estimated variance of the error term
rose slightly (from 155.2 to 155.8) in the incorrectly specified model. This higher estimated
variance played a minor role in increasing the estimated standard error of the coefficient of
momiq. The estimated variance of the error term only increased, however, by chance. As a
matter of fact, it only happened because dadiqs coefficient had a t-value less than one in
absolute value. If the dadiq coefficient had had a t-value greater than one in absolute value
(something that happens about a third of the time, even when the true coefficient is zero) the
estimated variance of the error term would have decreased by a small amount, slightly offsetting
the first effect.
Case IV: An irrelevant variable essentially uncorrelated with the included variable.
Once again, the true model is that only momiq matters, and dadiq is an irrelevant variable. The
true coefficient of momiq is 0.6. In this example the irrelevant variable is essentially
uncorrelated. We begin by estimating the true specification.
. regress kidiq4 momiq
Source |
SS
df
MS
-------------+-----------------------------Model | 9943.72409
1 9943.72409
Residual | 11380.7762
98
116.13037
-------------+-----------------------------Total | 21324.5003
99 215.398993

Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
85.63
0.0000
0.4663
0.4609
10.776

-----------------------------------------------------------------------------kidiq4 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------.6787824
.0733549
9.25
0.000
.533212
.8243527
momiq |
_cons |
31.65574
7.381419
4.29
0.000
17.00755
46.30393
------------------------------------------------------------------------------

The estimate looks good. The true value of the coefficient is well within the 95% confidence
interval.
Next we estimate the specification including the irrelevant variable

. regress kidiq4 momiq

dadiq2

Source |
SS
df
MS
-------------+-----------------------------Model | 10097.0142
2
5048.5071
Residual | 11227.4861
97 115.747279
-------------+-----------------------------Total | 21324.5003
99 215.398993

Number of obs
F( 2,
97)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

100
43.62
0.0000
0.4735
0.4626
10.759

-----------------------------------------------------------------------------kidiq4 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------momiq |
.6784018
.0732345
9.26
0.000
.5330515
.823752
dadiq2 |
.0860109
.0747398
1.15
0.253
-.0623269
.2343487
_cons |
22.98015
10.54219
2.18
0.032
2.056822
43.90348
------------------------------------------------------------------------------

Comparing the estimated coefficients of momiq in the correctly and incorrectly specified models
shows that they are almost the same. If the sample correlation between momiq and dadiq2 were
precisely zero in this sample, these two estimated coefficients would be identical. Comparing the
standard errors and confidence intervals shows they barely changed either. Because dadiq and
momiq have negligible correlation in this data, and the irrelevance of dadiq means its presence
doesnt effect the variance of the error term, it is no harder or easier to estimate momiq holding
dadiq constant than to simply estimate momiq imposing a zero coefficient on dadiq. In this
example the t-value on dadiq happened to be greater than one, and as a result the estimated
variance of the error term was slightly lower in the misspecified model (115.74) than in the true
model (116.13), which lowered the estimated standard error of the momiq coefficient by a
minuscule amount (from .0733 to .0732). This is possible, as this example illustrates, but will
rarely happen in practice. It only happens when the irrelevant variable has little or no correlation
with included variables and the t-statistic on the irrelevant variable is greater than one. In real
world data, irrelevant variables usually have a nontrivial correlation with included variables, and
two thirds of the time the t-statistic on an irrelevant variable will be less than one in absolute
value. Case III, where including an irrelevant variable raised the standard error of the coefficient,
is by far the more common case in actual practice.
My version of Table 6.1 follows, summarizing these results.

Table 6.1 Effect of Omitted Variables and Irrelevant Variables on the Coefficient Estimate
Effect on Coefficient
Estimates

Omitted Variable Uncorrelated in


sample with Included Variables

Omitted Variable Correlated in


sample with Included Variables

Irrelevant Variable

Is there Bias?

No

Yes

No

Their standard error


(Variance)

usually increases

May increase or decrease

usually increases

Are estimates consistent?

Yes

No

Yes

Anda mungkin juga menyukai