Anda di halaman 1dari 11

UNIVERSITY OF ESSEX

DEPARTMENT OF GOVERNMENT
ASSESSED WORK COVER SHEET

Module: GV903-7 Advanced Research Methods


Family Name: Carlos HENRIQUE
Registration number: PRID GONCA97206
Word count (to nearest 100 words): 900 words.

Dr. Alejandro Quiroz Flores


Given Names: Gonalves Freitas
Programme: MSc Political Economy

Please read the following carefully:


In signing this cover sheet you confirm:
1. That the attached work complies with University regulations governing academic offences, in particular regulations
6.20 and 6.21. Please follow the link, Information and Publications, from the Universitys website, to University
Regulations, Policy and Procedures, to Academic Offences Procedures.
2. That you are aware of the guidelines set out in the Postgraduate Taught Handbook Student, Department of
Government (as appropriate), and in particular that:
(a) You have read and understood A Guide to Good Practice in Assessed Work in the Postgraduate Taught Handbook
Student, Department of Government.
(b) All material copied from any other source is referenced in accordance with the guidelines set out in the Postgraduate
Taught Handbook Student, Department of Government.
(c) You have acknowledged the assistance of other people who have contributed substantively to your submitted work.
(d) You have acknowledged any overlap between the present submission and other assessed work either at the
University of Essex or elsewhere.
3. That the electronic file you submit exactly matches your printed submission.
4. That you consent that the electronic file may be submitted to the turnitinUK (the JISC Plagiarism Detection Service).
5. That you are aware of the University of Essex Code of Practice on intellectual property rights (see the Postgraduate
Taught Handbook Student, Department of Government or the University website. Go to Information and Publications,
from the Universitys website, to University Regulations, Policy and Procedures and then to Regulations relating to
Academic Affairs.
6. That you consent for your submission (term paper or dissertation) to be considered for publication in the Essex
appropriate journal (the journal publishes a selection of the best term papers and dissertations). If you wish to deny
consent, please delete (strike through) the previous sentence.
The following statement must be signed:
I confirm that I take personal responsibility for complying with the University regulations governing academic offences
and that I consent for my submission to be processed in the context of the JISC Plagiarism Detection Service.

Signed: Carlos
09/11/2014

Henrique G. Freitas

Date:

If you have worked closely with another student or sought advice from a proof reader in the preparation of this assessed
work please ask that person to complete the below section:

GV903-7 ASSIGNMENT 1 -- Carlos HENRIQUE Goncalves Freitas -- PRID GONCA97206

Name:. Signed:

GV903-7 ASSIGNMENT 1 -- Carlos HENRIQUE Goncalves Freitas -- PRID GONCA97206

1. The random variable Y comes from a distribution with finite mean and variance
Show that the sample average, , is an unbiased estimator of .

2 .

I) It is relevant to note1 that the unbiasedness of such an estimator would not


necessarily imply that the average (

, but in fact

) of any sample be equal to

that if we drew infinite samples of the same size from the population and got the
average of their sample means (i.e the average of the n estimates of
), this would converge to the population mean ( .

, or of all

This is then related to sample

distribution of and indicates that a mean of any particular sample may be above or
below the population mean, but that there is not a systematic tendency throughout
the infinite possible samples to undervalue or overvalue the population mean.
However unbiasedness may be an abstract idea and might not to be an enough
criterion to design an efficient estimator, it is one of the criteria for constructing such a
model. The estimate is, in fact, just an outcome of a rule (mathematical construction)
that helps us estimate the value of a parameter that is a significant and real value
from the population distribution, but which we do not know. I think it is relevant to
mention these considerations for they helped me understand a major distinction
between

descriptive

statistics

and

econometrics,

and

their

subtly

different

implications.

Yi
II) An estimator is unbiased if E[ ] = E[ 1

=
n

, so that its bias is zero

i=1

(Bias( )=0).
The proof: applying the properties of expectations which say that: (i) the expectation
of a sum multiplied by a constant is equal to the multiplication of that constant by the
expectation of the same sum; (ii) that the expectation of a sum of given certain values
is the sum of the expectation of each of these values; (iii) the expected value of is
equal to the population mean ( ; and that

is obviously the same throughout

the population distribution (iv); we can write:


E[ ] = E[ 1

Yi

Yi

E[ ] = E[ 1

i=1

E[ ] =

1
E[
n

i=1

Yi
n

i=1

1
E[
n

Yi
n

(i)

i=1

Y
E [ i]
n
1

n i=1

(ii)

1 As presented in Woodridge, J. M., 2014. Introduction to Econometrics (European Ed.). Andover: Cengage Learning
EMEA, p. 750.

GV903-7 ASSIGNMENT 1 -- Carlos HENRIQUE Goncalves Freitas -- PRID GONCA97206

E[ ] =

1

n i=1

E[ ] =

1
( 1 + 2 ++ n)
n

(iii)
E[ ] =

1
( n)
n

E[ ]

(iv)
As: Bias( =E ( )

Then Bias( =E ( ) =0

2. The random variable Y comes from a Chi-squared distribution with 50 df. Show how the
Lindberg-Levy central limit theorem works by taking the mean of a sample of size 10 and then
size 100. For each sample size, first do 50 iterations, then 100, then 1000 and finally 100000.
For each iteration plot the histogram and a kernel density on top of it.

According to Greene2, The Lindeberg-Levy Central Limit Theorem shows that,


regardless of the type of distribution a sample of random variables might originate
from, as we increase the number of repeated observations (iterations) of such a
sample, the histogram of the sample means distribution and its kernel density curve
will converge to a normal distribution.

We express it as: if

P:{y1, y2, , yn}, where P is a probability

distribution of any form with finite population mean (

and finite population

n( n )

variance ( 2 , and i is randomly drawn, then:

N(0,

1
where y i
n i=1

We can see it in our example the behaviour of the resulting distributions


through their histograms and density curves:
Sample Size
1) 10

Iterations
50
100
1,000
100,000

Sample Size Iterations


50
100
2) 100
1,000
100,000

(By Iterations we should understand the number of repeated draws or observations of


10 and 100 random values, respectively for each step, taken from a population with a chisquared distribution with 50 df 3)

Graph Set 1: For samples size 10, we will have, with 50, 100, 1,000 & 100,000 iterations,

45

50
exp_val1

55

60

.1

.1

Density

Density

.05

.05
0
40

45

50
exp_val2

55

60

40

.15

.15

.15
.1
.05
0

.05

Density

Density

.1

.15

respectively:

40

45

50

exp_val3

55

60

65

30

40

50
exp_val4

60

2 Greene, W. H., 2012. Econometric Analysis (International Ed., 7th). Harlow: Pearson. p. 1120.
3 Degree of freedom should not be seen as a universal representation such as (n-1), but as the relation between the number of independent
information and the total information necessary to have a meaningful representation of our statistics and its estimates. In a multiple regression it
represents the number of observations minus the number of estimated parameters. In a chi-squared distribution, the degree of freedom represents the
number of independent values of that normal distribution which have been squared and summed up Cf. Wooldridge, op. cit. p. 570 and Greene, op.
cit., pp. 1061-1062.)

70

GV903-7 ASSIGNMENT 1 -- Carlos HENRIQUE Goncalves Freitas -- PRID GONCA97206

Graph Set 2: For samples size 100, we will have, with 50, 100, 1,000 & 100,000 iterations,

Density
.2

Density
.2

48

49

50
exp_val5

51

52

48

49

50
exp_val6

51

52

53

.1

.1

.1

.2

.2

Density

Density
.3

.4

.3

.3

.4

.4

.4

.6

.5

respectively:

46

48

50
exp_val7

52

54

46

48

50
exp_val8

52

54

Table 1: To understand better the C.L.T.s implications, I look into the generated expected
values & their means:
48.7293exp_val exp_val exp_val exp_val
48.6449
exp_val exp_val exp_val exp_val
4 5
1
2
3
4
6
7
8 8
46.9887 49.1960 50.4291 49.5619 49.1257
49.2564 49.0828
1
1
1
6
4
3
51.07
7
6

51.0499 48.8125 51.4782


5
3
9

50.1845 47.6099
6
6

51.8028 49.9016 52.3473


3
3
4

1000

100

1
exp vali
n i=1

48.1769 48.1318 49.7546 50.0993 50.1363 50.9213 49.6645 50.4655


9
6
3
3
3
7
5
4

50

49.3783 48.9133
8
6

100000
49.4419 50.0272 50.0367 50.0054 49.9071 50.0402 50.0017 50.0052

79
7
19
55
96
63
03
62

The graphs and tables were generated on Stata by the following syntaxes:
clear all
set seed 649349
set obs 100000
capture gen exp_val1 =.
forvalues i=1/50 {
qui gen r`i' = rchi2(50) if _n <=10
qui sum r`i'
qui replace exp_val1= r(mean) if _n==`i'
qui drop r`i'
}
histogram exp_val1, kdensity
set obs 100000
capture gen exp_val2 =.
forvalues i=1/100 {
qui gen r`i' = rchi2(50) if _n <=10
qui sum r`i'
qui replace exp_val2= r(mean) if _n==`i'
qui drop r`i'
}
histogram exp_val2, kdensity
set obs 100000
capture gen exp_val3 =.
forvalues i=1/1000 {
qui gen r`i' = rchi2(50) if _n <=10
qui sum r`i'
qui replace exp_val3= r(mean) if _n==`i'
qui drop r`i'
}
histogram exp_val3, kdensity
set obs 100000
capture gen exp_val4 =.
forvalues i=1/100000 {
qui gen r`i' = rchi2(50) if _n <=10
qui sum r`i'
qui replace exp_val4= r(mean) if _n==`i'
qui drop r`i'
}

clear all
set seed 496703
set obs 100000
capture gen exp_val5 =.
forvalues i=1/50 {
qui gen r`i' = rchi2(50) if _n <=100
qui sum r`i'
qui replace exp_val5= r(mean) if _n==`i'
qui drop r`i'
}
histogram exp_val5, kdensity
set obs 100000
capture gen exp_val6 =.
forvalues i=1/100 {
qui gen r`i' = rchi2(50) if _n <=100
qui sum r`i'
qui replace exp_val6= r(mean) if _n==`i'
qui drop r`i'
}
histogram exp_val6, kdensity
set obs 100000
capture gen exp_val7 =.
forvalues i=1/1000 {
qui gen r`i' = rchi2(50) if _n <=100
qui sum r`i'
qui replace exp_val7= r(mean) if _n==`i'
qui drop r`i'
}
histogram exp_val7, kdensity
set obs 100000
capture gen exp_val8 =.
forvalues i=1/100000 {
qui gen r`i' = rchi2(50) if _n <=100
qui sum r`i'
qui replace exp_val8= r(mean) if _n==`i'
qui drop r`i'
}

GV903-7 ASSIGNMENT 1 -- Carlos HENRIQUE Goncalves Freitas -- PRID GONCA97206


histogram exp_val4, kdensity

histogram exp_val8, kdensity

We can then verify in the Graph sets 1 and 2, that the samples distributions
have a progressive behaviour tending to a normal distribution as we increase the
number of observations (or iterations of drawings of a certain number of random
values from a chi-squared distribution), which is consistent with the C.L.T.. Now, if we
also take into account the behaviour of the expected values of the samples means,
we can relate questions 1 and 2. An unbiased estimator gains efficiency as the
variance diminishes and the distribution tends to a normal curve. The table shows that
the improvement of an estimator for

would not improve in the same proportion of

the quality of the final distribution. With 100,000 iterations, for samples of 10 and 100
random variables, the resulting normal distribution offers a significant gain in
efficiency as their tails become diminished, compressing the P.D.F. even more around
the centre of the normal distribution.
3. Use the Stata dataset Presidential Approval Ratings with Covariates 2014.dta

Syntax for Question 3, from a


to c:

clear all
cd "C:\Users\Henrique_CHGF\MY_STUDIES\Essex_903_7\Assign_1"
use "Presidential Approval Ratings with Covariates 2014", clear
scatter month_app unempl
reg month_app unempl
reg month_app cpi
reg month_app unempl cpi
scatter month_app unempl|| lfit month_app unempl
// also possible: . scatter month_app unempl
. twoway (scatter month_app unempl)

20

40

month_app
60

80

100

a) Present a scatterplot of Approval Ratings (month_app) and


Unemployment (unempl):

6
unempl

10

Interpretation: The graph shows the monthly approval ratings of the American presidents
(from Truman to Obama, with a few moths missing, especially in the first years) in function of
the unemployment rate. Interesting to note how some key periods in the American history may
have an effect on this regression. For example: (i) the peak of approval around 09/11 (clearly
not related economic indexes but to patriotic claims of a nation under threat); (ii) Reagans
higher approval on his second terms first half (i.e. the consensus of Washington new-liberal
ideology and the results of Americas partially positive response to the debt crisis); (iii) Nixons
very low at end of his administration (obviously due to a political bias rather than to socioeconomical ones); and (iv) Lyndon Johnsons fall after the emotions of Kennedys assassination
had faded away and the rise anti-Vietnam protests). The way these factors affect this model
may be seen on the statistics summary below4.
4 All interpretations of the diferente Stata outputs presente in this Assignment were done with the help of the following sites pages: (i) Stata
Annotated Output Regression Analysis. UCLA: Statistical Consulting Group. From http://www.ats.ucla.edu/stat/stata/output/reg_output.htm/;
accessed on November 09, 2014. (ii) Interpreting Regression Output. Princeton University Library, Data & Statistical Services. From

GV903-7 ASSIGNMENT 1 -- Carlos HENRIQUE Goncalves Freitas -- PRID GONCA97206

b) Run a linear model of Approval Ratings (month_app) on Unemployment (unempl). Then run a
linear model of Approval Ratings (month_app) on Inflation (cpi). Then run a model of Approval
Ratings (month_app) on Unemployment (unempl) and Inflation (cpi).
I. reg month_app unempl
Source |
SS
df
MS
Number of obs =
702
-------------+-----------------------------F( 1,
700) =
7.15
Model | 1204.83108
1 1204.83108
Prob > F
= 0.0077
Residual | 118012.321
700
168.58903
R-squared
= 0.0101
-------------+-----------------------------Adj R-squared = 0.0087
Total | 119217.152
701 170.067264
Root MSE
= 12.984
-----------------------------------------------------------------------------month_app |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------unempl | -.7942154
.2970912
-2.67
0.008
-1.377512
-.2109187
_cons |
57.89172
1.806404
32.05
0.000
54.3451
61.43834
II. reg month_app cpi
Source |
SS
df
MS
Number of obs =
702
-------------+-----------------------------F( 1,
700) =
15.03
Model | 2506.56316
1 2506.56316
Prob > F
= 0.0001
Residual | 116710.589
700 166.729413
R-squared
= 0.0210
-------------+-----------------------------Adj R-squared = 0.0196
Total | 119217.152
701 170.067264
Root MSE
= 12.912
-----------------------------------------------------------------------------month_app |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------cpi | -.0278407
.0071804
-3.88
0.000
-.0419384
-.0137431
_cons |
56.08053
.8790848
63.79
0.000
54.35457
57.80649
------------------------------------------------------------------------------

On I and II, the coefficients of the independent variable unemployment and of the
model constant are statistically significant (p<0.05), and the latters are within the
interval of confidence. However the R-square statistic here is very low: respectively for
regressions I & II, circa of only 1% and 2% of the variance of approval rate may be
explained by the level of unemployment or inflation. This may reflect the situations
that escaped the scope of the relationship between approval and pure economic
explanations. Yet the relations among the dependent and independent variables are
still correctly reflected: as one would expect, approval varies inversely to inflation and
unemployment.
III. reg month_app unempl cpi
Source |
SS
df
MS
Number of obs =
702
-------------+-----------------------------F( 2,
699) =
8.78
Model | 2919.93424
2 1459.96712
Prob > F
= 0.0002
Residual | 116297.218
699 166.376564
R-squared
= 0.0245
-------------+-----------------------------Adj R-squared = 0.0217
Total | 119217.152
701 170.067264
Root MSE
= 12.899
-----------------------------------------------------------------------------month_app |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------unempl | -.4887759
.3100886
-1.58
0.115
-1.097593
.1200408
cpi | -.0241964
.0075362
-3.21
0.001
-.0389927
-.0094001
_cons |
58.56965
1.806892
32.41
0.000
55.02207
62.11724
------------------------------------------------------------------------------

On III, the coefficients for inflation and the constant are statistically significant (pvalues <0.05), but it is well above 10% for unemployment, so we could not reject the

http://dss.princeton.edu/online_help/analysis/interpreting_regression.htm#ptse; accessed on November 09, 2014. (iii) Reading and Using STATA
Output; http://web.mit.edu/course/17/17.846/OldFiles/www/Readout.html; accessed on November 09, 2014.

GV903-7 ASSIGNMENT 1 -- Carlos HENRIQUE Goncalves Freitas -- PRID GONCA97206

null hypothesis for the unemployment coefficient. Yet, the inverse relationship
between approval and inflation & unemployment has been maintained.

60

80

100

c) Present a scatterplot of Approval Ratings (month_app)


and Unemployment (unempl) and add the least squares line
of the regression of Approval Ratings (month_app) on
Unemployment (unempl). For one data point,
draw the residual:
40

The regression line visually does not permit us to


guarantee that the there is a significant bias,

20

which may be intriguing if we take into account

the respectively low R-squared statistic.

6
unempl
month_app

10

Fitted values

4. Suppose you have the following data. Y=3.5,5.7; X=2,4. What is the linear algebra formula
for the least squares estimates? Use linear algebra (by hand) to compute the least squares
estimates? Now use Mata to compute the estimates.

The least square is In least square, we have 5:

( ) ( )
( X X ) X Y =((1 1 )(1 2 )) (1 1 )(3.5 )
X ' =(1 1 )
2 4 1 4
2 4 5.7
2 4
(12+14)

(1 1 )( 3.5 )
((11+11)
)
2 4 5.7
(21+ 41) (22+44)
( 2 6 ) (1 1 )(3.5 )
6 20
2 4 5.7
1
(i) A=( a b )=( 2 6 )
( ii ) A =
d b
c d
6 20
detA ( c a )
1
5
1.5
detA =22066=4 A = ( 20 6 )
(1.5
4 6 2
0.5 )
( 5.11,5.2 )
( 5.11,5.4 )
3.5
( X X ) X Y =( 5 1.5 )(1 1 )(3.5 )
( )=
(
)
1.5 0.5
2 4 5.7
(1,5.1+ 0,5.2 ) (1,5.1+ 0,5.4 ) 5.7
1 3.5 = 23.515.7
( 2
b=(1.3)
0.5 0.5 ) (5.7 ) (0.53.5+ 0.55.7 )
1.1
X = 1 2 Y = 3.5
1 4
5.7

b=(XX)-1*XY

'

'

Given

that:

'

'

Stata syntax and output:

. mata
-------------------------- mata (type end to exit) -----: x=(1,2\1,4)
: x
1
2
+---------+
1 | 1
2 |
2 | 1
4 |
+---------+
: y=(3.5\5.7)
: y
1
+-------+

5 For a review on Matrixes, see http://www.mathwords.com/i/inverse_of_a_matrix.htm. Accessed on November 08, 2014.

GV903-7 ASSIGNMENT 1 -- Carlos HENRIQUE Goncalves Freitas -- PRID GONCA97206

1 | 3.5 |
2 | 5.7 |
+-------+
: invsym(x'*x)*x'*y
1
+-------+
1 | 1.3 |
2 | 1.1 |
+-------+
: end

5. Simulate 3 independent variables X1, X2, and X3. They have mean 1, variance 1, and they
have correlation .3 (that is, each pair has a correlation of .3). Sample size is 1000. Simulate one
disturbance with mean 0 and variance 1. This disturbance should be independent of the
covariates (i.e. covariates are exogenous).

Syntax:

clear all
set seed 5789
set obs 1000
matrix m = (1,1,1)
matrix sd = (1,1,1)
matrix c = (1, .3, .3\ .3, 1, .3\ .3, .3, 1)
drawnorm x1 x2 x3, n(1000) means(m) sds(sd) corr(c)
summarize x1 x2 x3
corr x1 x2 x3
gen e1=rnormal(0,1)
summarize x1 x2 x3 e1
Stata Output
. summarize x1 x2 x3 e1
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------x1 |
1000
.9925327
1.001764 -2.176162
4.10027
x2 |
1000
.9917389
.9710396 -2.563787
3.761583
x3 |
1000
1.048413
.973184 -1.886655
4.515096
e1 |
1000
.0627629
1.003885 -3.210835
2.864533
-----------------------------------------------------------------------------. corr x1 x2 x3 (obs=1000)
|
x1
x2
x3
-------------+--------------------------x1 |
1.0000
x2 |
0.2496
1.0000
x3 |
0.2752
0.3340
1.0000

The disturbance may be defined as other factors which may affect y (the dependent
variable), but which are not accounted for in the constructed model used in the
regression (WOOLDRIDGE, op. cit. pp. 20-21). Green (op. cit., pp. 53-55) recognises a
broader view, defining as a stochastic variable that translates the shades of inherent
randomness of real world situations and our impossibility to exhaust causality
relations and fit them in a mathematical model. Yet, in the end, both treat the
exogeneity of the independent variable as an assumption of the Linear Regression
Model, which imply that they will not offer relevant information on the disturbance.
This assumption improves the chance of obtaining reliable estimators of the constant
coefficient and of the independent variable coefficients. Finally an attempt sustain
such an assumption based on a direct correlation between a disturbance ( ) and an
independent variable (z) would not be enough, as there could still exist a relation
between the

and a third function of z, f(z). So this assumption it is better

translated as E[ z ]=E[ ].
a) Generate a process Y=.5+1(X1)+2(X2)+3(X3)+epsilon. Now run a regression of Y on X1, X2,
and X3. Present the results.

GV903-7 ASSIGNMENT 1 -- Carlos HENRIQUE Goncalves Freitas -- PRID GONCA97206

Syntax:

10

gen y=0.5+1*x1+2*x2+3*x3+e1
reg y x1 x2 x3
outreg2 using question5a, word replace
Stata Output
. reg y x1 x2 x3
Source |
SS
df
MS
Number of obs =
1000
-------------+-----------------------------F( 3,
996) = 6400.93
Model | 19331.9484
3 6443.98279
Prob > F
= 0.0000
Residual | 1002.69962
996 1.00672653
R-squared
= 0.9507
-------------+-----------------------------Adj R-squared = 0.9505
Total |
20334.648
999
20.355003
Root MSE
= 1.0034
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 |
1.039887
.0334725
31.07
0.000
.9742026
1.105572
x2 |
1.946
.0352197
55.25
0.000
1.876886
2.015113
x3 |
2.977187
.0353974
84.11
0.000
2.907724
3.046649
_cons |
.6006455
.0551984
10.88
0.000
.4923271
.708964

b) Now run a regression of Y on X1 and X2 only. Present the results. Do you see any problems
with this regression?

Syntax:

reg y x1 x2
outreg2 using question5b, word replace
Stata Output
. reg y x1 x2
Source |
SS
df
MS
Number of obs =
1000
-------------+-----------------------------F( 2,
997) = 749.21
Model | 12210.3137
2 6105.15684
Prob > F
= 0.0000
Residual | 8124.33432
997 8.14878066
R-squared
= 0.6005
-------------+-----------------------------Adj R-squared = 0.5997
Total |
20334.648
999
20.355003
Root MSE
= 2.8546
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 |
1.63178
.0931025
17.53
0.000
1.449081
1.814479
x2 |
2.790044
.0960483
29.05
0.000
2.601564
2.978524
_cons |
2.297421
.1461781
15.72
0.000
2.010569
2.584273

On both cases as the p-values are less than 0.05, so the coefficients for x1 and x2 and for x1,
x2 and x3, and for their respective constants are statistically significant (i.e. we reject their null
hypothesis of having a zero for coefficient). There are also within their and confidence interval,
but there the R-squared and Adjusted R-squared went down from a to b, i.e. when we took
x3 from the regression model. In this case, the Adjusted R-squared reduction is even more
significant: it is precisely this statistics variation we should observe. When we introduce a new
independent variable to the model, and the Adjusted R-squared increases, it reveals the new
variables explanation power, and vice-versa. So we can say that x3 makes a difference
improving the model.
c) Now simulate the same variables X1, X2, and X3, but make sure that they are independent
of each other (i.e. there is no correlation between them). Now generate a process
Y=.5+1(X1)+2(X2)+3(X3)+epsilon. Now run 4 regressions: 1) Y on X1; 2) Y on X2; 3) Y on X3;
4) Y on X1, X2, and X3. Compare the results from the 4 regressions.

Syntax:

clear
set obs 1000
matrix m = (1,1,1)
matrix sd = (1,1,1)
matrix c = (1, 0, 0\ 0, 1, 0\ 0, 0, 1)
drawnorm x4 x5 x6, n(1000) means(m) sds(sd) corr(c)
summarize x4 x5 x6
corr x4 x5 x6
gen e2=rnormal(0,1)
summarize x4 x5 x6 e2
gen y=0.5+1*x4+2*x5+3*x6+e2
reg y x4
reg y x5
reg y x6
reg y x4 x5 x6

GV903-7 ASSIGNMENT 1 -- Carlos HENRIQUE Goncalves Freitas -- PRID GONCA97206

11

Stata Output
. corr x4 x5 x6 (obs=1000)
|
x4
x5
x6
-------------+--------------------------x4 |
1.0000
x5 |
0.0201
1.0000
x6 | -0.0052 -0.0017
1.0000
-----------------------------------------------------------------------------. summarize x4 x5 x6 e2
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------x4 |
1000
1.022751
1.020198
-2.59362
4.401528
x5 |
1000
1.037923
1.022858 -3.037209
4.473898
x6 |
1000
.9941706
.9952134 -1.988218
4.504817
e2 |
1000
-.025378
.9870653
-3.21099
2.962861
-----------------------------------------------------------------------------. reg y x4
Source |
SS
df
MS
Number of obs =
1000
-------------+-----------------------------F( 1,
998) =
72.18
Model |
1030.6776
1
1030.6776
Prob > F
= 0.0000
Residual | 14250.5234
998 14.2790816
R-squared
= 0.0674
-------------+-----------------------------Adj R-squared = 0.0665
Total |
15281.201
999 15.2964975
Root MSE
= 3.7788
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x4 |
.9956211
.1171879
8.50
0.000
.7656581
1.225584
_cons |
5.537458
.1692456
32.72
0.000
5.20534
5.869576
-----------------------------------------------------------------------------. reg y x5
Source |
SS
df
MS
Number of obs =
1000
-------------+-----------------------------F( 1,
998) = 387.59
Model |
4274.5819
1
4274.5819
Prob > F
= 0.0000
Residual | 11006.6191
998 11.0286765
R-squared
= 0.2797
-------------+-----------------------------Adj R-squared = 0.2790
Total |
15281.201
999 15.2964975
Root MSE
= 3.3209
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x5 |
2.022315
.102722
19.69
0.000
1.820739
2.223891
_cons |
4.456724
.1496529
29.78
0.000
4.163054
4.750395
-----------------------------------------------------------------------------. reg y x6
Source |
SS
df
MS
Number of obs =
1000
-------------+-----------------------------F( 1,
998) = 1443.49
Model | 9034.74116
1 9034.74116
Prob > F
= 0.0000
Residual | 6246.45986
998 6.25897781
R-squared
= 0.5912
-------------+-----------------------------Adj R-squared = 0.5908
Total |
15281.201
999 15.2964975
Root MSE
= 2.5018
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x6 |
3.021753
.079534
37.99
0.000
2.86568
3.177826
_cons |
3.551593
.111853
31.75
0.000
3.332099
3.771088
-----------------------------------------------------------------------------. reg y x4 x5 x6
Source |
SS
df
MS
Number of obs =
1000
-------------+-----------------------------F( 3,
996) = 4890.52
Model | 14309.7625
3 4769.92083
Prob > F
= 0.0000
Residual | 971.438543
996 .975339903
R-squared
= 0.9364
-------------+-----------------------------Adj R-squared = 0.9362
Total |
15281.201
999 15.2964975
Root MSE
= .98759
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x4 |
.970581
.0306341
31.68
0.000
.9104663
1.030696
x5 |
2.007876
.030554
65.72
0.000
1.947918
2.067833
x6 |
3.030517
.0313968
96.52
0.000
2.968905
3.092128
_cons |
.4661972
.0625328
7.46
0.000
.3434861
.5889083

Here again we see the effect that a variable with good explanation power has on the model. The Adjusted R-squared
behaviour through the models let us consider that from x4 to x5 and to x6 there is a rising hierarchy of explanation
power. On all models the constants are statistically significant, but the rise of the Adj R-squared statistic points to the
important role x6 plays in the model.