Regression Analysis 01

Quantitative Methods 2:
“Decision Making Under Uncertainty”

Lecture 1
IRCO 454
Professor Edmund Malesky
Outline of Today’s Lecture
 Introduction to QM2
 Flip to the last page of the novel –
What is linear modeling and how is it
used?
 A brief review of critical concepts that
you learned in QM1.
Goals of the Course
 Learn to do quantitative empirical work for
use in economic analysis, public policy
and social sciences.
 Learn the basic properties of the
regression estimator
 Learn to diagnose and address problems
with fit between data and estimator
 Learn to present results in a meaningful
way
 Learn STATA
Topics We Will Address
 ONE basic equation:
 Y = β0 + β1X + u
 This is a VERY flexible model for
understanding social, political, economic
behavior
 First part of course will be about HOW to
estimate β0 and β1
 Also about what ASSUMPTIONS are
needed to make those estimates
Topics We Will Address
 Y = β0 + β1X + u
 The rest of the course will be about what
to do if those assumptions are not
reasonable
 How do we make sure that our estimates
of β1 are unbiased, or at least consistent
Problems with u
(the error term /residual)
 Omitted Variable Bias
 Heteroskedasticity
 Dichotomous Dependent Variables
 Autocorrelation
Problems with X
 Measurement Error
 Multicollinearity
Problems with β0 & β1
 Dummy variables for new intercepts
 Non-linear effects
 Interaction Effects
Problems with Y
 Endogeneity Bias
 Selection Bias
 The use (abuse) of R-squared and “curve
fitting”
Course Structure
(Two Components)
 Monday: A theory based lecture on the
mathematical properties of the linear
regression technique and problems with
its application. No laptops!
 Wednesdays: A practical hands-on lab,

where we will learn how to program
statistical code in STATA. Bring your
laptops!!
Course Requirements
 50% - 4 Problem Sets

Will write your own computer code (A.K.A. “The .Do File”)
• File sent in one hour before class on the Wednesday
following distribution.
• Send to “QM2 Homework” Folder
• “LastName_ProblemSet#”
• Whether .do file runs is worth 20% of HW grade
 Word write-up handed in before class lecture.
 50% - Final Take-Home Exam

 Will test cumulative knowledge.

Will involve a .do file

Grade will be determined based on your answers to the
questions and whether I can successfully run your .do files
without error.
Required Readings
 Wooldridge, Jeffrey M. 2006. Introductory
Economics: A Modern Approach, Volume
3E.
 King, Keohane, and Verba (KKV). 1994.
Designing Social Inquiry.
 Other brief reading assignments sent out
by professor.
Optional Readings
•Xiao Chen, Philip B. Ender, Michael Mitchell & Christine
Wells. 2006. Stata Web Books: Regression with Stata.
http://www.ats.ucla.edu/stat/stata/webbooks/reg/default.htm
• Zorn, Christopher. Stata for Dummies 2

http://www.buec.udel.edu/yatawarr/Stata4Dummies.pdf
• Kohler, Ulriche and Frauke Kreuter. 2005. Data Analysis

Using Stata.
http://www.stata.com/bookstore/statabooks.html
• Acock, Alan. 2005. A Gentle Introduction to Stata.

http://www.stata.com/bookstore/statabooks.html
TA Availability
Chris
 Office Hours: Tuesday 3:30 - 5:30pm
Nora
 Breakout: Thursday 11:00-12:30pm
Connie
 Office Hours: Tuesday 5:30 - 7:30pm
Any Questions?
The Linear Regression Model
Approach to Research
Otherwise known as……
Advanced Line Drawing

General Linear Model
 The “General Linear Model” refers to a
class of statistical models which are
“generalizations” of simple linear
regression analysis.
 Regression is the predominant statistical
tool used in the social sciences due to its
simplicity and versatility.
 Also called Linear Regression Analysis.
Notations for Regression
Line
 Alternate Mathematical Notation for
the straight line
 10th Grade Geometry
y = m x + b

Statistics Literature Wooldrigde
Yi = a + b X i + ei uses this
specification,
 Econometrics Literature so we will
too!
Y = β0 + β1X + u
Translating Math into English
 The linear model states that the
dependent variable is directly proportional
to the value of the independent variable.
 Thus if a theory implies that Y increases in
direct proportion to an increase in X, it
implies a specific mathematical model of
behavior - the linear model.
 E.g. “It’s the economy, stupid!”
Simple Linear Regression:
The Basic Mathematical Model
 Regression is based on the concept

of the simple proportional relationship
 A.K.A . . . the straight line.
 We can express this idea
mathematically!
 Y = β 0 + β 1X + u
The Theory Implies the Math
 ALL statements of relationships between
variables imply a mathematical structure.
 Even if we don’t like to phrase our theories
in these terms, they DO imply
mathematical relationships.
 Much of this course is about elaborating
the basic model to fit our more nuanced
theories.
Implications of a Linear Model
 The linear aspect means that the same
increase in inflation will always produce
the same reduction in presidential
approval.
 This is perhaps the most restrictive of all
the assumptions of OLS.
 We will work to loosen this assumption
through the quarter.
The Regression Parameters
 β0 = the intercept
 the point where the line crosses the Y-axis.

(the value of the dependent variable when all of
the independent variables = 0)
 β1 = the slope
 the increase in the dependent variable per unit
change in the independent variable (also known
as the 'rise over the run')
Regression in a Perfect World…
 Y = 1X
The Straight Line
12
10
Y 6
0
1 2 3 4 5 6 7 8 9 10
X
…but life is full of errors…
 Y = 1X + u
Simple Linear Regression
12
10
Y 6
0
1 2 3 4 5 6 7 8 9 10
X
The Error Term
 Our models do not predict behavior
perfectly.
 So we add a term to adjust or compensate
for the errors in prediction (u).
 Much of our ability to estimate β1 depends
upon the assumptions we make about the
errors (u).
 Sometimes u is called the “Disturbance”
The 'Goal' of Ordinary Least
Squares
 Ordinary Least Squares (OLS) is a
method of finding the linear model which
minimizes the sum of the squared errors.
 Such a model provides the best
explanation/prediction of the data.
 It is the “Best Linear Unbiased Estimator”

It’s BLUE
Other Goals are Possible
 Minimize total errors

 Minimize Absolute Value of Errors
 Maximum Likelihood Models
 OLS is a special case of MLE
Why Least Squared error?
 Why not simply minimum error?
 The errors about the line sum to 0.0!
 Minimum absolute deviation (error)
models now exist, but they are
mathematically cumbersome.
 Try algebra with | Absolute Value | signs!
Implications of Squared Errors
 This model seeks to avoid BIG misses
 A big u for one case leads to a REALLY
big u2.
 This means regression results can be
heavily influenced by outlier cases
 Some feel this is theoretically appropriate
 Always look at your data
Minimizing the Sum of Squared
Errors
 How to put the Least in OLS?
 In mathematical jargon we seek to
minimize the residual sum of squares
(SSR), where:
n
SSR = ∑ ( yˆ i − yi )
2
i =1
n
=∑uˆ i2
i =1
Picking the Parameters
 To Minimize SSR, we need

parameter estimates.
 In calculus, if you wish to know when
a function is at its minimum, you take
the first derivative.
 In this case we must take partial
derivatives since we have two
parameters (β0 & β1) to worry about.
How “good” does it fit?
 To measure “reduction in errors” we need
a benchmark variable is a relevant and
tractable benchmark for comparing
predictions for comparison.
 The mean of the dependent.
 The mean of Y represents our “best
guess” at the value of Yi absent other
information.
Sums of Squares
 This gives us the following 'sum-of-
squares' measures:

SST=Total Sum of Squares

SSE= Explained Sum of Squares
 SSR= Residual (Unexplained Sum of Squares)
 Total Variation (SST) = Explained Variation (SSE) +

Unexplained Variation (SSR)
n 2
SST = ∑ ( y i − y )
i =1
n
SSE = ∑( yˆ i − y )
2
i =1
n
SSR = ∑ ( yˆ i − y i )
2
i =1
“Explained and “Unexplained”
Variation
yˆ = Βˆ 0 + Βˆ 1 x
û i
( yi − y )
Y
yi
ŷ i
Β̂1
Β̂ 0
Xi
X
“Explained and “Unexplained”
Variation Square this quantity
and sum across all
Square this observations and
quantity and sum we have our SST
across all (Total Sum of
Squares)
observations and û i
we have our SSR
(Residual Sum of ( yi − y )
Squares)
Y
Square this yi
quantity and sum
ŷ i
across all
observations and
we have our SSE
(Explained Sum
of Squares) Xi
X
Some Confusing Terminology
 Occasionally you may see people refer
instead to USS (Unexplained) and ESS
(Error)
 These terms are interchangeable, but…
 ESS can be confused with explained sum
of squares
 USS is not confused with any
mathematical jargon, but does pose
issues for statistical work on the US Navy.
Let’s Test Some “Theories”
 Presidential approval depends upon the
performance of the US economy
 The development of US military power
was a response to America’s threatening
environment
Plotting Approval and Inflation
76.2143
(mean) approve
28.3333
-.263876 12.8595
(mean) inflat
Regressing Approval on
Inflation
 . reg approve inflat
 Source | SS df MS Number of obs = 46

 ---------+------------------------------ F( 1, 44) = 17.20
 Model | 1960.60398 1 1960.60398 Prob > F = 0.0002
 Residual | 5015.26094 44 113.983203 R-squared = 0.2811
 ---------+------------------------------ Adj R-squared = 0.2647
 Total | 6975.86492 45 155.01922 Root MSE = 10.676
 ------------------------------------------------------------------------------
 approve | Coef. Std. Err. t P>|t| [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
 inflat | -2.213684 .5337539 -4.147 0.000 -3.289394 -1.137973
 _cons | 63.80565 2.711964 23.527 0.000 58.34004 69.27125
 ------------------------------------------------------------------------------
Fitting Inflation to Approval
(mean) approve Fitted values
76.2143
28.3333
-.263876 12.8595
(mean) inflat
Plotting US Power & Disputes
.38
uscapbl
.03
0 7
numtargt
Regress US Power on Disputes
 . reg uscapbl numtargt
 Source | SS df MS Number of obs = 177

 ---------+------------------------------ F( 1, 175) = 18.61
 Model | .110444241 1 .110444241 Prob > F = 0.0000
 Residual | 1.03834672 175 .00593341 R-squared = 0.0961
 ---------+------------------------------ Adj R-squared = 0.0910
 Total | 1.14879096 176 .006527221 Root MSE = .07703
 ------------------------------------------------------------------------------
 uscapbl | Coef. Std. Err. t P>|t| [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
 numtargt | .0201142 .0046621 4.314 0.000 .010913 .0293155
 _cons | .1455665 .0067132 21.684 0.000 .1323172 .1588157
 ------------------------------------------------------------------------------
Fitting Disputes to US Power
uscapbl Fitted values
.38
.03
0 7
numtargt
A Brief Review of Critical Concepts
Measures of Central Tendency n
(Mean, Median, Mode) x  (1/ n) xi
i 1
Var ( X )  E[( X  E ( X )) 2 ]
 Population Variance n
 (1/ n) ( xi  x ) 2   2
i 1
 Standard Deviation sd ( X )   2
 Covariance Cov( X , Y )  E  ( X  E ( X ))(Y  E (Y ))
Cov( X , Y ) 
 Correlation Corr ( X , Y )   XY
sd ( X )* sd (Y )  X  Y
 Marginal Effect
y  1x
Distributions – The Usual Suspects
 Normal Distribution
 Standard Normal
 Chi-Square
t
F
The Normal Distribution
(Probability Density Function)
1
f ( x)  exp[( x  u )2 / 2 2 ]
 2
X :Normal (:u,  2 )
x
µ
The Standard Normal
Distribution (PDF)
1
 ( z)  exp[  z 2 / 2]
2
Z :Normal ( :0,1)
Z
0
Chi-Square Distribution
Let Zi , i  1, 2..., n be independent random variables, each distributed
standard normal.
n
 =  Zi2
i=1
  :( :n, 2n)
df=2
f(x)
df=4
df=6
x
t-distribution:
The Statistical Workhorse
Let  have a chi-square distribution
df=6
with n degrees of freedom.
Z
T= As the
degrees of
df=4
 /n
freedom increase, the
t  :( :0, n /(n  2))
t-distribution
approaches the normal
distribution.
df=2
-3 3
0
Quick Review:
Hypothesis Testing
 In STATA, the null hypothesis for a two-

tailed t-test is:
H0: βj=0
Quick Review:
Hypothesis Testing
 To test the hypothesis, I need to have a rejection rule. That
is, I will reject the null hypothesis if, t is greater than some
critical value (c).
| t | c 
c is up to me to some extent, I must determine what level of
significance I am willing to accept. For instance, if my t-
value is 1.85 with 40 df and I was willing to reject only at the
5% level, my c would equal 2.021 and I would not reject the
null. On the other hand, if I was willing to reject at the 10%
level, my c would be 1.684, and I would reject the null
hypotheses.
t-distribution:
5 % rejection rule for the that H0: βj=0
with 25 degrees of freedom
Looking at table G-
2, I find the critical
value for a two-
tailed test is 2.06
Rejection Region Rejection Region

Area=.025 Area=.025
-2.06 2.06
0
Quick Review:
 But this operation hides some very useful

information.
 STATA has decided that it is more useful to
provide what is the smallest level of significance
at which the null hypothesis would be rejected.
This is known as the p-value.
 In the previous example, we know that
.05<p<.10.
 To calculate the p, STATA computes the area
under the probability density function.
T-distribution:
Obtaining the p-value against a two-sided
alternative, when t=1.85 and df=40.
P-value=P(|T|>t)
Area=.9282
In this case, P(|T|
>1.85)=
2P(T>1.85)=2(.0359)
=.0718
Rejection Region Rejection Region

Area=.0359 Area=.0359
0
F Distribution
 2/ k1
k1
F 
 2/ k 2
k2
F and Chi Square
testing involves
df=2,8 only a one-tailed
f(x)
test of the area
df=6,20 underneath the
right portion of the
df=6,8
curve.

Regression Analysis 01

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Regression Analysis 01

Diunggah oleh

Hak Cipta:

Format Tersedia

Quantitative Methods 2:

“Decision Making Under Uncertainty”

 Wednesdays: A practical hands-on lab,

 50% - Final Take-Home Exam

• Zorn, Christopher. Stata for Dummies 2

• Kohler, Ulriche and Frauke Kreuter. 2005. Data Analysis

• Acock, Alan. 2005. A Gentle Introduction to Stata.

Advanced Line Drawing

 Regression is based on the concept

 Minimize total errors

 To Minimize SSR, we need

 Total Variation (SST) = Explained Variation (SSE) +

 Source | SS df MS Number of obs = 46

 Source | SS df MS Number of obs = 177

 In STATA, the null hypothesis for a two-

Rejection Region Rejection Region

 But this operation hides some very useful

Rejection Region Rejection Region

Anda mungkin juga menyukai