Anda di halaman 1dari 235

Econometric Analysis of Panel and Cross Section Data

Lecture Notes for Econ 220C

Yixiao Sun

Department of Economics,

University of California, San Diego

Spring 2019

Contents

Preface ix

1 Prediction Analysis versus Causal/Structural Inference 1

1.1 Predictive model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Causal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Di§erence between Predictive Analysis and Causal Inference . . . . . . . . . . . 6

1.4 Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.1 Example 1: No causality does not imply no predictability . . . . . . . . 6

1.4.2 Example 2: Causality does not imply predictability . . . . . . . . . . . . 7

2 Introduction to Panel Data Modeling 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Benefits of Using Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Unobserved Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Types of Panel Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Clustered Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 Seemingly Unrelated Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Static Panel Data Models 21

3.1 The Static Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 The Model: Fixed E§ects or Random E§ects? . . . . . . . . . . . . . . . 21

3.2 Estimation: Random-e§ects Approach . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2 Asymptotic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.3 Understanding the GLS Estimator . . . . . . . . . . . . . . . . . . . . . 25

3.3 Estimation: Fixed-e§ects Approach . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.2 Estimation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.3 Asymptotic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.4 Dummy Variable Regression . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.5 Incidental parameter problem . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Estimation: First Di§erencing Approach . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Comparison: FE and FD Estimators . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6 Comparison: RE and FE Estimators . . . . . . . . . . . . . . . . . . . . . . . . 33

3.7 Robust Variance Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

CONTENTS vi

3.8 Hausman-Wu Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.8.1 General Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.8.2 The Hausman-Wu Specification Test . . . . . . . . . . . . . . . . . . . . 36

3.8.3 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.8.4 Alternative Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.8.5 Hausman Test for Endogeneity . . . . . . . . . . . . . . . . . . . . . . . 38

3.9 Di§erences-in-Di§erences (DD/DiD) Estimator . . . . . . . . . . . . . . . . . . 40

3.9.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.9.2 Some Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.9.3 DiD in the Counterfactual Framework . . . . . . . . . . . . . . . . . . . 43

3.9.4 Examples and Applications . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.10 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Dynamic Panel Data Models 57

4.1 Models with Sequentially Exogenous Variables . . . . . . . . . . . . . . . . . . 57

4.2 Properties of FE and FD Estimators under SeqEx . . . . . . . . . . . . . . . . 58

4.2.1 Inconsistency of the FE Estimator . . . . . . . . . . . . . . . . . . . . . 58

4.2.2 Inconsistency of the FD Estimator . . . . . . . . . . . . . . . . . . . . . 59

4.3 FD+IV Estimator (Anderson and Hsiao) . . . . . . . . . . . . . . . . . . . . . . 62

4.4 Panel GMM estimator (Arellano and Bond) . . . . . . . . . . . . . . . . . . . . 63

4.4.1 The GMM Estimator: Definition . . . . . . . . . . . . . . . . . . . . . . 63

4.4.2 The GMM Estimator: Asymptotics . . . . . . . . . . . . . . . . . . . . . 65

4.4.3 Selecting the Weighting Matrix . . . . . . . . . . . . . . . . . . . . . . . 66

4.4.4 Inference Based on the Optimal GMM Estimator . . . . . . . . . . . . . 67

4.5 Models with Other Types of Indep. Variables . . . . . . . . . . . . . . . . . . . 68

4.5.1 Strictly Exogeneous Variables that Are Correlated with αi . . . . . . . . 68

4.5.2 SeqEx Variables that Are Uncorrelated with αi . . . . . . . . . . . . . . 68

4.5.3 Strictly Exogeneous Variables that Are Uncorrelated with αi . . . . . . 70

4.6 Number of Moment Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.7 Initialization and Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . 70

4.8 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Extremum Estimators 77

5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.1 Consistency Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.2 Primitive su¢cient conditions for uniform convergence . . . . . . . . . . 81

5.2.3 Primitive su¢cient conditions for identification . . . . . . . . . . . . . . 85

5.3 Asymptotic Normality of Extremum Estimators . . . . . . . . . . . . . . . . . . 90

5.3.1 Assumptions and the Theorem . . . . . . . . . . . . . . . . . . . . . . . 90

5.3.2 ML Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.3.3 NLS Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.3.4 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.3.5 MD Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3.6 Two Step GMM or MD Estimator . . . . . . . . . . . . . . . . . . . . . 96

CONTENTS vii

5.3.7 Two Step WLS Estimator. . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4 Covariance Matrix Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.5 Optimal Weight Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.6 Non-di§erentiable Objective Function . . . . . . . . . . . . . . . . . . . . . . . 102

5.6.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.6.2 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.6.3 Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6 Binary Choice Models 109

6.1 Linear Probability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.1.1 Introduction and estimation . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.1.2 Pros and Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.2.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.2.2 Extensions: Soft Margin Classification. . . . . . . . . . . . . . . . . . . . 114

6.3 Probit and Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.3.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.3.3 Report the Results for Probit and Logit . . . . . . . . . . . . . . . . . . 118

6.4 Probit with Endogenous Covariates . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.4.1 IV and Control Function Approach . . . . . . . . . . . . . . . . . . . . . 119

6.4.2 Probit with Endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.4.3 Average Partial E§ects under Endogeneity . . . . . . . . . . . . . . . . 123

6.4.4 Can We Use a Two-step IV Approach? No. . . . . . . . . . . . . . . . . 126

6.4.5 Conditional Likelihood Approach . . . . . . . . . . . . . . . . . . . . . . 127

6.5 Panel Logit and Probit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.5.1 Pooled Probit and Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.5.2 Unobserved-e§ect Probit Model under Strict Exogeneity . . . . . . . . . 131

6.5.3 Unobserved-e§ect Logit Model under Strict Exogeneity (Conditional

Logit) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.5.4 Dynamic Unobserved E§ect Model . . . . . . . . . . . . . . . . . . . . . 138

6.6 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7 Multinomial Response Models 142

7.1 Probabilistic Choice Model for Unordered Response . . . . . . . . . . . . . . . 142

7.2 Conditional and Multinomial Logit Models . . . . . . . . . . . . . . . . . . . . 144

7.2.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.2.2 Estimation of the CL Model . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.2.3 The limitation of the model . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.3 Multinomial Probit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.4 Nested Logit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.5 Ordered Probit and Logit Model . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

CONTENTS viii

8 Truncation, Censoring, and Sample Selection 159

8.1 Truncated Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

8.1.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

8.1.2 Moments of Truncated Normal Variables . . . . . . . . . . . . . . . . . . 159

8.1.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . 163

8.2 Tobit and Censored Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

8.2.1 Derivation of Expected Values . . . . . . . . . . . . . . . . . . . . . . . 167

8.2.2 Inconsistency of the OLS Estimator . . . . . . . . . . . . . . . . . . . . 168

8.2.3 Estimation and Inference with Censored Tobit . . . . . . . . . . . . . . 168

8.2.4 Mostly Harmless Econometrics . . . . . . . . . . . . . . . . . . . . . . . 170

8.3 Tobit Models With Endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . 172

8.4 Sample Selection with Probit Selection . . . . . . . . . . . . . . . . . . . . . . . 173

8.4.1 Heckit Two-step Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 174

8.4.2 Partial Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 176

8.5 Sample Selection with Tobit Selection . . . . . . . . . . . . . . . . . . . . . . . 178

8.5.1 Two-Step Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8.5.2 Partial MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8.6 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

8.6.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

8.6.2 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

9 Causal Inference 185

9.1 The Framework of Potential Outcomes1 . . . . . . . . . . . . . . . . . . . . . . 185

9.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

9.1.2 Objects of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

9.2 Randomized Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

9.3 Strongly Ignorable Treatment Assignment . . . . . . . . . . . . . . . . . . . . . 196

9.3.1 Unconfoundedness Assumption . . . . . . . . . . . . . . . . . . . . . . . 196

9.3.2 Overlap Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

9.3.3 Example: job training program . . . . . . . . . . . . . . . . . . . . . . 199

9.4 Identification under Strong Ignorability . . . . . . . . . . . . . . . . . . . . . . 199

9.5 Parametric Methods under Strong Ignorability . . . . . . . . . . . . . . . . . . 201

9.5.1 Parametric Method: Discrete Cases . . . . . . . . . . . . . . . . . . . . 201

9.5.2 Parametric Method: Continuous Cases . . . . . . . . . . . . . . . . . . . 202

9.6 Nonparametric Methods under Strong Ignorability . . . . . . . . . . . . . . . . 206

9.6.1 Nonparametric Regression for ATE . . . . . . . . . . . . . . . . . . . . . 206

9.6.2 Nonparametric Regression for ATT and ATC . . . . . . . . . . . . . . . 206

9.7 Propensity Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

9.7.1 A Key Result and Its implications . . . . . . . . . . . . . . . . . . . . . 208

9.7.2 Estimation of the Propensity Score . . . . . . . . . . . . . . . . . . . . . 210

9.7.3 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

9.7.4 Inverse Probability Weighting . . . . . . . . . . . . . . . . . . . . . . . . 214

9.7.5 Mixed Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

1
This chapter is less polished than other chapters so use it at your own risk

CONTENTS ix

9.8 Doubly Robust Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Preface

The primary goal of Econ 220C is to introduce tools necessary to understand and implement

empirical studies in economics focusing on issues other than time-series analysis. This course

contains two parts. The first part deals with panel data models: (1) static panel data models

(2) dynamic panel data models.The second part of the course deals with limited-dependent-

variable models: (1) discrete choice models; (2) censored and truncated regression models,

(3) sample selection models; and (4) evaluation of treatment e§ects. While the second part

focuses mainly on cross sectional data, it also covers panel Probit/Logit, panel Tobit and

panel attrition models. From an econometric theory perspective, the unifying framework for

the second part is the asymptotic theory of extremum estimators, which includes GMM as a

special case.

We will study di§erent issues in the specification, estimation and testing of these models

with cross-sectional data and with panel data. The emphasis of the course is on both econo-

metric ideas and econometric techniques. For some of the problem sets you will have to deal

with actual data or perform simulation experiments. You should become familiar as soon as

possible with some general features of the econometric package that you choose. MATLAB

is widely used by econometricians. STATA has gained increasing popularity in recent years

among applied micro economists. R has been widely used in statistics but not as much in

economics. SAS is another option.

Please do not circulate or post the lecture note on the Internet

ix

Chapter 1

Prediction Analysis versus

Causal/Structural Inference

1.1 Predictive model

We compare predictive modeling with causal modeling. Let us first define what a (linear)

predictive model is.

Given two scalar random variables (X, Y ) , suppose we want to predict Y based on X. The

starting point of a linear predictive model is to define

cov(X, Y )

β∗ = and α∗ = EY − (EX) β ∗ .

var(X)

Note that these are purely statistical objects. They may not contain any physical, chemical,

biological, or economic meaning. With these definitions, we define e to be the di§erence

between Y and the linear function α∗ + Xβ ∗ :

e = Y − (α∗ + Xβ ∗ ) .

I want to emphasize that this is just a mathematical definition. The mathematical equation

can be rewritten as

Y = (α∗ + Xβ ∗ ) + e.

We add whatever it should be to bring α∗ + Xβ ∗ to Y. The added amount may not represent

any real e§ect.

Because of the definition of β ∗ , we know that

cov(X, e) = cov [X, Y − (α∗ + Xβ ∗ )]

= cov(X, Y ) − cov(X, X)β ∗ = 0.

Because of the definition of α∗ , we know that

Ee = EY − (α∗ + EXβ ∗ ) = 0.

So we have constructed a “model”:

Y = (α∗ + Xβ ∗ ) + e

1.1 PREDICTIVE MODEL 2

where e satisfies

Ee = 0 and cov(X, e) = 0.

This is our linear predictive model. We choose the coe¢cients α∗ and β ∗ such that the

prediction residual e has mean zero and is uncorrelated with X. That is, the prediction residual

can not be (linearly) predicted by the predictor X.

If we want to predict X using Y, we can first define

cov (X, Y )

γ∗ = and δ ∗ = EX − (EY ) γ ∗
var(Y )

and then define

ẽ = X − δ ∗ − Y γ ∗ .

Then we have

X = δ ∗ + Y γ ∗ + ẽ

and Eẽ = 0 and cov (Y, ẽ) = 0

Interpretation of β ∗ : If we observe that X changes by 1 unit, then we expect Y to change

by β ∗ units. When we observe such a change in X, other things, including both observables

and unobservables, may have changed too. So ‘all else being equal condition’ may not be met.

The expected change of β ∗ units could be due to the change of X and/or other variables that

change with X.

Here is an alternative interpretation. Suppose there are two individuals i and j. If we

observe that Xi is higher than Xj by one unit, then we expect Yi to be higher than Yj by β ∗

units. Here we do not know whether other things are the same across the two individuals. In

fact, we let other things to change freely with X. That is, individuals make their own choice

of X and other variables and we happen to observe Xi = Xj + 1.

Checking your understanding. Suppose X = Y1 + Y2 and Y = Y1 − Y2 where Y1 and Y2

are iid. What is the (linear) statistical prediction model between Y and X? That is, what are

α∗ , β ∗ and e such that

Y = α∗ + Xβ ∗ + e

and E (e) = 0, cov(X, e) = 0? By definition,

cov(X, Y ) cov (Y1 + Y2 , Y1 − Y2 )


β∗ = = = 0,

var(X) var(X)

α∗ = EY − (EX)β ∗ = 0,

so

e = Y − (α∗ + Xβ ∗ ) = Y1 − Y2 .

Now what if Y = Y1 ? Answer

1 1
Y = 0 + X + (Y1 − Y2 ) ,

2 2

so α∗ = 0, β ∗ = 1/2, and e = 1
(Y1 − Y2 ) . Again this is a purely statistic decomposition,

which is useful for prediction.

1.1 PREDICTIVE MODEL 3

A statistical relationship does not have to be linear. How should be interpret (α∗ + Xβ ∗ )?

We need an alternative interpretation or defintion of α∗ and β ∗ :

h i

(α∗ , β ∗ ) = arg min f (a, b), where f (a, b) = E (Y − a − Xb)2 .

a,b

Check that this definition is the same as before.

(α∗ , β ∗ ) are the population regression coe¢cients. Compare them with the sample regres-

sion coe¢cients (i.e., the OLS estimator):

# ∗
$ 1 Xh
n i

α̂∗OLS , β̂ OLS = arg min fˆn (a, b), where fˆn (a, b) = (Yi − a − Xi b)2
a,b n
i=1

or equivalently

∗ c (X, Y )
cov ∗

β̂ OLS = and α̂∗OLS = Ȳ − X̄ β̂ OLS .


vd
ar(X)

Define

m(x) = E(Y |X = x)

which is the conditional mean function. This is a function of x : for each given x, we can

compute E(Y |X = x) and assign this value to m (x) . When we do not look at a particular

realized value of X, we write m(X) = E(Y |X). Now

f (a, b)

= E {Y − m(X) + m(X) − a − Xb}2

n o
2 2
= E [Y − m(X)] + [m(X) − a − Xb] + 2 [Y − m(X))] [m(X) − a − Xb]

n o n o

= E [Y − m(X)]2 + E [m(X) − a − Xb]2 + 2E {[Y − m(X)] [m(X) − a − Xb]} .

To understand the above steps, we can take

e1 = Y − m(X) and e2 = m(X) − a − Xb,

and then h i * + * +

f (a, b) = E (e1 + e2 )2 = E e21 + E e22 + 2E (e1 e2 ) .

A crucial step is to show the last term is zero. That is,

E {e} = 0 for e = e1 e2 .

By the Law of Iterated Expectation (LIE), which is also called “law of total expectation”, we

have

E {e} = E {E(e|X)} = E {E(e1 e2 |X)} .

Note that e2 is a function of X. Once we know X, we know e2 completely. So given X, e2 is

a deterministic number. A direct implication is E(e1 e2 |X) = e2 E (e1 |X) . But

E (e1 |X) = E {[Y − m(X)] |X} = E (Y |X) − E {m(X)|X}

= m(X) − m(X) = 0.

1.1 PREDICTIVE MODEL 4

Other functions of X

m›Xfi

XK D

Figure 1.1: Predictive Decomposition in the Absence of the Intercept

Hence E(e1 e2 |X) = 0 and E {e} = 0.

It now follows that

n o n o

f (a, b) = E [Y − m(X)]2 + E [m(X) − a − Xb]2 .

n o

The first term E [Y − m(X)]2 does not depend on (a, b) . Its presence does not change the

argument that minimizes the function. So

n o

arg min f (a, b) = arg min E [m(X) − a − Xb]2 .


a,b a,b

To summarize,

# $ n o

α̂OLS , β̂ OLS !p (α∗ , β ∗ ) = arg min E [m(X) − a − Xb]2 .

a,b

Remark 1 By definition, α∗ + xβ ∗ is the best linear approximation to the conditional expec-

tation function (CEF) m (x). α∗ + xβ ∗ is the closest linear function to m (x) according the

criterion n o Z

E [m (X) − a − Xb]2 = [m(x) − a − xb]2 fX (x) dx.

This criterion is the called the Mean Square Error (MSE)

Remark 2 We also say that α∗ + Xβ ∗ is the best linear prediction of Y given X in the MSE

sense.

1.2 CAUSAL MODEL 5

Remark 3 If E(Y |X) = α0 + Xβ 0 , then α∗ = α0 and β ∗ = β 0 (under some additional

identification conditions)

Remark 4 As a by-product of the above argument, we can write Y = m (X) + e1 as an

identity, but we know that E (e1 |X) = 0. So any random variable can be decomposed into

two pieces: A piece that statistically explained by X, that is, the CEF, and a piece left over

that is orthogonal to (i.e., uncorrelated with) any function of X. This decomposition is totally

statistical.

Example 1: Suppose Y = Xβ 0 + u and u = Xγ + v where v is independent of X and

E (v) = 0. What is β ∗ ?

Example 2: Suppose Y = X 2 +u where X and u are independent standard normals. What

is β ∗ ?

1.2 Causal Model

For a linear causal model

Y α + Xβ + u

where u stands for other and possibly unobserved causal factors.

Interpretation of β : If we intervene and set X to change by 1 unit while keeping all else

constant, then Y will change by β units. The di§erence between β ∗ and β lies in whether all

else has been kept as equal.

Here is an alternative interpretation of β. Consider a given individual i, suppose we change

Xi by one unit while keeping all else constant, then we expect Yi to change by β units. We

compute the causality e§ect by looking at the same individual under two di§erent scenarios:

one is observed and the other is counterfactual. That is, we observe one scenario. Then we

ask: what would have happened if Xi was increased by one unit while keeping all else being

equal?

To estimate β, ideally we hope to observe the changes of Y in response to changes in X

while holding all else being equal. This is only possible in ideal and controlled experiments.

In an observational study, we do not control the causal factors so X, the causal factor of

interest, may be covary with unobserved causal factors collected in u. For example, X is years

of schooling, Y is hourly wage rate and one causal factor in u is ability. In observational

studies where individuals make their decision on X, X is likely to be correlated with u. When

we observe a unit increase in X from individual i to individual j, we have also implicitly gone

through a change in ability. The change in Y (from individual i to individual j) may be partly

due to the change in X (from individual i to individual j) and may be partly due to the change

in ability (from individual i to individual j). Running an OLS of Y on X will give us a good

estimator of β ∗ , which aggregates the two e§ects. However, in policy analysis, we ask: what

would happen to individual i’s Y if we intervene and change his or her X by one unit. That

is, we care about the e§ect of education on Y while keeping ability as equal.

1.3 DIFFERENCE BETWEEN PREDICTIVE ANALYSIS AND CAUSAL INFERENCE6

1.3 Di§erence between Predictive Analysis and Causal Infer-

ence

Predictive Analysis

Model Y = α∗ + Xβ ∗ + e

Correlation By construction E(e) = 0, cov(X, e) = 0

Interpretation of the slope let other variables run their own course; all else may not be equal

Causal Inference

Model Y = α + Xβ + u

Correlation Cov(X, u) may not be zero.

Interpretation of the slope all other variables have to be kept constant

1.4 Some Examples

1.4.1 Example 1: No causality does not imply no predictability

Suppose that we have the following simple casual relations:

y az

x bz

for b 6= 0. Graphically,

. &

x y

Let z be generated as a sequence of iid random variables Zi so that in the absence of inter-

vention for x and y we observe

Xi = Zi × b,

a
Yi = Zi × a = Xi .

For the purpose of this example, we assume that we do not observe Zi0 s. Thus our observations

consist of (Xi , Yi ) lying on the line y = (a/b) x. Given any Xi , the best prediction of Yi is

m (Xi ) = (a/b) Xi . Thus Xi is useful for predicting Yi , even though there is no causal relation

between Xi and Yi . Furthermore, the regression coe¢cient a/b definitely does not measure the

e§ect on y caused by a change in x. Intervening to change x (while keeping z constant) has no

e§ect on y. Instead, the regression coe¢cient a/b works together with Xi to give an optimal

prediction of Yi .

1.4 SOME EXAMPLES 7

In any equation system, like the one above, if we intervene a variable (say x), then the

equation that determines this variable has to be crossed out. The equation does not describe

how x is determined any more. In this example, the system becomes

x x0 (set)

y az

Graphically,

x0 z

# & .

x y

x and y are not connected in any way: the causal e§ect is zero. The cause e§ect is (when x

is set at two di§erent values x0 , x00 )

y(x00 ) − y(x0 ) = 0

where y(x0 ) is the value of y when x is set at x0 .

1.4.2 Example 2: Causality does not imply predictability

Consider the following causal system

y ax + u,

x by + v,

or graphically

v u

# #.

x $ y

Example: x : crime rate; y : police spending. (The two di§erent causal directions may not

happen at exactly the same time, but if we observe the variables not very frequently, then

1.4 SOME EXAMPLES 8

y ax+u and x by +v can be regarded as happening simultaneously over each observation

interval).

Suppose that the values of (u, v) are generated as an iid sequence of pairs (Ui , Vi ) such

that 2 0 13

σ uu σ uv
(Ui , Vi ) s N 40, @ A5 .

σ uv σ vv

We do not observe (Ui , Vi ) . The reduced form (the equilibrium solution in terms of Ui and Vi )

is given by

Xi = (bUi + Vi ) ,
1 − ab

Yi = (Ui + aVi ) .
1 − ab

It is now easy to show that

cov(Xi , Yi ) bσ uu + (1 + ab) σ uv + aσ vv

β∗ = = .
var(Xi ) b2 σ uu + 2bσ uv + σ vv

Su¢cient freedom exists to deliver a wide range of possible value for β ∗ . For example, when

σ uu = 0, we have

aσ vv
β∗ = =a

σ vv

whereas if σ vv = 0, we have β ∗ = 1/b. Picking

bσ uu + aσ vv

σ uv = −
1 + ab

gives

β∗ = 0

so that Xi is useless as a predictor of Yi (under the normality, the best prediction is linear in

X), despite causal relations exist between them.

The optimal linear prediction interpretation of β ∗ holds regardless whether we have each

of the following

(a) x is the cause of y

(b) y is the cause of x

(c) x and y are mutually non-causal, although both have a common cause

(d) x and y mutually cause each other in the presence of additional causal variables.

In the first three cases, the predictions are in fact perfect, while in the last case we can

have Xi and Yi useless as predictors of one another despite their causal relationships.

While in case (a), the conditional mean coincides with the causal function, this is not true

in any of other cases. This conditional expectation cannot by itself tell us what we should

expect to happen when we intervene to set Xi to a particular value. Rather, it predicts: it

tells us what we can expect Yi to be given Xi when Yi and Xi are generated by whatever

process is operational for observation i.

Chapter 2

Introduction to Panel Data

Modeling

2.1 Introduction

Recently empirical research in economics has been enriched by the availability of a wealth of

new sources of data: Cross sections of individuals observed over time. The type of datasets

is called panel data. Other terms used for such data include: longitudinal data and repeated

measures (Statistics and Biostatistics). The availability of panel data has stimulated a rapid

growth in both methodological approaches and applications during the last thirty years.

The basic linear panel data model is:

Yit = Xit β + "it , i = 1, ..., N, t = 1, ..., T (2.1)

If Xit contains no lagged dependent variables, the model is a static linear panel data model.

Otherwise, it is a dynamic linear panel data model. The set of explanatory variables may

include:

• variables that vary across individuals and time periods, e.g., wage, age, and years of

experience. Denote them as Xit .

• variables that are time-invariant, i.e., vary only across individuals, e.g., race and sex.

Denote them as Xi .

• variables that vary only over time but not across individuals, e.g., economy-wide un-

employment, minimum-wage level, and other macroeconomic factors. Denote them as

Xt .

The model given above is not meaningful unless we explain what "it is. Consider the case

that Xit is a scalar random variable. If we define

β = cov(Xit , Yit )/var(Xit )

and let "it = Yit − Xit β, then "it is the prediction error and Xit β is the best linear prediction

of Yit given Xit . In this case, we have cov (Xit , "it ) = 0 by definition and the above model is a

predictive model.

2.2 BENEFITS OF USING PANEL DATA 10

If we assume that (i) Xit is a causal factor of Yit , (ii) the causal link from Xit to Yit is

linear, and (iii) "it contains all other (unobserved) causal factors, then the above model is

a causal model or a structural/behavioral model. In a causal/structural model, "it and Xit

are possibly correlated. By default, the linear model we considered in the course should be

interpreted as a linear causal/structural/behavioral model.

2.2 Benefits of Using Panel Data

What is generally referred to as the panel data approach to economics research provides several

major advantage over conventional cross sectional or time series data approaches. Both Hsiao

(2014) in his seminal monograph and Baltagi (2013) in his excellent book provide extensive

summaries.

• more informative data, more variability, more degrees of freedom and more e¢ciency.

• dynamics of adjustment. A cross section may yield an unemployment rate of 6%, we

need panel data to determine whether the same 6% are unemployed each year.

• Repeated observations on the same unit allow identification in the presence of some types

of unobservable, specifically, “permanent” unobserved di§erences across the countries,

firms, or individuals that are related to the causal variable of interest.

Figure 2.1: We are interested in the slope of the thin lines. If we have only cross sectional

data the fitted line will be the thick one. The estimated slope is obviously biased downward.

2.3 UNOBSERVED HETEROGENEITY 11

Figure 2.2: We are interested in the slope of the thin lines. If we have only cross sectional

data the fitted line will be the thick one, the estimated slope is obviously biased upward.

2.3 Unobserved Heterogeneity

Example 1. Consider agricultural Cobb-Douglas production function. Let

Yit = log output

Xit = log of a variable input

Zi = an input that remains constant over time (soil quality)

uit = a stochastic input which is outside the farmers’ control (rainfall)

We have

Yit = Xit β + "it for "it = Zi γ + uit .

Suppose Zi is known by the farmer but not by the econometrician. Then the profit maximizing

choice of Xit will depend on Zi . Therefore, Xit will be (positively) correlated with Zi and hence

"it . A pooled panel regression estimator of β will have an upward bias.

Example 2: State Tra¢c Fatality Data

• The data are for 48 states, where each state is observed in T = 7 time periods (each of

years 1982, . . . , 1988)

• 40,000 highway tra¢c fatalities each year in the US

• Approx. 1/3 fatal crashes involves a driver who was drinking

• A study estimated that 25% of drivers on the road between 1am and 3am have been

drinking

• A driver who is legally drunk is 13 times as likely to cause a fatal crash

2.3 UNOBSERVED HETEROGENEITY 12

Objective: the e§ect of government policies designed to discourage drunk driving on the

fatality rate

Y =) Fatality rate: the number of annual tra¢c death per 10,000 people in a state

X =) Beer tax: the “real” tax on a case of beer, i.e. the beer tax put into 1988 dollars.

Scatterplot:

0 1 2 3
Tax on Case of Beer (in 1988 dollars)

Fitted values vfrall

The Tra¢c Fatality Rate and the Tax on Beer (1982 data)

0 1 2
Tax on Case of Beer (in 1988 Dollars)
3

Fitted values vfrall

The Tra¢c Fatality Rate and the Tax on Beer (1988 data)

1982 Estimation:

F atalityrate = 2.01 + 0.15Beertax

(0.15) (0.13) (2.2)

2.3 UNOBSERVED HETEROGENEITY 13

1988 Estimation

F atalityrate = 1.86 + 0.44Beertax

(0.11) (0.13) (2.3)

• t1982 is not significant at 10% level while t1988 is significant at 1% level

• Higher tax are associated with more, not fewer tra¢c fatalities??? But if we focus on

one state, say California, we obtain figure 2.3.

87

86 84
1.95

88
vfrall
1.9

85 82

1.85

83

1.8

.085 .09 .095 .1 .105 .11

Tax on Case of Beer

Figure 2.3: Fatality rate vs Beertax in California

• Omitted variable bias: quality of the auto, highway conditions, social attitude toward

drinking and driving

• Suppose: High tra¢c density means more tra¢c deaths; (Western) states with lower

tra¢c density have lower alcohol taxes

beer tax traffic density

- +

traffic deaths

• Solution: Collect all the relevant data and augment the simple regression. However,

some of these variables are not observable or measurable

2.4 TYPES OF PANEL DATA SETS 14

• Keep those variables constant across di§erent period ) fixed e§ect model

Let Zi be a variable that determines the fatality rate in state i but does not change over

time. Let Y = Fatality rate and X = Beertax, then

Yit = b0 + b1 Xit + b2 Zi + uit (2.4)

When t = 1982, we have

Yi1982 = b0 + b1 Xi1982 + b2 Zi + ui1982 (2.5)

When t = 1988, we have

Yi1988 = b0 + b1 Xi1988 + b2 Zi + ui1988 (2.6)

Subtracting (2.5) from (2.6), we get

Yi,1988 − Yi1982 = b1 (Xi1988 − Xi1982 ) + (ui1988 − ui1982 ) (2.7)

Cultural attitudes toward drinking and driving a§ect the level of drunk driving and thus the

fatality rate. However, if they do not change over time, then they do not produce any change

in fatalities in the state. The changes must arise from other sources.

.5

-.5

-1

-1.5

-.6 -.4 -.2 0 .2 .4


Change in Beer Tax

Fitted values Change in Fatality Rate

For more examples on unobserved heterogeneity, See Arellano (2003, pages 8-10).

2.4 Types of Panel Data Sets

• Large N and T (panel time series, data fields)

2.5 CLUSTERED SAMPLING 15

For some macro panels and financial panels, both N and T can be large. We need to allow

both N and T goes to infinity. This is the so called multidimensional asymptotics. We may

let N ! 1 first and then let T ! 1 or let T ! 1 first and then let N ! 1p or let N and

T go to 1 at the same time but control the relative rate of expansions (i.e. N /T ! 0).

The first two asymptotics are called sequential asymptotics and the last one is called joint

asymptotics.

Data of this kind have been prominent, for example, in research on models of growth

and convergence. See Penn World Table http://pwt.econ.upenn.edu/ or Global Development

Network Growth Database at the World Bank.

• Small N and T (hopeless!)

• Small T, large N (traditionally considered in panel data econometrics)

In micro panels, N is typically very large (several hundreds or even thousands) while T is

quite small (ranging from 2 to 10 in most cases, and very rarely exceeding 20). If T is much

smaller than N , the usual asymptotics is to let N ! 1 with T fixed. Panel data set with

small T dimension is often called traditional panels or micro panels.

Examples of micro panels are household or firm level panels, which are based on surveys,

census, administrative records or company balance accounts. Two widely used data sets are

Panel Study of Income Dynamics (PSID) available from

http://psidonline.isr.umich.edu/ and National Longitudinal Surveys of Labor Market Experi-

ence (NLS) available from http://www.bls.gov/nls/

• Small N, large T (Seemingly Unrelated Regression Equation (SURE)). This type of data

set is referred to as time series and cross sectional data (TSCS) in political science.

2.5 Clustered Sampling

N and T do not necessarily refer to number of individuals and time periods respectively.

Other examples include families and family members, schools and classes, industries and firms.

Many types of cross sectional survey data are obtained through “cluster” sampling. Certain

geographical units are first selected (e.g. villages), then individuals are sampled within each

village. Thus, the village from which the individual observation comes may be thought of as

one dimension of the data. Thus panel data methods are of special importance in research in

developing countries. A simple model is

yci = αc + Xci β + uci , c = 1, ..., C, i = 1, ..., Ic

where c indexes the cluster and i indexes individuals in the cluster. If we have a large number

of clusters and relatively small group sizes (max(Ic ) is small), then we have a traditional linear

panel data model.

If you will do research in development and deal with survey data, it is worthwhile reading

the book by Deaton (1997).

2.6 SEEMINGLY UNRELATED REGRESSION 16

2.6 Seemingly Unrelated Regression

A SURE system is a system of K equations without any feedback mechanism. It can be

represented as:

0 1

y1j = X1j β 1 + e1j


B C

B C
B y2j = X2j β 2 + e2j C

B C
B C

B ... C

@ A

yKj = XKj β K + eKj

for j = 1, 2, ..., N , where Xij are regressors that are assumed to be exogeneous and β k are

vectors of parameters. The equations in the system may be related as the error terms in

di§erent equations may be correlated. For example, ykj may be individual j 0 s expenditure on

good k or budget share for good k. In SURE system, K is typically small while N is large.

Now, we can write our panel model with small N and large T as a SURE system:

0 1

y1t = X1t β 1 + e1t

B C
B C

B y2t = X2t β 2 + e2t C


B C

B C
B ... C

@ A

yN t = XN t β N + eN t

for t = 1, 2, ..., T where Xit 2 R1×d . In the matrix form, we have

yt = Xt β + et

where

yt = (y1t , y2t , ..., yN t )0 , N × 1

et = (e1t , e2t , ..., eN t )0 , N × 1

β = (β 01 , β 02 , ..., β 0N )0 , N d × 1

and 0 1

X1t 0 0 ... 0

B C
B C

B 0 X2t 0 C

B C
B C

Xt = B 0 0 X3t 0 C
B C

B C
B ... ... ... ... ... C

@ A

0 0 0 ... XN t
N ×(N d).

Here d is the dimension of each Xit . Let = y0 2 RN T ×1 and define X 2 RN T ×N d


(y10 , ..., yT0 )

and e 2 RN T ×1 similarly. We can write the above system as

y = Xβ + e

2.6 SEEMINGLY UNRELATED REGRESSION 17

with

E(e|X) = 0.

and

E(ee0 |X) = Φ.

A special case is 0 1

σ 11 σ 12 ... σ 1N

B C
B C

B σ 21 σ 22 ... σ 2N C
Φ = IT ⊗ B
B
C = IT ⊗ Σ,
C

B ... ... ... ... C

@ A

σ N 1 σ N 2 ... σ N N

That is, et |X s iid(0, Σ) across t = 1, ..., T. There is no serial autocorrelation or heteroscedas-

ticity. If you are overwhelmed by the notation, you can consider the special case N = 1, which

gives us a typical time series regression.

Note that by definition,

0 1

a11 B a12 B ... a1N B


B C

B C
B a21 B a22 B ... a2N B C

A⊗B =B B
C.
C

B ... ... ... ... C


@ A

aN 1 B aN 2 B ... aN N B

Some formulae: (A ⊗ B)0 = A0 ⊗ B 0 , (A ⊗ B)−1 = A−1 ⊗ B −1 , (A ⊗ B) (C ⊗ D) = AC ⊗ BD,

vec(ABC) = (C 0 ⊗ A) vec(B).

The GLS estimator of β is

0 1−1

* +

β̂ GLS = @ |{z} X A
X 0 Φ−1 |{z} X 0 Φ−1 y ,

N d×N T N T ×N d

which is a BLUE.

Under one of the following two conditions, OLS applied to each equation is equivalent to

GLS when Φ = IT ⊗ Σ :

• (i) Σ = IN

• (ii) X1t = X2t = ... = XN t = xt 2 R1×d

The proof of this last condition involves rewriting the SUR system individual-by-individual

as follows:

ỹi = X̃i β i + ẽi

2.6 SEEMINGLY UNRELATED REGRESSION 18

where 0 1 0 1 0 1

yi1 Xi1 ei1

B C B C B C

B C B C B C
B yi2 C B Xi2 C B ei2 C

ỹi = B
B
C
C , X̃i = B
B
C
C , ẽi = B
B
C
C .

B ... C B ... C B ... C


@ A @ A @ A

yiT XiT eiT

T ×1 T ×d T ×1

That is,

0 1 0 1 0 1

ỹ1 X̃1 0 ... 0 ẽ1


B C B C B C

B C B C B C
B ỹ2 C B 0 X̃2 ... 0 C B ẽ2 C

B C = B C B
β+B C or
B C B C C

B ... C B ... ... ... ... C B ... C


@ A @ A @ A

ỹN 0 0 ... X̃N ẽN


N T ×1 N T ×N d N T ×1

ỹ = X̃β + ẽ.

where now var(ẽ|X̃) = Σ ⊗ IT .

When X1t = X2t = ... = XN t = xt , we have

0 1 0 1

Xi1 x1
B C B C

B C B C
B Xi2 C B x2 C

X̃i = B C=B C := xT ×d and X̃ = IN ⊗ xT ×d .


B C B C

B ... C B ... C

@ A @ A

XiT xT
T ×d

So

h i−1 h i

β̂ GLS = X̃ 0 (Σ ⊗ IT )−1 X̃ X̃ 0 (Σ ⊗ IT )−1 ỹ

;* +* + <−1 ;* +* + <

= IN ⊗ x0T ×d Σ−1 ⊗ IT (IN ⊗ xT ×d ) IN ⊗ x0T ×d Σ−1 ⊗ IT ỹ

; <−1 ;* −1 + <

= Σ−1 ⊗ x0T ×d xT ×d Σ ⊗ x0T ×d ỹ


h * +−1 i * −1 +

= Σ ⊗ x0T ×d xT ×d Σ ⊗ x0T ×d ỹ

h * +−1 0 i

= IN ⊗ x0T ×d xT ×d xT ×d ỹ.

2.6 SEEMINGLY UNRELATED REGRESSION 19

Writing out this more explicitly, we have

0* +−1 0 10 1

x0T ×d xT ×d xT ×d ... 0 ỹ1


B CB C

B CB C
B 0 ... 0 C B ỹ2 C

β̂ GLS = BB
CB
CB
C
C

B ... ... ... C B ... C


@ A@ A

* 0 +−1 0

0 ... xT ×d xT ×d xT ×d ỹN
0 * +−1 0 1

x0T ×d xT ×d xT ×d ỹ1

B * C
B 0 +−1 0 C

B xT ×d xT ×d xT ×d ỹ2 C

= BB C,
C

B ... C
@ A

* 0 +−1 0
xT ×d xT ×d xT ×d ỹN

which is the same as the equation-by-equation OLS estimator. This is a purely algebraic

result. There does not seem to be any good intuition on why OLS is numerically identical to

GLS. The numerical equivalence between OLS and GLS in this case is a well-known result in

econometrics.

For more details on estimating systems of equations, read Chapter 7 in Wooldridge (2010).

A good reference for the Kronecker product and matrix algebra in general is Abadir and

Magnus (2005).

Bibliography

[1] Abadir K. and J. Magnus (2005): Matrix Algebra, Cambridge University Press.

[2] Arellano, M. (2003): Panel Data Econometrics. Oxford University Press.

[3] Baltagi, Badi H. (2013): Econometric Analysis of Panel Data, John Wiley & Sons.

[4] Deaton, Angus (1997): The Analysis of Household Survey, The John Hopkins University

Press.

[5] Hsiao, Cheng (2014): Analysis of Panel Data, Cambridge University Press.

[6] Stock, J. and M. Watson (2007): Introduction to Econometrics, 2nd Edition, Addison and

Wesley.

[7] Wooldridge J. (2010): Econometric Analysis of Cross Section and Panel Data, The MIT

Press.

20

Chapter 3

Static Panel Data Models

3.1 The Static Model

3.1.1 The Model: Fixed E§ects or Random E§ects?

Consider the structural model

Yit = Xit β + "it , i = 1, ..., N, t = 1, ..., Ti . (3.1)

In this chapter, we consider only the static model so that Xit contains no lagged dependent

variables.

0 , X 0 , ..., X 0 )0 2

Notation: all vectors except Xit are column vectors. Denote Xi = ( Xi1 i2 iT
i

RTi ×k where k is the number of elements in Xit and Yi = (Yi1 , ..., YiTi )0 2 RTi ×1 .

We assume that "it can be decomposed into two components:

"it = αi + uit .

In our simple example

αi = b2 Zi (3.2)

where Zi consists of time invariant covariates.

• αi : individual e§ect or individual heterogeneity; uit : the idiosyncratic error.

• In methodological papers, one often discusses about whether αi should be treated as a

random e§ect (random variable) or a fixed e§ect (parameters to be estimated).

• The key issue is whether αi is correlated with Xi or put in a stronger form whether

E(αi |Xi ) = 0. Following Wooldridge (2010, Chapter 10), we always treat αi as a random

variable. When E(αi |Xi ) = 0, we say the model is a random-e§ects model. Otherwise, it

is a fixed-e§ects model. Some authors refer to the random-e§ects model and fixed-e§ects

model as the uncorrelated e§ect model and correlated e§ect model, respectively.

• Regardless of the type of model, there are two approaches to estimating β : the random-

e§ects approach and the fixed-e§ects approach. In the former approach, αi is not treated

as a parameter to be estimated while in the latter approach αi is treated as a parameter

to be estimated.

21

3.2 ESTIMATION: RANDOM-EFFECTS APPROACH 22

3.2 Estimation: Random-e§ects Approach

3.2.1 Assumptions

Assumption RE.1(a): E (αi |Xi ) = E (αi ) = 0; E (ui |Xi ) = E (ui ) = 0.

E (αi |Xi ) = E (αi ) = 0 is the the random e§ect assumption. E (ui |Xi ) = 0 is the so-called

strong exogeneity assumption. More specifically, we assume

E (uit |Xi1 , Xi2 , ..., XiTi ) = 0, (3.3)

which implies that

E (uit Xis ) = 0 for all s and t. (3.4)

This is certainly stronger than zero contemporaneous correlation. The strong exogeneity

assumption may not hold for some economic applications.

Example 5 Program Evaluation:

log(Wit ) = θt + Zit γ + δ 1 P rogit + αi + uit , (3.5)

where P rogit is a binary variable indicating program participation.

*Feedback e§ect: If one attributes the higher wage at time t, which is really due to the

random shock uit , to the training program, then he is more likely to join the training problem

in the future. In this scenario, a higher uit leads to a high value for P rogit+1 , and so uit and

P rogit+1 can be correlated. This violates the strong exogeneity assumption.

Assumption RE.1(b): (Xi , αi , ui ) is i.i.d. over i.

Assumption RE.1(b) is very strong as it rules out cross sectional dependence. It is di¢cult

to allow cross-sectional dependence, especially when N is large and time series are short. Panel

data models with cross sectional dependence have attracted much attention in recent years.

In this course, we maintain the assumption of cross-section independence.

Under Assumption RE.1, the random-e§ect model can be represented graphically as

Xi ! Yi ui

"

αi

where the absence of a link between Xi and ui indicates E (ui |Xi ) = 0 and the absence of a

link between Xi and αi indicates that E (αi |Xi ) = 0.

#P $

Assumption RE.2. Rank N


EX 0 Ω−1 X = k where Ωi = E ("i "0i |Xi ) and "i =
i=1 i i i

("i1 , ..., "iTi )0 .

Assumption RE.3. E(α2i |Xi ) = σ 2α , E(αi ui |Xi ) = 0, and E(ui u0i |Xi ) = σ 2u ITi .

3.2 ESTIMATION: RANDOM-EFFECTS APPROACH 23

The presence of the time-invariant random-e§ect αi implies the presence of persistent

unobserved heterogeneity and the variance-covariance structure:

8
>
> σ 2α + σ 2u if i = j and t = s

>
<

E ("it "js |Xi ) = σ 2α if i = j and t 6= s (3.6)

>
>
>
:

0 if i 6= j

Such a covariance structure implies serial correlation in the error terms. Hence OLS estimation

in such models will not be BLUE.

Under Assumption RE.3, we have

0 1

σ 2α + σ 2u σ 2α ... σ 2α
B C

B .. C
B σ 2α σ 2α + σ 2u · · · . C

Ωi = B
B .. ..
C
C (3.7)

B . . σ 2α C
@ A

σ 2α σ 2α + σ 2u

= σ 2u ITi + σ 2α JTi , (3.8)

where ITi is Ti × Ti identity matrix and JTi is Ti × Ti is the matrix with unity in every element.

Sometimes it is more convenient to rewrite

* + JTi

Ωi = σ 2u (ITi − JTi /Ti ) + σ 2u + Ti σ 2α


Ti

2
* 2 2
+
= σ u QTi + σ u + Ti σ α PTi

for QTi = (ITi − JTi /Ti ) and PTi = JTi /Ti .

The GLS estimator is

N
!−1 N
!

X X
β̂ GLS = Xi0 Ω−1
i Xi Xi0 Ω−1
i Yi . (3.9)

i=1 i=1

Let 0 1 0 1 1 0

X1 Y1 0 Ω1 0 ...

B C B C B C
B C B C B . C

B X2 C B Y2 C B 0 Ω2 · · · .. C

X=B C B C B
B .. C , Y = B .. C and V = B .. ..
C.
C

B . C B . C B . . 0 C
@ A @ A @ A

XN YN 0 ΩN

We can write β̂ GLS as

* +−1 0 −1

β̂ GLS = X 0 V −1 X X V Y.

3.2 ESTIMATION: RANDOM-EFFECTS APPROACH 24

3.2.2 Asymptotic Inference

• Under assumptions RE.1 and RE.2, the GLS estimator is consistent.

• Under assumptions RE.1, RE.2 and RE.3, the GLS estimator is e¢cient in the class of

linear and unbiased estimators.

Traditionally, the statistical and econometric literature focuses on the class of unbiased

estimators. Nowadays unbiasedness is just one of many di§erent criteria to evaluate an esti-

mator.

The estimator in (3.9) is not feasible. As in the typical GMM setup, we replace Ωi by ITi

to get an initial consistent estimator of β. This estimator is the pooled OLS estimator.

N
!−1 N
!
X X

β̂ OLS = Xi0 Xi Xi0 Yi (3.10)

i=1 i=1

0 1−1

N
X XN
* +−1 0

= @ Xj0 Xj A Xi0 Xi Xi0 Xi Xi Yi (3.11)

i=1 j=1

N
X N
X
* +−1 0 (i)

= Wi Xi0 Xi Xi Yi = Wi β̂ OLS (3.12)

i=1 i=1

where 0 1−1

N
X

Wi = @ Xj0 Xj A Xi0 Xi , (3.13)

j=1

and * +−1 0
(i)

β̂ OLS = Xi0 Xi Xi Yi (3.14)

is the OLS estimator using only the time series observations for individual i. With β̂ OLS , we

can construct estimators of σ 2α and σ 2u . For example, σ 2" = σ 2α + σ 2u can be estimated by

Ti
N X

1 X
b2"
σ = PN (Yit − Xit β̂ OLS )2 . (3.15)

i=1 Ti − k i=1 t=1

Note that σ 2α = E"it "is for t 6= s, so it can be estimated by

N TX Ti
i −1 X
1 X

b2α = PN
σ b
(Yit − Xit β b
OLS )(Yis − Xis β OLS ). (3.16)

i=1 Ti (Ti − 1) /2 − k i=1 t=1 s=t+1

b2" and σ
Plugging in σ b2α into the definition of Ωi yields Ω̂−1 −1
i . Using Ω̂i , we get the feasible GLS

estimator !−1 N !

N
X X

β̂ REF GLS := β̂ RE = Xi0 Ω̂−1 Xi0 Ω̂−1


i Xi i Yi . (3.17)

i=1 i=1

3.2 ESTIMATION: RANDOM-EFFECTS APPROACH 25

As a practical matter, σ b2α may not be positive. A negative value of σ


b2α indicates negative

correlation in uit , probably a substantial amount, which means one of our assumptions is

violated.

Under assumptions RE.1, RE.2 and RE.3, β̂ RE is asymptotically equivalent to the infea-

sible GLS estimator β̂ GLS .

3.2.3 Understanding the GLS Estimator

It is possible to obtain a closed form expression for Ω−1 i (and Ω̂−1


i ). Recall that

* +

ΩTi = σ 2u QTi + σ 2u + Ti σ 2α PTi

for * +

PTi = JTi /Ti = J¯Ti 2 RTi ×Ti and QTi = ITi − J¯Ti 2 RTi ×Ti .

It is easy to see both PTi and QTi are projection matrix and PTi QTi = 0. We have

PTi Xi = X̄i· , QTi Xi = Xi − X̄i· , (3.18)

|{z} |{z}

Ti ×Ti Ti ×k

and
−1/2 1 1

ΩTi = QTi + p PTi .


σu σ 2u + Ti σ 2α

−1/2
To see the transformation behind ΩTi , we note that

1 1
−1/2

ΩTi Xi = QTi Xi + p PTi Xi


σu σ u + Ti σ 2α
2

1 * + 1
= Xi − X̄i,· + p X̄i

σu σ u + Ti σ 2α
2

" ! #

1 σu
= Xi − 1 − p X̄i,·

σu σ 2u + Ti σ 2α

1 * +

= Xi − θi X̄i,· ,

σu

where

p
θi = 1 − σu / σ 2u + Ti σ 2α . (3.19)

If we premultiply the regression model Yi = Xi β + ϵi by σ u Ωi


−1/2
, we obtain

* + * +

Yit − θi Ȳi,· = Xit − θi X̄i,· β + ("it − θi¯"i,· ) . (3.20)

This is the quasi-demeaned equation but θi is individual dependent. By definition, the variance

of "it − θi¯"i,· is σ 2u . So, OLS is BLUE if it is based on the above regression model.

It can be shown that β̂ GLS and β̂ REF GLS are asymptotically equivalent in the sense that

H I
1

β̂ GLS − β̂ REF GLS = Op p

3.2 ESTIMATION: RANDOM-EFFECTS APPROACH 26

p p

so that the asymptotic distribution of N (β̂ GLS −β) is the same as that of N (β̂ REF GLS −β).

To show this, we consider the case with a balanced panel for simplicity. It is not hard to show

that

N (β̂ REF GLS − β)

" # " #
1 X#
N $0 # $ −1 1 X#
N $0 # $

= Xi − θ̂X̄i,· Xi − θ̂X̄i,· p Xi − θ̂X̄i,· "it − θ̂¯"i,·

N N i=1
i=1

" #−1 " #

1 X# $0 # $
N N
1 X* +0 * +

= Xi − θX̄i,· Xi − θX̄i,· + op (1) p Xi − θ̂X̄i,· "it − θ̂¯"i,·


N N

i=1 i=1
" #−1 " #

N N # $0 # $
1 X* +0 * + 1 X

= Xi − θX̄i,· Xi − θX̄i,· p Xi − θ̂X̄i,· "it − θ̂¯"i,· (1 + op (1)) .

N N
i=1 i=1

We now consider

1 X# $0
N

p Xi − θ̂X̄i,· ("i − θ̂¯"i,· )


N i=1

1 X*
N
+0

= p Xi − θX̄i,· ("i − θ¯"i,· )

N i=1

p # $ 1 X N p # $ 1 XN p # 2 $ 1 X N

− N θ̂ − θ 0
X̄i,· "i − N θ̂ − θ 0
Xi ¯"i,· + N θ̂ − θ 2 0
X̄i,· ¯"i,·

N N N
i=1 i=1 i=1

P # $0

The above holds by taking a second order Taylor expansion and noting that p1N N i=1 X i − θ̂ X̄ i,· ("i −

θ̂¯"i,· ) is a quadratic function in θ̂, and so there is no approximation error.

By the LLN, we have

N N N
1 X 0 1 X 0 1 X 0

X̄i,· "i = op (1) , Xi ¯"i,· = op (1) and X̄i,·¯"i,· = op (1) .

N N N
i=1 i=1 i=1

So, if N (θ − θ̂) = Op (1) , we have

1 X# $0
N N

1 X* +0
p Xi − θ̂X̄i,· ("i − θ̂¯"i,· ) = p Xi − θX̄i,· ("i − θ¯"i,· ) + op (1) .

N i=1 N i=1

Therefore,
p

N (β̂ REF GLS − β)

" #−1 " #


N N

1 X* +0 * + 1 X* +0

= Xi − θX̄i,· Xi − θX̄i,· p Xi − θX̄i,· ("it − θ¯"i,· ) + op (1) (1 + op (1))


N N i=1

"
i=1
#−1

N
1 X* +0 * + N
1 X* +0

= Xi − θX̄i,· Xi − θX̄i,· p Xi − θX̄i,· ("it − θ¯"i,· ) + op (1)


N N i=1

i=1
p

= N (β̂ GLS − β) + op (1)

3.3 ESTIMATION: FIXED-EFFECTS APPROACH 27

as desired.

3.3 Estimation: Fixed-e§ects Approach

3.3.1 Assumptions

Again consider the linear unobserved e§ect model:

Yit = Xit β + αi + uit (3.21)

Now we assume that Xit and αi are correlated. In this case, the random e§ect estimator is

biased.

Assumption FE.1: E(ui |Xi ) = 0.

• If Xit contains some time invariant variables, then we can not identify the e§ects of these

time invariant variables on Yit . For individuals, factors such as race and gender can not

be included in Xit . For firms, industry can not be included in Xit .

• For identification, we only require that each element of Xit varies over time for some

cross sectional units. See Assumption FE.2 below.

Xi ! Yi ui

..
. "

αi

3.3.2 Estimation Strategy

The idea is to transform the equation to eliminate the unobserved e§ect αi . There are several

transformation that can be used for this purpose. Recalled that we already used “first dif-

ference” for a two-period model. Now we consider fixed e§ects transformation, which is also

called the within transformation.

• Averaging equation Yit = Xit β + αi + uit over t to get

Ȳi,· = X̄i,· β + αi + ūi, . (3.22)

• Subtracting the above equation from Yit = Xit β + αi + uit to get

* +

Yit − Ȳi,· = Xit − X̄i,· β + (uit − ūi,· ) or (3.23)

.. .. ..
Y it = X it β + uit

3.3 ESTIMATION: FIXED-EFFECTS APPROACH 28

Can we use the OLS estimator on (3.23)? Note that

* +

E(uit − ūi,· ) Xit − X̄i,· = 0 (3.24)

or the following stronger orthogonal condition:

* +

E(uit − ūi, | Xis − X̄i,· ) = 0 for all t and s. (3.25)

So the OLS estimator is consistent and unbiased. Note that the above condition may not hold

if we only assume that E(uit |Xit ) = 0.

n hP * +* +0 io
N

FE.2 Rank condition: rank E i=1 X i − X̄ i,· X i − X̄ i,· = k.

The fixed-e§ects estimator is

!−1
X Ti
N X Ti
N X
X

b
* +0 * + * +0 * +

β FE = Xit − X̄i,· Xit − X̄i,· Xit − X̄i,· Yit − Ȳi,· . (3.26)

i=1 t=1 i=1 t=1

If we estimate β using OLS based on equation (3.22), then we get the (weighted) between

estimator:

N
!−1 N !
X 1 X 1

b
β 0 0
BE = X̄ X̄i,· X̄ Ȳi,· (3.27)

var(αi + ūi ) i,· var(αi + ūi ) i,·


i=1 i=1

!−1 N T !
XN X Ti XX

i
1 0 1 0

= X̄ X̄i,· X̄ Ȳi,· . (3.28)


σ 2 i,·
i=1 t=1 1i
σ 2 i,·
i=1 t=1 1i

* +
where σ 21i = Ti · var(αi + ūi ) = Ti · σ 2α + σ 2u /Ti = Ti σ 2α + σ 2u .

b
• β

BE is inconsistent under the fixed e§ect assumption because X̄i,· and αi are correlated.

• However, it is consistent under the random e§ect assumption. It is ine¢cient because it

discards the time series information in the data set.

3.3.3 Asymptotic Inference

We maintain the following assumption:

Assumption FE.3: E(ui u0i |Xi ) = σ 2u ITi .

Let 0 1

JT1 /T1 0 ... 0

B C
B .. C

B 0 JT2 /T2 · · · . C

P =B
B .. ..
C
C
B . . 0 C

@ A

0 ... 0 JTN /TN

3.3 ESTIMATION: FIXED-EFFECTS APPROACH 29

PN PN
and Q = I − P, where I = IT1 +...+TN is the i=1 Ti × i=1 Ti identity matrix. Then

* 0 +−1 * 0 + a * +−1
b
β X Qu s N (0, σ 2u EX 0 QX
F E − β = X QX ) (3.29)

where sa denotes “is distributed approximately as”. More precisely, we should write

" H I−1 #

p # $ 1

N β b d 2 0
F E − β ! N 0, σ u lim EX QX . (3.30)
N !1 N

Now define the fixed-e§ects residual

* +−1 0 0
ub = QY − QX X 0 QX X Q QY

* 0 +−1 0 0

= QXβ + Qu − QX X QX X Q (QXβ + Qu)

* 0 +−1 0 0
= Qu − QX X QX X Q Qu

* 0 +−1 0 0

= (I − QX X QX X Q )Qu. (3.31)

Note

* +−1 0 0

b0 u
u b = u0 Q(I − QX X 0 QX X Q )Qu

* + −1
= u0 Qu − u0 QX X 0 QX X 0 Q0 u. (3.32)

So

* +−1 0 0
u0 u
Eb b = Eu0 Qu − Eu0 QX X 0 QX X Qu

* 0 +−1 0 0 0
0

= EtrQuu − EtrQX X QX X Q uu

N
X # * +−1 0 0 $ 2

= (Ti − 1) σ 2u − tr QX X 0 QX X Q σu

i=1

N
!
X

= (Ti − 1) − k σ 2u .

i=1

Thus, an unbiased estimate of σ 2u is

SSR b0 u
u b
b2u = PN
σ = PN . (3.33)

i=1 (Ti − 1) − k i=1 (Ti − 1) − k

3.3.4 Dummy Variable Regression

Given that

Yit = Xit β + αi + uit

and there is a correlation between αi and Xit , we need to control for αi in order to obtain

a consistent estimator of β. That is, we need to look at the data individual by individual

and then aggregate the individual information on β to obtain our final estimator. From the

perspective of the modern control function approach, this is exactly what the least square

dummy variable (LSDV) regression does.

3.3 ESTIMATION: FIXED-EFFECTS APPROACH 30

Mathematically, we introduce a set of dummy variables to represent αi . Let D1, ..., Dn be

the individual-specific and time-invariant dummy variables:

< 1, if i = j

Dji =

: 0, otherwise

Then we can write

αi = D1i × α1 + D2i × α2 + ... + Dni × αn

for all i = 1, ...., n. It then follows that

Yit = Xit β + D1i × α1 + D2i × α2 + ... + DNi × αn + uit

= Xit β + Di α + uit

0 1

= (Xit , Di ) @ A + uit ,

where 0 10 0 1

D1i α1

B C B C
B C B C

B D2i C B α2 C
Di = B . C , α = B
B C
B ..
C.
C (3.34)

B .. C B . C

@ A @ A

DNi αN

Therefore, the linear causal model reduces to the usual linear statistical model if

E(ui |Xi , Di ) = 0.

Important di§erence between αi and β :

b
• β F E is consistent with fixed Ti as N ! 1.

b i is an unbiased estimator for αi but may not be consistent for αi when Ti is fixed.
• α

• Incidental parameter problem.

3.3.5 Incidental parameter problem

The problem was originally pointed out by Neyman and Scott (Econometrica, 1948).

In our notation, the problem can be described as follows: For Yi = (Yi1 , Yi2 ), (Y1 , ..., Yn )

are iid across i. In addition,

Yit s N (αi , σ 2 ) and Yi1 is independent of Yi2 .

We can think of this as a panel data set with two periods and there is no covariate Xi . In this

problem, there are (n + 1) parameters. Neyman and Scott (1948) consider the MLE of {αi }

and σ 2 . The likelihood function is

!
2
X

1 1

L(α1 , ..., αn , σ 2 ) = Πni=1 exp − 2 (Yit − αi )2 .


2πσ 2 2σ
t=1

3.4 ESTIMATION: FIRST DIFFERENCING APPROACH 31

It is easy to show that the MLE’s are

1
α̂i = (Yi1 + Yi2 ) , i = 1, 2, ..., n;

n
1 XX
2

2
σ̂ = (Yit − α̂i )2 .

2n
i=1 t=1

Note that α̂i doesn’t converge to αi and we can show that σ̂ 2 converges in probability to

σ 2 /2 as n ! 1. To show the latter fact, we note that

1 X 1 σ2
σ̂ 2 = (Yi1 − Yi2 )2 !p E (Yi1 − Yi2 )2 =

4n 4 2
i=1

by the WLLN (weak law of large numbers).

When the number of parameters grows with the sample size, the usual argument for

consistency may not work any more. In the model of Neyman and Scott (1948), αi ’s are

the incidental parameters, because they are deemed as secondary importance. Depending

on the problem under consideration, αi ’s can be of primary importance. Nowadays, we use

“Incidental parameter problem” to refer to the problem where the the number of parameters

grows with the sample size.

How about the clustered robust variance (see the section on Robust Variance Estimator)

" 2 #2
n

1 X X

σ̃ 2 = (Yit − α̂i ) ?
n

i=1 t=1

3.4 Estimation: First Di§erencing Approach

Lagging

Yit = Xit β + αi + uit (3.35)

one period and subtracting yields

∆Yit = ∆Xit β + ∆uit (3.36)

The first-di§erence (FD) estimator is the pooled OLS estimator for the above regression"

Ti
N X
!−1 Ti
N X
!

X X
b
β ∆Xit0 ∆Xit ∆Xit0 ∆Yit
FD =

i=1 t=2 i=1 t=2

Assumption FD.1: Same as Assumption FE.1

b
Under the above assumption, we have E(∆uit |∆Xi2 , ∆Xi3 , ..., ∆XiT ) = 0. So the β F D is

consistent and unbiased. Note that

∆Xit0 ∆uit = (Xit − Xit−1 )0 (uit − uit−1 )

= Xit0 uit − Xit−1


0
uit − Xit0 uit−1 + Xit−1
0
uit−1 (3.37)

3.4 ESTIMATION: FIRST DIFFERENCING APPROACH 32

So E∆Xit0 ∆uit may not equal to zero if uit is correlated with Xit−1 , Xit or Xit+1 .

P PTi
Assumption FD.2: Rank( N
i=1
0
t=2 E∆Xit ∆Xit ) = k

• A computational warning: If you stack the data, the di§erence across di§erent individ-

uals should be discarded.

• The FD estimator is less e¢cient than the FE estimator under the FE assumptions

Assumption FD.3: Eei e0i := E(∆ui ∆u0i |Xi ) = σ 2e ITi −1

b
• Under assumptions FD.1—FD.3, β F D is the most e¢cient estimator.

h i−1

b )=σ 2 PN PTi P PTi 2 PN


[ β
• Avar( FD b e i=1 t=2 ∆X 0 ∆X
it it b2e = N
,σ i=1 bit /( i=1 (Ti − 1) − k)
t=2 e

b
• ebit = ∆Yit − ∆Xit β

FD

b
What if Assumption FE.3: E(ui u0i |Xi ) = σ 2u ITi holds? In this case, β F D is less e¢cient

b
than β F E because var(∆ui1 , ∆ui2 , ..., ∆uiTi ) is not a diagonal matrix.

Let 0 1

−1 1 0

B C
B C

B −1 1 C
Di = BB
C
C

B C
...

@ A

0 ... −1 1
(Ti −1)×Ti

then

∆Xi = Di Xi , ∆Yi = Di Yi , ∆ui = Di ui .

The variance of ∆ui is then given by σ 2u Di Di0 . The GLS estimator based on the first-di§erenced

model

Di Yi = Di Xi + Di ui

is (N )−1 ( N )

X * + X * +

β̂ F D,GLS = 0 −1
Xi0 Di Di Di
0
Di Xi 0 −1
Xi0 Di Di Di
0
Di Yi .

i=1 i=1

Note that Di0 (Di Di0 )−1 Di is a projection matrix projecting to the row space of Di . Since

Di `Ti = 0, `Ti = (1, ..., 1)0 is orthogonal to the row space of Di . So, projecting to the the row

space of Di is the same as projecting to the space orthogonal to `Ti . Therefore

* +−1

Di0 Di Di0 Di = ITi − `Ti (`0Ti `Ti )−1 `0Ti .

As a consequence

β̂ F D,GLS = β̂ F E .

3.5 COMPARISON: FE AND FD ESTIMATORS 33

It is worth pointing out that the β̂ F D,GLS estimator is the OLS estimator based on the

transformed model

* +−1/2 * +−1/2 * +−1/2

Di Di0 Di Yi = Di Di0 Di Xi + Di Di0 Di ui .

A natural question is: what u∗i = (Di Di0 )−1/2 Di ui is for any vector ui = (ui,1 , ..., ui,Ti )0 ? Note

that (Di Di0 )−1/2 is not unique. If we use the Cholesky decomposition of Di Di0 = Ri0 Ri for some

upper triangular Ri and take

* +−1/2

Di Di0 = Ri−1 ,

which is still upper triangular, then some algebra shows that we can take

L M

∗ 1
uit = cit uit − (ui,t+1 + ... + uiTi ) ,

(Ti − t)

where

Ti − t

c2it = .
Ti − t + 1

We refer to this transformation as the forward orthogonal transformation. Thus, if var(ui ) =

σ 2u ITi , then var(u∗i ) = σ 2u ITi −1 . Therefore, the forward transformation can be regarded as an

alternative transformation, which in common with first-di§erencing eliminates the individual

e§ects but in contrast does not introduce serial correlation in the transformed errors. Forward

transformation turns out to be very useful in dynamic models.

3.5 Comparison: FE and FD Estimators

• They are identical with a balanced data set with T = 2 for all individuals. In that case,

the FD model and the FE model are numerically identical models, since Yi2 − Yi1 =

2 (Yi2 − 1/2(Yi1 + Yi2 )) .

• When T > 2, the choice between FD and FE hinges on the assumption on {uit } .

• The FD estimator and FE estimator will have di§erent probability limits when the strict

exogeneity assumption is violated.

• The correlation between uit and Xis leads to inconsistent FD and FE estimators.

3.6 Comparison: RE and FE Estimators

Noting that for σ 21i = σ 2u + Ti σ 2α ,

−1/2 1 1
Ωi Xi = QTi Xi + PT X i ,

σu σ 1i i

we have

H I0 H I

1 1 1 1
Xi0 Ω−1
i Xi = QTi Xi + PTi Xi QTi Xi + PTi Xi

σu σ 1i σu σ 1i

1 1

= 2 Xi0 QTi Xi + 2 Xi0 PTi Xi .


σu σ 1i

3.7 ROBUST VARIANCE ESTIMATOR 34

So,

" #−1 " #

XN H I XN
b 1 0 1 0 1 0 1

β RE = X QT Xi + 2 Xi PTi Xi Xi QTi Yi + 2 Xi0 PTi Yi (3.38)


σ 2u i i σ 1i σ 2 σ 1i

i=1 i=1 u

Therefore

b
β b b
RE = W1 β between + (I − W1 ) β within , (3.39)

where " #−1 " #


XN H I XN

1 0 1 0 1 0

W1 = X QT Xi + 2 Xi PTi Xi X PT X i
σ 2u i i σ 1i σ2 i i

i=1 i=1 1i

• Recall σ 21i = Ti σ 2α + σ 2u . If σ 2α = 0, then σ 21i = σ 2u . So

!−1 N

XN X

b
β 0 b
Xi0 Yi = β
RE = Xi Xi P OLS .

i=1 i=1

b
• If min Ti ! 1, then σ u /σ 1i ! 0. As a result, β b b
RE ! β within = β F E (“within variation”

dominates “between variation”)

b
• If σ 2α ! 1, β b b 2 b b
RE ! β within = β F E . The larger σ α is, the closer β RE is to β F E .

• Under Assumption RE.3,

N H
" I #−1
X 1 0 1 0

b )=
asymvar(β RE X QT X i + X PT X i

σ 2u i i σ 21i i i
i=1

and !−1
N

b 1 X 0
asymvar(β within ) = Xi QTi Xi .

σ 2u
i=1

b ) ≤ asymvar(β
Hence asymvar(β b
RE within ).

3.7 Robust Variance Estimator

The panel estimators in the previous sections can be obtained by OLS estimation of β in the

pooled regression

Ỹit = X̃it β + ˜"it

where Ỹit , X̃it are the demeaned versions or quasi-demeaned versions of Yit and Xit for the FE

and RE estimators. For the FD estimator, Ỹit , X̃it are the first di§erences of Yit and Xit .

All the panel estimators can be written in the form:

N XTi
!−1 N T
X XX i

β̃ − β = 0
X̃it X̃it X̃it0 ˜"it

i=1 t=1 i=1 t=1

!−1
X Ti
N X N
X

= X̃it0 X̃it vi
i=1 t=1 i=1

3.7 ROBUST VARIANCE ESTIMATOR 35

where

Ti
X

vi = X̃it0 ˜"it

t=1

To derive the asymptotic variance that is robust to heteroscedasticity and autocorrelation,

we note that, under some mild conditions,

N Ti

1 XX
p lim X̃it0 X̃it = SXX

N !1 N
i=1 t=1

for some matrix SXX . Next, under cross sectional independence

1 X
N

p vi !d N (0, SX" )

N i=1

where
N

1 X * 0+
SX" = p lim E vi vi .

N !1 N
i=1

As a consequence,

#p # $$
−1 −1

asymV ar N β̃ − β = SXX SX" SXX , (3.40)

where SXX and SX" can be consistently estimated by

N T
1 XX 0

ŜXX = X̃it X̃it ,


N

i=1 t=1

XN X Ti X
Ti
1

ŜX" = X̃it0 ˜"est


it ˜"est
is X̃is

N
i=1 t=1 s=1

1 X 0 est * est +0
= X̃i ˜"i ˜"i X̃i ,

N
i=1

where ˜"est
it is the estimated residual.

• The above estimator is called the clustered standard error.

• Need large N to deliver a good variance estimator.

• The above variance formula is valid regardless of heteroscedasticity and autocorrelation

• The validity of the above formula depends crucially on the cross-sectional independence

assumption.

3.8 HAUSMAN-WU TEST 36

3.8 Hausman-Wu Test

One might expect that the random e§ects estimator is superior to the fixed e§ects estimator.

After all, it is the GLS estimator; moreover, the previous discussion shows that the fixed

e§ects estimator is a limiting case of the RE, corresponding to situations where the variation

in the individual e§ects is large. However, there is a very strong assumption built in to the

random e§ects estimator: αi is orthogonal to the explanatory variables. In this section, we

test the null H0 : αi and Xit are uncorrelated.

3.8.1 General Principle

Suppose we have two alternative estimators, β b and β b , for a true parameter vector β.
I II

Further suppose that if the null hypothesis H0 is correct, both estimators are consistent

and asymptotically normal with (approximate) variance-covariance matrices VI and VII , and

(approximate) matrix of covariances between the two estimators VI,II . Finally, suppose that

if the null hypothesis is false the two estimators converge to di§erent answers – for example,

one of them might remain consistent while the other one becomes inconsistent, or both of

them might become inconsistent but idiosyncratically so. Then, under H0 , the Wu-Hausman

quadratic form:
m = (β b −β b )0 (VI + VII − VI,II − VII,I )− (β
b −βb ) (3.41)

I II I II

b −β b .
converges in distribution to a χ2 (k), where k is the rank of the asymptotic variance of β

I
b , is e¢cient under H0 , it follows from the Rao-
II

In the case that one of the estimators, say β I

Blackwell theorem that VI,II = VII,I = VI . Hence, the variance-covariance expression in the

middle of m simplifies to VII − VI . In the above A− denotes a generalized inverse of A.

The intuition behind the Rao-Blackwell is as follows: Suppose we have two consistent

estimators βb and βb , and β b is an e¢cient estimator. Then the variance var(aβ b +(1−a)β b )
I II I I II

is smallest when a = 1. But the FOC for the minimization problem mina var(aβ b + (1 − a)β b )
I II

is

b ) − 2(1 − a)var(β
2var(β b ) + (2 − 4a)cov(β b ,βb ) = 0. (3.42)
I II I II

Letting a = 1 yields

b ) = cov(β
var(β b ,βb ) (3.43)
I I II

as desired.

3.8.2 The Hausman-Wu Specification Test

Applying this approach to the linear panel data problem, we can use the m statistic based the

b
β b
RE and β F E to test the null H0 : cov(αi , Xit ) = 0 for all t.

Under assumptions RE.1—RE.3 and H0 :

b
• β b
RE is consistent, asymptotically normal and e¢cient. β F E is consistent and asymptot-

ically normal.

In contrast, under assumptions RE.1—RE.3 (barring E (αi |Xi ) = E (αi ) = 0) and H1 :

b
• β b
RE is inconsistent while β F E is consistent.

3.8 HAUSMAN-WU TEST 37

# $0 h i−1 # $

b −βb b −βb ) b −β b
m= β V ar(β β FE . (3.44)

RE FE RE FE RE

For the purpose of exposition, suppose X 0 QX is of full rank, which rules out a constant

regressor in Xi . In this case

* 0 −1 +−1 0 −1 * +−1 0
b −β
β b X V " − X 0 QX

RE FE = X V X X Q". (3.45)

b −β
So, under H0 : β b

RE F E ' 0 and

* + * +

b ,β
cov(β b ) = E X 0 V −1 X −1 X 0 V −1 ""0 QX X 0 QX −1
RE FE

* +−1 0 −1 * +−1
= E X 0 V −1 X X V V QX X 0 QX

* +−1

= E X 0 V −1 X = var(β b ). (3.46)
RE

Thus

b −β
V ar(β b ) = V ar(βb ) − V ar(β b )
RE FE FE RE

2
* 0
+ −1 * +−1

= σ u X QX − X 0 V −1 X , (3.47)

which can be estimated by * +−1

b2u X 0 QX
σ − (X 0 V̂ −1 X)−1 . (3.48)

Alternatively,

Ti #
!−1
N X
X $0 # $

Vd b )=σ
ar(β b2u Xit − θ̂i X̄i,· Xit − θ̂i X̄i,· , (3.49)
RE

i=1 t=1

Ti
N X
!−1
X * +0 * +

Vd b )=σ
ar(β b2u Xit − X̄i,· Xit − X̄i,· . (3.50)
FE

i=1 t=1

b −β
To ensure the positive definiteness of V ar(β b ), we need to use the same σ
b2u in all the
RE FE

places.

3.8.3 Caveats

• Strict exogeneity (i.e., E (ui |Xi ) = 0) is maintained under both H0 and H1

• RE.3 may not hold. In this case, we need to implement a robust Hausman test, which

is explained in the next section.

3.8.4 Alternative Implementation

The quadratic form of the Hausman-Wu test does not extend easily to other situations, such

as heteroscedasticity. As pointed out by Hausman (1978), there is a regression-based form of

the test that turns out to be asymptotically equivalent to the quadratic form of the test.

3.8 HAUSMAN-WU TEST 38

Consider the random e§ect model for balance panels:

Yit∗ = Xit∗ β + αi (1 − θ) + u∗it

where

Yit∗ = Yit − θȲi,· and Xit∗ = Xit − θX̄i,· (3.51)

Here we have assumed balanced panel for simplicity. If αi is correlated with {Xit , t = 1, ..., T } ,

then we can follow Mundlak (1978) and assume that the linear projection of αi onto {Xit , t = 1, ..., T }

is

αi = γ 0 + X̄i,· γ + ei , (3.52)

where ei is uncorrelated with Xi . To test whether αi is correlated with {Xit }, we can test

H0 : γ = 0.

Plugging (3.52) into the quasi-demeaned equation yields

Yit∗ = γ 0 (1 − θ) + Xit∗ β + X̄i,· (1 − θ)γ + (uit − θūi,· ) + (1 − θ)ei . (3.53)

The key is that the new error term (uit − θūi,· ) + (1 − θ)ei is uncorrelated with the regressors

Xit∗ and X̄i,· . Therefore, a test of H0 : γ = 0 can be done using the standard Wald test on the

variable X̄i,· in an OLS regression that includes both Xit∗ and X̄i,· . Such a test is equivalent to

the quadratic-form Hausman-Wu test. The advantage of the Wald test is that we can easily

take the possible heteroscedasticity into account when computing the asymptotic variance of

γ̂.

Note that the linear projection in (3.52) is indeed an assumption. In general, we have the

linear projection

αi = γ̃ 0 + Xi γ̃ + ẽi , (3.54)

where ẽi is uncorrelated with Xi . In the above more general projection, the covariates consist

of the whole trajectory Xi not merely X̄i,· . The above holds as long as the variances of αi and

Xi are finite. Methods based on (3.54) are often said to implement the Chamberlain device,

after Chamberlain (1982).

3.8.5 Hausman Test for Endogeneity

Consider the causal model

Yi = Xi β + "i

We are concerned that X may be endogenous. Suppose there is an instrumental variable

available Zi , which we are confident is independent of "i . We are considering the null hypothesis

that X is exogenous. Formally, we want to test

H0 : EXi "i = 0 against H1 : EXi "i 6= 0.

To this end, we can estimate the model under the null hypothesis:

* +−1 0

β̂ OLS = X 0 X XY

with approximate variance

* +−1
Var(β̂ OLS ) = σ 2" X 0 X .

3.8 HAUSMAN-WU TEST 39

We can also estimate β by

β̂ IV = (X̂ 0 X̂)−1 X̂ 0 Y

with approximate variance

Var(β̂ IV ) = σ 2" (X̂ 0 X̂)−1

where * +−1 0

X̂ = PZ X = Z Z 0 Z Z X.

If β̂ OLS is asymptotically e¢cient, then we can construct the Hausman test statistic as

follows:

W = (β̂ OLS − β̂ IV )0 Ω̂− (β̂ OLS − β̂ IV )

where Ω̂− is the generalized inverse of Ω̂ and

* +−1 i
Ω̂ = σ̂ 2" (X̂ 0 X̂)−1 − X 0 X

Under the null hypothesis, W !d χ2k where k is the rank of (X̂ 0 X̂)−1 − (X 0 X)−1 , which is

the same as the column rank of X̂ − X. That is, only regressors not included in the set of

instruments are counted in the degrees of freedom. To show this, suppose X1 is a part of Z

and X2 is not, then # $

X̂ = X1 , X̂2

and so 0 1

X10 X1 X10 X̂2

X̂ 0 X̂ = @ A.

X̂20 X1 X̂20 X̂2

Let

0 1 0 1 # 0 $1
0 X̂ − X

X10 X1 X10 X̂2 X10 X1 X10 X2 X 1 2 20

D = X̂ 0 X̂−X 0 X = @ A−@ A=@# $ A,


0 0
X̂2 X1 X̂2 X̂2 0 0
X2 X1 X2 X2 0 0
X̂2 − X2 X1 X̂2 X̂2 − X2 X2

then

# $−1 * +−1 * 0 +−1 * 0 +−1

X̂ 0 X̂ − X 0X = X X +D − XX

h * +−1 i−1 * 0 +−1 * 0 +−1

= − I + X 0X D XX D XX

h i

So rank (X̂ 0 X̂)−1 − (X 0 X)−1 ≤ rank(D) = # of X2 = # of regressors not included in the set

h i

of instruments. Under additional identification conditions, we have rank (X̂ 0 X̂)−1 − (X 0 X)−1 =

rank(D).

To construct a regression-based Hausman test, we can consider the augmented pseudo-

regression

Yi = Xi β + Vi γ + errori

where Vi = Xi − X̂i is the first stage prediction residual. If we ignore the estimation uncertainty

in the first stage regression, then Vi is approximately equal to Xi − Zi π, which is exogenous

3.9 DIFFERENCES-IN-DIFFERENCES (DD/DID) ESTIMATOR 40

under the null. So the regression coe¢cient γ in the pseudo-regression is expected to be zero.

On the other hand, when Xi is not exogenous, then Vi will be correlated with "i and we

expect γ to be di§erent from zero. Therefore, to test the exogeneity of X, we can perform

the standard Wald test based on the above pseudo-regression with possibly robust standard

errors. The test can be carried out in Stata using commands similar to

ivregress 2sls y (x=z1 z2) w1 w2 w3

estat overid

3.9 Di§erences-in-Di§erences (DD/DiD) Estimator

3.9.1 Basic Idea

One way to measure the impact of a treatment in the setting of a natural experiment is to

use the di§erences-in-di§erences (DD or DiD) method. To apply this method, longitudinal or

repeated cross section data are needed, with at least one period before and one period after

the program change. The time line can be illustrated as follows:

Period 1 Period 2

Training period

In period 1, there is no program for both groups. Program participation only occurs between

periods one and two.

Let

Yit : outcome variable

P rogit : program participation dummy variable,

then the value of P rogit is given by:

Values of Progit

Period 1 Period 2

Control group 0 0

Treatment group 0 1

We consider the simple unobserved e§ect model:

Yit = αi + λt + progit × β + uit (3.55)

where λt is the time e§ect, αi is the time invariant unobserved e§ect. Note that αi is likely

to be correlated with progit , as program participation is NOT randomly assigned.

The OLS estimator of β has a very simple representation. It follows from

Ȳtreat,2 = ᾱtreat + λ2 + β + ūtreat,2 (3.56)

3.9 DIFFERENCES-IN-DIFFERENCES (DD/DID) ESTIMATOR 41

and

Ȳtreat,1 = ᾱtreat + λ1 + ūtreat,1 (3.57)

that

Ȳtreat,2 − Ȳtreat,1 = (λ2 − λ1 ) + β + ūtreat,2 − ūtreat,1 . (3.58)

Similarly, we can show that

Ȳcontrol,2 − Ȳcontrol,1 = (λ2 − λ1 ) + β + ūcontrol,2 − ūcontrol,1 . (3.59)

If E ūtreat,2 − E ūtreat,1 = E ūcontrol,2 − E ūcontrol,1 , then

;* + * +<

β = E Ȳtreat,2 − Ȳtreat,1 − Ȳcontrol,2 − Ȳcontrol,1 .

The OLS estimator of β has to be

* + * +
b
β

OLS = Ȳtreat,2 − Ȳtreat,1 − Ȳcontrol,2 − Ȳcontrol,1 (3.60)

Note that E ūtreat,2 − E ūtreat,1 = E ūcontrol,2 − E ūcontrol,1 is an important assumption. It

assumes that the changes of the averaged u are not systematically di§erent across the two

groups. This is the so-call parallel paths assumption (or common trend assumption) in the

DD literature.

The parallel paths assumption can be rearranged into

E ūtreat,2 − E ūcontrol,2 = E ūtreat,1 − E ūcontrol,1 .

This is the so-called constant bias assumption.

We can also illustrate various e§ects using the table below:

Period 1 Period 2 Di§erence

Treatment (Ȳtreat,t ) ᾱtreat + λ1 ᾱtreat + λ2 + β (λ2 − λ1 ) + β

Control (Ȳcontrol,t ) ᾱcontrol + λ1 ᾱcontrol + λ2 (λ2 − λ1 )

Di§erence ᾱtreat − ᾱcontrol ᾱtreat − ᾱcontrol + β β [dif f _in_dif f ]

Note that in the above table the individual e§ects matter only via ᾱtreat and ᾱcontrol , the

group average of the treatment and control groups. The DD estimator would be the same if

we have the following model

Yit = α̃g(i) + λt + progit × β + eit (3.61)

where g(i) = treatment, control indicating individual i’s group. Now if we introduce the

group dummy 8

< 1, if individual i is in the treatment group

Gi =

: 0, otherwise

3.9 DIFFERENCES-IN-DIFFERENCES (DD/DID) ESTIMATOR 42

and time dummy

8
< 1, if t = 2

Bt =

: 0, otherwise

we can write

Yit = α̃control + (α̃treat − α̃control ) × Gi + λ1 + (λ2 − λ1 ) Bt + progit × β + eit

= c0 + Gi × cG + Bt × cB + progit × β + eit (3.62)

for some constants c0 , cG and cB . Now it is easy to see that progit = Gi × Bt , and so

Yit = c0 + Gi × cG + Bt × cB + (Gi × Bt ) × β + eit . (3.63)

This is a common formulation in empirical studies. The OLS estimator of the coe¢cient on

the interaction dummy will be the DD estimator.

For the above formulation, we don’t actually need panel data. It is enough to have repeated

cross-section data. The di§erence between panel data and repeated cross-section data is that

for panel data the same individuals appear in both periods while for repeated cross-section

data the same individuals may not appear in both periods. For repeated cross-section data,

we can write the model as

* +

Yi = c0 + Gi × cG + Bt(i) × cB + Gi × Bt(i) × β + ei (3.64)

where t(i) indicates the period that individual i’s observation belongs to. As an example,

consider the case with two states CA and N Y. There is a policy change in CA only. At each

time period, we have observations for some individuals from each of the two states. For a

given state, the groups of individuals do not have to be the same across the two periods. In

this example, Gi is the state dummy.

3.9.2 Some Extensions

What if E ūtreat,2 − E ūtreat,1 = E ūcontrol,2 − E ūcontrol,1 does not hold but we have

E (ūtreat,2 − ūtreat,1 |X = x) = E (ūcontrol,2 − ūcontrol,1 |X = x)?

In this case, we have to control for X, that is, we can do a DiD analysis conditional on each

value of X.

In the panel data case, we may add the time varying variables Xit as a regression control:

Yit = αi + λt + progit × β + Xit γ + uit . (3.65)

Again, let’s assume that program participation only occur in the second period, then

Yi2 = αi + λ2 + progi2 × β + Xi2 γ + ui2 (3.66)

and

Yi1 = αi + λ1 + Xi1 γ + ui1 . (3.67)

3.9 DIFFERENCES-IN-DIFFERENCES (DD/DID) ESTIMATOR 43

So

∆Yi = λ2 − λ1 + progi2 × β + (∆Xi ) γ + ∆ui . (3.68)

We can first construct the adjusted di§erence ∆Yei = ∆Yi − (∆Xi ) γ


b and then estimate β by

b = ∆Yetreat − ∆Yecontrol .
β (3.69)

A more general formulation that accommodates more than two periods and other time

varying variables is:

Yit = c0 + Di × cD + Bt × cB + (Di × Bt ) × β + Xit γ + eit . (3.70)

This can be estimated by the least square dummy variable regression. When computing the

standard errors, make sure to use the cluster-robust standard errors.

The similar idea applies to repeated cross-section data, leading to

* +

Yi = c0 + Gi × cG + Bt(i) × cB + Gi × Bt(i) × β + Xi,t(i) γ + ei . (3.71)

3.9.3 DiD in the Counterfactual Framework

New notation:

Yit (1) : the value of the outcome variable at period t for individual i had the individual

participated in the program in the training period.

Yit (0) : the value of the outcome variable at period t for individual i had the individual

NOT participated in the program in the training period.

Observations Yit :

Yit = Di Yit (1) + (1 − Di ) Yit (0) .

The causal e§ect:

τ 2 = E [Yi2 (1) − Yi2 (0) |Di = 1]

or

τ 2 (x) = E [Yi2 (1) − Yi2 (0) |Di = 1, Xi = x] .

Note that

τ 2 (x) = E [Yi2 (1) |Di = 1, Xi = x] − E [Yi2 (0) |Di = 1, Xi = x]

= E [Yi2 |Di = 1, Xi = x] −E [Yi2 (0) |Di = 1, Xi = x]


| {z }

identif ied

where the first term can be identified from the data. The challenge is to identify the second

term E [Yi2 (0) |Di = 1, Xi = x] . If the following common trend assumption:

E [Yi2 (0) |Di = 1, Xi = x] − E [Yi1 (0) |Di = 1, Xi = x]

= E [Yi2 (0) |Di = 0, Xi = x] − E [Yi1 (0) |Di = 0, Xi = x]

3.9 DIFFERENCES-IN-DIFFERENCES (DD/DID) ESTIMATOR 44

holds, then

E [Yi2 (0) |Di = 1, Xi = x]

= E [Yi2 (0) |Di = 0, Xi = x] − E [Yi1 (0) |Di = 0, Xi = x]

| {z }

identif ied

+E [Yi1 (0) |Di = 1, Xi = x]

= E [Yi2 |Di = 0, Xi = x] − E [Yi1 |Di = 0, Xi = x]

| {z }

identif ied

Assume that

E [Yi1 (0) |Di = 1, Xi = x] = E [Yi1 (1) |Di = 1, Xi = x]

= E [Yi1 |Di = 1, Xi = x] ,

or equivalently τ 1 (x) = 0 for

τ 1 (x) := E [Yi1 (1) |Di = 1, Xi = x] − E [Yi1 (0) |Di = 1, Xi = x] .

Then

E [Yi2 (0) |Di = 1, Xi = x]

= E [Yi2 |Di = 0, Xi = x] − E [Yi1 |Di = 0, Xi = x] + E [Yi1 |Di = 1, Xi = x].

| {z } | {z }

identif ied identif ied

Therefore,

τ 2 (x) = E [Yi2 |Di = 1, Xi = x] − E [Yi1 |Di = 1, Xi = x]

| {z }
identif ied

−E [Yi2 |Di = 0, Xi = x] − E [Yi1 |Di = 0, Xi = x].


| {z }

identif ied

3.9.4 Examples and Applications

Example 6 Consider a simple example to illustrate the basic philosophy behind the di§erences-

in-di§erences approach. Suppose we are evaluating a program whose purpose is to increase

employment. We have a group that participates in the program (treatment group) and a com-

parison group (control group) of non-participants. We also have data on the outcome measure

for the participants and the comparison group in the time prior to and after the program. The

data are summarized in the table below. The number in each cell is the average employment

rate for each group.

Before the program After the program

treatment group 14.7% 17.6% (3.72)

control group 16.7% 18.4%

Graphically,

3.9 DIFFERENCES-IN-DIFFERENCES (DD/DID) ESTIMATOR 45

outcome

Y# control,after

18.4%

17.6% K

Y# treatment,after

16.7% Y# control,before

14.7% Y# treatment,before

time

t=1 t=2

The dashed blue line signifies the counterfactual e§ect had the treatment group not exposed to

the training program.

Let us consider di§erent ways to evaluate this program based on the data presented in this

table.

Method 1: Suppose that we look simply at the employment rate for participants after

the program and compare that to the employment rate for the comparison group after the

program. If we do this, we must conclude that the program actually reduces employment since

17.6% − 18.4% = −0.8%. Obviously this is a very unsatisfying result since, just by looking at

the table, we can see that it neglects to take into account the fact that participants started

o§ at a much lower level than the comparison group.

Method 2: Another approach to evaluating this program is to conduct a pre-post evalu-

ation. That is, we can look at program participants before and after the program. By doing

this, we see a very strong result of the program: 17.6%−14.7% = 2.9%. Yet this answer is also

open to criticism. By looking at the table, we see that the comparison group also improved

between the before and after time periods. This leads us to wonder if there is some external

force acting on everyone — both the control group and treatment group — that leads to higher

employment rates. If that is the case, then some portion of the improvement for participants

may be due to this external force rather than the program itself. For example, if the overall

employment rate has been rising, then both the treatment and control group members would

see an increase in employment.

Di§erence-in-Di§erence: An alternative approach to these two evaluation methods that

takes into account all of the information in the table above is the “di§erence-in-di§erences”

approach. First, compute the di§erence in employment for the treatment group before and

after the program:

17.6% − 14.7% = 2.9%.

Second, compute the di§erence in employment for the control group before and after the

3.9 DIFFERENCES-IN-DIFFERENCES (DD/DID) ESTIMATOR 46

program:

18.4% − 16.7% = 1.7%.

Now, compute the di§erence between these two di§erences:

2.9% − 1.7% = 1.2%.

By subtracting o§ the 1.7%, we are removing the increased employment that would have

occurred anyway (the benefit of an improving economy, for example) leaving us with an

estimate of the increased employment due just to the program itself.

Many papers use the DD to identify and estimate the causal e§ect of a policy change.

For example, Eissa and Liebman (1996) want to estimate the e§ect of the earned income

tax credit (EITC) on labor supply of women. The EITC is a subsidy that goes mostly to

low income women who have children. Eissa and Liebman evaluate the e§ect of EITC from

the Tax Reform Act of 1986, at which time only people with children were eligible. Their

treatment groups consist of single women with kids, and their control group consist of single

women without kids. They compare the variable of interest before and after the EITC.

As a second example, Richardson and William (2009) want to study whether Federal Re-

serve would have been able to mitigate the banking crisis that preceded the Great Depression.

In order to overcome many obstacles to answer this question, Richardson and William (2009)

find a group of banks within an economically similar environment which were subject to the

same state regulations but influenced by di§erent monetary policies. Banks in Mississippi fit

the bill. In 1913, the state was split evenly into two Federal Reserve districts (District borders

were determined by the population size in 1913 at the birth of the Federal Reserve System).

The top half of the state was placed in the Eighth District presided over by the St. Louis

Federal Reserve Bank. The lower half was part of the Sixth District which was the domain of

the Atlanta Fed. While the Atlanta Fed acted as a lender of last resort and provided credit

to troubled institutions, the St. Louis Fed allowed the supply of credit to contract as the

economy contracts because less credit is demanded during times of weak economic activity.

Number of banks open for business

1930 (Before the Crisis) 1931 (After the Crisis)

Sixth District (Treatment group) 135 121

Eighth District (Control group) 165 132

3.9 DIFFERENCES-IN-DIFFERENCES (DD/DID) ESTIMATOR 47

3.10 PROBLEM SET 48

The textbook by Stock and Watson (2002) provides a very nice discussion on program eval-

uation. Although it is a textbook for undergraduate econometrics, graduate students would

also benefit from some chapters such as Ch 11: Experiments and Quasi Experiments. For a

modern treatment and an excellent survey on the DiD in the potential outcomes framework,

see Lechner (2010).

3.10 Problem Set

1. Consider the panel data linear regression model

Yit = Xit β + "it for i = 1, 2, ..., N and t = 1, 2, ..., T (3.73)

where

"it = αi + uit (3.74)

αi s iid(0, σ 2α ) and uit s iid(0, σ 2u ), independent of each other and among themselves. The

Hausman test is designed to test the null hypothesis

H0 : cov(αi , Xit ) = 0. (3.75)

Hausman suggested computing the di§erence between the random e§ect estimator and fixed

e§ect estimator and construct the m-statistic

m = q 0 var−1 (q)q (3.76)

3.10 PROBLEM SET 49

b −β
where q = β b .
RE FE

(a) Show that

N
!−1 N
! N
!−1 N
!
X X X X

q= Xi0 Ω−1 Xi0 Ω−1 Xi0 QT Xi Xi0 QT "i


T Xi T "i − (3.77)

i=1 i=1 i=1 i=1

and !−1 !−1

N
X N
X

var(q) = σ 2u Xi0 QT Xi − Xi0 Ω−1


T Xi , (3.78)

i=1 i=1

where ΩT = σ 21 PT + σ 2u QT and σ 21 = T σ 2α + σ 2u .

(b) Consider
Yi∗ = Xi∗ β + X ei γ + vi (3.79)

where

Yi∗ = σ u ΩT
−1/2
Yi , Yit∗ = Yit − θȲi,·

−1/2

Xi∗ = σ u ΩT Xi , Xit∗ = Xit − θX̄i,·

ei = QT X, Xeit = Xit − X̄i,·


X (3.80)

Show that the OLS estimates of β and γ based on (3.79) are

e b b b

β eOLS = β
OLS = β between , γ within − β between . (3.81)

b b

(c) Show that under H0 , var (e γ OLS ) = var(β within ) + var(β between ). Prove that the Wald
0 −1

b (var (b
statistic w = γ γ )) γ b is identical to m

2. Use your favorite package to answer this question. A sample Matlab program is posted

on the TED. I encourage you to write your own program before reading the sample program.

The sample program works for a scalar Xit . For vector cases, some modifications are required.

Publish your code and report the URL on the TED. Group study is encouraged but you have

to write your own code and report your own analysis.

Consider the following data generating process:

Yit = Xit β + uit , i = 1, 2, ..., N and t = 1, 2, ..., T (3.82)

where β = 1, Xit s iidN (0, 1) across i and t, and uit |Xi s iid across t with distribution

N (0, |Xit |2 ). uit is independent of ujs for any i 6= j. Now suppose we use the FE estimator to

estimate β :

N X
T
!−1 N X
T
!

X * +2 X * +
β̂ = β + Xit − X̄i,· Xit − X̄i,· (uit − ūi,· ) (3.83)

i=1 t=1 i=1 t=1

3.10 PROBLEM SET 50

(a) Let N = 500 and T = 5. Simulate the sampling distribution of β̂ using 1000 simulation

replications. For each simulated sample, compute the robust standard errors σ̂ β and σ̃ β

according to

N
" T #2

X X* +* +
−2
σ̂ 2β = SXX Xit − X̄i,· ûit − ûi,·

i=1 t=1

XN XT X T

−2
* +* +* +* +
= SXX Xit − X̄i,· ûit − ûi,· ûis − ûi,· Xis − X̄i,· (3.84)

i=1 t=1 s=1

and

N X
X T
* +2 * +2
−2

σ̃ 2β = SXX Xit − X̄i,· ûit − ûi,· , (3.85)

i=1 t=1

where

N X
T
!
X * +2

SXX = Xit − X̄i,·

i=1 t=1

and ûit is the estimated residual.

(b) Compute the standard deviation sd(β̂) of the finite sample distribution of β̂.

(c) Compute the bias, std and rmse (root mean squared error) of σ̂ β and σ̃ β . According to

the rmse criterion, which estimator is better?

(d) Repeat (a)—(c) for T = 10, 20. Does the relative rmse advantage of the two estimators

remain valid for di§erent values of T ?

(e) Can you provide a theoretical explanation of your findings?

(f) Finally, do you think your results will change if you change the value of β, say let

β = 314.15926?

3. The debate regarding crime and guns is of course long running. The book ‘More Guns,

Less Crime: Understanding Crime and Gun Control Laws’ by Lott (American Enterprise

Institute) loudly made the claim that ‘shall’ laws reduce crime based on correlation analysis.

In this question, we will evaluate the claim and see whether we can shoot down the ‘More

Guns, Less Crime’ hypothesis (Ayres and Donohue III in the Stanford Law Review (2003)).

The book received 4.5 out of five stars at Amazon.com and there are 175 customer reviews.

Everybody has something to say about this issue. Let’s see what we can conclude from

econometric analysis.

The questions are based on the dataset handguns.dta which you can download from the

Ted. The data consists of data from 50 States plus DC for each year from 1977 to 1999. The

data we will be analyzing are crimes rates for various crime definitions provided by the Bureau

of Justice Statistics. The variables are described in the STATA data set. The main regressor

we will be focussing on is a dummy variable for whether or not the state allows widespread

carrying of concealed weapons. The variable shall is one for states which have ‘shall issue’

laws, which means that licenses must be given to all applicants that are citizens, mentally

competent and have not been convicted of a felony.

For additional background,you may want to read

http://en.wikipedia.org/wiki/More_Guns,_Less_Crime#Shall_issue_laws

3.10 PROBLEM SET 51

or http://islandia.law.yale.edu/ayers/Ayres_Donohue_article.pdf

Note: you do not need to submit your STATA output. However, please submit your Stata

do file.

I. We will examine the e§ect of shall on rates of violent crime, murder rates and robberies.

To this end, run regressions of the logs of each of these variables on shall (including an

intercept) with the robust option. Report the results in a table with a column for each

regression and the values and their standard errors in rows. That is, fill in the following table:

Dependent Var = ln(vio) ln(mur) ln(rob)

β̂ 0 6.13

(0.02) ( ) ( )

β̂ 1 (shall) -0.443

(0.048) ( ) ( )

R2 0.09

(a) What is the e§ect of ‘shall’ laws on each of the crime rates. Are the e§ects large

statistically? Explain.

To get started, you can first download the file ‘handguns.dta’ from the course webpage

and then use the following commands in your STATA do file. A do file is a text file that

contains a sequence of STATA commands. If you do not feel like using do files, you can type

the commands in the STATA command window. In this case, omit the delimiter semicolon ‘;’.

clear

- clear matrix

#delimit ;

set memory 50m;

set more off;

set matsize 300;

capture log close;

/* ps3 example using handguns data */

cd "D:\Teaching\";

log using shall.log, replace;

use handguns.dta;

desc;

summarize;

gen log_vio=log(vio);

gen log_mur=log(mur);

gen log_rob=log(rob);

/************ Question 1 *******/

reg log_vio shall, r;

3.10 PROBLEM SET 52

reg log_mur shall, r;

reg log_rob shall, r;

For this problem set, Stata may need some help with memory allocation. Because we will

set up lots of dummy variables, we need to allocate memory in a way to do this. So, at the

start of your do file, include the commands: set memory 50m and set matsize 300

II. Now we will control for a number of variables. First, it is well understood that de-

mographic variables play a role. Many have argued socioeconomic variables also play a part.

Most also would at least hope that jail is a deterrent. Run the above regressions but now

add the variables incarc_rate, density, pop, pm1029, and avginc to the regression. Report the

results in a table given below.

(a) What is the e§ect of the ‘shall’ laws now?

(b) Is the di§erence between the results here and in the results from Question (I) large in

a practical sense?

Dependent Var = ln(vio) ln(mur) ln(rob)

β̂ 0 -0.17

( ) (0.29) ( )

β̂ 1 (shall) -0.309

( ) (0.037) ( )

R2 0.55

Note: incarc_rate, density, pop, pm1029, and avginc should be included in the re-

gression but you do not have to report their coe¢cients.

III. One omitted variable from the above analysis is di§erences in laws and law enforcement

across states and time. We want to understand how this might a§ect results to provide more

foundation for the interval validity of the results. Recall the omitted variable bias formula:

cov(X1i , ui )

β̂ 1 ! β 1 + .

var(X1i )

Stronger laws would hopefully deter crime, especially crimes that are more rational in nature

like roberries, and perhaps violence. In this sense we would expect that stronger laws would

be associated with less crime and hence lower values for ui .

(a) Typically ‘shall’ laws are pushed using law and order arguments. States with a larger

‘law and order’ constituency would have stronger laws and would be more likely to have ‘shall’

laws. What does this suggest the sign of cov(X1i , ui ) where X1i is the dummy variable for

‘shall’.

(b) If there is a bias in β̂ 1 (the coe¢cient on shall), which direction is it?

IV. Since we have a panel data set, we are able to control for omitted variables that are

constant over time. We want to run the same regressions (i.e. use the same control variables)

3.10 PROBLEM SET 53

as in QII, but now add state fixed e§ects. Do this for each of the three dependent variables

we have examined, and construct three tables (one for each dependent variable). In each table

report the coe¢cient on ‘shall’ along with its standard error, test for the inclusion of state

e§ects if included.

Each table should look like the following (with the entries added instead of the XX’s, of

course).

Dep=ln(violence) 1 2

Shall XX(XX) XX(XX)

State Fixed E§ects? No Yes

F test for state e§ects - XX

(a) Describe the e§ect of controlling for state e§ects on the coe¢cient estimate for the

e§ect of ‘shall’ laws on crime.

(b) What does this tell us about omitted variables in the specification without state or

time e§ects?

(c) What is the statistical evidence that state dummy variables should be included?

(d) Do these results suggest that the arguments in QIII are correct?

(e) What types of e§ects do you think the time fixed e§ects are capturing?

Stata issues:

The command tab state, generate(statedummy) will take a variable in your data

set called state which has a number for each state and construct dummy variables named

statedummy1 through to the highest number statedummy51 where statedummy1=1

for state equal to 1 and zero otherwise, statedummy2=1 for state equal to 2 and zero

otherwise, etc.

The following code can be used to produce the required table.

tab state, gen(statedummy);

/* column 1 in the table */

reg log_vio shall incarc_rate density pop pm1029 avginc, r;

/* column 2 in the table */

reg log_vio shall incarc_rate density pop pm1029 avginc statedummy*, r;

testparm statedummy*;

/* if you want to compute standard error that is robust to the time series correlation

in uit , you can use the following commands. */

reg log_vio shall incarc_rate density pop pm1029 avginc statedummy*, cluster(state)

r ;

testparm statedummy*;

3.10 PROBLEM SET 54

Note: testparm provides a useful alternative to test that permits varlist rather than a list

of coe¢cients (which is often nothing more than a list of variables), allowing use of standard

Stata notation, including the wild card ‘*’.

4.(Estimating Linear Panel Data Models. Not covered in year 2017) For this problem,

please form a group of two students. The group are required to use both Matlab and STATA

to solve the problem. Presumably, one student in the group focuses on the Matlab solution

and the other one focuses on the STATA solution. The group should communicate and share

their programming experiences. Please include names of the group members in your code and

publish the code to TED. That is, post your code and results to a homepage and report the

URL on the TED.

Download the data file ret_edu.xls from the course web site. The panel data are drawn

from years 1976-1982 of the non-Survey of Economic Opportunity portion of the Panel Study

of Income Dynamics (PSID). The individuals in the sample are 595 heads of household between

the ages of 18 and 65 in 1976, who report a positive wage in some private, non-farm employment

for all 7 years. The following variables are included:

(1) WKS = (Weeks worked).

(2) SOUTH = (SOUTH=1, if the individual resides in the South).

(3) SMSA = (SMSA=1, if the individual resides in a standard metropolitan statistical

area).

(4) MS = (MS=1, if the individual is married).

(5) EXP =(Years of full-time work experience).

(6) EXP2 = (EXP^2)

(7) OCC = (OCC=1, if the individual is in a blue-collar occupation).

(8) IND = (IND=1, if the individual works in a manufacturing industry).

(9) UNION = (UNION=1, if the individual’s wage is set by a union contract).

(10) FEM = (FEM=1, if the individual is female).

(11) BLK = (BLK=1, if the individual is black).

(12) ED = (Years of education).

(13) LWAGE = (Logarithm of wage).

Our interest lies in the following wage equation

Yit = Xit β + Zi γ + αi + uit (3.86)

where Y = LW AGE, Xit = (X1it , X2it ), Zi = (Z1i , Z2i ) and

X1 = (W KS, SOU T H, SM SA, M S) (3.87)

X2 = (EXP, EXP 2, OCC, IN D, U N ION ) (3.88)

Z1 = (F EM, BLACK), Z2 = ED (3.89)

X1 and Z1 are assumed to be exogenous so that X1it and Z1i are uncorrelated with αi and

uis , for all t and s, while X2 and Z2 are endogenous because X2it and Z2i are correlated with

αi but not uis (for all s).

b
(a) Estimate the model using the within estimator β within .

3.10 PROBLEM SET 55

(b) Estimate the model using the GLS estimator (please follow the procedure outlined in

the lecture note. The GLS is OLS based on the following equation:

Yit − θȲi,· = (Xit − θX̄i,· )β + (1 − θ)Zi γ + "it − θ¯"i,· (3.90)

where the preliminary estimate θ̂ is given in the lecture note).

(c) Compare (a) and (b) through the use of Hausman (1978) specification tests. What can

you conclude from the Hausman test?

(d) Using the following instruments suggested by Hausman and Taylor (1981)

AHT = (QX1 , QX2 , P X1 , Z1 ), (3.91)

estimate (3.90) by IV approach. What assumptions do you need to ensure that A contains

valid instruments? Compare the estimate with (a) through the use of Hausman tests.

(e) Consider the alternative instrument set suggested by Amemiya and MaCurdy (1986)

AAM = (QX1 , QX2 , P X1 , (QX1 )∗ , Z1 ), (3.92)

where

0 1

X1,11 − X̄1,1. X1,12 − X̄1,1. X1,13 − X̄1,1. ... X1,1T − X̄1,1.

B C
B C

B ... ... ... ... ... C


B C

B C
(QX1 )∗ = B X1,i1 − X̄1,i. X1,i2 − X̄1,i. X1,i3 − X̄1,i. ... X1,iT − X̄1,i. C ⊗ lT (3.93)

B C
B C

B ... ... ... ... ... C

@ A

X1,N 1 − X̄1,N. X1,N 2 − X̄1,N. X1,N 3 − X̄1,N. ... X1,N T − X̄1,N.

Estimate (3.90) using AAM as instruments. What assumptions do you need to ensure that A

contains valid instruments?

(f) As more instruments are added, we expect to have more e¢cient estimates. Comparing

the standard errors in (d) and (e), do you find any noticeable reduction in standard errors?

(g) Discuss your estimation results with an emphasis on the return to education.


Panel

Neef not be Dyomic uit Lit progit p t uit frog m if Vei Tlikeso
it stock


then pestmolest soProsise 9 4

so not strict exo


SO far I

sewed il Xi o strict exo


xit y.ee Sit Li t Ti e e u go Cui l Yoo Tie 0



buoy of Tita yie.es T.it

Chapter 4

Dynamic Panel Data Models



4.1 Models with Sequentially Exogenous Variables


Consider the linear panel data model below:


Yi,t = Xi,t β 0 + "i,t , i = 1, . . . , N, t = 1, . . . , T

"i,t = αi + ui,t (4.1)


In addition to allowing αi and Xi,t to be arbitrarily correlated, we now allow ui,t to be



correlated with future values of Xi,t , i.e., (Xi,t+1 , Xi,t+2 , . . . , Xi,T ) :

0 1

Xi,1 Xi,2 Xi,3 · · · Xi,T

B C
B C

cast u tell future B % % % C.


@ A

1 future y whole

So
controlling u i,1 u i,2 u i,3 · · · u i,T


X trajectory f 0
In a dynamic panel data model where Xi,t = Yi,t−1 , ui,t is obviously correlated with

Y i,t , Yi,t+1 , . . . , Yi,T . This is because we have:



t−1
X t t−1
X f lit ki Wii Vix Yio

s t 1 − β 0 s t
Yi,t = β 0 "i,t−s + β 0 Yi,0 = αi + β 0 ui,t−s + β 0 Yi,0 , (4.2)

1 − β0
s=0 s=0


using

Yi,t = Yi,t−1 β 0 + "i,t . (4.3)


Note on notation: in the case of dynamic panels, we assume that Yi0 is available.

To identify the model parameters, we introduce the so-called sequential moment re-

striction:

E(ui,t |Xi,t , Xi,t−1 , . . . , Xi,1 , αi ) = 0, sequential (4.4)



which implies that exo

E(Yi,t |Xi,t , Xi,t−1 , . . . , Xi,1 , αi ) = Xi,t β 0 + αi . (4.5)


Yi f of Lc so alongs 57
F f to NIT look
go
at RE

4.2 PROPERTIES OF FE AND FD ESTIMATORS UNDER SEQEX 58

Example 7 Suppose

Yi,t = ρ0 Yi,t−1 + Zi,t β 0 + "i,t , (4.6)

then the sequential moment restriction becomes

E(ui,t |Zi,t , Yi,t−1 , Zi,t−1 , Yi,t−2 , . . . , Zi,1 , Yi,0 , αi ) = 0. (4.7)

So ui,t is correlated with future values of Yi,t−1 and is allowed to be correlated with future

values of Zi,t . If

E(Zi,t ui,s ) = 0 for all t and s, (4.8)

then Zi,t is strongly exogeneous, and we have additional moment conditions.

Example 8 Euler Equation:

H I

Uct (Ct )
E Rt |It−1 = 1. (4.9)

Uct−1 (Ct−1 )

Suppose

Ct1−γ − 1 t
Ut = , Uc = Ct−γ . (4.10)

1−γ

Then "H I−γ #

Ct

E Rt |It−1 = 1, (4.11)

Ct−1

or # $

E Ct−γ Rt |It−1 = Ct−1


−γ
. (4.12)

So, the moment conditions are

# $

E Ct−γ Rt − Ct−1
−γ
Zt−1 = 0, (4.13)

for any Zt−1 2 It−1 .

4.2 Properties of FE and FD Estimators under SeqEx RE cait nor

4.2.1 Inconsistency of the FE Estimator

balanced panel !−1

# $ N X
X T
* +0 * +

b
plimN !1 β = β 0 + plimN !1 Xi,t − X̄i,· Xi,t − X̄i,·
FE

i=1 t=1

N X
T
!
X * +0

× Xi,t − X̄i,· (ui,t − ūi,· )

needto see if
i=1 t=1 2

T
!−1
X * +0 * +
f IE Et

−1 so F Tiro
= β0 + T E Xi,t − X̄i,· Xi,t − X̄i,·

t=1

it treat exo
T

−1
X * +0 it is ou

×T E Xi,t − X̄i,· (ui,t − ūi,· ) . (4.14)


t=1

7 not too F E Fp biased

when long


4.2 PROPERTIES OF FE AND FD ESTIMATORS UNDER SEQEX 59

Now earlierneedstrict ego



or
T
X * +0
an

axe ECE Ex Cu Eu T −1
E X i,t − X̄ i,· (ui,t − ūi,· )

ECxuj E X Eu t=1

XT
* win
is0h +
−1 0 0

= T E Xi,t ui,t − Xi,t ūi,· − X̄i,· ui,t + X̄i,· ūi,·
so

CE
t=1
uit

f Ex !
way 5 is IE
T T
1 1 XX Xi

O
= −E X̄i,· ūi,· =
T T
0
EXi,t ui,s .
c
e
IE wi
(4.15)

t=1 s=1

until T 9
T −1
PT PT
EX 0 u can be regarded as the finite g
sample version of the long run covari-

i,t i,s
Th
t=1 s=1
n

ance between Xi,t and ui,t . It is reasonable to assume that
ias
wait deusto
So


if 3 or 5 1 XX
T T
!
He sics
filet

T
EXi,t 0
ui,s = O(1). doesn't
Cait t=1 s=1 go


dile
Tt not strict exo if T brink aye
Under this assumption, we have
ay

usuely
asymptotically

b
plimN !1 β F E = β 0 + O( ) as T ! 1.
1
BEE ivorishet (4.16)
asks
r even
biased

T
for


4.2.2 Inconsistency of the FD Estimator Tony Snell fixedT

Let ∆Xi,t = Xi,t − Xi,t−1 and ∆ui,t = ui,t − ui,t−1 . Then


but depends on arm need

b not be of
plimN !1 β FD

!−1 N T !

X N X T
0
XX
0

= β 0 + plimN !1 ∆Xi,t ∆Xi,t ∆Xi,t ∆ui,t

i=1 t=2 i=1 t=2

T
!−1 T
!
X X

= β 0 + 1/(T − 1) 0
E∆Xi,t ∆Xi,t 1/(T − 1) 0
E∆Xi,t ∆ui,t . (4.17)

t=2 t=2


Now

T T
1 X 1 X

0
E∆Xi,t ∆ui,t = E (Xi,t − Xi,t−1 )0 (ui,t − ui,t−1 )

T − 1 T − 1
t=2 t=2

T o+

1 X* 0 0 0 0
= EXi,t ui,t − EXi,t ui,t−1 − EXi,t−1 ui,t + EXi,t−1 ui,t−1

T −1
t=2

X T poster in
1 future x

= − EXi,t 0
ui,t−1 .

T −1
won't even in finite
decoy to
t=2


It is reasonable to assume that each of the summands is O(1) and that gT

X T

0

1/(T − 1) E∆Xi,t ∆Xi,t 2 (0, 1). (4.18)


t=2

4.2 PROPERTIES OF FE AND FD ESTIMATORS UNDER SEQEX 60

Hence inconsistent

b
plimN !1 β F D = β 0 + O(1) as T ! 1. (4.19)

Example 9 (Nickell (1981)) Dynamic Panel Data Model: Yi,t = Yi,t−1 β 0 +αi +ui,t . Assume 1 SD V

|β 0 | < 1 so that Yi,t is (weakly) stationary (we will investigate the consequences of relaxing

this condition later on) with


1
X

Yi,0 = β s0 "i,−s . (4.20)

s=0

Assume ui,t is iid over i and t. The fixed e§ects estimator (LSDV) estimator is

PN PT

b = i=1 (Yi,t − Ȳi,· )(Yi,t−1 − Ȳi,−1 )

β PNt=1PT 2
i=1 t=1 (Yi,t−1 − Ȳi,−1 )

PN PT

(ui,t − ūi,· )(Yi,t−1 − Ȳi,−1 )/N T


= β 0 + i=1PNt=1PT . (4.21)

(Yi,t−1 − Ȳi,−1 )2 /N T
i=1 t=1

The probability limit of the second term of (4.21) gives the asymptotic bias of the LSDV

estimator of the autocorrelation coe¢cient.

Note that

X1
1
1
X

Yi,t−1 = β s0 "i,t−1−s = αi + β s0 ui,t−1−s , (4.22)


1 − β0

s=0 s=0

the numerator of the second term is, as N tends to infinity,

−E Ȳi,−1 ūi,·

T 1
! T
!

1 1 XX s 1X
= E αi + β 0 ui,t−1−s ui,t

1 − β0 T T
t=1 s=0 t=1

! !
X T X 1 X T

1 1
β s0 ui,t−1−s Ey o
Ging L i

= E ui,τ
T T

t=1 s=0
!
τ =1
!
T t−1 T

1 X X t−1−q 1X

= E β0 ui,q ui,τ
T T

0
t=1 q=−1
1
τ =1

T −1 X T T
!
1 X 1 X

= E@ β t−1−q
0 ui,q A ui,τ

T q=−1 T
t=q+1 τ =1

T −1
! T
!

1 X β −q+T 0 −1 1X
= E ui,q ui,τ

T q=−1 β 0 − 1 T
τ =1

T −1

1 X β −q+T
0 −1 2 1 T − 1 − β 0 T + β T0 2

= σ u = − σu. (4.23)
T2 β0 − 1 T2 (β 0 − 1)2

q=1

4.2 PROPERTIES OF FE AND FD ESTIMATORS UNDER SEQEX 61

Similarly,

T
X

* +2

T −1 E Yi,t−1 − Ȳi,−1
t=1

!2

T
X 1
X T 1
1 XX s
−1

= T E β s0 ui,t−1−s − β 0 ui,τ −1−s . (4.24)


T

t=1 s=0 τ =1 s=0

But

1 T 1
!2

X 1 XX s
E β s0 ui,t−1−s − β 0 ui,τ −1−s

T
s=0 τ =1 s=0

σ 2u 1 − β t0
2σ 2u 1 − β T0 −t
= − + β

1 − β 20 T (1 − β 20 ) 1 − β 0
0
1 − β0

H I

σ 2u 2β 0 (1 − β T0 )
+ 1 − . (4.25)

T (1 − β 0 )2 T (1 − β 20 )

So

X * +2
T −1 E Yi,t−1 − Ȳi,−1

t=1

σ2 β 20 T 2 + T β 20 − 2β 0 T 2 + T 2 − T + 2β 0 − 2β T0 +1
= − u2 * + . (4.26)

T (β 0 − 1)2 β 20 − 1

The asymptotic bias of the fixed e§ects estimator or LSDV estimator is


* +* +

T − 1 − β 0 T + β T0 β 20 − 1

. (4.27)
β 20 T 2 + T β 20 − 2β 0 T 2 + T 2 − T + 2β 0 − 2β T0 +1

When T = 2, the bias reduces to −( 12 β 0 + 12 ). When T = 3, the bias reduces to −(β 0 + 1)(β 0 +

2)/2. When T is large, the dominating bias term is

* +

− (β 0 − 1) β 20 − 1 1 (β + 1)
2 (1 + o(1)) = − 0 . (4.28)

β 0 − 2β 0 + 1 T T

• For small T and β 0 > 0, we can see that the bias is always negative.

• The bias does not tend to zero as β 0 tends to zero.

• The smaller T is, the larger the bias.

• When T is large, the right-hand side variables become asymptotically uncorrelated; the

bias tends to zero as T tends to infinity.

• Note that we have assumed that Yi,0 = 1 s


s=0 β 0 "i,−s . So the DGP for Yi,0 is the same

as any other Yi,t , for t > 0. Sometimes we assume that Yi,0 is a fixed constant. In this

case, the exact expression for the asymptotic bias will be di§erent.

Exercise 10 Calculate the asymptotic bias of the FD estimator. Compare it with that of the

FE estimator.


th ED at least the order
Cait

is at E as ordering of t t c etc not used retune

as array of mole tie series

4.3 FD+IV ESTIMATOR (ANDERSON AND HSIAO) 62

4.3 FD+IV Estimator (Anderson and Hsiao)

it corr uit e

Consider the first-di§erenced model: endrogeneityonebeam

use cages3N

∆Yi,t = ∆Xi,t β 0 + ∆ui,t , t = 2, . . . , T. (4.29)

Under the sequential exogeneity assumption

E(Xi,s ui,t ) = 0 for all s = 1, 2, . . . , t, (4.30)

we have

E∆Xi,s0
∆ui,t = 0 for all s = 2, . . . , t − 1. (4.31)

So, at time t, we can use ∆Xi,t−1 as the potential instrument for ∆Xi,t (Anderson and Hsiao

(1982)). The AH estimator is given by

!−1 N T !
XN X T XX

0 0

β̂ AH = ∆Xi,t−1 ∆Xi,t ∆Xi,t−1 ∆Yi,t .

i=1 t=3 i=1 t=3

The implicit moment condition behind the above AH estimator is

X T

wYI
0
E ∆Xi,t−1 ∆ui,t = 0.

t=3 from seq ego

Di§erent Choices: weak

relevance

• Estimate ∆Yi,t = ∆Xi,t β 0 + ∆ui,t by pooled 2SLS using ∆Xi,t−1 as instruments. When

T = 3, this reduces to a cross-sectional 2SLS: Xi,2 − Xi,1 is used as instruments for

Xi,3 − Xi,2 .

• Rather than use lagged ∆Xi,t as instruments, we can use lagged Xi,t as instruments. For
can still

example, choosing Xi,t−1 , Xi,t−2 as instruments is as A least as e¢cient as the procedure here week

that uses ∆Xi,t−1 as instruments. The former also gives k overidentifying restrictions
su problem

level that can be used to test the sequential exogeneity. It has been found that the estimator
but less
resulting from instrumenting using di§erences has a singularity point and very large

preferred
variances over a significant range of parameter values. Instrumenting using levels does

exo not lead to the singularity problem, and results in much smaller variances, and so is

Geersheep
preferable.

efficient

more

The Problem of Weak Instruments

• When T = 2, we have

∆Yi,2 = ∆Xi,2 β 0 + ∆ui,2

gun

and we use Xi,1 as the instrument for ∆Xi,2 . In this case, β 0 may be poorly identified,

as the correlation between Xi,1 and ∆Xi,2 may be small.

• Even when T is large, we may still the weak instrument problem.

nonals

ratroof handsEarreleted
use cc

tne h m cooks at Bau

ug p E Ez d
Treffen T

Stix i
fnezixi mean 0 as
EI
use c.ee


every distribution of zu Eo

wedeinsnuneeklnnllg.EE

u

don'tgetNornd

RE Carrey

bug

4.4 PANEL GMM ESTIMATOR (ARELLANO AND BOND) most 63 L

Vari high

jeep
Example 11 Let

Yi,t = ρ0 Yi,t−1 + αi + ui,t , (4.32)


then the simplest IV estimators are

PN PT cancer

b i=1 t=3 (Y i,t − Y i,t−1 )(Y i,t−2 − Yi,t−3 ) has e
β IV,1 = PN PT , (4.33)

Ive'ag i=1 t=3 (Yi,t−1 − Yi,t−2 )(Yi,t−2 − Yi,t−3 )


moment


PN PT 80 no

Iv level b
β IV,2 = PN PT i=1 t=3 (Yi,t − Yi,t−1 )Yi,t−2
. (4.34) noon

i=1 (Y
t=3 i,t−1 − Y )Y
i,t−2 i,t−2

Consider the case that αi = 0 and ρ0 = 1 so that
noted
fpehose.to

and it
needHot
Yi,t = Yi,t−1 + ui,t and then (4.35)

gofer Eid 0
d ∆Yi,t = ui,t . wit
zf (4.36)

random well

quest those.uwr

First di§erencing yields: in were

∆Yi,t = ρ0 ∆Yi,t−1 + ∆ui,t . (4.37)

2
Hit Notice that ∆Yi,t−s is uncorrelated with ∆Yi,t−1 for s = 2, . . . , t − 2, ∆Yi,t−s is not a relevant

also
instrument for ∆Yi,t−1 .
great


4.4 Panel GMM estimator (Arellano and Bond)


4.4.1 The GMM Estimator: Definition


The Anderson-Hsiao instrumental variables estimator may be consistent, but it is not e¢cient

because it does not take into account all the available moment restrictions. Arellano and Bond

(1991) propose a more e¢cient estimator which uses additional moment restrictions.

Consider

∆Yi = ∆Xi β 0 + ∆ui . (4.38)



At t = 2, we have

we save Yi,2 − Yi,1 = (Xi,2 − Xi,1 ) β 0 + ui,2 − ui,1 . (4.39)

asq

We can convincingly argue that Xi,1 is a valid instrument for (Xi,2 − Xi,1 ), since it is likely

to be correlated with (Xi,2 − Xi,1 ), and is not correlated ui,2 − ui,1 . At t = 3, we have


cart Yi,3 − Yi,2 = (Xi,3 − Xi,2 ) β 0 + ui,3 − ui,2 . (4.40)


Here both Xi,1 and Xi,2 are valid instruments: both are likely to be correlated with (Xi,3 − Xi,2 ) , Turgut

and neither is correlated with ui,3 − ui,2 . Proceeding in this manner, we can see that at t, the conditions

o
valid instrument set is Xi,t−1 where


o
Xi,t−1 = (Xi,1 , Xi,2 , . . . , Xi,t−1 ) (4.41)

4.4 PANEL GMM ESTIMATOR (ARELLANO AND BOND) 64

for a store for each 1

is a row vector. Define 0 y


Xo 0 ··· 0
1
ware
E
Xie wite 4 ea a

B i,1
C
B Ix i C

ke C t l
B 0 o ···
Xi,2 0 C

Zi = B
B .. .. . . ..
C,
C (4.42)

B . . . . C
@ A

0 0 o
· · · Xi,T −1 Wan e T I

which is a (T − 1) × k (1 + 2 + · · · + T − 1) matrix or acx(Te −i 1) × k(T (T − 1)/2) matrix. Then

EZi0 ∆ui = 0 because

0 1
o )0 ∆u

(Xi,1 i,2
B C

B o )0 ∆u C

B (X i,2 i,3 C
Zi ∆ui = B
0
B ..
C
C (4.43)

B . C

@ A

o
(Xi,T −1 )0 ∆u
i,T

and Xi,t is sequentially exogeneous.

The moment condition EZi0 ∆ui = 0 is not enough to identify β 0 . We need the following

rank condition:

Assumption GMM: Rank(EZi0 ∆Xi ) = k, where k is the number of regressors after first

di§erencing.

Note that ∆Xi is a (T − 1) × k matrix and Zi0 ∆Xi is a kT (T − 1)/2 × k matrix. To be

more explicit, we can write Zi0 ∆Xi as

0 10 1 0 1
o )0
(Xi,1 0 ··· 0 ∆Xi,2 o )0 ∆X
(Xi,1

B CB C B
i,2
C

B o )0 · · · CB C B o )0 ∆X C
B 0 (Xi,2 0 C B ∆Xi,3 C B (Xi,2 i,3 C

B CB C=B C (4.44)
B .. .. .. .. C B .. C B .. C

B . . . . CB . C B . C

@ A@ A @ A
0 0 o
· · · (Xi,T −1 )0 ∆Xi,T o
(Xi,T 0

−1 ) ∆Xi,T

Under the assumption EZi0 ∆ui = 0 and Rank(EZi0 ∆Xi ) = k, β 0 is the unique vector that

solves

EZi0 (∆Yi − ∆Xi β) = 0. (4.45)

To estimate β 0 , we may attempt to solve the sample analogue:

1 X 0

Zi (∆Yi − ∆Xi β) = 0. (4.46)


N

i=1

In general, we have k(T (T − 1)/2) > k (the only exception is T = 2), so the above equation

b to make the vector as “small” as possible. Let

will not have a solution. Instead, we choose β

quite


4.4 PANEL GMM ESTIMATOR (ARELLANO AND BOND) 65


WN be a k(T (T − 1)/2) × k(T (T − 1)/2) symmetric and positive semi-definite matrix (with

enough rank), a GMM estimator of β 0 is


N
!0 N
!
X X

b
β 0 0
GM M = arg min Zi (∆Yi − ∆Xi β) WN Zi (∆Yi − ∆Xi β) . (4.47)

β
i=1 i=1


The solution is

" ! !#−1

w
Sf Ez zi5
X N XN

IN b
β GM M =
1 0
∆Xi Zi WN
1 0
Zi ∆Xi (4.48)
N N


tramp gwm Fz
i=1 i=1
N
! N
!
1 X 1 X

× ∆Xi0 Zi WN Zi0 ∆Yi . (4.49)

P
us
N
i=1
N
i=1


Let 0 1 0 1 0 1

Z1 ∆X1 ∆Y1

B C B C B C
B C B C B C

B Z 2 C B ∆X 2 C B ∆Y 2 C
Z = B . C , ∆X = B . C , ∆Y = B . C
B C B C B
C. (4.50)

B .. C B .. C B .. C

@ A @ A @ A

ZN ∆XN ∆YN


Then * +−1

βb 0 0
∆X 0 ZWN Z 0 ∆Y.
GM M = ∆X ZWN Z ∆X (4.51)

We assume that WN !p W1 , a positive definite matrix.



4.4.2 The GMM Estimator: Asymptotics

b

Under the orthogonality and rank conditions, we can show that β GM M is consistent. Note

" ! !#−1
XN XN

b 1 1

β GM M − β 0 = ∆Xi0 Zi WN Zi0 ∆Xi
N N

i=1
!
i=1
!
N N

1 X 1 X

× ∆Xi0 Zi WN Zi0 ∆ui . (4.52)
N N

i=1 i=1

But !

X N

1
Zi ∆Xi !p EZi0 ∆Xi = C
0
(4.53)

N
i=1


with rank(C) = k, and !
N

1 X 0

Zi ∆ui !p EZi0 ∆ui = 0. (4.54)
N

i=1


Hence * 0 +−1

b
β p
GM M − β 0 ! C W1 C CW1 0 = 0, (4.55)

4.4 PANEL GMM ESTIMATOR (ARELLANO AND BOND) 66

since rank(CW1 C 0 ) = k.

We can also show that β b


GM M is asymptotically normal:

N (βb
GM M − β 0 ) ) N (0, Vβ ), (4.56)

where

* +−1 0 * +−1
Vβ = C 0 W1 C C W1 ΛW1 C C 0 W1 C , (4.57)

and ‘)’ signifies convergence in distribution. This follows easily from the fact that

1 X 0
N

p Zi ∆ui ) N (0, Λ), (4.58)

N i=1

provided that some moments conditions and cross sectional independence hold.

4.4.3 Selecting the Weighting Matrix

Let !−1 H
N I

1 X 0 Z 0 Z −1
WN = Zi Zi = . (4.59)

N N
i=1

Then

* +−1

b
β 0 0
∆X 0 ZWN Z 0 ∆Y
GM M = ∆X ZWN Z ∆X

h * 0 +−1 0 i−1 * +−1 0


Wooldridge 0
∆X 0 Z Z 0 Z
Egf

= ∆X Z Z Z Z ∆X Z ∆Y. (4.60)

This is the so called Pooled 2SLS estimator.


WHY

Let
bit as road not
! −1 be serene

N
1 X 0

WN = Zi GZi
N

i=1

where 0 1

2 −1 0 · · · 0 0

B C
B C

B −1 2 −1 0 0 C
B C

B .. C

B . 0 0 C
B 0 −1 2 C

G=B .. .. .. C, (4.61)
B . . . C

B C
B C

B . C
B 0 0 0 . . 2 −1 C

@ A

0 0 0 · · · −1 2

then we get the pooled 3SLS estimator. p

Let WN = Λ̂−1 for a consistent estimator Λ̂ of Λ, then N (β b


* 0 −1 +−1 GM M − β 0 ) is asymptotically

normal with variance C Λ C −1


. Λ is the optimal weighting matrix in the sense that

* 0 −1 +−1 * 0 +−1 0 * +−1


CΛ C ≤ C W1 C C W1 ΛW1 C C 0 W1 C . (4.62)

4.4 PANEL GMM ESTIMATOR (ARELLANO AND BOND) 67

Here for two matrices A and B, A ≤ B signifies the negative semi-definiteness of the di§erence

A − B.

A Feasible GMM Procedure

b be an initial consistent estimator of β , for example, β


b=β
b

• let β 0 2SLS .

• Define ∆e b
ui = ∆Yi − ∆Xi β 2SLS .

PN
• Construct a consistent estimator of Λ̂ = N −1 0 u ∆e
i=1 Zi ∆e
0
i ui Zi and choose WN = Λ̂−1 .

• Use WN to construct the GMM estimator.

WN is a consistent estimator of Λ−1 under general conditions. It is easy to see that the

b
consistency and asymptotic normality of β −1
GM M remain valid with WN = Λ̂ .

The 3SLS estimator is asymptotically equivalent to the feasible GMM estimator if

* + * +
E Zi0 ∆ui ∆u0i Zi = E Zi0 GZi . (4.63)

A su¢cient condition is

* +
E ∆ui ∆u0i |Zi = G.

4.4.4 Inference Based on the Optimal GMM Estimator

Test the null H0 : Rβ 0 = r

• Wald test

• LR-like test with WN = Λ̂−1 :

N
!0 N
!
X X

Qr = 1/N Zi0 (∆Yi b


− ∆Xi β r ) WN Zi0 (∆Yi b
− ∆Xi β r ) (4.64)

i=1 i=1

!0 !
N
X N
X

Qur = 1/N Zi0 (∆Yi b )


− ∆Xi β WN Zi0 (∆Yi b )
− ∆Xi β (4.65)

ur ur

i=1 i=1

Then Qr − Qur ) χ2q , where q is the number of independent restrictions.

• Testing Overidentification: Qur ) χ2T k(T −1)/2−k .

Question: What if

0 1

Xi,1 0 ··· 0
B C

B C

B 0 Xi,2 · · · 0 C
Zi = B
B .. .. . . ..
C?
C (4.66)

B . . . . C

@ A

0 0 · · · Xi,T −1

asymptoticallyefficient

if w
Nt I 2 i o 0
of oyi oxi Fash

V a one of the

aesthete of

IN Eziaoye

ox p

then Gmm 3SLS

A amore wont

Basically can efficiency by

restrities

eines Sir

or above W and 354 instead

T imarets T2 VI Belter
Earlier

week I

because

But now more n Too ornery To V

Lie Lie a Xie Xie p t vie vie a flat now

I can se fi as 3N

says

is a bad irrelevant

but
if 7 Large but Xi week Ee

So se like most recent

Xie s Xia Li Xie l

2
fewer

in

use elements

Too may N

a n SN f n ooh sew alias

Yi L t Xi p t ve Zi Ei 7in

Eg

Few Fors

phew F y
or E

proof LS

perfect fit

Bois x

yon

4.5 MODELS WITH OTHER TYPES OF INDEP. VARIABLES 68

4.5 Models with Other Types of Indep. Variables

In the previous sections, we maintain the assumption that Xi,t are sequentially exogeneous

and are correlated with αi . Now suppose that there are other types of independent variables,

denoted as Wi,t , in the regression. We discuss a few cases here.

4.5.1 Strictly Exogeneous Variables that Are Correlated with αi

Consider

Yi,t = Xi,t β 0 + Wi,t δ 0 + "i,t , i = 1, . . . , N, t = 1, . . . , T

"i,t = αi + ui,t (4.67)

where Xi,t is sequentially exogenous while Wi,t is strictly exogeneous. Then the instruments

matrix can be expanded into

0 1

o ,Wo )
(Xi,1 0 · · · 0
i,T

B C
B o ,Wo ) C

B 0 (Xi,2 i,T · · · 0 C
B
Zi = B C (4.68)

.. .. .. .. C

B . . . . C
@ A

0 0 o o
· · · (Xi,T −1 , Wi,T )

0 = (W , W , . . . , W ) . Now, Z 0 ∆u becomes

where Wi,T i,1 i,2 i,T i i

0 1

(X o )0 ∆u
i,2 C

B i,1
B C

B (W o )0 ∆ui,2 C
B i,T C

B C
B (X o )0 ∆ui,3 C

B i,2 C
B C

Zi ∆ui = B
0 o 0
B (Wi,T ) ∆ui,3 C
C (4.69)

B .. C
B C

B . C
B C

B o 0 ∆u C

B (Xi,T −1 ) i,T C
@ A

o )0 ∆u
(Wi,T i,T

4.5.2 SeqEx Variables that Are Uncorrelated with αi

We consider the case that Wi,t is sequentially exogeneous and cov(Wi,t , αi ) = 0. In this case,

observations on Wi,t up to and including t = s are valid instruments for the levels equation at

t = s.

To combine the moment conditions for both the first-di§erenced equations and levels equa-

4.5 MODELS WITH OTHER TYPES OF INDEP. VARIABLES 69

tions, we stack the equations in di§erences and levels. Let

0 1 0 1 0 1

B ui,2 − ui,1 C B Yi,2 − Yi,1 C B ∆Xi,2 ∆Wi,2 C


B C B C B C

B ui,3 − ui,2 C B Yi,3 − Yi,2 C B ∆Xi,3 ∆Wi,3 C


B C B C B C

B C B C B C
B · · · C B · · · C B C

B C B C B C
B C B C B C

B ui,T − ui,T −1 C B Yi,T − Yi,T −1 C B ∆Xi,T ∆Wi,T C

B C B C B C
"+i = B C , Yi
+
= B C , X +
i = B C. (4.70)

B " C B Y C B X Wi,1 C
B i,1 C B i,1 C B i,1 C

B C B C B C
B "i,2 C B Yi,2 C B X Wi,2 C

B C B C B i,2 C
B C B C B C

B
···
C B
···
C B C
B C B C B ··· ··· C

@ A @ A @ A

"i,T Yi,T Xi,T Wi,T

In view of the first di§erenced and level equations, we have

Yi+ = Xi+ β 0 + "+


i .

Denote 0 1
o
Xi,1 0 ··· 0 0 0 0

B C

B C
B 0 o
Xi,2 ··· 0 0 0 0 C

B C
B .. .. .. .. C

B . . . . C
B C

B C

B 0 0 o
· · · Xi,T 0 C
+ B −1 C

Zi = B C, (4.71)
B 0 0 o
Wi,1 0 C

B C
B C

B 0 0 0 o
Wi,2 0 C
B C

B C
B .. C

B ··· ··· . 0 C
@ A

0 0 0 Wi,T

then EZi+0 "+


i = 0 as 0 1

o )0 ∆u

B
(Xi,1 i,2
C

B C
B (Xi,2o )0 ∆u C

B i,3 C
B . C

B .. C

B C
B C

* + +0 + B o 0
B (Xi,T −1 ) ∆ui,T
C
C

Zi "i = B C. (4.72)
B (W o )0 " C

B i,1 i,1 C
B C

B (W o )0 " C
B i,2 i,2 C

B .. C
B C

B . C

@ A
o )0 "
(Wi,T

i,T

So we can use the GMM approach as before.

4.6 NUMBER OF MOMENT RESTRICTIONS 70

4.5.3 Strictly Exogeneous Variables that Are Uncorrelated with αi

We now consider the case that Wi,t is strictly exogeneous and cov(Wi , αi ) = 0. In this case,

the observations on Wi,t for all periods become valid instruments in the level equations. Using

the notation "+ + +


i , Yi and Xi as before and defining

0 1

Xi,1 0 ··· 0 0 0 0
B C

B C
B 0 Xi,2 o ··· 0 0 0 0 C

B C
B . .. .. .. C

B .. . . . C
B C

B C
B 0 0 · · · X o 0 C

B i,T −1 C
Zi+ = B C, (4.73)

B 0 0 W o 0 C
B i,T C

B C
B 0 o C

B 0 0 Wi,T 0 C

B C
B .. C

B ··· ··· . 0 C
@ A

0 0 0 Wi,To

* +0

we have E Zi+ "+i = 0. Again, we can use the GMM approach as before.

To sum up, we have considered the following cases:

αi ui,s Levels/Di§erences

Wi,t cov(Wi,t , αi ) 6= 0 sequential exogeneity Di§erences

Wi,t cov(Wi,t , αi ) 6= 0 strict exogeneity Di§erences

Wi,t cov(Wi,t , αi ) = 0 sequential exogeneity Levels

Wi,t cov(Wi,t , αi ) = 0 strict exogeneity Levels

4.6 Number of Moment Restrictions

There is a large literature on weak instruments, although there is no consensus on the definition

of weak instruments. All researchers seem to agree that when the instruments are weakly

correlated with the regressors, the problem of weak instruments is present. In this case, the

GMM estimator may not be consistent. More recently, many papers find that using too

many overidentifying restrictions leads to poor finite sample properties. In practice, it may be

better to use a couple of lags (say 3) rather lags back to t = 1. A rigorous study of the weak

instruments problem is beyond the scope of this course.

4.7 Initialization and Maximum Likelihood Estimator

The GMM estimator in the previous sections is in general consistent and asymptotically nor-

mal regardless of how the process is initialized. The only exception is that the instruments

become weak for certain initializations. The GMM estimator is robust at the cost of ignoring

4.7 INITIALIZATION AND MAXIMUM LIKELIHOOD ESTIMATOR 71

information in the first observation. In the time series context, whether the first observation

is used in the estimation does not matter for robustness and asymptotic e¢ciency. With short

panels, the situation is fundamentally di§erent. As shown by Blundell and Bond (1998) and

Hahn (1999), imposing restrictions on the initial condition can greatly improve the e¢ciency

of GMM over certain parts of the parameter space. In this section, we will not discuss how

to incorporate the information in the first observation in the GMM framework. Instead, we

discuss the problem of initialization in the MLE framework.

Consider the standard dynamic linear panel data model

Yi,t = µ + Yi,t−1 β + Xi,t γ + "i,t , i = 1, · · · , N, t = 1, · · · , Ti

"i,t = αi + ui,t (4.74)

where αi and ui,t are normally distributed and

αi s iidN (0, σ 2α ) and ui,t s iidN (0, σ 2u )

αi ? Xi,t , ui,t ? Xi,s for all t and s,

To use information in the first observation Yi,0 , we need to specify how Yi,0 is generated.

We assume that

Yi,0 = δ 0 + δ 1 αi + vi

where vi s iidN (0, σ 2v ) and is independent of αi and {ui,t }Tt=1 . Some special cases of this

specification are:

a. δ 1 = 0 : Yi,0 is random but uncorrelated with αi

b. δ 1 = 0 and σ 2v = 0 : Yi0 is a fixed constant.

c. δ 0 = µ/(1 − β), δ 1 = 1/(1 − β) and σ 2v = σ 2u /(1 − β 2 ) : when Xi,t = 0, Yi,0 follows the

stationary and unconditional distribution of the process.

We will not impose any of the restrictions above.

The likelihood function for the observations {Yi,0 , Yi,1 , . . . , Yi,T }N


i=1 (conditional on Xi ,

i = 1, 2, . . . , N ) is:

N N N

Y Y Y
L= f (Yi,1 , . . . , Yi,T |Xi ) = f (Yi,1 , . . . , Yi,T |Yi,0 , Xi ) f (Yi,0 |Xi ) (4.75)

i=1 i=1 i=1

with !

N
Y P −N/2P N
X (Yi,0 − δ 0 )2

f (Yi,0 |Xi ) = (2π)−N/2 Pσ 20 P exp − (4.76)


2σ 20

i=1 i=1

where

σ 20 = δ 21 σ 2α + σ 2v .

Next, conditional on Yi,0 and Xi ,

δ 1 σ 2α

αi s N (φ(Yi,0 − δ 0 ), σ 2α − φ2 σ 20 ) where φ =
σ 20

4.7 INITIALIZATION AND MAXIMUM LIKELIHOOD ESTIMATOR 72

Therefore, conditional on Yi,0 , "i = ("i,1 , "i,2 , . . . , "i,T )0 has mean φ(Yi,0 − δ 0 ) and variance Ω

where

* +

Ω = σ 2α − φ2 σ 20 JT + σ 2u IT

: = σ 2α|0 JT + σ 2u IT .

QN
So i=1 f ("i,1 , . . . , "i,T |Yi,0 , Xi ) is

N H
X I
−N T /2 −N/2 1 0 −1

(2π) |Ω| exp − ["i − φ(Yi0 − δ 0 )] Ω ["i − φ(Yi0 − δ 0 )]


2

i=1

Using

N
Y N
Y

f (Yi,1 , . . . , Yi,T |Yi,0 , Xi ) = f ("i,1 , . . . , "i,T |Yi,0 , Xi )

i=1 i=1

and combining (4.75) and (4.76) gives

L(σ 20 , σ 2u , σ 2α , δ 0 , µ, β, γ, φ)

N
!
P P X
(Yi,0 − δ 0 )2

= (2π)−N/2 Pσ 20 P
−N/2
exp
− ×

2σ 20
i=1

N
!

X 1
(2π)−N T /2 |Ω|−N/2 exp − e0 Ω−1 ei

2 i
i=1

where

ei = Yi − µ − βYi,−1 − Xi γ − φ(Yi,0 − δ 0 )

= Yi − µ̃ − βYi,−1 − Xi γ − φYi,0 with µ̃ = µ + φδ 0

We can reparametrize L and write it as

L(σ 20 , σ 2u , σ 2α|0 , δ 0 , µ, β, γ, φ) := L(σ 20 , σ 2u , σ 2α , δ 0 , µ, β, γ, φ).

Remark 12 If δ 1 = 0, then φ = 0. In this case, the ML estimators of µ, β, γ are the random

e§ects estimator (or the GLS estimator) if the quasi-demeaning uses the MLE of θ. Therefore

the random e§ect estimator is consistent when Yi0 is a fixed constant or Yi0 is random but

uncorrelated with αi .

Remark 13 The random e§ects estimator is inconsistent when φ 6= 0. The consistency of the

random e§ects estimator thus depends crucially on the initialization of the process.

Remark 14 The condition likelihood function is defined by

# $

LC σ 2u , σ 2α|0 , µ̃, β, γ, φ

Y
= f (Yi,1 , . . . , Yi,T |Yi,0 , Xi )

i=1

!
N
X

= (2π)−N T /2 |Ω|−N/2 exp − e0i Ω−1 ei .


2
i=1

4.8 PROBLEM SET 73

We can obtain the conditional MLE by maximizing the conditional maximum likelihood. If

there is no restriction on the DGP for Yi0 , then the conditional MLE is asymptotically as

e¢cient as the MLE. Otherwise, the conditional MLE may be less e¢cient.

Remark 15 Note that

h i−1

Ω−1 = T σ 2α|0 PT + σ 2u (PT + QT )

h i−1

= σ 2u + T σ 2α|0 PT + σ −2
u QT

: = σ −2 −2
1 PT + σ u QT

and

* + * +T −1
|Ω| = σ 21 σ 2u ,

we have

e0i Ω−1 ei = σ −2 0 −2 0
1 ei PT ei + σ u ei QT ei .

The likelihood function can be regarded as based on the following two equations:

* + * + * +

ei,t − ēi,· = Yi,t − Ȳi,· − β Yi,t−1 − Ȳi,−1 − Xi,t − X̄i,· γ

ēi,· = Ȳi − µ̃ − β Ȳi,−1 − X̄i,· γ − φYi0

It is easy to show that

# $

log LC σ 2u , σ 2α|0 , µ̃, β, γ, φ

N N (T − 1) 1 XX
N T
T X
N

= const − log σ 21 − log σ 2u − 2 (ei,t − ēi,· )2 − 2 (ēi,· )2


2 2 2σ u 2σ

!
i=1 t=1 1 i=1
!

N T 1
N
X N (T − 1) 1 1 XN X
T

= const − log σ 21 + 2 (ēi,· )2 − log σ 2u + 2 (ei,t − ēi,· )2 .


2 2σ 1 N 2 σ u N (T − 1)

i=1 i=1 t=1

Remark 16 If the model is correctly specified, the MLE will be asymptotically more e¢cient

than the GMM estimator based on the first di§erenced equation. To improve the asymptotic

e¢ciency of the GMM estimator, Ahn and Schmidt (1995), Arellano and Bover (1995) and

Blundell and Bond (1998) proposed an additional set of moment conditions. See Ahn and

Schmidt (1999) for a survey on the GMM approach applied to the dynamic panel context.

4.8 Problem Set

1. Write a Matlab program to investigate the finite sample bias, standard error and root mean

squared error of di§erent estimators for a simple dynamic panel data model. Specifically, for

each value of N, T and ρ

(i) Generate αi ∼ iidN (0, 1) and ei ∼ iidN (0, 1).

(ii) Generate Yi0 according to Yi0 = 0.5αi + ei .

4.8 PROBLEM SET 74

(iii) Generate Yi,t according to

Yi,t = α + ρYi,t−1 + αi + ui,t for t = 1, 2, · · · , T (4.77)

where α = 0, ui,t ∼ iidN (0, 1) across i and t and ui,t is independent of αi .

(iv) Given the simulated panel data Yi,t , i = 1, 2, · · · , N and t = 1, · · · , T, estimate the

dynamic panel data model (4.77) using the pooling OLS estimator, fixed e§ects estimator,

first di§erenced estimator and Anderson-Hsiao estimator (with Yi,t−2 as the instrument).

(v) Repeat (i)-(iv) 1000 times and calculate the finite sample bias, standard error (se)

and root mean squared error (rmse) of each estimator. Let ρ̂(r) be the estimate for the r-

th replication, then the finite sample bias, standard error and root mean squared error are

computed as follows

1000
1 X (r)

bias(ρ̂) = ρ̂ − ρ,
1000

r=1
8 !2 91/2

< 1 1000 1000 =


X 1 X

se(ρ̂) = ρ̂(r) − ρ̂(s)

: 1000
r=1
1000
s=1
;

n o1/2

rmse(ρ̂) = [bias(ρ̂)]2 + [se(ρ̂)]2

(a) Let N = 100, T = 6. Graph the bias, se and rmse of each estimator as functions of ρ

for ρ = 0, 0.1, 0.2, · · · , 1. For ρ = 0.7, graph the histogram of each estimator and compare it

with a normal density.

(b) Repeat (a) with N = 100, T = 3 and N = 100, T = 9;

(c) Describe your findings and try to explain it.

2. Download the data file cigar.xls from the course web site. The file contains cigarette

consumption for 46 states over the years 1963-1992. We are interested in estimating a dynamic

demand model for cigarettes using Matlab or STATA. The model is given below

ln Ci,t = β 1 ln Ci ,t−1 +β 2 ln Pi,t + β 3 ln Yi,t + β 4 ln P ni,t + "i,t (4.78)

where

"i,t = αi + λt + ui,t . (4.79)

Ci,t is real per capita packs of cigarettes sold in state t in year t. Pi,t is the average retail price

of a pack of cigarettes measured in real terms. Yi,t is per capita disposable income. P ni,t is

the minimum real price of cigarettes in any neighboring state.

(a) Estimate the model using the OLS estimator with no αi and λt . In this case, the model

becomes

ln Ci,t = α + β 1 ln Ci ,t−1 +β 2 ln Pi,t + β 3 ln Yi,t + β 4 ln P ni,t + ui,t . (4.80)

(Do not forget to report the standard errors or t-statistics. Report only the estimates for

the β 0 s. This note also applies to the questions below.)

4.8 PROBLEM SET 75

(b) Estimate the model using the OLS estimator with no αi , i.e. "i,t = λt + ui,t .(Construct

the year dummies, say year62, year63,year64, . . ., year92 and include year64, . . ., year92 as

additional regressors in (4.80))

(c) Estimate the model using the within estimator with no λt , i.e. "i,t = αi + ui,t . Assume

ui,t s iid(0, σ 2u ) over i and t and is independent of the regressor. Is the within estimate of β 1

consistent?

(d) Estimate the model using the within estimator, i.e. "i,t = αi + λt + ui,t . Test for the

macro e§ects, i.e. test the null of λ1 = λ2 = · · · = λT = 0.

(e) Estimate the model ("i,t = αi +λt +ui,t ) using the the Anderson-Hsiao estimator (Please

use the model in its first di§erence and employ ln Ci ,t−2 , the second lag of log consumption,

as the only instrument). Is ln Ci ,t−2 a valid instrument when ui,t is autocorrelated, say ui,t s

M A(1)?

(f) Compare the estimate of β 1 in (b) with that in (e). Which one is larger? Can we intuit

what is going on here?

(g) Estimate the model ("i,t = αi + λt + ui,t ) using the the Arellano-Bond estimator

including

(i) the first-step AB estimator using all possible moments.

(ii) the first-step AB estimator using ln Ci ,t−2 , ln Ci ,t−3 and ln Ci ,t−4 as instruments.

(iii) the first-step AB estimator using only ln Ci ,t−2 as instruments.

The the moment conditions used in g(iii) are di§erent from those in (e). Can you

explain the di§erence?

# P $−1

Note: please use W = N1 N Z


i=1 i
0 GZ
i as the first step weighting matrix, where

Zi contains instruments and

0 1

2 −1 0 · · · 0 0

B C
B C

B −1 2 −1 0 0 C
B C

B . C
B 0 −1 2 . . 0 0 C

B C

G=B . . . C (4.81)
B .. .. .. C

B C
B C

B . C
B 0 0 0 . . 2 −1 C

@ A

0 0 0 · · · −1 2

(iv) Can you perform a two-step GMM estimation based on the specification in g(i)?

Why or why not? Please give a proof.

Bibliography

[1] Ahn, S. C., and Schmidt, P. (1995). “E¢cient estimation of models for dynamic panel

data,” Journal of Econometrics, 68, 5-28.

[2] Ahn, S. C., and Schmidt, P. (1999). “Estimation of linear panel data models using GMM,”

Generalized Method of Moments Estimation, edited by L. Mátyás.

[3] Anderson, T.W. and Cheng Hsiao (1981). “Estimation of dynamic models with error

components,” Journal of the American Statistical Association, 589-606.

[4] Anderson, T.W., and Cheng Hsiao (1982). “Formulation and estimation of dynamic mod-

els using panel data,” Journal of Econometrics, 18, 47-82.

[5] Arellano, M., and Bond, S. (1991). “Some tests of specification for panel data: Monte

Carlo evidence and an application to employment equations,” Review of Economic Stud-

ies, 58, 277-297.

[6] Arellano, M., and Bover, O. (1995). “Another look at the instrumental variable estimation

of error-components models,” Journal of Econometrics, 68, 29-52.

[7] Blundell, R., and Bond, S. (1998). “Initial conditions and moment restrictions in dynamic

panel data models,” Journal of Econometrics, 87, 115-143.

[8] Hahn, Jin (1999). “How informative is the initial condition in the dynamic panel data

models with fixed e§ects,” Journal of Econometrics, 93, 309-326.

[9] Nickell, S. (1981). “Biases in Dynamic Models with Fixed E§ects,” Econometrica, 49,

1417-1426.

[10] Roodman, D. (2009), A Note on the Theme of Too Many Instruments, OBES 71(1),

135-158.

76

Big-O is an inclusive upper bound, while little-o is a strict upper


bound.


Review Op A Op

G rate of converse

fCx is scar

means g x concrses

foster

Chapter 5

framework to understand
Extremum Estimators unifoired
properties


of oter Estimates


We define and investigate the properties of a very general class of estimators called extremum

estimators. As we will see, many popular estimators that you have encountered before are

part of this class, e.g. (nonlinear) Least Squares, Maximum Likelihood, Generalized Method

of Moments estimator.


5.1 Definitions


Denote by Θ ⊂ Rd the parameter space of interest. Extremum estimators (EE) {bθn : n > 1}

are defined to be random elements of Θ that approximately minimize a stochastic criterion

function Qn (θ). That is, b
θn is defined to satisfy
p
minimicer
Assumption EE: b θn 2 Θ and Qn (b θn ) ≤ inf θ2Θ Qn (θ) + op (1). of Q fn
F

small error
OP
Qaarcheted d estered Ipso a
for as ξ n = asymptotes alee error
Remark 17 A random quantity ξ n is of smaller stochastic order than bn , written

op (bn ), if

P lim (ξ n /bn ) = 0.

n!1

Equivalently, for any δ > 0,

limn!1 P (|ξ n /bn | > δ) = 0.



limit was not exist so I


Remark 18 Note that for any nonnegative sequence sn , we have lim sn = 0 i§ limn!1 sn = 0.

First, if lim sn = 0, then limn!1 sn = limn!1 sn = 0. Second, note that 0 ≤ sn ≤ supm≥n sm .

So limn!1 sn = 0 implies that lim sn = 0. As

Kon

Remark 19 When we do not know whether lim sn exists or not a priori, we may work with

limn!1 sn , which always exists if sn is bounded. In our setting, sn will be a probability, which

by definition is bounded by 1.


Remark 20 A random quantity ξ n is stochastically bounded, written as ξ n = Op (1), if for

any " > 0, there exists M" < 1 such that P (|ξ n | ≤ M" ) > 1 − " for n su¢ciently large.

77


5.1 DEFINITIONS 78

Here are some examples:


g need not be scalar
(1) Maximum Likelihood (ML) Estimator: Suppose the data {Wi : i = 1, ..., n}

are iid. Suppose we specify the density to be f (w, θ) (with respect to some measure µ, in

most cases the Lebesgue measure). Let l(w, θ) := log f (w, θ). The (quasi) ML estimator b θn

minimizes (at least up to op (1)) eoseikaihood if
f is true density

allow
firm Qn (θ) := −
n
1 X n
l(Wi , θ)


criteria for i=1


over θ 2 Θ. one n

(2) Least Squares (LS) Estimator for Nonlinear Regression: Suppose the data

{Wi = (Yi , Xi0 )0 : i = 1, ..., n} are iid. We specify a possibly nonlinear function for E (Yi |Xi ) .

That is,
E (Y |X ) = g(X , θ)
Edition fr cheer

i i i

for some function g (x, θ). The LS estimator b θn minimizes (at least up to op (1))

n

1X

Qn (θ) := (Yi − g(Xi , θ))2 /2 minimooer of severe residuals
n

i=1

over θ 2 Θ. (The scale factor 1/2 is used because it is notationally convenient for the asymp-

totic normality result given below. It has of course no e§ect on the estimator.)


(3) Generalized Method of Moments (GMM) Estimator: Suppose the data {Wi :

i = 1, ..., n} are iid, and we have the moment conditions


g is moment f Eg(Wi , θ0 ) = 0 (5.1)

not necessarilytrue due but was it o

where g(w, θ) 2 Rk (k ≥ d) is a known function. Let An be a k × k random (i.e. depending

on the data) weight matrix. Then, the GMM estimator b θn minimizes (at least up to op (1))

gardeanalogueof want
n

1X
Qn (θ) := ||An g(Wi , θ)||2 /2

n Condit en
ten
i=1

or 5115 i

over θ 2 Θ, where || · || denotes the Euclidean norm on Rk .

not used (4) Minimum Distance (MD) estimator: Let π̂ n be an unrestricted estimator of a
inn

anymorek-vector parameter π . Suppose π is known to be a function of a d-vector parameter_ignore θ0 where reform

0 0

d≤k:

If g invertible π 0 = g(θ0 ).
MI.fm

towwwo and extra stuff as dEk Cover id
Let An be a k × k random weight matrix. Then, the MD estimator θ̂n minimizes To do
_4

Can't get

Qn (θ) = kAn (π̂ n − g(θ))k /2 2
one

but
minimise

over θ 2 Θ.

As an example, consider the simple linear panel data model


Yit = α + Xit β + "it , i = 1, ..., N, t = 1, ...Ti odeon

"it = αi + uit orca it u e 40 (5.2)



Structural van


5.1 DEFINITIONS 79


where αi may be correlated with Xi. Chamberlain’s approach (1982, 1984) is to replace αi

with its linear projection onto {Xi } . Assume that αi and {Xit } have finite second moments.

The projection can always be written as



αi = λ0 + Xi1 λ1 + ... + XiT λT + vi .
Earlier did just XI
(5.3)

Plugging this into the original model, we have, for some ,


Li dot
Xi'd evi


Yit = + Xi1 λ1 + ... + Xit (β + λt ) + ... + XiT λT + rit
Tfn

of pets : = π t0 + Xi1 π t1 + Xi2 π t2 + ... + XiT π tT + rit . (5.4)

it

of
0
Now π t = (π t0 , π t1 , ...π tT ) can be estimated by a cross sectional regression using observations

for time t. Let θ collect the parameters α, β, λ0 , λ1 , ..., λT , then π = Hθ for some H. use mini.mn

distancecoiner

gofer
his (5) Two-step (TS) Estimator. Suppose that the criterion function C n (θ; τ 0 ) depend on

both θ and τ 0 . The infeasible extreme estimator is defined to be w
depends on an

θ̃n 2 Θ and Cn (θ̃n ; τ 0 ) ≤ inf Cn (θ; τ 0 ) + op (1).
on to

Q wasdepend
θ2Θ

To obtain a feasible version, we assume that τ̂ n is a preliminary consistent estimator of τ 0 . unknown

We can then define a two-step extreme estimator:
estrele

to
bθn 2 Θ and Qn (b θn ) ≤ inf Qn (θ) + op (1) for Qn (θ) = Cn (θ; τ̂ n ).

estate 0 θ2Θ

As an example, consider the nonlinear model



Yi = g(Xi , θ0 ) + Ui ,

conduit
2
and a2
with E(Ui |Xi ) = σ (Xi , τ 0 ) almost surely. Suppose τ̂ n is a consistent estimator of τ 0 , then


the (feasible) WLS estimator (up to op (1))
feasible

ITEM
θ̂W LS = arg min Qn (θ)

ofthis θ2Θ
properties

for
n WIS
1 X (Yi − g(Xi , θ))2
eg RE

Qn (θ) = Cn (θ; τ̂ n ) =
2n σ 2 (Xi , τ̂ n ) war of u

i=1 weighted
is a two-step estimator. initial

estate

As another example, suppose the data {Wi , i = 1, 2, ..., n} are iid, and we have the moment

conditions Eg(Wi , θ0 , τ 0 ) = 0. Let

n

1X
Gn (θ, τ ) = g(Wi , θ, τ ) 2 Rk sample arduge

n
i=1

do
can and τ̂ n be a preliminary consistent estimator of τ 0 . Then
a
steeple0 e

θ̂n = arg min Qn (θ) = kAn Gn (θ, τ̂ n )k2 /2
θ2Θ


for some matrix A n is a two-step estimator.

There is also a TS version of the MD estimator. In this case Gn (θ, τ ) = π̂(τ ) − g(θ, τ ),

which is a random k-vector that should be close to 0 when θ = θ0 and τ = τ 0 , and n is large.

Let An be a k × k random weight matrix. Then, the TS MD estimator θ̂n minimizes


Qn (θ) = kAn Gn (θ, τ̂ n )k /2

over θ 2 Θ.


2 on slides

cg


on

ordepends

un

Pointwise n
y E o t t o F n n C Z

_m

OH SH E

lance

for uniform covers e need stronger



But

to all o

One n trot applies


ball around 0

Assn 3h look at

ball

move among from info


C l B Coo E


achieve is

minimi you

So Oo neminiser

larger

value Do rimmiser
not have tr e
Q o Need

of bumit for
is d space e Ete ther Oo exists
up
Sf
f
deterries wine ere 0
go basically
Ball is located

Ass ID is het
A necessary condition for
0 O over
0 uniquely minimises

trim trot argnnnQ

gMfunch
Goo we can
ondasQnisfn.oE
surfeits's
Treeoria Onn argon On o
n
stochastic fu
ensures random boy I

wt Bon to erase
Is defined my do

gc
qyqggY
under what condition n covarge to do
continuous
besides where functionalgm is
want to measure continuity of forehead space
So
the matrix space
define
on
Continues if noon
eider sniper norm sup
Steps 2nd A 3rd slab
ASK 3
appxf around neighborto
estimate non lineer castrate prove consistency uncle
any
5.2 CONSISTENCY 80

5.2 Consistency
Argmax theorem
5.2.1 Consistency Theorem
hint ofan gas
Consistency of extremum estimators is implied by the following assumptions:
uniform W sachem icfn p
stronger Assumption ULLN: supθ2Θ |Qn (θ) − Q(θ)| ! 0 for some nonstochastic function Q :
rerun Θ ! R. or u CON
for in close in probability sense close to each T.fm
poinanise
G No
Assumption ID: 8" > 0, there exists δ > 0 such that inf θ2Θ\B(θ0 ,") Q(θ) − Q(θ0 ) ≥ δ, I
where by B(θ0 , ") we denote the open ball of radius " centered at θ0 , i.e. B(θ0 , ") := {θ : get
uniformly
a een
e ||θ − θ0 || < "}. Oo minism aer of limit for Q so sharronger
d
newise
Before we discuss these assumptions in detail, we prove the following theorem.
ran can mass oaf
nd onN
p
Theorem 21 Assume that EE, ULLN, and ID hold. Then b
θn ! θ0 .
ta man
closet defined by 3D
earn a Proof: Let " > 0. By ID, there exists δ > 0 such that whenever θ 2 Θ\B(θ0 , ") we have

Q(θ) − Q(θ0 ) ≥ δ. (5.5)

Thus
norm defn
P (||b
θn − θ0 || > ") = P (b
θn 2 Θ\B(θ0 , ")) buy outside ball
ID
would
assm as outside bell achieve
values ≤ P (Q(b
θn ) − Q(θ0 ) ≥ δ) by us
gapof Q at
mosuat to eine = P (Q(b
θn ) − Qn (b
θn ) + Qn (b
θn ) − Q(θ0 ) ≥ δ)

Ion b b
≤ P (Q(θn ) − Qn (θn ) + Qn (θ0 ) + op (1) − Q(θ0 ) ≥ δ) EE
ut ≤ P (2 sup |Qn (θ) − Q(θ)| + op (1) ≥ δ) = P (op (1) ≥ δ) ! 0. N
COO t 98
θ2Θ
FI
In Iap a run awe soap
The first inequality follows by (5.5), the third inequality by EE, and the last equality by
p
ULLN. So, limn!1 P (||b θn − θ0 || > ") = 0 for any " > 0. That is, b
θn ! θ0 "
just
g incise
result This starting with resuts akee
We now discuss the Assumptions Q En g t 0pct consists
Needsup fever
Q Assumption ULLN. Note that the assumption is stronger than convergence in proba-
10 3D just
need uun n
Lot
of Orton bility Q (θ) !p Q(θ) for each θ 2 Q. As an analogue, we can think of pointwise convergence
n
of nonrandom functions versus uniform convergence.
s
Definition (Uniform Convergence): The sequence of functions {fn (t)} converges
uniformly to function f (t) if for every " > 0, there is an N such that n ≥ N implies
|fn (t) − f (t)| < " for all t 2 R.
C
in 3D
Seats
Obviously every uniformly convergent sequence is pointwise convergent. The di§erence Con get
between pointwise and uniform convergence is this: If fn (t) I converges pointwise to f (t), then
for every " > 0 and for every t 2 R, there is an integer N depending on " and t such that IneOof
|fn (t) − f (t)| < " holds if n ≥ N . If fn (t) converges uniformly to f (t), it is possible for each endnote
" > 0 to find one integer N that will do for all t 2 R. act
nand a

Koin i E y
Q at values is
distance b w
aristene ando_O
There u CON
re
bail to glow
net going from local
winners
notrecostt result
E Q we obtain
Is 0 to can
Tf Hn
T

for any Pt
pointwise
Mn co
Is 0
mn oz
w.my
at o Mn Oz Mn Os Is
at none ten
true do
we stillget
poinnik
o
f 879 In C 2 t s't sot.VE not ten lo mncqdj

co 8 CE YES
Pff chosose none n tenet
waves for both

if f rile
can we alppy WP tin o my nerve Unco Un Ok
GE
not in general
set
teen Wfmt set is essentiallyfinite
is confront a
if assume
ten can pick pick so tret cover entire
to be center of balls whose union covers
can choose 0 Qe
swell variation each ball then
te fn ten I here a on
Zf
we are done
skull costut for over ee.dk
my appx
M
term q should
be Oz
Continuity ten fake sup
bile for
piecewire consent
Pointwise 1 Copartners continuity gives 05 Stoke
vocal result
result
5.2 CONSISTENCY 81

1.0
f
0.8

0.6

0.4
f2(t) f4(t)
0.2
f20(t)
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
t

Cheek slides Figure 5.1: Graphs for fn (t) and f (t)

As an example, consider the functions fn : [0, 1] ! R, fn (t) = tn converge pointwise to


f : [0, 1] ! R, where f (t) = 0 for t 6= 1 and f (t) = 1 otherwise. However, the convergence is
not uniform! Note that |fn (t) − f (t)| = tn for t 2 [0, 1). To ensure |fn (t) − f (t)| < ", we need
n ≥ log(")/ log(t). The lower bound for n depends on t and diverges to 1 as t ! 1 − .
What are the functions Q in our examples? By the law of large numbers, the candidate
functions are given by the expected value of the criterion function.
o(1) ML Estimator: Q(θ) = −El(W an
i , θ) = −E log f (Wi , θ), where E denotes expectation wane
under the true distribution generating the data.
(2) NLS Estimator: Q(θ) = E(Yi − g(Xi , θ))2 /2.
on Col _Inp
qee (3) GMM Estimator: Q(θ) = ||AEg(Wi , θ)||2 /2, where we assume An ! A. weens suits
p
(4) MD Estimator: Q(θ) = kA (π 0 − g(θ))k2 /2 where An ! A and π̂ n ! π 0 .
p
p p
O
(5) TS GMM/MD Estimator: Q (θ) = kAG(θ, τ 0 )k2 /2 where An ! A, τ̂ n ! τ 0 and
p
Gn (θ, τ ) ! G(θ, τ ). h i
(6) TS WLS estimator: Q (θ) = E (Yi − g(Xi , θ))2 σ −2 (Xi , τ 0 ) /2.

Note that for the TS estimator, Q (θ) is evaluated at τ = τ 0 .

5.2.2 Primitive su¢cient conditions for uniform convergence


Obviously, a necessary condition for Assumption ULLN is pointwise weak convergence:
p
Assumption LLN: 8θ 2 Θ, Qn (θ) ! Q(θ).
To establish Assumption LLN, we can use the following weak law of large numbers (WLLN)
that can be found in any basic probability textbook.

Lemma 22 Let {Wi : i = 1, ..., n} be a sequence of iid random variables. Let {m(w, θ) :
θ 2 Θ} be a class of Rs -valued functions for which E||m(Wi , θ)|| < 1 8θ 2 Θ. Then,
1 Pn p
n i=1 m(Wi , θ) ! Em(Wi , θ) as n ! 1 8θ 2 Θ.
IS Gormfond estrut

Q Col 11 An Gn gen HY
Q o
HAS to To 11172
not
even if An A
En Eo was
tor Cah
rent Gn G e

19 11 e l an co En Glo En t Glo En Koto


Ian lo En
Glo e I shadlides
lGn Cge
for een Sien C EB
G Glo to l
e En Gn G
O
NYhoshoed

of E o as En to
for each o Is o
5.2 CONSISTENCY 82

Using the lemma, we see that Assumption LLN holds in the following examples: (1) ML
estimator, provided the data {Wi : i = 1, ..., n} are iid and E| log f (Wi , θ)| < 1 8θ 2 Θ, (2)
LS estimator, provided {(Yi , Xi ) : i = 1, ..., n} are iid and E(Yi − g(Xi , θ))2 < 1 8θ 2 Θ,
p
and (3) GMM estimator, provided An ! A, {Wi : i = 1, ..., n} are iid and E kg(Wi , θ)k < 1,
8θ 2 Θ.
p p
Assumption LLN holds for the MD estimator provided that An ! A and π̂ n ! π 0 .
Assumption LLN holds for the TS WLS estimator if
P P
P1 Xn
(Yi − g(Xi , θ))2 (Yi − g(Xi , θ))2 PP p
P
sup P −E P!0
τ 2B(τ 0 ,") P n i=1
σ 2 (Xi , τ ) σ 2 (Xi , τ ) P
h i
for some " and any θ 2 Θ, and E (Yi − g(Xi , θ))2 σ −2 (Xi , τ ) /2 is continuous at τ 0 for any
θ 2 Θ.
p p
Assumption LLN holds for the TS GMM/MD estimator provided that An ! A, τ̂ n ! τ 0 ,
p
sup |Gn (θ, τ ) − G(θ, τ )| ! 0
τ 2B(τ 0 ,")

for some " and any θ 2 Θ, and G(θ, τ ) is continuous at τ 0 for any θ 2 Θ. The latter follows
because τ̂ n 2 B(τ 0 , ") with probability approaching one, and

|Gn (θ, τ̂ n ) − G(θ, τ 0 )|


≤ |Gn (θ, τ̂ n ) − G(θ, τ̂ n ) + G(θ, τ̂ n ) − G(θ, τ 0 )|
≤ sup |Gn (θ, τ ) − G(θ, τ )| + |G(θ, τ̂ n ) − G(θ, τ 0 )|
τ 2B(τ 0 ,")
p
! 0.

Next we consider how one can verify Assumption ULLN. Suppose we have a sequence of
random variables indexed by θ 2 Θ, {Hn (θ) : n ≥ 1} that converges in probability to zero
8θ 2 Θ. In our setting, we can take Hn (θ) := Qn (θ) − Q(θ). What additional conditions are
p
su¢cient to obtain uniform convergence to zero, supθ2Θ |Hn (θ)| ! 0? The following results
can be applied to verify Assumption ULLN. We need the following definition:
Definition: {Hn (θ) : n ≥ 1} is stochastically equicontinuous (SE) on Θ if 8" > 0
and η > 0, 9δ := δ (", η) > 0 (may depend on ", η but not n) such that
tenthdoesn't o
limn!1 P (sup sup |Hn (θ0 ) − Hn (θ)| > ") < η. too wah ow
θ2Θ θ0 2B(θ,δ)
wowmm are balls
over sued here tudou.es
centered
The “equi” in “equicontinuous” at o
emphasizes the fact that δ does not depend on n. A more
precise term is stochastic uniform equicontinuity. “Uniformity” refer to the fact that δ
does not depend on θ. However, it is common in the literature to use “ stochastic equicon-
tinuity”. Sometimes, we also use “Asymptotic equicontinuity”
Note: by definition,
five
thot words to all n
limn!1 fn = lim sup fm .
n!1 m≥n
8 In
each
up overhead Maco depends on in AO
fluctuates
radius 8 In
her core all balls metcover but
and over equi usniform
E is n
ten P C E are
BConifoom fluctuation
weekofprocas
chesti with
heygopro afar
convergenceed
go
egg 5.2 CONSISTENCY
of
83
dggree
measures
corri's too un
Sf sure ten for noto
Define
0
wn (δ) := sup sup |Hn (θ ) − Hn (θ)| annus
θ2Θ θ0 2B(θ,δ)

which is the so-called modulus of continuity of Hn . Then SE holds if 8" > 0 and η > 0, 9δ > 0
such that
depends 8s
canachievehis limn!1 P (wn (δ) > ") < η.
choice oes
picks's arenas E t 2
at ban a This is equivalent to say that wn (δ n ) = op (1) for any δ n ! 0. See Andrews (1994).
Cootnor a
Note: In mathematical analysis, a family of functions is equicontinuous if all the functions
f a
www.wt
o are continuous and they have equal variation over a given neighborhood, in a precise sense g sizeofball
woes ta
8 described below:
ar
The family F is equicontinuous at a point x0 2 X if for every " > 0, there exists a
of n

δ = δ (x0 ) > 0 such that d(f (x0 ), f (x)) < " for all f 2 F and all x such that d(x0 , x) < δ.
Note that δ = δ (x0 ) should work for all f 2 F. The family is (pointwise) equicontinuous if it
is equicontinuous at each point of X .
The concept of equicontinuity can be applied to a sequence of functions: the sequence of
functions {fn (x) , n = 1, 2, ...} is equicontinuous at a point x0 if for every " > 0, there exists
a δ = δ (x0 ) > 0 such that d(fn (x0 ), fn (x)) < " for all n and all x such that d(x0 , x) < δ.
Here δ may depend on x0 but not on n. When there is a δ that applies to all x 2 X , then the
sequence of functions is uniformly equicontinuous.
AssumptionPC ON Hn (θ) !
LLN:
p
0, 8θ 2 Θ.
Assumption SE: {Hn (θ) : n ≥ 1} is SE on Θ. intuition
it Kate eveneerier p
Theorem 23 (a) If Θ is compact and Assumptions LLN and SE hold, then supθ2Θ |Hn (θ)| !
p
0. (b) If supθ2Θ |Hn (θ)| ! 0, then SE holds.
canbe a s.se ofhere
Proof: (a) Let [θ2Θ {B(θ, δ)} be an open cover of Θ where δ is given in the definition
of stochastic equicontinuity. Since Θ is compact, there exists a finite subcover such that
Θ ⊆ [j=1,...,J {B(θj , δ)}. Then for any " and η > 0, we have

as limn!1 P (sup |Hn (θ)| > 2")


quite θ2Θ ainegi g up overcenterover bell
and tuchman
caput ≤ limn!1 P (max sup (|Hn (θ) − Hn (θj )| + |Hn (θj )|) > 2")
j≤J θ2B(θj ,δ) overcenterof Lau

2C K
Ae is za or B ≤ limn!1 P (max sup
j≤J θ2B(θj ,δ)
|Hn (θ) − Hn (θj )| > ") + limn!1 P (max |Hn (θj )| > ")
j≤J
a
J
X J
X
< limn!1 P( sup |Hn (θ0 ) − Hn (θj )| > ") + limn!1 P (|Hn (θj )| > "),
j=1 θ0 2B(θj ,δ) j=1

a = 2Jη, I
A from LLN
where in the lastcan win it swell
norCA or inequality we use Assumption SE to deal with the first summand and
a Assumption LLN to deal with the second summand.
more (b) We have
p plank
plait P (sup
θ 0
sup
θ 2B(θ,δ)
|Hn (θ0 ) − Hn (θ)| > ")

≤ P (2 sup |Hn (θ)| > ") ! 0


θ2Θ
air
empirical Cdf 9
EI xiao

E E
x co
n Coo if
Novel
random to Gaussian pucers
So coverages

tho converge to whims proces

Function delta method


5.2 CONSISTENCY 84

p
using the triangle inequality and supθ2Θ |Hn (θ)| ! 0. "

p p
Remark 24 If Θ is finite, then supθ2Θ |Hn (θ)| ! 0 follows automatically from |Hn (θ)| ! 0
for each θ 2 Θ. A common example of an infinite set that exhibits properties similar to finite
sets is the compact set. Combining some continuity (here SE) and compactness, we can extend
pointwise convergence, a local result, to uniform convergence, a global result.

We now obtain a ULLN by applying the previous theorem. We look at the case where
n
1X
Hn (θ) := [m(Wi , θ) − Em(Wi , θ)].
n
i=1
Pn
In examples (1) ML and (2) LS, we can think of Qn (θ) as n−1 i=1 m(Wi , θ) and of Q(θ) as
Em(Wi , θ). Let W denote the support of Wi .

Theorem 25 (a) Assume


ear so but no
d depend on n
Condities for
for which I
(i) {Wi : i = 1, ..., n} are identically distributed,
(ii) m(w, θ) is continuous in θ on Θ 8w 2 W,
SE holds
f
(iii) E supθ2Θ |m(Wi , θ)| < 1. sup of fn over parameter she to upper hand
(iv) Θ is compact. lo sa unit or over o prob my new
for Then, SE holds and Em(Wi , θ) is continuous in θ at θ0 .
envelope ai
have k
(b) If, in addition,
(v) {Wi : i = 1, ..., n} are independently distributed,
wer then Assumption LLN holds and, by Theorem 23(a), ULLN holds. IT needed to se WUN
so
to get peon pcowes
Proof: (a) The continuity result holds by the dominated convergence theorem with the ocean
dominating function given by supθ2Θ |m(Wi , θ)| . To show SE, define

Yiδ := sup sup |m(Wi , θ0 ) − m(Wi , θ)|.


θ2Θ θ0 2B(θ,δ)

as f o
We claim that meanof function
EYiδ ! 0 as δ ! 0.
To see this, note that Yiδ ! 0 as δ ! 0 a.s. (this follows by (ii) and (iv) which imply uniform
continuity). Furthermore, Yiδ ≤ 2 supθ2Θ |m(Wi , θ)| 8δ > 0 and E supθ2Θ |m(Wi , θ)| < 1 by m is
(iii). The claim thus follows by the dominated convergence theorem. uniformly
coarse
in 0
So 0 as f
watches of th un itself Yi
earlier fer h fn ha oof't of o
ed frrate
n is for of an
area so
can use but
So can get E Yi o

chewer
won here 4 ig as An norm so on ve
connor sie du f supfm Wi e I ca eh whims
www dawn
ten pointise co limits car
be extended to SE
5.2 CONSISTENCY 85

Now

sup sup |Hn (θ0 ) − Hn (θ)|


0
θ2Θ θ 2B(θ,δ)
P P
P1 X n
1X
n P
P 0 0 P
= sup sup P [m(Wi , θ) − m(Wi , θ ) − E[m(Wi , θ) − m(Wi , θ )]P
θ2Θ θ0 2B(θ,δ) P n i=1 n
i=1
P
( n n
)
1 X PP P 1 X P P
≤ sup sup m(Wi , θ) − m(Wi , θ0 )P + E Pm(Wi , θ) − m(Wi , θ0 )P
θ2Θ θ0 2B(θ,δ) n i=1
n
i=1
n n
1X P P 1X P P

unearhor
≤ P 0 P
sup sup m(Wi , θ) − m(Wi , θ ) + E sup sup Pm(Wi , θ) − m(Wi , θ0 )P
n θ2Θ θ0 2B(θ,δ) n θ2Θ θ0 2B(θ,δ)
i=1 i=1
n n n
1X 1X 1X
= Yiδ + EYiδ = [Yiδ + EYiδ ] .
n n n
i=1 i=1 i=1

Let " > 0 and η > 0, then

limn!1 P (sup sup |Hn (θ0 ) − Hn (θ)| > ")


0
θ2Θ θ 2B(θ,δ)
n
!
1X
≤ limn!1 P [Yiδ + EYiδ ] > "
n
i=1
n
1X
≤ limn!1 E [Yiδ + EYiδ ] /"
n
i=1 inanely
= 2EYiδ /" ≤ η identical date
constant not depend en n
by choosing δ = δ (", η) small enough. Here the first inequality follows by the triangle inequal-
ity, the second holds by Markov’s inequality and the last equality holds by (i). Note that we
can pick up δ := δ (", η) which can depend on " and η but not n to make 2EYiδ /" < η for any
" > 0 and η > 0
(b) Assumption LLN holds as in Lemma 22. "

I argmn In o

5.2.3 Primitive su¢cient conditions for identification


T 5 argm
Next we discuss Assumption ID. Clearly, a necessary condition for Assumption ID is that
a co
θ0 uniquely minimizes Q(θ) over Θ. What vectors uniquely minimize Q(θ) in our examples?
(1) ML Estimator:
w Case (i), the correctly specified case, i.e. the true distribution g(w) is part of the parametric
g e
family {f (w, θ) : θ 2 Θ}. In other words, there is θ∗0 such that g(w) = f (w, θ∗0 ).
Etr
of In this case, θ0 = θ∗0 .
9
true value
Oo

P tureen were for


g
model rewarded at 0 gives true pdf
5.2 CONSISTENCY 86

Proof :

Q(θ∗0 ) − Q(θ) = −E log f (Wi , θ∗0 ) + E log f (Wi , θ)


= E log(f (Wi , θ)/f (Wi , θ∗0 ))
≤ log E(f (Wi , θ)/f (Wi , θ∗0 )
Z
f (w, θ)
= log f (w, θ∗0 )dµ(w)
f (w, θ∗0 )
Z
= log f (w, θ)dµ(w) = log 1 = 0,

where the inequality holds by Jensen’s inequality since log(·) is a concave function. The in-
equality is strict, implying that θ∗0 uniquely minimizes Q(θ) over Θ, if and only if P (f (Wi , θ) 6=
f (Wi , θ∗0 )) > 0 for all θ 2 Θ di§erent from θ∗0 .
Case (ii), the true distribution g(w) is not part of the parametric family {f (w, θ) : θ 2 Θ}.
By
KLIC(g, f (·, θ)) := Eg log g(Wi ) − Eg log f (Wi , θ)
we denote the Kullback-Liebler Information Criterion between g and f (·, θ), where Eg denotes
the expectation when Wi has density g. Note that

KLIC(g, f (·, θ)) = Eg log g(Wi ) + Q(θ).

Thus, the ML estimator under misspecification (often called the quasi-ML estimator) converges
in probability to the parameter value θ0 that uniquely minimizes the KLIC between the true
density g and the densities in the parametric family {f (w, θ) : θ 2 Θ} provided such a unique
value exists.
More generally, the Kullback—Leibler Information Criterion is a non-symmetric measure of
the di§erence between two probability distributions P1 and P2 . If P1 is absolutely continuous
with respect to P2 , then
Z
dP1 dP1
KLIC(P1 , P2 ) = log dP2
dP2 dP2
If both P1 and P2 are absolutely continuous with respect to some measure µ, then
H I
dP1 dP2
KLIC(P1 , P2 ) = EP1 log − log .
dµ dµ
5.2 CONSISTENCY 87

As an example, the KLIC between P1 := N (λ1 , Σ1 ) and P2 := N (λ2 , Σ2 ) on Rk is

KLIC (P1 , P2 )
H I
dP1 dP2
= EP1 log − log
dµ dµ
L M
1 det (Σ1 )
= − EP1 log + (X − λ1 )0 Σ−1
1 (X − λ 1 ) − (X − λ 2 )0 −1
Σ 2 (X − λ 2 )
2 det (Σ2 )
1 det (Σ1 ) 1 ; <
= − log − k − EP1 (X − λ1 + λ1 − λ2 )0 Σ−1 2 (X − λ1 + λ1 − λ2 )
2 det (Σ2 ) 2
1 det (Σ1 ) 1 ; * + <
= − log − k − tr Σ1 Σ−1
2 − (λ1 − λ2 )0 Σ−1
2 (λ1 − λ2 )
2 det (Σ2 ) 2
1 1; * + < 1 det (Σ1 )
= (λ1 − λ2 )0 Σ−1
2 (λ1 − λ2 ) + tr Σ1 Σ−12 − k − log .
2 2 2 det (Σ2 )

When Σ1 = Σ2 = Σ, KLIC (P1 , P2 ) = 1


2 (λ1 − λ2 )0 Σ−1 (λ1 − λ2 ) . When λ1 = λ2 ,

1; * + < 1
KLIC (P1 , P2 ) = tr Σ1 Σ−1
2 − k − [log det (Σ1 ) − log det (Σ2 )]
2 2
What is the KLIC between probability distribution P1 with mean λ1 and variance Σ1 and
P2 := N (λ2 , Σ2 ) where P1 may not be normal? We can find this by:
H I
dP1 dP2
EP1 log − log
dµ dµ
dP1 1 ; <
= EP1 log + EP1 log det (Σ2 ) + (X − λ2 )0 Σ−1
2 (X − λ2 )
dµ 2
dP1 1
= EP1 log + C(Σ1 , Σ2 ) + (λ1 − λ2 )0 Σ−1
2 (λ1 − λ2 )
dµ 2

where C(Σ1 , Σ2 ) does not depend on λ2 . Suppose P1 is the true distribution, and we use P2
to perform the ML estimation. Then according to our general theory, the MLE of the mean
is consistent for the true mean, even if the normal distribution is mis-specified.
More generally, consider the linear model:

Y = Xβ 0 + U

where U may not be normal and may not be independent of X but satisfies E(U |X) = 0. The
true pdf of (Y, X) can be written as gY,X (y, x) = gU |X=x (y − xβ)gX (x) . Suppose we specify
U to be N (0, σ 2 ) and independent of X. Under this (mis) specification, the pdf of (Y, X) is
2
fY,X (y, x) = p 1 exp(− (y−xβ)
2σ 2
)gX (x) . Now
2πσ 2

KLIC(gY,X , fY,X ) = EgY,X log gY X (Y, X) − EgY,X log fY,X (Y, X)


= EgY,X log gY,X (Y, X) − EgY,X log gX (X)
1 (Y − Xβ)2
+ log 2πσ 2 + EgY X
2 2σ 2
5.2 CONSISTENCY 88

and
(Y − Xβ)2 (Y − Xβ)2 [Y − E (Y |X) + E(Y |X) − Xβ]2
EgY X = E E
X Y |X = E E
X Y |X
2σ 2 2σ 2 2σ 2
EX varY |X (Y ) EX [X (β 0 − β)]2
= + .
2σ 2 2σ 2
If KLIC(gY,X , fY,X ) has a unique minimizer (β o , σ 2o ), then

β 0 = β o and σ 2o = EX varY |X (Y ) = EX varU |X (U ).

That is, under the misspecification, β̂ M LE is consistent for β 0 , and σ̂ 2M LE converges to the
mean of the conditional variance of U.
(2) LS Estimator:
Case (i), the correctly specified case, i.e., there is a θ∗0 2 Θ such that E(Yi |Xi ) = g(Xi , θ∗0 )
almost surely.
In this case, θ0 = θ∗0 .

2(Q(θ∗0 ) − Q(θ)) = E(Ui )2 − E(Yi − g(Xi , θ) + g(Xi , θ∗0 ) − g(Xi , θ∗0 ))2
= E(Ui )2 − E(Ui + g(Xi , θ∗0 ) − g(Xi , θ))2
= −E(g(Xi , θ∗0 ) − g(Xi , θ))2 − 2E [(g(Xi , θ∗0 ) − g(Xi , θ))Ui ]
= −E(g(Xi , θ∗0 ) − g(Xi , θ))2 ≤ 0.

Note that E(Ui |Xi ) = 0 implies E [(g(Xi , θ∗0 ) − g(Xi , θ))Ui ] = 0. The last inequality is strict,
and so, θ∗0 uniquely minimizes Q(θ) if P (g(Xi , θ) 6= g(Xi , θ∗0 )) > 0 for all θ 2 Θ di§erent from
θ∗0 .
Case (ii), the model is not correctly specified. Define h(x) = E(Yi |Xi = x). Then,

Q(θ) = E(Yi − h(Xi ) + h(Xi ) − g(Xi , θ))2 /2


= E(Yi − h(Xi ))2 /2 + E(h(Xi ) − g(Xi , θ))2 /2

and θ0 uniquely minimizes Q(θ) over Θ if it uniquely minimizes E(h(Xi ) − g(Xi , θ))2 over Θ.
In other words, the LS estimator in this case converges to the point θ0 that gives the best
mean squared approximation in the family {g(·, θ) : θ 2 Θ} to the conditional mean of Yi
given Xi (provided the best approximation is unique).

(3) GMM Estimator: If A is nonsingular and there exists a unique vector θ∗0 2 Θ such
that Eg(Wi , θ∗0 ) = 0, then θ0 = θ∗0 , which uniquely minimizes Q(θ) over Θ.

(4) MD Estimator: If A is nonsingular and there exists a unique vector θ∗0 2 Θ such that
π 0 = g(θ∗0 ), then θ∗0 uniquely minimizes Q (θ) over Θ. Suppose the restrictions are misspecified
and no value of θ 2 Θ such that π 0 = g (θ) . Then, θ0 is the value that uniquely minimizes
Q(θ) = kA(π 0 − g(θ))k /2, if such a value exists.

(5) TS/GMM and TS/MD Estimator: If A is nonsingular and there exists a unique
vector θ∗0 2 Θ such that G(θ∗0 , τ 0 ) = 0, then θ0 = θ∗0 is the value that uniquely minimizes Q(θ)
over Θ.
5.2 CONSISTENCY 89

(6) TS NLS Estimator: Similar to the NLS estimator. Details are omitted here.

We now make Assumption ID1, which is more primitive than and implies Assumption ID.
Assumption ID1: (i) Θ is compact. (ii) Q is continuous. (iii) θ0 uniquely minimizes
Q(θ) over θ 2 Θ.
ID1(i) is typically assumed. Continuity of Q holds if Qn (θ) is continuous and E supθ2Θ |Qn (θ)| <
1.
Show as an exercise that Assumption ID1 implies ID (proof by contradiction), but that no
two of the three conditions in ID1 alone are enough to imply ID. Figures below illustrate that
all three conditions are needed.

Θ is not compact

Q (θ) is not continuous


5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 90

Q (θ) does not have a unique minimizer

5.3 Asymptotic Normality of Extremum Estimators


After having discussed “consistency” properties of extremum estimators, we now provide con-
ditions under which extremum estimators are asymptotically normally distributed. The sec-
tion is structured as follows. We specify assumptions and then show that they are su¢cient
for asymptotic normality. Then we discuss these assumptions for five particular extremum
estimators, namely ML, Nonlinear Least Squares, GMM, MD and TS estimators.

5.3.1 Assumptions and the Theorem


As before, denote by Θ ⊂ Rd the parameter space. By C i (A) we denote the space of functions
that is i-times continuously di§erentiable on the set A.
For asymptotic normality we make the following two assumptions. The first one concerns
the stochastic criterion function (CF) Qn (θ), the second one the estimator itself.
Assumption CF: (i) θ0 is in the interior of Θ.
(ii) Qn (θ) 2 C 2 (Θ0 ) for some neighborhood Θ0 ⊂ Θ of θ0 (with probability one).
p @ d
(iii) n @θ Qn (θ0 ) ! N (0, Ω0 ).
@2 p
(iv) supθ2Θ0 || @θ@θ 0 Qn (θ)−B(θ)|| ! 0 for some non-stochastic d×d matrix-valued function

B(θ) that is continuous at θ0 and for which B0 := B(θ0 ) is nonsingular.


p
Assumption EE2: (i) b θn ! θ0 .
@
(ii) @θ Qn (b
θn ) = op (n−1/2 ).
Assumption EE2 (i) assumes that we have already established consistency of b θn perhaps
by using the results of the previous section. Assumption EE2 (ii) requires that the first-order
conditions for minimizing the criterion function Qn (θ) hold approximately. This assumption
allows one some leeway in computing the estimator, because it may be di¢cult and/or costly
to find a value b
θn that exactly satisfies the first-order conditions.
p b d
Theorem 26 Under Assumptions CF and EE2, n(θn − θ0 ) ! N (0, B0−1 Ω0 B0−1 ).

For the proof the following lemma, which can be viewed as a generalization of Slutsky’s
theorem, is helpful.

b ! p p
Lemma 27 Suppose (i) β n β 0 2 Rs , (ii) supβ2B(β 0 ,") |Ln (β) − L(β)| ! 0 for some " > 0
b )!p
and (iii) the non-stochastic function L(β) is continuous at β . Then, Ln (β
0 n L(β ).
0

Proof (lemma):
b ) − L(β )| ≤ |Ln (β
|Ln (β b ) − L(β
b )| + |L(β
b ) − L(β )|
n 0 n n n 0
b p
≤ sup |Ln (β) − L(β)| + |L(β n ) − L(β 0 )| ! 0,
β2B(β 0 ,")
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 91

where the first inequality holds by the triangle inequality. The second inequality holds with
probability approaching 1 (wp ! 1), because β b 2 B(β , ") wp ! 1. By assumption (ii) the
n 0
first summand converges to zero in probability, and by assumptions (i), (iii) and Slutsky’s
theorem the second does so too. "

Proof (theorem): Using CF (i) and (ii) and EE2 (ii), element-by-element mean value
@
expansions of @θ Qn (b
θn ) about θ0 yield

@ @ @2
op (n−1/2 ) = Qn (b
θn ) = Qn (θ0 ) + Qn (θ∗n )(b
θn − θ0 ), (5.6)
@θ @θ @θ@θ0
p
where the mean-value θ∗n lies on the segment joining b θn and θ0 (and hence satisfies θ∗n ! θ0 by
@2 ∗ ∗ p
EE2 (i)) and may di§er across the rows of @θ@θ 0 Qn (θ n ). Applying the lemma (using θ n ! θ 0

and CF (iv)) we get


@2
Qn (θ∗n ) = B0 + op (1). (5.7)
@θ@θ0
p
Multiplying (5.6) by n and substituting (8.27) gives
p @ p
op (1) = n Qn (θ0 ) + (B0 + op (1)) n(b
θn − θ0 ).

Using, the nonsingularity of B0 (CF (iv)), rearranging yields
p p @
n(b
θn − θ0 ) = −(B0 + op (1))−1 n Qn (θ0 ) + op (1)

d
! N (0, B0−1 Ω0 B0−1 ),

using CF (iii) for the last equation. "

We now provide su¢cient conditions for Assumption CF and discuss the form of the
covariance matrix B0−1 Ω0 B0−1 for each of the examples introduced in the previous section.

5.3.2 ML Estimator
Pn
Recall Qn (θ) := − n1 i=1 log f (Wi , θ). Thus we have
n
@ 1X @
Qn (θ) = − log f (Wi , θ),
@θ n @θ
i=1
n
@2 1 X @2
Qn (θ) = − log f (Wi , θ),
@θ@θ0 n
i=1
@θ@θ0
@ @
Ω0 = E log f (Wi , θ0 ) 0 log f (Wi , θ0 ),
@θ @θ
@2
B(θ) = −E log f (Wi , θ).
@θ@θ0
Assumption CF (ii) holds if f (w, θ) 2 C 2 (Θ0 ) on some neighborhood Θ0 ⊂ Θ of θ0 for all
w in the support W of Wi .
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 92

CF (iii) holds by the central limit theorem (CLT) for iid random vectors with finite second
moment provided
@ @
E Qn (θ0 ) = 0 and E|| log f (Wi , θ0 )||2 < 1. (5.8)
@θ @θ
The former condition in (5.8) holds by the first order conditions for minimization of Q(θ) over
Θ, assuming that Q is di§erentiable at θ0 , provided θ0 is an interior point of Θ0 . That is,
@ @ @ @
0= Q(θ0 ) = − E log f (Wi , θ0 ) = −E log f (Wi , θ0 ) = E Qn (θ0 ),
@θ @θ @θ @θ
@
provided the inter-change of the order of E and @θ is justified. Su¢cient conditions for this
1
are that 1) log f (w, θ) 2 C (Θ0 ) on some neighborhood Θ0 ⊂ Θ of θ0 for all w 2 W, and
@
2) E supθ2Θ0 || @θ log f (Wi , θ)|| < 1. These conditions also imply that Q is di§erentiable at
@
θ0 . Note that 0 = @θ Q(θ0 ) holds by definition of θ0 (as the value that minimizes Q(θ), recall
Assumption ID) whether or not the model is correctly specified.
The second condition in (5.8) is equivalent to requiring the information matrix at θ0 to be
well defined, since the latter equals
@ @
I0 = E log f (Wi , θ0 ) 0 log f (Wi , θ0 ).
@θ @θ
@ 2
Assumption CF (iv) states that { @θ@θ 0 log f (Wi , θ) : i ≥ 1} satisfies a uniform WLLN over

θ 2 Θ0 , with continuous limit B(θ) that is invertible at θ0 . By Theorem 25 of the previous


@2
section, the uniform WLLN holds and B(θ) is continuous at θ0 if @θ@θ 0 log f (w, θ) is continuous

in θ on Θ0 for all w 2 W and


@2
E sup || log f (Wi , θ)|| < 1,
θ2Θ0 @θ@θ0
and Θ0 is compact.
@ 2
The information matrix equality I0 = −E @θ@θ 0 log f (Wi , θ 0 ) holds if the parametric model

is correctly specified and one can switch the order of di§erentiation and integration in the
definition of B0 = B(θ0 ). More specifically, under the correct specification,
h i
H I Z @
f (w, θ )
@ @θ0 0
E log f (W, θ0 ) = f (w, θ0 ) dw
@θ0 f (w, θ0 )
Z
@
= f (w, θ0 )dw
@θ0
HZ I
@ @1
= f (w, θ0 )dw = = 0. (5.9)
@θ0 @θ0
That is Z L M
@
log f (w, θ0 ) f (w, θ0 ) dw = 0.
@θ0
Taking another derivative on both sides, we have
Z L M Z L M
@ @ @2
log f (w, θ0 ) f (w, θ0 ) dw + log f (w, θ0 ) f (w, θ0 ) dw = 0,
@θ0 @θ00 @θ0 @θ00
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 93

i.e., L M
@ @ @2
E log f (w, θ0 ) 0 f (w, θ0 ) = −E log f (w, θ0 ). (5.10)
@θ0 @θ0 @θ0 @θ00
The information matrix equality is

B0 = Ω0 .
p b
Hence, in this case, the asymptotic covariance matrix of n(θn − θ0 ) simplifies to B0−1 , the
inverse of the information matrix.
We call (5.9) and (5.10) the Bartlett identities of order one and two. Taking higher order
derivatives will give us more identities.

5.3.3 NLS Estimator


1 Pn
Recall Qn (θ) := n i=1 (Yi − g(Xi , θ))2 /2. Thus we have
n
@ 1X @
Qn (θ) = − (Yi − g(Xi , θ)) g(Xi , θ),
@θ n @θ
i=1
n U V
@2 1X @ @ @2
Qn (θ) = g(Xi , θ) 0 g(Xi , θ) − (Yi − g(Xi , θ)) g(Xi , θ) ,
@θ@θ0 n
i=1
@θ @θ @θ@θ0
@ @
Ω0 = EUi2 g(Xi , θ0 ) 0 g(Xi , θ0 ), where Ui = Yi − g(Xi , θ0 ), and
@θ @θ
@ @ @2
B(θ) = E g(Xi , θ) 0 g(Xi , θ) − E(Yi − g(Xi , θ)) g(Xi , θ).
@θ @θ @θ@θ0
Assumption CF(ii) holds if g(x, ·) 2 C 2 (Θ0 ) for all x in the support X of the random variable
Xi , where Θ0 ⊂ Θ is some neighborhood
P of θ0 .
@
Recall n1/2 @θ Qn (θ0 ) = −n−1/2 ni=1 Ui @θ
@
g(Xi , θ0 ). Therefore assumption CF(iii) holds by
the CLT provided that
@ @
E Qn (θ0 ) = 0 and E||Ui g(Xi , θ0 )||2 < 1.
@θ @θ
The first condition holds because, by definition of θ0 , θ0 minimizes Q(θ) over Θ. As in the ML
example, since θ0 is assumed to be an interior point of Θ, the first order conditions for the
minimization of Q(θ) give

@ @
0 = Q(θ0 ) = E(Yi − g(Xi , θ0 ))2 /2
@θ @θ
@ @
= E (Yi − g(Xi , θ0 ))2 /2 = E Qn (θ0 ),
@θ @θ
provided the interchange of the E and @/@θ operators in the third equality is justified. (Suf-
ficient conditions for that are that g(x, ·) 2 C 1 (Θ0 ) for all x 2 X and E supθ2Θ0 ||(Yi −
@ @
g(Xi , θ)) @θ g(Xi , θ)|| < 1.) As in the ML example, E @θ Qn (θ0 ) = 0 holds by definition of θ0
whether or not the model is correctly specified.
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 94

Assumption CF(iv) holds by Theorem 25 provided g(x, ·) 2 C 2 (Θ0 ) for all x 2 X (as
assumed above) and

@ @ @2
E sup || g(Xi , θ) 0 g(Xi , θ) − (Yi − g(Xi , θ)) g(Xi , θ)|| < 1,
θ2Θ0 @θ @θ @θ@θ0

Θ0 is compact and B0 is nonsingular.


p
Under certain additional assumptions the asymptotic covariance B0−1 Ω0 B0−1 of n(b θn −θ0 )
simplifies. If the model is correctly specified (i.e. E(Ui |Xi ) = 0 a.s.), then we have
@ @
B0 = E g(Xi , θ0 ) 0 g(Xi , θ0 ) and
@θ @θ
@ @
Ω0 = Eσ 2 (Xi ) g(Xi , θ0 ) 0 g(Xi , θ0 ), where
@θ @θ
σ 2 (Xi ) : = E(Ui2 |Xi ).

If in addition, the errors Ui are homoskedastic, i.e. σ 2 (Xi ) ≡ σ 2 a.s. for some σ 2 > 0 (σ 2 (Xi )
does not depend on the realization of Xi ), then Ω0 = σ 2 B0 and thus B0−1 Ω0 B0−1 = σ 2 B0−1 .
The above holds only when the model is correctly specified. In a misspecified model, the
@
definition of θ0 only ensures that EUi @θ g(Xi , θ0 ) = 0. This does not imply that E (Ui |Xi ) = 0
a.s.

5.3.4 GMM
We consider only the case P with correctly specified moment conditions here.
Recall Qn (θ) := ||An n ni=1 g(Wi , θ)||2 /2 for θ 2 Θ ⊂ Rd for some function g that maps
1

into Rk . Thus we have


n n
@ 1X @ 0 0 1X
Qn (θ) = [ 0 g(Wi , θ)] An An [ g(Wi , θ)],
@θ n @θ n
i=1 i=1
n n
@2 1X @ 1X @
[ Qn (θ)]`,j = [ g(Wi , θ)0 ]A0n An [ g(Wi , θ)] (5.11)
@θ@θ0 n @θ`
i=1
n @θj
i=1
Xn n
1 @2 1X
+[ g(Wi , θ)0 ]A0n An [ g(Wi , θ)],
n @θ` @θj n
i=1 i=1

for `, j = 1, ..., d,

Ω0 = Γ00 A0 AV0 A0 AΓ0 , where


V0 : = Eg(Wi , θ0 )g(Wi , θ0 )0 ,
@ p
Γ0 : = E 0 g(Wi , θ0 ), and An ! A.

@ @
[B(θ)]`,j = E g(Wi , θ)0 A0 AE g(Wi , θ)
@θ` @θj
@2
+E g(Wi , θ)0 A0 AEg(Wi , θ),
@θ` @θj
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 95

for `, j = 1, ..., d, and


B0 = Γ00 A0 AΓ0 .
Assumption CF(ii) holds if g(w, ·) 2 C 2 (Θ0 ) for all w in the support W of Wi .
P
Assumption CF(iii) holds by the CLT applied to n−1/2 ni=1 g(Wi , θ0 ) because
P p
Eg(Wi , θ0 ) = 0, by the WLLN applied to n1 ni=1 @θ@ 0 g(Wi , θ0 ), and by the assumption An ! A.
The CLT and WLLN hold under the primitive assumptions
@
E||g(Wi , θ0 )||2 < 1 and E|| g(Wi , θ0 )|| < 1.
@θ0
2
Assumption CF(iv) holds if { @θ@` @θj g(Wi , θ) : i ≥ 1}, for `, j = 1, ..., d, { @θ@ 0 g(Wi , θ) : i ≥ 1},
and {g(Wi , θ) : i ≥ 1} satisfy uniform WLLNs over Θ0 , Γ0 is full rank, A is nonsingular, and
2
E @θ@` @θj g(Wi , θ), for `, j = 1, ...d, E @θ@ 0 g(Wi , θ), and Eg(Wi , θ) are continuous at θ0 . Again
by applying Theorem 25, primitive assumptions for CF(iv) are that g(w, ·) 2 C 2 (Θ0 ) for all
w 2 W (as assumed above) and

E sup ||g(Wi , θ)|| < 1,


θ2Θ0
@
E sup || g(Wi , θ)|| < 1,
θ2Θ0 @θ
@2
E sup || g(Wi , θ)|| < 1,
θ2Θ0 @θ` @θj

for all `, j = 1, ...d, and Θ0 is compact.


p
Again, in specific cases the asymptotic covariance matrix B0−1 Ω0 B0−1 of n(b
θ − θ0 ) simpli-
fies. Assume k = d (the “just identified” case, where the number of moment conditions equals
the dimension of θ0 ; as opposed to the “over identified” case, where k > d). Then Γ0 and A
are nonsingular matrices, and thus B0−1 = Γ−1 −1 0 −1 0 −1
0 A (A ) (Γ0 ) , and after many cancellations
−1 −1 −1 −1 0
B0 Ω0 B0 simplifies to Γ0 V0 (Γ0 ) .

5.3.5 MD Estimator
Again we consider only the case with correct specification in some details here.
Recall that Qn (θ) = kAn (π̂ n − g(θ))k2 /2. We have
H I0
@ @
Qn (θ) = − g(θ) A0n An (π̂ n − g(θ))
@θ @θ0
L 2 M H I0
@ @ @ @2
Qn (θ) = − g(θ) A0n An g(θ) − g(θ)0 A0n An (π̂ n − g(θ))
@θ@θ `,j @θ ` @θ j @θ ` @θ j

for `, j = 1, ..., d. We assume that


p d p
n(π̂ n − π 0 ) ! N (0, V0 ) and An ! A. (5.12)
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 96

Given this, we have


@
Ω0 = Γ00 A0 AV0 A0 AΓ0 where Γ0 = g(θ)|θ=θ0
@θ0
and H I0
@ @ @2
[B(θ)]`j = − g(θ) A0 A g(θ) − g(θ)0 A0 A(π 0 − g(θ))
@θ` @θj @θ` @θj
for `, j = 1, ..., d.
Assumption CF(ii) holds if g(θ) 2 C 2 (Θ0 ) for some neighborhood Θ0 of θ0 and Θ0 ⊂ Θ.
CF(iii) holds by (5.12) provided that the restriction π 0 = g(θ0 ) holds. Note that π 0 = g (θ0 )
implies that B0 simplifies, since the second summand of [B(θ)]`j equals zero:

B0 = Γ00 A0 AΓ0 .

CF(iv) holds under the assumptions given above, provided that Γ0 and A are full rank.

5.3.6 Two Step GMM or MD Estimator


We consider the case that
EGn (θ, τ ) = 0
if and only if θ = θ0 and τ = τ 0 .
Recall that
Qn (θ) = kAn Gn (θ, τ̂ n )k2 /2.
We have
H I0
@ @
Qn (θ) = Gn (θ, τ̂ n ) A0n An Gn (θ, τ̂ n )
@θ @θ0
L 2 M H I0
@ @ @
Qn (θ) = Gn (θ, τ̂ n ) A0n An Gn (θ, τ̂ n )
@θ@θ `,j @θ` @θj
H I0
@2
+ Gn (θ, τ̂ n ) A0n An Gn (θ, τ̂ n )
@θ` @θj
For brevity, we do not give su¢cient conditions for Assumption CF in this example. We
show what B0 and Ω0 equal in the example by making intermediate, rather than primitive
assumptions. We assume the following:
0 1 0 1 0 0 11
p Gn (θ0 , τ 0 ) d @ Z1 V 10 V 20
n@ A! A s N @0, @ AA
τ̂ n − τ 0 Z2 V200 V
30

@ p @ p
Gn (θ0 , τ̂ n ) ! G(θ0 , τ 0 ) = Γ0 , An ! A
@θ0 @θ0
@ p @
0
Gn (θ0 , τ ∗n ) ! G(θ0 , τ 0 ) = Λ0
@τ @τ 0
W 2 W
W @ W p
sup WW @θ@θ Qn (θ) − B(θ)W ! 0
W
θ2Θ0
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 97

for any τ ∗n = τ 0 + op (1) where


H I0
@ @
[B(θ)]`,j = E Gn (θ, τ 0 ) A0n An E Gn (θ, τ 0 )
@θ` @θj
H I0
@2
+ E Gn (θ, τ 0 ) A0n An EGn (θ, τ 0 )
@θ` @θj

and
B(θ0 ) = Γ00 A0 AΓ0 .
p
To find the asymptotic distribution of N (0, Ω0 ) of n@Qn (θ)/@θ, as required for CF (ii), we
p
carry out element-by-element mean value expansions of n@Qn (θ)/@θ about τ 0 and use the
above assumptions:
H I0
p @ @ p
n Qn (θ0 ) = 0 Gn (θ 0 , τ̂ n ) A0n An nGn (θ0 , τ̂ n )
@θ @θ
L M
0 0 p @ ∗ p
= (Γ0 + op (1)) An An nGn (θ0 , τ 0 ) + 0 Gn (θ0 , τ̂ n ) n (τ̂ n − τ 0 )

0 1
# $p G (θ , τ )
. n 0 0
= (Γ0 + op (1))0 A0n An Ik .. Λ0 + op (1) n@ A
τ̂ n − τ 0
! d Γ00 A0 A (Z1 + Λ0 Z2 ) s N (0, Ω0 )

where * +
Ω0 = Γ00 A0 A V10 + Λ0 V20
0
+ V20 Λ00 + Λ0 V30 Λ00 A0 AΓ0 .
Note that if Λ0 = 0, then Ω0 simplifies to an expression that is the same as one would get
p
if τ 0 replaced τ̂ n in Qn (θ). In this case, the asymptotic distribution of n(θ̂n − θ0 ) is the same
whether τ 0 is known or estimated. In general, however, Λ0 6= 0 and the estimator of τ 0 by τ̂ n
a§ects the limit distribution of θ̂n .

Example 28 2SLS with exactly identified model: Y = Xθ + u, Z is the instrument so that

X = Zτ 0 + eX .

For simplicity, here we have assumed that there is no intercept in the model. Let τ̂ n be the
first stage OLS estimator. For the 2SLS, we have An = 1,
n
1X
Gn (θ, τ̂ n ) = [Yi − (Zi τ̂ n ) θ] (Zi τ̂ n ) ,
n
i=1
G(θ, τ ) = E [Y − (Zτ ) θ] (Zτ ) ,

and
@ * +
G(θ0 , τ 0 ) = E [Y − (Zτ 0 ) θ0 ] Z − EZ 2 τ 0 θ0 .

So in general, * +
Λ0 = − EZ 2 τ 0 θ0 6= 0.
5.3 ASYMPTOTIC NORMALITY OF EXTREMUM ESTIMATORS 98

We have
n n
p 1 X τ0 X 2 p
nGn (θ0 , τ̂ n ) = p [Yi − (Zi τ 0 ) θ0 ] Zi τ 0 − Zi θ0 n (τ̂ n − τ 0 ) + op (1)
n n
i=1 i=1
n n n
!−1 n
!
1 X τ0 X 2 1X 2 1 X
= p [Yi − (Zi τ 0 ) θ0 ] Zi τ 0 − Zi θ0 Zi p Zi ei,X + op (1)
n n n n
i=1 i=1 i=1 i=1
n n
1 X 1 X
= p [Yi − (Zi τ 0 ) θ0 ] Zi τ 0 − τ 0 θ0 p Zi ei,X + op (1)
n n
i=1 i=1
n
1 X
= p τ 0 Zi {[Yi − (Zi τ 0 ) θ0 ] − θ0 ei,X } + op (1)
n
i=1
n
1 X
= p τ 0 Zi {Yi − (Zi τ 0 + ei,X ) θ0 } + op (1)
n
i=1
Xn
1
= p τ 0 Zi {Yi − Xi θ0 } + op (1) .
n
i=1

Except for a multiplicative scaler change, which has no e§ect on calculating the asymptotic
variance for θ̂IV , this is exactly the moment condition behind the IV estimator. The correction
leads to the well-expected moment conditions. If we ignore the estimation error in τ̂ n , we would
p
approximate the distribution of nGn (θ0 , τ̂ n ) by
n
1 X
p τ 0 Zi [Yi − (Zi τ 0 ) θ0 ] .
n
i=1
P
There is clearly a discrepancy between the above and p1n ni=1 τ 0 Zi {Yi − Xi θ0 } . Ignoring the
estimation error in τ̂ n leads to an inconsistent estimator of the asymptotic variance of the
2SLS estimator. Note that
n n
1 X 1 X
p τ 0 Zi {Yi − Xi θ0 } = p τ 0 Zi ui ,
n n
i=1 i=1
n n
1 X 1 X
p τ 0 Zi [Yi − (Zi τ 0 ) θ0 ] = p τ 0 Zi [θ0 ei,X + ui ] .
n n
i=1 i=1

The ratio of the asymptotic variances based on the above expressions should be
var(u)
ρ= .
var (θ0 eX + u)
The ratio is 1 if θ0 = 0. In this case, the correction is asymptotically ignorable.
Suppose u = κeX + v for some v such that cov(eX , v) = 0, then
var(u) κ2 var(eX ) + var (v)
ρ= = .
var (θ0 eX + κeX + v) (θ0 + κ)2 var(eX ) + var (v)

Then ρ > 1 if and only if κ2 > (θ0 + κ)2 .


5.4 COVARIANCE MATRIX ESTIMATION 99

5.3.7 Two Step WLS Estimator.


Recall that
n
1 X (Yi − g(Xi , θ))2
Qn (θ) = .
2n σ 2 (Xi , τ̂ n )
i=1

We have
n
@ 1 X (Yi − g(Xi , θ)) @g(Xi , θ)
Qn (θ) = ,
@θ n σ 2 (Xi , τ̂ n ) @θ
i=1

and so
n
p @ 1 X (Yi − g(Xi , θ0 )) @g(Xi , θ0 )
n Qn (θ0 ) = p
@θ n σ 2 (Xi , τ̂ n ) @θ
i=1
Xn
1 (Yi − g(Xi , θ0 )) @g(Xi , θ0 ) p
= p 2
− Λn n (τ̂ n − τ 0 ) + op (1) .
n σ (Xi , τ 0 ) @θ
i=1

where
n
1 X (Yi − g(Xi , θ0 )) @g(Xi , θ0 ) @σ 2 (Xi , τ 0 )
Λn = .
n [σ 2 (Xi , τ 0 )]2
i=1
@θ @τ 0
Under correct specification, we have

(Yi − g(Xi , θ0 )) @g(Xi , θ0 ) @σ 2 (Xi , τ 0 )


Λn ! p E
[σ 2 (Xi , τ 0 )]2 @θ @τ 0
U V
E {[Yi − g(Xi , θ0 )] |Xi } @g(Xi , θ0 ) @σ 2 (Xi , τ 0 )
= E
[σ 2 (Xi , τ 0 )]2 @θ @τ 0
= 0.
p
If we have n (τ̂ n − τ 0 ) = Op (1) additionally, then
n
p @ 1 X (Yi − g(Xi , θ0 )) @g(Xi , θ0 )
n Qn (θ0 ) = p + op (1) .
@θ n σ 2 (Xi , τ 0 ) @θ
i=1
p
This is an example where the asymptotic distribution of n(θ̂n − θ0 ) is the same regardless
of whether τ 0 is known or estimated. However, in finite samples, the estimation error in τ̂ n
may still matter. It may pay to employ the two-step correction anyway.

5.4 Covariance Matrix Estimation


In this section we consider estimation of the asymptotic covariance matrix B0−1 Ω0 B0−1 of the
extremum estimator b θn .
By Lemma 27, Assumptions EE2 and CF(iv) imply that
2
b −1 := ( @ Qn (b
B
p
θn ))−1 ! B0−1 .
n 0
@θ@θ
5.4 COVARIANCE MATRIX ESTIMATION 100

Therefore, it remains to find a consistent estimator of Ω0 . The general principle employed


is that of forming estimators by replacing expectations with sample averages and unknown
parameters with consistent estimators of them. Then, Lemma 27 can be used to establish
b n . We consider each of the five examples ML, NLS,
consistency of the resulting estimator Ω
GMM,MD and TS in turn.

(1) ML Estimator: Let


X @2 n
bn : = 1
B log f (Wi , b
θn ) and
n @θ@θ0
i=1
n
1X @ @
bn
Ω : = log f (Wi , b
θn ) 0 log f (Wi , b
θn ).
n @θ @θ
i=1
p
We obtain Ω b n ! Ω0 by verifying conditions (i)-(iii) of Lemma 27. Condition (i) holds by
consistency of b θn . Condition (ii) and (iii) hold by the uniform WLLN of Theorem 25, provided
@
@θ log f (w, θ) is continuous in θ on Θ0 8w 2 W (as was assumed in the previous section) and
@
E supθ2Θ0 || @θ log f (Wi , θ)||2 < 1, where Θ0 is a compact neighborhood of θ0 .
If the model is correctly specified, then B0 = Ω0 and the covariance matrix B0−1 Ω0 B0−1
can be estimated by B b −1 Ω
b nB b −1 , B
b −1 , or Ω
b −1 . Note that Ω
b n only requires calculation of the
n n n n
b
first derivative of f (w, θ), whereas Bn requires calculation of the second derivatives.

(2) NLS Estimator: Let


n U V
b 1X @ b @ b b @2 b
Bn : = g(Xi , θn ) 0 g(Xi , θn ) − (Yi − g(Xi , θn )) g(Xi , θn ) and
n @θ
i=1
@θ @θ@θ0
n
1X @ @
bn
Ω : = (Yi − g(Xi , b
θn ))2 g(Xi , b
θn ) 0 g(Xi , b
θn ).
n @θ @θ
i=1
p
We obtain Ω b n ! Ω0 by verifying conditions (i)-(iii) of Lemma 27. Condition (i) holds by
consistency of b
θn . Condition (ii) and (iii) hold by Theorem 25, provided g(x, ·) 2 C 1 (Θ0 ) 8x 2
@
X (as was assumed in the previous section) and E supθ2Θ0 ||(Yi − g(Xi , θ)) @θ g(Xi , θ)||2 < 1.
If the regression model is correctly specified, E(Yi |Xi ) = g(Xi , θ0 ) a.s., then B0 simplifies
and Bbn can be simplified correspondingly. Let
X @n
en := 1
B g(Xi , b
@
θn ) 0 g(Xi , b
θn ).
n @θ @θ
i=1
p
en ! B0 when CF(iv) holds. So, a consistent covariance matrix
In the correctly specified case, B
estimator for a correctly specified regression model is given by B b nB
en−1 Ω e −1 . Note that this
n
estimator allows for conditional heteroskedasticity of the errors — i.e., it is a “heteroskedasticity
consistent” covariance matrix estimator.
If the model is correctly specified and the errors are conditionally homoskedastic, then
Ω0 = σ 2 B0 and B e −1 Ω
b nB
e −1 can be replaced by the simpler estimator
n n
n
1X
e −1 , where σ
b B
σ n
2
b2 := (Yi − g(Xi , b
θn ))2 .
n
i=1
5.4 COVARIANCE MATRIX ESTIMATION 101

en−1 = σ Pn
b2 B
Note that for a linear regression model, σ b2 ( n1 0 −1
i=1 Xi Xi ) .

(3) GMM Estimator: Let


Xn n
bn : = [ 1 @ b 0 0 1X @ b
B 0 g(Wi , θ n )] An An [ 0 g(Wi , θ n )] and
n @θ n @θ
i=1 i=1
Xn n n
bn : = [ 1 @ b 0 0 1X b b 0 0 1X @ b
Ω 0 g(W i , θ n )] A n A n [ g(W i , θ n )g(W i , θ n ) ]A n A n [ 0 g(Wi , θ n )].
n @θ n n @θ
i=1 i=1 i=1

Note that the definition of Bbn does not include the second summand of @ 2 0 Qn (b θn ) in (5.11).
@θ@θ
The reason is that the second summand converges in probability to zero because Eg(Wi , θ0 ) =
0 and, hence, can be omitted.
Each component of Ω b n has been shown in the previous section to converge in probability to
P
the corresponding component of Ω0 . The only exception is the component n1 ni=1 g(Wi , b θn )g(Wi , b
θn )0 .
The latter converges in probability to Eg(Wi , θ0 )g(Wi , θ0 )0 by Lemma 27 and Theorem 25 pro-
p
vided b
θn ! θ0 , g(w, ·) 2 C 0 (Θ0 ) 8w 2 W and E supθ2Θ0 ||g(Wi , θ)||2 < 1.

(4) MD Estimator: Let


H I0
@ b @
bn =
B 0 g(θ n ) A0n An 0 g(b
θn )
@θ @θ
and H I0
@ b @
bn =
Ω g(θ n ) A0n An Vbn A0n An 0 g(b
θn )
@θ0 @θ
p
where Vbn is some consistent estimator of V0 , the asymptotic variance of n(π̂ n − π 0 ). Note
that the definition of Bbn does not include the second summand of @ 2 Qn (b θ) because the
@θ@θ
latter converges to zero in probability given that π 0 = g(θ0 ).

(5) TS GMM/MD Estimator: Let


H I0
b @
Bn = Gn (θ̂n , τ̂ n ) A0n An Gn (θ̂n , τ̂ n )
@θ0

H I0 # $
@ 0 0 0
Ω̂n = G (θ̂
n n n, τ̂ ) A A
n n V̂ 1n + Λ̂ V̂
n 2n + V̂ Λ̂
2n n + Λ̂ V Λ
n 3n n
@θ0
@
×A0n An 0 Gn (θ̂n , τ̂ n )

where
@
Λ̂n = Gn (θ̂n , τ̂ n )
@τ 0
and V̂jn are some consistent estimator of Vj0 for j = 1, 2, 3. If Λ0 is zero, as occurs in some
cases, such as feasible GLS estimation, then one can take Λ̂n = 0 and the estimators V̂2n and
V̂3n are not required.
5.5 OPTIMAL WEIGHT MATRIX 102

5.5 Optimal Weight Matrix


The asymptotic covariance matrix of the GMM, MD and TS estimator is of the form

(Γ00 CΓ0 )−1 Γ00 CΣ0 CΓ0 (Γ00 CΓ0 )−1 , (5.13)

where C := A0 A and Σ0 is a symmetric positive semidefinite matrix that depends on the


estimator. We will show that the optimal choice of weight matrix An is a choice such that
p
A0 A = Σ−1
0 , where An ! A.

When this last condition holds then the asymptotic covariance in (5.13) simplifies to
(Γ00 Σ−1 −1 b
0 Γ0 ) . This choice minimizes the asymptotic covariance matrix of θ n because we now
show that always

(Γ00 CΓ0 )−1 Γ00 CΣ0 CΓ0 (Γ00 CΓ0 )−1 − (Γ00 Σ−1
0 Γ0 )
−1
≥ 0, (5.14)

where “≥” denotes “is psd.”. Note that for two invertible matrices, F and G it holds that
F −1 − G−1 ≥ 0 if and only if G − F ≥ 0. Thus, (5.14) holds if and only if

Γ00 Σ−1 0 0 −1 0
0 Γ0 − Γ0 CΓ0 (Γ0 CΣ0 CΓ0 ) Γ0 CΓ0 ≥ 0. (5.15)
−1/2 1/2 1/2
Defining H := Γ0 Σ0 and P := Ik − Σ0 CΓ0 (Γ00 CΣ0 CΓ0 )−1 Γ00 CΣ0 the left hand side of
(5.15) equals
−1/2 −1/2
Γ00 Σ0 P Σ0 Γ0 = HP H 0 = HP (HP )0 ≥ 0,
where the second equality uses the fact that P is a projection matrix (i.e. P is symmetric and
idempotent, P 2 = P ). The final inequality follows because a matrix of the form HP (HP )0 is
necessarily psd, because z 0 HP (HP )0 z = ||P H 0 z||2 ≥ 0 8z 2 Rd .
In sum, the optimal weight matrix for GMM, MD and TS estimator depends on the
@
asymptotic covariance matrix of n1/2 @θ Qn (θ0 ), which is Ω0 = Γ00 CΣ0 CΓ0 . The optimal weight
p −1
matrix An is such that A0n An ! Σ0 . For the GMM and MD estimators, Σ0 = V0 and the
optimal weight matrix An is such that
p
A0n An ! A0 A = V0−1 .

For the TS estimator, the optimal weight matrix An is such that


p * +−1
A0n An ! A0 A = V10 + Λ0 V20
0
+ V20 Λ00 + Λ0 V30 Λ00 .

5.6 Non-di§erentiable Objective Function


Suppose we have iid sample {Wi , i = 1, 2, ..., n} and we are interested in β 0 2 Rd . The moment
condition
Eg(W, β) = 0 2 Rk for some k ≥ d
holds if and only if β = β 0 . We will assume that though Eg(W, β) is a continuous functions
of its arguments, g(·, ·) need not be.
5.6 NON-DIFFERENTIABLE OBJECTIVE FUNCTION 103

As an example of a non-di§erentiable objective function, consider the model:

Yi = Xi β 0 + "i

where
P ("i < 0|Xi = x) = α 2 (0, 1)
for almost all x. That is, conditioning on Xi , the 100α% quantile of Yi is Xi β 0 . We could
write Xi β 0 as Xi β α0 if we want to emphasize that it is the 100α% conditional quantile. For
this problem, we can take

g(Wi , β) = [1 {Yi < Xi β} − α] Xi0

To estimate β 0 , we solve arg minβ Qn (β) where

Qn (β) = kEn g(W, β)k2

and En g(W, β) is the expectation operator with respect to the empirical distribution. The
limiting function is Q (β) = kEg(W, β)k2 .
Often in the literature (e.g., Pollard (1984)), the following notation is used: Pg(W, β) :=
Eg(W, β), Pn g(W, β) = En g(W, β), and
n
p 1 X
n (Pn − P) g(W, β) = p [g (Wi , β) − Eg (Wi , β)]
n
i=1

which is an empirical process indexed by g (or β in our setting)

5.6.1 Consistency
Theorem 29 Assume
(i) (Definition of β n )
Qn (β n ) ≤ inf Qn (β) + op (1) .
β2B

(ii) (Identification) 8" > 0, 9δ (") > 0 such that

inf Q(β) ≥ δ (") > 0.


kβ−β 0 k>"

(v) (ULLN)
sup k(Pn − P) g(W, β)k = op (1) .
β2B

Then
β n − β 0 = op (1) .
5.6 NON-DIFFERENTIABLE OBJECTIVE FUNCTION 104

5.6.2 Asymptotic Normality


Theorem 30 Assume Pg(W, β 0 ) = 0, β 0 2 int (B) and
(i) (Definition of β n ) * p +
Qn (β n ) ≤ inf Qn (β) + op 1/ n
β2B

(ii) D (β) = @ [Pg(W, β)] /@β 0 2 Rk×d exists in a neighborhood of β 0 , is of full column
rank, and is continuous at β = β 0 .
(iii) For all δ n such that δ n = op (1)
Wp W
sup W n (Pn − P) [g(W, β) − g(W, β 0 )]W = op (1) .
kβ−β 0 k≤δ n

(iv) for some finite matrix V


p
nPn g(W, β 0 ) !d N (0, V ).

Then p
n (β n − β 0 ) !d N (0, Ω)
where * +−1 * 0 +* +−1
Ω = D0 D D V D D0 D and D = D (β 0 ) .

Remark on condition (iii). For each given g, we can apply CLT to show that
p h * + i
n (Pn − P) g !d N 0, P g 2 − (Pg)2 .

A uniform version of the above CLT (Uniform CLT) gives conditions under which the conver-
gence is locally uniform in g, in the sense that a small perturbation in g leads to only a small
p
change in n (Pn − P) g. We do not need a precise statement of the uniform CLT, which says
p
that the empirical process n (Pn − P) g converges in distribution to a continuous Gaussian
p
process. What we need here is the local perturbation property of n (Pn − P) g given in the
lemma below (Pakes and Pollard (1989, Lemma 2.16)).
* +
Lemma 31 Let G be an Euclidean class of functions with Envelop G for which P G2 < 1.
For each η > 0 and " > 0, there exists a δ > 0 such that
!
Pp p P
lim sup P sup P n (Pn − P) g1 − n (Pn − P) g2 P > η < "
[δ]

where [δ] denotes the set of pairs in G with P (g1 − g2 )2 ≤ δ 2 .

Let G = {g (·, β) : kβ − β 0 k ≤ δ n } . Then the above lemma can be used to verify condition
(iii), as the class of functions [δ] eventually contains all the pairs g (·, β 1 ) and g (·, β 2 ) for
which kβ 1 − β 2 k ≤ δ n . For the definition of “Euclidean class”, please refer to Pakes and
Pollard (1989, Lemma 2.16), Pollard (1984) or Andrews (1999).
Proof of the Theorem.
5.6 NON-DIFFERENTIABLE OBJECTIVE FUNCTION 105

p
We begin with the n-consistency step. The fact that Pg(W, β) is di§erentiable in β at
β = β 0 and D = D (β 0 ) is of full column rank implies that

kP [g(W, β n ) − g(W, β 0 )]k ≥ C kβ n − β 0 k

for some constant C. This is true because

P [g(W, β n ) − g(W, β 0 )] = D̃n (β n − β 0 )

for D̃n = D(β̃ n ) and β̃ n = β 0 + op (1) and so


W# $−1 W2
W 0 W
k(β n − β 0 )k2 = W D̃
W n n D̃ D̃n
0
P [g(W, β n ) − g(W, β )]
0 W
W
U L# $−1 MV2 W W2
0 W 0 W
≤ λ max D̃n D̃n WD̃n P [g(W, β n ) − g(W, β 0 )]W
h # $i−2 # $
≤ λ min D̃n0 D̃n λmax D̃n D̃n0 kP [g(W, β n ) − g(W, β 0 )]k2
= C kP [g(W, β n ) − g(W, β 0 )]k2
h # $i−2 # $ # $
for C = λ min D̃n0 D̃n λmax D̃n D̃n0 < 1 as λ min D̃n0 D̃n is bounded away from zero.
p
It su¢ces to show that P [g(W, β n ) − g(W, β 0 )] = Op (1/ n) . But

kP [g(W, β n ) − g(W, β 0 )]k ≤ kPg(W, β n ) − Pn g(W, β n )k


+ kPn g(W, β n ) − Pg(W, β 0 )k
: = I1 + I2
p
First term: I1 = Op (1/ n)
Using the triangle inequality, we have

kPg(W, β n ) − Pn g(W, β n )k = k(Pn − P) g(W, β n )k


≤ k(Pn − P) g(W, β n ) − g(W, β 0 )k + (Pn − P) g(W, β 0 ) (5.16)
* p +
= op 1/ n + Pn g(W, β 0 )
* p +
= Op 1/ n (5.17)

from Assumptions (iii) and (iv).


p
Second term: I2 = Op (1/ n)

* p +
I2 = kPn g(W, β n )k ≤ inf kPn g(W, β)k + op 1/ n
β2B
* p +
≤ kPn g(W, β 0 )k + op 1/ n
* p +
= Op 1/ n .

Combining the above analyses yields:


* p +
kβ n − β 0 k = Op 1/ n .
5.6 NON-DIFFERENTIABLE OBJECTIVE FUNCTION 106

Define
Ln (β) = D (β − β 0 ) + Pn g(W, β 0 ).
We can now proceed as in the case with a smooth objective function. Define
U V
Mn
Bn = β : kβ − β 0 k ≤ p
n
p
for some Mn ! 1 arbitrarily slowly. Given the n consistency of β̂ n , we have β̂ n 2 Bn with
probability approaching one. So we can focus on β 2 Bn . Now uniformly over any β 2 Bn , we
have
Pn g(W, β) = [Pn − P] g(W, β) − [Pn − P] g(W, β 0 )
+Pg(W, β) + [Pn − P] g(W, β 0 )
* p +
= Pg(W, β) + Pn g(W, β 0 ) + op 1/ n by (iii)
# $ * p +
= D β̃ (β − β 0 ) + Pn g(W, β 0 ) + op 1/ n by (ii)
* p + p
= D (β 0 ) (β − β 0 ) + Pn g(W, β 0 ) + op 1/ n by n-consistency
That is, uniformly over any β 2 Bn ,
* p +
kPn g(W, β) − Ln (β)k = op 1/ n .
This implies * p +
β n = β ∗n + op 1/ n
where * p +
β ∗n = arg min kLn (β)k + op 1/ n .
As a result
p p
n (β n − β 0 ) = n (β ∗n − β 0 ) + op (1)
* +−1 0 p
= D0 D D nPn g(W, β 0 ) + op (1) !d N (0, Ω)
by assumption (iv).

5.6.3 Quantile Regression


In this example, we have
Pg(W, β) = EXi0 [1 {Yi < Xi β} − α]
* +
= E Xi0 E [1 {Yi < Xi β} − α] |Xi
* +
= E Xi0 E [1 {"i < Xi (β − β 0 )} − α] |Xi
; <
= EXi0 F"i |Xi (Xi (β − β 0 )) − α ,
and so
@ ; <
D (β) = @ [Pg(W, β)] /@β 0 = EXi0 F"i |Xi (Xi (β − β 0 )) − α

0
= EXi Xi f"i |Xi (Xi (β − β 0 )) .
5.6 NON-DIFFERENTIABLE OBJECTIVE FUNCTION 107

Consequently,
D = EXi0 Xi f"i |Xi (0) .
In addition
n
p 1 X
nPn g(W, β 0 ) = p [1 {Yi < Xi β 0 } − α] Xi0
n
i=1
d
! N (0, V )

for V = α (1 − α) EXi0 Xi . Therefore

Ω = D−1 V D−1
; <−1 ; < ; <−1
= α (1 − α) EXi0 Xi f"i |Xi (0) × EXi0 Xi × EXi0 Xi f"i |Xi (0) .

For further reading on the extremum estimators, consult some Handbook of Econometrics
chapters and advanced textbooks in econometrics.
Bibliography

[1] Amemiya, T. (1985): Advanced Econometrics, Ch4. Harvard University Press.

[2] Andrews, D.W.K. (1999): Empirical Process Methods in Econometrics In Handbook of


Econometrics. Vol. 4. Edited by R. F. Engle and D. McFadden. Amsterdam, The Nether-
lands: Elsevier Science, pp. 2247-2294.

[3] Pakes, A. and D. Pollard (1989): Simulation and the Asymptotics of Optimization Esti-
mators, Econometrica, Vol. 57(5), pp. 1027-1057.

[4] Pollard, D. (1984): Convergence of Stochastic Processes. Springer. Available online from
his webpage at www.stat.yale.edu

[5] Newey, W. K., and D. L. McFadden (1999): “Large Sample Estimation and Hypothesis
Testing.” In Handbook of Econometrics. Vol. 4. Edited by R. F. Engle and D. McFadden.
Amsterdam, The Netherlands: Elsevier Science, pp. 2113-2245.

[6] Wooldridge, J. (2010): Econometric Analysis of Cross Section and Panel Data, Chapters
12-14. The MIT Press.

108
Chapter 6

Binary Choice Models

Many economic variables are observed as the result of individuals’ choices between a limited
number of alternatives. In this chapter, we shall assume that only two alternatives are avail-
able, e.g., purchase/not purchase a car, apply for/not apply for a job, obtain/not obtain a
loan, travel to work by own car/public transport. These are examples of genuine qualitative
choices. Since there are two alternatives, we call it a binary choice. We represent the outcome
of the choice by a binary variable.

6.1 Linear Probability Model


6.1.1 Introduction and estimation
In the setting where Yi takes only two values (say zero and one), we have

E(Yi |Xi ) := g(Xi β) = Pr(Yi = 1|Xi ) (6.1)

where we have assumed that Pr(Yi = 1|Xi ) depends on Xi through the linear index Xi β.
Suppose we would use a linear regression model to explain Yi :

Yi = Xi β + "i . (6.2)

Because Yi can take only two values, the error term "i , for a given value of Xi , can only take
two values: 8
< 1 − Xi β, if Yi = 1,
"i = (6.3)
: −X β, if Y = 0,
i i

In fact, Yi is a Bernoulli random variable. Its conditional p.d.f. is given by:

f (yi ; xi , β) = (Pr(Yi = 1|xi ))yi (Pr(Yi = 0|xi ))1−yi . (6.4)

The variance of a Bernoulli random variable is given by

V ar(Yi |Xi ) = Pr(Yi = 1|Xi ) × (1 − Pr(Yi = 1|Xi )) = V ar("i |Xi ). (6.5)

109
6.2 SUPPORT VECTOR MACHINE 110

But Pr(Yi = 1|Xi ) = Xi β, so

V ar("i |Xi ) = Xi β(1 − Xi β). (6.6)

Clearly heteroskedastic! Given (6.2) and (6.6), we know that the OLS estimator of β is
unbiased and consistent. Because of heteroskedasticity, we have to use the robust variance
estimator and the robust t-statistic to make inferences.
Since we know the form of heterogeneity, we can use WLS to obtain more e¢cient estimates
by regressing Yi /b
σ i on Xi /b
σ i , where
b
b2i = Xi β
σ b
OLS (1 − Xi β OLS ). (6.7)

6.1.2 Pros and Cons


• Simple to calculate.

• Constant marginal e§ects:

β j = @ Pr(Yi = 1|Xi )/@Xij . (6.8)

• The model provides ‘reasonable’ estimates of the partial e§ects near the center of the
distribution of X. This is not rigorously established.
b may be outside of the interval [0, 1]. The partial e§ect may not be reliable for
• But Xi β
extreme values of X.

6.2 Support Vector Machine


6.2.1 Basic Idea
In the computer science and machine learning literature, we often view the supervised learning
problem with a binary outcome as a binary classification problem. To illustrate the basic
idea, consider the case with two covariates X1 and X2 . Figure 6.1 presents the scatter plot of
(X1i ,X2i ) with the color indicating the outcome of Yi .
We want to find a decision boundary or separating hyperplanes so that when w1 X1 +
w2 X2 + b > 0, we predict that Y = 1 and otherwise, we predict Y = −1. In the machine
learning literature, the two outcomes are typically denoted by 1 and −1 rather than 1 and 0.
It is obvious that the change of labels does not matter. Support Vector Machine (SVM) is a
non-probabilistic algorithm to find a best decision boundary according to some criterion.
In the above example, many hyperplanes can separate the two classes of data. SVM
attempts to find two parallel hyperplanes that separate the two classes of data so that the
distance between them is as large as possible. The distance is referred to as the margin of the
classifier. See Figure 6.2 for an illustration.
We proceed to find the margin for a given linear classifier. Writing the classifier boundary
as
wT x + b = 0,
6.2 SUPPORT VECTOR MACHINE 111

Figure 6.1: Separating the training examples

the plus and minus planes can be represented as


wT x + b = C and wT x + b = −C
for some positive constant C. Given that the decision rule depends on the sign of wT x + b
only,* it remains the T
T
+ same if the underlying equation is rescaled by C (i.e., sign(w x + b) =
sign (w x + b)/C ). So without loss of generality, we can represent the classifier boundary,
the plus plane, and the minus plane as
wT x + b = 0, wT x + b = 1, and wT x + b = −1
respectively.
Claim: The distance from Xi to the classifier boundary is
Yi (wT Xi + b)
r= .
kwk
Proof. The distance is given by kX 0 − Xk where
w
X − X0 = r
kwk
or
w
X0 = X − r .
kwk
T
But wT X 0 + b = 0, that is, wT X − r wkwkw + b = 0. Solving for r yields

(wT Xi + b)
r= .
kwk
This works out for the case of a positive training example at X Figure 6.3 (i.e., the case with
Y = 1). More generally, we have
Yi (wT Xi + b)
r= .
kwk
6.2 SUPPORT VECTOR MACHINE 112

Figure 6.2: The Margins: Wide and Narrow

The margin is given by ρ = 2/ kwk . The SVM solves


2
max , s.t. Yi (wT Xi + b) ≥ 1 for all i = 1, ..., n
kwk

or equivalently
1
min kwk2 , s.t. Yi (wT Xi + b) ≥ 1 for all i = 1, ..., n.
2
This is a quadratic optimization problem subject to linear constraints.
At the solution, there will be a few points on the plus and minus planes. These points are
called the support vectors. The number of support vectors can be much smaller than the size
of the training set.
The Lagrangian for the optimization problem is

X ; n
1 <
L (w, b, λ) = kwk2 − λi Yi (wT Xi + b) − 1 .
2
i=1

From the first order condition with respect to w, we get


n
X
w= λi Yi Xi .
i=1

From the first order condition with respect to b, we get


n
X
λi Yi = 0.
i=1
6.2 SUPPORT VECTOR MACHINE 113

Figure 6.3: Compute the margin

Using this, we have


8 2 39
n n < n = X n
1X T
X
4
X
T 5
L (w, b, λ) = λi λj Yi Yj Xi Xj − λi Yi ( λj Yj Xj ) Xi + b + λi
2 : ;
i=1 i=1 j=1 i=1
n
X n
X
1
= λi − λi λj Yi Yj XiT Xj .
2
i=1 i=1

The dual problem is


n
X n
1X
max W (λ) = λi − λi λj Yi Yj XiT Xj , s.t.
2
i=1 i=1
λi ≥ 0 for i = 1, ..., n
n
X
λi Yi = 0.
i=1

The support vectors are given by those Xi ’s whose λi are positive. In fact,
n
X
wT x + b = λi Yi hXi , xi + b.
i=1

Thus, we need to find only the inner products between x and the support vectors in order
to make our prediction. More general inner products can be used, leading to the so-called
“kernel tricks” in the machine learning literature.
6.2 SUPPORT VECTOR MACHINE 114

Figure 6.4: SVM: Soft Margins

6.2.2 Extensions: Soft Margin Classification.


If the training data is not linearly separable, slack variables ξ i can be added to allow misclas-
sification of noisy examples. See Figure 6.4 for examples.
We still try to minimize the training set errors, and to place hyperplanes “far” from each
class (large margin):

X n
1
min kwk2 + C ξi
2
i=1
T
s.t. Yi (w Xi + b) ≥ 1 − ξ i for all i = 1, ..., n
ξ i ≥ 0 for all i = 1, ..., n.

The parameter C is a regularization parameter that controls overfitting.


The dual problem becomes
n
X n
1X
max W (λ) = λi − λi λj Yi Yj XiT Xj , s.t.
2
i=1 i=1
C ≥ λi ≥ 0 for i = 1, ..., n
n
X
λi Yi = 0.
i=1

The problem is almost the same as before. The only change is that there is now an upper
bound on the Lagrangian multiplier.
6.3 PROBIT AND LOGIT 115

6.3 Probit and Logit


6.3.1 Introduction
To overcome the problems with the linear model, there exist a class of binary choice models
designed to model the ‘choice’ between two discrete alternatives. Essentially these models
describe the probability that Yi = 1 directly. Typically we choose a cumulative density
function F (Xi β) for g(Xi β), since cdf’s by nature are restricted to lie between zero and one.
We let
Pr(Yi = 1|Xi ) = F (Xi β). (6.9)
Common choices are:
Z w
F (w) = Φ(w) = φ(Z)dZ standard normal distribution ) Probit Model. (6.10)
−1

exp(w)
F (w) = L(w) = logistic distribution ) Logit Model. (6.11)
1 + exp(w)
For the latter one, we have

exp(w)
f (w) = F 0 (w) = . (6.12)
(1 + exp(w))2

The probit model, which uses the normal distribution, may be justified by appealing to
a central limit theorem, while the logit model can be justified by the fact that it is similar
to a normal distribution but has a much simpler form. The di§erence between the logit and
normal distributions is that the logit has slightly heavier tails. The standard normal has mean
zero and variance 1 while the logit has mean zero and variance equal to π 2 /3.
Often the binary choice model is derived from underlying behavioral assumptions: a woman
will choose to work for pay if the utility she derives from working is larger than the utility not
to work for pay. This leads to a latent variable representation of the model.
Assuming a linear additive relationship, we obtain the utility di§erence, denoted by Yi∗ :

Yi∗ = Xi β + "i (6.13)

where "i /σ " has the CDF F (·) . Because utility Yi∗ is unobserved, it is referred to as a latent
variable. We assume that an individual chooses to work if the utility di§erence exceeds a
certain threshold level, which can be set equal to zero without loss of generality (assuming
our model contains an intercept):
Yi = 1 {Yi∗ > 0} . (6.14)
Consequently, we have

Pr(Yi = 1|Xi ) = Pr(Yi∗ > 0|Xi ) = Pr("i > −Xi β|Xi )


= 1 − F (−Xi β/σ " ) = F (Xi β/σ " ) (6.15)

provided that the distribution F is symmetric, where F (·) is the c.d.f. of "i /σ " .
6.3 PROBIT AND LOGIT 116

Figure 6.5: Logit vs Probit functions

In limited dependent variable models we typically lack identification for some unknown
parameter(s). This is due to the fact that we observe a limited set of the latent variable:
Yi = τ (Yi∗ ) for a noninvertible map τ .
First, in the binary choice example, σ " is not identified. Observing Yi , we only know
whether Yi∗ exceeds the threshold or not, there is no way to find the scale of Yi∗ . In the
sequel, therefore, we will set σ " = 1.
Second, setting the threshold for Y ∗ at 0 is likewise innocent if the model contains a
constant term. (In general, unless there is some compelling reason, binary choice models
should not be estimated without constant terms.)
Remark 32 Marginal E§ect
@ Pr(Yi = 1|Xi )
= f (Xi β)β j . (6.16)
@Xij
Note that if we do not set σ " = 1, then

@ Pr(Yi = 1|Xi ) @F ( Xσi"β ) Xi β β j


= = f( ) = f (Xi β σ )β j,σ
@Xij @Xij σ" σ"
for β σ = β/σ " . So, for marginal e§ects, it is innocuous to set σ " = 1.
Remark 33 The relative marginal e§ect is a constant:
@ Pr(Yi = 1|Xi )/@Xij βj
= . (6.17)
@ Pr(Yi = 1|Xi )/@Xih βh
This holds for any linear index model.
6.3 PROBIT AND LOGIT 117

Remark 34 For discrete Xk , the marginal e§ect is

F (β 0 + Xi1 β 1 + Xi2 β 2 + ... + (Ck + 1) β k )


−F (β 0 + Xi1 β 1 + Xi2 β 2 + ... + Ck β k ) . (6.18)

The marginal e§ect depends on the point (Xi1 , Xi2 , ..., Ck ) at which we evaluate the marginal
e§ect.

Remark 35 Apart from the sign of the coe¢cients, the coe¢cients in these binary choice
models are not easily interpretable. Except maybe in the logit model, where one can consider
the β’s to represent the marginal e§ect of Xik on the log of the odds: log(“odds”) = Xi β,
where
Pr(Yi = 1|Xi )
“odds” = (6.19)
1 − Pr(Yi = 1|Xi )

6.3.2 Estimation
If we assume that F (·) is known, then the optimal parametric estimator for this problem will
be ML:
X n
Yi log F (Xi β) + (1 − Yi )(log (1 − F (Xi β))). (6.20)
i=1
The score is
Yi (1 − Yi )
si (β) = f (Xi β)Xi0 − f (Xi β)Xi0
F (Xi β) 1 − F (Xi β)
H I
Yi (1 − Yi )
= − f (Xi β)Xi0
F (Xi β) 1 − F (Xi β)
Yi − F (Xi β)
= f (Xi β)Xi0 . (6.21)
F (Xi β) (1 − F (Xi β))
So,
@si (β) −f 2 (Xi β)
= X 0 Xi
@β 0 F (Xi β) (1 − F (Xi β)) i
[Yi − F (Xi β)] [1 − 2F (Xi β)] 2
− f (Xi β)Xi0 Xi
[F (Xi β) (1 − F (Xi β))]2
Yi − F (Xi β)
+ f (Xi β)f 0 (Xi β) Xi0 Xi
F (Xi β) (1 − F (Xi β))
Yi − 2F (Xi β)Yi + F 2 (Xi β) 2
= − f (Xi β)Xi0 Xi
[F (Xi β) (1 − F (Xi β))]2
Yi − F (Xi β)
+ f (Xi β)f 0 (Xi β) Xi0 Xi
F (Xi β) (1 − F (Xi β))
and the (normalized) expected Hessian is
L M
@si (β) 2 Xi0 Xi
E = −E f (Xi β) , (6.22)
@β 0 F (Xi β) (1 − F (Xi β))
6.3 PROBIT AND LOGIT 118

which is negative semidefinite.


Computational notice: in the probit setting, our probabilities are one dimensional integrals,
whereas in the logit setting our probabilities have simple expressions like

exp(Xi β)
, (6.23)
1 + exp(Xi β)

and the first order condition is


n
X
(Yi − F (Xi β)) Xi0 = 0. (6.24)
i=1
Pn Pn
If Xi contains a constant, then i=1 Yi /n = i=1 F (Xi β̂)/n. The empirical frequency of
Yi = 1 is equal to the average predicted probabilities of Yi = 1.

Problem 36 What’s going to happen if Yi = 0 or 1 for all i?

b
The approximate variance of β M LE is

n
!−1
X Xi0 Xi
2
f (Xi β) , (6.25)
F (Xi β) (1 − F (Xi β))
i=1

which can be written in a familiar form (X 0 ΛX)−1 with

Λ = diag(f 2 (Xi β)/(F (Xi β) (1 − F (Xi β))),


and X 0 = (X10 , X20 , ..., Xn0 ). (6.26)

For testing about the coe¢cients, the full menu of procedures is available (LR, LM and
Wald):

• The model P (Y = 1|X, Z) = F (Xβ + Zγ).

• The null H0 : γ = 0.

• The Tests: Wald test, LR test, 2(Lur − Lr ) and LM

6.3.3 Report the Results for Probit and Logit


1. Percentage of correct prediction
2. Weighted average of correct prediction when Yi = 1 and Yi = 0
3. Pseudo R2 = 1 − Lur /L0 (McFadden) where L = ln L.
The most basic way to describe how successful the model is at fitting the data is to report
the value of ln L at β̂. Since the hypothesis that all other slopes in the model are zero is also
interesting, L0 := ln L computed with only a constant term included should also be reported.
Comparing L0 to Lur gives us an idea of how much the likelihood improves on adding the
explanatory variables.
6.4 PROBIT WITH ENDOGENOUS COVARIATES 119

4. Pseudo R2 = 1 − SSRur /SSR0 (McFadden), where


N h
X i2
SSRur = b )
Yi − g(Xi β (6.27)
ur
i=1

b β
5. Partial e§ects: ∆P (Y = 1|X) ≈ f (X β) b ∆Xj for small Xj .
j
b = f (0) = 0.4; for logit, f (0) = 0.25. The
6. Compare logit and probit: for probit f (X β)
logit estimates can be expected to be larger by a factor 0.4/0.25 = 1.6. To compare with
LPM, logit estimates should be divided by 4 while probit estimates should be divided by 2.5.
7. Variance of the partial e§ect.

Exercise 37 Derive the asymptotic distributions of the predicted probability and


marginal e§ects.

6.4 Probit with Endogenous Covariates


We now study what can be done to account for endogeneity in probit models.

6.4.1 IV and Control Function Approach


We first review instrument variable regressions and control function/variable approach in a
linear causal model:

Y1∗ = Xβ + Y2 α + U
Y2 = Xγ + Zδ + V (6.28)

where (X, Z) is exogeneous and Y2 is endogeneous. The first equation is the structural equa-
tion, and the second equation is the reduced-form equation (not structural). Y2 is endogenous
because U and V are correlated.
Recall the 2SLS regression:
(i) First stage Regression: Ŷ2 = X γ̂ + Z δ̂ with estimated residual V̂ = Y2 − Ŷ2 ;
(ii) Second stage Regression: Y1∗ = Xβ + Ŷ2 α + error.
The essence behind the 2SLS is to decompose Y2 into two parts: one part (i.e., Xγ + Zδ)
is exogeneous and the other part (i.e., V ) is endogeneous. We include the exogenous part as
the regressor in the second stage regression while excluding the endogeneous part from the
regression. E§ectively, we move the endogeneous part to the error term.
Symmetrically, we can decompose U into two parts: U = ρV + e where by definition V is
correlated with Y2 and e is not. We include e in the error term while excluding V from the
error term. That is, we add V as an additional regressor, leading to

Y1∗ = Xβ + Y2 α + ρV + e.

If V is observable, we can regress Y1∗ on X, Y2 and V by OLS. Since X, Y2 and V are not
correlated with e, the OLS estimator of (α, β, ρ) is consistent. When V is not observable
6.4 PROBIT WITH ENDOGENOUS COVARIATES 120

in practice, we can replace it by V̂ from the first stage regression. This, combined with
Y2 = Ŷ2 + V̂ , gives rise to the model
# $
Y1∗ = Xβ + Y2 α + ρV̂ + e + ρ V − V̂
| {z }
# $
= Xβ + Ŷ2 α + (α + ρ) V̂ + e + ρ V − V̂
| {z }

where e + ρ(V − V̂ ) is the new error term. However, by construction, V̂ is orthogonal to X


and Ŷ2 . So regressing Y1∗ on X and Ŷ2 yields the same result as regressing Y1∗ on X, Ŷ2 and V̂ .
The former regression is the familiar IV approach. In terms of estimating α and β, regressing
Y1∗ on X, Ŷ2 and V̂ will produce the same estimates when Y1∗ is regressed on X, Y2 and V̂ . The
latter regression is the so-called control function approach. The IV approach and the control
function approach produce numerically identical results. Loosely speaking

reg Y1∗ X Ŷ2 = reg Y1∗ X Ŷ2 V̂ = reg Y1∗ X Y2 V̂ .

More generally, consider a causal model with endogeneity: Y1∗ = Xβ + Y2 α + U where Y2 is


the causal variable of interest. A control variable W is a variable that satisfies the conditional
mean independence assumption:

E(U |X, Y2 , W ) = E(U |X, W ).

If in addition we assume E(U |X, W ) = ρX X + ρw W, then

E (Y1∗ |X, Y2 , W ) = X (ρX + β) + Y2 α + ρw W.

So the OLS estimator of α obtained by regressing Y1∗ on X, Y2 , W is consistent. It is important


to point out that the OLS estimator of the slope coe¢cients for X and W may not have any
causal interpretation. Their signs may not be meaningful. One should not check their signs in
order to judge whether the coe¢cient estimators are compatible with some economic theory
or common sense.
What if we have a more general nonlinear and nonseparable structural/causal model

Y1∗ = G (X, Y2 , U )

where Y2 is correlated with U ? In this case, we hope to find a control variable W such that
the full conditional independence (CI) holds:

U ? Y2 |(X, W ).

We then compute

E (Y1∗ |X = x, Y2 = y2 , W = w)
| {z }
identif iable
= E [G (x, y2 , U ) |X = x, Y2 = y2 , W = w)]
= E [G (x, y2 , U ) |X = x, W = w)] (by conditional independence)
Z
= G (x, y2 , u) fU |X,W (u|x, w) du,
U
6.4 PROBIT WITH ENDOGENOUS COVARIATES 121

and so
@E (Y1∗ |X = x, Y2 = y2 , W = w)
ρ(x, y2 , w) : =
@y2
Z
@
= G (x, y2 , u) fU |X,W (u|x, w) du
@y2 U
Z
@
= G (x, y2 , u) fU |X,W (u|x, w) du
U @y2
R
provided that the interchange @y@ 2 and U is justified.
Note that @y@ 2 G (x, y2 , u) has a causal interpretation. It is exactly the ceteris paribus causal
e§ect that we defined before. When computing such an e§ect, we hold x and u constant while
making a small change to y2 .
Such an e§ect depends on the unobserved causal factor U. In general, we should not expect
to recover @y@ 2 G (x, y2 , u) at each given value of (x, y2 , u) , that is, the ceteris paribus causal
e§ects can not be identified at the individual level. Di§erent individuals have di§erent ceteris
paribus causal e§ects. Instead of focusing on individual treatment e§ects, we consider an
averaged version of it. ρ(x, y2 , w) is an averaged treatment e§ect where the average is taken
over the conditional distribution of the unobservable U given (X, W ) = (x, w) .
If we would like to average over the conditional distribution of U given X = x, we can
integrate out w using the conditional pdf fW |X (w|x) :
Z
ρ̃ (x, y2 ) = ρ(x, y2 , w)fW |X (w|x) dw
W
Z LZ M
@
= G (x, y2 , u) fU |X,W (u|x, w) fW |X (w|x) dw du
@y
ZU 2 W
@
= G (x, y2 , u) fU |X (u|x) du.
U @y2

If X and U are independent, then


Z
@ @
ρ̃ (x, y2 ) = G (x, y2 , u) fU (u) du = EG (x, y2 , U ) .
U @y2 @y2

6.4.2 Probit with Endogeneity


Consider:

Y1∗ = Xβ + Y2 α + U
Y2 = Xγ + Zδ + V
Y1 = 1 {Y1∗ > 0} (6.29)

where (U, V ) are bivariate normal distributions and are independent of X and Z, and U is
possibly correlated with V. As an example, Y1 is a binary variable indicating a woman’s labor
force participation, Y2 is her years of education, and X contains her parental education.
Assume that var(U ) = 1 and denote var(V ) = σ 2 . Write

U = θV + e (6.30)
6.4 PROBIT WITH ENDOGENOUS COVARIATES 122

where
cov(V, U ) cov(V, U ) 1 ρ
θ= =p p p := , (6.31)
var(V ) var(V ) var(U ) var(V ) σ
for ρ = corr(V, U ) and
e s N (0, 1 − θ2 σ 2 ) = N (0, 1 − ρ2 ). (6.32)
Then, we have
Y1∗ = Xβ + Y2 α + θV + e, (6.33)
and !
Xβ + Y2 α + θV
P (Y = 1|X, Y2 , V ) = Φ p . (6.34)
1 − ρ2
But we do not know V and have to estimate it.

A Two-step Control Function Approach


(a) run the OLS regression Y2 = Xγ + Zδ + V to get V̂ and σ̂ 2 ;

(b) run the probit regression using V̂ in place of V to get estimators of

β α θ
β ρ := p , αρ := p and θρ := p . (6.35)
1−ρ2 1−ρ2 1 − ρ2

Ordinary probit standard errors calculated from the second step are inconsistent because
estimated residuals are treated as if they were observations of the true first-stage errors. To
get consistent standard errors, we need to take into account the additional uncertainty that
results from using (γ̂, δ̂) as opposed to the true values γ and δ.
Note that once β ρ , αρ , θρ , and σ 2 are estimated, we can back out the estimates for α, β
and ρ. More specifically, note that
ρ 1
θρ = p
1 − ρ2 σ

which implies that


θρ σ
ρ= q .
1 + θ2ρ σ 2

So we can back out ρ if we know θρ and σ 2 . Once we know ρ, we can back out α and β from
αρ and β ρ :
p αρ
α = αρ 1 − ρ2 = q ,
1 + θ2ρ σ 2
p βρ
β = βρ 1 − ρ2 = q .
1 + θ2ρ σ 2
6.4 PROBIT WITH ENDOGENOUS COVARIATES 123

6.4.3 Average Partial E§ects under Endogeneity


First Approach
Our structural model is Y1 = 1 {Xβ + Y2 α + U > 0} . The model is structural in the following
sense: for a given U, it produces potential outcomes Y1 (x, y2 ) for every possible value of
(x, y2 ). We imagine that each individual in the population has a value of U and hence a value
of Y1 (x, y2 ) for each possible value of (x, y2 ). This is the way we think about structural models
in economics. For example, a demand schedule gives the conjectural and potential demand
Q(p) for every possible price p. For each individual in the sample we only observe her actual
value (Xi , Y2i ) that occurs in the sample so that Y1i = Y1i (Xi , Y2i , Ui ) .
For a give value (X, Y2 ) = (xo , y2o ), we have

Y1 (xo , y2o , U ) = 1 {xo β + y2o α + U > 0} .

For the subpopulation whose (X, Y2 ) is equal to (xo , y2o ), we ask what could be the average
response had their U follows a certain distribution, say N (0, 1), the marginal distribution of
U in the population. The answer is
Z
EY1 (xo , y2o , U ) = Φ (xo β + y2o α) = 1 {xo β + y2o α + u > 0} φU (u) du,

where u is the dummy for integration, EY1 (xo , y2o , U ) is the expectation of Y1 when we set
(X, Y2 ) to be (xo , y2o ) while letting U follow the distribution N (0, 1) instead of its conditional
distribution given that (X, Y2 ) = (xo , y2o ) . EY1 (xo , y2o , U ) has a causal/structural interpreta-
tion, as it is the average value of Y1 when (X, Y2 ) is set to be any value (xo , y2o ) while keeping
U the same in an average sense (i.e., the distribution of U does not change with the settings
of (X, Y2 )). For another pair of value (x∗o , y2o ∗ ) , we would do exactly the same calculation:

Z
EY1 (x∗o , y2o

, U ) = Φ (x∗o β + y2o

α) = 1 {x∗o β + y2o ∗
α + u > 0} φU (u) du.

Let U s N (0, 1), then

EY1 (xo , y2o , U ) = E1 {xo β + y2o α + U > 0} .

When U s N (0, 1) but the conditional distribution of U given (X, Y2 ) = (xo , y2o ) is not
N (0, 1), in general

EY1 (xo , y2o , U ) 6= E [Y1 (X, Y2 , U ) | (X, Y2 ) = (xo , y2o )] .

While EY1 (xo , y2o , U ) can be regarded as a “structural” expectation, E [Y1 (X, Y2 , U ) | (X, Y2 ) = (xo , y2o )]
is the statistical expectation.
Assuming that the unobserved factor U has the marginal distribution N (0, 1), the partial
e§ect can be computed by
@EY1 (xo , y2o , U )
AP E1 = = αφ (xo β + y2o α) ,
@y2o
which can be estimated by # $
\
AP E1 = α̂φ xo β̂ + y2o α̂ .
6.4 PROBIT WITH ENDOGENOUS COVARIATES 124

To understand AP E1 , we consider an individual with (X, Y2 ) = (xo , y2o ). Suppose we change


her Y2 by a small amount while keeping her U value the same as before. Then

Y1 (xo , y2o + ∆) − Y1 (xo , y2o ) = 1 {xo β + (y2o + ∆) α + U > 0} − 1 {xo β + y2o α + U > 0} .

The e§ect will be di§erent for di§erent individuals because their U ’s are di§erent. For some
individuals, Y1 will not change. For others, Y1 will change. We ask: what is the average change
in the population? To address this question, we take an average of the above di§erence, leading
to
E [Y1 (xo , y2o + ∆) − Y1 (xo , y2o )] = Φ (xo β + (y2o + ∆) α) − Φ (xo β + y2o α) .
Scaling this by ∆ and letting ∆ ! 0, we obtain AP E1 given above. For more discussions
along this line, see section 2.2.5 in Wooldridge (2010).
If we are not interested in a particular pair (xo , y2o ) , we can let (xo , y2o ) vary over the
sample and obtain
1X # $
n
^
AP E1 = α̂ φ Xi β̂ + Y2i α̂ .
n
i=1
The distinction between the “structural” expectation and statistical expectation is not spe-
cific to probit models or more general nonlinear and nonseparable models. Such a distinction
applies to linear models as well. Consider the linear causal model with endogeneity

Y = Wθ + U

where X and U are correlated. Then the “structural” expectation is

EY (wo , U ) = wo θ + EU = wo θ,

and the statistical expectation is

E [Y (W, U ) |W = wo ] = wo θ + E (U |W = wo ) .

If E (U |W = wo ) = wo π, then E [Y (W, U ) |W = wo ] = wo (θ + π). So the “structural” expecta-


tion and the statistical expectation have di§erent slope coe¢cients. While the slope coe¢cient
in the “structural” expectation has a structural interpretation, that in the statistical expec-
tation does not. Clearly
@EY (wo , U ) @wo θ
AP E1 = = =θ
@wo @wo
\
which is a constant. Also AP ^
E1 = AP E1 = θ̂ in this case.

Second Approach
An alternative method of computing the APE is to look at
Z
E(Y1 (xo , y2o , U ) |V = vo ) = 1 {xo β + y2o α + u > 0} φU |V (u|vo ) du

where φU |V (u|vo ) is the conditional density of U given V = vo . The reason to condition on V is


that (X, Y2 ) is independent of U conditional on V. Conditioning on a given V = vo means that
6.4 PROBIT WITH ENDOGENOUS COVARIATES 125

we look at the subpopulation whose V value is equal to vo . For this subpopulation, (X, Y2 )
and U are independent. So

E(Y1 (xo , y2o , U ) |V = vo ) = E [Y1 (X, Y2 , U ) | (X, Y2 , V ) = (xo , y2o , vo )]


= E [Y1 | (X, Y2 , V ) = (xo , y2o , vo )]

The right hand side is the usual conditional expectation. We have connected a structural ob-
ject, i.e., E(Y1 (xo , y2o , U ) |V = vo ) with a statistical object, i.e., E [Y1 | (X, Y2 , V ) = (xo , y2o , vo )] .
Define the Average Structural Function as

ASF (xo , y2o , vo ) = E [Y1 | (X, Y2 , V ) = (xo , y2o , vo )]


* +
= Φ xo β ρ + y2o αρ + θρ vo ,

where
β α θ
βρ = p , αρ = p and θρ = p . (6.36)
1 − ρ2 1 − ρ2 1 − ρ2
Let fV (·) be the marginal pdf of V. The second way to compute the average partial e§ect
with respect to Y2 at (X, Y2 ) = (xo , y2o ) is
Z L M
@ASF (xo , y2o , vo )
AP E2 = fV (vo ) dvo (6.37)
V @y2
@
= EV [ASF (xo , y2o , V )] (6.38)
@y2
@ ; * +<
= EV Φ xo β ρ + y2o αρ + θρ V , (6.39)
@y2
where EV is the expectation operator with respect to the marginal distribution p of V.
Under the normality assumption, i.e., V s N (0, σ 2 ), we can find ẽ s N (0, 1 − ρ2 ) and
is independent of V such that
; * +<
EV Φ xo β ρ + y2o αρ + θρ V
= EV Eẽ 1 {xo β + y2o α + θV + ẽ > 0}
= E1 {xo β + y2o α + θV + ẽ > 0}
n o
= E1 xo β + y2o α + Ũ > 0
= Φ (xo β + y2o α) .

So
AP E2 = αφ (xo β + y2o α) .
It is now clear that AP E1 = AP E2 . For AP E1 , the average is taken over the marginal
distribution of U . For AP E2 , the average is taken over the conditional distribution of U given
V followed by averaging over the marginal distribution of V.
6.4 PROBIT WITH ENDOGENOUS COVARIATES 126

Based on the equivalence of the two definitions, we can find a di§erent way to compute
the APE. Given that V s N (0, σ 2 ), we have
@ ; * +<
AP E2 = EV Φ xo β ρ + y2o αρ + θρ V
@y2
* +
= EV φ xo β ρ + y2o αρ + θρ V αρ
0 1
αρ xo β ρ + y2o αρ
= q φ@q A.
2 2 2 2
(1 + σ θρ ) (1 + σ θρ )

This can be proved as follows:


* +
EV φ xo β ρ + y2o αρ + θρ V αρ
@ * +
= EV Φ xo β ρ + y2o αρ + θρ V
@y2o
@
= P (ξ < xo β ρ + y2o αρ + θρ V ) for ξ ? V , ξ s N (0, 1), V s N (0, σ 2 )
@y2o
@
= P (ξ − θρ V < xo β ρ + y2o αρ ) for ξ − θρ V s N (0, 1 + θ2ρ σ 2 )
@y2o
0 1 0 1
@ xo β ρ + y2o αρ α ρ x o β ρ + y 2o α ρ
= Φ@ q A= q φ@q A.
@y2o 1+θ σ 2 2 2
(1 + σ θ )2
(1 + σ 2 θ2 )
ρ ρ ρ

Therefore, we can estimate the APE wrt to Y2 at (X, Y2 ) = (xo , y2o ) by


0 1
α̂ρ xo β̂ ρ + y2o α̂ρ
\
AP E2 = q φ@q A. (6.40)
2 2 2 2
(1 + σ̂ θ̂ρ ) (1 + σ̂ θ̂ρ )

\
Obviously AP E1 = AP\ \
E2 . While AP E1 uses the estimates of the deep structural parameters
\
α and β, AP E2 uses the estimates of the parameters αρ , β ρ , θ2ρ and σ 2 which are (arguably)
more of the reduced-form nature.
^
Like AP E1 , we could also compute

1X # $
n
^
AP E2 = φ xo β̂ ρ + y2o α̂ρ + θ̂ρ V̂i α̂ρ . (6.41)
n
i=1

The asymptotic variance of APE is di¢cult to compute, but we can employ standard
bootstrap methods to compute them.

6.4.4 Can We Use a Two-step IV Approach? No.


Suppose we try to estimate the probit model using the typical 2SLS. We write

Y1 = 1 {Xβ + (Xγ + Zδ) α + (U + αV ) > 0}


!
Xβ + (Xγ + Zδ) α
Pr(Y1 = 1| (X, Z)) = Φ p .
1 + α2 σ 2 + 2ρασ
6.4 PROBIT WITH ENDOGENOUS COVARIATES 127

If we replace γ and δ by the first stage LS estimators γ̂ and δ̂ and perform the probit MLE:
# +
$ n
X # $ h # $i
α̂+
ρ , β̂ ρ = arg max Y1i log Φ Xi β + + + +
ρ + Ŷ2i αρ + (1 − Y1i ) log 1 − Φ Xi β ρ + Ŷ2i αρ
i=1

where
β
β+
ρ = p ,
1 + α σ 2 + 2ρασ
2
α
α+
ρ = p .
1 + α2 σ 2 + 2ρασ
+
Then using the general theory of 2 step MLE, we can show that α̂+ p + p +
ρ ! αρ and β̂ ρ ! β ρ .
Note that σ can be consistently estimated from the first stage regression. The problem is that
we do not know ρ and the 2-step IV does not provide a consistent estimator. So we can not
recover α and β. As a result, we cannot estimate the APE.

6.4.5 Conditional Likelihood Approach


The likelihood function conditional on X and Z is

f (Y1 , Y2 |X, Z) = f (Y1 |Y2 , X, Z)f (Y2 |X, Z)


H I
1 Y2 − Xγ − Zδ
= f (Y1 |Y2 , X, Z) φ . (6.42)
σ σ
To find f (Y1 , Y2 |X, Z), we need to figure out Pr(Y1 = 1|Y2 , X) :

Pr(Y1 = 1|Y2 , X, Z)
= Pr(Xβ + Y2 α + U > 0|Y2 , X, Z)
!
Xβ + Y2 α + θ (Y2 − Xγ − Zδ)
= Φ p
1 − ρ2
!
Xβ + Y2 α + ρσ −1 (Y2 − Xγ − Zδ)
= Φ p
1 − ρ2
: = Φ(q) (6.43)

Therefore, the conditional likelihood is


H I
Y1 1−Y1 1 Y2 − Xγ − Zδ
{Φ(q)} (1 − Φ(q)) φ . (6.44)
σ σ

We can then maximize the the sum of the log-likelihood wrt α, β, γ, δ, ρ, and σ 2 , leading to a
conditional MLE (CMLE).

Remark 38 1. The conditional MLE is more e¢cient than the two step procedure but com-
putationally more demanding.
2. Test H0 : ρ = 0 is straightforward ) either t test or LR test.
3. It is easy to abuse the two-step procedure.
6.5 PANEL LOGIT AND PROBIT MODELS 128

When Y1∗ is observable, we have

f (Y1∗ , Y2 |X, Z)
= f (Y1∗ |Y2 , X, Z)f (Y2 |X, Z)
H ∗ I H I
1 Y1 − Xβ − Y2 α − θ (Y2 − Xγ − Zδ) 1 Y2 − Xγ − Zδ
= φ φ . (6.45)
σe σe σV σV
The CMLE based on the above pdf is the Limited Information Maximum Likelihood (LIML)
estimator. When both Y1∗ , Y2 are observable, we can treat them symmetrically and obtain
a di§erent representation of f (Y1∗ , Y2 |X, Z). If the error terms are normal, then Y1∗ , Y2 |X, Z
follows a bivariate normal distribution.

6.5 Panel Logit and Probit Models


We assume cross sectional independence throughout this section.

6.5.1 Pooled Probit and Logit


Suppose the model is
Yit = 1 {Yit∗ > 0} and Yit∗ = Xit β + "it
where Xit can contain a variety of factors, including time dummies, time constant or lagged
dependent variables, and conditional on Xit , "it follows the standard normal or logistic distri-
bution.
Under the above model and assumption, we have

P (Yit = 1|Xit ) = G(Xit β) (6.46)

where G is a known function (normal or logistic CDF). Note that the conditioning set is the
contemporaneous Xit not Xi .

Partial Likelihood Estimation


The partial log-likelihood is
N X
X T N X
X T
log f (Yit |Xit , β) = Yit log G(Xit β) + (1 − Yit ) log(1 − G (Xit β)). (6.47)
i=1 t=1 i=1 t=1
Q
Note: we do not assume that Tt=1 f (Yit |Xit ) is the conditional likelihood of the vec-
tor Yi = (Yi1 , Yi2 , ..., YiT ) given some set of conditional variables. For the behavior model
Yit = 1 {"it > −Xit β} , "it may be serially correlated for each i. We can assume that "i =
("i1 , ..., "iT )0 has multivariate normal distribution with variance matrix Σ" and construct the
joint probability density of Yi given Xi . But this is very complicated, and estimation is very
computationally intensive. In addition, "it may be correlated with past and future value of
Xit . For example, let Yit indicate whether a person was arrested for a crime in year t and
Xit measure the amount of time the person has spent in prison prior to the current year. An
arrest this year, Yit = 1 certainly has an e§ect on the expected future values of Xit .
6.5 PANEL LOGIT AND PROBIT MODELS 129

To understand the partial log-likelihood, let’s assume that we only observe


{Yiτ , Xiτ } for a specific τ and pretend that we do not have the observations for other periods.
In other words, we only have cross sectional observations. In this case, the log-likelihood for
{Y1τ , Y2,τ , ..., YN,τ } conditional on {X1τ , X2,τ , ..., XN,τ } is
N
X
Yiτ log G(Xiτ β) + (1 − Yiτ ) log(1 − G (Xiτ β)). (6.48)
i=1

It follows from the usual argument that


N
X
β̂ τ = arg max Yiτ log G(Xiτ β) + (1 − Yiτ ) log(1 − G (Xiτ β)) (6.49)
i=1

is consistent and asymptotically normal. The partial likelihood is just a way to combine the
0
β 0τ s. Of course, we could just take simple or weighted average of β̂ τ s to get our final estimator.
But it is typical in the literature to pool the objective functions and define β̂ as the maximizer
0
of the partial log-likelihood in (6.47). In fact, the so-defined β̂ is a weighted average of β̂ τ s
with weights depending on asymvar(β̂ τ ).
We now show that the partial MLE will be asymptotically normal. We usually proceed as
follows: the FOC is
XN X T
sit (β̂) = 0 (6.50)
i=1 t=1

where sit (β) = rβ log f (Yit |Xit , β). A Taylor expansion of the above FOC gives
N X
X T N X
X T # $
0= sit (β 0 ) − Hit (β̃) β̂ − β 0 (6.51)
i=1 t=1 i=1 t=1

where Hit = −rβ 0 rβ log f (Yit |Xit , β). Therefore


!−1 !
p # $ N T
1 XX 1 XX
N T
N β̂ − β 0 = Hit (β̃) p sit (β 0 ) . (6.52)
N N i=1 t=1
i=1 t=1

Now, under mild regularity conditions, we have the weak covergence result:
N T
1 XX
p sit (β 0 ) ) N (0, B) , (6.53)
N i=1 t=1

where !
N T
1 XX
B = lim var p sit (β 0 ) (6.54)
N !1 N i=1 t=1
and
N T T
1 XX p
X
Hit (β̃) ! E Hit (β 0 ) := A. (6.55)
N
i=1 t=1 t=1
6.5 PANEL LOGIT AND PROBIT MODELS 130

So p # $
N β̂ − β 0 ) N (0, A−1 BA−1 ). (6.56)

In the above, ‘)’ signifies convergence in distribution.


It remains to estimate A and B. It is easy to see that A can be estimated consistently by
N T
1 XX
 = Hit (β̂) (6.57)
N
i=1 t=1

or
N T
1 XX
 = sit (β̂)sit (β̂)0 . (6.58)
N
i=1 t=1
# P PT $
Due to cross sectional independence, var p1N N i=1 t=1 sit (β 0 ) is

N T
! T
!0
1 X X X
E sit (β 0 ) sit (β 0 ) . (6.59)
N
i=1 t=1 t=1

So B can be estimated by

N T
! T !0
1 X X X
sit (β̂) sit (β̂)
N
i=1 t=1 t=1

1
N
XX T # $ # $0 1 XX # $
N T # $0
= sit β̂ sit β̂ + sit β̂ siτ β̂ (6.60)
N N
i=1 t=1 i=1 t6=τ

where the second term in the above expression accounts for possible serial correlation in the
score.
For the probit model, a simple, general estimator of the asymptotic variance is
!−1 ! !−1
N X
X T # $ N
X # $0 N X
X T # $
Ait β̂ si (β̂)si β̂ Ait β̂ (6.61)
i=1 t=1 i=1 i=1 t=1

where # $
# $ φ2 Xit β̂ Xit0 Xit
Ait β̂ = # $h # $i , (6.62)
Φ Xit β̂ 1 − Φ Xit β̂

and # $ h i
# $ X T # $ X T φ Xit β̂ X 0 Yit − Φ(Xit β̂)
it
si β̂ = sit β̂ = # $h # $i . (6.63)
t=1 t=1 Φ X it β̂ 1 − Φ X it β̂
6.5 PANEL LOGIT AND PROBIT MODELS 131

Dynamically Complete Models


Definition: ft (Yt |Xt , β) is dynamically complete if

ft (Yt |Xt , β) = ft (Yt |Xt , Yt−1 , Xt−1 , Yt−2 , ..., Y1 , X1 , β) . (6.64)

Under this condition


Esit (β 0 ) siτ (β 0 )0 = 0 for t 6= τ . (6.65)
This is because
E (sit (β 0 ) |Xit ) = 0 (6.66)
and
E {sit (β 0 ) |Xit , Yit−1 , Xit−1 , Yit−2 , ..., Yi1 , Xi1 } = 0. (6.67)
Now, for τ < t

Esit (β 0 ) siτ (β 0 )0
; * +<
= E E sit (β 0 ) siτ (β 0 )0 |Xit , Yit−1 , Xit−1 , Yit−2 , ..., Yi1 , Xi1
L M
0
= E E (sit (β 0 ) |Xit , Yit−1 , Xit−1 , Yit−2 , ..., Yi1 , Xi1 ) siτ (β 0 )
| {z }
= 0 (6.68)

because siτ (β 0 ) depends only on Xiτ and Yiτ .

6.5.2 Unobserved-e§ect Probit Model under Strict Exogeneity


Basic Set-up
The basic model is

Yit = 1 {Yit∗ > 0} ,


Yit∗ = Xit β + αi + Uit , (6.69)

where
Ui s N (0, IT ) conditional on Xi and αi . (6.70)
Implicitly, we assume that the distribution of Ui is independent of Xi , conditional on αi . In
this sense, Xi is strictly exogenous. The strict exogeneity rules out lagged dependent variables,
as well as explanatory variables whose future movements depend on current or past values of
Ui .
For this model, the probability density of Yi1 , Yi2 , ..., YiT conditional on Xi and αi is
T
Y
f (Yi |Xi , αi , β) = f (Yit |Xi , αi , β) (6.71)
t=1
YT T
Y
= f (Yit |Xit , αi ; β) = Φ (Xit β + αi )Yit (1 − Φ (Xit β + αi ))1−Yit(6.72)
.
t=1 t=1
6.5 PANEL LOGIT AND PROBIT MODELS 132

The fixed e§ects probit treats αi as parameters to be estimated. Unfortunately, in addition


to being computationally di¢cult, estimation of αi along with β introduces an incidental
parameter problem. In the present case, the fixed e§ects estimator of β is inconsistent for a
fixed T.
To sum up, probit does not allow the fixed e§ects treatment at all. Random e§ects model
is feasible but has been di¢cult because of multidimensional integration. To restore the
consistency of the MLE of β, we need to integrate the random e§ects α out. For MLE we
must assume a particular distribution for α, say
H I
1 α
gαi |Xi (α) = φ (6.73)
σα σα
depending on parameters σ α . Note that the above distribution is the distribution conditional
on Xi . This distributional assumption implies that αi is independent of Xi . Given the above
assumptions, we have
Z
f (Yi |Xi ) = f (Yi |Xi , αi )gαi |Xi (αi )dαi
Z 1 T
Y
= f (Yit |Xi , αi )gαi |Xi (αi )dαi .
−1 t=1

We can maximize the following conditional log-likelihood function with respect to both β and
σα:
XN N
X Z 1Y T H I
1 α
log f (Yi |Xi ) = log f (Yit |Xit , α; β) φ dα. (6.74)
−1 σα σα
i=1 i=1 t=1
Since β and σ α can be consistently estimated, we can estimate the partial e§ect at α = 0 and
the APE, viz. !
@P (Yit = 1|Xit ) βj Xit β
=p φ p . (6.75)
@Xit,j 1 + σ 2α 1 + σ 2α
The integral in the log-likelihood function can be approximated using M -point Gauss-
Hermite quadrature:
Z 1 XM
2
e−X /2 g(X)dX = ∗
wm g(a∗m )
−1 m=1
where ∗
wm denote the quadrature weights and the a∗m
denote the quadrature nodes. The
log-likelihood function is then calculated using
" M (T )#
XN X Y 1 # p $

L= log wm p f Yit |Xit , 2σ a a∗m ; β .
m=1 t=1
π
i=1
QT
When T is small, t=1 f (Yit |Xit , α; β) can be well approximated by a polynomial. In this
case, the Gauss-Hermite quadrature provides a good approximation to the integral. Some
simulations show that 50 is a safe upper bound. When T is large, the Gauss-Hermite approx-
imation will be very poor. The quality of approximation also depends on the value of σ a . The
larger σ α is, the poorer the approximation.
6.5 PANEL LOGIT AND PROBIT MODELS 133

Some Extensions
Assumptions (6.70) and (6.73) are very strong, and it is possible to relax them.

Relaxing Temporal Independence of Uit Consider

Yit = {Yit∗ > 0} ,


Yit∗ = Xit β + αi + Uit , (6.76)

where conditional on Xit , αi s N (0, σ 2α ) and Uit s N (0, 1), αi and Uit are independent.
However, Uit may be correlated with Uis for t 6= s. This assumption may be more reasonable
and weaker than the assumption in the previous section, which assumes that conditional on
Xi : αi and Ui are independent and αi s N (0, σ 2α ), Ui s N (0, IT ).
For this model, we have
# p $
P (Yit = 1|Xit ) = P (Uit + αi > −Xit β|Xit ) = Φ Xit β/ 1 + σ 2α . (6.77)
p
Therefore, as in the previous section, we can estimate β/ 1 + σ 2α from pooled probit of Yit
on Xit . If αi is truly present or Uit is autocorrelated, then Yit will not be independent across
t. Robust inference is needed to account for the serial correlation, as discussed in the previous
section.

Relaxing Independence of αi from Xi To allow (flexible) correlation between Xi and


αi , we may follow Chamberlain (1980), but we now need the true regression function and a
distributional assumption on the α equation error term. Consider the model

Yit = 1 {Yit∗ > 0} ,


Yit∗ = Xit β + αi + Uit , (6.78)

where conditional on (Xi , αi ) , Uit s iidN (0, 1) across t, and

αi |Xi s N ( + X̄i ξ, σ 2a ) (6.79)

or
αi = + X̄i,· ξ + ai (6.80)
with ai s N (0, σ 2a ) and independent of Xi . As in the linear model, we can not estimate the
e§ect of time invariant variables. This is because they are indistinguishable from the e§ect
X̄i,· ξ.
The latent structure representation becomes Yit = 1 {Yit∗ > 0} with

Yit∗ = Xit β + + X̄i,· ξ + ai + Uit (6.81)

where conditional on (Xi , ai ), Uit s iidN (0, 1). The parameters β, , ξ and σ a can be estimated
as before, i.e. by maximizing (6.74) with Xit properly defined. More specifically, the log-
6.5 PANEL LOGIT AND PROBIT MODELS 134

likelihood function is
N
X
log f (Yi |Xi )
i=1
" #
N Z
Y 1 T
Y # $Yit h $i1−Yit 1 H a I
#
i
= log Φ X̃it θ + ai 1 − Φ X̃it θ + ai φ dai
σa σa
i=1 −1 t=1
XN Z 1 "Y
T # $Yit h # $i1−Yit 1 H a I
#
i
= log Φ X̃it θ + ai 1 − Φ X̃it θ + ai φ dai , (6.82)
−1 σa σa
i=1 t=1
* +
where X̃it = 1, Xit , X̄i,· and θ0 = ( 0 , β 0 , ξ 0 ).
If conditional on (Xi , ai ), Uit s N (0, 1), but Uit may be correlated with Uis , we can still
estimate the scaled version of θ. In this case, we have
!
Xit β + + X̄i,· ξ # $
P (Yit = 1|Xi ) = Φ p := Φ X̃it θa . (6.83)
1 + σ 2a
; <
The partial likelihood function ΠTt=1 ΠN i=1 P (Yit |Xi ) can be regarded as derived from di§erent
waves with each wave corresponding to one time period. For example, in wave/period t, the
observations are (Yit , Xi )N
i=1 . It is important to point out that

T
"N #
Y Y
P (Yit |Xi )
t=1 i=1

is not the likelihood function of observing {Yi }N N


i=1 conditional on {Xi }i=1 . In the presence of
cross sectional independence, the latter likelihood function should be
N
Y N Z
Y 1 H I
1 a
P (Yi |Xi ) = P (Yi |Xi , a) φ da
σa σa
i=1 i=1 −1
"T
N Z 1 Y
# H I
Y * + 1 ai
o
= P Yit |Yi,t−1 , Xi , ai φ dai
−1 σ a σ a
i=1 t=1

If, in addition, Uit s iidN (0, 1) across t conditional on (Xi , ai ) , then the above likelihood
function becomes "T #
N Z 1 Y
Y H I
1 ai
P (Yit |Xi , ai ) φ dai .
−1 σa σa
i=1 t=1

The average (structural) response probability is

Eα P (Yit = 1|Xit = xo , αi ) = Eα Φ (xo β + αi )


"* +#
; * +< xo β + + X̄i,· ξ
= E Ea Φ xo β + + X̄i,· ξ + ai = EΦ p (6.84)
1 + σ 2a
6.5 PANEL LOGIT AND PROBIT MODELS 135

where Eα (Ea ) is the expectation with respect to the marginal distribution of αi (ai ) . The
corresponding APE is
"* +#
@Eα P (Yit = 1|Xit = xo , αi ) βj xo β + + X̄i,· ξ
=p Eφ p .
@xo,j 1 + σ 2a 1 + σ 2a

The average (structural) response probability (over the whole population) can be estimated
by
1 X # $
N
Φ xo β̂ a + ˆ a + X̄i,· ξ̂ a . (6.85)
N
i=1
The corresponding estimator of the APE is then
N
1 X # $
β̂ a,j φ xo β̂ a + ˆ a + X̄i,· ξ̂ a .
N
i=1

Note that the average (structural) response probability is NOT

Eα P (Yit = 1|Xit = xo ) = Eα Φ (xo β + αi |Xit = xo ) .

The latter is

Eα P (Yit = 1|Xit = xo ) = Eα Φ (xo β + αi |Xit = xo )


" * + !P # "* +#
xo β + + X̄i,· ξ PP xo β + + X̄i,· ξ
= E Φ p P Xit = xo 6= EΦ p . (6.86)
1 + σ 2a P 1 + σ 2a

The naive APE based on the above would be


" * + !P #
@ xo β + + X̄i,· ξ PP
E Φ p P Xit = xo
@xo,j 1 + σ 2a P
Z !
@ (xo β + + x̄ξ)
= Φ p fX̄i,· |Xit (x̄|xo ) dx̄
@xo,j 1 + σ 2a
Z " !#
@ (xo β + + x̄ξ)
= Φ p fX̄i,· |Xit (x̄|xo ) dx̄
@xo,j 1 + σ 2a
| {z }
has a structural interpretation.
Z !" #
(xo β + + x̄ξ) @fX̄i,· |Xit (x̄|xo )
+ Φ p
1 + σ 2a @xo,j
| {z }
has no structural interpretation in general
= AP E(fX̄i,· |Xit (x̄|xo )) + Bias

where Z !
@ (xo β + + x̄ξ)
AP E(fX̄i,· |Xit (x̄|xo )) = Φ p fX̄i,· |Xit (x̄|xo ) dx̄
@xo,j 1 + σ 2a
| {z }
has a structural interpretation.
6.5 PANEL LOGIT AND PROBIT MODELS 136

and
Z !
(xo β + + x̄ξ) @ log fX̄i,· |Xit (x̄|xo )
Bias = Φ p fX̄i,· |Xit (x̄|xo ) dx̄
1 + σ 2a @x0,j
" * +! * + #
Xit β + + X̄i,· ξ @ log fX̄i,· |Xit X̄i,· |Xit
= E Φ p |Xit = xo
1 + σ 2a @Xit,j
" * +! * + #
Xit β + + X̄i,· ξ @ log fX̄i,· |Xit X̄i,· |Xit
= cov Φ p , |Xit = xo .
1 + σ 2a @Xit,j
| {z }
structural or causal discrepancy

6.5.3 Unobserved-e§ect Logit Model under Strict Exogeneity (Conditional


Logit)
We consider the same model as in the previous section:

Yit = 1 {Yit∗ > 0} ,


Yit∗ = Xit β + αi + Uit , (6.87)

where conditional on (Xi , αi ) , Uit is iid logistic across i and t.


The problem with the logit model is: integrate P (Yit = 1|Xi , αi ) = Λ (Xit β + αi ) with
respect to the normal density (or other popular continuous density) yields no simple function
forms. However, fixed e§ects logit isP possible. The idea is to find the joint distribution of Yi
conditional on Xi and αi and ni = Tt=1 Yit . It turns out that this conditional joint density
does not depend on αi so that it is also the distribution of Yi given Xi and ni . The idea is
in essence that ni is a su¢cient statistic for αi : given {Xit }Tt=1 , the likelihood of {Yit }Tt=1
does not depend on αi when conditioned on ni . This is the same as the linear case but in the
current situation the conditional likelihood is more complicated.
First, consider the T = 2 case, where ni takes values {0, 1, 2} . Intuitively, the conditional
distribution of {Yi1 , Yi2 } given ni can not be informative for β when ni = 0 or 2 as these values
completely determine the outcome on Yi . However, for ni = 1,

P (Yi1 = 0, Yi2 = 1|Xi , α)


P (Yi1 = 0, Yi2 = 1|Xi , α, ni = 1) =
P (ni = 1|Xi , α)
P (Yi1 = 0, Yi2 = 1|Xi , α)
=
P (Yi1 = 0, Yi2 = 1|Xi , α) + P (Yi1 = 1, Yi2 = 0|Xi , α)
Λ(Xi2 β + αi ) (1 − Λ(Xi1 β + αi ))
=
(1 − Λ(Xi1 β + αi )) Λ(Xi2 β + αi ) + Λ(Xi1 β + αi ) (1 − Λ(Xi2 β + αi ))
exp (Xi2 β + αi ) exp((Xi2 − Xi1 ) β)
= =
exp (Xi2 β + αi ) + exp (Xi1 β + αi ) 1 + exp((Xi2 − Xi1 ) β)
= Λ ((Xi2 − Xi1 ) β) .

Similarly,
P (Yi1 = 1, Yi2 = 0|Xi , α, ni = 1) = 1 − Λ ((Xi2 − Xi1 ) β) . (6.88)
6.5 PANEL LOGIT AND PROBIT MODELS 137

The conditional likelihood for observation i is

{ni = 1} {Wi log Λ ((Xi2 − Xi1 ) β) + (1 − Wi ) log [1 − Λ ((Xi2 − Xi1 ) β)]} (6.89)

where
Wi = {Yi1 = 0, Yi2 = 1} . (6.90)
The above likelihood approach is equivalent to a standard cross-sectional logit of Wi on Xi2 −
Xi1 using the observations for which ni = 1.
To generalize the result from T = 2 to a more general T, we derive an alternative presen-
tation for P (Yi1 = 0, Yi2 = 1|Xi , αi , ni = 1) :

exp [yi1 (xi1 β + αi ) + yi2 (xi2 β + αi )]


P (Yi1 = yi1 , Yi2 = yi2 |Xi = xi , αi , ni = 1) = P0
exp [ai1 (xi1 β + αi ) + ai2 (xi2 β + αi )]
P0
where the sum is over all ai1 + ai2 = 1. Hence

exp (Xi2 β)
P (Yi1 = 0, Yi2 = 1|Xi , αi , ni = 1) = P0 .
exp [ai1 (Xi1 β) + ai2 (Xi2 β)]
Similarly
exp (Xi1 β)
P (Yi1 = 1, Yi2 = 0|Xi , αi , ni = 1) = P0 .
exp [ai1 (Xi1 β) + ai2 (Xi2 β)]
For a general T, the log-likelihood is more complicated, but it is tractable. First,

P (Yi1 = yi1 , ..., Yit = yiT |Xi , αi , ni = n)


P (Yi1 = yi1 , ..., Yit = yiT |Xi , αi )
=
P (ni = n)
QT
t=1 P (Yit = yit |Xi , αi )
= P0
P (Yi1 = ai1 , ..., Yit = aiT |Xi , αi , ni = n)
P
exp Tt=1 yit (Xit β)
= P PT (6.91)
ai 2Ri exp t=1 ait (Xit β)

where ( )
T
X
Ri = ai 2 RT : ait 2 {0, 1} , ait = ni . (6.92)
t=1
p
The log-likelihood summed over i can be used to obtain a N -asymptotically normal estimator
of β, and all inference follows from conditional MLE theory.

Remark 39 The log-odds ratio depends on α, which is not known.

Remark 40 We can not estimate the average partial e§ect because we do not know the dis-
tribution of αi . Even worse, the mean of αi may be nonzero. Alternatively, we can include a
constant in the regression such that the mean of αi is zero. But in this case, the constant is
“di§erenced” out and can not be estimated from the conditional logit.
6.5 PANEL LOGIT AND PROBIT MODELS 138

Remark 41 The consistency relies on the assumption that Uit is iid logistic across i and t
conditional on (Xi , αi ) .

Remark 42 Consider the special case with ni = 1 for all i, then


P
exp Tt=1 yit (Xit β)
P (Yi1 = yi1 , ..., Yit = yiT |Xi , αi , ni = 1) = PT . (6.93)
s=1 exp(Xis β)

Define Wi = t if yit = 1, then

exp(Xit β)
P (Wi = t|Xi , αi , ni = 1) = PT . (6.94)
s=1 exp(Xis β)

Formally, this is the same as the conditional logit model in the next chapter when we model
unordered multinomial responses.

6.5.4 Dynamic Unobserved E§ect Model


The model:
P (Yit = 1|Yi,t−1 , ..., Yi,0 , Zi , αi ) = G (Zit δ + ρYi,t−1 + αi ) (6.95)
where we have assumed that Zit is strictly exogeneous. Alternatively, we write

Yit = 1 {Zit δ + ρYi,t−1 + αi + Uit ≥ 0}

where Uit s iid G(·) with G0 (·) being symmetric. In addition, Uit is independent of {Yi,t−1 , ..., Yi,0 , Zi , αi } .
The joint density is

f (Yi,1 , Yi,2 , ..., Yi,T |Yi,0 , Zi , αi ; β)


T
Y
= P (Yi,t |Yi,t−1 , ..., Yi,0 , Zi , αi ; β) (6.96)
t=1
YT
= G (Zi,t δ + ρYi,t−1 + αi )Yi,t (1 − G (Zi,t δ + ρYi,t−1 + αi ))1−Yi,t . (6.97)
t=1

With fixed T asymptotics, this density will not deliver a consistent estimator of β, due to the
incidental parameter problem. To avoid the the incidental parameter problem, we again make
distributional assumptions on α0i s and integrate them out.
Z 1
f (Yi,1 , Yi,2 , ..., Yi,T |Yi,0 , Zi ; θ) = f (Yi,1 , Yi,2 , ..., Yi,T |Yi,0, Zi , α; β)h(α|Yi0 , Zi ; γ)dα. (6.98)
−1

When G = Φ, it is convenient to assume that αi = + Yi0 ξ 0 + Zi ξ + ai where ai s N (0, σ 2a )


and is independent of (Yi0 , Zi ) . In this case, we have

Yi,t = { + Zi,t δ + ρYi,t−1 + Yi,0 ξ 0 + Zi ξ + ai + eit > 0} (6.99)


6.6 PROBLEM SET 139

where eit s iidN (0, 1) and is independent of other variables. Therefore, the density of
Yi,1 , Yi,2 , ..., Yi,T given (Yi,0 , Zi ) is
N
X Z 1 T
Y
log f (Yi,t |Yi,0, Zi , a; β) 1/σ a φ (a/σ a ) da (6.100)
i=1 −1 t=1

where
f (Yi,t |Yi,0, Zi , a; β) = Φ (Xi,t β)Yi,t (1 − Φ (Xi,t β))1−Yi,t (6.101)
and Xi,t = (1, Zi,t , Yi,t−1 , Yi,0, Zi ) .
For more details such as how to initialize the process di§erently, see Ch 7.4 of Hsiao (2003).

6.6 Problem Set


1. Extract data in file gdp.csv from the course webpage. This data file contains n = 1000
IID observations (Yi , Xi ) that I generated from the binary probability model:
8
< 1 with probabilitY Ψ(Xθ)
Y = (6.102)
: 0 with probabilitY 1 − Ψ(Xθ)

where Ψ(Xθ) is some parametric model of the conditional probability of the binary variable
Y given X, i.e. Ψ(Xθ) = P {Y = 1|X, θ}.
(a) Using the artificially generated data, compute maximum likelihood estimates of the
parameters (θ0 , θ1 , θ2 ) of the logit and probit specifications where Xθ is given by: X 0 θ =
θ0 + θ1 X + θ2 X 2 .

(b) Is it possible to consistently estimate θ by doing nonlinear least squares estimation of


the nonlinear regression formulation of the binary probability model

Y = Ψ(Xθ) + η (6.103)

instead of doing maximum likelihood? If so, provide a proof of the consistency of the NLLS
estimator. If not, provide a counterexample showing that the NLLS estimator is inconsistent.

(c) Estimate both the probit and logit specifications by nonlinear least squares as suggested
in part (b). How do the parameter estimates and standard errors compare to the maximum
likelihood estimates computed in part (a)?

(d) Is there any problem of heteroscedasticity in the nonlinear regression formulation of the
problem in part (c)? If so, derive the form of the heteroscedasticity and, using the estimated
“first stage” parameters from part (b) above, compute second stage “feasible generalized least
squares” (FGLS) estimates of θ. More specifically, the FGLS estimator is defined to be
N
X [Yi − Ψ(Xi θ)]2
θ̂F GLS = arg min h i (6.104)
θ2Θ
i=1 1 − Ψ(Xi θ̂) Ψ(Xi θ̂)
6.6 PROBLEM SET 140

where θ̂ is an initial consistent estimator of θ.

(e) Are the FGLS estimates of θ consistent and asymptotically normally distributed (as-
suming the model is correctly specified)? If so, derive the asymptotic distribution of the FGLS
estimator, and if not provide a counter example showing that the FGLS estimator is inconsis-
tent or not asymptotically normally distributed. If you conclude that the FGLS estimator is
asymptotically normally distributed, is it as e¢cient as the maximum likelihood estimator of
θ? Explain your reasoning for full credit.

(f) A friend suggests estimating θ by


N
X [Yi − Φ(Xi θ)]2
θ̂A = arg min (6.105)
θ2Θ [1 − Φ(Xi θ)] Φ(Xi θ)
i=1

and claims that θ̂A is asymptotically equivalent to θ̂F GLS ? Do you agree? Explain. (What is
the limit of θ̂A ?)

(g) Under some conditions, the FGLS estimator is asymptotically equivalent to θ̃F GLS that
satisfies the first order condition:
h i # $
XN Yi − Φ(Xi θ̃F GLS ) φ Xi θ̃F GLS Xi
h i = 0. (6.106)
i=1 1 − Φ(Xi θ̂) Φ(Xi θ̂)

Another friend suggests estimating θ by θ̂B that satisfies


h i # $
XN Yi − Φ(Xi θ̂B ) φ Xi θ̂B Xi
h i =0 (6.107)
i=1 1 − Φ(Xi θ̂B ) Φ(Xi θ̂B )

and claims that θ̂B is asymptotically equivalent to θ̂F GLS . Do you agree? Explain.

2. Download the data employ.xls from the class webpage. The data file contains em-
ployment information for 1881 young men over the years 1981-1987. Restrict your attention
to black men.
(a) Use pooled probit to estimate the model

P (employit = 1|employit−1 ) = Φ (β 0 + β 1 employit−1 ) .

What assumption is needed to ensure that the usual standard errors and test statistics from
pooled probit are asymptotically valid? Can you compute the standard errors that are robust
to the series correlation?

(b) Estimate P (employt = 1|employt−1 = 1) and P (employt = 1|employt−1 = 0). Explain


how would you obtain standard errors of these estimates.

(c) Add a full list of year dummies to the analysis in part (a), and estimate the probabilities
in part (b) for 1987. Are there important di§erences with estimates in part (b)?
6.6 PROBLEM SET 141

(d) Now estimate a dynamic unobserved e§ects model in which

employit = 1 {β 0 + β 1 employit−1 + β 2 employi,1981 + λt + ai + eit }

where ai s iidN (0, σ 2a ) across i, eit s iidN (0, 1) across i and t, and {ai }N
i=1 is independent of
{eit , i = 1, .., N ; t = 1, ..., Ti }

(e) The partial e§ect for a given employ1981 is defined to be

P (employt = 1|employt−1 = 1, employ1981 ) − P (employt = 1|employt−1 = 0, employ1981 ).

Compute the average partial e§ect for year 1987 by averaging the partial e§ect over employ1981 .
Chapter 7

Multinomial Response Models

So far we have talked about 0/1 decisions. What if there are more response categories?
An important distinction is between ordered categorical data, where the response categories
possess a natural ordering (e.g. low income, mid-level income or high income, bond rating
A, B, C,... ), and unordered categorical data, where the response categories are mere labels
totally devoid of structure (e.g. traveling by bus, by train or by car). Di§erent models are
used in the two cases.

7.1 Probabilistic Choice Model for Unordered Response


Probabilistic choice models are based on utility maximization. We assume that the utility of
individual i from alternative j is given by

yij = xij β + aij , j = 0, 1, 2, ..., J. (7.1)

where xij is the vector of values of attributes of j-th choice as perceived by the i-th individual
and aij is a (0, σ 2 ) random variable which is used to capture factors unobservable to the
researcher. Individuals choose the option that maximizes his utility. Let yi denote the choice
of individual i, then
∗ ∗ ∗
yi = arg max {yi0 , yi1 , ..., yiJ } (7.2)
j

As an example, consider a person who can take a car, a bus or a subway to work. The
researcher observes the time and cost that the person would incur under each mode. However,
the researcher realizes that there are factors other than time and cost that a§ect the person’s
utility and hence his choice. The researcher specifies

yic = Tic β 1 + Mic β 2 + aic (7.3)

yib = Tib β 1 + Mib β 2 + aib (7.4)

yis = Tis β 1 + Mis β 2 + ais (7.5)

where Tic and Mic are the time and cost (in money) that the person incurs traveling to work
by car, Tib and Mib , Tis and Mis are defined analogously for bus and subway.

142
7.1 PROBABILISTIC CHOICE MODEL FOR UNORDERED RESPONSE 143

The probability that the person chooses bus instead of car and subway is the probability
that
β 1 Tib + β 2 Mib + aib > β 1 Tic + β 2 Mic + aic
and
β 1 Tib + β 2 Mib + aib > β 1 Tis + β 2 Mis + ais
Remark 43 Can we include a constant in the utility specification so that

yij = α + xij β + aij , j = 0, 1, 2, ..., J? (7.6)
The answer is no. The presence of α changes all the utilities (yi0 ∗ , y ∗ , ..., y ∗ ) by the same
i1 iJ
amount. The rank of the utilities and thus individual’s choice do not depend on α. Therefore,
the intercept α is not identified.
Remark 44 Can we include an alternative-specific constant in the utility specification so that

yij = αj + xij β + aij , j = 0, 1, 2, ..., J? (7.7)
The answer is yes. But we can not identify all α0j s and have to normalize one α, say α0 , to
be zero. The alternative with zero α is called the base category.
Remark 45 Since
Z ∗[ Z ∗ ∗
[
arg max yij = arg max yij − yi0 (7.8)
j j
= arg max {(xij − xi0 ) β + aij − ai0 } (7.9)
j

we can not include a variable in xij if it is constant across di§erent alternatives. For example,
∗ by simply letting
we can not include age in yij

yic = Tic β 1 + Mic β 2 + Agei β 3 + aic (7.10)

yib = Tib β 1 + Mib β 2 + Agei β 3 + aib (7.11)

yis = Tis β 1 + Mis β 2 + Agei β 3 + ais (7.12)
because in this specification age does not a§ect one’s decision. If we believe that Age actually
plays a role, then we have to allow the coe¢cient associated with Age to change with the
alternative. In general, we can assume

yij = xij β + zi γ j + aij , j = 0, 1, 2, ..., J, (7.13)
where zi is the individual-specific variable. In the above specification, we can not identify all
γ’s and need to normalize one γ, say γ 0 , to be zero. It is recommended to normalize γ for the
base category to be zero.
Remark 46 A variant of the utility specification is to allow β to depend on individual-specific
characteristics. For example
(i) (i)
β k = β k + wi θk + σ k uk
(i)
where β k can be regarded as individual i0 s marginal utility for the k-th covariate. This spec-
ification is used widely in the empirical IO literature, which considers aggregation over con-
sumers choices to estimate demand parameters using market level data. We will not discuss
this extension in this class.
7.2 CONDITIONAL AND MULTINOMIAL LOGIT MODELS 144

Remark 47 Another variant of the utility specification is to allow aij to be heteroskedastic.


In this case, var(aij ) = σ 2j .

7.2 Conditional and Multinomial Logit Models


7.2.1 The model
Assume that aij are independently distributed with CDF

F (a) = exp (− exp(−a)) , (7.14)

the type I extreme value distribution, then

f (a) = F 0 (a) = exp (−a − exp(−a)) . (7.15)

The mean of the type I extreme value distribution is


Z 1
a exp (−a − exp(−a)) da =: 0.577 22 (Euler’s constant)
−1

and the variance is


Z 1
(a − 0.577 22)2 exp (−a − exp(−a)) da = 1. 644 9.
−1

y
0.3

0.2

0.1

-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Solid line: PDF of Type I Extreme Value. Dotted line: PDF of Normal with the same mean
and variance
In this case, we can show that
exp(vij )
P (yi = j|xi , zi ) = PJ where vij = xij β + zi γ j (7.16)
h=0 exp (vih )

where xi = (xi0 , ..., xiJ ).


Proof.
* ∗ ∗
+
P (yi = j|xi , zi ) = P yij > yi,−j
= P (vij + aij > vik + aik , for all k 6= j) (7.17)
7.2 CONDITIONAL AND MULTINOMIAL LOGIT MODELS 145

So P (yi = j|xi , zi ) is

P (aij + vij − vik > aik , for all k 6= j)


Z Y
= F (aij + vij − vik )f (aij )daij
R k6=j
Z Y
= exp (− exp(−aij − (vij − vik ))) exp {−aij − exp(−aij )} daij
R k6=j
0 1
Z X
= exp @− exp(−ξ − (vij − vik ))A exp [−ξ − exp(−ξ)] dξ (7.18)
R k6=j
0 1
Z X
= exp @− exp(−ξ) exp(vik − vij )A exp([−ξ − exp(−ξ)] dξ
R k6=j
Z 1
= − exp (− exp(−ξ)η) exp [− exp(−ξ)] d exp(−ξ)
−1
P
where η = k6=j exp(vik − vij ). Let λ = exp(−ξ), then the above probability becomes
Z 1
exp (−λη) exp [−λ] dλ
0
Z 1
= exp (−λ(η + 1))dλ
0
1
= − exp (−λ(η + 1))|1
0 (7.19)
η+1
1
=
η+1
Therefore
1 exp(vij )
P (yi = j|xi , zi ) = P = PJ (7.20)
1+ k6=j exp(vik − vij ) h=0 exp (vih )
which completes the proof.

Remark 48 When vij = xij β, we have


exp(xij β)
P (yi = j|xi ) = PJ , j = 0, 1, ..., J. (7.21)
h=0 exp (xih β)

The above probabilities constitute what is usually called the conditional logit model. See Mc-
Fadden (1973).

Remark 49 When vij = zi γ j , we have

exp(zi γ j )
P (yi = j|zi ) = PJ , j = 0, 1, ..., J. (7.22)
h=0 exp (zi γ h )

The above probabilities constitute what is usually called the multinomial logit model.
7.2 CONDITIONAL AND MULTINOMIAL LOGIT MODELS 146

Remark 50 The di§erence between the conditional logit and multinomial logit models:

• In the MNL model, the conditioning variables do not change across alternative: for each
i, zi contains variables specific to the individual but not to the alternatives. The model
is appropriate for problems where characteristics of the alternatives are not important.
Multinomial logit models help us answer the question “how do individual’s characteristics
a§ect their choice probabilities.” Multinomial logit is a generalization of the binary logit.
What if instead we are interested in how the characteristics of the categories a§ect indi-
viduals’ likelihood of betting in them?

• The CL model is intended specifically for problems where the individual choice are at least
made based on the observable attributes of each alternative. Conditional logit models help
us answer the question “how the characteristics of the categories a§ect individuals choice
probabilities.”

• Define a set of dummies to indicate the alternatives:


8 8
< 1 if j = 0 < 1 if j = 1
D0j = , D1j = , .....
: 0 otherwise : 0 otherwise

Then
γ j = D0j γ 0 + D1j γ 1 + ... + DJj γ J
and
zi γ j = (zi × D0j ) γ 0 + (zi × D1j ) γ 1 + ... + (zi × DJj ) γ J .
Let xij = (zi × D0j , zi × D1j , ..., zi × DJj ) and γ = (γ 00 , γ 01 , ..., γ 0J )0 , then

exp(zi γ j ) exp(xij γ)
P (yi = j|zi ) = PJ = PJ := P (yi = j|xi ) .
h=0 exp (zi γ h ) h=0 exp (xij γ)

Therefore the CL model contains the MNL model formally as a special case. So wlog we
can focus on the CL model hereafter.

7.2.2 Estimation of the CL Model


Given the probabilities P (yi = j|xi ) , we can estimate the logit model by MLE. The log-
likelihood function is
n X
X J
ln L = 1 {yi = j} ln P (yi = j|xi )
i=1 j=0
n X
X J
exp vij
= 1 {yi = j} ln PJ .
i=1 j=0 k=0 exp vik
7.2 CONDITIONAL AND MULTINOMIAL LOGIT MODELS 147

The first order condition is


0 1
n X
X J J
X
@ ln L 1 B exp vij @vij exp vij @vik C
= 1 {yi = j} @ PJ − #P $2 exp vik A
@β pij k=0 exp vik
@β J @β
i=1 j=0 exp v k=0 ik k=0

n X
J J
!
X @vij X @vik
= 1 {yi = j} − pik .
@β @β
i=1 j=0 k=0

@vij
When @β = x0ij , we have

n X
X J
Sn (β) = 1 {yi = j} (xij − x̄i )0 = 0,
i=1 j=0

PJ
where x̄i = k=0 pik xik is the weighted average of xik ’s. The FOC can be rewritten as
J X
X n J X
X n
1 {yi = j} xij = pij xij .
j=0 i=1 j=0 i=1

If an alternative-specific constant is included in vij , then one of elements in xij is Dhj , the
dummy for h-th choice. In this case,
J X
X n J X
X n
1 {yi = j} Dhj = pij Dhj ,
j=0 i=1 j=0 i=1

That is
n n
1X 1X
1 {yi = h} = pih .
n n
i=1 i=1

So, the market share is equal to the average predictive probabilities.


Note that the FOC can be rewritten as
n X
X J
Sn (β) = [yij − pij ] x0ij = 0, (7.23)
i=1 j=0

where
yij := 1 {yi = j} .
So the MLE is a moment-based estimator with the moment conditions being
0 1
J
X
E@ [yij − pij ] x0ij A = 0. (7.24)
j=0

In fact, we know that E (yij − pij |xi ) = 0 which implies that E (yij − pij ) x0ij = 0 for all
j = 0, ..., J. This is a set of overidentifying moment conditions. The moment conditions in
7.2 CONDITIONAL AND MULTINOMIAL LOGIT MODELS 148

(7.24) can be regarded as the optimal linear combinations of E (yij − pij ) x0ij = 0 for j =
0, ..., J.
The variance of the score conditional on {xi }ni=1 is
n X
X J X
J
V ar(Sn (β)| {xi }ni=1 ) = E (yij − pij ) (yik − pik ) (xij − x̄i )0 (xik − x̄i ) .
i=1 j=0 k=0

By definition,
E (yij − pij ) (yik − pik ) = −pij pik + pij {j = k} .1
It follows that
n X
X J
V ar(Sn (β)| {xi }ni=1 ) = pij (xij − x̄i )0 (xij − x̄i ) .
i=1 j=0

This is because
J
X
pij (xij − x̄i )0 = 0 for any i.
j=0

If the model is correctly specified, V ar(Sn (β)| {xi }ni=1 ) is the negative expected Hessian
matrix. Using the general MLE theory, we deduce that the distribution of β̂ − β can be
approximated by 8 0 1−1 9
< Xn X J =
N 0, @ p̂ij (xij − x̄i )0 (xij − x̄i )A .
: ;
i=1 j=0

If we use a version of the BHHH algorithm to obtain the MLE, the iterative step is given
by
2 3−1
Xn X
J # $0 # $ J #
n X
X $ # $
(k) (k) (k) (k) 0 (k)
β (k+1) = β (k) + 4 pij xij − x̄i xij − x̄i 5 xij − x̄i yij − pij
i=1 j=0 i=1 j=0

where
J
X
(k) (k) (k)
pij = pij (β k ) and x̄i = pij xij .
j=0

1
For those who are not familiar with the multinomial distribution, we can derive the equation from the first
principle. Note that for k 6= j,
8
>
> (1, 1), with probability zero
>
>
>
>
< (0, 0), with probability 1 − pik − pij
(yij , yik ) =
>
>
> (1, 0), with probability pij
>
>
>
: (0, 1), with probability p
ik

So
E (yij − pij ) (yik − pik ) = pij pik (1 − pik − pij ) − (1 − pij ) pik pij − pij (1 − pik ) pik = −pij pik
as desired.
7.2 CONDITIONAL AND MULTINOMIAL LOGIT MODELS 149

7.2.3 The limitation of the model


Note that
P (yi = j|xi ) /P (yi = h|xi ) = exp [(xij − xih ) β] (7.25)
so the relative probabilities for any two alternatives depend only on the attributes of those
two alternatives. This is called the independence from irrelevant alternatives.
Under IIA, it is possible to consistently estimate model parameters by using only a subset
of alternatives in the decision making process. An example is provided in Train (2002) wherein
amongst 100 alternatives, a researcher may choose to estimate parameters using only 10 alter-
natives for each sampled person. Since, under the IIA property, relative probabilities within
this subset are una§ected by the attributes or existence of alternatives outside the subset, the
consistency of the estimator is not a§ected by excluding a majority of the alternatives.
The IIA assumption is not plausible in many applications. Consider for example the choice
between a blue bus and a car. The IIA assumption implies that, if a new alternative, say a
red bus, is introduced, all of the existing probabilities are reduced by the same proportion,
irrespective of the new choice’s degree of similarity to any of the existing ones.
Suppose that a person is indi§erent between car and bus, that is,

Pr(C|C, B) = Pr(B|C, B) = 0.5 (7.26)

Since the person is indi§erent between car and bus, it would be reasonable to assume that

P r(C|C, B, R) = 0.5 (7.27)

In this case, however, the logit link implies that

Pr(C|C, B, R) Pr(C|C, B)
= =1 (7.28)
Pr(B|C, B, R) Pr(B|C, B)

and
Pr(C|C, B, R) Pr(C|C, R)
= =1 (7.29)
Pr(R|C, B, R) Pr(R|C, R)
so
1
Pr(C|C, B, R) = Pr(B|C, B, R) = Pr(R|C, B, R) = (7.30)
3
which is less than 0.5.
The IIA problem arises because we assume that aij are independent across j. If two al-
ternatives are close substitutes, we expect that their random utilities are correlated. We can
test whether some alternatives are potentially correlated by using a typical Hausman test
(Hausman and McFadden, 1984). Under the null H0 of IIA, one can estimate all β parameters
f
consistently and e¢ciently using the full data set. Denote the estimator as β̂ . A useful way
to obtain a robust estimator is to drop some potentially correlated alternatives. For example,
suppose that we suspect that alternatives J − 1 and J are correlated and they are correlated
with the rest of the alternatives, we can drop these two alternatives and all the individuals
who choose these two alternatives, leading to a restricted data set with only J − 1 alternatives
(j now runs from j = 0 to j = J − 2) and fewer than n observations. Based on the restricted
7.3 MULTINOMIAL PROBIT MODEL 150

choice set and data set, we can still estimate all of the parameters (if the model is a CL model).
The choice probabilities with only J − 1 alternatives (instead of J + 1 alternatives) are:

exp (vij )
P (yi = j|xi , zi ) = PJ−2 for j = 0, ..., J − 2.
h=0 exp (vih )
r
Under IIA, the estimator β̂ from the restricted choice set is consistent. When the IIA is
r f
violated, β̂ is still consistent but β̂ is not. This provides the usual basis for a Hausman test.
In absence of some natural grouping of the alternatives, the choice of the subset to leave out
is arbitrary and, hence, so is the test.

7.3 Multinomial Probit Model


Multivariate probit allows for a full correlation structure with ai ∼ N (0, Σ) and requires
J dimensional numerical integration. One has to impose normalization and identification
restrictions on the (J + 1)J/2 free elements σ of the (J + 1) × (J + 1) matrix Σ.
Consider the case J = 2, the choice of the first alternative P [yi = 0|xi ] corresponds to the
joint occurrence of
η 01 := ai0 − ai1 > − (xi0 − xi1 ) β (7.31)
and
η 02 := ai0 − ai2 > − (xi0 − xi2 ) β. (7.32)
One can then derive the variance-covariance of the joint normal pdf of η 01 and η 02 , the 2 × 2
matrix Σ̃, from the original σ elements. Finally,

P (yi = 0|xi ) (7.33)


Z 1 Z 1 P P # $
1 P P −1/2
= PΣ̃P exp −1/2 (η 01 , η 02 ) Σ̃−1 (η 01 , η 02 )0 dη 02 dη 01
−(xi0 −xi1 )β −(xi0 −xi2 )β 2π

Note: the original Σ has 6 free elements but Σ̃ can have only three free elements. In
addition, not all three free elements of Σ̃ can be identified. We need to impose one restriction
on Σ̃. One way to achieve identification is to set ai0 = 0 so that σ 00 = σ 01 = σ 02 = 0 and
normalize σ 11 = 1.
Alternatively, the independence assumption of CL can be relaxed using the generalized
extreme value (GEV) models. The GEV distribution generalizes the independent univariate
extreme value cdfs to allow for the correlation of ai across choices:

F (ai0, ai1 , ai2 , ..., aiJ ) = exp[−G(exp(−ai0 ), ..., exp(−aiJ ))] (7.34)

for some function G. The GEV approach has been widely used in the context of the nested
logit model. See Train (2003).

7.4 Nested Logit Model


Example 51 Choice of transportation mode:
7.4 NESTED LOGIT MODEL 151

Example 52 Choice of house: choose the neighborhood and select a specific house within a
chosen neighborhood. Choose to travel by plane, then choose among the airlines.

In the presence of a nested structure, we assume that the utility from house j in neighbor-
hood k looks as follows:
Vkj = xkj β + zk α + akj , (7.35)
where zk are characteristics of neighborhoods and xkj are house-specific characteristics. To
facilitate estimation when the number of choices is very large but the decision problem has a
tree structure, we use pkj = pk pj|k , whereas it turns out pj|k only involves β but not α. Under
the assumption that akj has iid type I extreme value distribution, we have

exp (xkj β + zk α) exp (xkj β)


p(j|k) = PN = PN (7.36)
k k
h=1 exp (xkh β + zk α) h=1 exp (xkh β)

where Nk is the number of houses in neighborhood k. Similarly


PNk
j=1 exp(xkj β + zk α)
pk = PC PNm (7.37)
m=1 j=1 exp(xmj β + zm α)
exp(Ik + zk α)
= PC (7.38)
m=1 exp(Im + zm α)
P k
where Ik = log N h=1 exp (xkh β) is the so-called inclusive value (the total contribution of each
house in a neighborhood). It is often referred to as logsum. The expression for pk = pkj /pj|k
may be derived from pj|k and pkj where

exp (xkj β + zk α)
pkj = PC PNm , (7.39)
m=1 j=1 exp(xmj β + zm α)
P
which is obvious if we think each individual has C m=1 Nm options.
One can therefore first estimate β o§ the choice within neighborhoods (based on p(j|k))
and then use the β̂ to impute Iˆk and estimate α by maximizing a likelihood consisting of pk .
This sequential estimation provides consistent estimators and can be applied in all problems
in which the number of choices is very large but the decision process has a tree structure.
7.4 NESTED LOGIT MODEL 152

The extension of this model to cases involving several branches of a tree is obvious. See
Maddala (Ch.3).
As multinomial/conditional logit model, the above nest logit model su§ers from the IIA
property. When two alternatives are in the same nest, the IIA property holds since the ratio
of probabilities is independent of the existence of other alternatives.
p(j1 |k) exp (xkj1 β)
=
p(j2 |k) exp (xkj2 β)
However, if two alternatives are placed in di§erent nests, then the IIA property no longer
holds. One way to avoid the IIA problem with the same net is to assume a variance component
structure for the random utility:
akj = ϵk + λk ϵkj (7.40)
for some λk 2 [0, 1], where ϵkj ∼ type I extreme value, ϵk ∼ C(λk ) where C (λ) is a distribution
defined below.

Definition 53 The C(λ) distribution is defined to be the unique distribution for which v and
e are independent, v ∼ C(λ), and e ∼ type I extreme value, implies that v + λe ∼ type I
extreme value. See Cardell (1997).

In the variance component specification, ϵk is a common component for all houses in


neighborhood k and ϵkj is a random/unobservable component for house j in neighborhood k.
{λk } are parameters to be estimated. It measures the degree of independence in the unobserved
utility among the alternatives in the same neighborhood. A higher value of λ means greater
independence and less correlation. The correlation between akj1 and akj2 goes to zero as λ
approaches one and goes to one as λ approaches zero. This can be easily seen by noting that
the variance of a C(λ) random variable is proportional to (1 − λ2 ).
Given the variance component structure, the marginal and conditional probabilities can
be written as
exp(λk Ik + zk α)
pk = PC (7.41)
m=1 exp(λm Im + zm α)
exp (xkj β/λk )
p(j|k) = PN (7.42)
k
h=1 exp (xkh β/λk )
where
Nk
X
Ik = ln exp (xkh β/λk ) . (7.43)
h=1
Now
p(j1 |k) exp (xkj1 β/λk )
=
p(j2 |k) exp (xkj2 β/λk )
which depends not only on xkj1 and xkj2 but also on λk , which captures the relative magnitude
of the common component. If another alternative is introduced into the choice set, then λk
will be di§erent, leading to a di§erent value of the ratio p(j1 |k)/p(j2 |k).
In addition,
exp(λk Ik + zk α) exp (xkj β/λk )
pkj = PC PNk .
m=1 exp(λm Im + zm α) h=1 exp (xkh β/λk )
7.4 NESTED LOGIT MODEL 153

The above probabilities can be derived by using the following lemma:


Lemma 54 If "j is iid type I extreme value, j = 0, 1, ..., J. Then for any constants {kj } ,
P
tJ = max0≤`≤J (k` + "` ) − log J`=0 exp(k` ) is type I extreme value.
Proof. It su¢ces to prove the case with J = 1 (Why? because if we can go from J = 0
to J = 1, then we can use this result to go from J = 1 to J = 2). We have, using the CDF of
the type I extreme value distribution:
Pr (t1 < t)
1
!
X
= Pr max (k` + "` ) − log exp(k` ) < t
0≤`≤1
`=0
1
!
X
= Pr max (k` + "` ) < t + log exp(k` )
0≤`≤1
`=0
1
! 1
!
X X
= Pr ("0 < t − k0 + log exp(k` ) Pr "1 < t − k1 + log exp(k` )
`=0 `=0
( 1
)! ( 1
)!
X X
= exp − exp −[t − k0 + log exp(k` )] × exp − exp −[t − k1 + log exp(k` )]
`=0 `=0
! !
exp {−[t − k0 ]} exp {−[t − k1 ]}
= exp − P1 exp − P1
`=0 exp(k` ) `=0 exp(k` )
" #
exp {−[t − k0 ]} exp {−[t − k1 ]}
= exp − P1 − P1
`=0 exp(k` ) `=0 exp(k` )
( " #)
exp (k0 ) exp (k1 )
= exp [− exp (−t)] × P1 + P1
`=0 exp(k` ) `=0 exp(k` )
= exp [− exp (−t)] .
So tJ indeed follows the type I extreme value distribution.
Remark 55 If kj = Xj for j = 0, ..., J, are random variables, then tJ = max0≤`≤J (X` + "` ) −
P
log J`=0 exp(X` ) is type I extreme value conditional on {Xj } . The conditional distribution
does not depend on the conditioning variables. Hence tJ is independent of {Xj : j = 0, ..., J} .
Remark 56 This lemma can also be used to compute the social surplus defined by
U V
U (Wi ) = E max [vij + aij ] |Wi
j=0,1,...,J

where Wi contains covariates, e.g., nPxij and zi . Assume


o that aij is iid Type I extreme value,
J
then maxj=0,1,...,J [vij + aij ] − log j=0 exp(vij ) is type I extreme value. Note that the type
I extreme value distribution has mean 0.577 22, so
8 9
<XJ =
U (Wi ) = 0.57722 + log exp(vij ) .
: ;
j=0
7.5 ORDERED PROBIT AND LOGIT MODEL 154

With the social surplus function, we can place a dollar value on the e§ect of changing one or
more of the determinants of choices, such as price and time cost of travel in transportation
mode choice.

We now come back to the nested logit model. Note that if we choose the k-th neighborhood,
the utility derived from the houses in this neighborhood is

Vk = max {xkj β + λk ϵkj } + zk α + ϵk


j
U V
xkj β
= λk max + ϵkj + zk α + ϵk
j λk
" U V Nk
#
xkj β X
= λk max + ϵkj − ln exp (xkh β/λk )
j λk
h=1
Nk
X
+λk ln exp (xkh β/λk ) + zk α + ϵk (7.44)
h=1
= λk Ik + zk α + η k ,

where " #
U V Nk
X
xkj β
η k = ϵk + λk max + ϵkj − ln exp (xkh β/λk )
j λk
h=1

is a type I extreme value random variable and η k is independent of Ik and zk . It now follows
from (7.44) that equation (7.41) holds.

7.5 Ordered Probit and Logit Model


Example 57 Ratings, opinion surveys, attained education level. ‘0’ < ‘1’ < ‘2’ but ‘1’−‘0’6=
‘2’ −‘1’.

Let y be an ordered response taking on the values {0, 1, 2, ..., J} for some known integer J.
The ordered probit model can be derived from a latent variable model. Assume that a latent
variable y ∗ defined by
y ∗ = xβ + e, e|x s N (0, 1) (7.45)
Let c1 < c2 < ... < cJ be unknown cut points and define
8
>
> 0, if y ∗ ≤ c1
>
>
>
>
< 1, if c1 < y ∗ ≤ c2
y= (7.46)
>
> ... ...
>
>
>
>
: J, if y ∗ > c
J

Given the standard normal assumption, we can compute each response probability:
P (y = 0|x) = P (y ∗ < c1 ) = P (xβ + e < c1 ) = Φ (c1 − xβ)
7.6 PROBLEMS 155

P (y = 1|x) = P (c1 < y ∗ ≤ c2 ) = P (c1 < xβ + e ≤ c2 ) = Φ (c2 − xβ) − Φ (c1 − xβ)


...
P (y = J|x) = P (cJ < y ∗ ) = P (cJ < xβ + e) = 1 − Φ (cJ − xβ) .
The parameters c and β can be estimated by maximum likelihood. For each i, the log-
likelihood is

li (c, β) = 1 {yi = 0} log Φ (c1 − xi β) + 1 {yi = 1} log [Φ (c2 − xi β) − Φ (c1 − xi β)]


+... + 1 {yi = J} log [1 − Φ (cJ − xi β)] (7.47)

Note that P (y ≤ j|x) = Φ (cj+1 − xβ) for j = 0, ..., J − 1, which is a system of J probit
equations. Optimal GMM in this system should be asymptotically equivalent to ordered probit
MLE.
Other distribution functions can be used in place of Φ. Replacing Φ with the logit function,
Λ, gives the ordered logit model.
The focus of interest is: @P (y = 1|x)/@xj and c0i s. Interpreting the coe¢cients based on
their sign is not obvious in the ordered response model. See the textbook by Wooldridge
(2010).

7.6 Problems
Download the file ca_heating.xls from course web site. The file contains data on choice
of heating system in California houses. The observations consist of single-family houses in
California that were newly built and had central air-conditioning. The choice is among heating
systems. Five types of systems are considered to have been possible:
(1) gas central,
(2) gas room,
(3) electric central,
(4) electric room,
(5) heat pump.
There are 900 observations with the following variables:
idcase gives the observation number (1-900)
depvar identifies the chosen alternative (1-5)
ic1 is the installation cost for a gas central system
ic2 is the installation cost for a gas room system
ic3 is the installation cost for an electric central system
ic4 is the installation cost for an electric room system
ic5 is the installation cost for a heat pump
oc1 is the annual operating cost for a gas central system
oc2 is the annual operating cost for a gas room system
oc3 is the annual operating cost for an electric central system
oc4 is the annual operating cost for an electric room system
oc5 is the annual operating cost for a heat pump
income is the annual income of the household
agehed is the age of the household head
rooms is the number of rooms in the house
7.6 PROBLEMS 156

ncostl identifies whether the house is in the northern coastal region


scostl identifies whether the house is in the southern coastal region
mountn identifies whether the house is in the mountain region
valley identifies whether the house is in the central valley region
Note that the attributes of the alternatives, namely, installation cost and operating cost,
take a di§erent value for each alternative. Therefore, there are 5 installation costs (one for
each of the 5 systems) and 5 operating costs. These costs were calculated as the amount the
system would cost if it were installed in the house, given the characteristics of the house (such
as size), the price of gas and electricity in the house location, and the weather conditions in
the area (which determine the necessary capacity of the system and the amount it will be
run.)
(a) Estimate the following multinomial logit model
(1) (2)
exp(xij β 1 + xij β 2 )
P (yi = j|x) = P # $ (7.48)
5 (1) (2)
h=1 exp(xih 1β + x ih 2β )

(1)
for j = 1, 2, 3, 4, 5, where xij is the installation cost of the j-th heating system for household i,
(2)
xij is the annual operating cost of the j-th heating system for household i. Do the estimated
coe¢cients (βb ,β
b ) have the expected signs? Are both coe¢cients significantly di§erent from
1 2
zero?

(b) How closely do the average probabilities match the shares of customers choosing each
alternative?

(c) The ratio of coe¢cients usually provides economically meaningful information. The
willingness to pay (wtp) through higher installation cost for a one-dollar reduction in operating
costs is the ratio of the operating cost coe¢cient to the installation cost coe¢cient. What is
the estimated wtp from this model? Is it reasonable in magnitude?

(d) We can use the estimated wtp to obtain an estimate of the discount rate that is implied
by the model of choice of operating system. The present value of the future operating costs is
the discounted sum of operating costs over the life of the system:
L
X OC
PV = (7.49)
(1 + r)t
t=1

where r is the discount rate and L is the life of the system. As L rises, the P V approaches
(1/r)OC. Therefore, for a system with a su¢ciently long life (which we will assume these
systems have), a one-dollar reduction in OC reduces the present value of future operating costs
by (1/r). This means that if the person choosing the system were incurring the installation
costs and the operating costs over the life of the system, and rationally traded-o§ the two at
a discount rate of r, the decision maker’s wtp for operating cost reductions would be (1/r).
Given this, what value of r is implied by the estimated wtp that you calculated in part (c)?
Is this reasonable?
7.6 PROBLEMS 157

(e) Adding alternative-specific constants to the model, it becomes


(1) (2)
exp(xij β 1 + xij β 2 + γ j )
P (yi = j|x) = P # $. (7.50)
5 (1) (2)
h=1 exp(xih 1β + xih 2β + γ h )

Is γ = (γ 1 , γ 2 , γ 3 , γ 4 , γ 5 ) identifiable? Explain. Estimate the above model. (Hint: (i) If


you think γ is not identified, impose some restriction(s) to make it identifiable. (ii) If you
get di§erent estimates for di§erent starting points, choose the one that delivers the largest
likelihood function). How well do the estimated probabilities match the shares of customers
choosing each alternative?

(f) Calculate the wtp and discount rate r that is implied by the estimates. Are these
reasonable?

(g) If you want to include income in the model, what P (yi = j|x) looks like? Are all the
parameters identifiable? Explain. (Note: no need to estimate this model).

(h) The California Energy Commission (CEC) is considering whether to o§er rebates on
heat pumps. The CEC wants to predict the e§ect of the rebates on the heating system choices
of customers in California. The rebates will be set at 10% of the installation cost. The new
installation cost for heat pumps will therefore be: set nic5 = .90 ∗ ic5 . Using the estimated
coe¢cients from the model in part (e), calculate new probabilities and predicted shares using
nic5 instead of ic5. How much do the rebates raise the share of houses with heat pumps?

(i) Suppose a new technology is developed that provides more e¢cient central heating. The
new technology costs $200 more than the central electric system that we have specified as our
alternative 3. However, it saves 25% of the electricity, such that its operating costs are 75% of
the operating costs of our alternative 3. We want to predict the potential market penetration
of this technology. Note that there are now six alternatives: the original five alternatives
plus this new one. Calculate the probability and predict the market share (i.e., the average
probability) for all six alternatives, using the model that is estimated on the original five
alternatives. (Note: (i) Be sure to use the original installation cost for heat pumps, rather
than the reduced cost in part (h). (ii) For the new technology, assume all else are the same
as alternative 3). What is the predicted market share for the new technology? From which of
the original five systems does the new technology draw the most customers?
Bibliography

[1] Cardell, N Scott (1997). “Variance Components Structures for the Extreme-Value and
Logistic Distributions with Application to Models of Heterogeneity.” Econometric Theory,
13(2): 185-213.

[2] Hausman, Jerry and Daniel McFadden (1984). “Specification Tests for the Multinomial
Logit Model,” Econometrica, 52: 1219-1240.

[3] Train, Kenneth (2002). Discrete Choice Methods with Simulation, Cambridge University
Press.

[4] Maddala, G.S., (1987). Limited Dependent and Qualitative Variables in Econometrics.
Cambridge University Press.

158
Chapter 8

Truncation, Censoring, and Sample


Selection

8.1 Truncated Regression Model


Example 58 (Truncation based on Wealth): We are interested in estimating the e§ect of
worker eligibility in a particular pension plan on family wealth

wealth = β 0 + β 1 plan + β 2 educ + β 3 age + β 4 income + u. (8.1)

However, we can only sample people with a net wealth greater than $10,000, so the sample is
selected on the basis of wealth (People with net wealth less than $10,000 may be hard to reach).

8.1.1 The Model


Suppose that {Yi∗ , Xi } is iid and

Yi∗ = Xi β + "i , "i |Xi s N (0, σ 2 ) (8.2)

We only observe the Yi∗ satisfying Yi∗ > c where c is a known constant, i.e.,
8
< (Yi∗ , Xi ) if Yi∗ > c
(Yi , Xi ) = (8.3)
: no observation if Y ∗ ≤ c
i

Here we consider only truncation from below. The extension to truncation from above is
straightforward.

8.1.2 Moments of Truncated Normal Variables


If a continuous random variable Y has density function fY (y). The truncated variable has
density
fY (y)
fY |Y >c (y|Y > c) = R 1 . (8.4)
c f (u)du

159
8.1 TRUNCATED REGRESSION MODEL 160

To see this, we note that


P (c < Y < y) FY (y) − FY (c)
P (Y < y|Y > c) = = ,
P (Y > c) P (Y > c)
and so
d fY (y)
fY |Y >c (y|Y > c) = P (Y < y|Y > c) = R 1 .
dy c f (u)du

Figure 8.1: Truncated PDF


To derive E (Y |X, Y > c) , we need the following fact: if Z s N (0, 1) then
Z 1 Z 1
Zφ(Z) 1 exp(−Z 2 /2) * 2 +
E (Z|Z > c) = dZ = p d Z /2
c 1 − Φ(c) 2π c 1 − Φ(c)
1 exp(−Z 2 /2) 1 φ(c)
= −p |c = := λ(c), (8.5)
2π 1 − Φ(c) 1 − Φ(c)
where
φ(c)
λ(c) =
1 − Φ(c)
is the so called inverse Mills ratio1 . Obviously λ (c) > c. Therefore, if u s N (0, σ 2 ),
σφ(c/σ) c
E(u|u > c) = = σλ( ). (8.6)
1 − Φ(c/σ) σ
Now we calculate var (Z|Z > c) . Note that
Z 1 Z 1 2
* + φ(Z) 1 Z exp(−Z 2 /2)
E Z 2 |Z > c = Z2 dZ = p dZ
c 1 − Φ(c) 2π c 1 − Φ(c)
Z 1
1 1 ; <
= −p Z d exp(−Z 2 /2)
2π 1 − Φ(c) c
Z 1
1 1 2 1 1
= p c exp(−c /2) + p exp(−Z 2 /2)dZ
2π 1 − Φ(c) 2π 1 − Φ(c) c
cφ(c)
= + 1 = 1 + cλ(c). (8.7)
1 − Φ(c)
1
For the standard normal, the Mills’ ratio is defined to be [1 − Φ (c)] /φ (c) .
8.1 TRUNCATED REGRESSION MODEL 161

3.5

2.5 E(Z|Z>c)

mean and variance


Var(Z|Z>c)
2

1.5

0.5

0
-3 -2 -1 0 1 2 3
c

Figure 8.2: Mean and Variance of Truncated Standard Normal Distribution

So
H I2
cφ(c) φ(c)
var (Z|Z > c) = +1−
1 − Φ(c) 1 − Φ(c)
H I
φ(c) φ(c)
= 1− −c
1 − Φ(c) 1 − Φ(c)
= 1 − λ (c) [λ (c) − c] ≤ 1.

For the standard normal, truncation reduces variance!

Lemma 59 if Z s N (0, 1), then

E (Z|Z > c) = λ(c), (8.8)


var (Z|Z > c) = 1 − λ(c)[λ(c) − c], (8.9)
φ(c)
where λ(c) = 1−Φ(c) .

Figure 8.2 graphs the mean and variance of truncated standard normal distributions
against the truncation point c.
Therefore, if u s N (0, σ 2 ), then

* + u2 u c h c c i
E u2 |u > c = σ 2 E( 2 | > ) = σ 2 1 + λ( ) ,
σ σ σ σ σ
and
u u c
var(u|u > c) = σ 2 var( | > )
n # cσ$σh cσ c io
2
= σ 1−λ λ( ) − . (8.10)
σ σ σ
8.1 TRUNCATED REGRESSION MODEL 162

From the above analyses, we have

E (Yi |Xi , Yi > c) = Xi β + E ("i |Xi β + "i > c, Xi )


= Xi β + E ("i |"i > c − Xi β, Xi )
H I
c − Xi β
= Xi β + σλ , (8.11)
σ

h i * +
E (Yi − Xi β)2 |Xi , Yi > c = E "2i |"i > c − Xi β, Xi
L H I H IM
2 c − Xi β c − Xi β
= σ 1+ λ , (8.12)
σ σ

and

var (Yi |Xi , Yi > c) = var ("i |"i > c − Xi β, Xi ) (8.13)


U H IL H I MV
2 c − Xi β c − Xi β c − Xi β
= σ 1−λ λ − .
σ σ σ

It then follows that H I


c − Xi β
Yi = Xi β + σλ + ei ,
σ
where E(ei |Xi, Yi > c) = 0 and ei is heteroscedastic by construction.
# $
The OLS estimator of β when the nonlinear term σλ c−X σ

is omitted converges in
probability to # # $$
c−Xi β
covc (Xi , Yi ) covc X i , σλ σ
=β+ ,
covc (Xi , Xi ) covc (Xi , Xi )
where covc (·, ·) is the conditional covariance conditioning on {Yi > c} . The direction of the
asymptotic bias depends on the sign of β. Typically, the OLS slope estimator in truncated
samples will su§er from the attenuation bias or bias towards zero. This can be seen from the
partial e§ect calculation below:

@E (Yi |Xi , Yi > c)


@Xi
2 # $ 3
c−Xi β
@ 4 φ σ
= β+σ # $5
@Xi 1 − Φ c−Xi β
σ
2 # $ # $ 3
2 c−Xi β c−Xi β
6 φ σ β φ σ c − Xi β β 7
= β + σ 4− h # $i2 + # $ 5
σ 1 − Φ c−Xi β σ σ
1 − Φ c−X σ

σ
L H I H I M
2 c − Xi β c − Xi β c − Xi β
= β 1−λ +λ
σ σ σ
U H IL H I MV
c − Xi β c − Xi β c − Xi β
= β 1−λ λ − ≤β
σ σ σ
8.1 TRUNCATED REGRESSION MODEL 163

where the inequality follows from: 1 − λ (c) [λ (c) − c] = var (Z|Z > c) 2 [0, 1] . The above
calculation is indicative only. The OLS estimator may not converge to @E(Yi@X|Xi ,Yi >c)
i
. (see a
question in PS1).

Figure 8.3: Inconsistency of the OLS estimator (truncation point: y = 0)

In passing, we note that the sample selection bias does not arise if selection is based on
regressors, not on the dependent variable. Note that the unbiasedness of the OLS estimator
relies on
E ("i |Xi , Si = 1) = 0,
where Si = 1 indicates the sample selected, i.e., Si = {Yi > C} or Si = {Xi > C} . When
Si = {Xi > C} , we have
E ("i |Xi , Si = 1) = E ("i |Xi ) = 0,
provided that E ("i |Xi ) = 0 in the population.

8.1.3 Maximum Likelihood Estimation


The density is # $
Yi −Xi β
σ −1 φ σ
f (Yi |Xi , Yi > c) = # $. (8.14)
c−Xi β
1−Φ σ

The likelihood function is


n
( H I2
X 1 1 1 Yi − Xi β
log L(θ) = − log(2π) − log(σ 2 ) − (8.15)
2 2 2 σ
i=1
L H IMV
c − Xi β
− log 1 − Φ .
σ
8.1 TRUNCATED REGRESSION MODEL 164

The first order conditions for β and σ 2 are


n L
X H IM
c − Xi β
Yi − Xi β − σλ Xi = 0,
σ
i=1

n
( L H I H IM)
X (Yi − Xi β)2 c − Xi β c − Xi β
− 1+ λ = 0.
σ2 σ σ
i=1
@
The conditions that E @θ log L(θ) = 0 are compatible with (8.11) and (8.12).
The usual asymptotics for MLE applies.
Stata command: truncreg y x, ul(#). See http://www.stata.com/manuals13/rtruncreg.pdf

Figure 8.4 presents finite sample distributions of the OLS estimator and MLE for a certain
DGP. For details of the simulation, see the Stata program.
10

10
8

8
6

6
Density

Density
4

4
2

2
0

.7 .8 .9 1 1.1 .8 .9 1 1.1 1.2


beta_hat_ols beta_hat_mle

Figure 8.4: Finite sample distribution of the OLS and MLE in the presence of truncation.

clear
capture postclose tempid
postfile tempid beta_hat_ols beta_hat_mle using mydata_truncated.dta,replace

set seed 1
forvalues i = 1(1)1000 {
drop _all
quietly set obs 200

/************* DGP ********************/

gen e = rnormal()
gen x = rnormal()^2
8.2 TOBIT AND CENSORED REGRESSIONS 165

gen y = x + e
qui drop if y < -1 /* truncation */

qui reg y x
scalar beta_hat_ols = _b[x]

qui truncreg y x, ll(-1)


scalar beta_hat_mle = _b[x]

post tempid (beta_hat_ols) (beta_hat_mle)


}
postclose tempid
use mydata_truncated.dta, clear
sum
hist beta_hat_ols, normal saving (ols)
hist beta_hat_mle, normal saving (mle)
graph combine ols.gph mle.gph

8.2 Tobit and Censored Regressions


The Tobit model, proposed by James Tobin (1958), is designed to describe the relationship
between a non-negative dependent variable and an independent variable (or vector). The
population model is
Y ∗ = Xβ + u, u|X s N (0, σ 2 ), (8.16)
8
< Y ∗ , if Y ∗ > 0,
Y = (8.17)
: 0, if Y ∗ ≤ 0,

where Y ∗ ,Y , and u 2 R. For example, Y ∗ is the willingness to donate and Y is the donation.
We have an iid sample {Xi , Yi } from this population model. Here Yi has a limited support in
the population.
A closely related model is the censored regression model under which the population model
is
Y ∗ = Xβ + u, u|X s N (0, σ 2 ). (8.18)
We have an iid sample from this population but we observe only {Xi , Yi } where as before
8
< Y ∗, if Yi∗ > 0
i
Yi = (8.19)
: missing, if Y ∗ ≤ 0
i

Here Yi∗ is the variable of interest and its support is not limited. However, Yi has a limited
support because the data on Yi∗ is missing when Yi∗ ≤ 0. Sometimes we fill in the missing
value by zero, leading to 8
< Y ∗ , if Y ∗ > 0
i i
Yi = . (8.20)
: 0, if Y ∗ ≤ 0
i
8.2 TOBIT AND CENSORED REGRESSIONS 166

However, it may not make much sense in some applications to impute the missing value to be
zero. We will be clear in the following when the result will reply on the specification that
Yi = 0 when Yi∗ ≤ 0.
The di§erence between the truncated regression model and the censored regression model
is that in the former case, Xi is not observable when Yi∗ ≤ 0 while in the latter case, Xi is
always observable regardless of whether Yi∗ > 0 or not.
Statistically, the Tobit model and censored regression model (with imputed zeros) are the
same. However, their interpretations are di§erent. In the censored regression model, we are
interested in the marginal e§ect of Xi on Yi∗ (not Yi ). The marginal e§ect is β. In the Tobit
model, we are interested in the marginal e§ect of Xi on Yi (not Yi∗ ). This e§ect is not equal
to β, as we demonstrate below.
In both Tobit and censored regressions, the threshold value may be di§erent from zero.
More generally,
Yi∗ = Xi β + ui , ui |X s N (0, σ 2 ). (8.21)
and we have data on Xi and Yi where
8
< Y ∗ , if Y ∗ > c
i i
Yi = (8.22)
: c, if Y ∗ ≤ c
i

where c is a known constant.


This model is often written in the form

Yi = max(c, Xi β + ui ), ui |Xi s N (0, σ 2 ). (8.23)

Note that this model is di§erent from

Yi = max(0, Xi β + ui ), ui |Xi s N (0, σ 2 ). (8.24)

Many researchers mistakenly think these two models are equivalent. A model equivalent to
(8.23) is

Yi − c = max(0, Xi β − c + ui ), ui |Xi s N (0, σ 2 )


= max(0, Xi β̃ + ui ), ui |Xi s N (0, σ 2 ).

Since c is assumed to be known, we can define Ỹi = Yi − c. Then

Ỹi = max(0, Xi β̃ + ui ), ui |Xi s N (0, σ 2 ),

which is now in the same form as (8.24). If c is not known, then we can estimate by ĉ = min(Yi ).
Carson and Sun (2007) show that ĉ converges to c at the rate of n−1 which is faster than the
p
parametric rate of 1/ n. So the estimation uncertainty can be ignored in making inference
on β. We will use (8.24) in the subsequent sections but keep in mind that when c is not zero,
we have to transform the dependent variable first.
8.2 TOBIT AND CENSORED REGRESSIONS 167

Figure 8.5: Censored PDF: Censoring Point c

8.2.1 Derivation of Expected Values


The calculations in the subsection are based on the assumption that Yi = 0 when Yi∗ ≤ 0.
It is easy to see that

E(Y |X) = Pr(Y = 0|X) ∗ 0 + Pr(Y > 0|X)E(Y |X, Y > 0)


= Pr(Y > 0|X)E(Y |X, Y > 0), (8.25)

and
Pr(Y > 0|X) = Pr (Xβ + u > 0) = Φ (Xβ/σ) . (8.26)
Hence, in order to find E(Y |X), we only need to compute E(Y |X, Y > 0) :

E(Y |X, Y > 0) = E (Xβ + u|X, Y > 0)


σφ(Xβ/σ)
= Xβ + E (u|u > −Xβ) = Xβ + (8.27)
Φ(Xβ/σ)
= Xβ + σλ(−Xβ/σ). (8.28)

As a consequence, H I H I
Xβ Xβ
E (Y |X) = Φ Xβ + σφ (8.29)
σ σ
and
H I H I H I
@E (Y |X) Xβ Xβ βj βj Xβ
= Φ βj + φ Xβ − Xβφ
@Xj σ σ σ σ σ
H I

= Φ β j = Pr(Y > 0|X)β j . (8.30)
σ

Hence P P
P @E(Y |X) P
P P
P @Xj P ≤ |β j |.
8.2 TOBIT AND CENSORED REGRESSIONS 168

8.2.2 Inconsistency of the OLS Estimator


From equation (8.27), we have

Yi = Xi β + σλ (−Xi β/σ) + ei (8.31)

with
E (ei |Xi , Yi > 0) = 0. (8.32)
This implies that if we run the OLS of Yi on Xi using the sample for which Yi > 0, we e§ectively
omit the variable λ. Due to the omitted variable bias, the OLS estimator is inconsistent. This
is e§ectively a truncation regression with omitted variables.
Even if we use all the data, the OLS estimator is still inconsistent because
H I H I
Xβ Xβ
E (Y |X) = Φ Xβ + σφ . (8.33)
σ σ

The above holds if Yi = 0 when Yi∗ ≤ 0.

Figure 8.6: Inconsistency of the OLS estimator

8.2.3 Estimation and Inference with Censored Tobit


Let {Xi , Yi } be a random sample following the censored Tobit model:

Yi∗ = Xi β + ui , ui |Xi s N (0, σ 2 ), Yi = max(0, Yi∗ ). (8.34)

Define
I (Yi ) = 1 {Yi > 0} = 1 {Xi β + ui > 0} .
Clearly, Y has a mixed distribution, a distribution that is partly discrete and partly continuous.
Part of the distribution is concentrated at the point Y = 0 and the rest of the distribution
8.2 TOBIT AND CENSORED REGRESSIONS 169

continuously spreads over R+ . For any A 2 Y, we have


Z X
P (Y 2 A|X = x) = fY ∗ |X (y|x) dy + P (Y = y|X = x).
A\R+ y2A\{0}

The density of Yi given Xi is then


L H IMI(Yi ) L H IM
1 Yi − Xi β Xi β 1−I(Yi )
f (Yi |Xi ) = φ 1−Φ . (8.35)
σ σ σ

The dominating measure behind the pdf is the sum of the counting measure and the Lebesgue
measure. Note that the above pdf does not depend on the value of Yi when I (Yi ) = 0 or
Y ∗ < 0. * +0
Let θ = β 0 , σ 2 , then the log-likelihood is
n
X
l(θ) = li (θ)
i=1

where
L H IM L H IM
Xi β 1 Yi − Xi β
li (θ) = [1 − I (Yi )] log 1 − Φ + I (Yi ) log φ (8.36)
σ σ σ
L H IM " ! * +#
Xi β (Yi − Xi β)2 log σ 2
= [1 − I (Yi )] log 1 − Φ − I (Yi ) + + const.
σ 2σ 2 2

The FOC with respect to β is


n
X @li (θ)
= 0,

i=1
where # $
1 Xi β
@li (θ) σφ Xi σ (Y − Xi β) Xi
= − [1 − I (Yi )] # $ + I (Yi ) i .
@β X
1 − Φ σi β σ2

By definition, E @l@β
i (θ)
= 0. In fact,
82 # $ 3P 9
< 1 Xi β P
4 σ φ σ (Yi − Xi β) 5PP =
E − [1 − I (Yi )] # $ + I (Yi ) P Xi ;
: 1 − Φ Xσi β σ2 P
# $
1 Xi β L M
σφ σ (Yi − Xi β)
= − [1 − P {Yi > 0|Xi }] # $ + E I (Yi ) |Xi
1 − Φ Xσi β σ2
H I L M
1 Xi β (Yi − Xi β)
= − φ +E |Xi , Yi > 0 P (Yi > 0|Xi )
σ σ σ2
H I L H IM H I
1 Xi β Xi β Xi β
= − φ + σλ − Φ = 0.
σ σ σ σ
8.2 TOBIT AND CENSORED REGRESSIONS 170

So in principle we can have moment conditions of the form


2 # $ 3
1 Xi β
σ φ σ h (X i ) (Y − Xi β) h (Xi ) 5
E 4[1 − I (Yi )] # $ − I (Yi ) i =0
1−Φ σ Xi β σ2

for any measurable function h (·) , but the theory of MLE suggests that there is no additional
information in large samples beyond the moment conditions obtained by letting h (Xi ) = Xi .
l(θ) has a single maximum, but two step procedures have been devised by Heckman and
Amemiya. The two step procedure starts with a probit on Yi > 0 or not. This delivers a
consistent estimator of β/σ. In the second step, we bring in the continuous information and
consider H I H I
Xβ Xβ
E (Y |X) = Φ Xβ + σφ . (8.37)
σ σ
d to predict Φi = Φ(Xi β/σ)
Use the first-step β/σ d and φ = φ(Xi β/σ)
d and estimate
i

Yi = (Φi Xi ) β + σφi + ei (8.38)

for a new set of β and σ.


In the above two-step procedure, we have assumed that Y = 0 when Y is not positive. If
this does not hold, we can use the subsample with Yi > 0. Then
H I

E (Y |X, Y > 0) = Xβ + σλ − (8.39)
σ

and another two-step estimator can be designed.


The two-step estimator can be used as a starting point in the ML estimation.
Testing is easily carried out in a standard MLE framework: all of the standard tests such
as the t-test, the LR test, and the Wald test can be readily used.

8.2.4 Mostly Harmless Econometrics


Are fancier econometrics techniques typically unnecessary and even dangerous? In their book
titled “Mostly Harmless Econometrics,” Angrist and Pischke (2009) give a somewhat posi-
tive answer to this question. They start with Probit and Tobit models with a single binary
covariate. For example, consider the Tobit model

Yi∗ = α + Di β + ui , ui |Xi s N (0, σ 2 ), (8.40)

where Di is a binary variable and


8
< Y ∗ , if Y ∗ > 0
i i
Yi = . (8.41)
: 0, if Y ∗ ≤ 0
i

The argument in this subsection relies on the assumption that Yi = 0 when Yi∗ ≤ 0.
8.2 TOBIT AND CENSORED REGRESSIONS 171

According to the calculations in the general cases, we have


H I H I
α + Dβ α + Dβ
E (Y |D) = Φ (α + Dβ) + σφ . (8.42)
σ σ
So the average causal e§ect of D on Y is

E (Y |D = 1) − E (Y |D = 0)
L H I H IM #α$ #α$
α+β α+β
= Φ (α + β) + σφ −Φ α − σφ .
σ σ σ σ
We can run the Tobit regression and then plug the parameter estimates into the right hand
side of the above equation to obtain an estimator of E (Y |D = 1) − E (Y |D = 0) . Angrist and
Pischke point out that this is unnecessary. They propose to run a simple OLS regression:

Yi = δ + γDi + errori (8.43)

and claim that the OLS estimator of δ will provide a legitimate estimator of E (Y |D = 1) −
E (Y |D = 0) . To verify their claim, we note that E (Y |D) can be rewritten as

E (Y |D) = δ + γD

where
#α$ #α$
δ = Φ α + σφ
L Hσ I σ H IM
α+β α+β #α$ #α$
γ = Φ (α + β) + σφ −Φ α − σφ .
σ σ σ σ
So indeed the OLS estimator γ̂ OLS based on the regression in (8.43) is consistent for the
average causal e§ect E (Y |D = 1) − E (Y |D = 0) .
In fact, consistency of γ̂ OLS for E (Y |D = 1) − E (Y |D = 0) does not rely on any distrib-
utional assumption on u. To see this, note that we can always write

E (Y |D) = f (D) = f (0) + [f (1) − f (0)] D (8.44)

for some function f (·) . That is

Y = f (0) + [f (1) − f (0)] D + error

with E(error|D) = 0. Given this, we know that the OLS estimator of the slope coe¢cient is
consistent for f (1) − f (0) = E (Y |D = 1) − E (Y |D = 0). This result reminds us of binary
choice models with only dummy covariates. There we show that the linear probability model,
probit model and logit model give us exactly the same predicted probabilities.
The equation that drives all these results is (8.44), which can be rewritten as

E (Y |D) = E (Y |D = 0) + [E (Y |D = 1) − E (Y |D = 0)] D.

For a binary variable D, the above equation holds for any type of dependent variable Y, being it
continuous or discrete. There is no mis-specification problem. That is, the linear specification
is always correct.
8.3 TOBIT MODELS WITH ENDOGENEITY 172

The question is: what if we have continuous covariates in the Tobit model? Can we ignore
the “LDVness” (Limited Dependent Variable) of Y and just run OLS regardless of the type of
the dependent variable we have? Angrist and Pischke again provide somewhat positive answer
to this question. They argue that the OLS estimator still provide a good approximation to
some average causal e§ect of interest. As they admit, this is not a theorem. They only provide
some empirical evidence that the linear OLS estimator is close to the average causal e§ect
based on nonlinear models. For more discussions, see Angrist and Pischke (2009, Sec 3.4.2).

8.3 Tobit Models With Endogeneity


Suppose the model is

Y1 = max(0, Xβ + Y2 α + u)
Y2 = Xγ + Zδ + v (8.45)

where (u, v) are zero mean normally distributed, independent of (X, Z). For identification, we
need the usual rank condition δ 6= 0 and E [(X, Z)0 (X, Z)] is assumed to have full rank, as
always.
Under the normality assumption, we have

u = θv + e (8.46)

where
θ = σ uv /σ 2v , σ uv = cov(u, v) and σ 2v = var(v) (8.47)
and e s N (0, σ 2e ) and is independent of (Z, v). Obviously, σ 2e = σ 2u −σ 2uv /σ 2v . Plugging u = θv+e
into
Y1 = max(0, Xβ + Y2 α + u) (8.48)
gives
Y1 = max(0, Xβ + Y2 α + θv + e) (8.49)
The Smith-Blundell procedure
(1) Run OLS of Y2 on X, Z and get the residual v̂ = Y2 − X γ̂ − Z δ̂.
(2) Estimate a standard Tobit of Y1 on X, Y2 and v̂ to get consistent estimates of β, α, θ
and σ 2e . (tobit y1, X, y2, v_hat, ll(0))
The usual t-statistic on v̂ provides a simple test of the null H0 : θ = 0, which says that Y2
is exogeneous. Note that when we compute the asymptotic variance, we need to account for
the fact that this is a two-step procedure.
A full MLE approach avoids the two-step estimation problem:

f (Y1 , Y2 |X, Z) = f (Y1 |Y2 , X, Z)f (Y2 |X, Z). (8.50)


* +
The density f (Y2 |X, Z) is normal N Xγ + Zδ, σ 2v , and Y1 given (Y2 , X, Z) follows a Tobit
with mean
Xβ + Y2 α + θ (Y2 − Xγ − Zδ) (8.51)
8.4 SAMPLE SELECTION WITH PROBIT SELECTION 173

and variance σ 2e = σ 2u − σ 2uv /σ 2v . So f (Y1 , Y2 |X, Z) is


U H IV1−I(Y1 ) H I
Xβ + Y2 α + θ (Y2 − Xγ − Zδ) 1 (Y2 − Xγ − Zδ)2
1−Φ p exp −
σe 2πσ 2v 2σ 2v
( H )
I I(Y1 )
1 (Y1 − Xβ − Y2 α − θ (Y2 − Xγ − Zδ))2
× p exp − .
2πσ 2e 2σ 2e

Once the MLE has been obtained, we can easily test the null hypothesis of exogeneity of
Y2 using the t-statistic for θ̂.

Exercise 60 What if Y2 = 1 {Xγ + Zδ + v > 0}? Hint: we can consider 4 cases:

[f (Y1i = 0, Y2i = 0|Xi , Zi )]{Y1i =0,Y2i =0} ×


[f (Y1i = 0, Y2i = 1|Xi , Zi )]{Y1i =0,Y2i =1} ×
[f (Y1i > 0, Y2i = 0|Xi , Zi )]{Y1i >0,Y2i =0} ×
[f (Y1i > 0, Y2i = 1|Xi , Zi )]{Y1i >0,Y2i =1} .

Exercise 61 What is the APE of Y2 on Y1 ? We compute


@ @
EY1 (xo , y2o , U ) = E max(0, xo β + y2o α + U )
@y2 @y2
where

E max(0, xo β + y2o α + U )
H IL H IM
xo β + y2o α xo β + y2o α
= Φ xo β + y2o α + σ u λ −
σu σu
H I H I
xo β + y2o α xo β + y2o α
= Φ [xo β + y2o α] + σ u φ .
σu σu

So, with some calculations, we have


H I
@ xo β + y2o α
EY1 (xo , y2o , U ) = Φ α.
@y2 σu

This is not equal to


@
E [Y1 (X, Y2 , U ) |X = xo , Y2 = y2o ] .
@y2

8.4 Sample Selection with Probit Selection


The topic of sample selection or incidental truncation has been the focus of an enormous
volume of empirical and theoretical literature. It involves features of both truncated and
censored models.
8.4 SAMPLE SELECTION WITH PROBIT SELECTION 174

8.4.1 Heckit Two-step Estimator


Example 62 (Wage O§er Function) Consider estimating a wage o§er equation for people
of working age. By definition, this equation is supposed to represent ALL people of working
age. But we can only observe the wage o§er for working people. We thus e§ectively select the
sample on this basis.

Interest lies in estimating E(Wio |Xi ) where Wio is the hourly wage o§er for a randomly
drawn individual i. If Wio is observed for everyone of working age, we would proceed in a
standard regression framework. However, a potential sample selection problem arises because
Wio is observed only for people who work.
Suppose we want to estimate a wage o§er function. The true model is the Mincer-type
wage o§er equation:
log Wio = Xi1 β 1 + ui1 (8.52)
where Wio is wage, Xi1 is a vector of human capital attributes (e.g. work experience and
education) with β being the associated vector of coe¢cients. This model has been examined
on many datasets, and it is one of the most widely used models in empirical economics. Due
to the sample selection problem, the assumption of the classical regression model, namely
E[ui1 |Xi1 , worker] = 0, is unlikely to hold. This is because a person who chooses to work
may be particularly diligent or have other characteristics that make him more desirable as a
worker, and E[ui1 |Xi1 , worker] may well be positive.
We now model the decision to work by a simple rule. We assume that everyone of working
age has a reservation wage Wir . The person chooses to work only if

Wio ≥ Wir . (8.53)

We parametrize the reservation wage by

Wir = exp (Xi2 β 2 + ai γ 2 + ui2 ) (8.54)

where Xi2 contains variables that determine the marginal utility of leisure and income and ai
is the non-income wage of person i. We assume that (ui1 , ui2 ) is independent of (Xi1 , Xi2 , ai ).
Person i decides to work if

Xi1 β 1 + ui1 > Xi2 β 2 + ai γ 2 + ui2 (8.55)

or
Xi δ + vi > 0, Xi = (Xi1 , Xi2 , ai ) , vi = ui1 − ui2 (8.56)

Remark 63 If Wir is observed and is exogeneous, and Xi1 is always available, then we would
be in the censored regression framework.

Remark 64 If Wir is observed and is exogeneous, and Xi1 is available only when Wio is
available, then we would be in the truncated regression framework

Remark 65 Since Wir is not observed, we need a new framework.


8.4 SAMPLE SELECTION WITH PROBIT SELECTION 175

Let Y1 = log W o and Y2 be the binary labor force participation indicator, then

Y1 = X1 β 1 + u (8.57)

and
Y2 = 1 {Xδ + v > 0} (8.58)
we discuss the estimation of the model under the following set of assumptions:
Assumption A:

(a) (X, Y2 ) is always observed, Y1 is observed only when Y2 = 1;

(b) (u, v) is independent of X with zero mean

(c) v s N (0, 1) and

(d) E(u|v) = vγ 1 .

The above model is the general Heckman selection model. Amemiya (1985) calls the above
model the Type II Tobit Model. Wooldridge (2009) calls it the probit selection model. When
X = X1 , δ = β 1 , and v = u, the model reduces to the standard Tobit model.
Note that

E (Y1 |X, v) = E (X1 β 1 + u|X, v) = X1 β 1 + E (u|X, v)


= X1 β 1 + vγ 1 . (8.59)

The above holds because


Z Z
fu,X,v (u, x, v)
E (u|X = x, v) = ufu|X,v (u|x, v) du = u du
fX,v (x, v)
Z Z
fu,v (u, v) fX (x) fu,v (u, v)
= u du = u du
fv (x, v) fX (x) fv (x, v)
Z
= ufu|v (u|v) du = E (u|v) = vγ 1

for almost all x. So, when γ 1 = 0, we have

E (Y1 |X, v) = X1 β 1 . (8.60)

Because Y2 is a function of (X, v), we obtain

E (Y1 |X, Y2 = 1) = X1 β 1 . (8.61)

In other words, if γ 1 = 0, there is no sample selection problem, and β 1 can be consistently


estimated by OLS using the selected sample.
What if γ 1 6= 0? We hope to calculate E (Y1 |X, Y2 = 1) . Since E(u|v) = vγ 1 , we can write

u = vγ 1 + η with E (η|v) = 0. (8.62)


8.4 SAMPLE SELECTION WITH PROBIT SELECTION 176

Therefore,
E (Y1 |X, Y2 = 1) = E (X1 β 1 + u|X, Y2 = 1)
= X1 β 1 + E (u|X, Xδ + v > 0)
= X1 β 1 + E (vγ 1 + η|X, Xδ + v > 0)
= X1 β 1 + E (v|X, v > −Xδ, ) γ 1
= X1 β 1 + λ (−Xδ) γ 1 . (8.63)
The above equation makes it clear that an OLS regression of Y1 on X1 using the selected
sample omits the term λ (−Xδ) γ 1 and generally leads to an inconsistent estimator of β 1 .
Following Heckman (1979), we can consistently estimate β 1 and γ 1 using the selected
sample by regressing Yi1 on Xi1 , λ (−Xi δ) . The problem is that δ is unknown. Fortunately, δ
can be consistently estimated using probit based on Yi2 . This two step procedure is sometimes
called Heckit.
To estimate the asymptotic variance of β̂ 1 , we have to make a correction for the fact that
we’re not using λ (−Xδ) but only λ(−X δ̂) so that the error term contains the following:
h i L $M
@λ(Z) #
λ(−Xδ) − λ(−X δ̂) γ 1 ≈ − X δ − δ̂ γ 1 (8.64)
@Z
evaluated at Z = −Xδ. More specifically, the second step regression is
Y1i = X1i β 1 + λ(−Xi δ̂)γ 1 + ei ,
where h i
ei = ui − E(ui |vi > −Xi δ) + γ 1 λ(−Xi δ) − λ(−Xi δ̂) .
The second term in ei needs to be taken into consideration when computing the standard
errors of the two-step estimator β̂ 1 .

8.4.2 Partial Maximum Likelihood Estimation


To get a more e¢cient estimator, we use MLE. Assume (u, v) are bivariate normal with mean
zero and variance-covariance matrix
0 1
σ 2u σ uv
@ A.
2
σ vu σ v

We normalize σ 2v = 1. The pdf of (Y1 , Y2 ) can be written as


f (Y1 , Y2 |X) = f (Y1 |Y2 , X)f (Y2 |X).
Obviously,
f (Y2 |X) = [1 − Φ (Xδ)]1−Y2 [Φ (Xδ)]Y2 .
Now we proceed to find f (Y1 |Y2 , X). Since we only observe Y1 when Y2 = 1, we can only use
the density f (Y1 |Y2 , X) for Y2 = 1. By Bayes’ rule, we have
f (Y2 = 1|Y1 , X)f (Y1 |X)
f (Y1 |Y2 = 1, X) = .
f (Y2 = 1|X)
8.4 SAMPLE SELECTION WITH PROBIT SELECTION 177

We need to compute each term in the above expression. First


H I
1 Y1 − X1 β 1
f (Y1 |X) = φ .
σu σu

Second, note that


σ uv
v= u + ξ, (8.65)
σ 2u
where
E (ξ|u) = 0, var(ξ|u) = 1 − σ 2uv σ −2
u .

Therefore, conditional on u, v is normal with mean σ uv σ −2 2 −2


u u and variance 1 − σ uv σ u . The
probability of Y2 = 1 conditional on (Y1 , X) can then be written as
0 1
−2
Xδ + σ uv σ u (Y1 − X1 β 1 ) A
Φ@ q .
1 − σ 2uv σ −2
u

In view of the above analysis, the contribution to the partial likelihood function for a given
observation (Y1 , Y2 , X) is

L(Y1 , Y2 |X)
8 0 19Y2
< 1 HY − X β I Xδ + σ uv σ u (Y1 − X1 β 1 ) A=
−2
L M1−Y2
1 1 1
= φ Φ@ q 1 − Φ (Xδ) .
: σu σu 2 −2 ;
1 − σ uv σ u

Note that L(Y1 , Y2 |X) is not f (Y1 , Y2 |X), the conditional likelihood function. When viewed
as a function of the parameters, we call L(Y1 , Y2 |X) the partial likelihood function.
Let θ be a vector that collects all the parameters. The partial MLE is defined by
n
Y n
X
θ̂ = arg max L(Y1i , Y2i |Xi ) = arg max log L(Y1i , Y2i |Xi ).
i=1 i=1

How to show that θ̂ is consistent? Note that

log L(Y1 , Y2 |X) = 1{Y2 = 1} log f (Y1 |Y2 = 1, X) + log f (Y2 |X).

When the model is correctly specified, the true parameter θ0 maximizes E log f (Y2 |X). For all
(Y2 , X), the true parameter θ0 also maximizes E log f (Y1 |Y2 , X). In particular, it maximizes
E log f (Y1 |Y2 = 1, X). Therefore

θ0 = arg max E log L(Y1 , Y2 |X).

For identification, we have to assume or verify that θ0 is the unique maximizer of E log L(Y1 , Y2 |X).
Consistency of θ̂ now follows if we can show that a ULLN
n
1X
P lim sup || log L(Y1i , Y2i |Xi ) − E log L(Y1 , Y2 |X)|| = 0
n!1 θ2Θ n
i=1
8.5 SAMPLE SELECTION WITH TOBIT SELECTION 178

holds.
How to compute the standard error of θ̂? The score function can be written as
@ log f (Y1i |Y2i = 1, Xi ) @ log f (Y2i |Xi )
si (θ) = {Y2i = 1} +
@θ @θ
: = {Y2i = 1}s1i (θ) + s2i (θ).

Therefore

Esi (θ)si (θ)0 = E{Y2i = 1}s1i (θ)s1i (θ)0 + Es2i (θ)s2i (θ)0
+E{Y2i = 1}s1i (θ)s2i (θ)0 + E{Y2i = 1}s2i (θ)s1i (θ)0 .

Since
@ log f (Y1i |Y2i , Xi )
E [s1i (θ0 )|Y2i , Xi ] = E |θ=θ0 = 0

for any given Y2i and Xi , E [s1i (θ0 )|Y2i = 1, Xi ] = 0. As a consequence,

Esi (θ0 )si (θ0 )0 = E{Y2i = 1}s1i (θ0 )s1i (θ0 )0 + Es2i (θ0 )s2i (θ0 )0
= −E{Y2i = 1}H1i (θ0 ) − EH2i (θ0 ) = −EHi (θ0 )

where
@ 2 log f (Y1i |Y2i = 1, Xi ) @ 2 log f (Y2i |, Xi )
H1i (θ0 ) = 0 , H2i (θ0 ) = .
@θ@θ @θ@θ0
Here we have used the fact that
; <
E s1i (θ0 )s1i (θ0 )0 |Y2i , Xi = −E [Hi1 (θ0 )|Y2i , Xi ] for any Y2i and Xi

and thus ; <


E {Y2i = 1}s1i (θ0 )s1i (θ0 )0 = −E [{Y2i = 1}Hi1 (θ0 )] .
So an unconditional information equality holds, which means we can estimate the asymptotic
p
variance of n(θ̂ − θ0 ) by (−EHi (θ0 ))−1 . The latter can be consistently estimated by
!
1 Xh
n i −1
− {Y2i = 1} Hi1 (θ̂) + H2i (θ̂)
n
i=1

under some regularity conditions.


Stata command: heckman

8.5 Sample Selection with Tobit Selection


The model:

Y1 = X1 β 1 + u
Y2 = max(0, Xδ + v) (8.66)

A familiar example occurs when Y1 is the log of the hourly wage o§ered and Y2 is hours of
labor supply. The model is sometimes called the type III Tobit model.
Assumption B
8.5 SAMPLE SELECTION WITH TOBIT SELECTION 179

(a) (X, Y2 ) is always observed, but Y1 is observed only when Y2 > 0

(b) (u, v) is independent of X

(c) v s N (0, σ 2v )

(d) E (u|v) = vγ 1

8.5.1 Two-Step Procedure


To estimate the model, we derive:

E (Y1 |X, Y2 > 0) = X1 β 1 + E (u|v)


= X1 β 1 + vγ 1 . (8.67)

From this, we naturally propose the following two step procedure:

(a) Estimate Y2 = max(0, Xδ + v) by standard Tobit using the whole sample. For Yi2 > 0,
define
v̂i = Yi2 − Xi δ̂. (8.68)

(b) Using observations for which Yi2 > 0, estimate β 1 and γ 1 by the OLS regression

Yi1 on Xi1 and v̂i . (8.69)


p
This regression produces consistent and N asymptotically normal estimators of β 1 and γ 1
under Assumption B.

8.5.2 Partial MLE


For partial MLE, we assume that (u, v) is jointly normal. The the contribution to the partial
likelihood function from observation (Y1 , Y2 , X) is

f (Y2 |X) × f (Y1 |X, Y2 > 0)


H H III(Y2 ) L H IM
1 Y2 − Xδ Xδ 1−I(Y2 )
= φ 1−Φ
σv σv σv
8 0 19I(Y2 )
< 1 Y1 − X1 β 1 − σ uv σ v (Y2 − Xδ) A=
−2
× q φ @ q .
: −2 ;
2 2
σ u − σ uv σ v σ 2u − σ 2uv σ −2
v

To make a comparison with the partial likelihood function for the Type II Tobit model, the
above partial likelihood can be rewritten as
L H IM H H III(Y2 )
Xδ 1−I(Y2 ) 1 Y1 − X1 β 1
1−Φ φ
σv σu σu
8 0 19I(Y2 )
< 1 Y2 − Xδ − σ uv σ u (Y1 − X1 β 1 ) A=
−2
× q φ @ q .
: ;
σ 2v − σ 2uv σ −2
u σ 2v − σ 2uv σ −2
u
8.6 PROBLEM SET 180

On the basis of the first form of the partial likelihood function, we can use the same
argument as in the previous section to show that the partial MLE is consistent and asymptot-
ically normal. Its asymptotic variance can be consistently estimated by the (negative) inverse
Hessian matrix.

8.6 Problem Set


8.6.1 Problems
1. Consider the following two equation simultaneous model

Y1 = Y2∗ α + Xβ + u (8.70)
Y2∗ = Y1 γ + Zδ + v (8.71)

where Y1 and Y2∗ are dependent variables, X and Z are scalar exogenous explanatory variables.
Suppose Y1 , X, and Z are always observable, but Y2∗ is only partially observable in the sense
that we observe only Y2 : 8
< 1, if Y ∗ > 0
2
Y2 = (8.72)
: 0, if Y ∗ ≤ 0
2

Assume that u s N (0, σ 2u ), v s N (0, σ 2v ) and u is independent of v.


(i) Given some independent observations {Xi , Zi , Y1i , Y2i }ni=1 , how would you estimate this
model? Design an estimation strategy that is as e¢cient as you can.
(ii) How does your answer change if we observe only Y2 :
8
< Y ∗ , if Y ∗ > 0
2 2
Y2 = (8.73)
: 0, if Y ∗ ≤ 0
2

2. Consider the following modified Tobit model:

Yi∗ = exp(Xi β + "i ), (8.74)

where "i is iid N (0, σ 2 ) and independent of {Xi }ni=1 . We observe Yi :


8
< Y ∗ , if Y ∗ ≥ γ ∗ ,
i i
Yi = (8.75)
: 0, if Yi∗ < γ ∗ ,

What is the log-likelihood function based on the observations {Yi , i = 1, 2, ..., n}? Is γ ∗ iden-
tified? What is the mean of Yi given X = Xi ? What is the median of Yi given X = Xi ?
8.6 PROBLEM SET 181

8.6.2 Answers
1. From the two-equation system, we can solve for Y1 :
1
Y1 = (Xβ + Zαδ) + e.
1 − αγ
where e = (u + vα) /(1 − αγ) with
* +
σ 2e = var(e) = σ 2u + α2 σ 2v / (1 − αγ)2 and cov(e, v) = σ 2v α/(1 − αγ).

We will estimate the model by MLE. First, the density of Y1i conditional on Xi and Zi is
H I
1 Y1i − (Xi β + Zi αδ) / (1 − αγ)
f (Y1i , Xi , Zi ) = φ
σe σe
Second, we can write
cov(e, v) σ 2 α(1 − αγ)
v= e + w = v2 e+w
var(e) (σ u + α2 σ 2v )
where
* + σ 2u σ 2v
σ 2w = var(w) = σ 2v − (σ 4v α2 )/ σ 2u + α2 σ 2v = .
α2 σ 2v + σ 2u
2 2 2
So conditional ei , vi is normal random variable with mean σ(σv2α(1−αγ) σu σv
2 2 ei and variance α2 σ 2 +σ 2 .
u +α σ v ) v u
As a result, the probability of Y2i = 1 conditional on Y1i , Xi and Zi is
L H IM
1 σ 2v α(1 − αγ)
f (Y2i |Y1i , Xi , Zi ) = Φ Y1i γ + Zi δ + 2 (Y1i − (Xi β + Zi αδ) / (1 − αγ)) .
σw (σ u + α2 σ 2v )
Therefore, the likelihood function is
n
Y
f (Y1i , Xi , Zi )f (Y2i |Y1i , Xi , Zi )
i=1
Yn H I
1 Y1i − (Xi β + Zi αδ) / (1 − αγ)
= φ
σ σe
i=1 e
U L H IMVY2i
1 σ 2v α(1 − αγ)
× Φ Y1i γ + Zi δ + 2 (Y1i − (Xi β + Zi αδ) / (1 − αγ))
σw (σ u + α2 σ 2v )
U L H IMV1−Y2i
1 σ 2v α(1 − αγ)
× 1−Φ Y1i γ + Zi δ + 2 (Y1i − (Xi β + Zi αδ) / (1 − αγ)) .
σw (σ u + α2 σ 2v )

Note that we can not identify all the model parameters and have to normalize σ 2v = 1.
Maximizing the above likelihood function with respect to the remaining parameters leads to
consistent estimators of these parameters.
Next we consider the case when we observe only Y2 :
8
< Y ∗ , if Y ∗ > 0
2 2
Y2 = (8.76)
: 0, if Y ∗ ≤ 0
2
8.6 PROBLEM SET 182

Using the same steps, we can show that the likelihood function is:
Yn
f (Y1i , Xi , Zi )f (Y2i |Y1i , Xi , Zi )
i=1
Yn H I
1 Y1i − (Xi β + Zi αδ) / (1 − αγ)
= φ
σ σe
i=1 e
U L H IMVI(Y2i )
1 1 σ 2v α(1 − αγ)
× φ Y1i γ + Zi δ + 2 (Y1i − (Xi β + Zi αδ) / (1 − αγ))
σw σw (σ u + α2 σ 2v )
U L H IMV1−I (Y2i )
1 σ 2v α(1 − αγ)
× 1−Φ Y1i γ + Zi δ + 2 (Y1i − (Xi β + Zi αδ) / (1 − αγ)) .
σw (σ u + α2 σ 2v )
The MLE of the model parameters can be obtained by maximizing the above likelihood func-
tion.
2. (i) First, the probability of Yi = 0 is
P (Yi = 0|Xi ) = P (Yi∗ < γ ∗ |Xi ) = P [exp(Xi β + "i ) < γ ∗ |Xi ]
= P ["i < ln (γ ∗ ) − Xi β|Xi ]
H I
ln (γ ∗ ) − Xi β
= Φ .
σ
Second, we find the density of Yi∗ given that Yi∗ ≥ γ ∗ . To this end, we compute
Pr (Yi∗ ≤ Y |Yi∗ ≥ γ ∗ )
Pr (γ ∗ ≤ Yi∗ ≤ Y ) Pr (ln γ ∗ ≤ ln Yi∗ ≤ ln Y )
= =
P (Yi∗ ≥ γ ∗ ) P (Yi∗ ≥ γ ∗ )
# $ # ∗ $
ln Y −Xi β ln γ −Xi β

Pr (ln γ − Xi β ≤ "i ≤ ln Y − Xi β) Φ σ − Φ σ
= ∗ ∗
= ∗ ∗
.
P (Yi ≥ γ ) P (Yi ≥ γ )
So the probability of observing a nonzero Yi that is less or equal to Y is
Pr (Yi < Y |Yi 6= 0) = Pr (Yi∗ ≤ Y |Yi∗ ≥ γ ∗ ) × Pr (Yi∗ ≥ γ ∗ )
H I H I
ln Y − Xi β ln γ ∗ − Xi β
= Φ −Φ .
σ σ
The pdf is therefore H I
1 ln Y − Xi β
φ .
σY σ
Combining the above analysis, we find that the loglikelihood function is
Yn U H IV{Yi =0} U H IV{Yi 6=0}
ln (γ ∗ ) − Xi β 1 ln Yi − Xi β
ln Φ φ
σ σYi σ
i=1
Xn H I X n L H IM
ln (γ ∗ ) − Xi β 1 ln Yi − Xi β
= {Yi = 0} ln Φ + {Yi 6= 0} log φ
σ σ σ
i=1 i=1
n
X
− {Yi 6= 0} log (Yi ) .
i=1
8.6 PROBLEM SET 183

Note that the last term is the Jacobian term.


(ii) Yes. γ ∗ can be identified. We can consistently estimate it by min {Yi : Yi > 0} .
(iii) The mean of Yi given Xi is

E (Yi |Xi ) = E [exp(Xi β + "i )|Xi ]


= exp (Xi β) E (exp "i )
H I
σ2
= exp Xi β + .
2

(iv) If ln (γ ∗ ) − Xi β ≥ 0, then
H I
ln (γ ∗ ) − Xi β 1
P (Yi = 0|Xi ) = Φ ≥ .
σ 2

So the median is 0. If ln (γ ∗ ) − Xi β < 0, then the median of Yi is Y ≥ γ ∗ such that

P (Yi∗ ≥ Y |Xi ) = 0.5.

That is P ("i ≥ ln (Y ) − Xi β|Xi ) = 0.5 or Y = exp (Xi β) . To sum up


8
< 0, if Xi β ≤ ln (γ ∗ )
median(Yi |Xi ) =
: exp (X β) , otherwise
i
Bibliography

[1] Angrist, J. D. and Pischke, J-S. (2009), Mostly Harmless Econometrics, Princeton Univer-
sity Press.

[2] Amemiya, T. (1985), Advanced Econometrics, Harvard University Press, Cambridge, MA.

[3] Carson, R. T. and Y. Sun (2007), The Tobit model with a non-zero threshold, The
Econometrics Journal, Vol. 10, No. 3 (2007), pp. 488-502

[4] Heckman, J. (1979), Sample Selection Bias as a Specification Error, Econometrica, Vol.
47, No. 1 153-162.

[5] Tobin, J. (1958), Estimation of relationships for limited dependent variables, Econometrica,
26 (1): 24—36. doi:10.2307/1907382.

[6] Wooldridge, J. (2009), Econometric Analysis of Cross Section and Panel Data, MIT press.

184
Chapter 9

Causal Inference

In the microeconometrics textbooks of both Wooldridge (2010) and Angrist and Pischke
(2009), the very first page describes the estimation of causal e§ects as the principal goal of
empirical microeconomists. According to Angrist and Pischke, “In the beginning, we should
ask, What is the causal relationship of interest? Although purely descriptive research has an
important role to play, we believe that the most interesting research in social science is about
questions of cause and e§ect, such as the e§ect of class size on children’s test scores. . .
.” Similarly, the first sentences in the Wooldridge textbook are, “The goal of most empirical
studies in economics and other social sciences is to determine whether a change in one variable,
say w, causes a change in another variable, say y. For example, does having another year of
education cause an increase in monthly salary? Does reducing class size cause an improvement
in student performance? Does lowering the business property tax rate cause an increase in
city economic activity?”

9.1 The Framework of Potential Outcomes1


9.1.1 Introduction
We have a cross section of observation indexed by i. To begin with, we focus on binary
treatment so Ti 2 {0, 1}. Associated with each treatment, there is a potential outcome Yi :

Yi : T ! Y

Yi is a function from treatment to potential outcomes.


For individual i, define
Yi (0) = Potential outcome under treatment 0 (‘control’)
Yi (1) = Potential outcome under treatment 1 (‘treatment’)
Yi (0) and Yi (1) are the potential outcomes in that they represent the outcomes for indi-
vidual i has they received treatment 0 and treatment 1 respectively.
Important implicit assumption:
Stable Unit Treatment Value Assumption: (Neyman, 1923, Rubin, 1980). potential out-
come for i does not depend on the treatments received by other units.
1
This chapter is less polished than other chapters so use it at your own risk

185
9.1 THE FRAMEWORK OF POTENTIAL OUTCOMES2 186

We observe Yi (Ti ) , but we are interested in the causal e§ect for individual i :

Yi (1) − Yi (0).

The fundamental problem is that we can not observe both Yi (1) and Yi (0). For individuals
in the treatment group, we observe only their Y (1). For individuals in the control group, we
observe only their Y (0). For example, we may have the following data:

Individual get treated or not: T Y(1) Y(0) Causal E§ect

1 T1 = 1 Y1 (1) ? Y1 (1) − Y1 (0)


2 T2 = 0 ? Y2 (0) Y2 (1) − Y2 (0)
3 T3 = 0 ? Y3 (0) Y3 (1) − Y3 (0)
... ... ... ... ...
N TN = 1 YN (1) ? YN (1) − YN (0)

Causal e§ects for each individual is not fully observable. For this reason, we often focus
on the average treatment e§ect:

AT E := E[Yi (1) − Yi (0)] = E(Yi (1)) − E(Yi (0)).

Observed outcome:
Yi := (1 − Ti )Yi (0) + Ti Yi (1)
so we only observe one of the two potential outcomes. The other potential outcome is coun-
terfactual.
Consider a concrete example:

i Ti Yi (0) Yi (1) Yi (1) − Yi (0) Yi


1 0 5
! 3 -2 5
2 1 2 5
! 3 5
3 1 3 4
! 1 4
4 0 3
! 2 -1 3
5 1 1 2
! 1 2

Some potential outcomes are observed, as is the ATE. The unobserved Yi ’s are the numbers
in a box. Here ATE is
−2 + 3 + 1 − 1 + 1 2
=
5 5
A naive estimator of the ATE is
5+4+2 5+3 1
− =−
3 2 3
9.1 THE FRAMEWORK OF POTENTIAL OUTCOMES3 187

which is clearly biased.


How is the potential outcomes framework related to the familiar linear structural/causal
model framework. Consider the model Yi α + Ti β + ϵi where ϵi denotes all other causal
factors. This implies that

Yi (0) : = α + ϵi
Yi (1) : = α + β + ϵi

So, the linear structural /causal model is a special case of the general potential outcome model,
with the restriction that
Yi (1) − Yi (0) = β
for all i.
We call this the constant treatment e§ect assumption.
In general, the potential outcome model allows for heterogeneous treatment e§ects across
individuals, as Yi (1) − Yi (0) is di§erent for di§erent i’s. This can also be seen by noting that

Yi : = (1 − Ti )Yi (0) + Ti Yi (1)


= EYi (0) + Ti (Yi (1) − Yi (0)) + [Yi (0) − EYi (0)]
= α + Ti β i + ϵi

where the first equation is the definition of Yi in the potential outcomes framework, and

α = EYi (0); β i = [Yi (1) − Yi (0)] ; ϵi = [Yi (0) − EYi (0)] .

So the potential outcomes framework is equivalent to the linear structure/causal model of the
form:
Yi α + Ti β i + ϵi
with the coe¢cient β i being di§erent for di§erent individuals. In addition to being empiri-
cally relevant, treatment heterogeneity has been important in the development of econometric
thought.
The above can be mapped into a causal model. Let

x̃ = t,

we can model the response of y by

y − c (x̃, ẍ) = c (t, ẍ) .

Let X̃i = Ti . In terms of the settings of X̃i and Ẍi , we have

Yi = c(Ti , Ẍi ).

Define

Yi (1) = c(1, Ẍi ) := c1 (Ẍi ),


Yi (0) = c(0, Ẍi ) := c0 (Ẍi ).
9.1 THE FRAMEWORK OF POTENTIAL OUTCOMES4 188

Then Yi (0) is the value of Yi when Ti is set to 0 while keeping Ẍi constant. Similarly, Yi (1)
is the value of Yi when Ti is set to 1 while keeping Ẍi constant. Given the above definitions,
we have

Yi = c(Ti , Ẍ) = Ti × c1 (Ẍi ) + (1 − Ti ) c0 (Ẍi )


h i
= Ec0 (Ẍi ) + Ti × c1 (Ẍi ) − c0 (Ẍi ) + c0 (Ẍi ) − Ec0 (Ẍi )
= α + Ti β i + ϵi

where

α = Ec0 (Ẍi )
h i
β i = c1 (Ẍi ) − c0 (Ẍi ) = [Yi (1) − Yi (0)] ;
ϵi = c0 (Ẍi ) − Ec0 (Ẍi ) = [Yi (0) − EYi (0)] .

So there is an equivalence between the causal modeling framework and the potential outcomes
framework.
However, since we only observe one of the two potential outcomes for any individual, there
is a limit on how much we can learn about the distribution of individual-level treatment e§ects.
Now suppose we run the regression based on the predictive model:

Yi = α∗ + Ti β ∗ + ui

where E(ui ) = 0 and E(ui Ti ) = 0 by the definition of α∗ and β ∗ :

α∗ = EYi − (ETi ) β ∗
cov(Yi , Ti ) E(Yi Ti ) − E(Yi ) (ETi )
β∗ = =
var(Ti ) var(Ti )
E(Yi Ti |Ti = 1)P (Ti = 1) + E(Yi Ti |Ti = 0)P (Ti = 0) − E(Yi ) (ETi )
=
var(Ti ))
E(Yi |Ti = 1) (ETi ) − E(Yi ) (ETi ) E(Yi |Ti = 1) − E(Yi )
= =
(ETi ) (1 − ETi ) (1 − ETi )
E(Yi |Ti = 1) − E(Yi |Ti = 1)ETi − E(Yi |Ti = 0) (1 − ETi )
=
(1 − ETi )
= E(Yi |Ti = 1) − E(Yi |Ti = 0).

Exercise: Let E(ui ) = 0 and E(ui Ti ) = 0 and Ti is a binary random variable. Show that
E(ui |Ti = 1) = 0 and E(ui |Ti = 0) = 0.
The question: is β ∗ = β? In general, no.
Consider

E[Yi |Ti = 1] = E[(1 − Ti )Yi (0) + Ti Yi (1)|Ti = 1]


= E[Yi (1)|Ti = 1] 6= E[Yi (1)] in general

To see the inequality, we note that


9.1 THE FRAMEWORK OF POTENTIAL OUTCOMES5 189

• E[Yi (1)] is the mean of the potential outcomes under treatment 1 for all individuals
in the population.

• E[Yi (1)|Ti = 1] is the mean of the potential outcomes under treatment 1 for those who
actually received treatment 1.

• So E[Yi (1)] is the mean for the population while E[Yi (1)|Ti = 1] is the mean for a
subpopulation.

• We expect that E[Yi (1)|Ti = 1] 6= E[Yi (1)] in general. Mathematically, the unconditional
mean may be di§erent from the conditional mean.

Similarly,
E[Yi |Ti = 0] = E[Yi (0)|Ti = 0] 6= E[Yi (0)] in general
Hence β ∗ 6= β in general.

EYi ›1fi EYi ›0fi E›Y i|Ti = 1fi E›Yi |Ti = 0fi

But β ∗ = β if

E[Yi (1)|Ti = 1] = E[Yi (1)] and


E[Yi (0)|Ti = 0] = E[Yi (0)]

Example: Suppose that for all i, Yi (1) = Yi (0) and Ti indicates college attendance. So
AT E = 0. There is no causal e§ect of college. However, if Ti is positively correlated with
Yi (1) = Yi (0) (more motivated students go to college), then Ti is positively correlated with
[Yi (0) − EYi (0)] , which is the error term in the absence of ATE: Yi := EYi (0) + Ti × 0 +
[Yi (0) − EYi (0)] , then we would have

E[Yi (1)|Ti = 1] > E[Yi (1)]


E[Yi (0)|Ti = 0] < E[Yi (0)].

Hence, β ∗ > 0.
9.1 THE FRAMEWORK OF POTENTIAL OUTCOMES6 190

Another way to look at the problem is to focus on the linear causal model representation:

Yi = c α + Ti β i + ϵi
= α + Ti (Eβ i ) + ϵi + Ti [β i − (Eβ i )]
= α + Ti × AT E + ei

where

ei = [Yi (0) − EYi (0)] + Ti [Yi (1) − Yi (0) − E [Yi (1) − Yi (0)]]
= [Yi (0) − EYi (0)] + Ti {[Yi (1) − EYi (1)] − [Yi (0) − EYi (0)]} .

The above representation has taken on the look and feel of a regression model. ei is
akin to an error term, even though it represents both heterogeneity of the baseline of no
treatment potential outcome i.e., [Yi (0) − EYi (0)] and heterogeneity of the treatment e§ects,
i.e., Yi (1) − Yi (0) − E [Yi (1) − Yi (0)], and even though it includes within it the observed
variable Ti . The above representation is in fact quite di§erent from the traditional bivariate
regression in the sense that it is not only more finely articulated but also tied to a particular
formulation of causal e§ects that are allowed to vary across individuals.
Whether OLS is consistent depends on whether Ti is endogenous, i.e. whether

E (ϵi + Ti [β i − (Eβ i )] |Ti = 1) = 0 and


E (ϵi + Ti [β i − (Eβ i )] |Ti = 0) = 0

These two equations are equivalent to

E (Yi (1) − EYi (1)|Ti = 1) = 0 and


E (Yi (0) − EYi (0)|Ti = 0) = 0

or

E (Yi (1)|Ti = 1) = E (Yi (1)) and


E (Yi (0)|Ti = 0) = E (Yi (0))

If the above two equations hold, then OLS is consistent for β. This is exactly the same
conclusion we obtained by comparing β ∗ with β directly. * +
There are two ways that Ti could be endogenous. First cov T i , Yi (0) − EYi (0 ) may not be
zero. That is, there is a correlation between the treatment membership and the (net) baseline
di§erence in the hypothetical no-treatment state. Second,

cov(Ti , [Yi (1) − EYi (1)] − [Yi (0) − EYi (0)])

may not be zero. That is, there is a correlation between the net treatment e§ect di§erence
and the treatment membership.
9.1 THE FRAMEWORK OF POTENTIAL OUTCOMES7 191

9.1.2 Objects of Interest


• Average Treatment E§ect:

AT E := E[Yi (1) − Yi (0)].

Note that in general


AT E 6= E(Yi |Ti = 1) − E(Yi |Ti = 0).
See the graph above.

• Average E§ect on the Treated (ATT):

AT T = E [Yi (1) |Ti = 1] − E [Yi (0) |Ti = 1]


= E [Yi (1) − Yi (0) |Ti = 1]

The quantities in the definition of ATT can be illustrated by the figure below:

E [Yi (1) |Ti = 1] vs E [Yi (0) |Ti = 1]

In some cases, we are interested in some policy or treatment which we will make available
to individuals, but we will not force them to take the treatment. So if individuals who
are likely to benefit from the treatment are the ones who end up taking it, then we could
have AT T > AT E.

We should be a little careful in using this definition. If information about the e¢cacy of
the treatment becomes widely known, individuals may change their behavior over time. So
AT T is arguably somewhat more sensitive to problems of “external generalizability.”
We can decompose the apparent e§ect E(Yi |Ti = 1) − E(Yi |Ti = 0) as follows:

E(Yi |Ti = 1) − E(Yi |Ti = 0)


= E [Yi (1) |Ti = 1] − E [Yi (0) |Ti = 0]
= E [Yi (1) − Yi (0) |Ti = 1]
| {z }
ATT
+E [Yi (0) |Ti = 1] − E [Yi (0) |Ti = 0]
| {z }
Selection bias

In the literature, Sometimes “TT” is used to stand for “ATT”.


9.2 RANDOMIZED EXPERIMENTS 192

• Subgroup Average E§ect:

Suppose we also observe some individual characteristics (“covariate” or “attribute’) Xi .


We assume that X is not a§ected by the treatment. Define

AT E(x) := E[Yi (1) − Yi (0)|Xi = x].

This is the average e§ect for individuals with covariate value x. It could be particularly
useful to a social planner who wants to make treatment assignments on the basis of individual
characteristics.

9.2 Randomized Experiments


We have shown that in general β ∗ 6= β. However if Ti is randomly assigned, then β ∗ = β.
To see this, since Ti is randomly assigned, it should be independent of Yi (0) and Yi (1). Note
that while Ti ? (Yi (0) , Yi (1)) , Yi is not independent of Ti if EYi (1) 6= EYi (0) . Intuitively, Ti
determines whether Yi takes the value of Yi (0) or Yi (1). So T i and Yi are NOT independent.
Mathematically

cov(Yi , Ti ) = EYi Ti − EYi ETi = E [(1 − Ti ) Yi (0) + Ti Yi (1)] Ti − EYi ETi


= ETi Yi (1) − EYi ETi = (ETi ) [EYi (1)] − EYi ETi
# $
= (ETi ) EYi (1) − E [(1 − Ti ) Yi (0) + Ti Yi (1)]
= (ETi ) E (1 − Ti ) [Yi (1) − Yi (0)]
= (ETi ) E (1 − Ti ) E [Yi (1) − Yi (0)] 6= 0.
9.2 RANDOMIZED EXPERIMENTS 193

When Ti ? (Yi (0) , Yi (1))

E[Yi (1)|Ti = 1] = E[Yi (1)]


E[Yi (0)|Ti = 0] = E[Yi (0].

It then follows that

β̂ OLS ! β ∗ = E[Yi (1)|Ti = 1] − E[Yi (0)|Ti = 0]


= E[Yi (1)] − E[Yi (0)] = β.

So randomized experiments permit consistent estimation of the ATE by OLS, without strong
distributional or functional form assumptions. We say the ATE is nonparametrically identified.
An alternative definition of identification (roughly equivalent): ATE (or other object of
interest) is identified if we can recover it from the distribution of observables – this is, from
the distribution of (Ti ,Yi ).

Example 66 Lalonde (1986); National Supported Work Demonstration (NSW): randomized


evaluation of a job training program. Di§erence in means between treated and controls: fe-
males $851, males $886. Also used regression methods to control for observed background
characteristics (this can improve precision of estimates) — similar results
“Nonexperimental estimates”: don’t use experimental controls. Instead, construct a control
group from various national surveys (PSID, CPS).
Use regression methods, selection correction methods to try to control for di§erences and
obtain estimates of treatment e§ects.
Nonexperimental methods did poorly — in many cases not close to the experimental results.

Example 67 Vitamin C. Cameron and Pauling (1976): gave vitamin C to 100 patients be-
lieved to be terminally ill from cancer.
Comparison group constructed by matched sampling: for each treated patient, select 10
patients from historical records with same type of cancer and other characteristics (age, gender)
Patients receiving vitamin C lived about 4 times longer than controls, highly significant.
Later, careful randomized experiment conducted at the Mayo Clinic. Patients randomly
assigned to receive vitamin C or a placebo.
NO evidence that vitamin C prolonged survival.

Example 68 The RAND Health Insurance Experiment (RAND HIE) was an experimental
study of health care costs, utilization and outcomes in the United States, which assigned people
randomly to di§erent kinds of plans and followed their behavior, from 1974 to 1982. As a
result, it provided stronger evidence than studies that examine people afterwards who were not
randomly assigned.
9.2 RANDOMIZED EXPERIMENTS 194

People assigned to di§erent HIE plans appear to be similar

People assigned to di§erent HIE plan appear to be similar


9.2 RANDOMIZED EXPERIMENTS 195

People assigned to more generous plans used substantially more health care

Did extra care make us healthier? Not Really


9.3 STRONGLY IGNORABLE TREATMENT ASSIGNMENT 196

9.3 Strongly Ignorable Treatment Assignment


Next we turn to quasi-experimental studies and observational studies, where treatment as-
signment may not be under our control.

9.3.1 Unconfoundedness Assumption


Unconfoundedness:

Ti ? (Yi (1), Yi (0))|Xi .


This says that Ti is independent of the potential outcomes conditional on Xi . Unconfound-
edness implies that

Pr(Ti = 1|Xi = x, Yi (1),Yi (0)) = Pr(Ti = 1|Xi = x).

For unconfoundedness to hold, typically Xi contains pre-treatment variables that determine


the treatment selection and a§ect the outcome of interest.
How is the assumption related to the conditional independence assumption we use in the
structural causal modeling? Consider the model

Y − c(Ti , Ẍi )

where Ẍi are unobserved causes. The condition independence assumption in the above model
entails # $
Ti ? Ẍi |Xi

for some Xi . But Yi (1) = c1 (Ẍi ) and Yi (0) = c0 (Ẍi ), so the above assumption implies that
Ti ? (Yi (1), Yi (0))|Xi .
When Xi contains pre-treatment variables that determine the treatment selection and
a§ect the outcome of interest, we have the following DAG:

Xi
. &
Ti Ẍi
& .
Yi

Conditioning Xi is a type of “back door” conditioning.


Sometimes unconfoundedness is called the “selection on observables” assumption. This
term comes from the following model for Ti :

Ti = 1 {φ0 + Xi φ1 + Ui > 0}

where X represents all observed variables that determine treatment selection and Ui represent
all unobserved variables. When Ui is completely random in that it is independent of Yi ,
9.3 STRONGLY IGNORABLE TREATMENT ASSIGNMENT 197

then selection depends only on the observables systematically and Ui can be regarded as a
‘randomizer’. The left panel of the graph below illustrates the “selection on observables”:
There are no back-door paths from T to Y other than the one that is blocked by X. The
term U represents completely random and idiosyncratic determinants of treatment selection.
In contrast, the right panel of the graph below illustrates the “selection on unobservables”:
The term U , like the elements in X, is not completely random. It is correlated with Y.
There are now back-door paths from T ! Y other than those via X. Conditioning on X
does not block the back-door path T U ! Y . In this case, we have to make strong
assumptions about the links between unobservables to observables, for example, we may to
find a randomized instrument variable which does not a§ect the unobservables, leading to the
instrumental variables approach.
2 32 3
Xi Xi
6 76 7
6 76 7
6 . & 76 . & 7
6 76 7
6 76 7
6 Ti Ẍi 7 6 Ti Ẍi 7
6 76 7
6 76 7
6 % & . 76 % & . 7
4 54 5
Ui Yi Ui −! Yi
left: selection on observables; right: selection on unobservables

So basically, in a nonexperimental setting including both conditional randomization and


observational study, we have the first two cases:

• Selection on observables (Other terms for this include: unconfoundedness, conditional


independence, exogeneity, missing at random)
We adjust for di§erences in a fixed set of covariates to remove bias. Good references in-
clude: Imbens (2015 JHR), Imbens and Wooldridge (2009 JEL), Imbens and Wooldridge
(2007 NBER summer school)

• Selection on unobservables

We find instrumental variables to implement the IV.

Example 69 Suppose treatment e§ect is constant:

Yi (1) − Yi (0) = β

Then we can write:


Yi = α + Ti β + ϵi ,
where α := E[Yi (0)], and ϵi := Yi (0) − E[Yi (0)]. Then unconfoundedness is equivalent to the
assumption that Ti is independent of ϵi conditional on Xi .
9.3 STRONGLY IGNORABLE TREATMENT ASSIGNMENT 198

9.3.2 Overlap Assumption


We also assume:
Overlap: for all x in the support of X,
0 < Pr(Ti = 1|Xi = x) < 1.
This says that for each possible value of Xi , we have a positive probability of observing
Ti = 1, and of observing Ti = 0. Note that
Pr(Ti = 1|Xi = x)fX (x)
fX|T (x|1) =
Pr (Ti = 1)
Pr(Ti = 0|Xi = x)fX (x)
fX|T (x|0) =
Pr (Ti = 0)
and so under the overlap assumption
fX|T (x|1) Pr(Ti = 1|Xi = x) Pr (Ti = 0)
= 2 (0, 1)
fX|T (x|0) Pr(Ti = 0|Xi = x) Pr (Ti = 1)

Partial Overlap: for very large values of x, everyone is treated and for very small values of x,
everyone is untreated. In this case, ATE(x) for a very large x or a very small x is not
identified. However, ATE(x) for x in the middle range is identified.
9.4 IDENTIFICATION UNDER STRONG IGNORABILITY 199

No Overlap: ATE(x) is not identified for all x

When the unconfoundedness and overlap assumptions hold, we say that the treatment
assignment is strongly ignorable.

9.3.3 Example: job training program


Consider a job training program.
Ti : indicator for receiving training
Yi : employment indicator
Xi : binary variable equal to 1 if person i went to college, and equal to 0 if they only have
a high school diploma.
Unconfoundedness implies:

P r(Ti = 1|Xi = 1, Yi (1), Yi (0)) = pc

P r(Ti = 1|Xi = 0, Yi (1), Yi (0)) = phs


This says that for the subgroup with Xi = 1, we essentially have a randomized experiment
with probability pc of getting the treatment, and probability 1−pc of not getting the treatment.
Likewise, for the subgroup with Xi = 0, we have a randomized experiment with probability
phs of getting the treatment, and probability 1 − phs of not getting the treatment.
The overlap assumption implies that 0 < pc < 1 and 0 < phs < 1, so that for each of the
two groups, we don’t observe all treated or all controls. Note that in general, pc 6= phs , so that
the treated and control groups could di§er in their ratios of high school to college graduates.
So just comparing treated and untreated groups (as we did in the randomized treatment case)
will not be appropriate.

9.4 Identification under Strong Ignorability


For now, we focus on AT E(x) and AT E. (Similar argument works for ATT.)
Recall that

AT E(x) = E[Yi (1) − Yi (0)|Xi = x]


= E [Yi (1)|Xi = x] − E [Yi (0)|Xi = x] .

Note that

E[Yi (1)|Xi = x] = E[Yi (1)|Ti = 1, Xi = x] = E[Yi |Ti = 1, Xi = x]

where the first equality follows from the unconfoundedness assumption. Since P r(Ti = 1|Xi =
x) > 0 by assumption, we can consistently estimate E[Yi |Ti = 1, Xi = x], and therefore we
can identify E[Yi (1)|Xi = x].
Likewise

E[Yi (0)|Xi = x] = E[Yi (0)|Ti = 0, Xi = x] = E[Yi |Ti = 0, Xi = x]


9.4 IDENTIFICATION UNDER STRONG IGNORABILITY 200

so we can estimate E[Yi (0)|Xi = x] as well. Thus, we can identify AT E(x). For this, we need
to E[Yi |Ti , Xi ] for all values of Ti and Xi .
Notice also that Z
AT E = E[AT E(Xi )] = AT E(x)dFX (x)

where FX (·) is the (marginal) distribution of Xi .


If we have an estimate of AT E(x), then we can average over the distribution of Xi to get
an estimate of ATE. So, we have shown that ATE is identified under the unconfoundedness
assumption.
Note: we have not made parametric assumptions about the joint distribution of (Y, T, X).
In that sense this is a nonparametric identification result. On the other hand, the uncon-
foundedness assumption is very strong and rules out many interesting treatment selection
processes.

Remark 70 Relationship with the SCM. In the SCM, we have


Z
dG (t, x) = c (t, ẍ) g (ẍ|x, t) dẍ = E(Y |X = x, T = t)
Z h # $ i
= c (t, ẍ) g (ẍ|x) dẍ = E c t, Ẍ |X = x .

Note that E(Y |X = x, T = t) and hence dG (t, x) can be estimated from the data. The discrete
analogue of the “partial derivative” is
h # $ # $ i
dG (1, x) − dG (0, x) = E c 1, Ẍ − c 0, Ẍ |X = x .

and therefore

E [dG (1, X) − dG (0, X)]


n h # $ # $ io h # $ # $i
= E E c 1, Ẍ − c 0, Ẍ |X = E c 1, Ẍ − c 0, Ẍ
= E [Y (1) − Y (0)] = AT E.

Basically, we estimate the di§erence of the conditional mean E(Y |X = x, T = 1) − E(Y |X =


x, T = 0), which is fundamentally a statistical object. However, under the conditional inde-
pendence assumption, we can attach a causal interpretation to this statistical object. For this
reason, we call the conditional independence assumption the identification assumption. This
assumption is not directly testable.

We note in passing that ATT is identified via


Z
AT T = AT E(x)dFX|T =1 (x)

where FX|T =1 (x) is the conditional distribution of Xi given Ti = 1.


Note that we do not really need Ti ? (Yi (1), Yi (0))|Xi . It su¢ces to have

E[Yi (1)|Xi = x] = E[Yi (1)|Ti = 1, Xi = x]


E[Yi (0)|Xi = x] = E[Yi (0)|Ti = 0, Xi = x]
9.5 PARAMETRIC METHODS UNDER STRONG IGNORABILITY 201

Mathematically, the above are weaker than Ti ? (Yi (1), Yi (0))|Xi but is hard to see the
practical advantage of the weaker conditional mean independence assumption. Recall that
conditional mean independence is su¢cient for a linear causal model. Here when Ti is a
binary variable, we have a linear or quasi-linear model by construction. That is why it su¢ces
to have conditional mean independence.
For the ATT, we only need to assume that

E[Yi (0)|Xi = x] = E[Yi (0)|Ti = 0, Xi = x]

because Yi (1) is not missing for the treated individuals. We use the above to impute their
missing Yi (0) . To identify ATT, the overlap condition can be relaxed to

0 < P r(Ti = 1|Xi = x) < 1.

for all x in the support of Xi conditional on Ti = 1. Similarly, for ATC, we only need to
assume
E[Yi (1)|Xi = x] = E[Yi (1)|Ti = 1, Xi = x]
and the overlap assumption can be similarly weakened.
Let

m1 (x) = E[Yi |Ti = 1, Xi = x]


m0 (x) = E[Yi |Ti = 0, Xi = x]

and m̂1 (x) and m̂0 (x) be the consistent estimators of m1 (x) and m0 (x) , respectively. Then
we can estimate ATE, ATT and ATC by
n
[ 1X
AT E = [m̂1 (Xi ) − m̂0 (Xi )]
n
i=1
Pn
[ i=1 [YP
i − m̂0 (Xi )] 1 {Ti = 1}
AT T = n
i=1 1 {Ti = 1}
Pn
[ [
i=1 Pm̂1 (Xi ) − Yi ] 1 {Ti = 0}
AT C = n
i=1 1 {Ti = 0}

Depending on whether we use a parametric method or nonparametric method to estimate


m1 (x) and m0 (x) , we will have a parametric or nonparametric method for the estimation of
ATE, ATT and ATC.

9.5 Parametric Methods under Strong Ignorability


9.5.1 Parametric Method: Discrete Cases
When X is a discrete variable, we can estimate m1 (x) and m0 (x) by
P
Ti 1(Xi = x)Yi
m̂1 (x) = Pi
Ti 1(Xi = x)
P i
(1 − Ti ) 1(Xi = x)Yi
m̂0 (x) = Pi
i (1 − Ti ) 1(Xi = x)
9.5 PARAMETRIC METHODS UNDER STRONG IGNORABILITY 202

As a result, the ATE(x) can be estimated by


P P
[ i T i 1(X i = x)Yi (1 − Ti ) 1(Xi = x)Yi
AT E(x) = P − Pi .
i Ti 1(Xi = x) i (1 − Ti ) 1(Xi = x)

We are simply taking the treatment and control averages for the subsample with Xi equal to
a particular value. Then taking a sample analog to the equation AT E = E[AT E(x)], we have
n
[ 1X[
AT E= AT E(Xi ).
n
i=1

A nice feature of this estimator is that we avoid making strong assumptions on the form
of E[Yi |Ti , Xi ]. However, if Xi takes on many values, there will be relatively few observations
[
with any particular value of Xi , leading to high variance for AT E(x) and AT [ E.

9.5.2 Parametric Method: Continuous Cases


If Xi has a continuous distribution, then the estimator in the previous section cannot be used.
Under continuity, the probably of observing more than one sample unit with the same value
of Xi is zero. In practice, we will have very few observations with the same value of Xi .
One possibility is to make some assumptions about the conditional regression function
E(Y |T, X). For example, we may assume that

E(Yi |Ti , Xi ) = a + Ti b + Xi c + (Ti · Xi ) d.

This implies that

E(Yi |Ti = 1, Xi ) = a + b + Xi (c + d) ,
E(Yi |Ti = 0, Xi ) = a + Xi c.

So the regression line for the Ti = 0 subgroup could have a di§erent slope and intercept from
the regression line for the Ti = 1 subgroup. We could include transformations of Xi (such as
powers of Xi ) as well, and get a fairly general regression specification.
We could then estimate this regression function by OLS, and then estimate AT E(x) by:
[
AT E(x) = b̂OLS + xdˆOLS .

Then the estimate of the treatment e§ects would be


n
1X
[
AT E = (b̂OLS + Xi dˆOLS ) = b̂OLS + X̄ dˆOLS ,
n
i=1
Pn
i=1 (b̂OLS + Xi dˆOLS )Ti
[
AT T = Pn = b̂OLS + X̄|treatment dˆOLS ,
i=1 T i
Pn
(b̂
i=1 OLS + Xi dˆOLS ) (1 − Ti )
[
AT C = Pn = b̂OLS + X̄|control dˆOLS .
i=1 (1 − T i )

where X̄|treatment is the average of Xi for the treated and X̄|control is the average of Xi for the
untreated.
9.5 PARAMETRIC METHODS UNDER STRONG IGNORABILITY 203

We can get an even nicer expression for AT[ E if we run the regression with X̃i := Xi − X̄
in place of Xi . Then we will have
β̂ AT E = b̂OLS,alt
where b̂OLS ,alt is the OLS estimator of b in the regression

Ŷi = âOLS,alt + Ti b̂OLS,alt + X̃i ĉOLS,alt + (Ti · X̃i )dˆOLS,alt .

Notice that the unconfoundedness assumption provides a link between conventional re-
gression parameters and the causal parameter. Thus, it is possible to interpret regression
parameters as causal, but only under somewhat strong assumptions about the selection into
treatment and control groups.
The OLS estimators of a, b, c, d can be obtained by running two separate regressions:
X
(a[+ b, c[
+ d) = arg min [Yi − (a + b) − Xi (c + d)]2 ,
i: Ti =1
X
a, b
(b c) = arg min [Yi − a − Xi c]2 .
i: Ti =0

So
; <
[
AT T = a[+ b + X̄|treatment (c[
+ d) − b
a + X̄|treatment (b
c)
| {z } | {z }
h i
= a[+b−b a + X̄|treatment (c[+ d) − (b
c)

= Ȳ |treatment − X̄|treatment (c[ + d) − Ȳ |control + X̄|control (b


c)
h i
+X̄|treatment (c[ + d) − (bc)
; * + <
= Ȳ |treatment − Ȳ |control + X̄|treatment − X̄|control (b c) .

Similarly, we have
; <
[
AT C = Ȳ |treatment − X̄|treatment (c[
+ d) − Ȳ |control − X̄|control (b
c)
| {z } | {z }
[
a+b b
a
h i
+X̄|control (c[+ d) − (bc)
h * + i
= Ȳ |treatment − X̄|treatment − X̄|control (c[
+ d) − Ȳ |control .

These two expressions give us a clear idea what the parametric assumption entails. As an
example, consider the ATT. In this case, the estimate is di§erence between Ȳ |treatment and
the adjusted average of the control group Ȳ |adjcontrol for
; * + <
Ȳ |adj
control = Ȳ |control + X̄|treatment − X̄|control (b
c) .

The magnitude of the adjustment depends on the extent to which the two groups are balanced
in terms of the covariate averages. Under the completely randomized experiments, we have
E X̄|treatment = E X̄|control = E X̄ if Ti does not cause Xi . Then
[
AT T t Ȳ |treatment − Ȳ |control .
9.5 PARAMETRIC METHODS UNDER STRONG IGNORABILITY 204

Ȳ |treatment − Ȳ |control is exactly the same as the estimate of the ATE in the context of com-
pletely randomized experiments.

Remark 71 The method can also be viewed from an imputation perspective. Suppose for
individual i0 , Ti0 = 1 so that we observe Yi0 (1) but Yi0 (0) is missing. We want to impute
the missing a, b
P value Yi0 (0). The 2above procedure amounts to first running the regression (b c) =
arg min j :Tj =0 [Yj − a − Xj c] based on the subsample with Tj = 0 and then imputing the
missing value by
Ŷi0 (0) = b
a + Xi0 b
c.
Similarly, the missing value of Yj0 (1) for Tj0 = 0 is imputed by

Ŷj0 (0) = a[
+ b + (c[
+ d)Xj0 .

With the imputed values, we can then proceed as if there is no missing value. This is equivalent
to the OLS regression adjustment above.

Remark 72 The linear parametric assumption implies that the adjustment is linear. If the
adjustment should be nonlinear, when X̄|treatment is approximately equal to X̄|control , the linear
specification will not lead to large bias. Otherwise, the bias due to the misspecification of the
functional form can be large. Consider AT [ T as an example, in which case, we essentially use
the control cases to impute the missing values of Yi (0) for the treated. If X̄|treatment is very
di§erent from X̄|control , we rely on extrapolation to make the imputation. Linear extrapolation
may not be reliable when the averages of the covariates across two groups are not close to each
other.

Remark 73 Bias from the global linear regression is a function of (i) the covariate distrib-
utions and (ii) nonlinearity. One way to assess potential bias is to calculate the normalized
di§erence:
X̄|treatment − X̄|control
normdif f = p
(S 2 |treatment + S 2 |control ) /2
where S 2 |treatment and S 2 |control are the sample variance of Xi for the two groups. Imbens and
Wooldridge (2009) suggest a rule of thumb:
If normdif f > 0.25, then don’t use global linear regression.
Note that normdif f is not the t-statistic for testing equal means. With really large samples,
small values of X̄|treatment − X̄|control can have a large t-stat. But inference doesn’t get harder
as the sample sizes increases.

Remark 74 Suppose that we have a completely randomized experiment so that Ti ? (Yi (1), Yi (0))
but we are not aware of this. Instead, we proceed to estimate the treatment e§ect under the
assumption that
Ti ? (Yi (1), Yi (0))|Xi .
If Ti causes Xi so that E X̄|treatment 6= E X̄|control , and if Yi causes Xi , then using X to adjust
the treatment and control averages would introduce a bias. * This is a case of+ having a bad
control. For the ATT estimation, the bias is reflected in X̄|treatment − X̄|control (b c). This can
9.5 PARAMETRIC METHODS UNDER STRONG IGNORABILITY 205

be regarded as a selection bias, as it depends on the di§erence of the averages of Xi across the
two subgroups.

2 3
Ti −! Xi
6 7
6 7
6 & " 7
4 5
Yi − Ẍi
Bad control

2 3
Ti −! Xi
6 7
6 7
6 & # 7
4 5
Yi − Ẍi
Unnecesssary control, “controling” for Xi
leads to a di§erent interpretation of the
estimated causal e§ect. It is now the causal
e§ect that is not mediated via Xi

Remark 75 Recall that


Yi − α + Ti × AT E + ei
where
ei = [Yi (0) − EYi (0)] + Ti {[Yi (1) − EYi (1)] − [Yi (0) − EYi (0)]} .
So
E (Yi |Ti , Xi ) = α + Ti × AT E + E (ei |Ti , Xi ) ,
where
h i
E (ei |Ti , Xi ) = E {[Yi (0) − EYi (0)] |Ti , Xi } + Ti E {[Yi (1) − EYi (1)] − [Yi (0) − EYi (0)]} |T i , Xi
= E {[Yi (0) − EYi (0)] |Xi } + Ti E [{[Yi (1) − EYi (1)] − [Yi (0) − EYi (0)]} |Xi ] .

If
E {Yi (0)|Xi } = δ 0 + Xi δ 1 , E {Yi (1)|Xi } = γ 0 + Xi γ 1 ,
then
E {[Yi (0) − EYi (0)] |Xi } = (Xi − EXi ) δ 1 ,
and
E {[Yi (1) − EYi (1)] |Xi } = (Xi − EXi ) γ 1 ,
9.6 NONPARAMETRIC METHODS UNDER STRONG IGNORABILITY 206

which imply that

E (Yi |Ti , Xi ) = α + Ti × AT E + E (ei |Ti , Xi )


= α + Ti × AT E + (Xi − EXi ) δ 1 + Ti × (Xi − EXi ) δ 1 .

9.6 Nonparametric Methods under Strong Ignorability


9.6.1 Nonparametric Regression for ATE
We can use the data to estimate m1 (x) and m0 (x) by m̂1 (x) and m̂0 (x) using any of the
nonparametric methods.
For example, we can take the observations with Ti = 1, and regress Yi on 1, Xi , Xi2 , ..., XiJ
to obtain an estimate m̂1 (x) and likewise with the Ti = 0 observations to get m̂0 (x). Then
n
[ 1X
AT E= [m̂1 (Xi ) − m̂0 (Xi )] .
n
i=1

Note that we are taking each observed value of Xi , and evaluating m̂1 and m̂0 at each such
value, then averaging.
An advantage of the series method is that inference be made by pretending that we have
two (pseudo) parametric regressions.
As a second example, we can use the kernel smoothing method to estimate m1 (x) and
m0 (x) by
P
j:T =0 Yj Kh (Xj − x)
m̂0 (x) = P j
j:Tj =0 Kh (Xj − x)
P
j:T =1 Yj Kh (Xj − x)
m̂1 (x) = P j
j:Tj =1 Kh (Xj − x)

where
1 #u$
Kh (u) = K
h h
and K (·) is a kernel function and h is the bandwidth.

9.6.2 Nonparametric Regression for ATT and ATC


Recall

AT T = E(Yi (1) − Yi (0)|Ti = 1)


= E(Yi (1)|Ti = 1) − E(Yi (0)|Ti = 1)
= E(Yi |Ti = 1) − E(Yi (0)|Ti = 1).

The first term E(Yi |Ti = 1) can be estimated simply by taking the sample average of the
treated observations:
Pn Pn
\ i=1 Yi 1 {Ti = 1} Yi Ti
E(Yi |Ti = 1) = P n = Pi=1
n .
i=1 1 {Ti = 1} i=1 Ti
9.7 PROPENSITY SCORE 207

The problem is how to estimate the second “counterfactual” term? For the individual with
Ti = 1, Yi (0) is the counterfactual outcome which by definition is not observable. Here we
have to use the unconfoundedness assumption:
L P M
P
E(Yi (0)|Ti = 1) = E E(Yi (0)|Ti = 1, Xi )PP Ti = 1 by LIE
| {z }
L P M
P
P
= E E(Yi (0)|Ti = 0, Xi )P Ti = 1 by unconfoundedness
| {z }
L P M
P
= E E(Yi |Ti = 0, Xi )PP Ti = 1 by the definition of Yi
| {z }

We can estimate E(Yi |Ti = 0, Xi ) by m̂0 (Xi ) and estimate E(Yi (0)|Ti = 1) by
Pn
m̂ (X ) 1 {Ti = 1}
\
E(Yi (0)|Ti = 1) = i=1 Pn 0 i .
i=1 1 {Ti = 1}

This leads to the following estimator of ATT:


Pn
[ [Yi − m̂0 (Xi )] 1 {Ti = 1}
AT T = i=1 Pn .
i=1 1 {Ti = 1}

By the same argument, we can estimate ATC by


Pn
[ [m̂1 (Xi ) − Yi ] 1 {Ti = 0}
AT C = i=1 Pn
i=1 1 {Ti = 0}

As in the case of ATE, we can use any of the nonparametric methods to estimate m0 (x) and
m1 (x). For example, suppose we use 1-nearest neighbor matching for m̂0 . Basically, what we
are doing is taking each treated outcome, and finding a control unit that has the closest value
of X. We subtract the treated outcome from its “matched” control outcome, and average.
This is a type of matching estimator, which we will discuss in some details latter.

9.7 Propensity Score


We maintain the Unconfoundedness assumption:

Ti ? (Yi (1), Yi (0))|Xi

and the overlap assumption: for all x in the support of X,

0 < P r(Ti = 1|Xi = x) < 1.

These assumptions may be weakened when AT T or ATC is the object of interest. Under the
above two assumptions, we look at the propensity score approach to estimate the ATE and
ATT. Similar ideas apply to ATC estimation.
9.7 PROPENSITY SCORE 208

9.7.1 A Key Result and Its implications


Definition 76 The propensity score function is defined as

p(x) = P (T = 1|X = x)

or equivalently
p(x) = E(T |X = x).

A basic result in the propensity score framework is

Theorem 77 If T ? (Y (1), Y (0))|X and p(x) 2 (0, 1) for all x in the support of X, then
T ? (Y (1), Y (0))|p(X).

Proof. We will show that

P [T = 1|Y (1), Y (0), p(X)] = P [T = 1|p(X)] = p(X).

This implies that T is independent of (Y (0), Y (1)) conditional on p(X). First,


U P V
P
P [T = 1|Y (1), Y (0), p(X)] = E E [T |Y (1), Y (0), p(X), X]PP Y (1), Y (0), p(X)
| {z }
U P V
P
P
= E E [T |Y (1), Y (0), X]P Y (1), Y (0), p(X)
| {z }
U P V
P
= E E [T |X]PP Y (1), Y (0), p(X)
| {z }
= E { p(X)| Y (1), Y (0), p(X)} = p(X).

Second,
U P V
P
P
P [T = 1|p(X)] = E E [T |p(X), X]P p(X)
| {z }
U P V
P
= E E [T |X]PP p(X)
| {z }
= E { p(X)| p(X)} = p(X).

Hence
P [T = 1|Y (1), Y (0), p(X)] = P [T = 1|p(X)]
as desired.
In the SCM framework, we have

Theorem 78 If T ? Ẍ|X and p(x) 2 (0, 1) for all x in the support of X, then T ? Ẍ|p(X).

The proof is the same but with Y (1), Y (0) replaced by Ẍ.
9.7 PROPENSITY SCORE 209

2 3
P (Xi )
6 7
6 7
6 " 7
6 7
6 7
6 Xi 7
6 7
6 7
6 . & 7
6 7
6 7
6 Ti Ẍi 7
6 7
6 7
6 % & . 7
4 5
Ui Yi
The above results imply that we can replace the covariate X with the scalar p(X) in our
previous regression-based approaches.
For the parametric case, recall that
Yi − α + Ti × AT E + ei
where
ei = [Yi (0) − EYi (0)] + Ti {[Yi (1) − EYi (1)] − [Yi (0) − EYi (0)]} .
So
E (Yi |Ti , P (Xi )) = α + Ti × AT E + E (ei |Ti , P (Xi )) ,
where
E (ei |Ti , P (Xi ))
= E {[Yi (0) − EYi (0)] |Ti , P (Xi )}
h i
+Ti E {[Yi (1) − EYi (1)] − [Yi (0) − EYi (0)]} |Ti , P (Xi )
= E {[Yi (0) − EYi (0)] |P (Xi )} + Ti E [{[Yi (1) − EYi (1)] − [Yi (0) − EYi (0)]} |P (Xi )] .
If
E {Yi (0)|P (Xi )} = δ 0 + P (Xi )δ 1 , E {Yi (1)|P (Xi )} = γ 0 + P (Xi )γ 1 ,
then
E {[Yi (0) − EYi (0)] |Xi } = (P (Xi ) − EP (Xi )) δ 1 ,
and
E {[Yi (1) − EYi (1)] |Xi } = (P (Xi ) − EP (Xi ))γ 1 ,
which imply that
E (Yi |Ti , P (Xi ))
= α + Ti × AT E + E (ei |Ti , Xi )
= α + Ti × AT E + (P (Xi ) − EP (Xi ))δ 1 + Ti × (P (Xi ) − EP (Xi )) δ 1
So we can estimate the ATE by regressing Yi on a constant, Ti , the centered propensity score,
and their interaction. In the presence of constant treatment e§ects, we have
E (Yi |Ti , P (Xi )) = α + Ti × AT E + (P (Xi ) − EP (Xi ))δ 1 ,
9.7 PROPENSITY SCORE 210

and we can estimate the ATE by using P (Xi ) as the control variable. This is entirely analogous
to the control variable approach under the conditional mean independence assumption.
For the nonparametric approach, letting X̃ = p(X), we can estimate ATE by

1 Xh # $i
n
[
AT E= ŝ1 (X̃i ) − ŝ0 X̃i
n
i=1

where ŝ1 (x) is a nonparametric estimator of s1 (x) = E(Yi |Ti = 1, X̃i = x) and ŝ0 (x) is a
nonparametric estimator of s0 (x) = E(Yi |Ti = 0, X̃i = x). Similarly, we can estimate ATT by
Pn h # $i
i=1 Yi − ŝ0 X̃i 1 {Ti = 1}
[
AT T = Pn .
i=1 1 {Ti = 1}

9.7.2 Estimation of the Propensity Score


In practice, p(Xi ) is rarely known, so we must estimate it. We can use logit, probit or
even linear probability method to estimate p(x). We should usually avoid using X that could
be a§ected by the treatment. Depending on the assumptions we are willing to make, we
can estimate p(x) parametrically or (semi-)nonparametrically. In the parametric method, we
assume that p(x) takes some parametric function form. For example

p(x) = Φ (xδ) .

In the (semi)-nonparametric method, we relax the above assumption slightly and may take
* +
p(x) = Φ δ 0 + xδ 1 + x2 δ 2 + ... + xJ δ J .

How should X enters the probit or logit model? If X is multivariate and we include high
order terms and their interactions, the the number of terms can explode. Imbens and Rubin
(2015, textbook) propose using stepwise regression, a sensible way to limit the number of X’s:

1. Start with covariates that are expected a priori to matter. This requires some subject
matter knowledge.

2. Add the covariate that has the largest test statistic (e.g., likelihood ratio) one by one.
If the test statistic exceeds a pre-specified threshold, then include the corresponding
covariate. Iterate this procedure until no more test statistic exceeds the threshold.

3. Do this for the first order terms and then the second order terms, with potentially
di§erent thresholds. We probably don’t want to use the fourth order terms, and maybe
not even third order terms.

We could also use Lasso or other shrinkage methods. For Lasso, we choose the model that
minimizes
Xn J
X
− log Li + λ |δ j |
i=1 j=1
9.7 PROPENSITY SCORE 211

for some tuning parameter λ > 0.


With the estimated propensity score, we may have to trim the data so that there is common
support in the propensity score distribution between the two groups. This is related to the
overlap assumption
0 < P r(Ti = 1|Xi = x) < 1.
Without su¢cient overlap, for some treated individual, we may not be able to find a control
with a similar propensity score. In practice, we often have to use ad hoc trimming rules. With
trimming, the ATE has to be redefined, as it is not the average treatment e§ect over the entire
population anymore.

It is often argued in the literature that the propensity score approach reduces the dimen-
sionality of the problem. This seems to be true on the surface. In the ATT estimation, if we
use the single-nearest neighbor estimator ŝ0 , we essentially match a treated individual with an
untreated individual with the closest propensity score. Originally we have to match according
to a multidimensional vector X and now we only have to match according to a scalar. We have
seemingly achieved dimension reduction. However, the high dimension problem is still there,
as the propensity score still depends on the multidimensional vector X. So the dimension
reduction has not been really achieved. It is just hidden a little deeper. Nevertheless, it is still
insightful to know that it is su¢cient to use the propensity score under the unconfoundedness
assumption.

9.7.3 Matching
Note that ATT can be estimated by
Pn
[ [Yi − m̂0 (Xi )] 1 {Ti = 1}
AT T = i=1 Pn (9.1)
i=1 1 {Ti = 1}

or Pn h # $i
i=1 Yi − ŝ0 X̃i 1 {Ti = 1}
[
AT T = Pn .
i=1 1 {Ti = 1}
9.7 PROPENSITY SCORE 212

As we discussed before, when m̂0 and ŝ0 are estimated by the kNN method with k = 1, the
above estimators are matching estimators that entail matching a treated individual with an
untreated individual with the closest propensity score or the closest propensity score x value.
Matching involves finding direct comparisons, that is, matches, for each individual.
For the kNN matching based on the values of X variable, matching involves a distance
measure and the number of points k we choose to match. For instance, we can use the
Euclidean distance: s
X# $2
kXi − Xj k = Xi` − Xj`
`

In practice, we need to standardize the covariates somehow (e.g., using the inverse of the
variances) in order to avoid the situation that some covariates dominate other covariates.
Selecting the value of k represents a subtle tradeo§. Matching more control cases to each
treatment case results in lower asymptotic variance of the treatment e§ect estimator but also
tends to increase the bias because the probability of making poorer matches increases with
the number of matches.
A danger to select the fixed number of matches such as 5 is that it may lead to some
poor matches for some treatment cases. A version of nearest-neighorhood matching, known as
‘radius’ or ‘caliper’ matching, is designed to remedy this drawback by restricting the matches
to a chosen maximum distance. A hybrid approach is to keep the single nearest match and
throw away the kNN matches whose distances are larger than a given maximum distance.
For the kernel matching based on the X variable, all control cases are used as counter-
factuals for each treatment case but weigh each control case based on its distance from the
treatment case. Note that
P Pn
j:Tj =0 Yj Kh (Xj − Xi ) j=1 Yj Kh (Xj − Xi ) 1 {Tj = 0}
m̂0 (Xi ) = P = Pn
j:Tj =0 Kh (Xj − Xi ) j=1 Kh (Xj − Xi ) 1 {Tj = 0}
n
X
= Yj Wji 1 {Tj = 0}
j=1

where
Kh (Xj − Xi ) 1 {Tj = 0}
Wji = Pn .
j=1 Kh (Xj − Xi ) 1 {Tj = 0}

Plugging this into (9.1) yields:


Pn h P i
i=1 Yi − nj=1 Yj Wji 1 {Ti = 1}
[
AT T = Pn
i=1 1 {Ti = 1}

The weight attached to the jth control case for the ith treatment case is Wji .

Example 79 A simple example


9.7 PROPENSITY SCORE 213

i T training X educ Y income M atch Y (1) Y (0) Y (1) − Y (0)


1 0 12 60 {8} 70 60 10
2 0 13 80 {6} 80 80 0
3 0 15 90 {5} 100 90 10
4 0 20 150 {5} 160 150 10
5 1 15 160 {3} 160 90 70
6 1 13 80 {2} 80 80 0
7 1 14 90 {2, 3} 90 85 5
8 1 12 70 {1} 70 60 10
For the 1NN matching, AT E is estimated to be
10 + 0 + 10 + 10 + 70 + 0 + 5 + 10 115
=
8 8
and ATT is estimated to be
10 + 0 + 5 + 10 25
= .
4 4
The above procedure applies to the matching based on the propensity score. In this case,
there is also an interval matching, which also called blocking, subclassification or stratification.
It involves sort the treatment and control cases into segments of a unidimensional distance
metric, usually the estimated propensity score. For example, the interval can be defined by
the cutpoints on the distance metric that subdivide the treatment cases into a chosen number
of equal sized subgroups. Interval matching is really another nonparametric regression called
“partitioned regression”.
More specifically, the ATE for the j interval or block is estimated by
P P
j i2Bj Yi Ti i2B Yi (1 − Ti )
[
AT E = P − P j
i2Bj Ti i2Bj (1 − Ti )

and the ATE is estimated by


J
X
block j nj
[
AT E = [
AT E
n
j=1

where nj is the number of individuals in interval j.


The underlying goal of all matching estimator is to transform the data into such a way that
they can be analyzed as if they are balanced with respect to the treatment e§ect of interest.
In the case of ATT estimation, we say the data is balanced if

Pn (Xi |Ti = 1) = Pn (Xi |Ti = 0)

where Pn (Xi |Ti = 1) is the empirical distribution of X for the treatment group and Pn (Xi |Ti = 0)
is the empirical distribution of X for the matched dataset, i.e., the set of control cases that
9.7 PROPENSITY SCORE 214

are matched to any of the treatment cases. Recent matching literature has developed di§erent
algorithms to achieve “balance” as much as possible.
At the minimum, we should check whether the propensity scores are balanced across the
treatment group and matched control group.

9.7.4 Inverse Probability Weighting


An alternative use of the propensity score is to inverse-weighting. Consider the following
expectation:
H I H I U H IV
TY T Y (1) T Y (1)
E = E =E E |X
p(X) p(X) p(X)
U H I V
T
= E E |X E (Y (1)|X)
p(X)
= E {E (Y (1)|X)} = E [Y (1)] .
Similarly, H I
(1 − T ) Y
E = E [Y (0)] .
1 − p(X)
Therefore H I H I
TY (1 − T ) Y
AT E = E −E .
p(X) 1 − p(X)
This suggests the following estimator for ATE, assuming the propensity score function is
known:
Xn L M
] 1 T i Yi (1 − T i ) Yi
AT E = −
n p(Xi ) 1 − p(Xi )
i=1
1 X 1 X
= λi Yi − λi Yi
n n
i:Ti =1 i:Ti =0
9.7 PROPENSITY SCORE 215

where 8
1 < 1
if Ti = 1
p(Xi )
λi = Ti 1−Ti
=
p (Xi ) (1 − p (Xi )) : 1
if Ti = 0
1−p(Xi )

]
AT E is the di§erence between two weighted averages.

Remark 80 Let us compare the IPW estimator with the regression-based estimator:
n
[ 1X
AT E= [m̂1 (Xi ) − m̂0 (Xi )] ,
n
i=1
where
m̂1 (Xi ) = Ê[Yi |Ti = 1, Xi ], m̂0 (Xi ) = Ê[Yi |Ti = 1, Xi ]
and Ê (·|·) stands for an estimator of the conditional mean E (·|·) . Note that
E[Yi Ti |Xi ] E[Yi (1 − Ti ) |Xi ]
E[Yi |Ti = 1, Xi ] = and E[Yi |Ti = 0, Xi ] = .
E [Ti |Xi ] E [1 − Ti |Xi ]
If we estimate E[Yi |Ti = 1, Xi ] by Ê[Yi Ti |Xi ]/Ê [Ti |Xi ] and E[Yi |Ti = 0, Xi ] by Ê[Yi (1 − Ti ) |Xi ]/Ê [1 − Ti |Xi ] ,
then " #
X n
[ 1 Ê[T i Yi |X i ] Ê[(1 − T i ) Yi |X i ]
AT E= − .
n p̂(Xi ) 1 − p̂(Xi )
i=1
]
This estimator is close to AT E, which is
n L M
1 X Ti Yi (1 − Ti ) Yi

n p̂(Xi ) 1 − p̂(Xi )
i=1
when the estimated propensity score is plugged in. The di§erence appears to be whether
Ê[Ti Yi |Xi ] or Ti Yi is used.
Inverse probability weighting was proposed by Horvita and Thompson (1952) in a some-
what di§erent setting. One problem in practice with this estimator is that the weights do not
necessarily add up to 1. We could modify the estimator by normalizing the weights so they
add up to one:
Pn Ti Yi Pn (1−Ti )Yi
i=1 p(Xi ) i=1 1−p(Xi )
]
AT E renorm = P − P .
n Ti n (1−Ti )
i=1 p(Xi ) i=1 1−p(Xi )

Example 81 To provide some intuition behind inverse probability weighting, consider the
following hypothetical data set with a binary X and binary treatment T :

i 1 2 3 4 5 6 7 8 9 10
Xi 0 0 0 0 0 1 1 1 1 1
Ti 0 1 1 1 1 0 0 1 1 1
Yi 3 4 5 1 2 6 0 3 9 7
4 4 4 4 4 3 3 3 3 3
p(Xi ) 5 5 5 5 5 5 5 5 5 5
9.7 PROPENSITY SCORE 216

In order to avoid small sample problems, we can think each i as representing a large number
of individuals, say 10,000 individuals.
We have
10 L M
] 1 X Ti Yi (1 − Ti ) Yi
AT E = −
10 p(Xi ) 1 − p(Xi )
i=1
" ! !#
1 4 5 1 2 3 9 7 3 6 0 5
= 4 + 4 + 4 + 4 + 3 + 3 + 3 − 4 + 3 + 3 = .
10 5 5 5 5 5 5 5 1− 5 1− 5 1− 5 3

]
Note that AT E can be rewritten as
]
AT "E ! !#
1 4 5 1 2 3 3 9 7 6 0 5
= 4 + 4 + 4 + 4 − + + + − − = .
10 5 5 5 5 1 − 45 3
5
3
5
3
5 1− 3
5 1− 3
5
3

Consider as an example the subsample with X = 0

i 1 2 3 4 5
Xi 0 0 0 0 0
Ti 0 1 1 1 1
Yi 3 4 5 1 2

For i = 1, we observe Yi (0) = 3. We ask: what would happened to individual 2,3,4,5 had they
chosen T = 0? Because of the randomization within the subsample, we expect each of Yi (0) for
i = 2, 3, 4, 5 to be 3. So individual 1 can be regarded as representing all five individuals in the
counterfactual world where everyone chooses the control state. Similarly, each of individual
i = 2, 3, 4, 6 can be regarded as representing 1.25 individuals in the counterfactual world where
everyone chooses the treatment state. We have e§ectively generated a pseudo-subsample with
10 individuals:
i 1 10 10 10 10 20 + 30 + 40 + 50 2 3 4 5
Ti 0 0 0 0 0 1 1 1 1 1
4+5+1+2
Yi 3 3 3 3 3 4 4 5 1 2
ATE(0) is then given by
4+5+1+2
4 +4+5+1+2
AT E(0) = − 3 = 0.
5
Another way to estimate ATE is to take an average of ATE(x) for x = 0 and 1 where
4+5+1+2
AT E(0) = − 3 = 0,
4
3+9+7 6+0 10
AT E(1) = − = ,
3 2 3
9.7 PROPENSITY SCORE 217

leading to
[ 1 1 10 5
AT E = ×0+ × = .
2 2 3 3
Clearly the two estimators are the same. This is not a coincidence! We can prove this rigor-
ously.

In practice, we do not know the propensity score, and have to use an estimated version
p̂ (Xi ) , leading to
] 1 X 1 X
AT E renorm = λ̂i Yi − λ̂i Yi
n n
i:Ti =1 i:Ti =0

where 8
> n[p̂(Xi )]−1
< P −1 , if Ti = 1
j:T =1 [p̂(Xi )]
j
λ̂i =
> n[1−p̂(Xi )]−1
: P −1 , if Ti = 0
j:T =0 [1−p̂(Xi )]
j

In a completely random experiment, p̂ (Xi ) is a constant and so λ̂i = (n/n1 )Ti (n/n0 )1−Ti .
]
AT E renorm is then equal to Ȳtreatment − Ȳcontrol .
Now consider the ATT. We can construct a weighting estimator for this quantity as well,
building upon our approach for ATE. First, note that
Z
AT T = AT E(x)fX (x|T = 1)dx
Z
= [m1 (x) − m0 (x)] fX (x|T = 1)dx
Z
fX (x)P (T = 1|X = x)
= [m1 (x) − m0 (x)] R dx
fX (v)P (T = 1|X = v)dv
Z
fX (x)p(x)
= [m1 (x) − m0 (x)] R dx
fX (v)p(v)dv
E [m1 (X) − m0 (X)] p(X)
= .
Ep(X)

By the LIE, we have

E [Y (1) − Y (0)] p(X)


AT T =
Ep(X)
h i
YT
E p(X) − Y1−p(X)
(1−T )
p(X)
= .
Ep(X)

So we can estimate ATT by

1 Pn h Yi Ti Yi (1−Ti )
i
n i=1 p(Xi ) − 1−p(Xi ) p(Xi )
]
AT T = 1 Pn .
n i=1 p(Xi )
9.8 DOUBLY ROBUST ESTIMATOR 218

9.7.5 Mixed Estimators


The current best practice is to combine di§erent methods.
1. Blocking + regression: We use regression locally on the blocks with similar covariate
distributions for treatment and control units — for example, defined by the estimated propensity
score. This method a§ords more flexibility if we allow the regressions to be di§erent across
di§erent blocks. For this method, we have to select the number of blocks. Most researcher
use 5 blocks but the number of blocks can be made data dependent. See Imbens and Rubin
(2015, textbook, Ch 13.5)
2. Combine Regression with Matching: One-to-one (or one-to-many) covariate matching
with replacement plus regression adjustments within matched pairs.

9.8 Doubly Robust Estimator


Counterfactuals:
Each individual has two potential outcomes Yi (1) and Yi (0) :
Yi (0) : outcome the subject would have if s/he received no treatment
Yi (1) : outcome the subject would have if s/he received treatment
The probability distribution of Y(0) represents how the outcomes in the population would
turn out if everyone received no treatment, with mean EY (0). Similarly, the probability dis-
tribution of Y(1) represents how the outcomes in the population would turn out if everyone
received treatment, with mean EY (1). So the ATE is

AT E = E [Y (1) − Y (0)] = EY (1) − EY (0).

The problem is that we do not observe both Yi (1) and Yi (0). Instead, we only observe

Yi = Ti Yi (1) + (1 − Ti ) Yi (0) .

Under the Unconfoundedness assumption:

Ti ? (Yi (1), Yi (0))|Xi .

there are two ways to estimate ATE.

1. Regression-based approach. Note that

AT E = E [AT E(X)] = E {E(Y |T = 1, X) − E(Y |T = 0, X)} .

Regression-based approach assumes a parametric model for the outcome regression


E(Y |T, X). To estimate ATE, we fit the model, and average the resulting estimate
of
E(Y |T = 1, X) − E(Y |T = 0, X)

For example, we may assume

E(Yi |Ti , Xi ) = a + Ti b + Xi c + (Ti · Xi ) d


9.8 DOUBLY ROBUST ESTIMATOR 219

and estimate ATE by


[
AT E R = b̂OLS + X̄ dˆOLS
where b̂OLS and dˆOLS are the OLS estimator of b and d obtained by regressing Y on T ,
X and T · X.
[
AT E R is a consistent estimator of ATE only if the model for E(Yi |Ti , Xi ) is correctly
specified.

2. Propensity score approach (or inverse probability weighting approach):


n n
[ 1 X Ti Yi 1 X (1 − Ti ) Yi
AT E IP W = − .
n p(Xi , θ̂) n 1 − p(Xi , θ̂)
i=1 i=1

[
AT E R is a consistent estimator of ATE only if the p(Xi , θ) is correctly specified (and
hence p(Xi , θ̂) is a consistent estimator of the true propensity score).

It is possible to combine the two estimators so that the new estimator is less subject to
model misspecification. Consider
n
" #
1 X T i (Yi − m 1 (X i , α̂ 1 ))
[
AT E DR = m1 (Xi , α̂1 ) +
n p(Xi , θ̂)
i=1
n
" #
1X (1 − Ti ) (Yi − m0 (Xi , α̂0 ))
− m0 (Xi , α̂0 ) +
n 1 − p(Xi , θ̂)
i=1

where m0 (X, α0 ) and m1 (X, α1 ) are the postulated models for the true regression E(Y |T =
0, X) and E(Y |T = 1, X) (fitted by OLS). AT [ E DR may be viewed as taking the regression
estimator and “augmenting” it with some adjustment.
[
We can rewrite AT E DR as
n
" #
1 X T Y
i i T i − p(X i , θ̂)
[
AT E DR = − m1 (Xi , α̂1 )
n p(X i , θ̂) p(X i , θ̂)
i=1
n
" #
1 X (1 − Ti ) Yi Ti − p(Xi , θ̂)
− + m0 (Xi , α̂0 )
n 1 − p(Xi , θ̂) 1 − p(Xi , θ̂)
i=1

[
so AT E DR may be also viewed as taking the IPW estimator and “augmenting” it with some
adjustment.
Let
n
" #
1X Ti Yi Ti − p(Xi , θ̂)
µ̂1,DR = − m1 (Xi , α̂1 )
n p(X i , θ̂) p(X i , θ̂)
i=1
n
" #
1 X (1 − Ti ) Yi Ti − p(Xi , θ̂)
µ̂0,DR = + m0 (Xi , α̂0 )
n 1 − p(Xi , θ̂) 1 − p(Xi , θ̂)
i=1
9.8 DOUBLY ROBUST ESTIMATOR 220

What is the population object behind µ̂1,DR ? It should be


U V
Ti [Yi − m1 (Xi , α1 )]
µ1,DR = E m1 (Xi , α1 ) +
p(Xi , θ)
L M
Ti [Yi (1) − m1 (Xi , α1 )]
= E m1 (Xi , α1 ) +
p(Xi , θ)
L M
[Ti − p (Xi , θ)] [Yi (1) − m1 (Xi , α1 )]
= E Yi (1) +
p(Xi , θ)
[Ti − p (Xi , θ)] [Yi (1) − m1 (Xi , α1 )]
= EYi (1) + E
p(Xi , θ)
That is, µ̂1,DR estimates µ1,DR for any specification of p(Xi , θ) and m1 (Xi , α1 ) that may not
equal to the true propensity score function or the conditional mean function.
For µ1,DR to be EYi (1) , we need the second term to be zero. Consider two scenarios as
follows.

1. p(Xi , θ) is correctly specified while m1 (Xi , α1 ) is not.


In this case
Ti − p(Xi , θ)
E [Yi (1) − m1 (Xi , α1 )]
p(Xi , θ)
U L MV
Ti − p(Xi , θ)
= E E [Yi (1) − m1 (Xi , α1 )] |Yi (1) , Xi
p(Xi , θ)
U H I V
Ti − p(Xi , θ)
= E E |Yi (1) , Xi [Yi (1) − m1 (Xi , α1 )]
p(Xi , θ)
= 0.

So as long as p(Xi , θ) is correctly specified, even if m1 (Xi , α1 ) is not, µ̂1,DR is a consistent


estimator of EYi (1) . Similarly, µ̂0,DR is a consistent estimator of EYi (0) . Therefore,
µ̂1,DR − µ̂0,DR is a consistent estimator of ATE.

2. m1 (Xi , α1 ) is correctly specified while p(Xi , θ) is not.


In this case
Ti − p(Xi , θ)
E [Yi (1) − m1 (Xi , α1 )]
p(Xi , θ)
U L MV
Ti − p(Xi , θ)
= E E [Yi (1) − m1 (Xi , α1 )] |Ti , Xi
p(Xi , θ)
U V
Ti − p(Xi , θ)
= E E [Yi (1) − m1 (Xi , α1 )|Ti , Xi ]
p(Xi , θ)
U V
Ti − p(Xi , θ)
= E [E (Yi (1)|Xi ) − m1 (Xi , α1 )]
p(Xi , θ)
= 0

where the last line follows because m1 (Xi , α1 ) = E(Yi |Ti = 1, Xi ) = E(Yi (1)|Ti =
1, Xi ) = E (Yi (1)|Xi ) .
9.8 DOUBLY ROBUST ESTIMATOR 221

So, as long as m1 (Xi , α1 ) is correctly specified, even if p(Xi , θ) is not, µ̂1,DR is a consistent
estimator of EYi (1) . Similarly, µ̂0,DR is a consistent estimator of EYi (0) . Therefore,
µ̂1,DR − µ̂0,DR is a consistent estimator of ATE.

To sum up, µ̂1,DR − µ̂0,DR is a consistent estimator of ATE if either the propensity score
function or the conditional mean function is correctly specified. This property of µ̂1,DR − µ̂0,DR
is referred to as Double Robustness. The double robustness o§ers some protection against
model misspecification.
For the ATT estimation, we may use
Pn
[ Yi Ti
AT T DR = Pi=1 n
Ti
(i=1
Pn n
)
T m
i 0 (X i , α̂ 0 ) 1 X [Yi − m 0 (X i , α̂ 0 )] p(X i , θ̂)
i=1 P
− n + Pn (1 − Ti )
i=1 T i i=1 T i 1 − p(X i , θ̂)
i=1

It su¢ces to show that the second term in the above di§erence converges to
E (Yi (0) |Ti = 1) .
The limit should be
ETi m0 (Xi , α0 ) 1 (1 − Ti ) [Yi − m0 (Xi , α0 )] p(Xi , θ)
A= + ·E
ETi ETi 1 − p(Xi , θ)
where
1 (1 − Ti ) [Yi − m0 (Xi , α0 )] p(Xi , θ)
·E
ETi 1 − p(Xi , θ)
1 (1 − Ti ) [Yi (0) − m0 (Xi , α0 )] p(Xi , θ)
= ·E
ETi 1 − p(Xi , θ)
1 (1 − Ti ) [Yi (0) − m0 (Xi , α0 )] [p(Xi , θ) − Ti ]
= ·E
ETi 1 − p(Xi , θ)
1 (1 − Ti ) [Yi (0) − m0 (Xi , α0 )] [p(Xi , θ) − 1 − (Ti − 1)]
= ·E
ETi 1 − p(Xi , θ)
1 (1 − Ti ) [Yi (0) − m0 (Xi , α0 )]
= − E (1 − Ti ) [Yi (0) − m0 (Xi , α0 )] + E
ETi 1 − p(Xi , θ)
1
= − E (1 − Ti ) [Yi (0) − m0 (Xi , α0 )]
ETi
[{1 − p (Xi , θ) − [Ti − p (X, θ)]}] [Yi (0) − m0 (Xi , α0 )]
+E
1 − p(Xi , θ)
ETi [Yi (0) − m0 (Xi , α0 )] 1 [Ti − p (X, θ)] [Yi (0) − m0 (Xi , α0 )]
= − E .
ETi ETi 1 − p(Xi , θ)
So
ETi Yi (0) 1 [Ti − p (X, θ)] [Yi (0) − m0 (Xi , α0 )]
A = − E
ETi ETi 1 − p(Xi , θ)
1 [Ti − p (X, θ)] [Yi (0) − m0 (Xi , α0 )]
= E [Yi (0) |Ti = 1] − E .
ETi 1 − p(Xi , θ)
9.8 DOUBLY ROBUST ESTIMATOR 222

For the same reason as above, the second term in A is zero as long as either p (X, θ) or
[
m0 (Xi , α0 ) is correctly specified. In this case, A = E [Yi (0) |Ti = 1] and AT T DR converges to
AT T.
Can you design a double robust estimator for ATC?

Anda mungkin juga menyukai