Simple Estimators for Binary Choice Models with Endogenous Regressors

(1)

Simple Estimators for Binary Choice Models With

Endogenous Regressors

Yingying Dong and Arthur Lewbel

University of California Irvine and Boston College

Revised February 2012

Abstract

This paper provides simple estimators for binary choice models with endogenous or mis-measured regressors. Unlike control function methods, which are generally only valid when endogenous regressors are continuous, the estimators proposed here can be used with limited, censored, continuous, or discrete endogenous regressors, and they also allow for latent errors having heteroskedasticity of unknown form, including random coef cients. The variants of special regressor based estimators we provide are numerically trivial to implement. We illus-trate these methods with an empirical application estimating migration probabilities within the US.

JEL codes: C25, C26.

Keywords: Binary choice, Binomial response, Endogeneity, Measurement error, Heteroskedas-ticity, Discrete endogenous regressor, Censored regressor, Random coef cients, Identi cation, Latent variable model.

We would like to thank Jeff Wooldridge for helpful comments. Corresponding Author: Arthur Lewbel, Dept of Economics, Boston College, 140 Commonwealth Ave., Chestnut Hill, MA, 02467, USA. (617)-552-3678, [email protected], http://www2.bc.edu/~lewbel/

(2)

1 Introduction

This paper describes numerically very simple estimators that can be used to estimate binary choice (bi-nomial response) models when some regressors are endogenous or mismeasured, and when latent errors can be heteroskedastic and correlated with regressors. These estimators have some signi cant advantages relative to leading alternatives such as maximum likelihood, control functions, and instrumental variables linear probability models. The model and associated estimators also allow for latent errors having het-eroskedasticity of unknown form, including random coef cients on most of the regressors.

Consider a binary choice model D _D I X0 _C_" _{0 , where} _D _{is an observed dummy variable that}

equals zero or one, X is a vector of observed regressors, is a vector of coef cients to be estimated,

" is an unobserved error, and I is the indicator function that equals one if its argument is true and zero otherwise. The special case of a probit model has" _s N.0;1/, while for logit"has a logistic distribution. The initial goal is to estimate , but ultimately we are interested in choice probabilities and the marginal effects of X, looking at how the probability that D equals one changes whenX changes.

Suppose also that some elements ofX are endogenous or mismeasured, and so may be correlated with

". In addition, the latent error term" may be heteroskedastic (e.g., some regressors could have random coef cients) and has an unknown distribution. LetZbe a vector of instrumental variables that are uncorre-lated with". There are three common methods for estimating such models: maximum likelihood, control functions, and linear probability models. We now brie y summarize each, noting that each method has some serious drawbacks, and we then discuss the relative advantage of this paper's alternative approach based on Lewbel's (2000) special regressor estimator. A more complete comparison of these estimators, including the special regressor method, is provided in Lewbel, Dong, and Yang (2012).

One method for estimation is maximum likelihood. Maximum likelihood estimation requires a com-plete parametric speci cation of how each endogenous regressor depends on Zand on errors. Letedenote the set of errors in the required equations describing how each endogenous regressor depends on Z. In addition to parameterizing these equations, maximum likelihood also requires a complete parametric spec-i catspec-ion of the jospec-int dspec-istrspec-ibutspec-ion of e and" conditional upon Z. One drawback of maximum likelihood

(3)

is the dif culty in correctly specifying all this information. A second problem is that the resulting joint likelihood function associated with binary choice and endogenous regressors will often have numerical dif culties associated with estimating nuisance parameters such as the covariances betweeneand".

A second type of estimator for binary choice with endogenous regressors uses control functions. These use methodology that can be traced back at least to Heckman (1976) and Heckman and Robb (1985), and for binary choice with endogenous regressors can range in complexity from the simple ivprobit command in Stata (for a model like that of Rivers and Vuong (1988) and Blundell and Smith (1989)), to Blundell and Powell's (2003, 2004) estimators with multiple nonparametric components. Control function estimators are typically consistent onlywhen the endogenous regressors are continuously distributed (because one cannot otherwise estimate the latent errore), and so should not be used when the endogenous regressors are discrete or limited. Also, like maximum likelihood, control function estimators require models of the endogenous regressors as functions of Z andeto be correctly speci ed. In addition, control functions do not permit many types of heteroskedasticity, and can suffer from numerical problems similar to those of maximum likelihood.

A third approach to dealing with endogenous regressors is to estimate an instrumental variables linear probability model, that is, linearly regress D on X using two stage least squares with instruments Z. However, despite its simplicity and popularity, this linear probability model does not nest standard logit or probit models as special cases, is generally inconsistent with economic theory for binary choice, and can easily generate silly results such as tted choice probabilites that are negative or greater than one. Additional drawbacks of the linear probability model are documented in Lewbel, Dong, and Yang (2012). One reason for the popularity of the linear probability model, despite its serious aws, is that for true linear regressions, two stage least squares has many desirable properties. In linear regression, two stage least squares does not require a correct speci cation, or indeed any speci cation, of models for the endogenous regressors. One might interpret the rst stage of two stage least squares as a model of the endogenous regressors, but unlike maximum likelihood or control function based estimators, linear two stage least squares does not require the errors in the rst stage regressions to satisfy any of the properties

(4)

of a correctly speci ed model. Linear two stage least squares only requires that the instruments Z be correlated with regressors and uncorrelated with errors. Linear two stage least squares also allows for general forms of heteroskedasticity. Special regressor estimators possesses these desirable properties of linear two stage least squares, but without the drawbacks of the linear probability model.

This paper provides a simpli ed version of Lewbel's (2000) special regressor estimator. It overcomes all of the above listed drawbacks of linear probability models, control functions, and maximum likelihood. The special regressor based estimator consistently estimates , nests logit and probit as special cases, al-lows for general and unknown forms of heteroskedasticity (including, e.g., random coef cients), does not require correctly speci ed models of the endogenous regressors, does not require endogenous regressors to be continuously distributed (e.g., permitting censored or discrete endogenous regressors), and does not suffer from computational convergence dif culties because it does not require numerical searches.

The price to be paid for these advantages is that the special regressor estimator requires one exogenous regressor to be conditionally independent of ", appear additively to" in the model, and be conditionally

continuously distributed with a large support (though, as we discuss later, the support does not need to be as large as the rst papers in this literature suggest). Call this special regressor V. Only one special regressor is required, no matter how many endogenous regressors appear in the model.

LetSdenote the vector consisting of all the instruments and all the regressors other thanV. A dif culty in implementing Lewbel's (2000) special regressor estimator is that it requires an estimate of the density of V, conditional upon S. In this paper we propose simple semiparametric speci cations of this density, thereby yielding special regressor estimators that are numerically very easy and practical to implement.

Using a sample of individuals in the labor force, we empirically illustrate the special regressor method by applying our estimator to a model of migration. Speci cally, we model the probability of moving from one state to another within the United States. Our special regressor V is an individual's age, which is clearly exogenous and continuous. The model contains both a discrete (binary) and a continuous endoge-nous regressor, namely, home ownership and family income.

(5)

likeli-hood would require fully specifying a joint model of migration, home ownership, and income. Control function methods would also require modeling these variables, and will not in general be feasible here because homeownership is discrete. In contrast to the dif culty of maximum likelihood and the incon-sistency of control function and linear probability based estimates, we show that our simpli ed special regressor based estimator is numerically trivial to implement and provides reasonable estimates for this model.

1.1 Normalization of the Binary Choice Model

LetV be some conveniently chosen exogenous regressor that is known to have a positive coef cient, and now let X be the vector of all the other regressors in the model. We now write the binary choice model as

D_D I.X0 _C_V _C_" ₀_/ ₍₁₎

where the variance of"is some unknown constant 2_", and is a vector of coef cients to be estimated. Models like probit often normalize the variance of the error " to be one, but it is observationally equivalent to instead normalize the positive coef cient of a regressor to be one. Estimation of choice probabilities (propensity scores) and of marginal effects are unaffected by this choice of normalization. For special regressor estimators, equation (1) is more convenient than normalizing the variance of"to one. The economics of some applications provide a natural scaling , e.g., ifD is the decision of a consumer to purchase a good andV is the negative logged price of the good, then having demand curves be downward sloping determines the sign of the coef cient ofV, and in this scalingX0 _C_"_{is the log of the consumer's}

reservation price (that is, their willingness to pay) for the good.

If unknown a priori, the sign of the coef cient of V can be determined as the sign of the estimated average derivativeE[@E.D_jV;X/=@V], or weighted average derivative such as Powell, Stock, and Stoker (1989). The sign of this estimator converges faster than rate rootn, so a rst stage estimation of the sign won't affect the later distribution theory. Even simpler is to just graph the nonparametric regression of D

(6)

1.2 The Special Regressor Method - Literature Review

The special regressor method is characterized by three assumptions. First, it requires additivity between the special regressor V and the model error"(or some function of"). In standard binary choice models

where the latent variable, X0 _C_V_C_"_{, is linear in regressors and an error term, all regressors in the model}

satisfy this assumption. Second, it requires the special regressorV to be conditionally independent of the model error", conditioning on other covariates. If the distribution of" is independent of the exogenous

regressors (e.g., is homoskedastic), then any exogenous regressor will satisfy this assumption. Third, the special regressor needs to be continuously distributed with a large support, though this last condition can sometimes be relaxed (see Magnac and Maurin 2007, 2008).

The special regressor method has been employed in a wide variety of limited dependent variable models including binary, ordered, and multinomial choice as well as censored regression, selection and treatment models (Lewbel 1998, 2000, 2007a), truncated regression models (Khan and Lewbel 2007), binary panel models with xed effects (Honore and Lewbel 2002, Ai and Gan 2010), dynamic choice models (Heckman and Navarro 2007, Abbring and Heckman 2007), contingent valuation models (Lew-bel, Linton, and McFadden 2008), market equilibrium models of multinomial discrete choice (Berry and Haile 2009a, 2009b), games with incomplete information (Lewbel and Tang (2011), and a variety of models with (partly) nonseparable errors (Lewbel 2007b, Matzkin 2007, Briesch, Chintagunta, and Matzkin 2009). Additional empirical applications of special regressor methods include Anton, Fernandez Sainz, and Rodriguez-Poo (2002), Cogneau and Maurin (2002), Goux and Maurin (2005), Stewart (2005), Avelino (2006), Pistolesi (2006), Lewbel and Schennach (2007), and Tiwari, Mohnen, Palm, and van der Loeff (2007). Earlier results that can be reinterpreted as special cases of special regressor based identi ca-tion methods include Matzkin (1992, 1994) and Lewbel (1997). Vytlacil and Yildiz (2007) describe their estimator as a control function, but their identi cation of the endogenous regressor coef cient essentially treats the remainder of the latent index as a special regressor. Recent econometric theory involving spe-cial regressor models includes Jacho-Chávez (2009), Khan and Tamer (2010), and Khan and Nekipelov (2010a, 2010b). The methods we propose in this paper to simplify special regressor estimation in binary

(7)

choice models could be readily applied to many of the other applications of the method cited here, such as ordered choice and selection models.

To illustrate how a special regressor works to identify limited dependent variable models, consider the simple modelD _D I. CVC" 0/where"has an unknown mean zero distribution andV is independent

of ". Let F "./and f "./ denote the probability distribution function and the probability density function (respectively) of ", and suppose this distribution has support given by the interval [L;U] .

Suppose we wish to estimate the constant . In this model,E.D jV/DPr. " V/D F ".V/, so by estimating the conditional mean of DgivenV, we estimate the distribution of ", evaluated at

V. Once we know this distribution, we can calculate its mean, which is .

In particular, by the de nition of an expectation, _D E. "/ D RLUV f ".V/dV D

RU

L V @F ".V/ =@V dV D R_LUV @E.D j V/ =@V dV, which shows one way in which could be recovered from an estimate of E.D j V/. Note that this construction requires V to take on

all values in the interval [L;U], since we need to evaluate E.D jV/for all those values of V. This is

the sense in which special regressor estimation requires a large support. However, suppose that V has a smaller support, say the interval [`;u] whereL ` < 0 andU u >0. Then we will only be able to

es-timate_e_D R_`uV @E.Dj V/ =@V dV, which is in general not equal to . But suppose the following

equality between upper and lower tails of f "holds: R_L`V f ".V/dV DR_uUV f ".V/dV. Then

Deand the special regressor method estimator still works even though the support of V is not large

enough relative to the support of _C". This is tail symmetry, which is described in more detail in Magnac

and Maurin (2007). Even when tail symmetry does not hold exactly, the size of bias term _ewill equal the magnitude of the difference between these two integrals, and so the the bias resulting from applying special regressor methods when the support of V is too small will generally be small if the density of"

either has thin tails or is close to symmetric in the tails.

In the more general model D _D I.X0 _C_V _C_" ₀_/_{with instruments}_Z_{, the conditional expectation}

E.D jV;X;Z/will equal the conditional distribution ofX0 C"conditioning on X and Z, evaluated at

(8)

directly estimating that avoids the step of estimating E.D jV;X;Z/, however, this shortcut requires

the conditional density ofV given X and Z. This is the step that we simplify in this paper.

2 Special Regressor Binary Choice With Endogenous Regressors

Assume thatD _D I.X0 _C_V_C_" ₀_/_{and that some or all of the elements of} _X _{are endogenous. Assume}

Z has the standard properties of instruments in linear regression models, i.e., thatE Z X0 _{has rank equal}

to the number of elements of X, and that E.Z"/ _D 0. As usual, Z would include all elements of X that are exogenous, including a constant. The special regressorV is not be included in Z.

Unlike linear models, having E Z X0 _{full rank and} _E_._Z_"/ _D _{0 is not suf cient to identify} _{in the}

binary choice model. But by adding assumptions regarding the special regressor V, Theorem 1 in the Appendix shows how to construct a variableT having the property that T _D X0 _C_e_" _and _E_._Z_e_"/ _D _0.

We will then be able to identify and estimate by a linear two stage least squares regression of T on X

using instruments Z, i.e.,eZ _D Z0_{E Z Z}0 1_{E Z X}0 _and _D _E e_{Z X}0 1_E e_{Z T} _.

De ne S to be the union of all the elements of X and Z, soS is the vector of all the instruments and all of the regressors except for the special regressor V. The additional information that will be required regardingV is a semiparametric model of the formV _D g.U;S/whereU is an error term.

2.1 Simplest Estimator

To make estimation based on Theorem 1 in the appendix simple, a convenient parametric model is chosen here forg. This is given in Corollary 1. We then propose some generalizations that impose less restrictive assumptions on the special regressor while still being numerically simple to implement.

COROLLARY 1: Assume D _D I.X0 _CV _C _" 0_/, E_.Z_"/ _D 0, E_.V_/ _D 0, V _D S0b_CU,

E.U/ _D 0, U _? .S; "/, and U _s f .U/, where f.U/ denotes a mean zero density function having

supportsupp.U/that containssupp. S0b X0 _"/. De neT by

(9)

ThenT _D X0 _C_e_"_where _E_._Z_e_"/ _D_0.

Corollary 1 assumes that the function g is linear in covariates S and an error U. By Theorem 1, other regular parametric models for g could be assumed instead. This particular model is chosen for its simplicity. It can be readily checked that the model in Corollary 1 satis es the assumptions of Theorem 1. The probability density function f could be parametrically or nonparametrically estimated. Any dis-tribution having support equal to the whole real line will automatically satisfy the required support as-sumption in Corollary 1. As noted earlier, this large support asas-sumption could alternatively be replaced by the"tail symmetry assumptions of Magnac and Maurin (2007).

Other than the assumed model for the special regressor V, nothing is required for estimation using Corollary 1 other than what would be needed for a linear two stage least squares regression, speci cally, that E.Z X0_/_{have full rank and} _E_._Z_"/ _D_0.

Based on Corollary 1 we have the following simple estimator. Assume we have data observations Di,

Xi, Zi, andVi. Recall that Si is the vector consisting of all the elements of Xi and Zi. Also note that Xi and Zi should include a constant term.

ESTIMATOR 1

Step 1. Demean Vi. Let bb be the estimated coef cients of S in an ordinary least squares linear regression ofV on S. For each observationi, construct dataUbi D Vi S_i0bb, which are the residuals from this regression.

Step 2. For each i let bfi be a nonparametric density estimator given later by equations (4) or (5). Alternatively, estimate a parametric f usingUbi. For example, if f is normal or otherwise parameterized by its variance as f U _j 2 , then let_b2 _DPn_i_D₁Ub_i2=nand for each observationi de ne bfi D f Ubi jb2 ,

where f is a standard normal (or other) density function.

Step 3. For each observationi construct databTi de ned as bTi D[Di I.Vi 0/]=bfi.

Step 4. Letbbe the estimated coef cients of X in an ordinary linear two stage least squares regression of Tbon X, using instruments Z. It may be necessary to discard outliers in this step. Givenb, choice

(10)

probabilities and marginal effects can be obtained as in Lewbel, Dong, and Yang (2012).

Details regarding nonparametric estimators for the density bfi are for convenience defered until later. Estimator 1 differs from Lewbel (2000) mainly in that it assumes a parametric or semiparametric model for V, while Lewbel (2000) uses a nonparametric conditional density estimator for V. However, Lewbel (2000) is not strictly more general than Estimator A, since Theorem 1 and Corollary 1 allowV to depend on all of the elements of X, including the endogenous regressors, while Lewbel (2000) assumes the conditional density ofV does not depend on endogenous regressors.

This estimator is numerically trivial, requiring no numerical searches and no estimation steps more complicated than linear regressions. It is therefore also fast and easy to obtain standard errors, test sta-tistics, or con dence intervals by an ordinary bootstrap, drawing observations Di, Xi, Zi, and Vi with replacement.

In Estimator 1, nothing constrains the regression of V on Sin the rst step to be linear. For example, if necessary this rst step regression could include squared and cross terms of S. The later steps of the estimator are estimator are unchanged by this generalization.

Also, Estimator 1 can be easily modi ed to allow for more general parametric speci cations of f. In particular, suppose f is any regular continuous density function parameterized by a vector , which we may denote as f .U j /. Then in step 2 we could estimatebby maximizingPni_D1ln f Ubi j , and then

let bfi D f Ubi jb for each observationi. This step is then just a maximum likelihood estimator for .

2.2 Allowing for Heteroskedasticity in

V

All the estimators in this paper allow the model errors " to be heteroskedastic, e.g., X having random coef cients does not violate the assumptions of Theorem 1 or Corollary 1. More generally, the model errors" can have second and higher moments that depend on the regressors in arbitrary, unknown ways. However, the model in the previous section assumes that only the mean of the special regressor V is related to the other covariates S. In this section we provide a more general model forV that allows higher moments ofV to depend onS, yielding a slightly more complicated, but still numerically trivial, estimator.

(11)

In a small abuse of notation, let S2 _{denote the vector that consists of all the elements of} _S_{, and the}

squares and cross products of all the elements ofS.

COROLLARY 2: Assume D _D I.X0 _C _V _C _" ₀_/_, _E_._Z_"/ _D _0, _E_._V_/ _D _0, _U _? _._S_{; "/}_,

V _D S0_b_C _S20_c 1=2_U_, _E_._U_/_D_0,_v_ar_._U_/_D_{1, and}_U _s _f _._U_/_where _f_._U_/_{is a density function that}

has mean zero, variance one, and support supp.U/ that contains supph. S0_b _X0 _"/ _S20_c 1=2i_.

De neT by

T _D[D I.V 0/] S20_c 1=2 ₌_f _._U_/ ₍₃₎

ThenT _D X0 _C_e_"_where _E_._Z_e_"/ _D_0.

Corollary 2 introduces a multiplicative heteroskedastic term, so the errors in theV regression are now

S20_c 1=2_U _{instead of just}_U_{. Corollary 2 follows from Theorem 1 with} _g_._U_;_S_/ _D _S0_b_C_._S20_c_/1=2_U_,

which puts the termS20_c_{into equation (3). As with Corollary 1, the large support assumption for}_U _could

alternatively be replaced by the"tail symmetry assumptions of Magnac and Maurin (2007). The estimator corresponding to Corollary 2 is an immediate generalization of Estimator 1, as follows.

ESTIMATOR 2

Step 1. Demean Vi. Let bb be the estimated coef cients of S in an ordinary least squares linear regression of V on S. For each observationi, construct data Wbi D Vi S_i0bb, which are the residuals of this regression.

Step 2. Let_bc be the estimated coef cients of S2in an ordinary least squares linear regression of Wb2

onS2. For each observationi, construct dataUbi D Si20bc

1=2 _b

Wi.

Step 3. For each i let bfi be a nonparametric density estimator given later by equation (4) or (5), alternatively, de ne bfi D f Ubi where f is a normal or any other distribution that has mean zero and variance one.

Step 4. For each observationi construct databTi de ned as bTi D[Di I.Vi 0/]

h

S20

i bc

1=2i

=bfi.

(12)

In Estimator 2, step 2 comes fromW _D.S20_c_/1=2_U _with_U _? _S_{, so}_{E W}2 _j _S _D _S20_{cE U}2 _D _S20_c_.

The step 2 regression of Wb2_on _S2_{is the same as the regression that would be used for applying White's}

(1980) test for heteroskedasticity in the step 1 regression ofV onS. An easy way to test if Estimator 2 is required instead of Estimator 1 is to perform White's test for heteroskedasticity on the step 1 regression ofV onS. The presence of heteroskedasticity in this regression would then call for Estimator 2.

As with Estimator 1, there is nothing that constrains the functionsS0_b_and_S20_c_{to be linear. One could}

if necessary specify these as higher order polynomial regressions, or for example one could estimateeS20_c

or S20_c 2_{in place of}_S20_c_{everywhere and thereby ensure that the variance estimate is always positive.}

2.3 Convergence Rates and Increasing Ef ciency

By Theorem 1, virtually everywhere in this paper that the term I.V 0/appears, it can be replaced with

M.V/, which is any mean zero distribution function (on the support of V) chosen by the econometrician. In particular, choosing M to be a simple differentiable function likeM.V/_D I .V 1/min.V _C1;2/

(corresponding to a uniform distribution on -1 to 1) can simplify the calculation of limiting distributions and possibly improve the performance of the estimators.

To obtain standard error estimates without bootstrapping, and possibly to improve ef ciency, the pa-rameters in Estimators 1 and 2 can be estimated simultaneously instead of sequentially using GMM. Speci cally, assuming f is parameterized by its variance, the steps comprising Estimator 1 correspond to the following moment conditions:

E S V S0_b _D_0, _Eh _V _S0_b 2 2i _D_0, _E " Z D I.V 0/ f V S0b_j 2 X 0 !# D0

These moments correspond respectively to the regression model ofV, the estimator of the variance ofU, and the transformed instrumental variables special regressor estimator.

If the density of f is parameterized more generally as f .U _j /for some parameter vector , then the moment Eh V S0_b 2 2i _D _{0 could be replaced by} _E _@_ln _{f V} _S0_b_; _=@ _D _{0, which is the}

(13)

The moments corresponding to Estimator 2 are

E S V S0_b _D_0, _Eh_S2 _V _S0_b 2 _S20_c i_D_0, _{E Z} D I.V 0/

f .V S0b_/ S

20_c 1=2 _X0 _D₀

As before, if f is parameterized by parameters in addition to its mean and variance, then one could add the score functions for estimating to this set of moments.

Applying ordinary two step GMM to either of these sets of moments provides estimates of the desired parameters along with nuisance parametersband 2(orbandcfor Estimator 2) that ef ciently combine these estimation steps in the usual way for GMM, and also delivers asymptotic standard errors (possibly replacingI.V 0/withM.V/as above). Alternatively, given the simplicity of the estimators, one could easily obtain standard errors, con dence intervals, or test statistics via bootstrapping.

We do not provide formal limiting distribution theory assumptions here, since the estimator is just GMM. However, a potential concern is that the de nition ofT involves dividing by a density. This could result in T having in nite variance, violating standard GMM limiting distribution theory. As shown by Khan and Tamer (2010), this generally leads to slower than rootn convergence rates, unless the tails of

U are very thick, or the distribution of " is bounded, or Magnac and Maurin (2007) type tail symmetry conditions hold. If these conditions do not hold, then it could be necessary to apply the thick tailed GMM asymptotics of Hill and Renault (2010), or the asymptotics for irregularly identi ed models as described in Khan and Tamer (2010), and Khan and Nekipelov (2010a, 2010b).

A practical implication of this construction of T is that one should watch out for outliers in the nal step regression of Tbon X. In particular, in some contexts it may be desirable to trim the data (that is, remove observationsi whereTbi is extremely large in magnitude) before running the last step regression.

Another implication is that the larger the variance (or other measures of spread such as interquartile range) ofU or V, the better the estimator is likely to perform in practice. This should be borne in mind when choosing V. Lewbel (2000) found that special regressor estimation tended to perform well when the variance ofV was as big or bigger than the variance of X0 _C_"_.

(14)

2.4 More General Special Regresor Models and Estimators

Estimators 1 and 2 require the density ofU, which could be either parametric or nonparametrically esti-mated. One possible nonparametric estimator of f is the standard one dimensional nonparametric kernel density estimator. This consists of replacing step 2 in estimator 1 or step 3 in estimator 2 with

bfi D _nh1 n X j_D1 K Ubi Ubj h ! fori _D1; :::;n (4)

where the kernel function K is a symmetric density function like a standard normal density, and h is a bandwidth. Even with this nonparametric component, bcan be root n consistent and asymptotically normal, based on well known sets of regularity conditions, such as Newey and McFadden (1994), for two-step semiparametric estimation. The estimator forbwill still not require any numerical searches (except possibly a one dimension search for the choice of bandwidth h), so bootstrapping would be entirely practical for estimating con dence intervals, tests, or standard errors, based on, e.g., Chen, Linton, and Van Keilegom (2003) or Escanciano, Jacho-Chávez, and Lewbel (2010).

Instead of choosing a kernel and bandwidth, one could also use the sorted data density estimator of Lewbel and Schennach (2007), which is speci cally designed for use in estimating averages weighted by the inverse of a density, as is the case here. Givennobservations ofUbi, sort these observations from lowest to highest. For each observationUbi, letUb_iC be the value ofUbthat, in the sorted data, comes immediately afterUbi (after removing any ties) and similarly letUb_i be the value that comes immediately beforeUbi. Then the estimator is

bfi D _b 2=n

UC

i Ubi

fori _D1; :::;n (5)

Equation (5) is not a consistent estimator of fi (its probability limit is random, not constant), but given regularity, inverse density weighted averages of the form 1_nPn_i_D₁5i=bfi converge at rate rootn, and our estimators entail averages of this form, e.g., Estimator 1 has 5i D Zi.Di I.Vi 0//. Asymptotic variance formulas are provided in Lewbel and Schennach (2007) and (in more generality) Jacho-Chávez (2009).

(15)

estimator for bfi. It follows from general results in Magnac and Maurin (2004) and Jacho-Chávez (2009) that estimation of will generally be more ef cient using a nonparametric estimator of f than by using the true density, analogous to the more well known result of Hirano, Imbens and Ridder (2003) that weighting by a nonparametrically estimated propensity score is more ef cient than weighting by the true propensity score in treatment effect estimation.

The models presented so far assume thatV depends on covariates only through its location and scale. To allow for more general dependence ofV on covariates, one could replace the assumption thatU _? S; "

in Theorem 1 withU _? S; " j R where R is one or more functions of covariates S. Corollaries 1 and

2 will then still hold replacing the unconditional density f .U/ with the conditional density f .U j R/.

For example, R might equal S0_b_{, or} _S0_b _and _S20_c_{, or} _R _{could equal one or more principal components}

of S. To implement these extensions, one would need to replace bfi D bf .Ui/ in the estimators with

fi D bf Ui j Rbi . For example, we could let bRi D S0bband then de ne bf as a standard kernel estimator of

the conditional density ofU given R.

Finally, it may sometimes be possible to increase ef ciency, or increase the relative support of the special regressor by combining some exogenous covariates to construct a V. For example, suppose the model is D _D I.X0 _C _V₁_C_V₂ _C_" ₀_/_{, where both} _V₁ _and _V₁_C_V₂ _{satisfy the special regressor}

assumptions, i.e., V1 is a special regressor and V2 (which could be discrete or otherwise have limited

support) is exogenous and independent of ". Then we can write down all the moments associated with

estimator 1 or estimator 2 in the previous section treating V as V1 and including V2 in X0 . These

moments will identify and . We can also write down all the moments associated with estimator 1 or 2 based on de ningV asV1CV2 . Then, to increase estimation ef ciency, GMM could be applied to both

sets of moments (those corresponding to either de nition ofV) simultaneously.

Applying GMM just to the set of conditions de ningV asV1CV2 will not work, because they will fail

to identify . However, ifV _DV1CV2 has a suf ciently large support butV DV1possibly does not, then

one could rst obtain an_bby estimating E.D j V1CV2 ;X;Z/using a conditional linear index model

(16)

weighted average derivative estimation of _D E @E.D jV1;V2;X;Z/ =@V2 = @E.D j V1;V2;X;Z/ =@V1

if both V1and V2 are continuously distributed. Then, given a consistentbby one of these methods, we

could constructVb_DV1CV2band useVbin place ofV in this paper's estimators.

3 Empirical Illustration

In this section we illustrate our simple estimators (coded in Stata, available upon request) with an empirical application. Let Di be the probability of an individuali migrates from one state to another in the United States. Let age be the special regressor Vi, because it is exogenously determined, and human capital theory suggests it should appear linearly (or at least monotonically) in a threshold crossing model of the utility of migration. This is because workers migrate in part to maximize their expected lifetime income, and by construction the gains in expected lifetime earnings from any permanent change in wages decline linearly with age. Figure 1 provides strong empirical evidence for this relationship, showing a tted kernel regression of Di on age in our data, using a quartic kernel and bandwidth chosen by cross validation. We also depict the same nonparametric regression cutting the bandwidth in half, to verify that this near linearity is not an artifact of possible oversmoothing. Others have reported similar empirical evidence (See, e.g., Dong 2010 and the references therein) in accordance with the above human capital motivation for migration.

Pre-migration income and home ownership greatly affect the decision of whether to move or not. Both are endogenous regressors in our binary choice model. Maximum likelihood would require an elaborate dynamic speci cation and an extensive amount of current and past information about individuals to completely model their homeownership decision and the determination of their wages and other income jointly with their migration decisions. See, e.g., Kennen and Walker (2011) for an example of a dynamic structural income based model of migration. Control function methods are also not appropriate for this application, because home ownership is discrete, and control functions are generally inconsistent when used with discrete endogenous regressors (see, e.g., Dong and Lewbel 2012).

(17)

20 25 30 35 40 45 50 55 60 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 Age M igr ation pr oba bility

Figure 1 Nonparametric age profile of migration probabilities

20 25 30 35 40 45 50 55 600.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 cross-validated optimal bandwidth

half the optimal bandwidth

Figure 1: Nonparametric age pro le of migration probabilities

completed education and who were not retired at the time of their interview. This is intended to largely exclude people who are moving to retirement locations. The nal sample has 4,689 observations, consist-ing of 807 migrants and 3882 nonmigrants. We let D = 1 if an individual changes his state of residence during 1991 - 1993, and 0 otherwise. We de ne the special regressor V to be the negative of age, minus its mean (ensuring it has a positive coef cient and mean zero)

Our endogenous regressors are log(income), de ned as the logarithm of family income averaged over 1989 and 1990, and homeowner, a dummy indicating whether one owns a home in 1990. The remaining regressors comprising X, which we take as exogenous, are education (in years), number of children, and dummy indicators for white, disabled, and married. Our instruments Z consist of the exogenous regressors, along with government de ned bene ts received in 1989 and 1990, i.e., the value of food stamps and other welfare bene ts such as Aid to Families with Dependent Children (AFDC), and state median residential property tax rates, computed from the 1990 U.S. Census of Population and Housing and matched to the original PSID data. Government bene ts have been used by others as instruments for household income in wage and labor supply equations, based on their being determined by government formulas rather than by unobserved attributes like ability or drive. Similarly, property tax rates affect

(18)

homeownership costs and hence the decision of whether to own or rent, while being exogenously set by government rules.

Note that although age is exogenously determined, that does not guarantee that age satis es the re-quired special regressor exogeneity assumptions, because age could affect the endogenous regressors in ways that cause a violation of conditional independence (we'd like to thank Jeffrey Wooldridge for point-ing this out). This concern may be partially mitigated by our inclusion of the endogenous regressors in S, and hence in the model forV, since our estimator (unlike the original Lewbel 2000 version) only requires

U, notV, to satisfy conditional independence.

The special regressor formally requires V to have the same or larger support than X0 _C_"_{. As}

discussed earlier, in practice nite sample biases will tend to be small when measures of the empirical spread ofV (standard deviation or interquantile ranges) are comparable to, or larger than, those of X0b_{. In}

our application, the standard deviation of X0b_(usingb_{from estimator 1) is either 16.3 or 12.4 depending}

on the choice of estimator for bf (kernel or sorted density, respectively). These are at least comparable in magnitude to the standard deviation ofV, which is 9.0, though ideally one would wantV to have a larger spread. Moreover, much of this difference in spread is due to a small fraction of outliers in X0b_{. Quantile}

measures of spread are similar, e.g., the difference between the 5th and 95th quantile of V is 30.0, while that of X0b_{is 44.50 or 36.6.}

Table 2 presents the estimated marginal effects of covariates by our two estimators. For comparison, results from standard probit and ivprobit are also presented. Estimates ofUbfrom both of our estimators were somewhat skewed and kurtotic. Normality is rejected by the Jarque-Bera tests. We therefore used nonparametric density estimates for bf.

Columns 1 and 2 of Table 2 are based on Estimator 1, (which assumes U is homoskedastic), using (a) an ordinary kernel density estimator and (b) the sorted data estimator, respectively. The kernel density estimator is given by equation (4). We used a standard Epanechnikov kernel functionK (though the results are not sensitive to the choice of kernel function) with bandwidth parameterh given by Silverman's rule. The sorted data estimator is given by equation (5).

(19)

Columns 3 and 4 of Table 2 are from Estimator 2, using (a) kernel and (b) sorted data density estima-tors, respectively. White's (1980) test on the regression of V on S shows signi cant heteroskedasticity, indicating the more general Estimator 2 is necessary in this case. Recall that S2_{in the heteroskedasticity}

termS20_c_{was de ned to be the vector of all elements of}_S_{and all of their squares and cross products. The}

total number of terms in S2_{is rather high, so for parsimony we only included the squares and cross terms}

of the most relevant regressors of S in the construction of S20_c _{(equivalent to setting the coef cients of}

other elements ofS2_{equal to zero). Note that all of this discussion regarding heteroskedasticity refers only}

to the equation for the special regressor V; all our estimators permit the model error" to have variance

and higher moments that depend onSin arbitrary ways.

Table 2: The estimated migration equation - marginal effects Dependent variable: migration (0/1)

Estimator 1-(a) Estimator 1-(b) Estimator 2-(a) Estimator 2-(b) ivprobit probit age 0.003 (0.001) 0.004 (0.001) 0.002 (0.001) 0.002 (0.0007) -0.0008 (0.001) 0.002 (0.0007) log(income) -0.013 (0.013 ) -0.012 (0.015) -0.026 (0.014) -0.037 (0.014) 0.065 (0.035) -0.009 (0.007) homeowner -0.055 (0.031) -0.050 (0.033) -0.043 (0.030) -0.026 (0.033) -0.330 (0.058) -0.086 (0.013) white 0.017 (0.012) 0.003 (0.012) -0.004 (0.007) -0.003 (0.006) 0.006 (0.014) -0.010 (0.012) disabled -0.165 (0.073) -0.134 (0.066) -0.187 (0.041) -0.205 (0.041) 0.018 (0.040) -0.012 (0.033) education 0.005 (0.002) 0.006 (0.003) 0.003 (0.001) 0.004 (0.001) 0.0002 (0.003) 0.0004 (0.002) married -0.004 (0.011) 0.018 (0.018) 0.050 (0.015) 0.046 (0.015) 0.020 (0.025) -0.006 (0.017) # of children 0.018 (0.006) 0.019 (0.007) -0.006 (0.003) -0.006 (0.003) 0.013 (0.005) 0.010 (0.005)

Note: Bootstrapped standard errors are in the parentheses; *signi cant at the 10% level; **signi cant at the 5% level; ***signi cant at the 1% level.

For comparison with Estimators 1 and 2, Column 5 of Table 2 uses the ivprobit estimator from Stata. Let e1 and e2 respectively denote the errors in linear regressions of log(income) and the homeowner

dummy on the instruments Z. The ivprobit estimator assumes that e1, e2, and the latent binary choice

(20)

this assumption cannot hold for a discrete endogenous variable like our homeowner dummy, because the errors e2 in a linear probability model (which is what the linear regression of homeownership on Z is)

cannot be homoskedastic, and are generally nonnormal. As a result, ivprobit estimates, like other control function estimators, are inconsistent when used with discrete endogenous regressors. In contrast, our proposed Estimators 1 and 2 do not make any assumptions regarding properties of the errors e1 ande2.

The ivprobit estimator also does not allow" to be heteroskedastic. Finally, column 6 in Table 2 reports

ordinary probit estimates, which ignores any regressor endogeneity and possible heteroskedasticity in",

and is provided here as a baseline benchmark.

Our estimators normalize the coef cient of V to be one, so marginal effects are also reported in table 2, using formulas given in Lewbel, Dong, and Yang (2012). We report marginal effects because they have more direct economic relevance than , and because they are directly comparable across speci cations, including probit. The estimated marginal effect of negative ageV is small but statistically signi cant, and is similar across all speci cations except ivprobit. Unlike the other speci cations, ivprobit gives V the wrong sign, inconsistent with the human capital argument that potential wage gains from moving become smaller as one ages.

Log income has a marginally signi cant coef cient in the heteroskedasticity corrected models, that is, Estimator 2. One would expect income to have a signi cant effect on migration. The relatively large standard errors on this variable may be due to weakness in the government de ned bene ts instrument, which for many people is zero. Unlike all the other estimators, the ivprobit estimates have a counterintu-itive poscounterintu-itive and statistically signi cant sign for log income. Probit and Estimator 1 give negative income effects, though small in magnitude compared to Estimator 2.

The endogenous homeowner dummy has a negative sign in all the estimators, consistent with the fact that xed costs of moving are higher if one is a homeowner. The estimated magnitude of this effect is largest for ordinary probit, smallest for ivprobit, and roughly halfway between these two extremes in this paper's estimators. Intuitively, people who buy a home should be those who do not want to move, so homeownership should be negatively correlated with unobserved preference for migration. Ordinary

(21)

pro-bit fails to account for this endogeneity of homeownership on migration and so yields an overestimate of the negative impact of homeownership on migration probabilities, while ivprobit is inconsistent when en-dogenous regressors are discrete, which may be causing ivprobit to overcompensate for this endogeneity. Finally, being disabled signi cantly reduces the probability of migration in all the models except for ivprobit, and education has a small positive effect on migration across the board.

4 Conclusions

Commonly used methods to deal with heteroskedasticity and endogenous regressors in binary choice models are linear probability models, control functions, and maximum likelihood. Each of these types of estimators have some drawbacks. Unlike these other estimators (each of which only has some of the following attractive features), the special regressor based estimators we provide here possess all of the following attributes: They provide consistent estimates of the model coef cients , they nest logit and probit as special cases, they allow for general and unknown forms of heteroskedasticity (including, e.g., random coef cients), they do not require correctly speci ed models of the endogenous regressors, they do not require endogenous regressors to be continuously distributed, and they do not require numerical searches. What special regressor estimators do require are ordinary instruments, and just one exogenous regressor (no matter how many regressors are endogenous) to be conditionally independent of the latent error" and be conditionally continuously distributed with a large support.

In this paper, we provide some variants of the special regressor model that are numerically almost as trivial to implement as linear probability models. We apply our estimators to estimating migration probabilities in the presence of both discrete and continuous endogenous regressors, and illustrate how our special regressor estimators can be implemented in practice. We compare our estimators with the standard probit and ivprobit in this empirical application.

Special regressor methods can be applied in a variety of settings in addition to binary choice. The same models for V and f that are proposed here could be used to simplify these other applications as well.

(22)

5 References

Abbring, J. H. and J. J. Heckman, (2007) "Econometric Evaluation of Social Programs, Part III: Dis-tributional Treatment Effects, Dynamic Treatment Effects, Dynamic Discrete Choice, and General Equi-librium Policy Evaluation," in: J.J. Heckman & E.E. Leamer (ed.), Handbook of Econometrics, edition 1, volume 6B, chapter 72 Elsevier.

Ai, C. and L. Gan, (2010) "An alternative root- consistent estimator for panel data binary choice models" Journal of Econometrics, 157, 93-100

Avelino, R. R. G. (2006), "Estimation of Dynamic Discrete Choice Models with Flexible Correlation in the Unobservables with an Application to Migration within Brazil," unpublished manuscript, University of Chicago.

Ait-Sahalia, Y., P. J. Bickel, and T. M. Stoker (2001), "Goodness-of- t tests for kernel regression with an application to option implied volatilities," Journal of Econometrics, 105, 363-412.

Altonji, J. G. and R. L. Matzkin (2005), "Cross Section and Panel Data Estimators for Nonseparable Models with Endogenous Regressors," Econometrica, 73, 1053-1102.

Anton, A. A., A. Fernandez Sainz, and J. Rodriguez-Poo, (2002), "Semiparametric Estimation of a Duration Model," Oxford Bulletin of Economics and Statistics, 63, 517-533.

Berry, S. T., and P. A. Haile (2009a), "Identi cation in Differentiated Products Markets Using Market Level Data," Unpublished Manuscript.

Berry, S. T., and P. A. Haile (2009b), "Nonparametric Identi cation of Multinomial Choice Demand Models with Heterogeneous Consumers," Unpublished Manuscript.

Blundell R. and J. L. Powell (2003), "Endogeneity in Nonparametric and Semiparametric Regres-sion Models," in Dewatripont, M., L.P. Hansen, and S.J. Turnovsky, eds., Advances in Economics and Econometrics: Theory and Applications, Eighth World Congress, Vol. II (Cambridge University Press).

Blundell, R. W. and J. L. Powell, (2004), "Endogeneity in Semiparametric Binary Response Models," Review of Economic Studies, 71, 655-679.

Blundell, R. W., and Smith, R. J. (1989), "Estimation in a Class of Simultaneous Equation Limited Dependent Variable Models", Review of Economic Studies, 56, 37-58.

(23)

Briesch, R., P. Chintagunta, and R.L. Matzkin (2009) "Nonparametric Discrete Choice Models with Unobserved Heterogeneity," Journal of Business and Economic Statistics, forthcoming.

Chesher, A. (2009), "Excess heterogeneity, endogeneity and index restrictions," Journal of Economet-rics, 152, 37-45.

Chesher, A. (2010), "Instrumental Variable Models for Discrete Outcomes," Econometrica, 78, 575-601.

Cogneau, D. and E. Maurin (2002), "Parental Income and School Attendance in a Low-Income Coun-try: A Semiparametric Analysis," Unpublished Manuscript.

Dong, Y. (2010), "Endogenous Regressor Binary Choice Models without Instruments, with an Appli-cation to Migration," forthcoming, Economics Letters

Escanciano, J. C., D. Jacho-Chávez, and A. Lewbel (2010), "Identi cation and Estimation of Semi-parametric Two Step Models" Boston College Working Paper wp756.

Goux, D. and E. Maurin (2005), "The effect of overcrowded housing on children's performance at school, Journal of Public Economics, 89, 797-819.

Greene, W. H. (2008), Econometric Analysis, 6th edition, Prentice Hall.

Heckman, J. J., (1976) “Simultaneous Equation Models with both Continuous and Discrete Endoge-nous Variables With and Without Structural Shift in the Equations,” in Steven Goldfeld and Richard Quandt (Eds.), Studies in Nonlinear Estimation, Ballinger.

Heckman, J. J., and R. Robb, (1985) “Alternative Methods for Estimating the Impact of Interven-tions,” in James J. Heckman and Burton Singer (Eds.), Longitudinal Analysis of Labor Market Data, Cambridge:Cambridge University Press.

Heckman, J. J. (1990), "Varieties of selection bias, American Economic Review 80, 313–318.

Heckman, J. J. and Navarro, S. (2007), "Dynamic discrete choice and dynamic treatment effects," Journal of Econometrics, 136, 341-396.

Hill, J. B. and E. Renault (2010), "Generalized Method of Moments with Tail Trimming," unpublished manuscript.

Hirano, K., G. W. Imbens and G. Ridder, (2003), "Ef cient Estimation of Average Treatment Effects Using the Estimated Propensity Score," Econometrica, 71, 1161-1189.

(24)

Hoderlein, S. (2009) "Endogenous semiparametric binary choice models with heteroscedasticity," CeMMAP working papers CWP34/09.

Hong H. and E. Tamer (2003), "Endogenous binary choice model with median restrictions," Eco-nomics Letters 80, 219–225.

Horowitz, J. L. (1992), "A Smoothed Maximum Score Estimator for the Binary Response Model," Econometrica, 60, 505-532.

Honore, B. and A. Lewbel, (2002) "Semiparametric Binary Choice Panel Data Models Without Strictly Exogenous Regressors," Econometrica, 70, 2053-2063.

Ichimura, H., and S. Lee (2006): "Characterization of the Asymptotic Distribution of Semiparametric M-estimators," CeMMAP working papers, CWP15/06.

Imbens, G. W. and Newey, W. K. (2009), "Identi cation and Estimation of Triangular Simultaneous Equations Models Without Additivity," Econometrica, 77, 1481–1512.

Jacho-Chávez, D. T., (2009), "Ef ciency Bounds For Semiparametric Estimation Of Inverse Conditional-Density-Weighted Functions," Econometric Theory, 25, 847-855.

Kennen, J. and J. R. Walker (2011), "The Effect of Income on Individual Migration Decisions," Econo-metrica, 79, 211-251.

Khan, S. and A. Lewbel (2007) "Weighted and Two Stage Least Squares Estimation of Semiparametric Truncated Regression Models," Econometric Theory, 23, 309-347.

Khan, S. and E. Tamer (2010), "Irregular Identi cation, Support Conditions, and Inverse Weight Esti-mation," Econometrica, 78, 2021–2042.

Khan, S. and D. Nekipelov (2010a), "Semiparametric Ef ciency in Irregularly Identi ed Models," unpublished working paper.

Khan, S. and D. Nekipelov (2010b), "Information Bounds for Discrete Triangular Systems," unpub-lished working paper.

Lewbel, A. (1997), "Semiparametric Estimation of Location and Other Discrete Choice Moments," Econometric Theory, 13, 32-51.

Lewbel, A. (1998), "Semiparametric Latent Variable Model Estimation With Endogenous or Mismea-sured Regressors," Econometrica, 66, 105–121.

(25)

Lewbel, A. (2000), "Semiparametric Qualitative Response Model Estimation With Unknown Het-eroscedasticity or Instrumental Variables," Journal of Econometrics, 97, 145-177.

Lewbel, A. (2007a), "Endogenous Selection or Treatment Model Estimation," Journal of Economet-rics, 141, 777-806.

Lewbel, A. (2007b), "Modeling Heterogeneity," in Advances in Economics and Econometrics: Theory and Applications, Ninth World Congress (Econometric Society Monographs), Richard Blundell, Whitney K. Newey, and Torsten Persson, editors, Cambridge: Cambridge University Press, Vol. III, Chapter 5, 111-121.

Lewbel, A. (2007c), "Coherence and Completeness of Structural Models Containing a Dummy En-dogenous Variable," International Economic Review, 48, 1379-1392.

Lewbel, A., Dong, Y., and T. Yang (2012), "Why and How to Avoid the Linear Probability Model, and a Simple Alternative," unpublished manuscript, Boston College.

Lewbel, A. and S. Schennach (2007), "A Simple Ordered Data Estimator for Inverse Density Weighted Functions," Journal of Econometrics, 186, 189-211.

Lewbel, A. and X. Tang (2011), "Identi cation and Estimation of Games with Incomplete Information using Excluded Regressors," unpublished manuscript.

Lewbel, A., O. Linton, and D. McFadden (2008), "Estimating Features of a Distribution From Bino-mial Data," Unpublished manuscript.

Magnac, T. and E. Maurin (2007), "Identi cation and Information in Monotone Binary Models," Jour-nal of Econometrics, 139, 76-104.

Magnac, T. and E. Maurin (2008), "Partial Identi cation in Monotone Binary Models: Discrete Re-gressors and Interval Data, Review of Economic Studies, 75, 835-864.

Manski, C. F. (1975), "Maximum Score Estimation of the Stochastic Utility Model of Choice", Journal of Econometrics, 3, 205-228.

Manski, C. F. (1985), "Semiparametric analysis of discrete response: Asymptotic properties of the maximum score estimator," Journal of Econometrics, 27, 313-333.

Manski, C. F. (1988), "Identi cation of Binary Response Models," Journal of the American Statistical Association, 83, 729-738.

(26)

Manski, C. F. (2007), "Partial Identi cation of Counterfactual Choice Probabilities," International Economic Review, 48, 1393–1410.

Matzkin, R.L. (1992), "Nonparametric and Distribution-Free Estimation of the Binary Threshold Crossing and The Binary Choice Models," Econometrica, 60, 239-270.

Matzkin, R.L. (1994) "Restrictions of Economic Theory in Nonparametric Methods," in Handbook of Econometrics, Vol. IV, R.F. Engel and D.L. McFadden, eds, Amsterdam: Elsevier, Ch. 42, 2524-2554.

Matzkin, R. (2007), "Heterogeneous Choice," in Advances in Economics and Econometrics: Theory and Applications, Ninth World Congress (Econometric Society Monographs), Richard Blundell, Whitney K. Newey, and Torsten Persson, editors, Cambridge: Cambridge University Press, Vol. III, Chapter 4, 75-110.

Pistolesi, N. (2006), "The performance at school of young Americans, with individual and family endowments," unpublished manuscript.

Powell, J. L., J. H. Stock, and T. M. Stoker, (1989) "Semiparametric Estimation of Index Coef cients," Econometrica, 57, 1403-1430.

Rivers, D., and Q. H. Vuong (1988), "Limited information estimators and exogeneity tests for simul-taneous probit models," Journal of Econometrics 39, 347–66.

Shaikh, A. and E. Vytlacil (2008), "Endogenous binary choice models with median restrictions: A comment," Economics Letters, 23-28.

Stewart, M. B. (2005), "A comparison of semiparametric estimators for the ordered response model," Computational Statistics and Data Analysis, 49, 555-573.

Tiwari, A. K., P. Mohnen, F. C. Palm, S. S. van der Loeff, (2007), "Financial Constraint and R&D Investment: Evidence from CIS," United Nations University, Maastricht Economic and social Research and training centre on Innovation and Technology (UNU-MERIT) Working Paper 011.

Vytlacil, E. and N. Yildiz (2007), "Dummy Endogenous Variables in Weakly Separable Models," Econometrica, 75, 757-779.

White, H. (1980) "A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity," Econometrica, 48, 817-838.

(27)

press.

6 Appendix

LEMMA 1: Assume the distribution of V given S is continuous. Then there exists a function g and a random variableU such that V _D g.U;S/whereU _? S.

PROOF OF LEMMA 1: De neU _D FV_jS.V j S/where FV_jS is the conditional distribution functionV given S. De ne g to be the inverse of the function FVjS, so g is de ned by V D g FVjS.V j S/ ;S .

Then by constructionV _Dg.U;S/whereU ? S. In this constructionU will have a uniform distribution

but one could more generally letU _D ef FV_jS.V j S/ for any strictly monotonic ef to giveU some other

distribution, such as a normal.

Lemma 1 is not new, e.g., it is used in Vytlacil and Yildiz (2007) and Matzkin (2007). It is useful here because Theorem 1 below assumes existence ofgandU withU independent ofS, and the lemma shows that this assumption is made without loss of generality. The variableU can be interpreted as the error term in a modelgfor the variableV. Later we will propose some simple functional forms forg.

Theorem 1 below generalizes Lewbel (2000), showing how to construct a variableT having the prop-erty thatE.Z T/_D E Z X0 _{, so a linear two stage least squares regression of}_T _on_X _{using instruments}

Z yields the desired coef cients . Note this also proves point identi cation of .

THEOREM 1: Assume D _D I.X0 _C_V _C_" ₀_/_, _E_._Z_"/_D_{0, and}_V _D _g_._U_;_S_/_{. Assume}_supp_._X0 _C

"/ supp. V/, E.V/D 0,g is differentiable and strictly monotonically increasing in its rst element,

U _? .S; "/, and U is continuously distributed. Let f.U/ be the probability density function of U. Let M.V/ be any mean zero distribution function on supp.V/ that equals zero and one strictly inside

supp.V/. De neT by T _D D M.V/ f.U/ @g.U;S/ @U (6) ThenT _D X0 _C_e_"_where _E_._Z_e_"/ _D_0.

(28)

taking M.V/D I.V 0/. By the de nition of conditional expectation E.T _j S; "/_D Z supp.UjS;"/ I.D Cg.U;S/ 0/ I.g.U;S/ 0/ f.U/ @g.U;S/ @U f.U j S; "/dU D Z supp.U_jR/ I.D _Cg.U;S/ 0/ I.g.U;S/ 0/ @g.U;S/ @U dU D Z supp.VjR/ I.D _CV 0/ I.V 0/ dV

where the second equality follows from U _? S; " which means f.U/ D f.U j S; "/, and the third

equality uses a change of variables fromU toV. If D 0 then

E.T _j S; "/ _D Z supp.VjR/ I. D V 0/dV _D Z 0 D 1dV D D and if D 0 then E.T _j S; "/_D Z supp.VjR/ I.0 V D /dV _D Z D 0 1dV D D

This proves that E.T _j S; "/_D X0 _C_"_{. De ning}_e_" _D_T _X0 _{we have}

E.Z_e"/ _D E[Z.T X0 _/_]_D _E_[_E_._Z_._T _X0 _/_j _S_{; "/}_]

D E[Z.E.T j S; "/ X0 /]D E.Z"/ D0:

To show the theorem holds for other choices ofM.V/, replaceD M.V/in equation (6) with [D I.V 0/]_C

[I.V 0/ M.V/]. ThenE.T _j S; "/equals the sum of the term given above andR_supp_._V_j_R_/[I.V 0/ M.V/]dV. Applying an integration by parts to this term gives

[I.V 0/ M.V/]Vjsupp.VjR/

Z

supp.V_jR/

@M.V/ @V V dV

The rst term here is zero because M.V/is distribution function that equals zero and one strictly inside the support of V, and the second term is zero because M.V/ is a mean zero distribution function. So

E.T j S; "/is unchanged by replacing I.V 0/with M.V/.

One way in which Theorem 1 generalizes previous results is that Lewbel (2000) used I.V 0/ in

place of M.V/. and we will usually let M.V/ _D I.V 0/. The usefulness of this extension, rst

proposed by Lewbel and Tang (2012), is that taking M.V/ to be a differentiable function can simplify

(29)

Recall thatSconsists of all the elements ofXand Z. As long asV givenSis continuously distributed, the assumption that V _D g.U;S/ with U independent of S holds without loss of generality. This is

because, as shown by Lemma 1, it is always possible to construct a function g and an error termU that satis es this independence assumption. Differentiability of g and continuity of the U distribution both correspond to smoothness of the distribution function ofV. Having E.V/_D0 is not really necessary, but

it simpli esT. Setting the median of V to zero would have the same effect. In practice one could simply recenter V (by demeaning or subtracting off the median) before using it in the model to make this hold. Note that X and Z will generally include a constant term, so recentering V will have no impact on the model.

Having E.Z"/ _D 0 and rank of E Z X0 _{equal to the number of elements of} _{are just the minimal}

conditions that would be required for two stage least squares estimation of a linear model with endogenous regressors, so we maintain those minimal conditions in our nonlinear binary choice model.

The requirement thatU be independent of"is nothing more than an exogeneity assumption regarding

the special regressorV. It says that after one has conditioned on other covariates, the remaining variation inV is unrelated to the binary choice model error".

Finally, the condition regarding the support of V is that the range of possible values of X0 _C_" _lies

in the range of possible values of V, which implies that it is possible forV to be small enough or large enough to drive Dto zero or one. In the case where the support of X0 _C_"_{is not bounded, this becomes}

an identi cation at in nity argument as in Heckman (1990), though it should be noted that consistent estimation of any moment, even a mean, requires observing data over their entire support, and Khan and Tamer (2010) point out that similar requirements apply to standard average treatment effect estimators.

The required support assumption is not in general testable prior to estimation, because it depends on . After estimation ofbone can check whether the values of X0bin the data lie in the range of observed

values of V, but even then, the true supports of the regressors and the support of the latent " are not

known in general. So one may worry about the support condition holding in empirical applications, to which there are a few responses.

First, in theory the support condition is easily satis ed, since e.g., it holds if V contains an additive component like an error term that is normal, t-distributed, or has any other full real line support

(30)

distribu-tion.

Second, as described earlier, the large support assumption can be relaxed and replaced with a tail sym-metry assumption. See Magnac and Maurin (2007) for details. The construction of T and the conclusion of Theorem 1 is unchanged when the support ofV is not as large as that of X0 _C_"_{, provided that this tail}

symmetry condition holds. Moreover, even if tail symmetry does not hold exactly, as described earlier the asymptotic bias in estimation resulting from a violation of the large support assumption will generally be small if the tails of the distribution of"are either thin or close to symmetric.

Third, Lewbel (2000, 2007a) shows that for special regressor based estimators, the nite sample bias in estimation ofbalso tend to be small when the variance or interquantile ranges of V are comparable to or larger than the variance or interquantile ranges of X0 _C_"_{. This makes intuitive sense, since in real}

data what matters is not the hypothetical extreme values that might possibly be seen, but rather the spread of values actually observed in the majority of the sample. Thus in practice one may check measures of the relative spread of the distributions ofV versus X0b_{to get a sense of whether the observed variation in}_V