Estimating Interdependence Across Space, Time and Outcomes in Binary Choice Models Using Pseudo Maximum Likelihood Estimators

(1)

Universität Basel Peter Merian-Weg 6 4052 Basel, Switzerland

Corresponding Author:

Prof. Dr. Aya Kachi Tel: +41 (0) 61 207 28 41 March 2018

Estimating Interdependence Across Space, Time

and Outcomes in Binary Choice Models Using

Pseudo Maximum Likelihood Estimators

WWZ Working Paper 2018/11 Julian Wucherpfennig, Aya Kachi, Nils-Christian Bormann, Philipp Hunziker

A publication of the Center of Business and Economics (WWZ), University of Basel.

(2)

Estimating Interdependence Across Space, Time and

Outcomes in Binary Choice Models Using Pseudo

Maximum Likelihood Estimators

Julian Wucherpfennig1

,

Aya Kachi2

,

Nils-Christian Bormann3

, and

Philipp Hunziker∗4 1

Hertie School of Governance 2

Faculty of Business and Economics, University of Basel 3_{Department of Politics, University of Exeter}

4_{College of Computer and Information Science & Department of Political Science,} Northeastern University

March 30, 2018

*PLEASE DO NOT CITE WITHOUT PERMISSION OF THE AUTHORS*

Binary outcome models are frequently used in Political Science. However, such models have proven particularly difficult in dealing with interdependent data structures, including spatial autocorrelation, temporal autocorrelation, as well as simultaneity aris-ing from endogenous binary regressors. In each of these cases, the primary source of the estimation challenge is the fact that jointly determined error terms in the reduced-form specification are analytically intractable due to a high-dimensional integral. To deal with this problem, simulation approaches have been proposed, but these are compu-tationally intensive and impractical for datasets with thousands of observations. As a way forward, in this paper we demonstrate how to reduce the computational bur-der significantly by (i) introducing analytically tractable pseudo maximum likelihood estimators for latent binary choice models that exhibit interdependence across space, time and/or outcomes, and by (ii) proposing an implementation strategy that increases computational efficiency considerably. Monte-Carlo experiments demonstrate that our estimators perform similarly to existing alternatives in terms of error, but require only a fraction of the computational cost.

∗_{We would like to thank Frederick Boehmke, Lars-Erik Cederman, Robert Franzese Jr., Dennis Quinn,}

Andrea Ruggeri, Emily Schilling, Martin Steinwand, Oliver Westerwinter, Michael Ward and Christopher Zorn for their useful comments.

(3)

1 Introduction

War vs. peace, vote “yes” vs. “no”—binary data are ubiquitous in political science. Thus, it is little surprising that binary choice models—e.g. probit and logit models—are amongst the most common statistical tools for empirical analyses. However, as is true for any other type of data, political scientists are becoming increasingly aware that observational data generated through social and political processes exhibits various forms of interdependence.

Take the simple example of international treaty ratification and compliance. A bur-geoning literature that makes the case that countries are more likely to ratify international treaties, such as the Kyoto protocol to limit greenhouse gas emissions or the Ottawa conven-tion on the ban of landmines, if they expect other countries to do the same. This pattern is often explained by international norms and the logic of public-goods provision (e.g., Nord-haus 1999; Simmons and Elkins 2004). Accordingly, one country’s decision to ratify is (at least in part) dependent on decisions of other countries, and vice-versa. Following a large literature, we refer to this form of interdependence as “spatial interdependence”.

Because such treaties are frequently the result of long-lasting negotiations, a related question pertains to the timing of ratification. Indeed, countries with a greater proclivity are generally expected to ratify sooner (such as Germany which is among the environmental frontrunners, Weidner and Mez (2008)), while an abstaining countries (such as the U.S. in both examples mentioned above) is likely to continue this behavior from one year to the next. Thus, we refer to this type of dependence as “temporal dependence”.

Sceptics of international treaties have argued that in many instances, ratification does not necessarily occur because countries are willing to accept the constraint on the particular forms of behavior that is implied by the treaty (e.g., refrain from using landmines). Instead, countries could ratify if they were to comply even in the absence of the treaty. As such, treaties merely screen, rather than constrain (see e.g., Von Stein 2005). In this case, the causal arrow runs from compliance to ratification, rather than the other way around. Put differently, ratification and compliance are interdependent outcomes, making it difficult to

(4)

evaluate the actual effectiveness of treaties. In this paper, we refer to this pattern as “outcome interdependence”.

All three forms of (inter-)dependence outlined above— across space, time and outcomes— suggest that individual observations violate the basic assumption of conventional statistical tools, including i.i.d. error terms for regression based estimators, or SUTVA for matching techniques. This violation poses a significant challenge when it comes to valid causal in-ference. While existing methods to tackle the challenge are relatively well developed for continuous data, there is a clear gap when it comes to binary data. As we show below, the main problem stems from a mathematically intractable error term distribution which prevents researchers from deriving closed form solutions for appropriate likelihood functions. While some simulation-based techniques exist, these are generally extremely computationally burdensome, even for small datasets, and thus often impracticable for applied researchers. Moreover, different forms of (inter-)dependence have generally been treated in isolation, even though typical political science data is likely to be affected by multiple forms at the same time.

Therefore the primary goals of our paper is to develop a general toolkit for applied researchers studying binary data featuring (inter-)dependencies. We do so with an eye on re-ducing computational burdens significantly in order to accommodate large datasets of several thousand observations by developing a series of pseudo maximum likelihood estimators. Our analytical point of departure is a pseudo maximum likelihood estimator for binary spatial models due to Smirnov (2010), for which the remaining computational burden amounts to inverting an N-dimensional matrix we refer to as the “interdependence multiplier.” We then show that this estimation framework lends itself suitable to accommodate other types of in-terdependence as well, including temporal and outcome inin-terdependence across simultaneous equations. Finally, we further reduce the estimation costs by proposing an implementation strategy that avoids direct matrix inversion, and instead relies on a combination of itera-tive gradient procedures and approximations that yield an estimation algorithm with almost

(5)

linear complexity in N .

The paper proceeds as follows. First, we provide a technical description of the math-ematical problem associated with the three sources of interdependence in binary dependent variables. Then we introduce a pseudo maximum likelihood estimator (PMLE) for spatial models according to Smirnov (2010). Next we derive equivalent pseudo maximum likeli-hood estimators for accommodating dependence across time, and interdependence between outcomes. We then discuss our implementation strategy, followed by an evaluation of our estimators via Monte Carlo simulation. We conclude with a plan for further research.

2 The challenge: The intractable reduced-form error specification

In this section we set out to demonstrate two things. First, we use the binary spatial model to show why fitting models for binary spatial data is challenging. At its core, the problem is that the likelihood function for the binary spatial model involves an analytically intractable integral. Second, we show that this result generalizes to other sources of (inter-)dependence, namely across time and between outcomes. In fact, it turns out that spatial, temporal, and outcome (inter-)dependence all give rise to the same reduced form specification for the latent outcome vector. This result is a double edged sword. On the one hand, it implies that fitting models on binary data featuring any of these types of (inter-)dependencies will be challenging. On the other hand, it suggests that solving the estimation challenge for one type of model will be useful for addressing the other types of (inter-)dependencies as well.

2.1

Spatial interdependence

Suppose one is interested in the following model

y_i∗ = ρ N X j=1 wijyj∗+ xiβ + ui (1) yi =        1 if y∗_i > 0 0 otherwise (2)

(6)

whereas y_i∗ is a continuous latent outcome variable, wij is a spatial lag between unit i and j indicating how closely the two units are connected in a given space (e.g, geographical proximity, membership in the same organizations etc.), xi is a 1 × k vector of covariates with parameter vector β, and ui is a zero-mean iid error term with fixed variance. We call this specification the binary spatial model. Note that in this specification, spatial dependence occurs on the level of the latent (i.e. not observed) outcome y_i∗. This specification follows Franzese et al. (2016), and it is suitable for cases where actors of our interest can observe or know more or less what others’ latent characteristics are, and not only their revealed binary actions.

It is useful to write the latent equation in matrix notation, yielding

y∗_{(N ×1)} = ρWy∗+ Xβ + u, (3) with WN ×N =          0 w12 · · · w1N w21 . .. . .. ... .. . . .. . .. wN −1,N wN 1 · · · wN,N −1 0          . (4)

W is commonly referred to as the spatial weights matrix. Throughout the paper we assume that W is row-standardized. Doing so ensures that the spatial process defined in (3) is stationary as long as |ρ| < 1 (Kelejian and Prucha 2010). Given (3) we can derive the reduced form as

y∗ = (I − ρW)−1Xβ + (I − ρW)−1u = (I − ρW)−1Xβ + v,

(5)

where vector v contains the reduced-form error terms with non-spherical covariance matrix structure due to the multiplier (I − ρW)−1.

(7)

likelihood function assumes the following general form: L(ρ, β|X, y∗) = h N Y i=1 P (yi = 1) iyih_YN i=1 P (yi = 0) i(1−yi) =h N Y i=1 P (yi = 1) iyihYN i=1 1 − P (yi = 1) i(1−yi) , (6)

The main component of the likelihood function – the marginal probability that i takes 1 given X and the parameters – is,

P (y = 1) = P (y_i∗ > 0) = Ph(I − ρW)−1Xβi i+ vi > 0 = Pvi > − h (I − ρW)−1Xβi i = 1 − Pvi ≤ − h (I − ρW)−1Xβi i = 1 − FVi −h(I − ρW)−1Xβi i . (7)

where [·]i indicates the i’th element of vector [·]. FVi(·) is the marginal cdf of random variable

Vi (the reduced form error term for unit i). Therefore, expression FVi

−h(I − ρW)−1Xβi i

is the marginal cdf of Vi evaluated at − h

(I − ρW)−1Xβi i

. In theory, one can derive the marginal cdf of Vi based on the definition of a marginal cdf as follows

FVi(−[(I − ρW) −1 Xβ]i) = Z ∞ −∞ · · · Z ∞ −∞ Z −[(I−ρW)−1Xβ]i −∞ Z ∞ −∞ · · · Z ∞ −∞ fV(s1, · · · , si, · · · , sN)ds1· · · dsi· · · dsN, (8) where fV(s1, · · · , sN) is the joint pdf of the reduced-form error. The estimation challenge arises because evaluating FVi is generally analytically intractable as long as ρ 6= 0. As a

(8)

2.2

Temporal dependence

Consider the following time-series model

y∗_t = Xtβ + γyt−1∗ + ut (9) yt=        1 if y_t∗ > 0 0 otherwise (10)

where the latent outcome exhibits a first-order temporal autoregressive process, governed by the correlation parameter γ with |γ| < 1.1 _{We refer to this model as the binary temporal} autoregressive model. For a discussion of this model in a political science context, see Beck et al. (2001). Note that we can rewrite the model in matrix notation as follows

y∗_{(T ×1)} = Xβ + γTy∗+ u, (11) where T is defined as T =             0 0 0 · · · 0 1 0 0 · · · 0 0 1 0 · · · 0 .. . ... ... . .. ... 0 0 0 · · · 0             (12)

It is evident that this model is mathematically almost equivalent to the binary spatial model from the past section, the sole difference being that now we impose a weights matrix where the first minor diagonal (all the 1’s) maps y∗_t−1 to y_t∗. The reduced form of the autoregressive model is given by

y∗_{(T ×1)} = (I − γT)−1Xβ + (I − γT)−1u. (13)

Given this result, it is clear that this model gives rise to the same difficulties in ML estimation as the binary spatial model.

(9)

2.3

Dependence across outcomes

Finally, consider the following simultaneous equation model with two binary-choice processes,

y_i1∗ = xi1β1+ λyi2∗ + ui1 (14) y_i2∗ = xi2β2+ λyi1∗ + ui2 (15) yil =        1 if y_il∗ > 0 0 otherwise for l = 1, 2, (16)

with interdependence parameters |λ| < 1. We can write this model in the now-familiar matrix form y∗_{(2N ×1)} = Ay∗ + Xβ + u, (17) whereas y∗ = [y∗₁₁, . . . , y_{N 1}∗ , y∗₁₂, . . . , y∗_{N 2}]0, β = [β1, β2]0, and X =    X1 O O X2   . (18)

The weights matrix is now given as

A(2N ×2N )=                 0 0 0 λ 0 0 0 . .. 0 0 ... 0 0 0 0 0 0 λ λ 0 0 0 0 0 0 . .. 0 0 ... 0 0 0 λ 0 0 0                 =          A1 A2 “1st region” “2nd region” A3 A4 “3rd region” “4th region”          . (19)

The reduced form follows as

(10)

Again, it is clear that the structure of this specification is essentially the equal to the one of the binary spatial model, and lead to the same estimation challenge.

2.4

Combining dependencies

So far, we have established that spatial, temporal, and outcome (inter-)dependence all give rise to the same reduced form specification for the latent outcome vector y∗, and are thus all impossible to fit via direct ML estimation. However, the similarity in functional form also means that it is very straightforward to combine different dependency structures, yielding models exhibiting multiple types of dependencies among observations. In the following we give three examples of hybrid models that are potentially useful in applied research.

The first hybrid model we consider is the binary spatio-temporal autoregressive model (STAR), which combines the binary spatial model with the temporal autoregressive binary model, yielding a panel setup (see e.g. Franzese et al. 2016). The binary STAR model is given by

y∗_{(N T ×1)}= Qy∗+ Xβ + u, (21)

where y∗ = [y_.1∗, . . . , y_.T∗ ]0 and y_.t∗ = [y_1t∗, . . . , y∗_{N t}]0. Hence, the cross-sectional y_.t∗ vectors are stacked “on top of each other”. The X matrix is constructed analogously. The weights matrix Q is given by QN T ×N T = ρ             W 0 0 · · · 0 0 W 0 · · · 0 0 0 W · · · 0 .. . ... ... . .. ... 0 0 0 · · · W             + γ             0 0 0 · · · 0 IN 0 0 · · · 0 0 IN 0 · · · 0 .. . ... ... . .. ... 0 0 0 · · · 0             , (22)

where W is the N × N spatial weights matrix and IN is the N × N identity matrix.

(11)

where two related binary choice processes are repeated over time. This model is given by

y∗_{(2N T ×1)}= Qy∗+ Xβ + u, (23)

where y∗ = [y_.11∗ , y_.21∗ . . . , y_.1T∗ , y_.2T∗ ]0 and y∗_.lt= [y∗_1lt, . . . , y∗_{N lT}]0. Hence, the cross-sectional latent outcomes are first stacked by choice type, and then by time period. Here, the weights matrix is block diagonal:

Q2N T ×2N T =          A 0 · · · 0 0 A · · · 0 .. . ... . .. ... 0 0 · · · A          , (24)

with A as defined in the previous section.

Finally, we consider the binary simultaneous equation spatial model, where two related binary outcomes are each spatially interdependent. Here we have

y∗_{(2N ×1)} = Qy∗+ Xβ + u, (25)

with y∗ defined as in the previous section. The weights matrix now assumes the form

Q2N ×2N =                 ρ1W λ 0 0 0 . .. 0 0 0 λ λ 0 0 ρ2W 0 . .. 0 0 0 λ                 (26)

(12)

3 A pseudo maximum likelihood estimator for (inter-)dependent

binary outcomes

Having established that direct ML estimation is infeasible for binary models featuring (inter-)dependencies, it is clear that we require an alternative approach. One option is simulation. Franzese et al. (2016) and Calabrese and Elkink (2014b) provide extensive reviews of the spa-tial probit literature, and useful comparisons of several simulation-based estimation methods like recursive-importance-sampling (RIS) and Bayesian MCMC approaches (see also Cal-abrese and Elkink (2014a) for cases with asymmetric link functions accommodating rare events). Similarly, Beck et al. (2001) discuss a Bayesian estimation strategy for the binary temporal autoregressive model. The main drawback of simulation-based approaches is that they are computationally intensive, and often require tedious hyperparameter tuning (es-pecially in the MCMC case). For this reason, this section introduces a pseudo maximum likelihood (PML) method as a feasible way to overcome technical challenges associated with estimating binary (inter-)dependence models. The method was originally proposed for binary spatial models by (Smirnov 2010), but we show in this section that it is applicable for all three types of (inter)-dependence.

3.1

PMLE for the binary spatial model

Recall the reduced form for the binary spatial model is given by

y∗ = (I − ρW)−1Xβ + (I − ρW)−1u = (I − ρW)−1Xβ + v.

(27)

Denote the spatial multiplier by Z,

(13)

and, by D, an N × N matrix that contains diagonal elements of Z. All off-diagonal elements of D are zero. The spatial multiplier indicates the degree of local and global spillovers of an exogenous shock that unit i receives (Anselin 2003); in other words, zij = ∂yi∗

∂uj, where

zij is the ijth element of Z. The diagonal matrix D indicates “private effects,” borrowing Smirnov’s (2010) term, of exogenous shocks on the individual latent outcomes. The relative effect captured by D is “private” in that it indicates the magnitude of the effect that unit i receives from an exogenous shock that occurred to unit i itself; in other words, di =

∂y∗_i ∂ui.

On the other hand, the off-diagonal elements of Z, i.e. Z − D, represent “aggregate spatial effects” of an exogenous shock. Note that all diagonal elements of Z − D are zero. One could interpret it as an aggregate spillover effects that unit i receives from an exogenous shock through all the other units.

The reduced form can now be re-written as

y∗_{(N ×1)} = ZXβ + (Z − D)u | {z } “Social effects” + Du |{z} “Private effects” , (29)

or, for each unit i,

y∗_i =X j βzijxj + X j [Z − D]ijuj + diui. (30)

We can now rewrite the probability of unit i seeing a positive outcome as

P (yi = 1) = P (y_i∗ ≥ 0) = P (X j βzijxj+ X j [Z − D]ijuj + diui ≥ 0) = P uit ≤ P jβzijxj |di| + P j[Z − D]ijuj |di| . (31)

The above expression holds regardless of the sign of di, a diagonal element of the spatial multiplier, as long as the distribution for ui is symmetric.

(14)

above expression: P

j[Z − D]ijuj. Because E(ui) = 0 by assumption, we can write

E " X j [Z − D]ijtuj # = 0. (32)

This implies that the effects of exogenous shocks that unit i receives via the spatial multi-plier component zij, i 6= j are not systematic, and have no systematic effect on the choice probability P (yi = 1). Smirnov’s (2010) key proposal is to approximate

P

j[Z − D]ijuj in (31) by its expectation, i.e., zero. This step massively simplifies the likelihood function. To see why, note that P (yit = 1) can now be written as follows:

P (yi = 1) = P (y_i∗ ≥ 0) = P ui ≤ P jβzijxj |di| = Fu P jβzijxj |di| , (33)

where Fu(.) is the cdf of the univariate distribution of ui, which is typically the standard normal (yielding a Probit model) or a standard logistic (yielding a Logit model).

With this approximation, we can write the pseudo likelihood in closed form. If ui follows the standard logistic distribution, for instance, we have

P L(ρ, β|X, y∗) = " _N Y P (yi = 1) #yi" _N Y (1 − P (yi = 1)) #(1−yi) ∝ " _N Y exp (( P jβzijxj)/|di|) 1 + exp ((P jβzijxj)/|di|) #yi × " _N Y 1 1 + exp ((P jβzijxj)/|di|) #(1−yi) . (34)

(15)

3.2

PMLE for the temporal autoregressive model

Recall the reduced form for the binary temporal autoregressive model, given by

y∗_{(T ×1)} = (I − γT)−1Xβ + (I − γT)−1u. (35)

Next, let

Z(T ×T ) = (I − γT)−1, (36)

denote the dependency multiplier. Applying the logic of the previous section, we can decom-pose the reduced-form error term into two parts

y∗ = ZXβ + Zu = ZXβ + (Z − D)u | {z } distributed + Du |{z} contemporaneous . (37)

The distributed effect captures the effect of exogenous shocks that occurred in the past and were carried over to the outcome of time t. These are distributed because this term focuses on the effect that is carried across multiple time periods (“neighbors” in time). On the other hand, the contemporaneous effects capture the effect of an exogenous shock that occurred in the current time period on the current outcome. Note that due to the lower-diagonal structure of T, D = I, and thus di = 1. Again substituting (Z − D)u with its expectation, we arrive at the following expression for the probability of a positive outcome:

P r(yt= 1) = P r(y∗_t > 0)

= P r (ut < [ZXβ]t) ,

(16)

and the pseudo likelihood function, for instance with a logit link function, is given by L(γ, β|X, y) = " _T Y P (yt= 1) #yt" _T Y (1 − P (yt= 1)) #(1−yt) ∝ " _T Y 1 − 1 1 + exp (−[ZXβ]t) #yt × " _T Y 1 1 + exp (−[ZXβ]t) #(1−yt) . (39)

3.3

PMLE for the simultaneous outcome model

As established earlier, the reduced form for the binary simultaneous outcome model is given by

y∗ = (I − A)−1Xβ + (I − A)−1u (40)

Applying the same decomposition to the reduced form error as in the previous sections yields

y∗ = (I − A)−1Xβ + (I − A)−1u = ZXβ + (Z − D)u | {z } plural effects + Du |{z} singular effects . (41)

The singular effect captures the effect of exogenous shocks on outcome type l directly. The plural effect captures exogenous shocks that apply to outcome type ¬l and spill over to outcome type l. Given the error decomposition, the pseudo likelihood estimator may be derived in the exact same way as in the previous two sections. The same applies for any of the discussed hybrid models.

4 Speeding up computation

In the previous section, we have derived pseudo likelihood functions for binary (inter-)dependence models that can be evaluated directly, thus permitting a pseudo maximum likelihood (PML) strategy that does not require simulation. However, naive implementations of the proposed

(17)

PML estimator may still be prohibitively costly to run. To see why, let us assume that we attempt to fit a model with the following reduced form

y∗_{N ×N} = (I − Q)−1Xβ + (I − Q)−1u, (42)

which yields a pseudo likelihood function consisting of N terms of the following form

P (yi = 1) = P (yi∗ ≥ 0) = Fu [ZXβ]i |di| , (43)

with Z = (I − Q)−1 and di = [Z]ii. Perhaps the most straightforward implementation of expression (43) is to invert I−Q directly using a decomposition-based solver, then multiplying Z with Xβ, and dividing by diag(Z). However, this strategy is typically very slow, as most decomposition-based solvers operate with almost cubic complexity in N .

We propose two (mutually compatible) strategies for avoiding the full inversion of I − Q. The first is trivial, but is worth spelling out nonetheless. If Q is block diagonal, which will be the case in any panel model without a temporal autoregressive component (e.g. the binary simultaneous equation panel model introduced earlier), then its inverse is the block diagonal matrix of block-wise inverses. More formally,

if Q =          B1 0 · · · 0 0 B2 . .. ... .. . . .. ... 0 0 · · · 0 BT          , then B−1 =          Q−1₁ 0 · · · 0 0 B−1₂ . .. ... .. . . .. ... 0 0 · · · 0 B−1_T          .

The second strategy is useful whenever Q can be decomposed as Q = αM, where α is a scalar. Note that this is the case for all three models proposed earlier (spatial, temporal, simultaneous equation), as well as any panel model based on any of these. In this strategy, we completely avoid inverting I − Q, and instead compute ZXβ and D = diag(Z) separately.

(18)

We compute µ = ZXβ by solving the linear system (I − Q)µ = Xβ for µ. This we can do without inversion by using gradient-based iterative procedures. A key advantage of doing so is that iterative procedures are especially efficient if Q is sparse, which is usually the case. We propose using the conjugate gradient method if Q is symmetric, and the slightly slower biconjugate gradient stabilized method otherwise (see Saad 2003). Both these methods have time complexity that is linear in the number of non-zero elements in Q, and are thus far more efficient than directly inverting I − Q.

To obtain D, we make use of the fact that Z can be written as a Neumann series (LeSage and Pace 2009, ch. 2)

Z = I + ∞ X

l=1

(αM)l. (44)

Thus, an approximation for D may be obtained via

D ≈ eD = diag(I) + L X

l=1

diag αlMl . (45)

Note that we can precompute the series {M, M2_{, . . . , M}L_{} prior to optimization. Thus, the} time complexity of evaluating eD during optimization is linear in N .

5 Evaluation and comparison of estimation strategies

Next, we evaluate the performance of our proposed pseudo maximum likelihood (PML) esti-mator for a spatial probit model and a temporal autoregressive probit model. In both cases, we compare the performance of our estimator to that of some well-recognized alternatives.2

5.1

Monte Carlo simulation for the spatial probit model

In this section, we present simulation results for a spatial probit data generating process

(DGP) as specified in (3), with ui ∼ N (0, 1). Throughout the experiments we set the

offset β0 to 0.5 and β1 to 1. The covariate vector x is drawn from the standard normal. The spatial autocorrelation parameter, ρs, is set to 0 (no spatial autocorrelation), 0.25 and

(19)

0.5, respectively in across experiments. The spatial weights matrix W is captures queen-neighborhoods on a square lattice and is row-normalized.

We compare our estimator (PMLE) with three alternatives: (1) A simple GLM probit model where the observed spatial lag Wy is added as a covariate. (2) The Bayesian spatial probit model proposed by LeSage (2000) and implemented by Wilhelm and Godinho de Matos (2013).3 _{(3) The GMM-based spatial probit model by Klier and McMillen (2008).}

We repeat the experiments for sample sizes of N = {256, 1024, 4096}. For each exper-iment, we record the root mean squared error (RMSE) associated with all three parameters for each model. We also keep track of estimation times, since one of our goals objectives in this paper is to reduce the computational burden associated with existing methods. We repeat each experiment 50 times.

Table 1 reports the RMSE estimates across all experiments. It is evident that our method, the Bayesian model, and the GMM method perform almost identically in terms of RMSE. As expected, the “naive” GLM model tends to incur relatively large errors. Perhaps more interestingly, Figure 1 summarizes the average estimation time for each experiment and model. It is on this metric where the benefits of using the PMLE become evident: While estimation times for the GMM and Bayesian models approach prohibitively high values for high N , the PMLE model is estimated almost instantly. In fact, we had to abort the estimation loop for the Bayesian models for N = 4096 because completing the experiments would have taken more than a day.

5.2

Monte Carlo simulation for the temporal autoregressive probit model

Next, we provide simulation results for the temporal autoregressive probit model, as defined in Section 2.2, with ui ∼ N (0, 1). We set up the experiments equivalently to those for the spatial probit model in the previous section. We compare out estimator (PMLE) with two alternatives: (1) A simple GLM probit model that does not account for temporal autocor-relation. (2) A GLM probit model where the observed outcome is included as an additional

(20)

● ● ● ● ● ● ● ● ● ● ● ●● N = 256 N = 1024 N = 4096

GLM Bayes GMM PMLE GLM Bayes GMM PMLE GLM Bayes GMM PMLE

0 50 100 150 200 0 5 10 0 1 2 3 4 5

Estimation time in sec.

Figure 1: Estimation times (in seconds) for the spatial probit experiments.

Table 2 summarizes the RMSE estimates for the experiments. We find that our

estimator exhibits the least error by far – not only for the temporal autocorrelation parameter (γ), but also for β1. Figure 2 summarizes estimation times. Again, we find that our estimator performs extremely well, typically converging to a solution in under a second, even for N = 4096.

6 Conclusion

In this paper, we suggested pseudo maximum likelihood estimators (PMLE) as simpler solutions—particularly from applied the researcher’s perspective—to the estimation prob-lems that arise from various sources of interdependence in binary data. The primary source of the estimation challenge is the fact that the jointly determined error terms in the reduced-form specification are analytically intractable due to an N-dimensional integral; hence we cannot obtain an analytical form of the likelihood with respect to the structural-form errors,

(21)

● ● ● ● ● ● ● N = 256 N = 1024 N = 4096

GLM w/o lag GLM /w lag PMLE GLM w/o lag GLM /w lag PMLE GLM w/o lag GLM /w lag PMLE

0.00 0.25 0.50 0.75 0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.3 0.4 0.5

Estimation time in sec.

Figure 2: Estimation times (in seconds) for the temporal probit experiments.

for which we know the distributional characteristics. We discussed the problem as generally as possible, considering three sources of interdependence in data: (1) spatial interdepen-dence, (2) temporal autocorrelation, and (3) simultaneity arising from endogenous binary regressors in simultaneous equations. To deal with this problem, the previous literature has proposed simulation approaches, but these are extremely computationally intensive and generally impractical for standard datasets of medium to large size.

As a way forward, we demonstrated how to reduce computational burdens significantly by (i) introducing analytically tractable pseudo maximum likelihood estimators (PMLE) for binary choice models that exhibit (inter-)dependence across space, time and/or outcomes, and by (ii) proposing an implementation strategy that increases computational efficiency considerably. Our first-cut Monte Carlo experiments demonstrate that (a) omitting inter-dependence induces bias in binary choice models, (b) our estimators are generally able to recover the parameters of the DGP, and (c) our estimators require only a fraction of the

(22)

computational cost of simulation-based methods.

That said, there are still at least two major tasks that need to be completed in future iterations of this paper:

• Monte Carlo analyses for the simultaneous equation model and the hybrid models • An a pplication along the lines sketched in the introduction. This project was originally

motivated by a set of complex theory in ethnic conflict; however we would like to illustrate the methods with more self-explanatory topics that incorporates all sources of interdependence.

References

Anselin, L. (2003). Spatial Externalities, Spatial Multipliers, and Spatial Econometrics. Interna-tional Regional Science Review, 26(2):153–166.

Beck, N., Epstein, D., Jackman, S., and O’Halloran, S. (2001). Alternative models of dynamics in binary time-series-cross-section models: The example of state failure.

Calabrese, R. and Elkink, J. A. (2014a). Estimating Binary Spatial Autoregressive Models for Rare Events. Working paper.

Calabrese, R. and Elkink, J. A. (2014b). Estimators of binary spatial autoregressive models: A Monte Carlo study. Journal of Regional Science, 54(4):664–687.

Franzese, R. J., Hays, J. C., and Cook, S. J. (2016). Spatial-and spatiotemporal-autoregressive pro-bit models of interdependent binary outcomes. Political Science Research and Methods, 4(1):151– 173.

Kelejian, H. H. and Prucha, I. R. (2010). Specification and estimation of spatial autoregressive mod-els with autoregressive and heteroskedastic disturbances. Journal of Econometrics, 157(1):53–67. Klier, T. and McMillen, D. P. (2008). Clustering of auto supplier plants in the united states: generalized method of moments spatial logit for large samples. Journal of Business & Economic Statistics, 26(4):460–471.

LeSage, J. and Pace, R. K. (2009). Introduction to spatial econometrics. Chapman and Hall/CRC. LeSage, J. P. (2000). Bayesian estimation of limited dependent variable spatial autoregressive

models. Geographical Analysis, 32(1):19–35.

Nordhaus, W. D. (1999). Global Public Goods and the Problem of Global Warming. The Institut d’Economie Industrielle (IDEI), Toulouse, France.

(23)

Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia.

Simmons, B. A. and Elkins, Z. (2004). The globalization of liberalization: Policy diffusion in the international political economy. American Political Science Review, 98(1):171–189.

Smirnov, O. A. (2010). Modeling spatial discrete choice. Regional Science and Urban Economics, 40(5):292–298.

Von Stein, J. (2005). Do treaties constrain or screen? selection bias and treaty compliance. American Political Science Review, 99(04):611–622.

Weidner, H. and Mez, L. (2008). German climate change policy a success story with some flaws. The Journal of Environment and Development, 17(4):356–378.

Wilhelm, S. and Godinho de Matos, M. (2013). Estimating Spatial Probit Models in R. The R Journal, 5(1):130–143.

Notes

1_{The following results generalize trivially to higher-order processes.}

2_{At the moment, we do not yet have simulation results for the simultaneous outcome}

model; they are forthcoming.

(24)

N = 256 N = 1024 N = 4096 T rue v alue GLM Ba y es GMM PMLE GLM Ba y es GMM PMLE GLM Ba y es GMM PMLE β0 = -0.5 0.2693 0.1554 0.1315 0.1290 0.1572 0.0685 0.0788 0.0795 0.0441 0.0477 0.0447 β1 = 1 0.1202 0.1318 0. 1210 0.1238 0.0404 0.0421 0.0416 0.0414 0.0369 0.0364 0.0366 ρ = 0 0.8527 0.1836 0.2015 0.1871 0.4050 0.0805 0.1350 0.1365 0.0783 0.0976 0.0923 β0 = -0.5 0.5940 0.1116 0.0842 0.0739 0.4423 0.0266 0.0350 0.0420 0.3934 0.0493 0.0537 β1 = 1 0.1482 0.1800 0.1527 0.1374 0.0524 0.0522 0.0555 0.0533 0.0257 0.0240 0.0265 ρ = 0.25 1. 1605 0.1718 0.1591 0.1340 0.6211 0.0629 0.0643 0.0703 0.5086 0.0690 0.0774 β0 = -0.5 0.8076 0.2703 0.1631 0.2010 0.9802 0.0637 0.1196 0.0946 0.9618 0.0398 0.0366 β1 = 1 0.1683 0.1563 0.1639 0.1695 0.0583 0.0647 0.0797 0.0764 0.0470 0.0306 0.0477 ρ = 0.5 0. 7214 0.2213 0.1153 0.1609 1.5294 0.0588 0.1258 0.0953 1.4694 0.0542 0.0541 T able 1: RMSE estimates for the spatial mo del. Best-p erforming estimator in b old.

(25)

N = 256 N = 1024 N = 4096 T rue v alue GLM w/o lag GLM /w lag PMLE GLM w/o lag GLM /w lag P MLE GLM w/o lag GLM /w lag PMLE β0 = -0.5 0.1450 0.4995 0.1907 0.0483 0.4368 0.0392 0.0222 0.5192 0.0220 β1 = 1 0.1571 1.5724 0.1577 0.0563 1.4917 0.0570 0.0358 1.5189 0.0362 γ = 0 1.0782 0.1034 1.0320 0.0437 1.0212 0.0203 β0 = -0.5 0.1651 0.9845 0.0692 0.1047 1.0295 0.0539 0.1229 1.0556 0.0214 β1 = 1 0.1316 1.8186 0.1118 0.1003 1.7958 0.0780 0.0497 1.8252 0.0324 γ = 0.25 0.6667 0.1340 0.6826 0.0637 0.7408 0.0215 β0 = -0.5 0.4427 1.6373 0.1025 0.2670 1.6569 0.0965 0.2915 1.6714 0.0639 β1 = 1 0.2506 2.2781 0.2223 0.2234 2.1772 0.1432 0.2240 2.2037 0.1384 γ = 0.5 0.3933 0.0521 0.4064 0.0432 0.3953 0.0122 T able 2: RMSE estimates for the temp oral mo del. Best-p e rforming estimator in b old.