4.1 Quantitative Methods
4.1.2 Statistical Modelling
This section introduces two statistical approaches to test three hypotheses, which shed light on the role of patron states, patron-parent state competition and temporal patterns of state building in de facto states. First, this thesis develops linear and logistic state and institution building models that measure the impact of patron states on state and institution building in de facto states, while controlling for temporal dependence and a set of domestic, structural and international variables. Even though the dependent variable dfsinst represents a discrete count variable, I decided to run linear regressions rather than a Poisson regression model, as the variable shares more characteristics with a continuous variable than with a traditional count variable (see appendix E for the justification). The recoded dependent variables dfsbuildmod and dfsbuildstrong will be included in logistic regression models and survival models. Second, survival model techniques uncover temporal patterns of state and institution building in de facto states and the extent to which patrons shape these developments. I used the statistics package Stata (Version 15.0) for the statistical modelling and analysis of the data set.
4.1.2.1 Statistical Models for Time Series Cross-Sectional Panel Data
Unlike panel data that tends to consist of many units (𝑖) across few time observations (𝑡), time series cross-sectional (TSCS) panel data usually comprises large 𝑡 and small or medium 𝑖. In a basic ordinary least squares (OLS) estimation for TSCS panel data, the covariates are therefore indexed both by time and unit (see equation below).
𝑌𝑖 = 𝛼 + 𝛽𝑋𝑖𝑡+ 𝜀𝑖𝑡
Statistical models for time series cross-sectional panel data enable researchers to approach questions that have both spatial and temporal dimensions by covering multiple cross-sectional units across various time periods. Particularly for the research goals set out in this thesis, TSCS panel data presents an appropriate resource, as the proposed research questions and state building theories cover temporal variations across a fixed set of units. Yet, statistical models for TSCS panel data require additional statistical considerations related to temporal and spatial
110 dependence, various forms of heterogeneity and panel heteroscedasticity. The standard OLS assumption that the error terms are independent, for instance, is violated in the case of TSCS panel data for a variety of reasons including likely time dependence and potential spatial dependence. These considerations ultimately inform the suitability of the model choice and the appropriate means to account for temporal dependence.
The above OLS equation assumes constant intercepts and covariate effects across all observations. As this homogeneity assumption is unlikely to hold true across a diverse set of entities as de facto states, it is necessary to account for potential intercept and effect heterogeneity. The degree of heterogeneity informs the choice of fixed or random effect models for the state and institution building models of this thesis. Therefore, I consulted a set of descriptive statistics to uncover possible intercept and slope heterogeneity across units or time for the dependent, independent and control variables of the statistical models (see appendix F). The results indicate some heterogeneity across the unit level. For the number of governance institutions, for instance, several de facto states have low mean values and no change in the number of state institutions, whereas others show more institutional development. At the same time, the standard deviations are relatively similar, which indicate slightly less heterogeneity. As there are no regions with specifically high or low state institution counts, it appears at least unnecessary to create regional dummy variables.
Beyond the dependent variables, I also tested for heterogeneity across the independent and control variables of the study and found variations of variables across space, time as well as both space and time. The variables typeonset, dias and dfspriorind do not vary over time and would ultimately be dropped in a fixed effects model. A set of other variables (i.e. patronspanke) indicate little time variation. Due to the theoretical significance of these variables for state building and the research questions of this thesis, I decided to keep these variables in the model despite their limited variation across time. This is in line with Beck (2008), who stresses the importance of time-varying independent variables for most studies that work with binary time series cross-sectional (BTSCS) panel data. The heterogeneity of variables and intercepts as well as the limited variability of some independent variables, in
111 turn, informed the model choice and the inclusion of a set of variables that vary more across time, such as relparentstrength and tsincedfsinstchg.
Fixed effects models, on the one hand, consider unit-specific effects, where the unit effects are fixed and time effects are constrained to zero (see Stimson 1985 who refers to this as the least-squares dummy variables method). Fixed effects assume uniqueness of the fixed unit over time that covariates are unable to capture. Fixed effects models are suitable for data sets with unit heterogeneity, because they account for within-unit variation and correlations between the sources of the heterogeneity with the independent variables. However, fixed effects do not estimate the effects of covariates that do not change over a period of time within cases (Beck & Katz 2001). Random effects, on the other hand, assume that unit- specific effects are not correlated with the independent variables. Importantly for the variables in the state and institution building models, random effects models have the advantage of taking time-invariant regressors into consideration, which captures the effects of time-invariant regressors.
In light of the descriptive statistics results, a fixed effects model would not be the most appropriate model for this study, as it does not suitably take the independent variables into account that do not vary over time. Additionally, the fixed effects model would overemphasise those cases in the data set with limited independent variations (patronspanke, typeonset, dfspriorind). Thus, even though the Hausman test suggests that the fixed effects model may be preferred, I decided to pursue random effects models, based on the descriptive statistics and because the Hausman test cannot address covariates that are non-time-varying (the results of the Hausman test results can be found in appendix G).
The choice of the random effects model also addresses potential endogeneity in the regression models of this thesis. Endogeneity refers to the correlation of the independent variable with the error term in the regression analysis, which would result in biased coefficients. In that sense, endogenous variables are those that are determined by variables outside the model. TSCS panel data reduces the endogeneity problem somewhat, because unlike non-time-varying data sets, TSCS panel data captures correlated individual effects across time. If the assumption of random effects models that the unit-specific effects are not correlated with the independent
112 variable holds true, this significantly reduces the endogeneity concerns in the model. Still, it is not possible to argue that no variables were omitted, because in random effect models, omitted variable bias may affect time-varying effects. Therefore, process tracing in the case studies of this thesis may identify further omitted variables.
4.1.2.2 Survival Modelling
Survival models are also known as event history, duration or hazard models and estimate the time until a given event occurs. The event in question is conditional on the time until the event takes place (Box-Steffensmeier & Bradford 2004; Cleves et al. 2016; Mills 2011). For this thesis, survival models enable a deeper engagement with potential temporal patterns of state building in de facto states and the extent to which patron states shift these dynamics across time.
From the outset, survival models necessitate the modification of the original data set in such a way, that the events of attaining moderate degrees of state building or high degrees of state building represent the final observation for each unit. In other words, all observations that take place after the state building event (dfsbuildmod or dfsbuildstrong) will be dropped. Unfortunately, the discarded observations reduce the number of observations to an extent that makes parametric and Cox models statistically infeasible. Furthermore, it is necessary to declare data as survival data and choose the appropriate temporal variable. Rather than comparing de facto states across years, I decided to pursue an ahistoric approach that measures state building using the number of months an entity survived (duration). While this approach does not consider the potential influence of historical contexts on state building, it enables a better comparison across de facto states based on the number of months that these entities survived.
The modified data sets produce insightful results about temporal patterns of state building by utilising a useful feature of survival analysis in form of the Kaplan- Meier estimator, that estimates the likelihood of an event not taking place at a given point in time. In other words, it estimates the survival function for a non-parametric method and estimates survival at time 𝑡. This estimator can be used to understand
113 the likelihood (or risk) of moderate or high state building taking place in de facto states and at what point in time this step is likely to happen.