Chapter 3: Methods
3.3 Analytical Strategy
3.3.2 Model Building and Testing
Final models include both time-variant and -invariant factors. Major time-variant factors include labor force status, work characteristics (e.g., hours worked/week and working alone or with others), self-rated health, marital status, and total household wealth, among others. Major time- invariant factors include gender, race, Hispanic ethnicity, and education, among others. For both aims, univariates were analyzed for all variables (see Appendix B), as well as bivariate
associations between the outcome, treatment, and predictor variables (see Appendix D). I also ran tests to increase my confidence that the specific assumptions for each model are reasonably met, described throughout this section. Model fit was determined by F tests. In all cases, the alpha level for indicating significant relationship is 0.05. Results shown in Tables 2 and 3, as well as in Tables 9 and 10 in Appendix F, list exponentiated coefficients that are called relative risk ratios (RRR) for multinomial logistic regression, odds ratios (OR) for ordered logistic regression, and incidence rate ratios (IRR) for negative binomial regression. Accounting for serial correlation, cluster-robust standard errors are listed in all multivariate tables. Finally, results will be described in terms of the direction and significance of the documented
relationships to aid theory development. For this dissertation, a discussion of the marginal effects and predicted probabilities from my final models will be avoided, given ongoing questions about
42
the limitations of using Rubin’s combination rules with predicted probabilities in multiply- imputed datasets and the chances of invalid results (StataCorp, 2017).
To answer Question 1, multinomial logistic regression is employed, as the outcome variable has three distinct categories: self-employment, wage-and-salary work, and not working. Before the analysis, univariates of all variables were assessed to determine if data transformations were necessary, after which I transformed individual earnings, total household wealth, and total household income from all sources (minus individual earnings) due to a high level of skewness, discussed in the next paragraph. After the models were completed, parameter and significance estimates indicated when the hypotheses were supported.
As Question 2 has four outcome variables, different methods are necessary. For both financial well-being variables, I transformed the variables using the inverse hyperbolic sine function (IHS) before conducting regression analysis. This can be expressed as:
Equation 1. Inverse Hyperbolic Sine (IHS) = arcsinh(𝑥) = 𝑙𝑛(𝑥 + √𝑥2 + 1)
For the individual earnings outcome variable, the IHS transformation can account for the non- trivial number of respondents who report zero earnings in some years—unlike in a log
transformation, where the log of zero is undefined—as well as the positive skewness of the data. For the total household wealth outcome variable, the IHS transformation accounts for the large number of respondents who report negative household wealth (as defined by assets minus debts), as well as the positive skewness of the data. The IHS transformation, which was first proposed by Johnson (1949), can handle extreme values in dependent variables, including negative and zero values, performing better than the more commonly-used tactic of taking the log of values after adding a constant (Burbidge, Magee, & Robb, 1988; Friedline, Masa, & Chowa, 2015). As
43
a form of sensitivity analysis, all final models for total household wealth were run with and without housing assets included; as the results were largely similar, I chose to keep housing assets as a part of this variable and present those findings in Chapter 4.
Regarding personal health outcomes, self-rated health was measured using a four-item ordinal variable. As such, ordered logistic regression was used, which accounts for the rank order of the data while not assuming equal differences between the possible values (Kennedy, 2008). A key assumption of ordered logistic regression is that the coefficients are equal in a series of
cumulative logit models in which the response variable is recoded into a series of binary variables (Williams, 2016). In other words, the coefficients should have the same relationship with the outcome variable, no matter how it is dichotomized (e.g., fair/poor health compared to good health and better, or good health or worse compared to very good health or better). To test this assumption, I used the Brant test of coefficients (Brant, 1990), rejecting the null hypothesis of equal coefficients for the entire model. This significant result was expected once considering the large sample size in my study and the high number of covariates in my final model.
Following the guidance set forth by Williams (2016), I carefully considered the direction of the coefficients and their magnitudes, and ultimately determined that the spirit of this assumption was met, making the need for partial proportional odds models unnecessary.
Depressive symptoms, which were measured using a modified CESD scale with answers ranging from 0 to 8, required the use of negative binomial regression due to overdispersed nature of the data. Poisson regression should not be used, as the variance of total depressive symptoms (at t=1:
V=2.85) was not equal to the mean (t=1: M=1.09), a strong assumption of Poisson regression
44
existence of more statistically-significant explanatory variables than might actually exist (Kennedy, 2008).
Question 2 incorporates two additional estimation procedures. First, all models include a form of propensity score analysis, called inverse probability of treatment weighting, to help correct for selection into self-employment and wage-and-salary work by including a time-invariant factor for self-employment (“treatment”) or wage-and-salary work (“control”). This procedure will be described in detail in the next section. Additionally, lagged dependent variables (LDVs) from the prior wave are included to prevent the biasing of coefficients that can result from serial
correlation that is not controlled for using sandwich estimators. After including LDVs, however, the magnitude of the coefficients for the explanatory variables can be reduced to values below what the real magnitudes may, in reality, be (Angrist & Pischke, 2009; Keele & Kelly, 2006). This may also reduce the magnitude of the estimated treatment effect in Question 2. Given the consequences of not including LDVs—serial correlation of errors that lead to an overestimation of the magnitude of explanatory variables—I decided to keep them in my models, with the understanding that the estimated magnitude of the coefficients for the explanatory and treatment variables are likely more conservative, and the estimated magnitude for the coefficients for the LDVs are likely higher, than in reality.