CHAPTER 4 A GROUPED VARIABLE SELECTION ANALYSIS OF CHILD-
4.4 CAP: Empirical Evaluation
In this section, I implement the grouped selection procedure from the i CAP to evaluate childhood malnutrition risk factors for young children in Ghana. I compare the performance of i CAP with the OLS, LASSO and the grouped LASSO using the prediction error on an independent test set. The key idea of using a grouped selection procedure for choosing malnutrition risk factors is the following. Risk factors appear in 2 possible forms. First, they appear as a categorical variable, with 2 or more categories which can be expressed in the model as a linear variable. This variable naturally appears as a group, and we are typically interested in choosing the entire group or factor, rather than its individual components, e.g. source of water (piped at home, piped outside home, river, dam, lake, well, etc.). Second, they can appear as a continuous variable which typically exerts a nonlinear effect
on the outcome rather than linear. In order to express it as a nonlinear effect, it is usually written as a nonparametric component, which appears as an additive combination of basis expansions. In this way, continuous variables also naturally form groups, and should be selected as such. Both the i CAP and grouped LASSO select these pre-structured variables in groups, using L∞ and L2 as the within-group norms.
4.4.1
Data
The Demographic and Health Surveys (DHS) are nationally-representative cross-sectional household surveys containing data on fertility, family planning, maternal and child health, child survival, HIV/AIDS, malaria, nutrition and anthropometric measures for children. In addition, there are several demographic features for the child and parents. I use the 2008 DHS with data for 2000 children between age 0 and 5. I focus on child health, maternal health, maternal and paternal demographic measures, and other plausible environmental risk factors. Candidate risk factors include child age, gender, mothers BMI, age and education, fathers education, birth order of child, household wealth and assets, and other socioeconomic and environmental effects. Variables like child age, mothers BMI, mothers education, fathers education, months of breastfeeding and mothers age are included as additive nonparametric effects. Child health literature demonstrates how continuous covariates have a nonlinear effect on health outcomes, and so should be included as such. I use B-spline expansions to do this. The rest are categorical variables with linear effects on the health outcome.
4.4.2
Empirical specification
Following the analysis in Koenker (2010), I use child height (in cm) as the response variable. The empirical specification for our model is:
Heighti = x0iβ+f1(cagei)+f2(mfeedi)+f3(mbmii)+f4(magei)+f5(medui)+f6(edupartneri)+i,
(4.17) where the linear effects (categorical variables) are included in the first term, and each fj
additive combination of basis functions, such as fj(x) =
P
kβjkφk(x), 1 ≤ j ≤ p, where
I specify φk as a B-spline basis. I thus have 6 groups for the nonlinear effects. The types
of potential risk factors include: child and parental health, demographic characteristics, health facility availability, environmental circumstances, other household factors like wealth and assets, etc. Most categorical variables consist of a multinomial variable, e.g. anemia: severe, moderate, mild and not anemic. Some variables have up to 10 or more classes. In some other cases, the categorical variables are dummies. Table 4.1 summarizes some key multi-class variables. Continuous variables used in candidate models are summarized in Table 4.2. Some of these variables are known to be significant predictors of child health, especially maternal traits like health and education. For instance, low maternal body mass index (BMI) has been shown to a risk factor for low birth weight and childhood stunting (Bhalotra and Rawlings (2010)). Table 4.3 demonstrates the degree to which important factors like wealth, maternal BMI, maternal education and cooking fuel play a role for a health indicator like child size at birth. For instance, a higher proportion of poorer children tend to be smaller. Similarly, a higher proportion of smaller children belong to the group of mothers with a lower BMI, or with no education. Children exposed to household using biofuels like wood and straw have a higher proportion of smaller children.
4.4.3
Results
The i CAP selects the following variables: child age, child gender, months of breastfeeding, mother’s BMI, mother’s age, birth order, mother’s religion, occupation of mother’s partner, household wealth, state, cooking fuel used by household, household ownership of assets like refrigerator, television and car, whether the mother had access to prenatal facilities like nurse and auxiliary midwives, and whether the child took a vaccination for diptheria, hepatitis and influenza. These chosen variables cover a range of risk factors related to the child, the parents, wealth and assets of the household, and other environmental factors that can affect child health like cooking fuel and prenatal services.
Finally, I also select risk factors using the same candidate variables using least squares, LASSO and grouped LASSO, and then compute prediction error on an independent test
set, consisting of 25% of the original sample. These errors are summarized in Table 4.4. The i CAP has the lowest prediction error relative to other selection methods. This exercise demonstrates the potential benefits of implementing i CAP for large household survey models from which we wish to extract maximum information at a lower cost.