3.4 Performance on Synthetic Data
3.4.2 Comparison with the Generalized Random Forest
The generalized random forest (GRF) approach (Athey et al. 2019) can also be used to estimate heterogeneous treatment effects using IVs. The GRF method can be thought of as a “generalist” that can be used to estimate a broad range of parameters using moment conditions whereas the IV tree can be thought of as a “specialist” that is tailored for detecting heterogeneous treatment effects in observational studies. As such, the two approaches are complementary in the big data analytics tool kit. In this section, we conduct simulation studies to compare the accuracy and interpretability of these two approaches for analyzing heterogeneous treatment effects in observational data with unobservable features that affect both outcome variable and treatment assignment.
To allow comparison with the GRF method, we make some modifications to the IV tree described earlier. First, because directly comparing a tree to a forest is not fair, we grow an equal number of IV trees (which we refer to as an “IV Forest (IVF)”) to compare with the GRF.3 Second, following much of the existing literature on random forests (e.g., Breiman 2001), we use over-fitted instead of well-pruned trees to construct the IVF. Finally, we use the same local centering approach as in Athey et al. (2019, p21). That is, we first regress out the effect of the features on all the outcomes and then construct a forest using centered outcomes instead of original outcomes.
3.4.2.1 Synthetic Data Construction
We consider four designs to compare the performance of the two approaches. The first two designs are the same as those described in Section 3.4.1.1. The third and fourth designs are the same as those described in Athey et al. (2019). The details of each design are presented in Table 3.3. For each of the four designs, we consider instances with the number of features equal to 5, 10 and 20, with the sample size ranging from 1,000 to 5,000. Both the IVF and GRF are constructed using 100 trees and all features are used for splitting.4 We compute mean-squared errors and split frequencies based on
3For a fair comparison, our forest is constructed using methods consistent with those in Athey et al. (2019). However, to formally extend a tree method to a random forest, more parameters need to be carefully evaluated and fine-tuned, which we leave for future research.
4As a default and in the simulation of Athey et al. (2019), the GRF considers all features for splitting when the number of features is less than or equal to 20 (see https://github.com/swager/grf for more details). Its performance remains almost the same or becomes worse when we restrict the number of features (e.g., to one-third or square root of the total number of features) for splitting.
a testing sample of size 5,000 and aggregate results based on 10 runs of the simulations.
3.4.2.2 Accuracy Comparison
Table 3.3 summarizes mean-squared errors of the IVF and GRF. We see that the IVF has smaller mean-squared errors in all scenarios except Designs 3 and 4 with five features and sample size of 1,000. The relative gap between the two approaches remains the same or increases as the sample size increases. When sample size equals 5,000, the relative gap ranges from 33% to 93%, depending on the scenarios. The primary reason the IVF generates more accurate estimates of treatment effects is that it uses the exact loss function for tree splitting whereas the GRF uses a gradient-based approximation.
Table 3.3: Mean-Squared Errors of IV Forest (IVF) and Generalized Random Forest (GRF)
Sample Size
#Features Approach 1,000 3,000 5,000 1,000 3,000 5,000 1,000 3,000 5,000 1,000 3,000 5,000
Design 1 Design 2 Design 3 Design 4
5 IVF 0.18 0.10 0.06 0.02 0.01 0.00 0.43 0.15 0.11 0.49 0.21 0.13 GRF 0.35 0.18 0.12 0.18 0.13 0.06 0.39 0.20 0.19 0.44 0.27 0.21 10 IVF 0.74 0.37 0.26 0.15 0.02 0.02 0.46 0.19 0.13 0.35 0.16 0.11 GRF 1.28 0.64 0.42 0.28 0.17 0.12 0.50 0.31 0.22 0.55 0.35 0.26 20 IVF 2.85 1.90 1.48 0.58 0.28 0.21 0.48 0.18 0.15 0.39 0.19 0.12 GRF 3.68 2.58 2.19 0.77 0.51 0.33 0.62 0.31 0.26 0.66 0.40 0.32
Note: Designs 1 and 2 have the form Yi=3m Design 4. All forests have 100 trees, and results are aggregated over 10 runs of the simulations.
3.4.2.3 Interpretability Comparison
In addition to estimating treatment effects, we also care about subject groupings.
For example, in medical applications knowing which patients respond similarly to a treatment and which do not can provide clues to the underlying mechanism and thereby guide research into improved treatment alternatives. To evaluate the interpretability of the trees generated by the IVF and GRF approaches, we compare frequencies of splitting on both relevant and irrelevant features at each split depth, which is defined as the number of edges to the root node. A higher proportion of splits on relevant features implies greater interpretability. A shallower tree with a smaller number of subgroups is also easier to interpret.
Table 3.4: Split Frequencies of IV Forest (IVF) and Generalized Random Forest (GRF) in Design 2
Sample Size
Split on Approach 1,000 3,000 5,000 1,000 3,000 5,000 1,000 3,000 5,000 1,000 3,000 5,000
Depth 0 Depth 1 Depth 2 Depth 3
Relevant Feature IVF 979 1,000 1,000 10 0 0 2 0 0 0 0 0
GRF 554 650 811 371 361 272 421 336 117 9 439 101
Irrelevant Features IVF 21 0 0 756 687 685 231 372 314 0 91 74
GRF 446 350 189 1,579 1,601 1,677 2,659 3,325 3,484 66 5,413 5,733
Note: Design 2 has the form Yi=3
k=1xik+ xi3Ti+ εi+ ξi, where xik∼ Bern(0.5). All forests have 100 trees, and results are aggregated over 10 runs of the simulations.
Table 3.4summarizes split frequencies at the first four depths for Design 2, where x3 is the only relevant feature and x1, x2 and x4 are irrelevant features. At depth zero, the split frequency is 1,000 for 10 runs of the simulations for both approaches. We see that 979–1,000 trees in an IVF (compared with 554–811 trees in a GRF) split on x3 at depth zero. At all depths, the GRF splits more on irrelevant features than does the IVF. Finally, most trees in the IVF have a depth of four or less whereas trees in the GRF are much deeper than those shown in Table3.4. The combination of deeper trees and more splits at each depth leads to more subgroups with smaller sizes in GRF.
A main reason the GRF splits on irrelevant features is that it estimates parameters from moment conditions that require both the instrument and other exogenous features to be orthogonal to the error term. As a result, its splitting criterion is determined by both features that affect treatment effects and features that directly affect outcomes.
In contrast, the objective of the IVF is to ensure the accuracy of treatment effect estimation. Its splitting criterion is determined (almost) exclusively by features that affect treatment effects.
Finally, an important reason the GRF has deep trees with small subgroups is that it takes into account only the mean of estimated treatment effects during splitting.
It partitions observations into two child subgroups as long as the two subgroups have different average treatment effects. In contrast, the IVF considers both the mean and variance of estimated treatment effects (see first and second terms of Equation (3.8)) for tree splitting. It therefore balances the tradeoff between relevance (smaller subgroup size) and statistical reliability (less estimation noise).