Missing Incorporated in Attributes within BART

Implementing BARTm is straightforward. Recall from Section 1.4.1 that the prior on

possible splitting attributes and discrete uniform on the possible splitting values. To account for Lines 1 and 2 in the MIA procedure (Algorithm 3), the splitting attribute

xj and split value are proposed within BART, but now we additionally propose a

direction (left or right with equal probability) for records to be sent when the records have missing values in xj. A possible splitting rule would therefore be “xij < c and dispense to the left if xij is missing.” To account for Line 3 in the algorithm, splitting on missingness itself, we create dummy vectors of lengthnfor each of thepM attributes with missingness, denoted M1, . . . ,MpM, which assume the value 1 when

the entry is missing and 0 when the entry is present. We then augment the original training matrix together with these dummies and use the augmented training matrix,

X0_train := [Xtrain, M1, . . . , MpM], as the training data in the BARTm algorithm. Once

again, the prior on the splitting rules is the same as in the originalBARTbut now with the additional consideration that the direction of missingness is equally likely left or right conditional on the splitting attribute and value.

We expect BARTm to exhibit greater predictive performance over MIA in classical

decision trees for two reasons. First, BARTm’s sum-of-trees model offers much greater fitting flexibility compared to a single tree. Additionally, due to the greedy nature of decision trees, once a split is chosen, the direction in which missingness is sent cannot

be reversed. BARTm can alter its trees by pruning and regrowing nodes or changing

splitting rules. These proposed modifications to the trees are accepted or rejected stochastically using the Metropolis-Hastings machinery depending on how strongly the proposed move increases the model’s posterior value.

We hypothesize that BARTm’s stochastic search for splitting rules allows obser-

vations with missingness to be grouped with observations having similar response values. Due to the Metropolis-Hastings step, the algorithm will attempt to move towards splitting rules and corresponding groupings that increase overall model like-

lihood _P(Y | X,M). In essence, BARTm is “feeling around” predictor space for a location where the missing data increases the overall marginal likelihood. For selection models, since splitting rules can depend on any covariate (including the covariate with missing data), it should be possible to generate successful groupings for the missing data under both MAR and NMAR mechanisms.

We describe simple examples of rules that increase overall model likelihood. Sup-

pose there are two covariates X1 and X2 and we are fitting a BARTm model with one

tree. In a simple MAR example, imagine a mechanism whereX2 is increasingly likely

to go missing for large values of X1. The model can partition this data in two steps

to increase overall likelihood: (1) A split on a large value of X1 and then (2) a split

on M2. As a simple NMAR example, suppose a mechanism where X2 is more likely

to be missing for large values of X2. BARTm can select splits of the form “x2 > c and x2 is missing” with c large. Here, the missing data is appropriately kept with larger

values of X2 and overall likelihood should be increased.

When missingness does not depend on any other covariates, it should be more difficult to find appropriate ways to partition the missing data, and we hypothesize thatBARTmwill be least effective for selection models with MCAR missing data mechanisms. We hypothesize this is due to the regularization prior on the depths of the trees coupled with the fact that all missing data must move to the same daughter node. In short, the trees do not extend deeply enough to create sufficiently complex partitioning schemes to handle the MCAR mechanism.

We also hypothesize thatBARTm has potential to perform well on pattern-mixture

models due to the partitioning nature of the regression tree. BARTm can partition the data based on different patterns of missingness by using missingness as a valid split value. Then, underneath these splits, different submodels for the different patterns can be constructed. More concretely, consider a simple saturated pattern mixture

model where the model is fA(X1) if X2 is missing and fB(X1) if X2 is present. The

model can split immediately on M2 and attempt to fitfA(X1) in a tree below the left

node and fB(X1) in a tree below the right node.

In light of the above examples, it should be noted that the MIA steps within the Bayesian framework can also conceptually be viewed as combining pattern mixture models with imputation. Conditional on a splitting rule, non-missing values of a covariate are transformed into an indicator that takes the value of 1 if the splitting rule condition is satisfied. Here, MIA rule 1 effectively imputes 1 for the missing covariate and analagously MIA rule 2 effectively imputes 0 for the missing covariate.

Another motivation for adapting MIA toBARTarises from computational concerns.

BART is a computationally intensive algorithm, but its runtime increases negligibly

in the number of covariates (see Chipman et al., 2010, Section 6). Hence, BARTm

leaves the computational time virtually unchanged with the addition of the pM new

missingness dummy covariates. Another possible strategy would be to develop an iterative imputation procedure usingBARTsimilar to that in Stekhoven and B¨uhlmann (2012) or a model averaging procedure using a multiple imputation framework, but we believe these approaches would be substantially more computationally intensive.

In document Extensions and Applications of Ensemble-of-trees Methods in Machine Learning (Page 128-131)