In this section, we briefly review decision trees and the BART model. We refer the reader to the paper ofChipman et al. (2010) for further details about the model. Our notation closely follows their’s.
4.2.1 Problem setup
We assume that the training data consist of N i.i.d. samples X = {xn}Nn=1, where
xn∈RD, along with corresponding labelsY ={yn}Nn=1, whereyn∈R. We focus only
on the regression task in this chapter, although the PG sampler can also be used for classification by combining our ideas with the work ofChipman et al.(2010) andZhang and H¨ardle(2010).
4.2.2 Regression trees
We refer to Section2.2 and Figure2.1 for a review of decision trees and our notation. A decision tree used for regression is referred to as a regression tree. In a regression tree, each leaf node j ∈leaves(T) is associated with a real-valued parameter µj ∈ R.
Let µ={µj}j∈leaves(T) denote the collection of all parameters. Given a tree T and a data pointx, letleaf(x) be the unique leaf nodej ∈leaves(T) such that x∈Bj, and
letg(·;T,µ) be the response function associated with T and µ, given by
g(x;T,µ) :=µleaf(x). (4.1)
4.2.3 Likelihood specification for BART
BART is asum-of-trees model, i.e., BART assumes that the labely for an input xis generated by an additive combination ofM regression trees. More precisely,
y=
M
X
m=1
g(x;Tm,µm) +e, (4.2)
wheree∼ N(0, σ2) is an independent Gaussian noise term with zero mean and variance σ2. Hence, the likelihood for a training instance is
`(y|{Tm,µm}Mm=1, σ2,x) =N y| M X m=1 g(x;Tm,µm), σ2 ,
and the likelihood for the entire training dataset is `(Y|{Tm,µm}Mm=1, σ2,X) =
Y
n
4.2.4 Prior specification for BART
The parameters of the BART model are the noise varianceσ2 and the regression trees (Tm,µm) form= 1, . . . , M. The conditional independencies in the prior are captured
by the factorization p({Tm,µm}Mm=1, σ2|X) =p(σ2) M Y m=1 p(µm|Tm)p(Tm|X).
The prior over decision treesp(Tm={Tm,δm,ξm}|X) can be described by the following
generative process (Chipman et al., 2010; Lakshminarayanan et al.,2013): Starting with a tree comprised only of a root node, the tree is grown by deciding once for every nodej whether to 1)stop and makeja leaf, or 2) split, makingj an internal node, and addj0 and j1 as children. The same stop/split decision is made for the children, and their children, and so on. Letρj be a binary indicator variable for the event thatj is
split. Then every nodej is split independently with probability p(ρj = 1) =
αs
(1 +depth(j))βs1[valid split exists belowj inX], (4.3) where the indicator1[...] forces the probability to be zero when every possible split of j is invalid, i.e., one of the children nodes contains no training data.2 Informally, the hyperparameters αs ∈(0,1) and βs ∈[0,∞) control the depth and number of nodes
in the tree. Higher values ofαs lead to deeper trees while higher values of βs lead to
shallower trees.
In the event that a nodej is split, the dimension δj and location ξj of the split are
assumed to be drawn independently from a uniform distribution over the set of all valid splits ofj. The decision tree prior is thus
p(T |X) = Y j∈T\leaves(T) p(ρj = 1)U(δj)U(ξj|δj) Y j∈leaves(T) p(ρj = 0), (4.4)
where U(·) denotes the probability mass function of the uniform distribution over dimensions that contain at least one valid split, and U(·|δj) denotes the probability
density function of the uniform distribution over valid split locations along dimension δj in blockBj.
Given a decision treeT, the parameters associated with its leaves are independent and identically distributed normal random variables, and so
p(µ|T) = Y
j∈leaves(T)
N(µj|mµ, σ2µ). (4.5)
2Note thatp(ρ
j= 1) depends onX and the split dimensions and locations at the ancestors ofjin
The meanmµand varianceσµ2 hyperparameters are set indirectly: Chipman et al.(2010)
shift and rescale the labelsY such that ymin=−0.5 and ymax= 0.5, and set mµ= 0
andσµ= 0.5/k
√
M, where k >0 is an hyperparameter. This adjustment has the effect of keeping individual node parametersµj small; the higher the values of kand M, the
greater the shrinkage towards the meanmµ.
The prior p(σ2) over the noise variance is an inverse gamma distribution. The hyperpa- rametersν andq indirectly control the shape and rate of the inverse gamma prior over
σ2. Chipman et al. (2010) compute an overestimate of the noise variance
b
σ2, e.g., using the least-squares variance or the unconditional variance ofY, and, for a given shape parameterν, set the rate such thatP(σ≤σ) =b q, i.e., the qth quantile of the prior over
σ is located atσ.b
Chipman et al.(2010) recommend the default values: ν = 3, q= 0.9, k= 2, M = 200
andαs = 0.95, βs= 2.0. Unless otherwise specified, we use this default hyperparameter
setting in our experiments.
In Section3.2.3, we presented a sequential generative process for the tree priorp(T |X), where a treeT is generated by starting from an empty treeT(0) and sampling a sequence
T(1),T(2), . . . of partial trees.3 We will leverage this sequential representation for our PG sampler. We refer to Section3.2.3 for the details and Figure 3.1 for a cartoon of the sequential generative process. In Section3.2.3, we discussed a more general version where more than one node may be expanded in an iteration. Based on the experimental results comparing different expansion strategies in Section3.4.1, we restrict our attention here to node-wise expansion: one node is expanded per iteration and the nodes are expanded in a breadth-wise fashion.
Algorithm 4.1 Bayesian backfitting MCMC for posterior inference in BART
1: Inputs: Training data (X, Y), BART hyperparameters (ν, q, k, M, αs, βs)
2: Initialization: For all m, set Tm(0)={T(0)m ={},ξ(0)m =δ(0)m =∅} and sampleµ(0)m
3: for i= 1 :max iter do
4: Sampleσ2(i)|T1:(iM−1),µ(1:i−M1) . sample from inverse gamma distribution
5: form= 1 :M do
6: Compute residual R(mi) .using (4.7)
7: Sample Tm(i)|R(mi), σ2(i),Tm(i−1) . using CGM, GrowPrune or PG
8: Sample µ(mi)|Rm(i), σ2(i),Tm(i) .sample from Gaussian distribution