Model and notation - Decision trees and forests: a probabilistic perspective

In this section, we briefly review decision trees and the BART model. We refer the reader to the paper ofChipman et al. (2010) for further details about the model. Our notation closely follows their’s.

4.2.1 Problem setup

We assume that the training data consist of N i.i.d. samples X = {xn}Nn=1, where

xn∈RD, along with corresponding labelsY ={yn}Nn=1, whereyn∈R. We focus only

on the regression task in this chapter, although the PG sampler can also be used for classification by combining our ideas with the work ofChipman et al.(2010) andZhang and H¨ardle(2010).

4.2.2 Regression trees

We refer to Section2.2 and Figure2.1 for a review of decision trees and our notation. A decision tree used for regression is referred to as a regression tree. In a regression tree, each leaf node j _∈leaves(T) is associated with a real-valued parameter µj ∈ R.

Let µ=_{µj}j∈leaves(T) denote the collection of all parameters. Given a tree T and a data pointx, letleaf(x) be the unique leaf nodej ∈leaves(T) such that x∈Bj, and

letg(_·;_T,µ) be the response function associated with _T and µ, given by

g(x;_T,µ) :=µleaf(x). (4.1)

4.2.3 Likelihood specification for BART

BART is asum-of-trees model, i.e., BART assumes that the labely for an input xis generated by an additive combination ofM regression trees. More precisely,

m=1

g(x;_Tm,µm) +e, (4.2)

wheree_{∼ N}(0, σ2_{) is an independent Gaussian noise term with zero mean and variance} σ2. Hence, the likelihood for a training instance is

`(y_|{Tm,µm}Mm=1, σ2,x) =N y| M X m=1 g(x;_Tm,µm), σ2 ,

and the likelihood for the entire training dataset is `(Y_|{Tm,µm}Mm=1, σ2,X) =

4.2.4 Prior specification for BART

The parameters of the BART model are the noise varianceσ2 and the regression trees (Tm,µm) form= 1, . . . , M. The conditional independencies in the prior are captured

by the factorization p(_{Tm,µm}Mm=1, σ2|X) =p(σ2) M Y m=1 p(µm|Tm)p(Tm|X).

The prior over decision treesp(Tm={Tm,δm,ξm}|X) can be described by the following

generative process (Chipman et al., 2010; Lakshminarayanan et al.,2013): Starting with a tree comprised only of a root node, the tree is grown by deciding once for every nodej whether to 1)stop and makeja leaf, or 2) split, makingj an internal node, and addj0 and j1 as children. The same stop/split decision is made for the children, and their children, and so on. Letρj be a binary indicator variable for the event thatj is

split. Then every nodej is split independently with probability p(ρj = 1) =

αs

(1 +depth(j))βs1[valid split exists belowj inX], (4.3) where the indicator1[...] forces the probability to be zero when every possible split of j is invalid, i.e., one of the children nodes contains no training data.2 Informally, the hyperparameters αs ∈(0,1) and βs ∈[0,∞) control the depth and number of nodes

in the tree. Higher values ofαs lead to deeper trees while higher values of βs lead to

shallower trees.

In the event that a nodej is split, the dimension δj and location ξj of the split are

assumed to be drawn independently from a uniform distribution over the set of all valid splits ofj. The decision tree prior is thus

p(T |X) = Y j∈T\leaves(T) p(ρj = 1)U(δj)U(ξj|δj) Y j∈leaves(T) p(ρj = 0), (4.4)

where U(·) denotes the probability mass function of the uniform distribution over dimensions that contain at least one valid split, and _U(_·|δj) denotes the probability

density function of the uniform distribution over valid split locations along dimension δj in blockBj.

Given a decision tree_T, the parameters associated with its leaves are independent and identically distributed normal random variables, and so

p(µ|T) = Y

j∈leaves(T)

N(µj|mµ, σ2µ). (4.5)

2_{Note that}_p₍_ρ

j= 1) depends onX and the split dimensions and locations at the ancestors ofjin

The meanmµand varianceσµ2 hyperparameters are set indirectly: Chipman et al.(2010)

shift and rescale the labelsY such that ymin=−0.5 and ymax= 0.5, and set mµ= 0

andσµ= 0.5/k

√

M, where k >0 is an hyperparameter. This adjustment has the effect of keeping individual node parametersµj small; the higher the values of kand M, the

greater the shrinkage towards the meanmµ.

The prior p(σ2) over the noise variance is an inverse gamma distribution. The hyperpa- rametersν andq indirectly control the shape and rate of the inverse gamma prior over

σ2_. _{Chipman et al.} ₍₂₀₁₀_{) compute an overestimate of the noise variance}

σ2_{, e.g., using} the least-squares variance or the unconditional variance ofY, and, for a given shape parameterν, set the rate such thatP(σ≤σ) =b q, i.e., the qth quantile of the prior over

σ is located atσ._b

Chipman et al.(2010) recommend the default values: ν = 3, q= 0.9, k= 2, M = 200

andαs = 0.95, βs= 2.0. Unless otherwise specified, we use this default hyperparameter

setting in our experiments.

In Section3.2.3, we presented a sequential generative process for the tree priorp(_{T |}X), where a treeT is generated by starting from an empty treeT(0) and sampling a sequence

T(1),T(2), . . . of partial trees.3 We will leverage this sequential representation for our PG sampler. We refer to Section3.2.3 for the details and Figure 3.1 for a cartoon of the sequential generative process. In Section3.2.3, we discussed a more general version where more than one node may be expanded in an iteration. Based on the experimental results comparing different expansion strategies in Section3.4.1, we restrict our attention here to node-wise expansion: one node is expanded per iteration and the nodes are expanded in a breadth-wise fashion.

Algorithm 4.1 Bayesian backfitting MCMC for posterior inference in BART

1: Inputs: Training data (X, Y), BART hyperparameters (ν, q, k, M, α_s, β_s)

2: Initialization: For all m, set T_m(0)={T(0)m ={},ξ(0)m =δ(0)m =∅} and sampleµ(0)m

3: for i= 1 :max iter do

4: Sampleσ2(i)|T_1:(i_M−1),µ(_1:i−_M1) . sample from inverse gamma distribution

5: form= 1 :M do

6: Compute residual R(mi) .using (4.7)

7: Sample Tm(i)|R(mi), σ2(i),Tm(i−1) . using CGM, GrowPrune or PG

8: Sample µ(mi)|Rm(i), σ2(i),Tm(i) .sample from Gaussian distribution

In document Decision trees and forests: a probabilistic perspective (Page 48-50)