• No results found

PRTs, Model Compression, and Dependency Recovery

7.3 Learning Poisson Dependency Networks

7.3.4 PRTs, Model Compression, and Dependency Recovery

So far, we have specifically assumed two natural candidates for the initial mean model λ0i(x\i), namely a constant or the empirical mean of Xi. Both models might not be very precise and,

yes burglaries < 1145 no larcenies < 1168 286 56e+3 / 195 100% 120 21e+3 / 176 90% 57 7296 / 129 66% 295 14e+3 / 47 24% 1820 35e+3 / 19 10% yes no

Figure 7.4: Example of a Poisson regression tree for the crime “auto theft” from the Communities and Crime dataset from our experimental evaluation. The root of the tree, denoted by a rectangle, represents the first splitting criterion. In this case, the root node splits the dataset according to the number of burglaries. Leafs are denoted by ellipses and contain the average mean of that partition in the first row of the label. Additionally, the color shade of a leaf node indicates the relative size of the mean (the darker, the higher).

in turn, may require many iterations of optimization, even when using adaptive step sizes via multiplicative updates, resulting in a large number of regression trees and a tendency to overfit.

Therefore, we propose another initial mean model that can potentially serve as a head start and comes with additional benefits as we will see. We are interested in a compact regression model for each node and since we are dealing with count data, Poisson Regression Trees (PRTs) [33] naturally lend themselves as a starting point. An example of a PRT is shown in Figure 7.4. To initialize our PDNs, we learn one PRT for each variable Xi, where we train treei(Xi|X\i) and evaluate λ0i(x\i) = treei(x\i).

More precisely, a PRT partitions the training examples in the space of the dependent variables X\i in order to best fit the response variable Xi. It is a binary tree whose leaves represent the λi of that given partition, all other nodes represent a splitting criterion on a variable Xj ∈ X\i. We use the PRT implementation of rpart [222] where the splitting criterion is given by the likelihood ratio test for two Poisson groups Dparent− (Dleft son+ Dright son) with the deviance given by D =X hxiln xi ¯ λ  − (xi− ¯λ) i ,

where ¯λ is the sample mean2 of count variable Xi. This splitting is recursively applied until each subgroup reaches a minimum size or there are no improvements. To avoid overfitting, the depth of a tree is typically limited a priori, alternatively post-pruning can be used after the tree has been learned. In the rpart-implementation, the pre-pruning is controlled by a complexity parameter cp. This parameter, which we consider to be part of the hyper-parameter space of our algorithm (see Line 2 in Algorithm 9), controls the size of the initial tree. A secondary step is executed where the tree is pruned using cross validation after the initial tree has been

2

input: A factor graph G

output : A list of generated samples S 1 S ← [ ];

2 s ← {0}n; /* Initialization */

3 for k← 0 to K do 4 for Xi ∈ V do

5 s[i ]← Xi ∼ P (Xi| nb(Xi)); /* Sample new state of Xi */

6 end

7 if k > burn-in then

8 S.append(s); /* Append new sample to list */

9 end

10 end 11 return S

Algorithm 10: Pseudo-code for the ordered Gibbs sampler.

learned. The height of the final tree will be the one that reduces the cross validation error. This learning approach generalizes well in our experiments as shown below in Section 7.5.

A potential head start is not the sole advantage of PRTs. We can also use PRTs for model compression [29] which is, next to Laplace estimates, an additional way to avoid overfitting. Here, we collapse the trained additive, respectively multiplicative model, into a single model. To do so, we evaluate it on the training set and learn a single PRT per count variable based on this evaluation. That is, the compressed PDN model consists only of a set of local PRTs that were learned based on the optimized GTB model.

Moreover, interactions among count variables are directly conveyed by the structure of the compressed PDN and, as a result, interactions can be understood and interpreted more easily in qualitative terms. More precisely, since after compression there is only one local tree model per count variable, we simply look at the count variables used in the inner split nodes of the trees; they indicate relevant features of the PDN. It is important to note that one tree can use a variable Xk multiple times with different splitting criteria. This indicates that the variable is important in the PDN. Also, count variables closer to the root of a tree are more important than a variable further down the tree. This is, e.g., captured by Breiman et al.’s notation of relative importance [28] saying how important the value of xv is for predicting the value for xu:

I2(u|v; λu) = X

l

i2l · δv(l)=u, (7.12)

where l iterates over the levels of the Poisson tree for λu of Xu, the value i2l is the maximal estimated improvement over a constant fit over the entire region of the current node v(l), and δ is an indicator function selecting all splits involving xv. To summarize, in contrast to other Poisson graphical models, compressed PDNs return local models that are likely to be sparse and therefore easier to interpret.