3. Sum-Product Networks
5.4. Very Large Mixture of Spanning Trees for Density Estimation in Layered
In this section we present a preliminary experiment regarding the use of SPGMs for approximating an intractable graphical model G. To perform a preliminary investigation on this property, we limit our discussion here to quantitatively evaluate if such an approach works in practice in a small academical setting, leaving a full fladged discussion to future work.
We start with the observation that mixtures of spanning trees have been used extensively to approximate intractable graphs (see e.g. Meila and Jordan[2000],Bach and Jordan [2001], Pletscher et al. [2009]). In these models, the approximation quality typically increases with the number of components in the mixture, hence the ability of SPGMs to model very large mixture of trees suggests that SPGMs might be apt for this approach (as already seen in Section 5.2).
Hence, the procedure that we employ is twofold. First, we find a mixture of spanning trees over G such that many parts between the trees are shared, in such a way as to implement this mixture efficiently with a SPGM S. Secondly, we learn the parameters governing S by maximizing the Log Likelihood of a set of samples taken from G. This can be done in case G is a directed graphical model, for which samples can be obtained efficiently with Ancestor Sampling even if inference is infeasible (Pearl [2000]).
However, finding a set of spanning trees with shared parts, which can then be efficiently represented as SPGM, is not a simple problem. We leave a full fledge discussion of this approach for future work, and in this preliminary application we consider a class of models for which such mixture can be easily found.
Layered Distributions. A large mixture of spanning trees can be obtained with a simple heuristic in the case of layered distributions. We define a layered distribution as a directed GM composed by successive layers of variables, where variables in one layer connect only to variables in the next one. Variable Xkl denotes the k-th variable at layer l (Fig. 5.4.1a). This class of distributions is relevant in applications, since it includes Factorial Hidden Markov Models, Multiscale Quadtrees (Wainwright and Jordan [2008]) and deep belief networks (Hinton and Osindero[2006]). Inference cost in Layered Distributions is worst case exponential in the layer size and it is therefore intractable. However, samples can be obtained efficiently with ancestral sampling.
A spanning tree can be taken from a layered distribution by allowing a single “active variable” to have children at each layer (Fig. 5.4.1b). It is easy to see that if two trees
T1 and T2 taken in this way differ only by the choice of one active variable, then their
structure is largely shared - this is due to the fact that parts of the trees corresponding to the same active variables are identical.
The mixture of many spanning trees with this structure can be modeled compactly with the SPGM shown in Fig. 5.4.2: notice that any subtree in this model corresponds
5.4. Very Large Mixture of Spanning Trees for Density Estimation in Layered Distributions
to a tree in the form of Fig. 5.4.1, right, hence the SPGM encodes the mixture of all such trees (Proposition 4.1.7).
In addition, we can also allow more than one variable to be active at the same time, i.e. allow more than one variable in the same layer to have children (Fig. 5.4.1c). This can be done by creating a clique by merging the state of all active variables in a single node: e.g., if two variables A and B are active at a certain layer, then we create a node associated to a variable {A, B} with values in ∆(A) × ∆(B), which merges the individual variables (Fig. 5.4.1d).
The resulting SPGM efficiently encodes a very large mixture of trees with shared parts. Let the model contain L layers, and let there be C choices of active variables at each layer. Then it is immediate to see that the number of subtrees, each of which corresponds to a spanning tree, grows as CL due to the combinatorial number of choices of active variables at each layer. However, due to Proposition 4.1.1 inference in the SPGM has just O(LC2) cost in memory and time. This exponential reduction in inference cost is
made possible by exploiting the fact that many parts of the trees are shared.
(a) A subsection of a layered directed
GM.
(b) A subsection of a spanning tree of
(a).
(c) Allowing two active variables per
layer.
(d) The model in (c) represented as a
tree.
Figure 5.4.1. - A mixture of spanning trees with shared subparts obtained from a layered directed
5. Applications
Figure 5.4.2. - First two layers of an SPGM encoding a mixture of spanning trees in a layered
model with K variables per layer. Xkl denotes the k-th variable at layer l.
Empirical Evaluation. We tested the SPGM on a layered mixture model with 10 layers, each containing 6 binary variables. As described above, we created a SPGM encoding a mixture of spanning trees over this model, whose parameters are learned by taking a set samples from the layered distribution, dividing them into training and test set and maximizing the training Log Likelihood via EM (Section 3.3.2).
We report Log Likelihood results obtained by SPGMs and several well established methods for density estimation in Table 5.4.1. We test SPGM models using a different number of choices of active variables per layer (i.e. the number of sum node children), which result in an increasingly large number of subtrees in the resulting mixture model. Choosing from 1 to 8 possible active variables per layer, the mixture size ranges from 1 to 810. We also rest different numbers of active variables at each layer (1,2 and 4). We first compared against methods based on trees, namely the optimal spanning trees (Chow and Liu[1968], see Section 2.3.4) and mixture of trees trained with EM (Meila and Jordan [2000], see Section 2.4). We report separate results depending on the number of trees in the mixture. Then, we compared against two state-of-the-art density estimation methods for SPNs: ACMNs (Section 3.4.1) and ID-SPN (Section 3.4.2).
From the quantitative results it is evident that SPGMs widely outperform all competing methods in terms of test set LL. In particular, they do not seem to suffer from the overfitting problem that plagues mixtures of tree even for moderately large mixture sizes. In addition, very large mixtures with up to 810tree components can be modeled tractably: learning time was about 5 minutes in a non-optimized MATLAB implementation. We hypothesize that this is due to the strong regularization imposed by sharing the structure, and hence the parameters, between the trees in the mixture. The results of this preliminary experiment show that using SPGMs as approximation of an intractable graph with known structure is a very interesting research direction, to be explored in future work.