3.4 Learning
3.4.1 Structure Learning
Learning the structure of the model is a challenging task. As in [SGP00], we adopt the approximation of limiting the number of dependencies among parts (i.e., the fan-in of each node in the graph) to a fixed value K= 2. In contrast to Song, however, we dispose of the decomposability requirement and allow for a more general structure of dependencies.
We begin by introducing a scoring method, for a model, which is known as the mini- mum description length (MDL), the Bayesian information criteria (BIC) or the Schwartz information criteria. The MDL/BIC principle is widely used in statistics as a model se- lection tool and offers interesting asymptotic properties (see [Sch78], [LB94], [BJ02]). The idea behind MDL is to take into consideration both the “goodness of fit” of a model to the data, as well as the complexity of the model itself. As we have already observed, a fully connected graphical model would be the most accurate description of the training set, yet the least useful, since a search for the optimal labelling would be computationally infeasible. Additionally, by Occam’s razor, if the goodness of fit was the same, a more complex model might not generalize as well as a simpler one.
To quantify this trade off, the MDL/BIC principle suggests scoring a model from an information theory point of view. An arc in the graph indicates a dependence among two vertices. If we need to estimate the value of the dependent variable, then knowing the value of its parents provides us (on average) information; that is, we have less uncertainty about the child and thus need less bits to convey its value. The stronger the child-parents dependence, the fewer bits are needed.
The average amount of additional information on the child, provided by observing the parents, is exactly what the mutual information I(i, πi) of a family represents. An alter-
native interpretation is that the mutual information reflects the likelihood that the data satisfies the dependency relationship.
On the other hand, representing this relationship incurs a cost. Imagine if we had to transmit the model over a channel: the higher the number of connections, the larger the number of bits required for its transmission.
The MDL/BIC principle combines these two quantities into a single score. In the case of Gaussian models, it is easy to see that the mutual information is given by a ratio of determinants, while the cost of representing the model is proportional to the number of non-zero entries in the inverse of the covariance matrix. For N i.i.d. data, and a family (i, πi), we have BIC(i, πi) = N 2 log2 |Σi∪πi,i∪πi| |Σπi,πi||Σi,i| + dπid 2 log2N , dπi ,|πi|d
while the total score of the graph G is
BIC(G) =
M
X
i=1
BIC(i, πi). (3.43)
Ideally, we would like to examine every possible graphical model that can be constructed over theM variables in our problem, and score each one using the metric (3.43). For each of the graphs we could evaluate the encoding length of the data and that of the model description, searching for the one that minimizes their sum. However, this method is clearly impractical since there is an exponential number of graphs over the M variables. Unfortunately, the problem of finding the optimal graph has been shown to be NP-hard [Chi96].
A number of heuristics could be applied to the problem. We choose the solution of Giudici et al. [GC03], which is based on the Markov chain Monte Carlo (MCMC) method. Sampling methods provide an excellent tool for hard optimization problems and their em- pirical performance is well documented.
The algorithm is initialized by arbitrarily choosing a feasible graph. At every iteration a move is proposed at random by choosing among three possibilities:
• Addition: a new arc is added between randomly chosen variables, as long as struc- tural constrains such as maximum fan-in and the absence of cycles are maintained.
• Deletion: an existing arc is removed from the graph.
• Reversal: the direction of an existing arc is switched, subject to the same constraints imposed on additions.
The graph obtained after the proposed move is evaluated according to equation (3.43) and its score compared to that of the existing graph. Acceptance of the move is guaranteed only if the score is increased. If, on the other hand, the newly obtained graph fares less than the current one, a probability of acceptance is computed that is lower as the decrease in score contributed by the move grows. A key aspect in avoiding local minima is that MCMC methods accept (with appropriate reluctance) moves that decrease the score of the functional being optimized. Although this seems counter-intuitive, it is important to notice how this creates an escape from local attraction basins, allowing the exploration of larger portions of the solution space. An additional step we take, trying to mitigate the curse of local minima, is to randomly restart the algorithm several times, and retain the best performing graph of all runs.