Online learning and inference - Graphical Models: Modeling, Optimization, and Hilbert Space Emb

We discuss in this section how to use the training data to learn the model, i.e. the distribution of weights and bias. Bear in mind that the graphical models in Figure3.1,

3.2, 3.3 and 3.4 correspond to one particular training example. So we need to make two decisions:

1. Given a training example and its corresponding graph, how to infer the posterior of the model?

2. How is the set of training data used as a whole,i.e.how are the graphs of different training examples connected?

Our answer is: expectation propagation (EP, Minka, 2001) for the first question and Gaussian density filtering (Maybeck,1982) for the second. Below are the details.

3.2.1 A Bayesian view of learning

Assume we have n feature/label pairs

(xi_,_yi₎ n

i=1 drawn iid from some underlying distribution. Suppose we have a prior distribution p0(w) on the weight vectorw, as

well as a likelihood model p(xi_,_yi_|_{w). In Bayesian learning, we are interested in the} posterior distribution of w. p(w| (xi_,_yi₎ _{) =} p0(w) Q ip(xi,yi|w) R p0(w)Q ip(xi,yi|w)dw .

The integral in the denominator can be computationally intractable, hence various approximation algorithms have been developed (see Section 1.5). Due to the large

Algorithm 6:Gaussian density filtering.

Input: A set of feature/label pairs for training

(xi_,_yi₎ n i=1. Output: Approximate posterior of the model.

1 Initialize: Specify a prior of the model p0(w). 2 for i= 1 to ndo

3 Construct the likelihoodp(xi,yi|w) using the training example (xi,yi). 4 Set the prior of the model topi−1(w).

5 Find a Gaussian distributionpi(w) which approximates the posterior distributionp(w|xi,yi)∝pi−1(w)p(xi,yi|w). Different inference algorithms

differ in the sense of approximation. 6 return pn(w)

amount of data in many real life applications, we resort to one of the cheapest approx- imations: assumed density filtering (ADF). The idea is simple: in each iteration, visit only one data point (xi_,_yi_{), use its likelihood to compute the posterior of the weight,} and then use this posterior as the prior for the next iteration. Since each step only deals with one likelihood factor, the posterior inference can be performed efficiently. Algorithm 6sketches ADF.

In our case the prior of all weights are set to zero mean Gaussians, and the variance will be discussed in the experiment section. The likelihood is modeled by the factor graph in Figure 3.1. If we keep the posterior approximated by Gaussians, then ADF can also be called Gaussian density filtering (GDF). Now the only problem is how to compute the posterior in the step 5 of Algorithm6.

3.2.2 Inference on the graph of a given training example with EP

Given a training example (x,y), the discussion in Section 3.1.1 has shown that the

posteriorp(w_|x,y) can be derived by querying the marginal distribution ofwin Figure

3.1. This marginal can be computed by EP, which was introduced in Section 1.5. In a nutshell, EP is similar to loopy belief propagation, but further approximates the messages as well as possible. To this end, it approximates the marginals of the factors via Gaussians which match the first and second order moments. Strictly speaking, EP finds a Gaussian approximation of the true posterior. Since our model uses the same set of factors as in TrueSkillTM_{, we refer the interested readers to the Table 1} in (Herbrich et al., 2007) for a summary of the message formulae, and we provide a detailed derivation in Appendix B.

One important implementation consideration of EP is the message passing schedule (Herbrich et al.,2007). There is no loop in all the graphical models from Figure3.1to

3.4. However, they all have a non-Gaussian factor: δ(·> ε), which necessitates passing

Figure 3.5: A dynamic graphical model with factors between the model of two adjacent examples.

non-Gaussian factors only involve factors{αc, βc}and variables{dc}andb(see Figure

3.4), we only need to run EP iteratively over b and _{αc, dc, βc}c. This significantly reduces the cost of each EP iteration from O(DC) (for all weights) to O(C)5_{. In} practice, since we only send messages from factors to variables, we just repeatedly do:

α1→d1, . . . , α5→d5; β1→d1, . . . , β5 →d5; α1 →b, . . . , α5→b.

The termination criterion is that the relative difference of messages between two iter- ations fall below a given tolerance value for all messages. All messages are initialized to zero precision and zero precision-mean.

3.2.3 Dynamic models

So far we have not taken into account the need of different models for different parts of the dataset,i.e.temporal/spatial variations. This simplification may be unrealistic in many applications. For example, the categorization rule of Reuters news wire may change over the year, so our model needs to evolve through time accordingly. GDF also depends on the random order of training examples and the model information propagates only in the forward direction of the data stream. If the data can be stored, then we may add dynamic factors between the models of adjacent news article to allow smooth temporal variations. See Figure 3.5 for the dynamic graphical model and see (Dangauthier et al.,2008) for how TrueSkillTM _{can be extended to a dynamic} scenarios. In this case, EP needs to be performed back and forth over the whole dataset. Theoretically appealing, it is very expensive in both time and space, and hence we stick

to GDF in this work.

Our model also admits straightforward active learning, where in each iteration one picks a most informative training example, label it, and train on it. This can be useful when labeling is expensive. In this chapter, we focus on applications where a large number of labeled data is available, and then the bottleneck of computation shifts to finding the most informative training example. This usually requires applying the current model to the whole training set which is expensive, hence we would rather spend that time taking more updates considering its low cost in our model.

From now on, we will refer to our algorithm as Bayesian online multi-label classifi- cation (BOMC).

3.3 Generalization for multi-variate performance measure

In document Graphical Models: Modeling, Optimization, and Hilbert Space Embedding (Page 85-88)