• No results found

2.4 Literature Review

2.4.6 Summary

We have presented a framework for Bayesian information theoretic active learning called BALD. This framework directly exploits the rearrangement of parameter entropies to predictive entropies. There are many approaches to active learning, but BALD can be advantageous for a number of reasons:

• It is inductive, so does not make any assumptions about a future decision task, loss or test set.

• With many models, the utility function is smooth and so BALD may be applied to both continuous sampling and pool-based active learning.

• The utility function is often, but not always, submodular and hence greedy max- imization is near-optimal.

The above advantages are specific to the classical information gain objective function in Equation (2.8). The potential advantages of the rearrangement that BALD exploits (2.10) and the extension in Equation (2.15) are:

• The required computations are not inherently tied to a particular model or infer- ence method.

• If output space is ‘simpler’ than parameter space, as is often the case, then the required entropies are more straightforward to compute.

• The number of (approximate) posterior updates is reduced from one per possible datapoint to one per observed datapoint.

• One can focus upon learning particular variables in the model.

However, computation of the utility in Equation (2.10) may still be non-trivial. Whether BALD is computationally useful depends on the particular task and model. Whether BALD is practically useful is a matter of empirical performance. In the following chapters we apply this framework in machine learning and scientific domains to yield efficient algorithms with strong practical performances.

Chapter 3

Active Gaussian Processes

Gaussian processes (GPs) are a powerful, Bayesian non-parametric model for classifi- cation and regression. They have been extended to a number of other domains such as optimization [Osborne et al.,2009], quadrature [Ghahramani & Rasmussen,2002], di- mensionality reduction [Lawrence,2004] and preference learning [Chu & Ghahramani,

2005b]. Using information-theoretic active learning with GPs appears to be challeng- ing because their parameter space is infinite dimensional. However, with BALD (Sec- tion2.3) we can calculate posterior information gains accurately without having to compute entropies of infinite dimensional objects.

In this chapter we first provide a brief introduction to GPs, for full details seeRas- mussen & Williams [2005]. In Section3.2 we demonstrate how BALD may be applied to Gaussian process classification (GPC). In vanilla Gaussian process regression (GPR) the observation noise is constant over the input domain, however, this is not true in GPC which makes active learning more difficult. Other active learning algorithms that work with predictions, such as maximum entropy sampling (MES), confound posterior uncertainty with inherent noise. BALD provides a principled and intuitive balance between these sources of uncertainty in GPC. Furthermore, unlike in GPR, inference in GPC is intractable and so GPC requires more expensive inference routines. Therefore, the reduction in the number of posterior updates for N candidates and l possible labels fromO(Nl) to O(1) when using BALD (see Section2.3.2) is important with GPC.

Like most models, Gaussian processes have additional parameters, known as hyper- parameters. Obtaining good performance with GPs requires appropriate hyperparam- eter management. Typically, these are optimized using type-II maximum likelihood

Rasmussen & Williams [2005], but this method can perform poorly and integration over the hyperparameters yields better predictions [Garnett et al.,2010]. This problem

is particularly important in active learning which usually works in the low-data regime (labels are expensive). Hence, ignoring uncertainty estimates may cause extreme over- fitting of the GP. Particularly, in GPR maximizing information gain has been found to yield poor performance after the first couple of samples when the hyperparameters are fixed [Seeger et al., 2003; Seo et al., 2000]. In Section3.3 we address this problem by combining BALD for ‘focused’ learning of particular variables of interest with a new algorithm for approximate hyperparameter marginalization, the Marginal GP [Garnett et al., 2013]. Using these techniques we provide a complete pipeline for active GPR with unknown hyperparameters.

3.1

Primer on Gaussian Processes

Informally, Gaussian processes provide a distribution over a broad class of functions. The probabilistic model underlying GPR and GPC is

prior: p(f ) = GP(µ(·), k(·, ·)) , (3.1) regression likelihood: p(y|x, f ) =N(y; f(x), σ2) , (3.2) classification likelihood: p(y|x, f ) = Bernoulli(Φ(f (x))) . (3.3) The latent parameter for this model, f , is a functionX → R. A Gaussian process prior on this function is fully specified by a mean function µ(x) : X 7→ R and covariance function or kernel k(x, x0) :X×X 7→ R. Under the GP prior, the marginal of f evaluated

at any finite set of points {x1, . . . , xn} follows a multivariate Gaussian distribution

with mean m, whose components are mi = µ(xi), and covariance matrix Σ, where

Σij = k(xi, xj).

For regression, the output variable y is modelled directly using f plus additive Gaus- sian noise. For classification we consider the probit likelihood. Here, given the value of f , y takes a Bernoulli distribution with parameter Φ(f (x)), and Φ(·) is the standard Gaussian c.d.f. (probit function). As an alternative one can use a logistic likelihood, but in practice there is little difference in performance [Rasmussen & Williams,2005]. In GPR, the Gaussian process prior (3.1) is conjugate to the Gaussian likelihood (3.2), so inference is tractable. Exponential family likelihoods, such as the Gaussian, have conjugate priors. These priors yield tractable posterior distributions that are in the same family as the prior, see Bishop [2006], Chapter 2, for details. However, the GP prior is not conjugate to the classification likelihoods. Therefore exact inference is intractable; given some observationsD, the posterior over f is non-Gaussian. There are

a number of approximate inference methods, the most common of which – expectation propagation (EP) [Minka, 2001a], the Laplace approximation [Kass & Raftery,1995], assumed density filtering [Ito & Xiong, 2000] and sparse methods [Naish-Guzman & Holden, 2007] – all approximate the posterior by a Gaussian. Throughout we will assume that such a Gaussian approximation is provided, though the active learning algorithm does not care which. We will denote the use of such approximate inference by 1

≈.

After performing inference, given a Gaussian (exact or approximate) posterior, the predictive distribution at a new point x? is computed by integrating over the latent

function,

p(y|x?,D) = Z

p(y|fx?)p(fx?|D)dfx?, (3.4) where fx = f (x). With both the regression and classification likelihoods given in∆

Equations (3.2) and (3.3) respectively, this computation can be performed analytically. We now show how BALD may be used for active GPC.