Network-based Sparse Bayesian Classification

In this section we present a novel network-based sparse Bayesian classifier that effectively makes use of information on feature dependencies to improve the prediction accuracy of the model and the capacity to identify features that are relevant for classification. Consider a supervised learning task in which

D

= {(xi, yi)}n_i=1 is a set of training instances with features xi∈ Rd+1

and class labels yi∈ {−1, 1}. The zeroth component of every xi is constant and equal to 1. The

objective is to build the linear classifier w = (w0, . . . , wd)T that optimally separates instances

of different classes. Following Herbrich et al. (2001), we assume the existence of a ”true” parameter vector wtrue ∈ Rd+1 that has been used to label the data according to the rule yi=

sign wTtruexi. Since, in a general case, the data need not be linearly separable, we consider

the possibility that some of the class labels yi have been flipped with probability ε. Given these

assumptions, the likelihood for w given X = (x1, . . . , xn)T, y = (y1, . . . , yn)Tand ε is

P

(y|w, ε, X) = n

∏

i=1

P

(yi|w, ε, xi) = n

∏

i=1 ε 1 − Θ yiwTxi + (1 − ε)Θ yiwTxi , (5.6)

where Θ is the Heaviside step function. Note that (5.6) is robust to outliers because it only depends on the number of errors of w in the training set and not on the actual size of these errors. The problems that we are interested in are characterized by a high-dimensional feature space and a small amount of training instances (d n). This is an under-determined scenario in which many different parameter vectors fit the data equally well. To break this symmetry, we introduce a prior distribution for w that captures our expectation that some particular values of the parameters are more likely than others. Specifically, we assume that only a small subset of the components of xiare actually relevant for predicting the class label yi. Thus, wtrueis assumed

to be a sparse vector with only a few non-zero components. To incorporate this expectation, we follow George and McCulloch (1997) and introduce a vector of binary latent variables z = (z0, . . . , zd)T∈ {−1, 1}d, where zi= 1 if the i-th component of wtrue is different from zero and

zi= −1 otherwise. Assuming that z is known, the spike and slab prior density for w is

P

(w|z) = d

∏

i=0 zi+ 1 2

N

(wi|0, σ 2 i) + 1 − zi 2 δ(wi) , (5.7)

where

N

(x|µ, σ2) is a Gaussian density with mean µ and variance σ2 (the slab), δ(wi) is a

Dirac’s delta function (the spike), which corresponds to a point probability mass for wiat zero,

σ2₁, . . . , σ2_d are equal to 1 and σ2₀is equal to 100. σ2₀is much larger than σ2₁, . . . , σ2_d to guarantee that the prior for the bias parameter w0is not informative. To complete the specification of the

prior for w we assume that a network that encodes the dependencies between features is known. This network is an undirected graph G = (V, E) whose vertices V = {0, . . . , d} correspond to instance features and whose edges, E, link features that are expected to be both excluded or both included in the classification model. Given G, the prior probability for z is represented by a Markov random field (MRF) model (Bishop,2006;Wei and Li,2007). This is the main reason for choosing the latent variables zito take values in {−1, 1}, which is the standard notation used

Chapter5. Network-based Sparse Bayesian Classification 83

in Markov random fields. The corresponding prior probability for z is

P

(z|G, α, β) = 1 Zexp ( 10z0+ α d

∑

i=1 zi ) exp ( β

_∑

{ j,k}∈E zjzk ) , (5.8)

where Z is a normalization constant, α ∈R controls the level of sparsity in the model, β ≥ 0 determines the correlation between zi and zj when features i and j are linked in G and the

constant 10 reflects our expectation that z0= 1 is much more likely than z0= −1 so that the prior

for w0does not favor solutions for which this bias coefficient is zero. Following the prescription

given byHern´andez-Lobato and Hern´andez-Lobato(2008), the prior for ε is

P

(ε) = Beta(ε|a0, b0) =

1 B(a0, b0)

εa0−1(1 − ε)b0−1, (5.9) where B is the beta function with parameters a0 and b0. The results obtained are not very

sensitive to the values of these hyper-parameters, provided that they are consistent with the assumption that most of the training data are correctly labeled. Specifically, the choice made in the experiments presented in Section5.5, a0= 1 and b0= 9, is equivalent to assuming that one

out of ten data instances are mislabeled. This prior expresses a moderate level of confidence in that the labels of the training data are correct.

Once the specification of the network-based sparse Bayesian classifier (NBSBC) is made, Bayes’ theorem can be used to compute the posterior distribution of the model parameters w and ε given the training data X and y. Assuming that the network G and the model hyperparameters α and β are known, the posterior is given by

P

(w, ε|y, X, G, α, β) = ∑z

P

(y|w, ε, X)

P

(w|z)

P

(z|G, α, β)

P

(ε)

P

(y|X, G, α, β) . (5.10) The denominator in (5.10) is a normalization constant that is known as the model evidence. This constant can be used for model selection (Bishop,2006;MacKay,1992). Given an unlabeled test instance xtest, the predictive distribution for the corresponding class label ytestis

P

(ytest|xtest, y, X, G, α, β) =

Z Z

P

(ytest|w, ε, xtest)

_P

(w, ε|y, X, G, α, β) dw dε . (5.11) An advantage of this approach is that the relevance of the features can be quantified by the posterior of z

P

(z|y, X, G, α, β) =

R R

P

(y|w, ε, X)

P

(w|z)

P

(z|G, α, β)

P

(ε) dw dε

P

(y|X, G, α, β) . (5.12) Specifically, the relevance of the i-th feature is a number between 0 and 1 given by the marginal probability of the event zi= 1 using the posterior (5.12). Finally, this Bayesian framework also

allows to compute an estimate of the level of noise in the class labels as the average of ε over its posterior distribution

¯ε =

Z Z

P

(w, ε|y, X, G, α, β) dw dε . (5.13) This quantity also provides an estimate of the generalization error of NBSBC. Unfortunately, the sums and integrals in (5.10), (5.11), (5.12) and (5.13) are in most cases too costly to be

practicable. For this reason, to implement the NBSBC model, it is necessary to resort to approx- imate methods for Bayesian inference. In this chapter, an approximation of the joint distribution

P

(w, ε, z, y|X, G, α, β) is given in terms of a simpler unnormalized distribution

Q

(w, ε, z) that belongs to the exponential family. The computation of

Q

is made using the expectation propa- gation (EP) algorithm (Bishop,2006;Minka,2001). Once

Q

is known, the previous sums and integrals can be computed in a straightforward manner.

In document Balancing flexibility and robustness in machine learning: semi-parametric methods and sparse linear models (Page 96-98)