In this section we present a novel network-based sparse Bayesian classifier that effectively makes use of information on feature dependencies to improve the prediction accuracy of the model and the capacity to identify features that are relevant for classification. Consider a supervised learning task in which
D
= {(xi, yi)}ni=1 is a set of training instances with features xi∈ Rd+1and class labels yi∈ {−1, 1}. The zeroth component of every xi is constant and equal to 1. The
objective is to build the linear classifier w = (w0, . . . , wd)T that optimally separates instances
of different classes. Following Herbrich et al. (2001), we assume the existence of a ”true” parameter vector wtrue ∈ Rd+1 that has been used to label the data according to the rule yi=
sign wTtruexi. Since, in a general case, the data need not be linearly separable, we consider
the possibility that some of the class labels yi have been flipped with probability ε. Given these
assumptions, the likelihood for w given X = (x1, . . . , xn)T, y = (y1, . . . , yn)Tand ε is
P
(y|w, ε, X) = n∏
i=1P
(yi|w, ε, xi) = n∏
i=1 ε 1 − Θ yiwTxi + (1 − ε)Θ yiwTxi , (5.6)where Θ is the Heaviside step function. Note that (5.6) is robust to outliers because it only depends on the number of errors of w in the training set and not on the actual size of these errors. The problems that we are interested in are characterized by a high-dimensional feature space and a small amount of training instances (d n). This is an under-determined scenario in which many different parameter vectors fit the data equally well. To break this symmetry, we introduce a prior distribution for w that captures our expectation that some particular values of the parameters are more likely than others. Specifically, we assume that only a small subset of the components of xiare actually relevant for predicting the class label yi. Thus, wtrueis assumed
to be a sparse vector with only a few non-zero components. To incorporate this expectation, we follow George and McCulloch (1997) and introduce a vector of binary latent variables z = (z0, . . . , zd)T∈ {−1, 1}d, where zi= 1 if the i-th component of wtrue is different from zero and
zi= −1 otherwise. Assuming that z is known, the spike and slab prior density for w is
P
(w|z) = d∏
i=0 zi+ 1 2N
(wi|0, σ 2 i) + 1 − zi 2 δ(wi) , (5.7)where
N
(x|µ, σ2) is a Gaussian density with mean µ and variance σ2 (the slab), δ(wi) is aDirac’s delta function (the spike), which corresponds to a point probability mass for wiat zero,
σ21, . . . , σ2d are equal to 1 and σ20is equal to 100. σ20is much larger than σ21, . . . , σ2d to guarantee that the prior for the bias parameter w0is not informative. To complete the specification of the
prior for w we assume that a network that encodes the dependencies between features is known. This network is an undirected graph G = (V, E) whose vertices V = {0, . . . , d} correspond to instance features and whose edges, E, link features that are expected to be both excluded or both included in the classification model. Given G, the prior probability for z is represented by a Markov random field (MRF) model (Bishop,2006;Wei and Li,2007). This is the main reason for choosing the latent variables zito take values in {−1, 1}, which is the standard notation used
Chapter5. Network-based Sparse Bayesian Classification 83
in Markov random fields. The corresponding prior probability for z is
P
(z|G, α, β) = 1 Zexp ( 10z0+ α d∑
i=1 zi ) exp ( β∑
{ j,k}∈E zjzk ) , (5.8)where Z is a normalization constant, α ∈R controls the level of sparsity in the model, β ≥ 0 determines the correlation between zi and zj when features i and j are linked in G and the
constant 10 reflects our expectation that z0= 1 is much more likely than z0= −1 so that the prior
for w0does not favor solutions for which this bias coefficient is zero. Following the prescription
given byHern´andez-Lobato and Hern´andez-Lobato(2008), the prior for ε is
P
(ε) = Beta(ε|a0, b0) =1 B(a0, b0)
εa0−1(1 − ε)b0−1, (5.9) where B is the beta function with parameters a0 and b0. The results obtained are not very
sensitive to the values of these hyper-parameters, provided that they are consistent with the assumption that most of the training data are correctly labeled. Specifically, the choice made in the experiments presented in Section5.5, a0= 1 and b0= 9, is equivalent to assuming that one
out of ten data instances are mislabeled. This prior expresses a moderate level of confidence in that the labels of the training data are correct.
Once the specification of the network-based sparse Bayesian classifier (NBSBC) is made, Bayes’ theorem can be used to compute the posterior distribution of the model parameters w and ε given the training data X and y. Assuming that the network G and the model hyperparameters α and β are known, the posterior is given by
P
(w, ε|y, X, G, α, β) = ∑zP
(y|w, ε, X)P
(w|z)P
(z|G, α, β)P
(ε)P
(y|X, G, α, β) . (5.10) The denominator in (5.10) is a normalization constant that is known as the model evidence. This constant can be used for model selection (Bishop,2006;MacKay,1992). Given an unlabeled test instance xtest, the predictive distribution for the corresponding class label ytestisP
(ytest|xtest, y, X, G, α, β) =Z Z
P
(ytest|w, ε, xtest)P
(w, ε|y, X, G, α, β) dw dε . (5.11) An advantage of this approach is that the relevance of the features can be quantified by the posterior of zP
(z|y, X, G, α, β) =R R
P
(y|w, ε, X)P
(w|z)P
(z|G, α, β)P
(ε) dw dεP
(y|X, G, α, β) . (5.12) Specifically, the relevance of the i-th feature is a number between 0 and 1 given by the marginal probability of the event zi= 1 using the posterior (5.12). Finally, this Bayesian framework alsoallows to compute an estimate of the level of noise in the class labels as the average of ε over its posterior distribution
¯ε =
Z Z
ε
P
(w, ε|y, X, G, α, β) dw dε . (5.13) This quantity also provides an estimate of the generalization error of NBSBC. Unfortunately, the sums and integrals in (5.10), (5.11), (5.12) and (5.13) are in most cases too costly to bepracticable. For this reason, to implement the NBSBC model, it is necessary to resort to approx- imate methods for Bayesian inference. In this chapter, an approximation of the joint distribution