Class-conditional Probabilities - Aspects of generative and discriminative classifiers

3.4 Conclusions

4.1.1 Class-conditional Probabilities

For binary classification, where K = 2, based on Bayes’ Theorem, the Bayes discriminant

criterion (i.e.,y(x) = argmaxˆ _yp(y|x)) of the generative classifiers for classifying x into the

groupy = 1 can be written as p(x, y = 1) ≥ p(x, y = 2), or equivalently p(y = 1)p(x|y = 1) _{≥ p(y = 2)p(x|y = 2). In addition, specific generative classifiers, such as linear normal-}

based discriminant analysis with a common diagonal covariance matrix (denoted by LDA-Λ)

and the na¨ıve Bayes classifier, assume that theh features are conditionally independent given

the group labely, i.e., p(x_{|y) =}Qh

i=1p(xi|y).

In the normalised hybrid and the unnormalised hybrid algorithms proposed by Raina et al. (2003), the feature vector x is divided intoR partial feature vectors x1_{, . . . , x}R_{, because they}

suggest different levels of importance for different partitions, or partial feature vectors; for example, x1 may represent the message subject of an email while x2 represents the message body. As with Raina et al. (2003), we focus on R = 2, such that x = (x1T, x2T)T_{, x}1 ₌

(x1, . . . , xh1)

T_{, x}2 _{= (x}

h1+1, . . . , xh)

T _and_h

2 = h− h1, and assume that the discriminant

criterion of the generative classifiers can be rewritten as

p(y = 1)p(x1|y = 1)p(x2|y = 1) ≥ p(y = 2)p(x1|y = 2)p(x2|y = 2) .

Thus, givenp(x, y)6= 0, the corresponding discriminant function λG(x) = logp_p(y=1|x)_(y=2|x) can be expressed in terms of likelihood ratios as

λG(x) = log p(y = 1) p(y = 2) + log p(x1_{|y = 1)} p(x1_{|y = 2)}+ log p(x2_{|y = 1)} p(x2_{|y = 2)} .

Such a representation can be obtained by assuming the generative DGP

p(x|y) = w(x1, x2)p(x1|y)p(x2|y) ,

wherew(x1, x2) can be regarded as a normalisation factor. However, if, for all y, p(x1|y) and p(x2_{|y) are proper marginal distributions derived from p(x|y) (i.e., p(x}1_{|y) =} P

x2p(x|y), p(x2|y) =P x1p(x|y) and P xp(x|y) = P x1p(x1|y) = P x2p(x2|y) = 1), then w(x1, x2)≡ 1, given that there exists x = x such that p(x_{|y = 1) 6= p(x|y = 2). In other words,}

it leads to assuming conditional independence between partial feature vectors x1|y and x2_|y

such thatp(x_{|y) = p(x}1_|y)p(x2_{|y). In addition, to some extent, for a simple implementation}

in practice, Raina et al. (2003) further assume thatp(x1|y) = Qh1

j=h1+1p(xj|y); these imply the conditional independence of the elements within x

1 _{and x}2

giveny, respectively.

Raina et al. (2003) introduce two additional parameters θ1 and θ2 into the discriminant

criterion, leading to different weights for different partial feature vectors in the discrimination. Two ways of weighting are proposed by Raina et al. (2003): one corresponds to assigning x to the groupy = 1 if

p(y = 1)p(x1|y = 1)θ1h1_p(x2_{|y = 1)}θ2h2 _{≥ p(y = 2)p(x}1_{|y = 2)}h1θ1_p(x2_{|y = 2)}θ2h2 _,

which is the criterion (denoted by Criterion-H) corresponding to the normalised hybrid algo- rithm; the other gives

p(y = 1)p(x1_{|y = 1)}θ1_p(x2_{|y = 1)}θ2 _{≥ p(y = 2)p(x}1_{|y = 2)}θ1_p(x2_{|y = 2)}θ2 _,

which is the criterion corresponding to the unnormalised hybrid algorithm. Without loss of generality, in this chapter we focus on the normalised hybrid algorithm.

Let us writeθ = (θ1, θ2)T. Then the hybrid algorithm can be derived from

pθ(x|y) = wθ(x1, x2)p(x1|y)

θ1

h1_p(x2_|y)θ2h2 _and_p_θ_{(x, y) = p(y)p}_θ_(x_{|y) ,}

wherewθ(x1, x2) is independent of groups y so that it is cancelled out from Criterion-H, but

it is not necessarily further factorised as wθ(x1, x2) = wθ1(x1)w2θ(x2). However, in order

to maintain pθ(x|y) as a proper probability distribution (so that Criterion-H is derived from

a proper probabilistic model), with the marginal distributions p(x1_{|y) =} P

x2pθ(x|y) and

p(x2|y) =P

x1pθ(x|y), it is required that, for all y,

X x2 wθ(x1, x2)p(x2|y) θ2 h2 _{= p(x}1_|y)1−θ1h1 _, X x1 wθ(x1, x2)p(x1|y) θ1 h1 _{= p(x}2_|y)1−θ2h2 _.

In some cases, it might be difficult to validate the existence of such awθ(x1, x2), e.g., when

θ1

h1 = 1 while

θ2

h2 6= 1 or vice versa, as the sums, in terms of x, on the left-hand sides of

the above equations have to become independent of y. In other cases, further assumptions

might be needed to guarantee the existence. We illustrate this by assuming that wθ(x1, x2)

can be further factorised in terms ofwθ(x1, x2) = w_θ1(x1)w2_θ(x2); in other words, we assume

conditional independence between x1|y and x2_{|y. It follows that}

pθ(x|y) = w1θ(x1)p(x1|y) θ1 h1_w2 θ(x2)p(x2|y) θ2 h2 _,

which also leads to Criterion-H. One option forwθ(x1, x2) is, for all y, w1_θ(x1) = q(y)p(x1|y)1−h1θ1_{, w}2 θ(x2) = 1 q(y)p(x 2_|y)1−θ2_h2 _,

where q(y) is a non-zero function used to cancel out terms in y within p(x1_|y)1−θ1h1 _and p(x2_|y)1−_h2θ2_{. If such a}_w

θ(x1, x2) cannot be found, Criterion-H is not a Bayes discriminant

criterion derived from a proper probabilistic model; nevertheless, in practice it can still be used as a criterion for discrimination, although in this case the hybrid algorithm is no longer a true Bayes classifier and, under a0− 1 loss function, it cannot provide a minimum Bayes error.

Under Criterion-H, we classify x into y = 1 if pθ(x, y = 1) ≥ pθ(x, y = 2). Given

pθ(x, y) 6= 0, the discriminant function λH(x) of the hybrid algorithm can be expressed in

terms of weighted likelihood ratios as

λH(x) = log p(y = 1) p(y = 2) + θ1 h1 logp(x 1_{|y = 1)} p(x1_{|y = 2)} + θ2 h2 logp(x 2_{|y = 1)} p(x2_{|y = 2)} .

Therefore,λH(x) can be viewed as a “weighted” version of the discriminant function λG(x) of the generative classifier; however, as mentioned above, in theory the hybrid algorithm should satisfy some conditions about the marginal distributions in order to make the underlying model probabilistically valid. In addition, as withλG(x), most parameters, such as those for p(x1|y) andp(x2_{|y), in λ}H(x) are learnt by using a generative approach; only a few parameters, such

as the two weights θ1 and θ2, are then learnt by using a discriminative approach based on

the learning results (aboutp(x1_{|y) and p(x}2_{|y)) from the generative approach. Therefore, the}

hybrid algorithm can be regarded as a generative classifier since it assumes the DGPp(x|y)

and thusp(x, y).

With the assumption of conditional independence between x1|y and x2_{|y, it follows that}

the two class-conditional probabilities,p(x_{|y) and p}θ(x|y), are related by

pθ(x|y) = p(x|y) wθ(x1, x2)p(x1|y) θ1 h1−1_p(x2_|y)θ2h2−1 .

This indicates that, in practice, the hybrid algorithm assumes a scaled DGP pθ(x|y) which

scales the generative DGPp(x_{|y) by a function not only of the group label y but also of the}

In document Aspects of generative and discriminative classifiers (Page 71-74)