• No results found

Examples: Unlabeled Data Degrading Performance with Discrete and Continuous Variables

SEMI-SUPERVISED LEARNING

5. Asymptotic Properties of Maximum Likelihood Estimation with Labeled and Unlabeled Data

5.3 Examples: Unlabeled Data Degrading Performance with Discrete and Continuous Variables

The previous discussion alluded to the possibility thate(θu)>e(θ∗l)when the model is incorrect. To the skeptical reader who still may think that this will not occur in practice, or that numerical instabilities are to blame, we an- alytically show how this occurs with several examples of obvious practical significance.

Example 4.5 Consider the following (fictitious) classification problem. We

are interested in predicting a baby’s gender (G = Boy or Girl) at the 20’th week of pregnancy based on two attributes: whether the mother craved chocolate in the first trimester (Ch = Y es or N o), and whether the mother’s weight gain was more or less than 15lbs (W = M ore or

Less). Suppose that the true underlying joint distribution,p(G, Ch, W) can be represented with the following graph: G Ch W (i.e., W

G|Ch) and the values of the conditional probabilities are specified as:

p(G=Boy) = 0.5, p(Ch=N o|G=Boy) = 0.1, p(Ch=N o|G=Girl)

= 0.8,p(W =Less|Ch=N o) = 0.7,p(W =Less|Ch=Y es) = 0.2. With these probabilities we compute the a-posteriori probability of G (which de- pends only onCh):

p(G|Ch) Girl Boy Prediction

No 0.89 0.11 Girl

Yes 0.18 0.82 Boy

The Bayes error rate for this problem can be easily computed and found to be 15%. Suppose that we incorrectly assume the following (Naive Bayes) relationship between the variables: Ch G W, thus we incorrectly assume that weight gain is independent of chocolate craving given the gender. Suppose also that we are given the values forp(Ch|G)and we wish to estimate

p(G)andp(W|G)from data. We use Eq.(4.9) to get both the estimates with infinite labeled data (λ = 1) and the estimates with infinite unlabeled data = 0). For the labeled case,pˆ(G)is exactly0.5. The estimate ofpˆ(W|G)is

p(W|G)computed from the true distribution:pˆ(W =Less|G=Girl) = 0.6,

ˆ

p(W =Less|G=Boy) = 0.25, leading to the a-posteriori probability ofG:

ˆ

p(G|Ch, W) Girl Boy Prediction

No, Less 0.95 0.05 Girl

No, More 0.81 0.19 Girl

Yes, Less 0.35 0.65 Boy

We see that although there is a non-zero bias between the estimated distribution and the true distribution, the prediction remains unchanged. Thus, there is no increase in classification error compared to the Bayes error rate. The solution for the unlabeled case involves solving a system of three equations with three variables (using the marginal, p(Ch, W) from the true distribution), yield- ing the following estimates: pˆ(G=Boy) = 0.5,pˆ(W =Less|G=Girl) =

0.78,pˆ(W =Less|G=Boy) = 0.07, with a-posteriori probability:

ˆ

p(G|Ch, W) Girl Boy Prediction

No, Less 0.99 0.01 Girl

No, More 0.55 0.45 Girl

Yes, Less 0.71 0.29 Girl

Yes, More 0.05 0.95 Boy

Here we see that the prediction is changed from the optimal in the case of

Ch = Y es, W = Less; instead of predicting G = Boy, we predict Girl.

We also see that the bias is further increased, compared to the labeled case. We can easily find the expected error rate to be at22%, anincreaseof7%in error.2

For the second example, we will assume that bivariate Gaussian samples (X, Y)are observed. The only modeling error is an ignored dependency be- tween observables. This type of modeling error is quite common in practice and has been studied in the context of supervised learning [Ahmed and Lachen- bruch, 1977; McLachlan, 1992]. It is often argued that ignoring some depen- dencies can be a positive decision, as we may see a reduction in the number of parameters to be estimated and a reduction on the variance of estimates [Fried- man, 1997].

Example 4.6 Consider real-valued observations (X, Y) taken from two

classes c and c. We know that X and Y are Gaussian variables, and we know their means and variances given the class C. The mean of (X, Y) is

(0,3/2) conditional on {C = c}, and (3/2,0) conditional on{C = c}. Variances forXand forY conditional onC are equal to 1. We do not know, and have to estimate, the mixing factor η = p(C=c). The data is sampled from a distribution with mixing factor equal to 3/5.

We want to obtain a Naive-Bayes classifier that can approximate

p(C|X, Y); Naive-Bayes classifiers are based on the assumption thatXandY

are independent givenC. Suppose thatXandY are independent conditional on{C =c}but thatXandY are dependent conditional on{C =c}. This dependency is manifested by a correlationρ=E[(X−E[X])(Y −E[Y])] =

4/5. If we knew the value of ρ, we would obtain an optimal classification boundary on the planeX×Y. This optimal classification boundary is shown

82 Theory: Semi-supervised Learning

in Figure 4.2, and is defined by the function

y=40x−87 +52652160x+ 576x2+ 576 log(100/81)/32.

Under the incorrect assumption thatρ = 0, the classification boundary is then linear:

y=x+ 2 log((1−ηˆ)ˆ)/3,

and consequently it is a decreasing function ofηˆ. With labeled data we can easily obtainηˆ(a sequence of Bernoulli trials); thenη∗l = 3/5and the classi- fication boundary is given byy =x−0.27031.

Note that the (linear) boundary obtained with labeled data is not the best possible linear boundary. We can in fact find the best possible linear boundary of the formy=x+γ. For anyγ, the classification errore(γ)is

e(γ) = 3 5 −∞ x+γ −∞ N 0 3/2 ,diag[1,1] dydx +2 5 −∞ x +γ N 3/2 0 , 1 4/5 4/5 1 dydx.

By interchanging differentiation with respect toγwith integration, it is pos- sible to obtainde(γ)/dγ in closed form. The second derivatived2e(γ)/dγ2

is positive when γ [3/2,3/2]; consequently there is a single minimum that can be found by solvingde(γ)/dγ = 0. We find the minimizingγ to be

(9 + 245/4 + log(400/81))/4 ≈ −0.45786. The liney = x−0.45786

is the best linear boundary for this problem. If we consider the set of lines of the formy = x+γ, we see that the farther we go from the best line, the larger the classification error. Figure 4.2 shows the linear boundary obtained with labeled data and the best possible linear boundary. The boundary from labeled data is “above” the best linear boundary.

Now consider the computation ofηu∗, the asymptotic estimate with unlabeled data: ηu∗ = arg maxη[0,1] −∞ −∞

logηN([0,3/2]T,diag[1,1])+(1−η)N([3/2,0]T,diag[1,1])

· (3/5)N([0,3/2]T,diag[1,1]) + (2/5)N 3/2 0 , 1 4/5 4/5 1 dydx.

The second derivative of this double integral is always negative (as can be seen interchanging differentiation with integration), so the function is concave and there is a single maximum. We can search for the zero of the derivative of the double integral with respect to η. We obtain this value numerically,

ηu 0.54495. Using this estimate, the linear boundary from unlabeled data isy = x−0.12019. This line is “above” the linear boundary from labeled data, and, given the previous discussion, leads to a larger classification error

than the boundary from unlabeled data. We have: e(γ) = 0.06975;e(θ∗l) =

0.07356; e(θu) = 0.08141. The boundary obtained from unlabeled data is also shown in Figure 4.2.2

This example suggests the following situation. Suppose we collect a large numberNNNlof labeled samples fromp(C, X), withη= 3/5andρ= 4/5. The

labeled estimates form a sequence of Bernoulli trials with probability3/5, so the estimates quickly approachηl (the variance ofηˆdecreases as6/(25NNNl)).

If we add a very large amount of unlabeled data to our data,ηˆapproachesηu

and the classification error increases.

By changing the “true” mixing factor and the correlation ρ, we can cre- ate different situations. For example, if η = 3/5 and ρ = 4/5, the best linear boundary is y = x 0.37199, the boundary from labeled data is

y=x−0.27031, and the boundary from unlabeled data isy=x−0.34532; the latter boundary is “between” the other two — additional unlabeled data lead to improvement in classification performance. As another example, ifη = 3/5

andρ = 1/5, the best linear boundary is y = x−0.29044, the boundary from labeled data isy=x−0.27031, and the boundary from unlabeled data is

y=x−0.29371. The best linear boundary is “between” the other two. We in fact attain the best possible linear boundary by mixing labeled and unlabeled data withλ= 0.08075.

5.4

Generating Examples: Performance Degradation with