Logit transformation function for predicted class probabilities

abilities

In the previous section, Figure 5.1 shows the discrepancy is produced between the predicted and true class, because the basic EM algorithm based semi-supervised learning is more likely to produce uncalibrated predicted probabilistic labels. Thus, the semi-supervised naïve Bayes classifier is more likely to give worse performance results rather than better. To overcome this

problem, we recalibrate the uncalibrated predicted labels for the unlabelled patterns between the E-step and the M-step during each iteration of the EM algorithm in order to obtain the correct mean value. The basic idea is to shift all the predicted probability labels of unlabelled patterns to the level approximately equal to that class distribution, whilst constraining them to be within the range 0 to 1.

The inverse sigmoid function is used to transform the predicted class probabilities of the unlabelled patterns in advance in order to shift the probability values. The inverse sigmoid function gives the log-odds ratio,ric, for binary classification so in this work, we consider only binary classification tasks, where y_i ∈ {−1,+1} is an indicator variable such that

y_i= +1 if theith pattern is drawn from the positive andy_i=−1 if drawn from the negative class.

Let the proportion of patterns in the positive class be denoted byθ andqic+ be the predicted probabilistic label of theith unlabelled pattern belonging to the positive class. We believe that there is abiasbetweenθ and the meanqic+ value. Correcting this bias by recalibrating theqicvalues might improve classification performance. The whole calibration process is given below:

The predicted class probabilities of the unlabelled patterns belonging to the positive class,

q_ic₊ where i=l+1, ...,l+u, are transformed into other real value domain, r_ic₊ where

i=l+1, ...,l+u, using the inverse sigmoid function.

ric+ = −log

1−q_ic₊ q_ic₊

In the case,r_ic₊ has a real value in the range[−∞,+∞], for alli=l+1, ...,l+usuch that;

r_ic₊ =                (0,+∞], if q_ic₊ >0.5, 0, if q_ic₊ =0.5, [−∞,0), if q_ic₊ <0.5. (5.2)

We then normalise the real values, ric+, by dividing each value by the total number of unlabelled data,u;

r_ic₊ = − ric+ u .

We apply the sigmoid function to convert the average value of the ric+ corresponding to the probability valuesq_ic₊ and then plot against the true proportion of positive class to see whether this transformation can approximately give the correct mean values,

˜ q_ic₊ = 1 1+exp(−r¯_ic₊); where r¯ic+ = l+u

∑

i=l+1 r_ic₊. ˜ q_ic₋ = 1 − q˜_ic₊.

Before the next iteration in the estimation of the model parameters, the sigmoid function is applied in order to convert the log-odds valuesricback into probability values,qic, that will guarantee 0≤qic≤1,

q_ic₊ = 1

1+exp(−r_ic₊),

qic− = 1 − qic+.

The whole procedure of the semi-supervised naïve Bayes classifiers with the inverse sigmoid function transformation technique for predicting the probabilities of class membership for the unlabelled patterns can be seen in Algorithm 5.3 . The error rate results for the SSNB classifier by using the inverse sigmoid function transformation is the same as the normal SSNB classifier, because the log-odds ratio,ric, was converted back to the,qic. However, the mean value of the predicted class log-odds ratio of the unlabelled data was converted by the sigmoid function, ˜q_ic, that corresponds to the mean value of the predicted class probabilities,

Algorithm 5log-odds ratio transformation for uncalibrated predicted class probabilities for the EM based Semi-Supervised naïve Bayes classifier

• Input: X={(x1,y1), ...,(xl,yl),xl+1, ...,xl+u} •Set: t=0

•Set Initialise: θˆ(0) = argmax_θ P(Xl,Yl|θ)P(θ)

• Loop whileclassifier parameters improve as measured by the change in

l(θ |Xl,Yl,Xu):

• (E-Step) : Use the current classifier, ˆθ(t), to find qik = P(yi = k | xu;θ) as shown in Equation 2.14 r_ik₊ = − log 1−qik+ qik+ , r_ik₊ = rik+ u , ˜ q_ik₊ = ₁₊_exp₍1₋_r_¯

ic+), see Figure5.2; where ¯

r_ic₊ = ∑l_i₌+u_l₊₁ric+ q_ik₊ = ₁₊_exp₍1₋_r

ik₊),

q_ik₋ = 1 − q_ik₊.

•(M-Step): Re-estimate the classifier, ˆθ(t+1), using

θ(t+1) = argmax_θ P(Xl,Yl,Xu|θ(t))P(θ(t)) • Set: t=t+1

• Output: A classifier, ˆθ(t), that takes unlabelled data and predicts a class label.

Figure 5.2 shows the plot of the ˜q_ic₊ and the proportion of the positive examples,θ which

is the green line, for 100 replications of the Mushroom benchmark dataset. The inverse sigmoid function transformation fits badly, effectively pushing the converted mean value of ther_ic₊ to 0 or 1 as can be seen in the Figure 5.2.

The positive patterns are more clustered with less variability which makes the classifier overconfident when the positive patterns have been added. However, the average value of the predicted probability for the positive class for the unlabelled patterns seems to have a small difference with the proportion of the positive examples,θ, in Figure 5.1. In reality almost

always the value of predicted probabilities for the positive class for the unlabelled patterns is close to 0 or 1, as can be seen in Figure 5.2, but when averaged it seems to have a small difference.

100 101 102 103 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 mushroom qic−positive

Fig. 5.2 The average value of predicted probabilities for the positive class, for the unlabelled data for 100 replications of the Mushroom benchmark dataset, using Platt scaling [63] for rank transformation

The log-odds ratio transformation does not fix the discrepancy problem between the predicted and actual class. In addition, this results suggest that the discrepancy problem is not the only issue problem. The overconfidence problem is also an issue. One way to visualize the overconfidence problem is by plotting a histogram of the predicted class probabilities. Figure 5.3 shows the percentage histogram of predicted class probabilities output, q_ic₊, for 100 replication for the UCI Mushroom benchmark dataset. It should be noted that the classifier output is overly confident, the output being almost always close to 0 or close to 1.

In document Empirical Evaluation of Semi-Supervised Naïve Bayes for Active Learning (Page 165-169)