• No results found

The contribution of this chapter is an empirical evaluation of NB and SSNB on binary and multi-class classification problems with continuous and discrete features. We wish to address the question of whether using unlabelled data will improve classification accuracy. This will clearly be dictated by our choice of classifier and semi-supervised learning scheme. We evaluated a naïve Bayes classifier used in conjunction with an Expectation-Maximization algorithm that iteratively uses NB to predict the unlabelled instances. We found that using the unlabelled data made the classifier significantly less accurate. To understand why this may be so, we assessed the performance of NB and SSNB on synthetic data for which the NB assumption of independent features was valid. We found that SSNB was significantly more accurate on these data. We conclude that if a classifier is not suitable for a data set, then using unlabelled data in a self training scheme is likely to make it worse. This implies that effort should be applied in finding a classifier suitable for a problem before using unlabelled data to self train.

Differential weighting of labelled and

unlabelled examples

As demonstrated by the benchmarking in Chapter 3, the use of semi-supervised learning with the naïve Bayes classifier generally degrades rather than improves classification performance. Experiments show that one of the reasons for this degradation is that the assumption of independence between input features is often invalid in practice. Another reason probably is the size of data, where the amount of labelled patterns is usually much smaller than the amount of unlabelled patterns. Thus, it is possible that a large amount of uninformative unlabelled data is swamping the more reliable information in the labelled data.

In this chapter, we employ down-weighting of the unlabelled data to test whether this reduces the influence of the unlabelled data and improves the performance of the nai¨ve Bayes classifier. Furthermore, we investigate the use of a hyper-parameter,λ, to down-weight the

contribution of the unlabelled data, and some model selection methods which have been used to tuneλ. A preliminary study, as expected, shows that down-weighting the influence of the

unlabelled data improves the baseline classifier somewhat. The cause for this improvement is tuned to maximise test set performance, which is a biased protocol. Then, an unbiased model selection procedure has been investigated but then the down-weighting was less successful.

Investigating other model selection procedures such as k fold cross-validation and leave-one- out-cross-validation, may give unreliable indicate for selecting a hyper-parameter,λ. The k

fold cross-validation procedure needs a large amount of labelled patterns to obtain a reliable result and using leave-one-out-cross-validation provides high variance. Thus, a different value ofλ, might be utilised if the experiment is repeated with different a sample of dataset.

Therefore, we used a new method between the leave-one-out-cross-validation and k folds cross-validation section 4.5 and again the unlabelled data does not improve classification performance because it is difficult to tune the value ofλ.

4.1

Down-weighting of the unlabelled data

The standard EM based semi-supervised NB algorithm procedure works well to estimate the model parameters with unlabelled patterns in the case of semi-supervised learning, when the data conform to the assumptions of the model [59]. However this assumption of independence is generally invalid in practice, thus there exists the possibility that the EM algorithm would degrade rather than improve classification error [70]. As described in the chapter 3, a common scenario in semi-supervised learning is that the majority of the data is unlabelled, but unlabelled data participates in estimating the model parameters in the M-step of the EM algorithm. Thus, it is possible that a large amount of unlabelled data may swamp the more reliable information in the labelled data.

In order to reduce the influence of unlabelled data, we investigate the inclusion of a hyper- parameter,λ, to down-weight the contribution of the unlabelled data in the M-step of the

EM algorithm, which is denoted by EM-λ.

Nigam et al. [59], show that down-weighting the influence of the unlabelled data in this way can improve the performance of the naïve Bayes classifier for theWebKBdataset. The experiments in this chapter, use a large number of benchmark datasets from the UCI repository to test whether implementing down-weighting of the contribution of the unlabelled data can

improve the performance of naïve Bayes SSNB-λ classifier, especially for cases with a small

number of labelled examples. In addition, while running the experiments, another research question is raised which is how to choose the value of weighting factorλ.

The contributions of this chapters are summarised as follows:

• The main contribution is that down weighting the influence of the unlabelled data does not generally improve the classifier. In fact, our experiments show that for the majority of the benchmark datasets it is preferable not to use the unlabelled data.

• Tuning the value ofλ through the test set can improve the performance of the NB

classifier, but it is a biased protocol giving over-optimistic estimates of performance. Therefore, it would be better to investigate other model section methods for tuning hyper-parameterλ.

• The results obtained with other model selection methods suggest that none of the model selection methods that we evaluate for choosing the value of theλ are a significant

improvement over the naïve Bayes classifier.