5 Discussion and conclusion

We are not experts in the field of machine-learning and can therefore not provide an advanced analysis to why the classifiers work the way they do. This also makes us more susceptible to possible mistakes and we may not have fine-tuned the algorithms and classifiers ideally. The time constraint for this degree thesis has not allowed us to fully explore the options available during the GSSL-process. Such options could for example have been to test other GSSL-classifiers such as MAD [27] or to improve the selection method for our vocabularies.

The time constraint also affects the number of maximum iteration the LP-algorithm was allowed to run. A lower number enables more trial and error while a higher number allows us to achieve a better classification result for BOW since some classifiers did not have enough iterations to finish.

Accuracy and F1-score

The classification measurement calculates the percentage of correctly classified documents. A successful score should be higher than randomly selecting the labels for the documents. Since we have four categories is it 25% chance to guess the correct label by simply assigning documents their label randomly. It is therefore essential that the classification-algorithm has a higher score than 25% for it to be useful.

Figure 7 shows that our classification-algorithms have a better accuracy than 25%

which makes the predictions better than just randomly assign labels. There are however results from our classification where we simply could have assigned the labels randomly. Figure 6 shows that the RBF-kernel has a score of around 25%

due to reaching the iteration limit when using BOW. In these cases it is better to look at the error rate where we want as low of an error rate as possible. A classification-score of 100% might look good on paper but is not desirable. Such a score could tell us that something is not working properly or that the classifier has been overly trained with data.

Except for achieving a higher classification score than randomly selecting labels it is important that the classifier is precise. We also want the classification-algorithms to be sensitive which is measured by recall. F1-score is the harmonic mean of precision and recall which shows if the classifier is precise and sensitive. Each category has an equal effect on the F1-score and a significantly lower F1-score than accuracy-score shows that the classifier has bias to label documents with a certain category. Also, the F1-score can not be higher than the accuracy-score.

Instability

Empty training documents have a major effect on the stability of the result when using GSSL without a vocabulary. The varying result is due to the fact that randomly selected labeled training documents can be empty after the preprocessing which are not good to train the classifier with. The result of MNB does not vary as much as the GSSL-algorithms for the same number of empty training documents.

This is most likely because MNB only uses the labeled training documents when training and excludes many of the empty documents. This is not the case for GSSL-algorithms which uses all the training documents.

Vocabulary

Vocabulary features: The vocabulary constructed from the preprocessing reduced the number of features, or unique words, in all texts, from 2793 to 27. This number varies when using the MNB-classifier which is an SL-classifier which only uses the labeled training documents when training the classifier. SSL classifiers use labeled and unlabeled training documents. Without a vocabulary, MNB usually had around 2200 features. The number of features using the runtime vocabulary are dependent on the randomly selected training documents and varies between 10-40 features. However, it is unrealistic that a runtime vocabulary would consist of ten features since it would mean that the four different categories would have the same ten most frequently used words. Instead, the amount of features for a runtime vocabulary was the same as the standard vocabulary.

The classification tests show that the classifier makes more stable predictions with a vocabulary because we remove noisy words and get a lower feature count per document. Since the feature count is lower and unnecessary words which could

“confuse” the classifier are removed, the result is less differentiated in the classification measurements between lower and higher number of labeled training documents per category. This makes vocabularies ideal for algorithms using lower number of labels since it is possible to not have enough features without a vocabulary to differentiate categories.

A vocabulary makes it easier to predict categories but the downside is that it will produce a lower maximum score for the classifier. This is due to the fact that using fewer features causes some features which would have differentiated the categories to no longer be included in the classification. For example, the categories car and

useful in the classification. A runtime vocabulary provides more accurate results since the above issue is eliminated. We saw a 2-5 percentage increase in classification score using a runtime vocabulary compared to the standard vocabulary. The result of using a runtime vocabulary could also become worse than the standard vocabulary. Such a result is achieved when the randomly selected labeled training documents have a misleading representation of the other training documents. This means that the words in the vocabulary rarely occur in the use of MNB with a vocabulary will perform worse.

Any reduction in the number of unique features would however reduce the result of MNB. MNB has an accuracy of 0.81 at 100 labeled training documents per category using TF-IDF without a vocabulary. When using data which has not gone through data cleaning or preprocessing for the same scenario, the accuracy improves to 0.90.

This shows that the result of MNB becomes better even when using features which many other classifier perceive as not good for classification.

GSSL using vocabulary: The classification result for LS using KNN- and RBF-kernels decreases when using a vocabulary because of the lower amount of distinguishing features. Another reason is that the vocabulary is not optimal for distinguishing between the categories since it is constructed from the most frequent words in each category and nothing stops these words from occurring in another category. A vocabulary with words that only occur in one category would therefore improve the result. Such a vocabulary could be constructed manually by a human selecting the features. However, it would only be possible for the standard vocabulary and not the runtime vocabulary since it is built from the random generated categories. The feature selection for the runtime vocabulary could instead be built by a custom algorithm which would be very time expensive.

The classification result increases for LP, in opposite to LS, when using a vocabulary. An example is LP RBF which has a significant 15 percentage points differentiation for 100 labeled training documents depending if it uses a vocabulary or not. LP without a vocabulary has a steeper classification line, compared to using

a vocabulary, towards 100 labeled training documents. This leads us to believe that if the number of labels per category increased further, LP without a vocabulary should give a better result than with a vocabulary.

allowed iterations, whereas TF-IDF produces a result without reaching the iteration limit. The RBF-kernel using BOW is not even close to produce a result at

Higher number of neighbors increases the number of iterations required. Higher gamma values when using RBF-kernel also increases the number of iterations number of iterations required to train the classifier properly.

MNB baseline: It is difficult to achieve a better result than the result produced by the MNB baseline. This is because SL-classifiers, such as the MNB-classifier, does not have the same potential of training errors as GSSL-classifiers. When using SL-classifiers the labels of the training data is unchangeable. Classification using LP does allow for the already labeled training data to change labels. This

possible for the classifiers to be overfitted. The randomly selected labeled training documents should however counteract overfitting. non-mathematicians. The other research methodologies being less adequate makes design and creation the best option.

Research question

We have two research questions that have been answered throughout the thesis.

Our first research question “How to explain graph-based semi-supervised learning for non-mathematicians?” is answered in the implementation-section (chapter 3) where we provide a detailed explanation of the process to construct GSSL-classification. This is followed by a more in depth explanation of GSSL through the LP-algorithm (chapter 4.1) and by listing the programming code in Appendix B.

Our second research question “What kind of preprocessing is most effective regarding the quality of results of semi-supervised learning?” is answered in the classification comparison-section (chapter 4.2) with graphs over the results for different preprocessing techniques. Parts of this chapter also discusses different preprocessing techniques to complement the result in chapter 4.2.

Future Research

Future research could include what kind of effects different parameters used for classification and preprocessing have on the classification result. The selection process for parameters could also be further researched. This would result in a deeper understanding of the impact of the parameters which would improve the reader’s understanding of the classifiers and preprocessing.

Another research area are the classification of documents that are corrupt or incorrect. For example, some of the documents in the 20 Newsgroups have spelling mistakes. Researching how corrupt and incorrect data affects the classification and provide knowledge on cases where it is important to fix these mistakes or if these mistakes could even be helpful for the classification.

It is also possible to further increase the number of classifiers to include GSSL-classifiers from other libraries. Such classifiers to include could be MAD which is implemented in the Junto library [28].

References

[1] A. Subramanya and P. P. Talukdar, “Graph-based Semi Supervised Learning,”

in Synthesis Lectures on Artificial Intelligence and Machine Learning^{, Morgan}

& Claypool Publishers, 2014, ISBN: 9781627052016.

[2] T. Iliou, C. Anagnostopoulos, M. Nerantzaki and G. Anastassopoulos. “A Novel Machine Learning Data Preprocessing Method for Enhancing Classification Algorithms Performance,” In Proc. EANN '15 Proceedings of the 16th

International Conference on Engineering Applications of Neural Networks ‘09, 2015, Article No. 11^.

[3] S. B. Kotsiantis, D. Kanellopoulos and P. E. Pintelas, “Data Preprocessing for Supervised Learning,” International Journal of Computer Science, vol. 1, no. 1, pp. 111-117, 2006.

[4] I. H. Witten, E. Frank, M. A. Hall and C. J. Palestro, Data Mining : Practical Machine Learning Tools and Technique, Burlington, MA: Elsevier, 2016.

[5] C. Yin, J. Xiang, H. Zhang, J. Wang, Z. Yin and J. Kim, “A New SVM Method for Short Text Classification Based on Semi-Supervised Learning,” In Proc. 4th International Conference on Advanced Information Technology and Sensor Application ‘08, 2015, pp. 100-103.

[6] M. K. Dalal and M. A. Zaveri, “Automatic Text Classification of sports blog data,“ In Proc. Computing, Communications and Applications Conference ‘01, 2012, pp. 219-222.

[7] N. Widmann and S. Verberne, “Graph-based Semi-supervised Learning for Text Classification,” in Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval - ICTIR ’17, Amsterdam, The Netherlands, 2017, pp. 59–66.

[8] S. Raschka, Python Machine Learning. UK: Packt Publishing Ltd., 2015 [9] J. Han, M. Kamber and J. Pei, Data mining: Concepts and techniques.

Waltham, MA: Elsevier, 2012.

[10] A. Casari and A. Zheng, “Chapter 4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf”, in Feature Engineering for Machine Learning^{, CA:}

O'Reilly Media, Inc., 2018, pp. 61-76.

[11] “Semi-Supervised” [Online]. Available:

https://scikit-learn.org/stable/modules/label_propagation.html [Accessed:

March. 4, 2019].

[12] K. Ozaki, M. Shimbo, M. Komachi and Y. Matsumoto, “Using the mutual k-nearest neighbor graphs for semi-supervised classification of natural language data,” In Proc. Proceedings of the Fifteenth Conference on Computational Natural Language Learning ‘06, 2011, pp. 154-162.

[13] X. Zhu and Z. Ghahramani, “Learning from Labeled and Unlabeled Data with Label Propagation” ⁽²⁰⁰²⁾

[14] C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2008

[15] H. Yang, S. Zhu, I. King and M. R. Lyu, “Can irrelevant data help semi-supervised learning, why and how?,” In Proc. Proceedings of the 20th ACM Conference on Information and Knowledge Management ‘10, 2011^{, pp.}

937-946.

[16] A. Subramanya and J. Bilmes, “Soft-Supervised Learning for Text

Classification” in Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing ‘10, 2018, pp. 1090-1099.

[17] S. H Srinivasan, “Features for unsupervised document classification” In Proc. COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20, 2002, pp. 1-7.

[18] R. Pimplikar, D. Garg, D.Bharani, G.Parija, “Learning to Propagate Rare Labels”, in Proceedings of the 23rd ACM International Conference on

Conference on Information and Knowledge Management ‘11, 2014, pp. 201-201.

[19] T. D. Bui, S. Ravi, and V. Ramavajjala, “Neural Graph Learning: Training Neural Networks Using Graphs”. In Proceedings of the Eleventh ACM

International Conference on Web Search and Data Mining ‘02, 2018, pp. 64-71.

[20] X. Zhu, “Semi-Supervised Learning with Graphs,” Ph.D thesis, Carnegie Mellon University, Pittsburgh, PA, 2005.

[21] F. Dong, Y. Guo, C. Li, G. Xu and F. Wei, “ClassifyDroid: Large scale Android applications classification using semi-supervised Multinomial Naive Bayes”. In Proc. 4th International Conference on Cloud Computing and Intelligence Systems (CCIS) ‘08, 2016, pp. 77-81.

[22] B. Hardin and U. Kanewala, “Using Semi-Supervised Learning for

Predicting Metamorphic Relations,” In Proceeding MET ‘18 Proceedings of the 3rd International Workshop in Metamorphic Testing ‘05, 2018, pp. 14-17.

[23] B. J. Oates, Researching Information Systems and Computing. London, UK:

SAGE Publications Ltd, 2005.

[24] N. Widmann, “Graph-based semi-supervised learning of semantic text clusters”, M. S. thesis, Radboud University, Netherlands, 2017.

[25] J. Davis and M. Goadrich. “The relationship between precision-recall and ROC curves”. In Proceeding ICML ‘06 Proceedings of the 23rd international conference on Machine Learning, 2006, pp. 233-240.

[26] “Feature extraction” [Online]. Available:

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-ext raction [Accessed: May. 16, 2019].

[27] P. P. Talukdar and F. Pereira, “Experiments in graph-based

semi-supervised learning methods for class-instance acquisition”, ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1473-1481, July 2010.

[28] “The Junto Label Propagation Toolkit” [Online]. Available:

https://github.com/parthatalukdar/junto [Accessed: May. 23, 2019]

[29] Weisstein, Eric W. "Multinomial Distribution." [Online]. Available:

http://mathworld.wolfram.com/MultinomialDistribution.html [Accessed: May.

24, 2019]

Appendix A

Each score in Appendix A displays the average out of ten results where each of the ten results uses a different set of randomly selected training documents. These randomly selected documents should keep their labels going into the classification algorithm with Label Propagation. Naive Bayes should only use the randomly selected document for training. This thesis uses the categories “rec.autos”,

“rec.motorcycles”, “rec.sport.baseball”, “rec.sport.hockey”.

Table A1: Average classification f1 score 100 labels Feature Extraction

Table A2: Average classification f1 score 100 labels Feature

Table A3: Average classification accuracy 100 labels Feature Extraction

Table A4: Average classification accuracy 100 labels Feature

Table A5: Average classification f1 score 50 labels Feature Extraction

Table A6: Average classification f1 score 50 labels Feature

Table A7: Average classification accuracy 50 labels Feature Extraction

Table A8: Average classification accuracy 50 labels Feature

Table A9: Average classification f1 score 25 labels Feature Extraction

Table A10: Average classification f1 score 25 labels Feature

Table A11: Average classification accuracy 25 labels Feature Extraction

Table A12: Average classification accuracy 25 labels Feature

Table A13: Average classification f1 score 10 labels Feature Extraction

Table A14: Average classification f1 score 10 labels Feature

Table A15: Average classification accuracy 10 labels Feature Extraction

Table A16: Average classification accuracy 10 labels Feature

Appendix B

Appendix B contains some important code snippets used in the implementation process of our artefact.

from sklearn import datasets digits = datasets.load_digits() print(digits.data)

Figure B1: Verification of the installation

newsgroups_train = fetch_20newsgroups(subset='train',

remove=('headers', 'footers', 'quotes'), categories=categories) Figure B2: Data cleaning

newsgroups_train.data[i] = (" ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in

nltk.word_tokenize(newsgroups_train.data[i]) if w not string.punctuation]))

Figure B3: Lemmatization

newsgroups_train.data[i] = re.sub(r'\b' + word + '\s', ' ', newsgroups_train.data[i])

Figure B4: Remove stop-word

CountVectorizer(max_df=0.5, min_df=10) Figure B5: Remove rare and frequent features

clf = MultinomialNB().fit(vectors.todense(), dataset.train['target'])

test_vec = vectorizer.transform(dataset.test['data']) pred = clf.predict(test_vec.todense())

Figure B6: Multinomial naive Bayes classifier

clf = LabelSpreading(kernel='knn',n_neighbours=10) .fit(vectors.todense(), dataset.train['target']) test_vec = vectorizer.transform(dataset.test['data']) pred = clf.predict(test_vec.todense())

Figure B7: Label Propagation KNN kernel

clf = LabelPropagation(kernel=’rbf’,gamma=5) .fit(vectors.todense(), dataset.train['target']) test_vec = vectorizer.transform(dataset.test['data']) pred = clf.predict(test_vec.todense())

Figure B8: Label Propagation RBF kernel

clf.score(test_vec.todense(), dataset.test['target'])) Figure B9: Calculate accuracy

metrics.f1_score(dataset.test['target'], pred, average='macro') Figure B10: Calculate F1-score

In document How to explain graph-based semi-supervised learning for non-mathematicians? (Page 32-49)