• No results found

5 Discussion and conclusion  

We are not experts in the field of machine-learning and can therefore not provide        an advanced analysis to why the classifiers work the way they do. This also makes        us more susceptible to possible mistakes and we may not have fine-tuned the        algorithms and classifiers ideally. The time constraint for this degree thesis has        not allowed us to fully explore the options available during the GSSL-process. Such        options could for example have been to test other GSSL-classifiers such as MAD        [27] or to improve the selection method for our vocabularies.  

The time constraint also affects the number of maximum iteration the        LP-algorithm was allowed to run. A lower number enables more trial and error        while a higher number allows us to achieve a better classification result for BOW        since some classifiers did not have enough iterations to finish. 

 

Accuracy and F1-score 

The classification measurement calculates the percentage of correctly classified        documents. A successful score should be higher than randomly selecting the labels        for the documents. Since we have four categories is it 25% chance to guess the        correct label by simply assigning documents their label randomly. It is therefore        essential that the classification-algorithm has a higher score than 25% for it to be        useful. 

Figure 7 shows that our classification-algorithms have a better accuracy than 25%       

which makes the predictions better than just randomly assign labels. There are        however results from our classification where we simply could have assigned the        labels randomly. Figure 6 shows that the RBF-kernel has a score of around 25%       

due to reaching the iteration limit when using BOW. In these cases it is better to        look at the error rate where we want as low of an error rate as possible. A        classification-score of 100% might look good on paper but is not desirable. Such a        score could tell us that something is not working properly or that the classifier has        been overly trained with data.  

Except for achieving a higher classification score than randomly selecting labels it        is  important  that  the  classifier  is  precise.  We  also  want  the  classification-algorithms to be sensitive which is measured by recall. F1-score is        the harmonic mean of precision and recall which shows if the classifier is precise        and sensitive. Each category has an equal effect on the F1-score and a significantly        lower F1-score than accuracy-score shows that the classifier has bias to label        documents with a certain category. Also, the F1-score can not be higher than the        accuracy-score.  

 

 

Instability 

Empty training documents have a major effect on the stability of the result when        using GSSL without a vocabulary. The varying result is due to the fact that        randomly selected labeled training documents can be empty after the preprocessing        which are not good to train the classifier with. The result of MNB does not vary as        much as the GSSL-algorithms for the same number of empty training documents.       

This is most likely because MNB only uses the labeled training documents when        training and excludes many of the empty documents. This is not the case for        GSSL-algorithms which uses all the training documents.  

 

Vocabulary 

Vocabulary features: The vocabulary constructed from the preprocessing reduced        the number of features, or unique words, in all texts, from 2793 to 27. This number        varies when using the MNB-classifier which is an SL-classifier which only uses the        labeled training documents when training the classifier. SSL classifiers use labeled        and unlabeled training documents. Without a vocabulary, MNB usually had        around 2200 features. The number of features using the runtime vocabulary are        dependent on the randomly selected training documents and varies between 10-40        features. However, it is unrealistic that a runtime vocabulary would consist of ten        features since it would mean that the four different categories would have the same        ten most frequently used words. Instead, the amount of features for a runtime        vocabulary was the same as the standard vocabulary. 

The classification tests show that the classifier makes more stable predictions with        a vocabulary because we remove noisy words and get a lower feature count per        document. Since the feature count is lower and unnecessary words which could       

“confuse” the classifier are removed, the result is less differentiated in the        classification measurements between lower and higher number of labeled training        documents per category. This makes vocabularies ideal for algorithms using lower        number of labels since it is possible to not have enough features without a        vocabulary to differentiate categories. 

A vocabulary makes it easier to predict categories but the downside is that it will        produce a lower maximum score for the classifier. This is due to the fact that using        fewer features causes some features which would have differentiated the categories        to no longer be included in the classification. For example, the categories car and       

  useful in the classification. A runtime vocabulary provides more accurate results        since the above issue is eliminated. We saw a 2-5 percentage increase in        classification score using a runtime vocabulary compared to the standard        vocabulary. The result of using a runtime vocabulary could also become worse than        the standard vocabulary. Such a result is achieved when the randomly selected        labeled training documents have a misleading representation of the other training        documents. This means that the words in the vocabulary rarely occur in the        use of MNB with a vocabulary will perform worse. 

Any reduction in the number of unique features would however reduce the result of        MNB. MNB has an accuracy of 0.81 at 100 labeled training documents per category        using TF-IDF without a vocabulary. When using data which has not gone through        data cleaning or preprocessing for the same scenario, the accuracy improves to 0.90.       

This shows that the result of MNB becomes better even when using features which        many other classifier perceive as not good for classification. 

 

GSSL using vocabulary:      ​The classification result for LS using KNN- and        RBF-kernels decreases when using a vocabulary because of the lower amount of        distinguishing features. Another reason is that the vocabulary is not optimal for        distinguishing between the categories since it is constructed from the most frequent        words in each category and nothing stops these words from occurring in another        category. A vocabulary with words that only occur in one category would therefore        improve the result. Such a vocabulary could be constructed manually by a human        selecting the features. However, it would only be possible for the standard        vocabulary and not the runtime vocabulary since it is built from the random        generated categories. The feature selection for the runtime vocabulary could        instead be built by a custom algorithm which would be very time expensive.  

The classification result increases for LP, in opposite to LS, when using a        vocabulary. An example is LP RBF which has a significant 15 percentage points        differentiation for 100 labeled training documents depending if it uses a vocabulary        or not. LP without a vocabulary has a steeper classification line, compared to using       

 

a vocabulary, towards 100 labeled training documents. This leads us to believe that        if the number of labels per category increased further, LP without a vocabulary        should give a better result than with a vocabulary.  

  allowed iterations, whereas TF-IDF produces a result without reaching the        iteration limit. The RBF-kernel using BOW is not even close to produce a result at       

Higher number of neighbors increases the number of iterations required. Higher        gamma values when using RBF-kernel also increases the number of iterations        number of iterations required to train the classifier properly.   

 

MNB baseline:       ​It is difficult to achieve a better result than the result produced by        the MNB baseline. This is because SL-classifiers, such as the MNB-classifier, does        not have the same potential of training errors as GSSL-classifiers. When using        SL-classifiers the labels of the training data is unchangeable. Classification using        LP does allow for the already labeled training data to change labels. This       

  possible for the classifiers to be overfitted. The randomly selected labeled training        documents should however counteract overfitting.   non-mathematicians. The other research methodologies being less adequate makes        design and creation the best option.  

 

Research question 

We have two research questions that have been answered throughout the thesis.       

Our first research question ​“How to explain graph-based semi-supervised learning                    for non-mathematicians?” is answered in the implementation-section (chapter 3)        where we provide a detailed explanation of the process to construct        GSSL-classification. This is followed by a more in depth explanation of GSSL        through the LP-algorithm (chapter 4.1) and by listing the programming code in        Appendix B. 

 

Our second research question ​“What kind of preprocessing is most effective                      regarding the quality of results of semi-supervised learning?” is answered in the                    classification comparison-section (chapter 4.2) with graphs over the results for        different preprocessing techniques. Parts of this chapter also discusses different        preprocessing techniques to complement the result in chapter 4.2. 

 

Future Research 

Future research could include what kind of effects different parameters used for        classification and preprocessing have on the classification result. The selection        process for parameters could also be further researched. This would result in a        deeper understanding of the impact of the parameters which would improve the        reader’s understanding of the classifiers and preprocessing.  

Another research area are the classification of documents that are corrupt or        incorrect. For example, some of the documents in the 20 Newsgroups have spelling        mistakes. Researching how corrupt and incorrect data affects the classification and        provide knowledge on cases where it is important to fix these mistakes or if these        mistakes could even be helpful for the classification. 

It is also possible to further increase the number of classifiers to include        GSSL-classifiers from other libraries. Such classifiers to include could be MAD        which is implemented in the Junto library [28]. 

   

 

References 

[1] A. Subramanya and P. P. Talukdar, “Graph-based Semi Supervised Learning,” 

in ​Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan 

& Claypool Publishers, 2014, ISBN: 9781627052016. 

[2] T. Iliou, C. Anagnostopoulos, M. Nerantzaki and G. Anastassopoulos. “A Novel  Machine Learning Data Preprocessing Method for Enhancing Classification  Algorithms Performance,” In ​Proc. EANN '15 Proceedings of the 16th 

International Conference on Engineering Applications of Neural Networks ‘09,  2015, Article No. 11

[3] S. B. Kotsiantis, D. Kanellopoulos and P. E. Pintelas, “Data Preprocessing for  Supervised Learning,” ​International Journal of Computer Science,​ vol. 1, no. 1,  pp. 111-117, 2006.  

[4] I. H. Witten, E. Frank, M. A. Hall and C. J. Palestro, ​Data Mining : Practical  Machine Learning Tools and Technique, ​Burlington, MA: Elsevier, 2016. 

[5] C. Yin, J. Xiang, H. Zhang, J. Wang, Z. Yin and J. Kim, “A New SVM Method  for Short Text Classification Based on Semi-Supervised Learning,” In Proc. 4th  International Conference on Advanced Information Technology and Sensor  Application ‘08, 2015, pp. 100-103. 

[6] M. K. Dalal and M. A. Zaveri, “Automatic Text Classification of sports blog  data,“ In Proc. Computing, Communications and Applications Conference ‘01,  2012, pp. 219-222. 

[7] N. Widmann and S. Verberne, “Graph-based Semi-supervised Learning for  Text Classification,” in ​Proceedings of the ACM SIGIR International Conference  on Theory of Information Retrieval - ICTIR ’17​, Amsterdam, The Netherlands,  2017, pp. 59–66. 

[8] S. Raschka, Python Machine Learning. UK: Packt Publishing Ltd., 2015  [9] J. Han, M. Kamber and J. Pei, ​Data mining: Concepts and techniques. 

Waltham, MA: Elsevier, 2012. 

[10] A. Casari and A. Zheng, “Chapter 4. The Effects of Feature Scaling: From  Bag-of-Words to Tf-Idf”, in ​Feature Engineering for Machine Learning, CA: 

O'Reilly Media, Inc., 2018, pp. 61-76.  

[11] “Semi-Supervised” [Online]. Available: 

https://scikit-learn.org/stable/modules/label_propagation.html​ [Accessed: 

March. 4, 2019].

[12] K. Ozaki, M. Shimbo, M. Komachi and Y. Matsumoto, “Using the mutual  k-nearest neighbor graphs for semi-supervised classification of natural  language data,” In ​Proc. Proceedings of the Fifteenth Conference on  Computational Natural Language Learning ‘06, 2011​, pp. 154-162. 

[13] X. Zhu and Z. Ghahramani, “Learning from Labeled and Unlabeled Data  with Label Propagation”​ (2002) 

[14] C. D. Manning, P. Raghavan and H. Schütze, ​Introduction to Information  Retrieval. Cambridge​: Cambridge University Press, 2008 

 

[15] H. Yang, S. Zhu, I. King and M. R. Lyu, “Can irrelevant data help  semi-supervised learning, why and how?,” In ​Proc. Proceedings of the 20th  ACM Conference on Information and Knowledge Management ‘10, 2011, pp. 

937-946. 

[16] A. Subramanya and J. Bilmes, “Soft-Supervised Learning for Text 

Classification” in ​Proceedings of the 2008 Conference on Empirical Methods in  Natural Language Processing ‘10, 2018​, pp. 1090-1099. 

[17] S. H Srinivasan, “Features for unsupervised document classification” In  Proc. COLING-02 proceedings of the 6th conference on Natural language  learning - Volume 20, 2002​, pp. 1-7. 

[18] R. Pimplikar, D. Garg, D.Bharani, G.Parija, “Learning to Propagate Rare  Labels”, in ​Proceedings of the 23rd ACM International Conference on 

Conference on Information and Knowledge Management ‘11, 2014,​ pp. 201-201. 

[19] T. D. Bui, S. Ravi, and V. Ramavajjala, “Neural Graph Learning: Training  Neural Networks Using Graphs”. In ​Proceedings of the Eleventh ACM 

International Conference on Web Search and Data Mining ‘02, 2018, ​pp. 64-71. 

[20] X. Zhu, “​Semi-Supervised Learning with Graphs,​” Ph.D thesis, Carnegie  Mellon University, Pittsburgh, PA, 2005. 

[21] F. Dong, Y. Guo, C. Li, G. Xu and F. Wei, “ClassifyDroid: Large scale  Android applications classification using semi-supervised Multinomial Naive  Bayes”. In ​Proc. 4th International Conference on Cloud Computing and  Intelligence Systems (CCIS) ‘08, 2016​, pp. 77-81. 

[22] B. Hardin and U. Kanewala, “Using Semi-Supervised Learning for 

Predicting Metamorphic Relations,” In ​Proceeding MET ‘18 Proceedings of the  3rd International Workshop in Metamorphic Testing ‘05, 2018​, pp. 14-17. 

[23] B. J. Oates, ​Researching Information Systems and Computing. ​London, UK: 

SAGE Publications Ltd, 2005.   

[24] N. Widmann, “​Graph-based semi-supervised learning of semantic text  clusters​”, M. S. thesis, Radboud University, Netherlands, 2017. 

[25] J. Davis and M. Goadrich. “The relationship between precision-recall and  ROC curves”. In ​Proceeding ICML ‘06 Proceedings of the 23rd international  conference on Machine Learning, 2006​, pp. 233-240. 

[26] “Feature extraction” [Online]. Available: 

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-ext raction​ [Accessed: May. 16, 2019]. 

[27] P. P. Talukdar and F. Pereira, “Experiments in graph-based 

semi-supervised learning methods for class-instance acquisition”, ​ACL '10  Proceedings of the 48th Annual Meeting of the Association for Computational  Linguistics​, pp. 1473-1481, July 2010. 

[28] “The Junto Label Propagation Toolkit” [Online]. Available: 

https://github.com/parthatalukdar/junto​ [Accessed: May. 23, 2019] 

[29] Weisstein, Eric W. "​Multinomial Distribution.​" [Online]. Available: 

http://mathworld.wolfram.com/MultinomialDistribution.html​ [Accessed: May. 

24, 2019]   

 

Appendix A  

Each score in Appendix A displays the average out of ten results where each of the        ten results uses a different set of randomly selected training documents. These        randomly selected documents should keep their labels going into the classification        algorithm with Label Propagation. Naive Bayes should only use the randomly        selected document for training. This thesis uses the categories “rec.autos”,       

“rec.motorcycles”, “rec.sport.baseball”, “rec.sport.hockey”. 

    

Table A1: Average classification f1 score 100 labels  Feature Extraction 

Table A2: Average classification f1 score 100 labels  Feature 

 

Table A3: Average classification accuracy 100 labels  Feature Extraction 

Table A4: Average classification accuracy 100 labels  Feature 

 

Table A5: Average classification f1 score 50 labels  Feature Extraction 

Table A6: Average classification f1 score 50 labels  Feature 

 

Table A7: Average classification accuracy 50 labels  Feature Extraction 

Table A8: Average classification accuracy 50 labels  Feature 

 

Table A9: Average classification f1 score 25 labels  Feature Extraction 

Table A10: Average classification f1 score 25 labels  Feature 

 

Table A11: Average classification accuracy 25 labels  Feature Extraction 

Table A12: Average classification accuracy 25 labels  Feature 

 

Table A13: Average classification f1 score 10 labels  Feature Extraction 

Table A14: Average classification f1 score 10 labels  Feature 

 

Table A15: Average classification accuracy 10 labels  Feature Extraction 

Table A16: Average classification accuracy 10 labels  Feature 

 

Appendix B 

Appendix B contains some important code snippets used in the implementation        process of our artefact. 

 

from ​sklearn ​import ​datasets digits = datasets.load_digits() print​(digits.data)

Figure B1: Verification of the installation   

 

newsgroups_train = fetch_20newsgroups(subset='train',

remove​=(​'headers'​, ​'footers'​, ​'quotes'​)​, ​categories​=categories) Figure B2: Data cleaning 

   

newsgroups_train.data[i] = (​" "​.join([lemmatizer.lemmatize(w​, get_wordnet_pos(w)) ​for ​w ​in

nltk.word_tokenize(newsgroups_train.data[i]) if w not string.punctuation]))

Figure B3: Lemmatization 

newsgroups_train.data[i] = re.sub(​r'\b' ​+ word + ​'\s'​, ​' '​, newsgroups_train.data[i])

Figure B4: Remove stop-word 

CountVectorizer(max_df=0.5, min_df=10) Figure B5: Remove rare and frequent features 

clf = MultinomialNB().fit(vectors.todense(), dataset.train[​'target'​])

test_vec = vectorizer.transform(dataset.test['data']) pred = clf.predict(test_vec.todense())

Figure B6: Multinomial naive Bayes classifier 

 

clf = LabelSpreading(​kernel​=​'knn'​,​n_neighbours​=10) .fit(vectors.todense(), dataset.train['target']) test_vec = vectorizer.transform(dataset.test[​'data'​]) pred = clf.predict(test_vec.todense())

Figure B7: Label Propagation KNN kernel 

clf = LabelPropagation(kernel=’rbf’,gamma=5) .fit(vectors.todense(), dataset.train[​'target'​]) test_vec = vectorizer.transform(dataset.test['data']) pred = clf.predict(test_vec.todense())

Figure B8: Label Propagation RBF kernel  

 

clf.score(test_vec.todense()​, ​dataset.test[​'target'​])) Figure B9: Calculate accuracy 

   

metrics.f1_score(dataset.test[​'target'​]​, ​pred​, ​average​=​'macro'​) Figure B10: Calculate F1-score 

     

Related documents