Automatic subject classification of textual documents using limited or no training data

Full text

(1)Department of Electronic and Computer Engineering. Automatic Subject Classification of Textual Documents Using Limited or No Training Data. Author. Arash Joorabchi. Supervised by. Dr. Abdulhussain E. Mahdi. Submitted for the degree of. Doctor of Philosophy. Submitted to the University of Limerick, November 2010.

(2) Abstract. Automatic Subject Classification of Textual Documents Using Limited or No Training Data Arash Joorabchi With the explosive growth in the number of electronic documents available on the Internet, intranets, and digital libraries, there is a growing need for automatic systems capable of indexing and organising such large volumes of data more that ever. Automatic Text Classification (ATC) has become one of the principal means for enhancing the performance of information retrieval systems and organising digital libraries and other textual collections. Within this context, the use of Machine Learning (ML) algorithms has been the dominant approach to ATC since the 1990s. However, one of the major obstacles in the deployment of ML-based ATC systems for practical real-world applications, is the lack or absence of high quality and/or quantity labelled datasets for training the ML algorithms. The aim of this work is to address this problem via investigating two lines of research: (a) the development of new bootstrapping methods which automate the process of creating labelled corpora required for training ML-based ATC systems; and (b) the development of a new breed of ATC algorithms which are unsupervised, and therefore do not require any training data. In order to achieve this aim, the project has mainly focused on utilising two knowledge sources whose potential application in ATC has yet to be fully explored. Namely, the conventional library organisation resources (e.g., library classification schemes, thesauri, and online public access catalogues); and the linkage among documents in form of citation and reference networks. In relation to bootstrapping methods for ML-based ATC systems, our investigation has resulted in the development of two new methods. The developed methods greatly reduce the human involvement in the process of building training datasets by utilising the documents and textual contents that are abundantly available on the Internet as training samples. The other major contribution of this work is the development and evaluation of a new unsupervised ATC method which is capable of classifying a wide range of documents with high accuracy according to a library classification scheme without requiring any training data. This method, which has been named as Bibliography Based ATC (BBATC), is based on the hypothesis that citations and references in a document can be used as primary sources of information to determine the subject of the document with a high accuracy. The proposed BB-ATC method automatically mines the citation and reference networks among the documents and uses the classification metadata of documents which are manually classified to predict the subject/class of unlabelled documents. Finally, our further investigation into the application of citation networks in topical indexing of documents has resulted in the development of a new unsupervised keyword/keyphrase extraction method for scientific documents which is based on the same underlying theorem as the BB-ATC. The developed keyphrase extraction method does not require any training data and yields an accuracy similar to that obtained by human indexers and state-of-the-art ML-based keyphrase extraction methods, whose accuracy is highly dependant on the quality and quantity of the manually labelled training data.. i.

(3) Declaration This thesis is presented in fulfilment of the requirements for the degree of Doctor of Philosophy. It is entirely my own work and has not been submitted to any other university or higher education institution, or for any other academic award in this university. Where use has been made of the work of other people, it has been fully acknowledged and fully referenced.. Signature of Author ................................................................... November 18, 2010. Arash Joorabchi. Signature of Supervisor. ............................................................ Hussain Mahdi. ii. November 18, 2010.

(4) Acknowledgements I wish to express my sincere thank to my supervisor, Dr. Hussain Mahdi, who gave me the opportunity to undertake this PhD research programme at the Department of Electronic and Computer Engineering, University of Limerick. This could have not been possible without his support and help. Also, throughout the course of this work, he has supported me with his valuable guidance, encouragement, and patience. I would also like to take this opportunity to thank my parents for making it all possible.. iii.

(5) Nomenclature. List of acronyms ACM-CCS. ACM Computing Classification System. ADI. Average Degree of Importance. ATC. Automatic text Classification/Categorisation. ASCED. Australian Standard Classification of Education. BOW. Bag Of Words. BB-ATC. Bibliography-Based ATC. BL. British Library. CKE. Citation-based Keyphrase Extraction. CRF. Conditional Random Field. CW. Combined Weight. DDC. Dewey Decimal Classification. DPD. Definitive Programme Document. FP. False Positive. FN. False Negative. FO. First Occurrence. GBS. Google Book Search. GF. Global Frequency. GWC. Google Word Cloud. HEA. Higher Education Authority. HMM. Hidden Markov Model. IDF. Inverse Document Frequency. IE. Information Extractor. IG. Information Gain. ISCED. International Standard Classification of Education. JACS. Joint Academic Coding System. JAPE. Java Annotations Pattern Engine. JUNG. Java Universal Network/Graph Framework. KE. Knowledge Engineering. iv.

(6) k-NN. k-Nearest Neighbour. LCSH. Library of Congress Subject Headings. LF. Local Frequency. LOC. Library of Congress. ML. Machine Learning. MI. Mutual Information. MNB. Multinomial Naive Bayes. MeSH. Medical Subject Heading. MSS. Module Syllabus Segmenter. MVB. Multi-Variate Bernoulli. NEE. Named Entity Extractor. NB. Naive Bayes. NC. Number of Characters. NLF. Normalised Local Frequency. NW. Number of Words. LCC. Library of Congress Classification. OCLC. Online Computer Library Centre. OAI. Open Access Initiative. OPAC. Online Public Library Catalogue. PDS. Programme Document Segmenter. Pr. Precision. Re. Recall. RF. Reference Frequency. SVM. Support Vector Machine. TCB. Training Corpus Builder. TDB. Training Dataset Builder. TF. Term Frequency. TFIDF. Term Frequency Inverse Document Frequency. TP. True Positive. TWCNB. Transformed Weight-normalized Complement Naive Bayes. ULF. Normalized Local Frequency. v.

(7) Table of Contents. Abstract. .......................................................................................................... i. Declaration. ......................................................................................................... ii. Acknowledgements ...................................................................................................... iii Nomenclature. ........................................................................................................ iv. Table of Contents ........................................................................................................ vi List of Figures. ........................................................................................................ ix. List of Tables. ........................................................................................................ ix. CHAPTER 1. Introduction ....................................................................................... 1. 1.1. Overview and Rationale...................................................................1. 1.2. Aims and Objectives ........................................................................2. 1.3. Organisation of Report.....................................................................3. CHAPTER 2. Automatic Text Classification: A Review ........................................ 5. 2.1. Introduction ......................................................................................5. 2.2. ATC Applications ............................................................................7. 2.3. Machine Learning Methods for ATC...............................................8. 2.3.1. Naive Bayes (NB) Classifier............................................................8. 2.3.2. Support Vector Machine (SVM)....................................................12. 2.3.3. k-Nearest Neighbour (k-NN)..........................................................18. 2.4. Document Representation ..............................................................19. 2.4.1. The Problem of High Dimensionality............................................21. 2.4.1.1. Stop Words Removal .....................................................................22. 2.4.1.2. Stemming and Lemmatization .......................................................22. 2.4.1.3. Feature Selection Methods.............................................................23. 2.5. Performance Measures ...................................................................26. 2.6. Benchmark Datasets.......................................................................27. 2.6.1. 20 Newsgroups Dataset..................................................................27. 2.6.2. Reuters-21578 Dataset ...................................................................28. 2.6.3. OHSUMED Collection ..................................................................28. 2.6.4. Four Universities Dataset...............................................................29. CHAPTER 3 3.1. Developing New Bootstrapping Methods for ML-based ATC....... 30 Bootstrapping ML-based ATC Systems ........................................30. vi.

(8) 3.2. Bootstrapping ATC Systems Utilising Library Resources ............32. 3.2.1. ATC System Components..............................................................33. 3.2.1.1. Universal Classification Scheme ...................................................33. 3.2.1.2. Training Corpus .............................................................................34. 3.2.1.3. Classification Algorithm ................................................................35. 3.2.2. Prototype ATC System ..................................................................36. 3.2.2.1. Training Corpus Builder (TCB) .....................................................37. 3.2.2.2. Training Dataset Builder (TDB) ....................................................38. 3.2.2.3. Classifier ........................................................................................39. 3.2.3 3.3. Evaluation and Experimental Results ............................................40 Bootstrapping ATC Systems Utilising Search Engines .................42. 3.3.1. Hot Folder Application ..................................................................43. 3.3.2. Pre-processing Unit........................................................................44. 3.3.3. Information Extractor (IE) .............................................................45. 3.3.3.1. Programme Document Segmenter (PDS) ......................................45. 3.3.3.2. Module Syllabus Segmenter (MSS)...............................................46. 3.3.3.3. Named Entity Extractor (NEE) ......................................................47. 3.3.4. Classifier ........................................................................................48. 3.3.5. Web-based Bootstrapping Method.................................................50. 3.3.6. Post-processing Unit ......................................................................53. 3.3.7. Evaluation and Experimental Results ............................................53. 3.4 CHAPTER 4. Summary ........................................................................................56 A New Bibliography-Based ATC Method...................................... 58. 4.1. Introduction ....................................................................................58. 4.2. Outline of Proposed BB-ATC Method ..........................................60. 4.3. System Implementation and Functionality.....................................62. 4.4. System Evaluation and Experimental Results................................64. 4.5. Discussion of Results .....................................................................67. 4.6. Summary ........................................................................................69. CHAPTER 5. Enhanced BB-ATC Method............................................................ 72. 5.1. Introduction ....................................................................................72. 5.2. Enhanced BB-ATC method Implementation and Functionality....73. 5.3. Evaluation & Experimental Results ...............................................84. 5.4. Discussion of Results .....................................................................90. 5.5. Summary ........................................................................................94. vii.

(9) CHAPTER 6. A Citation-based Approach to Automatic Topical Indexing of Scientific Literature....................................................................... 97. 6.1. Introduction ....................................................................................97. 6.2. Rationale and Related Work ..........................................................98. 6.3. Citation-based Keyphrase Extraction...........................................100. 6.3.1. Reference Extraction....................................................................100. 6.3.2. Data Mining .................................................................................101. 6.4. Term Weighting and Selection.....................................................103. 6.5. System Evaluation & Experimental Results ................................106. 6.6. Discussion of Results ...................................................................111. 6.7. Summary ......................................................................................115. CHAPTER 7. Conclusions ................................................................................... 117. 7.1. Summary of thesis........................................................................117. 7.2. Contributions................................................................................118. 7.3. Future work ..................................................................................120. References. ..................................................................................................... 122. APPENDIX A. Publications ................................................................................... 137. APPENDIX B. Sample Results for Using BB-ATC for Classification Of Syllabus Documents.................................................................... 138. APPENDIX C. Sample Results for Using the Enhanced BB-ATC for Classification of Scientific Documents...................................... 150. APPENDIX D. Citation-based Keyphrase Extraction Results............................... 174. viii.

(10) List of Figures Figure 2.1 Linear separation using perceptron............................................................ 14 Figure 2.2 The best separating hyperplane ................................................................. 15 Figure 2.3 Separating hyperplane’s margin ................................................................ 16 Figure 2.4 Kernel trick to make the data linearly separable ....................................... 18 Figure 2.5 k-Nearest Neighbours ................................................................................ 19 Figure 3.1 Overview of the prototype ATC system.................................................... 36 Figure 3.2 Overview of the developed syllabus repository system ............................ 43 Figure 3.3 Building word vectors for the fields of the hierarchy................................ 52 Figure 4.1 Overview of the prototype BB-ATC system ............................................. 62 Figure 5.1 Overview of the prototype ATC system for the CiteSeer digital library... 74 Figure 5.2 Data mining process .................................................................................. 76 Figure 5.3 Inferring unit’s visualised output for the sample document ...................... 82 Figure 5.4 Performance measures for the set of documents with no extracted references ............................................................................................... 86 Figure 5.5 Performance measures for the set of documents with 4 references........... 87 Figure 5.6 Performance measures for the set of documents with 8 references........... 87 Figure 5.7 Performance measures for the set of documents with 16 references......... 88 Figure 5.8 Performance measures for the set of documents with 32 references......... 88 Figure 5.9 The effect of number of documents’ references on the performance measures................................................................................................. 91 Figure 5.10 Percentage of documents in each level and corresponding performance scores................................................................................. 92 Figure 5.11 A sample snippet of level 5 classes in DDC............................................ 93 Figure 5.12 Sample key terms cloud from Google Books .......................................... 96 Figure 6.1 A sample GWC from GBS database ....................................................... 102 Figure 6.2 Data mining process ................................................................................ 103. ix.

(11) Figure 6.3 The correlation between the number of extracted references and GWCs per document and the inter-consistency results ................................... 111 Figure 6.4 The GWC extracted for the Sample document 12049 using the GBS direct querying approach...................................................................... 114. x.

(12) List of Tables Table 2.1 List of topics in 20 Newsgroups dataset ..................................................... 28 Table 3.1 Baseball class in the DDC scheme.............................................................. 37 Table 3.2 20-Newsgroup to DDC mapping ................................................................ 40 Table 3.3 TWCNB classification accuracy results ..................................................... 41 Table 3.4 Performance results of the IE component ................................................... 54 Table 3.5 Summary of issues affecting the performance of IE component ................ 55 Table 4.1 Micro-averaged classification results of BB-ATC...................................... 64 Table 4.2 Sample syllabus classification results ......................................................... 65 Table 5.1 Sample document’s metadata...................................................................... 78 Table 5.2 Data mining results for the sample document............................................. 79 Table 5.3 Micro-averaged performance measures for each document set in the test corpus ..................................................................................................... 86 Table 5.4 Distribution of test documents among the DDC levels and corresponding performance measures achieved in each level ............... 89 Table 6.1 Performance of the CKE algorithm compared to human indexers and competitive methods. ........................................................................... 108 Table 6.2 The inter-consistency of the CKE algorithm with each human team compared to that of the Maui. .............................................................. 109 Table 6.3 The inter-consistency of the CKE algorithm with human indexers compared to that of the Maui on a per document basis ....................... 110 Table 6.4 Comparison of the top five keyphrases assigned to three sample documents from wiki-20 collection by the human indexers, Maui, and our algorithm ................................................................................. 113. xi.

(13) CHAPTER 1. Introduction 1.1 Overview and Rationale Automatic Text Classification or Categorization (ATC) is the automatic assignment of natural language text documents to one or more predefined subject categories/classes according to their contents. ATC is a subfield of the Information Retrieval (IR) science. Knowing the subject of documents can enormously increase the performance of IR systems in terms of both accuracy and speed. In addition, classification of text documents allows us to move beyond the conventional keyword-based search methods which yield a large volume of results which users have to filter manually. Furthermore, classification of documents according to a standard classification scheme enables users to browse the collection by subject. Due to the sheer volume of e-documents and their rapid growth and based on above-mentioned benefits of document classification, development of ATC systems and their applications has attracted researchers from machine learning, information retrieval, and data management communities. Examples of document classification applications include classification of webpages, sorting e-mails and news articles, and learning user reading interests to name a few. Using Machine Learning (ML) algorithms has been the dominant approach to ATC since 1990s. In fact, ML-based ATC systems have demonstrated considerable success in a variety of text classification applications such as spam filtering [1], cataloguing news and journal articles [2], and classification of webpages [3]. However, these systems require a large volume of pre-classified documents in each subject category/class to train the ML algorithm with and, in majority of cases, such training data is not available. In these situations human cataloguers are usually called upon to create a labelled training corpus. However, this process is expensive, time consuming and impractical in many cases. For example, consider a middle size classification scheme with 1000 classes, where a modest number of 100 labelled training documents is required in each class. The cataloguers here are required to label 100,000 documents. In addition, the classification schemes may change and the training data. 1.

(14) initially used may render out of date over time (i.e. concept drift phenomena). These issues may require the expensive manual labelling process to be repeated once in a while to keep the training dataset up-to-date and accurate. In this thesis we propose a number of new and innovative methods to address the issue of lack of training data for ATC.. 1.2 Aims and Objectives The aim of this work is to alleviate the problem of training data shortage for the development of efficient and robust ATC systems by pursuing two lines of research: i. Development of new bootstrapping methods to automate the process of building labelled corpora for training ML-based ATC algorithms. ii. Development of new unsupervised classification algorithms which do not require any training data. In order to realise this aim, the project mainly focuses on utilising two knowledge sources whose potential application in ATC has not been fully explored before: a) The conventional library organisation resources such as library classification schemes, thesauri, and online public access catalogues. b) Linkage among documents in form of citation and reference networks. Based on above, the following objectives have been devised for this project: − Investigating the utilisation of conventional library organisation resources for ATC: library classification schemes and thesauri have been used for classification and indexing of library holdings for over a century and there are millions of books, journals, and other library holding which are classified and indexed according to these traditional knowledge organisation resources. We believe that the enormous classification metadata generated in libraries around the world can be utilised for developing new bootstrapping methods for MLbased ATC systems and producing a new breed of unsupervised ATC methods. − Investigating the utilisation of citation and reference networks for ATC: a considerable amount of e-documents have some form of linkage to other. 2.

(15) documents. For example, it is a common practice in scientific documents to cite other articles, papers, and books. It is also common practice for documented law cases to refer to other cases, patents to refer to other patents, and webpages to have links to other webpages. Utilising the networked characteristic of webpages for their automatic classification has been the subject of many studies (see, for a review, [4]). Also, there are a number of recent works which study the utilisation of patent citation networks for automatic classification of patent documents, for example see [5, 6]. In this project we investigate the utilisation of citation networks among scientific documents for automatic classification and keyphrase indexing of scientific literature (e.g., articles, papers, reports, theses and dissertations, etc.) which, to the best of our knowledge, has not been studied before.. 1.3 Organisation of Report Chapter 2 provides an introduction into the field of ATC. This includes: ATC applications, a review of common ML algorithms used for ATC and their most recent extensions, text representation and term weighting methods, performance measures and benchmarking datasets. Chapter 3 describes the development of two new bootstrapping methods for MLbased ATC algorithms. In the first part of this chapter, we introduce a new bootstrapping method that leverages the Dewey Decimal Classification (DDC) scheme and Online Public Library Catalogues (OPACs) to automatically build training corpora for ML-based ATC systems. In the second part of this chapter, we describe the development of an automatic digital repository for teaching and learning materials, which includes a complete ML-based ATC system that uses a new webbased bootstrapping method. In Chapter 4, we propose a novel unsupervised document classification method named Bibliography Based Automatic Text Classification (BB-ATC) which does not require any training data. This method is based on leveraging the enormous collection of classification metadata, as found in the online public access catalogues of conventional libraries, to automatically classify a wide range of documents by mining and analysing their citation/reference networks. In this chapter, we discuss the. 3.

(16) implementation details of the proposed BB-ATC method and report its accuracy performance for classification of a test dataset of two hundred syllabus documents. In Chapter 5, we describe the development of an enhanced version of the BB-ATC method, designed for automatic classification of scientific literature according to the DDC classification scheme. We report the performance results of this method for automatic classification of a large test dataset comprising one thousand research documents collected from a well-known scientific literature digital library. In Chapter 6, we discuss the result of our investigation into the application of the underlying principle of BB-ATC method (i.e., mining and utilising citation/reference networks among documents) in automatic keyphrase extraction from scientific documents. In this chapter, we discuss the development and present the assessment results of a new unsupervised keyphrase extraction method based on utilising the citation networks among documents. The proposed method automatically mines and analyses the citation network for a given document and infers a list of high likelihood keyphrases for the document based on the keyphrases assigned to the documents which either cite or are cited by the document to be indexed. In Chapter 7, we provide a summary of the thesis, discuss its main contributions, and propose some potential future work.. 4.

(17) CHAPTER 2. Automatic Text Classification: A Review 2.1 Introduction In this chapter we provide an overview of the field of Automatic Text Classification /Categorisation (ATC) and discuss different aspects of ATC systems such as applications, Machine Learning (ML) based methods, text modelling methods, feature reduction and selection techniques, performance measures, and benchmarking datasets. ATC is the automatic assignment of natural language text documents to one or more predefined categories or classes according to their contents. It can be divided into three main categories: i. Binary classification: where there are only two classes/categories, and an incoming unlabeled text belongs only to one of these classes. A common example of this type is spam filtering systems in which the textual content of an email is either classified as spam or ham. ii. Multi class classification: where there are more than two categories, but an incoming text can only belong to one of them (a.k.a single label classification). iii. Multi label classification: where there are more than two categories, and an incoming text can belong to multiple categories. Depending on the application, an ATC system may be required to perform either hard or soft classification. Hard classification provides a truth value in {True, False} which indicates membership or non-membership of a given document in a given category. Hard classification is mostly adopted in fully autonomous classifiers, whose results do not need to be confirmed by human cataloguers as the final stage of the process. Whereas, Soft classification provides a real number between 0.0 and 1.0 which indicates the degree of confidence of the system in the membership of a given document in a given category. This is useful for interactive ATC systems in which the. 5.

(18) final class(es) of documents are decided by human cataloguers using the suggestions made by the automatic classification system . The history of ATC can be traced back to as early as 1960s [7]. However, over the past two decades, due to the explosive growth in the number of electronic documents available on the Internet, intranets, and digital libraries and therefore the growing need for automatic systems capable of managing such large volumes of data, ATC has attracted a large number of researchers working in areas such as: data mining, Information retrieval, data management, machine learning, natural language processing, and digital libraries. Until the late 1980s, Knowledge Engineering (KE) was the most popular approach to the realization of ATC systems. In general, this approach involved manually defining a set of rules to encode human expert knowledge on how to classify documents under the given categories. The most famous example of this approach is the Construe system [8]. In the 1990s, with the advances in the area of machine learning and the emergence of high processing computers capable of analysing thousands of documents in a reasonably short time, ML algorithms were widely adopted in ATC and superseded the rule-based methods. In the latter approach, the ML algorithms play the role of knowledge engineers in corresponding rule-based systems and automatically learn the distinguishing characteristics of each class by analysing a set of pre-classified sample documents in that class. The advantages of this approach are: − The engineering effort focuses on the construction of an automatic builder of classifiers (learners) instead of the construction of individual classifiers. Therefore, if the set of categories is updated or if the system is ported to a different domain, all that is needed is a different set of manually classified documents to learn from. − ML-based systems require domain expertise for labelling, as opposed to knowledge engineering expertise in the case of rule-based systems. − Sometimes the pre-classified documents are already available which greatly reduces the human effort required for developing an ML-based ATC system. For example, consider the development of an ATC system for automatic classification of computer science related research documents according to the ACM Computing Classification System (ACM-CCS) [9]. In this case,. 6.

(19) there are already a large number of research papers in the ACM digital library1, which are manually classified according to the ACM-CSS and can be used for training an ML-based classifier.. 2.2 ATC Applications ATC has a wide range of applications. The following gives descriptions of the most common ATC applications: − Document organization: includes automatic subject indexing of newspaper articles, journal and conference papers, advertisements, and catalogues, to name a few. − Organising digital libraries: large-scale digital libraries contain thousands of items and therefore require deploying flexible querying and information retrieval techniques that allow users to easily find the items they are looking for. In order to provide highly precise search results, we need to go beyond the traditional keyword-based search techniques which yield a large volume of indiscriminant search results without regard to the content. Classification of materials in a digital library based on a pre-defined scheme improves the accuracy of information retrieval significantly and allows users to browse the collection by subject [10]. − Classification of webpages: classifying the textual content of webpages is essential to tasks such as: focused crawling, semi-automated development of web directories, and analysis of the topical structure of the web. webpage classification can also help improve the quality of web search [3]. − Spam filtering: in its simple form, spam filtering can be recast as a text categorization task where the classes to be predicted are spam and legitimate. A variety of supervised machine-learning algorithms have been successfully applied to the task of e-mail filtering [1]. − Automatic identification of text genre: genre characterizes text differently than the usual subject content. Text genre or the style of text is viewed as characterizing the purpose for which the text has been written. Examples for 1. http://portal.acm.org/dl.cfm. 7.

(20) genre are: research article, novel, poem, news article, editorial, homepage, advertisement, manual, court decision etc. As text-based applications have become more diverse and the amount of information has increased, different aspects of text, such as genre, can prove useful for various purposes [11]. − Web filtering: web filtering uses screening of web requests and analysis of the contents of the received webpages to block undesired content. It can be seen as a case of single-label categorization, i.e., classification of incoming documents into two disjoint categories, the relevant and the irrelevant [12]. Web filtering has the two following major applications: i. Protection against inappropriate content: Internet has become an important source of information. However, it is also host to a variety of inappropriate material, especially for children. Web filtering can be used to block access to webpages that are against a defined policy. ii. Prevention against misuse of network resources: here the main aim is to prevent misuse of the resources of an organisation. A common problem in many organisations that provide Internet access for employees is that the network connection could be used for applications such as chat, network games, and downloading streaming video and audio content. Such misuses decrease productivity, impose unnecessary load on the entire network and, hence, impede legitimate activities.. 2.3 Machine Learning Methods for ATC Over the last two decades a wide range of machine learning algorithms have been utilised for ATC. However, the most popular ML algorithms used for ATC to date are the Naïve Bayes, Support Vector Machines, and k-Nearest Neighbour. This section provides a review of the three above-mentioned algorithms and their application to ATC.. 2.3.1 Naive Bayes (NB) Classifier The underlying theorem for the Naïve Bayes text classification algorithm is the Bayes rule, which shows the relation between a conditional probability and its reverse form,. 8.

(21) For example, the probability of a hypothesis given some observed pieces of evidence and the probability of that evidence given the hypothesis. It is expressed as:. P ( Ai | B j ) =. P ( Ai ) P( B j | Ai ). (2.1). P( B j ). where, − P(Ai) is the prior probability or marginal probability of a given event Ai. It is "prior" in the sense that it does not take into account any information about Bj. − P(Ai|Bj) is the conditional probability of Ai, given Bj. It is also called the posterior probability because it is derived from or depends upon the specified value of Bj. − P(Bj|Ai) is the conditional probability of B given A. It is also called the likelihood. − P(B) is the prior or marginal probability of Bj, and acts as a normalizing constant. Bayes' rule in this form gives a mathematical representation of how the conditional probability of event Ai given Bj is related to the converse conditional probability of Bj given Ai.When applied to text classification, the Equation in (2.1) can be rewritten as:. p (Class i | Document j ) =. P(Class i ) P ( Document j | Class i ) P( Document j ). (2.2). Such that the rule is used to calculate the probability of each pre-defined Classi given Documentj, and the Class with the highest probability is allocated to Documentj. In Equation (2.2), P(Documentj) is a constant divider, common to every calculation and therefore can be safely removed from the equation. In this model each documents is represented as a vector of words (Euclidean vector) in an n-dimensional space, where each dimension corresponds to a distinct word and the distance along that dimension is the number of times that word occurs in the document. In a common supervised setting, a set of manually classified training documents is used to parameterise the class prior probabilities, P(Classi), and the P(Documentj|Classi). The conditional. class-conditioned (word) probabilities,. probability of a given word, wk, in a given class, Classi, is estimated by: 9.

(22) P ( wk | Class i ) =. nk + 1 n + | Vocabulary |. (2.3). where nk is the number of times the word occurs in the training documents which belong to Classi, n is the total number of words in the training documents which belong to the Classi, and Vocabulary is a set of all distinct words which occur in all training documents. Each estimate is primed with a count of one to avoid probabilities of zero (Laplace smoothing). The class prior probability, P(Classi), can be estimated by dividing the number of documents belonging to Classj by the total number of training documents. It follows that if a document, Documentj, is to be classified the most likely class, CNB, for that document would be determined as: | Document j |   C NB = arg max  P (Class i ) ∏ P ( w k | Class i ) f wk  i∈S k =1  . (2.4). where S is a set of all possible target classes and fwk is the frequency of word k in the test document. Multiplying a large number of probabilities, which by definition have values between 0 and 1, can result in floating-point underflow. To avoid this problem, we perform all computations by summing logarithms of the probabilities rather than multiplying them: | Document j |   C NB = arg max log P (Class i ) + ∑ f wk log P ( wk | Class i ) i∈S k =1  . (2.5). Finally, the class with highest probability is chosen as the target class for the document under processing. There are two different methods used for modelling the textual data for NB algorithm. The more popular model called Multinomial Naive Bayes (MNB), as explained by Mitchell in [13], uses a so called Bag Of Words (BOW) for representing the text. Using the BOW representation, a text such as a document is transformed to an unordered set of unique words and their associated frequency counts found in the text. In order to create a dataset for training the NB algorithm using this model, all the documents that belong to the same class are concatenated to create a single large. 10.

(23) document and the BOW representation of these documents are used for training the algorithm. There also exists a less popular approach to modelling data for the NB algorithm, called Multi-Variate Bernoulli (MVB) model. In this model the frequency of words are not taken into account and the value of each feature (word) is either 0 or 1 (binary model). McCallum et al. [14] compared the performance of the MVB model with the Multinomial model and concluded that the Multinomial model is more effective in majority of cases. Many people who use the MNB attribute its superior performance to the fact that its document representation model captures word frequency information in documents, whereas the MVB does not. However, a recent study by Schneider [15] argues that the superiority of MNB over MVB is not because of its ability to capture word frequencies and it should be attributed to the way the two models treat negative evidence, i.e. evidence from words that do not occur in a document. Naive Bayes algorithm has become the punching bag of classifiers, and it has been placed last or near last in many studies that has compared the accuracy of different ML-based ATC algorithms [2]. Still, it is frequently used in ATC systems because it is fast and easy to implement. The biggest problem associated with the NB algorithm is its “naïve” assumption that the words are independent from each other and their sequence is irrelevant. There are a number of studies (see for example [16]) which discuss the reasons for surprisingly good performance of NB despite its systematical problems such as the words independency assumption. A recent study by Wolf et al. [17], which investigates the nature of information humans use to classify documents, shows that human subjects are able to achieve similar classification accuracy with or without syntactic information across a range of passage sizes. Furthermore, it should be noted that the other ML-based ATC algorithms which aim to model text more accurately and take into account syntactic information tend to be slower and more complex. A recent study by Hand [18] argues that the superiority of more complex ML-based classification algorithms is not substantial enough to justify their usage in many cases and the exhibited superior performance of such algorithms in experimental settings does not always translate to better performance in real world applications. These arguments can be seen as main reasons for wide usage of the NB algorithm in a lot of real world applications, despite the existence of more complex and accurate algorithms reported in the literature. Many modified versions of the NB. 11.

(24) algorithm have been reported in the literature and they all report some improvement over the original version, for example see [19-21]. Among all these improved versions, the modifications suggested by Rennie et al. in [19] are most credited and confirmed [20]. In an independent experiment, we implemented the modifications suggested by Rennie et al. and achieved very similar results. They propose simple, heuristic solutions to some of the problems with the NB classifiers, addressing both systemic issues as well as problems that arise because text is not actually generated according to a Multinomial model. These simple corrections result in a fast algorithm that is competitive with state-of- the-art ML-based ATC algorithms such as SVM.. 2.3.2 Support Vector Machine (SVM) SVM [21] is widely known as the state-of-the-art algorithm for automatic text classification [22]. It was first adopted for the problem of text classification by Joachims in 1998 [2]. SVM is a vector space-based machine learning method where the goal is to find a decision boundary between two classes that is maximally far from any point in the training data (possibly discounting some points as outliers or noise). In order to justify the underlying concept of SVM we first discuss the perceptron, which is also a linear classifier and has some principal similarities to SVM. The perceptron was first proposed by Frank Rosenblatt in 1956. It is a linear classifier that maps its input x, a real-valued vector, to an output value f(x), which is a single binary value: 1 if w.x + b > 0  f ( x) =   0 else . (2.6). where, w is a vector of real-valued weights, w.x is the dot product of weight and input vectors, and b is the bias, a constant term that does not depend on any input value. During the training process, given a linearly separable training set S: S = {(x1 , y1 )...(x ι , yι )}, x ∈ ℜ n , y ∈ {0,1}. (2.7). And a learning rate η ∈ ℜ + , the weight vector and bias are updated after each input according to the algorithm in (2.8):. 12.

(25) R = max 1<i <ι xi  w0 = 0 Initiation   b0 = 0  k =0  Repeat  for i = 1 to ι   if y i × (w k ⋅ x i + bk ) ≤ 0 then  w k + 1 = w k + ηy i x i   Iteration b k +1 = bk + ηyi R 2  k = k +1   end if   end for   Until no mistakes made within the for loop Output. (2.8).  number of mistakes, k ,  weight vector , w , and bias, b , for the seperating hyperplane k k . This procedure is guaranteed to converge provided that there exists a linear hyperplane that correctly classifies the training data. If the data in not linearly separable, then the loop will continue indefinitely and the procedure will never converge. Therefore, linear perceptrons can only be adopted when the data is guaranteed to be linearly separable. Examples of linearly separable and inseparable data are shown in Figures 2.1(a-b), respectively. As shown if Figure 2.1 (c), the perceptron could result in an infinite number of linear hyperplanes who all correctly separate the data. But, intuitively the best hyperplane is the one which has the biggest margins from the data points, as shown in Figure 2.1 (d). SVM addresses both of these problems. It can be deployed for classifying datasets which are linearly inseparable and also finds the optimum w and b such that the separating hyperplane has the most distance from the data points and therefore, has the highest potential to correctly classify unseen data. Here, we first discuss how SVM finds the best separating hyperplane with the biggest margin and then describe the SVM approach for solving non-linear classification problems using soft margin technique and kernel trick.. 13.

(26) (b) Linearly inseparable. (a) Linearly separable. (c) Infinite number of linearly separating hyperplanes. (d) The best separating hyperplane. Figure 2.1 Linear separation using perceptron. As discussed above and illustrated in Figure 2.2 the best separating hyperplane is the one which has the biggest margin from the data points and therefore, has the minimum generalisation error. The separating hyperplane is a line in n dimensions. The closest points to the separating hyperplane are called support vectors as they are in fact vectors since we are in an n-dimensional space and they support where the separating hyperplane should be. The lines that the support vectors create on each side of the separating hyperplane are called canonical hyperplanes.. 14.

(27) n rg i a M. Canonical hyperplans. Support vectors. Separating hyperplane. Figure 2.2 The best separating hyperplane. In an arbitrary-dimensional space a separating hyperplane can be written as:. w⋅x+b = 0. (2.9). where, w is the weight vector, x is the input vector, and b is the bias. Thus we will consider a decision function of the form: D ( x) = sign ( w ⋅ x + b). (2.10). The sign function above is influenced by the sign of w ⋅ x + b and not its magnitude. In other words, the decision function is left invariant if we scale w and b by any positive quantity. Therefore, we can implicitly fix a scale by fixing the canonical hyperplanes:. w ⋅ x + b = −1 w⋅x +b =1. (2.11). By subtracting the equation of one of the canonical hyperplanes from the other we get:. (w ⋅ x + b = −1) − (w ⋅ x + b = 1) = w ⋅ (x1 − x 2 ) = 2 for the two support vectors on each side of the hyperplane.. 15. (2.12).

(28) The margin will be given by the projection of the vector (x1 − x 2 ) onto the normal vector to the separating hyperplane, i.e., w / w , as illustrated in Figure 2.3:. Margin =. w ⋅ (x1 − x 2 ) w. (2.13). n rgi a M. _. w ⋅ x + b < −1. _ +. + ++ + + w⋅x+b >1 + + + + + +. (x1 − x 2 ). _. _. w w. _ _. _ _. w ⋅ x + b = −1 w⋅x +b = 0 w⋅x +b =1. Figure 2.3 Separating hyperplane’s margin. Dividing both sides of the equation derived in (2.12) by w will give us the equation for the separating hyperplane:. w 2 1 ⋅ (x 1 − x 2 ) = = w || w || 0.5 || w ||. (2.14). Our goal is to maximise the margin which is equivalent to minimising w . Maximisation of the margin is thus equivalent to minimisation of the function:. φ (w ) = 0.5(w ⋅ w ). [. ]. subject to the constraint : y i (w ⋅ x i ) + b ≥ 1. 16. (2.15).

(29) The constraint makes sure that as the margin gets bigger the separating hyperplane would still separate all the data points correctly. This is a constraint optimisation problem which is solved using quadratic programming techniques. In situations where a training set is not linearly separable, the standard approach is to allow the decision margin to make a few mistakes (some points – outliers or noisy examples – are inside or on the wrong side of the margin). Even in situations where the data is linearly separable we might prefer a solution that better separates the bulk of data by making the margin bigger while ignoring a few noisy data points. We then pay a cost for each misclassified example, which depends on how far it is from meeting the margin requirement given in (2.14). To implement this, we introduce slack variables ξ i . A nonzero value for ξ i allows x i to not meet the margin requirement at a cost proportional to the value of ξ i . The formulation of the SVM optimization problem with slack variables is: ι. φ ( w) = 0.5(w ⋅ w ) + C ∑ ξ i. (2.16). i. subject to the constraint : y i [(w ⋅ x i ) + b] ≥ 1 − ξ i The optimization problem is then trading off how fat it can make the margin versus how many points have to be moved around to allow this margin. The margin can be less than 1 for a point x i by setting ξ i > 0, but then one pays a penalty of C ξ i in the. ι. minimization for having done that. ∑ ξ i gives an upper bound on the number of i training errors. Soft-margin SVM minimizes training error traded off against margin. The parameter C is a regularization term, which provides a way to control over fitting. As C becomes large, it is unattractive to not respect the data at the cost of reducing the geometric margin; when it is small, it is easy to account for some data points with the use of slack variables and to have a fat margin placed so it models the bulk of the data more generically. SVM also uses the so called kernel trick to classify the datasets which are not linearly separable in their original form. The general idea is to map the original feature space to some higher dimensional feature space where the training set is separable. Of course, we would want to do so in ways that preserve relevant dimensions of relatedness between data points, so that the resultant classifier should still generalise. 17.

(30) well. For example, the data points showed in Figure 2.4(a) are not linearly separable in the original 2-dimensional feature space. Yet if we map the data to a 3-dimensional space using:. φ : ℜ 2 a ℜ3. (2.17). ( x1 , x 2 ) a φ ( x1 , x 2 ) = ( x12 , 2 x1 x 2 , x 22 ),. then we can linearly separate the mapped data, using a hyperplane in ℜ 3 , of the form: w ⋅ φ (x) + b = 0 .. φ. (b) Linearly separable after mapping into 3-dimensional space. (a) Linearly inseparable data points in 2-dimensional space. Figure 2.4 Kernel trick to make the data linearly separable. 2.3.3 k-Nearest Neighbour (k-NN) The k-NN algorithm is amongst the simplest of all machine learning algorithms. In this method, an object is classified by a majority vote of its neighbours, with the object being assigned to the class most common amongst its k nearest neighbours. Figure 2.5 illustrates this method in case of binary (two class) classification.. k is a typically small positive. If k = 1, then the object is simply assigned to the class of its nearest neighbour. In binary classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. It can be useful to weight the contributions of the neighbours, so that the nearer neighbours contribute more to the average than the more distant ones. The neighbours are taken from a set of objects for which the correct classification is known. This can be thought of as the training set for the. 18.

(31) algorithm, though no explicit training step is required and hence it is called a “lazy learner”. In case of text documents, the objects are represented by word vectors in an n-dimensional feature space. Measuring the cosine similarity of a given unlabelled document from labelled documents in the training dataset is the common method used for finding the nearest neighbours. The similarity is inversely proportional to the cosine of the angle between the two vectors, and computed as:. Similarity(a,b) = cosine θ =. a⋅b a b. (2.18). where, a and b are the two vectors whose similarity is being measured.. K=9. -. +. +. -. -. K=3. +. +. + -. -. +. +. ? +. -. -. +. -. + +. Figure 2.5 k-Nearest Neighbours. 2.4 Document Representation Text documents, as they are, can not be interpreted by classification algorithms and an indexing procedure, which maps the text into a succinct representation of its content, needs to be invoked. The choice of a representation for text depends on what one 19.

(32) regards as the meaningful textual units (the problem of lexical semantics) and the meaningful natural language rules for the combination of these units (the problem of compositional semantics). In true information retrieval style, each document is usually represented by an n-dimensional vector of weighted terms that occur in the document. Differences among the various approaches to represent textual content are accounted for by: i. Different ways to understand what a term is; ii. Different ways to weight terms. A typical choice for the first issue is to identify each word in a document as being a terms. This is often referred to as the Bag-Of-Words (BOW) approach to document representation. The BOW model in its simplest form converts a natural language text document to an n-dimensional vector in which each unique word in the document is represented as a dimension, and the frequency of that word in the document is the weight/magnitude of that dimension. The BOW model completely disregards the sequence of the words in the document. There are more complex models proposed in the literature that aim to represent the text documents more accurately. These models try to find more sophisticated and semantically rich units in text. Such units can be divided into three groups: a). Statistically motivated phrases: phrases that are not grammatically such, but are composed of a set/sequence of words that occur contiguously with high frequency in the dataset [23].. b) Grammatically motivated phrases: phrases that are grammatically such according to a grammar of the language [24]. c). Ontology motivated phrases: phrases that correspond to one or more concepts in a formal ontology. In this model, which utilises the new advances in the area of text semantic analysis, the document is represented by a vector of concepts/topics which are directly related to its content [25].. Although these more sophisticated units which could be composed of 1 to n single words and therefore called n-grams, are semantically richer than single words, they do not always result in developing more accurate classification systems. They could have even a reversed effect on the prediction performance of the classification systems. This is due to the fact that n-grams are generally statistically much less significant 20.

(33) than single words. In [17], Wolf et al. showed that human subjects are able to achieve similar classification accuracy with or without syntactic information. In their experiments they used the simple BOW model to represent texts and asked their human subjects to classify these representations. The results of this study suggest that the BOW model, which does not contain any syntactic/sequential information, still contains enough information for achieving high accuracy classification performance. As for the second issue in text representation, i.e. the weight assigned to each term, the simplest method is Term Frequency (TF) in which the weight of each term is equal to the number of times that it appears in the text. However, the most popular weighting mechanism is the Term Frequency Inverse Document Frequency (TFIDF), which is based on two intuitions: − The more often a term appears in a document, the more representative it is of the document’s content. − The more documents a term appears in, the less discriminating it is. TFIDF is high for terms that have a high frequency in a small subset of documents in the dataset, but rarely appear in the other documents. The TFIDF weight for a given term, t i , in a given document, d i , in a given dataset is computed as:. TFIDF (t i,j ) = TF (ti, j ) × log 2. N n. (2.19). where, TF (ti.j ) is the frequency of Term t i in document d i , N is the number of Documents in the dataset, and n is the number of documents where term t i occurs at N is the inverse document frequency measure. least once. log 2 n. 2.4.1 The Problem of High Dimensionality The text documents naturally contain a large number of unique words and therefore their vector representation has a very high dimensionality which many learning algorithms do not deal well with. Also, not all of these dimensions are positively effective in the classification process. Therefore, it is desired to only include in the model the dimensions/terms which are discriminating. In this section we review the. 21.

(34) different methods and techniques used for reducing the dimensionality of the text vectors. 2.4.1.1 Stop Words Removal Stop words are extremely common words (e.g., the, of, is) which appear in the majority of documents and do not have any discriminating value. Removing such words from the text representation reduces the number of words in the document with no negative effect. There is no definite list of stop words which all natural language processing tools incorporate (e.g., see [26]) . 2.4.1.2 Stemming and Lemmatization For grammatical reasons, documents are use different forms of a word as needed. For example the verb ‘organize’ may be used as it is, or as ‘organizes’, ‘organizing’, etc. Additionally, there are families of derivationally related words with similar meanings, such as ‘democracy’, ‘democratic’, and ‘democratization’. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: am, are, is ⇒ be car, cars, car’s, cars’ ⇒ car The boy’s cars are different colours ⇒ The boy car be differ colour However, the two words differ in their flavour. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as lemma. If confronted with the word ‘saw’, stemming might return just ‘s’, whereas lemmatization would attempt to return either ‘see’ or ‘saw’, depending on whether the use of the term was as a verb or as a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component, and a number of such components exist as commercial and open source. 22.

(35) software components. The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter’s algorithm [27]. 2.4.1.3 Feature Selection Methods The aim of feature selection is to reduce the number terms in the documents by filtering out the terms that have no or very limited effect on the outcome of classification task. The simplest method for feature selection is to keep the words that occur in the highest number of documents in the training dataset. More sophisticated term score functions try to measure the contribution of a specific term to the outcome of classification task [28]. Some of these common methods include: •. Information Gain (IG): it measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document form the training dataset. Let {C i }i =1 denote the set of categories m. in the dataset. The information gain for a given term t is computed as: m. IG (t ) = −∑ P(ci ) log P(ci ) i =1 m. + P(t )∑ P(ci | t ) log P(ci | t ). (2.20). i =1 m. + P(t )∑ P(ci | t ) log P(ci | t ) i =1. where, −. P(ci ) is the probability of category i,. −. P(ci | t ) is the probability of category i if t is in the document; i.e., which proportion of the documents where t occurs belong. to the. category i, −. P(ci | t ) is the probability of category i if t is not in the document; i.e., which proportion of the documents where t does not occur belongs to the category i,. −. P (t ) is the probability of term t appearing in a given document in the dataset.. 23.

(36) −. P (t ) is the probability of term t not appearing in a given document in the dataset.. After computing the information gain for all the unique words in the training corpus , the terms whose information gain is below a pre-determined threshold are removed from the feature space.. The computation includes the estimation of the conditional probabilities of a category given a term, and the entropy computations in the definition.. •. Mutual information (MI): if one considers the two way contingency table of a given term t and a given category c, where A is the number of times t and c co-occur, B is the number of times t occurs without c, C is number of times c occurs without t, and N is the total number of documents, then the mutual information criterion between t and c is computed as:. MI (t , c) = log. P(t ∧ c) , P(t ) P(c). (2.21). and is estimated using:. MI (t , c) ≈ log. AN ( A + C )( A + B). (2.22). MI (t , c) has a natural value of zero if t and c are independent. To measure the goodness of a term (i.e., its contribution to the classification task) in a global feature selection, the category specific scores of a term are combined in two alternate ways: m. MI average (t ) = ∑ P (ci ) MI (t , ci ). (2.23). i =1. m. MI max imum (t ) = max{MI (t , ci )}. (2.24). i =1. where, m is the number of categories in the training dataset. A weakness of mutual information is that the score is strongly influenced by the marginal probabilities of terms. Rare terms will have a higher score than common. 24.

(37) terms. The scores, therefore, are not comparable across terms of widely differing frequency. •. Chi-square ( χ 2 ): χ 2 statistic measures the correlation between a given term t and a given category c. It can be compared to the χ 2 distribution with one degree of freedom to judge extremeness. Using the two way contingency table of a term t and a category c, where A is the number of times t and c co-occur, B is the number of times t occurs without c, C is the number of times c occurs without t, D is the number of times neither c nor t occur, and N is the total number of documents, then the χ 2 of term t in relation to category c is computed as:. χ 2(t,c) =. N(AD − CB) 2 (A + C)(B + D)(A + B)(C + D). (2.25). The χ 2 statistic has a natural value of zero if t and c are independent. To measure the goodness of a term in a global feature selection, the χ 2 statistic between each unique term in the training corpus and all the categories are measured, and then the category specific scores for the term t are combined using on of the following functions: m. 2 χ average (t ) = ∑ P (ci ) χ 2 (t , ci ). (2.26). i =1. m. 2 2 χ max imum (t ) = max{χ (t , c i )}. (2.27). i =1. where, m is the number of categories in the training dataset. A major difference between χ 2 and MI is that χ 2 is a normalized value, and hence χ 2 values are comparable across terms for the same category.. 25.

(38) 2.5 Performance Measures Obtaining an accurate performance measure of ATC algorithms is important for comparing different algorithms and datasets. It is also necessary for tuning the algorithms to yield optimum results. Performance measures are built upon the two concepts of Precision (Pr) and Recall (Re). Precision is the probability that a document predicted to be in category, ci, truly belongs to this category. Recall is the probability that a document belonging to ci is classified into this category. When a single performance measure is desired, the harmonic mean of precision and recall, F1, is quoted. Accordingly with respect to a given class ci:. Re(ci ) =. TPi Number of correctly assigned class labels = Total possible correct TPi + FN i. (2.28). Pr (ci ) =. TPi Number of correctly assigned class labels = Total assigned TPi + FPi. (2.29). F1(ci ) =. 2Pr (ci ) Re(ci ) Pre(ci ) + Re(ci ). (2.30). Where, the Re, Pr, and F1 are computed in terms of the labels TP (True Positive), FP (False Positive), and FN (False Negative) to evaluate the validity of a given class label i assigned to a given document j, such that: − TPi: refers to the cases when both the classifier and human cataloguer agree on assigning class label i to document j; − FPi: refers to the cases when the classifier has mistakenly (as judged by a human cataloguer) has assigned class label i to document j; − FNi: refers to the cases when the classifier has failed (as judged by a human cataloguer) to assign a correct class label i to document j. Many algorithms have adjustable parameters upon which precision and recall will depend, such as a confidence threshold. When this is the case, a single performance measure can be obtained by varying the parameter(s) and finding the precision-recall. 26.

(39) breakeven point, which is the (interpolated) value of precision obtained by varying the parameter until precision becomes equal or very close to recall. Both F1 and the breakeven point measures are relevant to single-class classification problems, i.e., when an example may be either positive or negative. For text classification, except for the very special case of binary classicisation, precision and recall values for the various classes should be combined to obtain an accurate measure of ATC algorithms used. There are two possible approaches for this: macro averaging and micro averaging. Macro-averaged performance scores are computed by first calculating the scores per-category (i.e., Re(ci ) , Pr (ci ) , F1(ci ) ) and then averaging these per-category scores to compute the global means. Whereas, in micro-averaging, performance scores are computed global over all categories. There is an important distinction between macro-averaging and micro-averaging. Micro-averaging gives equal weight to every document, and therefore is considered a per-document average (more precisely, an average over all the document-category pairs). In contrast, macroaveraging gives equal weight to every category, regardless of its population, and is therefore a per-category average [29].. 2.6 Benchmark Datasets The core of any text classification experiment is measuring the performance of the developed method for a standard dataset and comparing the results to other methods. There are a number of de facto standard datasets that are widely used in evaluating the text classification algorithms. In this section we review a number of these datasets that are most common.. 2.6.1 20 Newsgroups Dataset The 20 Newsgroups dataset [30] is a collection of approximately 20,000 newsgroup documents, partitioned (almost) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, for his work entitled “Newsweeder: Learning to filter net news” [31], though he does not explicitly mention this collection. The 20 Newsgroups collection has become a popular dataset for experiments in text classification and text clustering. The documents in this dataset are organized into 20 newsgroups, each corresponding to a different topic.. 27.

(40) Some of the newsgroups are very closely related to each other (e.g., comp.sys.ibm.pc.hardware and comp.sys.mac.hardware), while others are highly unrelated (e.g., misc.forsale and soc.religion.christian). Table 2.1 shows the list of topics in 20 Newsgroups dataset, partitioned according to their subject matter.. Table 2.1 List of topics in 20 Newsgroups dataset. comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x. misc.forsale. rec.autos. sci.crypt. rec.motorcycles. sci.electronics. rec.sport.baseball. sci.med. rec.sport.hockey. sci.space. talk.politics.misc. talk.religion.misc. talk.politics.guns. alt.atheism. talk.politics.mideast. soc.religion.christian. 2.6.2 Reuters-21578 Dataset This is another popular test dataset for text classification tasks. It contains 21578 Reuter’s news documents published in 1987. They were labelled manually by Reuters’ personnel. The total number of categories in this dataset is 672, but many of them occur very rarely. Some documents belong to many different categories, others to only one, and some have no category. Over the past decade, there have been many efforts to clean up this dataset, and improve it for use in scientific research. The present format is divided in 22 files of 1000 documents delimited by SGML tags [32].. 2.6.3 OHSUMED Collection OHSUMED dataset contains 348,566 texts and was used for the filtering track in the 2000 Text Retrieval Conference (TREC-9) [33]. The texts are records of medical articles containing fields for the author, the title, the source, the publication type, a number of human assigned relevant key terms (topics), and in about two thirds of the cases also the abstract. Yiming [29] measured the accuracy performance of a number of popular ATC algorithms for well-known text classification datasets . One of the. 28.

(41) datasets examined in this study was the OHSUMED and it was pointed out that this dataset is much harder to classify compared to other well-known datasets such as Reuters-21578.. 2.6.4 Four Universities Dataset This dataset contains webpages collected from computer science departments of various universities in January 1997 by the World Wide Knowledge Base (WebKb) project of CMU text learning group [34]. 8,282 webpages were manually classified into seven classes: ‘student’, ‘faculty’, ‘staff’, ‘department’, ‘course’, ‘project’, and ‘other’. For each class the data set contains webpages from four universities: Cornell, Texas, Washington, Wisconsin, and 4,120 miscellaneous pages from other universities. The files are organized into a directory structure, with one directory for each class. Each of these seven directories contains five subdirectories, one for each of the four universities and one for the miscellaneous pages. These directories in turn contain the webpages.. 29.

(42) CHAPTER 3. Developing New Bootstrapping Methods for MLbased ATC 3.1 Bootstrapping ML-based ATC Systems One of the practical difficulties with the widespread use of supervised machine learning methods for ATC is that they require a large number of training instances in order to construct an accurate classifier. Increasing the size of training corpus improves the accuracy of the classifier substantially. For example, Joachims [35] measured the accuracy of a Naïve Bayes classifier for a dataset of 20,000 Usenet articles from twenty different newsgroups (≈1000 articles in each group). She reported that the NB classifier achieves the highest accuracy of 89.6% when trained with 13,400 documents (670 documents per class). The accuracy of the classifier dropped to 66% when 670 documents (33 documents per class) were used to train the classifier. As this and other similar reported experiments show, increasing the size of the training corpus improves the accuracy of the classifier substantially. However, manual classification of documents for the purpose of training a classifier is a tedious and expensive job and would be deemed infeasible or financially nonviable in many cases. Over the past decade, Motivated by this problem, researchers have attempted to develop classifiers using semi-supervised and unsupervised training methods with a limited number of training documents for the first type of methods, and no training documents for latter type of methods. Such classifiers mainly use the so called bootstrapping methods to automatically create a complete training corpus from scratch or enrich an existing training corpus by adding new training instances. The bootstrapping methods deploy different techniques to utilise a large volume of unlabelled documents, which are easily accessible from sources such as web, as training instances. There are a considerable number of bootstrapping methods reported in the literature. Xiaojin [36] provides a review of these methods. Here we briefly review some of. 30.

(43) these methods. McCallum and Nigam [37] used a small set of keywords per class, a class hierarchy, and a large quantity of easily obtained unlabeled documents to bootstrap an ATC system. The keywords are used to assign approximate labels to the unlabeled documents by term matching. These preliminary labels become the starting point for a bootstrapping process that trains a NB classifier using expectation maximization and hierarchical shrinkage. Nigam et al. [38] proposed a bootstrapping method which first trains a classifier using a limited number of available labelled documents, and probabilistically labels a large volume of unlabeled documents. It then trains a new classifier using the labels for all the documents. Gliozzo et al. [39] proposed a generalised bootstrapping method in which categories are described by relevant seed features. Their method introduces two unsupervised steps that improve the initial categorization step of the bootstrapping scheme: (i) the use of the latent semantic space to obtain a generalized similarity measure between instances and features, and (ii) a Gaussian Mixture algorithm to obtain uniform classification probabilities for unlabeled examples. Keswani et al. [40] proposed a bootstrapping method based on clustering unlabeled data with labelled data using probabilistic and fuzzy approaches. Following this trend, we investigate here the use of new resources and methods for bootstrapping ML-based ATC systems using no or very limited number of manually labelled training documents. These resources include: a). Conventional library resources such as online public access catalogues and classification schemes, e.g., Dewey decimal classification and library of congress classification schemes.. b) Web search engines The rest of this chapter describes the development of two new bootstrapping methods which aim to eliminate the need for manually labelled training data in ATC systems. Section 3.2 introduces a novel bootstrapping method based on utilising physical library resources, describes the development of a prototype classifier based on the proposed method, and presents the experimental results. Section 3.3 describes the development of a complete automatic digital repository system, which one of its main two components is an ML-based text classifier. The classifier component of the system is trained using an in-house developed new web-based bootstrapping method,. 31.

No results found