The Use of Keyword and Phrase in Text Classification

As already noted, the most common representation for text classification is the bag-of- words representation, which has been used widely in previous text classification research [57, 43]. In this representation, single keywords are selected from the dataset and used as the representation for the documents in the dataset. A keyword is defined as a word that is highly discriminative, i.e. can be used to distinguish between classes and selected from the collection of words from the documents in a dataset. The use of keywords as features for the text representation is fairly straightforward. The norm is to apply a feature selection technique to select a subset of words from the word collection to be used as keywords. These selected keywords are then used to represent the documents. Although the use of keywords is fairly effective in text classification, a lot of research has been directed at the development of richer representations than the bag-of-words. This has resulted in the the bag-of-phrases representation. The use of phrases for the text representation is motivated by the potential benefit of preserving semantic information in phrases that is not present in single keywords. There are various methods that may be adopted to identify and extract phrases for the bag-of-phrases representation. These methods tend to fall into two categories: linguistic phrase extraction [53, 54, 28] and statistical phrase extraction [34, 67, 11, 20]. The former is based on syntactic patterns while the latter is based on statistical patterns.

Previous work has reported on the use of phrases in text classification, albeit with mixed results. While some researchers reported better results with phrases, others claim that the use of phrases produced only marginal or zero improvement over the use of single keywords.

One of the earliest reports on research using phrases for the text representation in text classification is that of Lewis [54]. He studied the effects of the use of syntactic phrases in text classification and found that the use of noun phrases (in a Naive Bayes classifier) was less effective than individual words. The reason given for this was that not all phrases were good content indicators and that this affected the results when those phrases were used with better content indicators. Dumais et al. [28] extracted syntactic phrases in the form of factoids (for example, “Salomon Brothers International”), multi- word dictionary entries (for example, “New York”) and noun phrases (for example, “first quarter”) and reported no improvements on classification when using Naive Bayes and SVM classifiers. Based on the examples of phrases given, one could argue that factoids could be too “unique” (possibly infrequent) and could overfit the training data

and thus are not beneficial in this context. Multi-word dictionary entries and noun phrases could be similar in that they are proper two-word phrases which can both be common or unique. While common words could result in a classifier being over- generalized, unique words could cause the classifier to overfit the training data. In other words, the effectiveness of classification could be affected.

F¨urnkranz [34] used the Apriori algorithm to generate n-gram features based on word sequences of lengthnand used RIPPER as its learning algorithm. He concluded that although there was slight improvement in including n-grams (up to 2-grams) for the text representation, word sequences of n > 3 were not useful and may decrease classification effectiveness. In addition, F¨urnkranz et al. [35] investigated the use of linguistic phrases with both a Naive Bayes classifier (RAINBOW) and a rule-based classifier (RIPPER). Phrases were extracted using AUTOSLOG-TS, which used syntactic heuristics to create linguistic patterns. RAINBOW showed better performance when using phrases instead of words, while RIPPER showed worse performance when using phrases instead of words. Experimental results showed that the use of linguistic features could improve the precision of text classification, but at the expense of coverage. Although direct comparisons could not be made between these two pieces of research due to different experimental setups and the use of different datasets, it can be concluded that in both cases when RIPPER was used, the statistical phrases extracted in [34] could bring about a slight improvement over the case when linguistic features were used in [35].

Mladeni´c and Grobelnik [67] enriched their document representation by including

n-grams of length up to five (5-grams) and used a Naive Bayes classifier for learning to classify the Yahoo text hierarchy. Their experiments showed that using word sequences of length up to three (3-grams), instead of using only single words, improved the classifier performance while longer sequences did not offer any benefits. This demonstrated that using statistical phrases could benefit text classification.

Scott and Matwin [80] extracted noun phrases and keyphrases and these were used with RIPPER for text classification. Noun phrases were extracted using the Noun Phrase Extractor (NoPE), which comprised a part-of-speech tagger and a regular ex- pression algorithm to group tagged words into noun phrases. Keyphrases were extracted using a separate algorithm called the Extractor [86], which operated on an algorithm that mimicked the choice a human would make when selecting keyphrases. The use of noun phrases was found to be only slightly better than the use of keywords, while the use of keyphrases was found to be slightly worse. In general, the authors reported no significant benefit from using phrases and concluded that more complex natural lan- guage processing methods were needed to identify them. One could argue that in this case, the authors attempted to extract very “high level” phrases, in that the methods that they used extracted phrases that a human would choose to represent a class, i.e.

phrases that a human thought were semantically related to a particular class. While those phrases could be very good for text classification when a human performed the classification, they could be very statistically insignificant when a machine performed the classification.

Bakus and Kamel [5] extracted phrases using a statistical word association based grammar and a slight improvement over the use of the bag-of-words representation was reported using a Naive Bayes classifier. Although the extracted phrases were found to be good classification discriminators, the performance of classification was not significantly better than when using keyword feature. The authors pinned this down to two factors: (i) the number of extracted phrases was significantly less than the total number of extracted words and (ii) many phrases corresponded to the same keyword feature. It was suggested by the authors that these two reasons lessened the impact of phrases on the effectiveness of the classification.

In the work conducted by Kongovi et al. [49], they found instances where using phrases was more effective than when using single words. They reasoned that this was due to the fact that word pairs (two adjoining words) could provide some semantic value, as well as filter out words occurring frequently in isolation that are not discriminative. They defined a phrase as “two adjoining words in the text with zero word distance, eliminating all the stop words in between”. Extracting phrases in this manner allowed patterns of co-occuring words to be extracted and statistical information concerning these words helped identify phrases that were good discriminator. Again, statistical phrases here were found to be beneficial as the text representation for text classification. Tan et al. [84] extracted bigrams from the Reuters and Yahoo! Science datasets. Bigrams were extracted such that they contain at least one keyword; the keywords were selected based on a document frequency ranking and only highly ranked words that were deemed more significant than lower ranking ones were considered. The bigrams that were extracted were then further filtered by using TF-IDF and Information Gain ranking. Tan et al. used two classifiers, Naive Bayes and maximum entropy, and reported better classification results when bigrams were included in the representation. It was suggested by the authors that the improvement in classification results was due to a number of factors: (i) bigrams were used in addition to single words and not in place of; (ii) the number of bigrams selected was equivalent to 2% of the number of keywords and (iii) information gain was used in addition to document frequency and term frequency to choose bigrams, resulting in the bigrams being good discriminators. Li et al. [59] reported that the use of phrases benefited classification when classifying texts about closely related topics in the same domain. They used an n-gram word extractor to extract frequent phrases and used them for classifying research paper abstracts using various classifiers. Experiments showed that the use of phrases was better than the use of keywords as the text representation for the classification of their

dataset. This was understandable because topics in the same domain may share a lot of technical terms. Therefore, phrases could serve as good discriminators to differentiate between classes. As given in the example by the authors: “text mining” and “data mining” were good phrases to differentiate between the classes “text mining” and “data mining” while “text”, “data” and “mining” alone were common words shared by the two classes and thus less discriminating when used by themselves.

Chang and Poon [14] investigated the use of phrases for email classification using a Naive Bayes classifier and two k-nearest neighbour classifiers and found that using phrases of size two for the text representation gave the best classification results. Phrases were extracted using the Shingling algorithm [8] where contiguous sequence of words were extracted in an overlapping manner. The authors named the phrases

w-shingles, wherewwas the length of the phrase. An example given by the author was “a rose is a rose is a rose” and the 4-shingles extracted from the example comprised {(a rose is a), (rose is a rose), (is a rose is), (a rose is a), (rose is a rose)}. They also experimented with removing “stop-shingles”, which was a shingle containing only stop words. The authors commented that the removal of stop-shingles only had a very marginal effect on the classification of their dataset.

In general, the literature has reported various outcomes from the use of phrases in text classification. While some results are promising, others reported very small or no improvements in classification effectiveness. It is clear that direct comparison cannot be made between the different bodies of research work because of the different experimental setups that were used. Both linguistic and statistical phrase extraction had been experimented with and previous works has reported more favourable results for statistical phrase extraction. With respect to the work conducted in this thesis, three different kinds of statistical phrases are extracted as reported in Sub-section 4.2.3 in Chapter 4.

In document An investigation into the use of negation in Inductive Rule Learning for text classification (Page 37-40)