• No results found

CHAPTER 2 Automatic Text Classification: A Review

3.4 Phrase based Representation

Phrases are defined as sequences of words that appear in a text. They provide more specific meaning compared to single words and less ambiguous. For example, classification is a general term that acquires a special meaning by specifying the object such as text classification, image classification, query classification, and question classification or single- and multi-label classification.

Chapter 3: Text Representation

Two documents that mention the same phrase such as “question classification” are more similar than those that only share the word classification. In addition, phrases help to solve the ambiguity problem.

In general, two different types of phrases have been proposed and investigated as features in the text representation model for the ATC system:

 Syntactic phrases. These types of phrases make use of lexical and syntactic rules of phrases to extract the terminologies in the text such as noun phrases and verb phrases and uses these linguistic units as features in the representation model.

 Statistical phrases. This type of phrase makes use of the sequence of N words occurring consecutively in the text. It ignores the semantics in texts and builds phrases after removing stop words. Phrases are either words grouped based on co-occurrence or a word sequence extracted from documents by the traditional string matching method. In many studies, authors refer to statistical phrases as N-gram where N is the number of words in the phrase.

Both syntactic and statistical phrases have been studied in ATC and IR applications. For example, the first work that studied the use of syntactic phrases in ATC was done by Lewis [74, 75]. In this study, using phrases as features and NB as the classification algorithm yields significantly lower performances than the standard BOW model. In Lewis’ work, only syntactic phrases are used to represent the text where in most other works phrases are added to the BOW model. Another study [76] implemented a practical method to extract syntactic phrases from documents. In addition, two strategies were used to represent documents using the extracted syntactic phrases, the general and sub-topic representation. In the general representation only short syntactic phrases that point out general concepts in documents are selected, whereas in sub-topic representation only long syntactic phrases that refer to more specific concepts in documents are selected. A Feature Selection (FS) technique, Information Gain (IG), was used to remove phrases with low information value. The SVM classification algorithm was used to classify the Reuters-21578 documents. The results demonstrated that representing documents with syntactical phrases using sub-topic representation outperforms the general one. Furthermore, the representation using BOW without linguistic pre-processing outperforms any

representation using syntactical phrases with any other strategy [77]. This is consistent with other works such as in [27] which reports the same result on Reuters-21578.

The reason behind this poor result is that unless phrases occur frequently enough in the document collection, they are unlikely to make an impact in terms of effectiveness of the classification system. Another reason is that not every syntactic phrase denotes an interesting concept and distinguishing interesting phrases is difficult [78]. Therefore, a number of researchers used statistical phrases in ATC systems to improve the low performance of syntactic phrases.

Statistical phrases have a number of advantages over syntactic ones by means of stronger and less computationally difficult algorithms. In addition, the effect of unrelated syntactic phrases can be removed, and worthless phrases tend to be filtered out from useful ones [76]. For instance, the study by Mladenic [79] extracted statistical phrases of a length of up to five using an extraction algorithm that relies on document frequency as a statistical filter. For the classification algorithm, they used NB on a body of web pages. They report that statistical phrases of a length of up to four gave significant benefits with respect to the BOW, while statistical phrases of a length of five did not provide additional benefit. In [80] authors used an algorithm similar to the one used in [79] for extracting statistical phrases of a length up to 5. The Reuters-21578 dataset is used in the experiment and statistical phrases of a length of 2 show significant improvement in performance. Another finding concludes that the longer the lengths of the statistical phrases, the classification performance is reduced. The author used a dataset of Usenet newsgroups articles and reported that statistical phrases of a length of 3 are of some use, whereas the negative contribution of N statistical phrases was confirmed. The authors in [76] defined the statistical phrase as a stemmed and alphabetically ordered sequence of N words. Different lengths of statistical phrases were extracted and used to represent documents, and different FS techniques were used to remove phrases with low information value. The Racchio algorithm was used to classify the Reuters-21578 dataset, and the experimental results showed that using statistical phrases as features to represent documents in the TC task did not yield better performance than the BOW. Other researchers combined words and statistical phrases for text representation to classify text documents such as [81]. In comparison with using the BOW

Chapter 3: Text Representation

representation, using both BOW and phrases for text representation did not show significant improvement in terms of classification accuracy.

Unfortunately, both statistical and syntactic phrases as representation features for text fail to be a satisfactory solution to the problems in the BOW model. For that, a new feature is introduced in text mining to represent text which is concepts.