3.2 Text Representation
3.3.1 Hierarchical Categorization
In earlier works, categories were considered as a flat set, with no relation- ships between them. In the last decade, to better address real use cases and to achieve higher effectiveness and efficiency, increasing attention is being given tohierarchical classification of documents: categories are arranged in a usually rooted tree-like taxonomy, where each of them treats a slightly more specific topic with respect to its parent category and may have any number of subcategories dealing in turn with even more specific arguments. A hierarchical taxonomy can be expressed formally as a partially ordered sethC,≺i, whereC is the set of categories and≺⊂ C × C is the asymmetric, transitive is-a relationship. In practice, cd ≺ ca means that cd represents a specific topic within the wider discussion area represented by ca: in this case, ca is said to be an ancestor of cd, which is in turn adescendant ofca. In this formalization, the (direct) parent of a non-root category cd is the only category cp satisfying cd≺cp∧@γ ∈ C :cd≺γ ≺cp, whilechildren of a category cp are those categories whose cp is parent.
The use of a hierarchical taxonomy of categories is often useful to better organize documents, allowing to find specific ones starting from general discussion areas and progressively narrowing down the domain to the topic of interest. A typical example of this organization are web directories, where great numbers of websites are organized in a fine-grained taxonomy of categories which can be browsed by the user, from the home page presenting the top-level categories to the sub-pages of specific topics listing general related websites and possibly even more specific sub-categories where other websites are distributed.
(Silla and Freitas, 2011) give a general review of hierarchical TC litera- ture, while (Qi and Davison, 2009) focus on web pages classification, which is a common application.
The hierarchical setting of categories has some variants: (Sun and Lim, 2001) distinguish between taxonomies organized specifically as trees or more
generally as directed acyclic graphs (where nodes may have multiple par- ents) and between allowing or denying documents to be classified in inter- mediate (non-leaf) nodes of the taxonomy. The same work discerns two major approaches to hierarchical text classification. In the big-bang ap- proach a global feature selection is performed and a single classifier is used on the whole hierarchy, while in the most common top-down approach a different classifier is used for each node: the computational effort to build classifiers is therefore higher, but the most suitable features can be selected for each node.
To classify a document with a top-down approach, an iterative process is generally used starting from the root of the hierarchy and progressively descending to more specific categories of the tree to find the potentially correct one. Typical issues in this process are correctly classifying very specific documents at high levels of the hierarchy and, if documents can be classified in intermediate nodes, choosing at each level whether to stop at the current node or keep descending the tree.
Some earlier experiments were performed on flat collections like Reuters- 21578, using categories as leafs in a small hierarchy built ad hoc. For instance, (D’Alessio et al., 2000) use a local classifier for each node of the hierarchy, using two feature sets created from positive and negative example documents.
Among the first experiments on real hierarchical collections, (Dumais and Chen, 2000) create one SVM classifier for each node in a two-levels hierarchy of summaries of web documents, using feature sets created with documents and categories of the same node, assigning multiple leaf nodes to the test documents. (Liu et al., 2005) also use multiple SVMs trained for each category. (Cai and Hofmann, 2004) leverage knowledge of categories relationships in the whole SVM classification architecture. In (Cesa-Bianchi et al., 2006b,a), the authors propose improvements in the classification pro- cess, like an incremental classifier (Cesa-Bianchi et al., 2006b) and a refined evaluation scheme (Cesa-Bianchi et al., 2006a). (Xue et al., 2008b) propose a strategy based on pruning the original hierarchy which first computes the similarity between the test document and all other documents, and then classifies the document in a pruned hierarchy. (Sun et al., 2004) tested three methods to limit the blocking problem in top-down approaches, that is the problem of documents wrongly rejected by the classifiers at higher-levels of the taxonomy. (Bennett and Nguyen, 2009) propose a expert refinements technique that uses cross-validation in the training phase to obtain a better
3.3. Text Categorization 47
estimation of the true probabilities of the predicted categories. (Ceci and Malerba, 2007) make a comparison between using a flat classifier and a local hierarchy classifier created in each parent node, using both SVM and Na¨ıve Bayes approaches. (Ruiz and Srinivasan, 2002) propose a variant of the Hierarchical Mixture of Experts model, making use of two types of neural networks for intermediate nodes and for leaves.
Also as regards the use of global classifiers the literature offers various reference works. (Cai and Hofmann, 2003) investigate the use of concept- based document representations to supplement word- or phrase-based fea- tures. Weak hypotheses are created based on terms and concepts features, which are combined using Adaboost. (Vens et al., 2008) transform the clas- sification output into a vector with boolean components corresponding to the possible category. They also use a distance-based metric to measure the similarity of training examples in the classification tree.
(Li et al., 2012c) propose an active learning method to significantly reduce the number of training documents needed: this is based on iteratively picking a limited number of informative unlabeled documents from a pool of available data and getting a “yes” or “no” answer for each category from a specialized oracle (human expert).
While the basic bag-of-words approach considers words only in their lexical form, many works are based on leveraging their semantic content to improve the accuracy of text classification. Some works use statistical techniques like Latent Semantic Indexing (LSI; or Analysis, LSA) to cap- ture the underlying correlations between words from training documents. (Hull, 1994) applied LSI in conjunction with the Rocchio method for clas- sification. (Yang, 1995) used Singular Value Decomposition to reduce noise in latent semantic structures. (Zelikovitz and Hirsh, 2001) test the benefits of leveraging additional unlabeled documents when applying LSI. Newer techniques like Probabilistic Latent Semantic Analysis (Hofmann, 1999) and Latent Dirichlet Allocation (Blei et al., 2003) model the presence of different topics in the documents, each represented by different words.
While these statistical techniques learn approximated semantic knowl- edge from the training documents themselves, another possibility is to use external knowledge bases. Generally, an external knowledge base may be any kind of resource and may be used in different ways. Structured or semi- structured resources are often used to obtain accurate semantic information enabling to spot relationships between words which might not be learned from the documents themselves. A widely used resource is WordNet, a lex-
ical database for the English language which indicates concepts as sets of synonyms and mutual semantic relationships between them (Miller, 1995): we’ll return on it later when describing how it is used in our method.
Many works make use of semantic knowledge by substituting or enrich- ing the representations of documents with concepts expressed by terms. For example, (Scott and Matwin, 1998) use WordNet synsets (concepts) as features, weighting them also by respective hypernyms found in the text. (Gabrilovich and Markovitch, 2005) tested the introduction of additional features extracting hundreds of thousands of concepts from domain-specific and common-sense world knowledge sources, like DMoz (Gabrilovich and Markovitch, 2005) and Wikipedia (Gabrilovich and Markovitch, 2007). In a similar way, (Tao et al., 2012) perform unsupervised classification by extracting categories from generalized world knowledge. (Peng and Choi, 2005) generate further features from relationships between WordNet synsets and also create representations for categories which are tested for similarity with those for documents: our method resembles this approach for these aspects.
There also are other approaches based on the creation of structured representations for each category, other than for single documents. In a base case, each topic may be represented with a set of manually-picked keywords. The method presented by (McCallum, 1999) assigns few keywords for each category in a small hierarchy as a preliminary step to train a Na¨ıve Bayes classifier from a set of initially unlabeled documents. (Barak et al., 2009) build topics representations by using the name of each category and finding terms semantically related to it; documents are then classified in a flat context by comparing them to categories by cosine similarity, as also done by (Peng and Choi, 2005).