The description above is very general and refers to a wide range of works about text classification. Works usually give slightly narrower definitions of the problem they address, which are anyway similar and some of them can be seen as generalizations or specializations of others. Here are described the commonly possible variants of the general problem.
3.2.1
Binary, single-label and multi-label
classification
One thing to define, other than how many and which categories can be globally applied to documents, is how many of them can be applied to each single document.
Generally speaking about classification of objects (either documents or not), the most basic task is binary classification, where each object must be assigned to one and only one of two possible classes. It is common here to denote one class with some identifier, say c, and the other one as its complement ¯c: in this point of view there is a single class to which each object may belong or not. This is the case of the spam filtering example cited above, as well as other similar filtering tasks. This type of problems fits well to some machine learning models which are natively oriented to binary classification, such as support vector machines (§3.5.2).
While binary classification considers only two possible classes, multi- class classification entails an arbitrarily sized set of classes: in this case, two possibilities are common. In single-label classification, each document
3.2. Variants 37
is assigned to one and only one possible category: in this case, considering classification by topics, we impose that each document must treat a single one of such topics, which may be reasonable or not according to the specific context.
With multi-label classification, instead, no constraints are set on the labeling of each document: given a setC of possible categories, to each doc- ument can be assigned an arbitrary subset of them, which is usually allowed to be empty. While some machine learning algorithms have been specifi- cally proposed for multi-label classification problems (often being general- purpose adaptations of standard ones), most of them can only treat binary and single-label cases. The most common solution is to consider a multi- label problem with|C| possible categories as |C| distinct binary classification problems, each to determine which documents should be labeled with one of the categories. In this approach, the potential dependencies between categories are not considered, meaning that the probability for a document to belong to a certain category does not depend from its associations to the other categories. Although considering independence between categories may be theoretically not correct, it is usually an acceptable approximation in practice to ease the classification process.
3.2.2
Hierarchical classification
In the most general case, categories of the setC are independent from each other: the assignment (or absence thereof) of a certain category to a doc- ument does not depend from which of the other categories are assigned to that document. This happens when the scopes of different categories do not significantly overlap. Consider for example the classification of docu- ments under three categories corresponding to topics computer, music and sports: while some documents may be related to more than one of these categories at the same time (e.g. a document about a soccer video game might regard both computer and sports), it is generally not pos- sible to make assumptions about the likelihood for a document to belong to a category knowing whether it belongs to any other one.
Anyway, there is the possibility for topic categories to be inter-related by means of is-a relationships, meaning that some known topics are more specific branches of other topics, also represented by categories. For ex- ample, considering a set of known categories including sports, baseball and hockey, documents talking about baseball are forcedly also dealing
38 Chapter 3. Text Categorization with sports and the same is valid for hockey. Given a set of topics pre- senting these relationships, a hierarchical taxonomy of these topics can be created, where some of them are recursively broken down into more specific branches.
Hierarchical text classification generally refers to classifying text docu- ment under hierarchically-organized taxonomies of categories. Each of such taxonomies is generally structured as a single-rooted tree, where each node, apart from a root node, represents a category having a distinct, single node as its parent. Each category may have any number of children categories, representing more specific topics: categories with no child nodes are leafs of the tree. As a generalization, a taxonomy may also be represented by a directed acyclic graph (DAG), which in practice allows multiple roots and nodes with multiple parents, but in the following, unless otherwise stated, only single-root trees are considered.
A hierarchical taxonomy can be expressed formally as a partially ordered sethC, ≺i, where C is the set of categories and ≺⊂ C × C is the asymmetric, transitive is-a relationship. In practice, cd ≺ ca means that cd represents
a specific topic within the wider discussion area represented by ca: in this
case, ca is said to be an ancestor of cd, which is in turn a descendant of ca.
In this formalization, the (direct) parent of a non-root category cd is the
only category cp satisfying cd ≺ cp∧ @γ ∈ C : cd ≺ γ ≺ cp, while children of
a category cp are those categories whose cp is parent.
The use of a hierarchical taxonomy of categories is often useful to better organize documents, allowing to find specific ones starting from general discussion areas and progressively narrowing down the domain to the topic of interest. A typical example of this organization are web directories, where great numbers of websites are organized in a fine-grained taxonomy of categories which can be browsed by the user, from the home page presenting the top-level categories to the sub-pages of specific topics listing general related websites and possibly even more specific sub-categories where other websites are distributed.
The most extended and known web directory is the Open Directory Project, also known as DMOZ1: it is maintained collaboratively by users
and contains millions of web links organized into a dynamic hierarchical taxonomy of about one million categories. Figure 3.1 shows a small part of the top of this taxonomy, which in some branches goes down for ten levels
1