• No results found

Types of Hierarchical Classification Frameworks

5.2 Background

5.2.3 Types of Hierarchical Classification Frameworks

Whereas the previous section describes subtly different definitions of the problem, this section describes three broad categories of frameworks used in solving hierar-

chical classification problems. This work focuses on the local classifier approach,

however a discussion of all three is included for completeness.

5.2.3.1 Flat Classifiers

The simplest approach to hierarchical classification is to reduce the problem to one of flat classification. By selecting a set of classes with mutually exclusive definitions (no parent/child relationships between them), any multi-class classifier (or indeed binary classifier using an approach such as one-vs-rest voting) can be used to solve the problem. While this simplifies matters, it imposes significant limitations. Firstly, any time a different set of classes is of interest, the classifiers must be retrained. Consider a classifier that identifies species within a scientific taxonomy; a researcher interested in insects requires a different classifier to one interested in bird species. Performing hierarchical classification on a carefully designed semantic hierarchy can allow a user to “roll up” the classes that are not of interest, and explore the areas of the tree that are relevant to a particular usage. There is still some risk of retraining in hierarchical classification, as solutions are only flexible within the same semantic framework (e.g. a hierarchy based on visual appearance of organisms compared to genetic similarity).

The second disadvantage to simplifying the problem to the flat case is that any semantic information embedded in the hierarchy is lost, and unable to be leveraged by the classification algorithm to improve performance. [15] discusses the difference between semantic and feature space similarity (in their case visual similarity for

5.2 Background

images), and the knowledge embedded in the semantic hierarchy that may improve classifier performance.

5.2.3.2 Global Classifiers

If the class hierarchy is to be preserved, it is necessary to create an algorithm that

provides a hierarchy as an output, rather than a binary or categorical value. Aglobal

classifier (also described in the literature as the “big bang” approach) seeks to do this via the creation of a specialised classification algorithm. Structured Support Vector Machines [90] are an example of this approach. In the computer vision literature,

this is often referred to asstructured output prediction. The output of the classifier is

a binary vector of classes, which in the most general sense can be used to solve any of the hierarchical classification problem types in Table 5.1. The misclassification cost function to be minimised is typically the number of different bits in the output vector between the true and predicted output (also known as the Manhattan distance, or L1-norm of the difference vector). In hierarchical classification terms, this is identical to the path length in the hierarchy between the true and predicted class. This neatly captures the sense that distant objects in the hierarchy (such as poodles and parrots) should be penalised more harshly than mistakes nearby in the tree (such as poodles and spaniels). Under the structured output classification interpretation, the class hierarchy becomes simply a method of defining the desired misclassification cost between any two classes. In this sense, structured output classification (when

constrained to predict according to a hierarchy) can be generalised to cost sensitive

classification, in which a flat multi-class problem with a single categorical output has a different cost of misclassification between each pair of classes.

This perspective highlights the fact that there is no theoretical reason for using the path length (or number of differing bits) as a cost function. Many papers cite the

Chapter 5 Hierarchical Classification with Bayesian Networks

intuition that misclassification penalty between something like a chimpanzee and a gorilla should be lower than that between a chimpanzee and a slug (e.g. [15]), however this does not mean that the cost must be precisely the path length between the two classes. Hierarchical classification is somewhat arbitrary in many cases. In a scientific taxonomy, the nodes can be arranged by morphology, or by genus and species, or by visual similarity, or even geographic presence. In essence, the hierarchy is an intuitive way for the human expert to define and visualise the misclassification cost matrix between a large set of classes. Taking this more generally with struc- tured output prediction, it would be entirely possible for an expert to assign some level of similarity between two vectors of classes using another metric, or even in a completely manual way (such as using survey results on perceived similarity). Using other distance penalties is also possible to achieve various effects, e.g. using the Euclidean Distance (the square root of the sum of squares, which equates to the path length between true and predicted nodes) to de-emphasises the penalty induced by very long path lengths.

5.2.3.3 Local Classifiers

A global classifier uses a specialised algorithm to predict either a binary vector as output, or a categorical output with specified misclassification costs. By contrast, a

local classifier breaks the problem down into a set of binary classification problems, and then aggregates the output. While other variations exist, the most popular is to train a single binary classifier to recognise each class in the hierarchy - usually

referred to as the local classifier per node approach. In the example of a scien-

tific taxonomy, one classifier might be trained to recognise biological vs physical objects, another might recognise plants and animals, and another might recognise a particular species of animal.

5.2 Background

A clear disadvantage in breaking the problem down in this way is that it prevents the local training algorithm from optimising directly for a misclassification cost matrix (e.g. path length difference), as each binary classifier is trained independently. The outputs of the local classifiers must then be combined in some way such that the hierarchical classification definition (of only activating unbroken chains of classifiers from predicted to ancestor nodes) is adhered to. The advantages of the approach are described in detail in Section 5.2.4 below.