The thesis is supported by the following publications:
[Hu et al. (2009)] Hu, R., Delany, S.J., Mac Namee, B.: Sampling with confi- dence: Using k-NN confidence measures in active learning. In: Proceedings of the UKDS Workshop at the 8th International Conference on Case-based Reasoning (ICCBR 09). (2009) 181-192
[Hu et al. (2010a)] Hu, R., Delany, S.J., Mac Namee, B.: EGAL: exploration guided active learning for TCBR. In: Case-Based Reasoning Research and Development, Volume 6176 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg. (2010) 156-170
ing the frontier of uncertainty space. At the AISTATS 2010 Workshop on Active Learning and Experimental Design (May 16, 2010; Sardinia, Italy).
[Hu et al. (2010c)] Hu, R., Mac Namee, B., Delany, S.J.: Off to a good start: Using clustering to select the initial training set in active learning. In: Pro- ceedings of the Twenty-Third International Florida Artificial Intelligence Re- search Society Conference (FLAIRS 2010), AAAI. (2010) 26-31 (Best Student Paper Award)
Additional work related to the contribution is as following:
[Zhang et al. (2008)] Zhang, Q., Hu, R., Mac Namee, B., Delany, S.J.: Back to the future: Knowledge light case base cookery. In Schaaf, M., ed.: Proceed- ings of Workshop on Computer Cooking Contest, ECCBR’08. (2008) 239-248 (Champion of the 1st Computer Cooking Contest)
[Hu et al. (2008)] Hu, R., Mac Namee, B., Delany, S.J.: Sweetening the dataset: Using active learning to label unlabelled datasets. In Proceedings of the 19th Irish Conference on Artificial Intelligence and Cognitive Science. (2008) 53-62
[Lindstrom et al. (2010)] Lindstrom, P., Hu, R., Delany, S.J., Mac Namee, B.: SVM based active learning with exploration. At the AISTATS 2010 Workshop on Active Learning and Experimental Design (May 16, 2010; Sardinia, Italy).
[Mac Namee et al. (2010)] Mac Namee, B., Hu, R., Delany, S.J.: Inside the se- lection box: Visualising active learning selection strategies. At the NIPS 2010 Workshop on Challenges of Data Visualization (December 11, 2010; Whistler, BC).
As a summary, the contributions of this work, the corresponding chapters of this thesis and the publications are shown in Table 1.1.
Table 1.1: Contributions, corresponding chapters and publications.
Contribution Chapter Publication
Active learning literature review Chapter3
Confidence-based selection strategy Chapter5 Huet al.(2009) Exploration guided selection strategy Chapter6 Huet al.(2010a)
Hybrid selection strategy Section9.2 Huet al.(2010b) Use of visulisation in active learning Section6.4 Mac Nameeet al.(2010)
New method to seed active learning Chapter7 Huet al.(2010c) Understanding of the reusability problem Chapter8
Chapter
2
Supervised Learning and Text
Classification
Machine learning is inherently a multidisciplinary field which draws on concepts and results from many fields, including probability and statistics, artificial intelligence, philosophy, psychology, information theory, cognitive science and other fields. This chapter introduces notations and provides a brief introduction to the formalisms most pertinent to the work presented in this thesis. More specifically, Sections 2.1
sketches basic principles and concepts of supervised learning with a focus on classifi- cation learning. Section 2.2describes text classification. Section2.3provides a brief overview of common approaches to text classification. Finally, methods to evaluate the performance of the text classification systems are presented.
2.1
Basic Concepts
In this section, a formal definition of the elements involved in supervised machine learning is given. We have revised some of the definitions and adapted them to the purpose of this thesis, with the aim of clarifying the classification task and setting a base terminology for further discussion. Therefore, the notation and definitions ex- posed in this section are the result of some consolidation of the most used formalisms we can find in supervised learning literature.
The goal of supervised learning is to find a functiong :X → Y which maps an example x∈X to its output value y∈Y, as shown in Definition 1.
Definition 1. (Example, Output) An example x ∈X represent an input object
in the data. X is the set of all possible examples in the input space where X =
{x1, . . . , xi, . . . , xN}. The output y ∈ Y represents an output value of an output
space and Y is the set of all possible values. ♣
Two types of learning problems are often defined depending on the output values Y: regression learning where Y = R, classification learning where Y = C (C is a set of classes andC ={c1, . . . , cj, . . . , cM}). Classification is the process to output a
value that matches an example to a class. The focus of this thesis is on classification learning. The output values y ∈ Y in classification learning are called classes or
labels. When applied to text classification tasks, an example is a text document, such as a recipe document. The classes could be the types of recipes, for example Y ={starter, maincourse, dessert}.
a machine learning approach for a specific problem, it is a key step to select an appropriate function (Mitchell,1997). The function is called a model or a classifier
in classification learning as defined in Definition 2. A binary classifier assigns a positive (+1) or negative class (−1) to each example; a multi-class classifier assigns a class from a set of classes C (|C| >2) to an example; and a multi-label classifier assigns a subset of the set of classes to an example, that is, it assigns more than just one class to an example.
Definition 2. (Classifier) A classifier is a function Ψ :X →C that maps exam- ples to assignment classes. For example, for the binary classifier C ={−1,+1}, so
Ψ :X → {−1,+1}. ♣
Usually, an example is characterised by a vector of features which is a vector of real values with a dimension for every feature in the feature space: �x ∈ Rk and
�x= (f1(x), . . . , fk(x)). A feature is any item that can be considered a characteristic
of an example by a classifier. It is important to identify which items will be features for the example. More details about feature representations of text documents can be found in Section 2.2.1.