• No results found

Key Findings

After a detailed literature review on previous work, this section will discuss our key findings.

In previous work, active learning has been mostly used to create high perfor- mance classification systems with a limited number of labelled examples. However, active learning can also be very helpful in labelling datasets. We are particularly interested in using active learning to create large collections of labelled examples from unlabelled collections.

Uncertainty sampling is one of the most commonly used selection strategies. Typically, the most informative examples are selected through uncertainty sampling based on direct outputs of classifiers. Instead of using the direct output of thek-NN classifier, a confidence-based selection strategy which uses k-NN based confidence measures to measure the confidence of the prediction and chooses the examples with least confidence for labelling would be better.

Most of the selection strategies use exploitation techniques. Other approaches tend to balance exploitation with exploration. Seldom has work been done on explo- ration based selection strategies. It is valuable to do research on an exploration only selection strategy which does not need any classifiers, using the idea of exploration. Active learning is an interactive process which requires the interaction with the oracle. However, it can be hard for human experts to get a deep insight of the selection strategy. Visualisation techniques can be very useful to visualise the ac- tive learning process. However seldom work has demonstrated how to use visualise

techniques to help the understanding of the active learning process which is what we tend to do.

In the active learning process, a small labelled set is needed to seed the active learning process. In most of the active learning applications, the initial training examples are randomly selected. Previous work has shown that better performance can be achieved by selecting the initial training set using clustering techniques. However, the clustering techniques used are non-deterministic which might be not as good as advanced deterministic clustering techniques. Better and more reliable performance could be achieved by using deterministic clustering algorithms to con- struct the initial training set.

In a wider domain, there is a reusability problem in using active learning for labelling. It would be useful to compare reusability of popular active learning meth- ods for text classification and identify the best classifier to be used as the selector in the active learning selection strategy and the best classifier to be used as the consumer.

3.10

Conclusions

In this chapter, we reviewed previous research on active learning. Compared to passive learning, active learning is machine learning technique which can be use to select examples for manual labelling in a more informative way instead of randomly picking examples for labelling. Numerous applications have demonstrated its role as a useful technology. Three main forms of active learning are (i) active learning with membership queries, (ii) stream based active learning, and (iii) pool-based active

learning. Among them, pool-based active learning has been widely used in text classification which is the one considered in this thesis.

Active learning is typically used to build high performance classifiers. We focus on using active learning to generate labelled datasets. Related work on three major problems while using active learning are discussed which includes how to select an initial training set to seed the active learning process; how to identify the most informative examples to query true labels from the oracle and when to stop the active learning process.

Selection strategies are used to determine how to select the most informative examples. We categorised selection strategies into three groups: exploitation based selection strategies, exploration based selection strategies and selection strategies balancing exploration and exploitation. Exploitation based selection strategies con- centrate on examples closest to the classification boundary and thus can refine the classification model efficiently. However, it can be difficult to estimate the classifi- cation boundary at the early learning stage with very few initial training examples. Exploration based selection strategies are interested in dense or diverse examples so that they can explore the entire feature space. Selection strategies balancing exploration and exploitation select examples by considering multi criteria, such as uncertainty, density and diversity and have shown superior performance over other selection strategies considering exploitation or exploration alone. Learning curves are often used to evaluate the performance of active learning systems.

This chapter also discusses existing research in using visualisation techniques in active learning and resuability problems in active learning. Visualisation techniques

can be used to help human experts in labelling and help users to understand the complex active learning process. In order to build a labelled dataset which can be used to train different types of classifiers, reusability should be considered in active learning.

Other interesting fields related to active learning research including semi-supervised methods, ensembles, cost-sensitive active learning and methods for multi-class and multi-label problems are discussed before we included our key findings and the gaps we are going to fill at the end of this chapter.

The next chapter presents the design of our active learning based labelling sys- tem. A preliminary validation of the framework on a recipe dataset shows how this system helps in generating labelled datasets. Experimental methodology discusses the datasets and evaluation measures to be used in this work.

Chapter

4

System Design

This chapter elaborates on the design of our Active Learning based Labelling sys- tem (ALL) (Hu et al., 2008) which uses ideas from pool-based active learning to investigate the use of active learning approaches in labelling large collections of textual examples. The dual goals of the system are the creation of high-quality labelled datasets and the minimisation of the manual labelling effort. This chap- ter starts with a description of the design of the system framework. It continues with the discussion of the datasets involved in further experiments and experimental methodologies including the evaluation measures.

4.1

Framework Design of ALL

As discussed in Section 3.1, active learning has been widely used to create classi- fication systems in the absence of large numbers of labelled examples. However,

active learning can also be used to create large collections of labelled examples from unlabelled collections. We feel that before a classification system will be valuable, advancements need to be made on the examples labelling problems. Therefore, we focus our efforts on building high quality labelled training datasets which we feel are key sub-problems of the text classification systems. These collections can then be used for disparate purposes beyond classification. ALL is used to generate labelled textual datasets.

The core mechanism of the ALL system is based on pool-based active learning as discussed in Section 3.1.1. A flow diagram of ALL is shown in Figure 4.1. The system starts with a large pool of unlabelled examples, from which a small number of examples are selected and manually labelled by an oracle as the training set. This initial training set can be selected by random sampling from the unlabelled pool or by some advanced methods which will be described in Chapter 7. Given an initial training set, a selection strategy can be built with or without using some classifiers. The selection strategy is used to assign a ranking score as an informativeness measure to each of the unlabelled examples in the pool. Then the most informative example is selected and presented to the oracle for labelling. After the example is labelled it is added into the labelled set and the pool is re-ranked. In this framework, the batch size is set to one so at each active learning iteration only one example from the unlabelled pool is selected for labelling and its label is applied. The process of selecting examples from the pool and re-ranking the pool continues until a label budget is exhausted.

order to monitor the performance of the proposed system, and compare it to other approaches, after each labelling a classifier is built from the current labelled set, L, and classifications are made for every example remaining in the unlabelled pool, U. These classifications are compared with the actual labels in each dataset and the accuracy of this labelling is recorded. Accuracy is calculated as Accuracy = C/N whereN is the size of the union of labelled and unlabelled set andCis the number of correctly labelled examples. Both manually and automatically labelled examples are included in this calculation to measure labelling accuracy over the entire collection, and to ensure that the measure remains stable as the process continues.