Inspired by the comparative analysis presented in last chapter, we design a novel stacked sub-word tagging model for joint word segmentation and POS tagging. We define a sub-word structure which maximizes the agreement of multiple segmenta- tions provided by heterogeneous segmenters. We show that this sub-word structure could explore the complementary strengths of different systems designed with dif- ferent views. Moreover, the POS tagging can be efficiently and effectively resolved over sub-word sequences. Exploiting diversity among different systems plays a cen- tral role in the success of our new model. By observing two essential characteristics of heterogeneous annotation data, we propose to use our new model to explore the diversity between different labeled corpora. A new sub-word tagging model together with corpus conversion is implemented and evaluated. Experiments show that our approach is superior to the existing approaches reported in the literature.
Chapter 4
Harvesting String Knowledge for
Word Segmentation
This chapter investigates improving supervised word segmentation accuracy with un- labeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we in- troduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-of-vocabulary (OOV) words which appear more than once inside a document. Novel features result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively. Our work can be viewed as a good example to leverage feature induction to bridge the gap between supervised language processing and unsupervised language acquisition.
This chapter is joint work with Jia Xu, originally published in [Sun and Xu,2011].
4.1
Background
4.1.1
The Problem:
Combining Supervised and Unsuper-
vised NLP
Machine learning has become an indispensable tool for NLP researchers. Highly developed supervised training techniques have led to state-of-the-art performance for many NLP tasks. Unfortunately, given the limited availability of labeled data, and the non-trivial cost of human annotation, progress on supervised learning often yields diminishing returns. Unsupervised learning, on the other hand, is not bound by the
same data resource limits. While labeled data is expensive to obtain, unlabeled data is essentially free in comparison. It exists simply as raw text from sources such as the Internet. The amount of unlabeled linguistic data available to us is much larger and growing much faster than the amount of labeled data. However, unsupervised learning is significantly harder than supervised learning and, although intriguing, has not been able to produce consistently successful results for most NLP tasks.
It is becoming increasingly important to leverage both types of data resources, labeled and unlabeled, to achieve the best performance in challenging NLP problems. Many semi-supervised learning methods, e.g. transductive SVM, graph-based meth- ods, have been originally developed for binary classification problems. NLP problems often pose new challenges to these techniques, involving more complex structure that can violate many of the underlying assumptions. On the other hand, a number of easy-to-implement methods have been proposed, e.g. self-training and co-training, but their effectiveness on NLP tasks is not always clear. For example, bootstrapping methods typically assume a very small amount of labeled data and have not always shown to improve state-of-the-art performance when a large amount of labeled data is available, such as POS tagging [Clark et al.,2003].
We believe that it is important to explicitly investigate why and how auxiliary unlabeled data can truly improve NLP tasks. The following aspects motivate us to search for a robust semi-supervised solution that can help high-resource tasks.
• Flexibility: We favor the solutions which are easy to apply for problems with different structures (e.g. word sequences, syntactic trees or forests, N-best lists). • Linguistic knowledge: We favor the idea exploiting NLP-specific background
knowledge to aid semi-supervised learning.
• Scalability: NLP data-sets are often large, even for non-English tasks. We favor methods that can be applied to large-scale data (both labeled and unlabeled) sets.
• Effectiveness: We still expect gains even when high-performance supervised systems can be built. For example, we hope that semi-supervised learning can improve a supervised system that is already more than 95% accurate.
4.1.2
The Method: Feature Induction
In this chapter, we focus on a general framework for semi-supervised NLP, i.e. fea- ture induction. Feature induction is a simple yet effective semi-supervised learning
method for NLP. The basic strategy for taking advantage of unlabeled data is to derive informative features from large-scale unlabeled data and use them in discrimi- native supervised models. This “feature-engineering” approach has been successfully applied to named entity recognition (NER) [Lin and Wu, 2009; Miller et al., 2004], dependency parsing [Koo et al.,2008], query classification [Lin and Wu,2009]. Miller et al. [2004] and Koo et al. [2008] demonstrated the effectiveness of using word clus- ters as features in discriminative learning. Following their ideas, Turian et al. [2010] compared different word clustering algorithms and evaluated their impacts on both NER and text chunking. Moreover, Lin and Wu [2009] present a simple and scalable algorithm for clustering tens of millions of phrases and use the resulting phrase clus- ters as features to enhance two applications: NER and query classification. Their experimental results show that phrase-based clusters offer significant improvements for NLP applications.
One of the advantages of the feature induction approach is that the learning al- gorithm is decoupled from the process of generating features. In other words, the construction of unlabeled data features is separated from training. This decoupling gives us the flexibility of using any algorithm to create different linguistic features that might be useful. Linguistic knowledge can explicitly motivate us to design good features based on unlabeled data. Moreover, models trained with features from un- labeled data are more compact and easier to interpret than more complex learning techniques, such as transductive SVMs. Feature induction increases the complexity of an original discriminative model only with new features, which are normally in a very small set. This property make this method efficient and scalable to most dis- criminative NLP systems. Finally, when good and task-related linguistic features are derived, they are reasonably expected to be useful clues for disambiguation.