2.4 Statistical Natural Language Processing
2.4.1 Overview
In natural language processing (NLP) systems the goal is to build a computational system that can map raw input strings of text to output linguistic structures that encode the meaning of the text [5]. For example, as illustrated in Fig.2.10, given the input text
Stanford awarded the highly acclaimed Mr Smith with an award,
the goal might be to label words with their parts of speech (POS), e.g. nouns NN* and verbs VB*, or to extract the dependency relationships between words shown as arrows from head wordsto their dependents, or to extract the semantic components of the sentence (who does whatto whom).
Most modern approaches model this as a machine learning problem where we try to predict a set of labels y ∈ Y (e.g. a POS tag) that each input x ∈ X (e.g. a word) can belong to. To do this, one defines a model that can learn how inputs relate to labels. Generally speaking, jointly modelling inputs and labels is referred to as generative modelling, since these models can map both inputs to labels and labels to inputs allowing the model to generate or “hallucinate” new data, from where the name. On the other hand, discriminative models learn a conditional mapping from inputs to labels that can assign a real-valued score or probability to the different labels y that a given input x might belong to. I.e. it can discriminate between the different labels, from where the name. Given such a trained model, and a new input x, the prediction problem then reduces to solving the “decoding or argmax problem” of finding the best label ˆy for a given input x, e.g. ˆy = arg maxyPθ(y| x) for a probabilistic model.
x ∈ X the cat sits on the mat r (x) r (x1) r (x2) r (x3) r (x4) r (x5) r (x6)
y ∈ Y B-NP I-NP B-VP O B-NP I-NP NE Tags [the cat]_NP [sits]_VP [on]_O [the mat]_NP
Table 2.1: Example NLP syntactic chunking task for the sentence “the cat sits on the mat”. X represents the words in the input space, Y represents labels in the output space. r (x) is a feature representation for the input text x and the bottom row represents the output named entity tags in a more standard form.
In this dissertation we only focus on discriminative models where Pθ(y| x) is parame- terised in terms of a feature representation r (x)2which captures the salient, discriminative aspects of the input domain X with respect to the labels y. In this class of models, most of the novelty in new approaches relate to how the feature functions r (x) are defined.
As a concrete example, consider the task of syntactic chunking, also called “shallow parsing” [31]: Given an input string, e.g.
“the cat sits on the mat”,
the chunking problem consists of labelling segments of a sentence with syntactic con- stituents such as noun or verb phrases (NPs or VPs). Each word is assigned one unique tag, often encoded using the BIO encoding3. We represent the input text as a sequence of words xi ∈ x, and each word’s corresponding label is represented by yi ∈ y (see Table2.1). Given
a feature-generating function r (xi) and a set of labelled training pairs (xi, yi) ∈ X × Y,
the task then reduces to learning a suitable mapping sθ : r (X) → Y by tuning the model parameters θ until the model is successful at predicting the training labels.
The task of coming up with good feature generating functions r (x) is referred to as fea- ture engineeringand typical feature functions for this task would produce an output vector where individual components of the vector is set to 0 or to 1 depending on the presence of certain binary indicators. Examples for this task include for example “does the word contain a capital letter”, “is the word contained in a gazetteer”, etc. This approach works well, but it is labour-intensive and generally hard to come up with features that can generalise well across many different tasks. Therefore these feature functions are often only useful for one specific task. This problem is compounded even more when the goal is to come up with features that can capture important aspects related to the task across different languages, i.e. cross-lingual prediction.
In this dissertation, our goal is to automatically and efficiently learn distributed rep- resentations of natural language that can generalise across a variety of NLP tasks and
2This is usually denoted as φ(x), but for consistency in this dissertation we will refer to the feature repre-
sentation of object x as r (x) to emphasize that it is a function, or simply rw to refer to result of applying r to
x.
CHAPTER 2. BACKGROUND 30
languages. There is a rich history of distributional methods that can be used for learning useful representations, and in the next sections we will discuss the most well-known meth- ods, namely Brown clustering and latent semantic analysis. However, in the rest of this dissertation, we will approach this problem by building upon advances in the field of neural language models, which will be discussed in detail in Chapter3.