5.4 Experiments with Simple NPs
5.4.4 Supervised Models
Supervised models typically outperform unsupervised models for most NLPtasks. For
NPbracketing, the small quantity of gold-standard data has meant that few supervised models have been implemented, and those that have been, performed poorly. With our new, significantly larger data set covering the Penn Treebank, we have built the first large-scale supervisedNPbracketer.
We use the MegaM Maximum Entropy classifier (Daum´e III, 2004), which, as we de-scribed in Section 2.6, allows diverse and overlapping features to be incorporated in a principled manner. We also discretise non-binary features using Hawker’s (2007) implementation of Fayyad and Irani’s (1993) supervised entropy-based discretisation algorithm.
The data set is split into training, development and test sets, with 4451, 559 and 559NPs respectively. Our initial features use counts from Google, Web 1T and the snippets. We use the adjacency and dependency models with counts from Google and Web 1T, and all three association measures. The n-gram variations in Table 5.8 for the three count sources are also used, but only the raw count. This is because the counts are often too small for the other measures to be effective. For each of these, there is a feature for the left and right association measure score, as well as a binary feature representing the left or right vote. If the left and right measures are equal, then neither vote feature is active. This first supervised model has 947 features in total.
The results on our Penn Treebank development set are shown in Table 5.10, compared to an unsupervised adjacency model, and the unsupervised voting system from Section 5.4.3. As we described there, calling the latter model unsupervised is a misnomer, as the set of voters needs to be optimised on training data. With the larger Penn Treebank corpus available, we can now “train”
this unsupervised voting model on the training set, rather than on the test set itself. This avoids over-estimating its performance figures.
The supervised model outperforms the unsupervised voting model by 0.6%, even though both models are using the same information to base their decisions on. This improvement comes
88 Chapter 5: Noun Phrase Bracketing
-
Unsupervised, Web 1T adjacency 82.5
Unsupervised, voting 89.6
Supervised model 90.2
Table 5.10: Comparing unsupervised approaches to a supervised model
from the supervised model’s ability to weight the individual contributions of all the unsupervised counts from Google and the Web 1T corpus.
We can also test on Lauer’s data set using the supervised model trained on Penn Treebank data. The result is an 82.4% accuracy figure, which is higher than our unsupervised dependency model and Lauer’s. However, it is much lower than Nakov and Hearst’s (2005a) best result and our own voting scheme. This suggests that the voting schemes, by training on their own test data, have over-estimated their performance by about 9%.
Additional Features
One of the main advantages of using a Maximum Entropy classifier is that we can easily incorporate a wide range of features in the model. We now add lexical features for all unigrams, bigrams and the trigram within theNP. All of these features are labelled with the position of the n-gram within theNP.
Since we are bracketing NPs in situ, rather than stand-aloneNPs as performed by Lauer, the context around theNPcan be exploited as well. To do this we added bag-of-word features for all words in the surrounding sentence, as well as specific features for a two-word window around theNP. For the context sentence, there are features for words before theNP, after theNP, and either before or after theNP. As an example, when bracketing lung cancer deaths, in the sentence:
The number of lung cancer deaths has grown recently.
the context sentence bag-of-words features would be The, number, of, has, grown, recently. And the context window features would be:
word−2 = number word−1 = of
word+1 = has word+2 = grown
We have access to gold-standardPOSandNERtags, from the Penn Treebank and the BBN Pronoun Coreference and Entity Type Corpus (Weischedel and Brunstein, 2005) respectively. These
Chapter 5: Noun Phrase Bracketing 89
are used by adding generalised features for every n-gram and context window feature, replacing the words with their POS and NER tags. POS tags are included even though all the words in the
NPare nouns for these simple NPexperiments, as they may be proper and/or plural. We use the coarse-grained NERtags, of which there are 28 (plus O), including the B- and I-. Using the lung cancer deathsexample again, itsPOStag trigram would beNN NN NNSand itsNERtag trigram isO B-DISEASE O.
Finally, we incorporate semantic information from WordNet (Fellbaum, 1998). For each sense of each word in theNP, we extract a semantic feature for its synset, and also the synset of each of its hypernyms up to the WordNet root. These features are marked with how far up the tree from the original synset the hypernym is, but there is also an unordered bag-of-hypernyms for all senses.
All of these semantic features are applied to each word in the NP, including a label de-scribing whether it is the first, second or third word. For the word cancer there are five synsets to which it belongs: malignant neoplastic disease, Crab, Cancer, Cancer the Crab, genus Cancer all of which are included. The first level of hypernyms for these synsets is malignant tumor, per-son, arthropod genus(two of the senses have no hypernyms), and this continues up the tree. The bag-of-hypernyms would include all of the synsets we have listed and many more.
These additional feature types increase the number of features in the maximum entropy classifier to 86,116, compared to the 947 we had previously. This number is still small compared to some other tasks, because our data set is comparatively small, being made up of only 4,263 unique tokens. Almost all of the models converge after 50 training iterations, the one exception being that using only unsupervised features, which takes about 200.
Results
Table 5.11 shows the results for a model using only the additional features, and also once the “unsupervised” features used in Table 5.10 are included. The additional features do not perform as well as the unsupervised ones, but once they are combined a further performance increase of 3.5% is attained.
Table 5.11 also presents a subtractive analysis of all feature groups. The Google and snippets features do not appear to contribute at all, probably because they overlap significantly with each other and the Web 1T searches. Of the supervised features, the context window andNERare most important but all make a positive contribution, except for the semantic features. Our best performance of 93.8% F-score is obtained by removing this group.
90 Chapter 5: Noun Phrase Bracketing
-
Unsupervised, voting 89.6
Additional features 89.5
Additional+ unsupervised features 93.0
−Google 93.0
−Snippets 93.0
−Web 1T corpus 92.1
−Lexical 92.3
−POS 92.5
−NER 92.1
−Context sentence 92.7
−Context window 92.0
−Semantic 93.8
Table 5.11: Subtractive analysis of simpleNPfeatures on development set
Unsupervised, Web 1T adjacency 77.6
Unsupervised, voting 86.8
Best supervised model 93.4
Table 5.12: Test set results for the supervised model
Finally, results on the test set are shown in Table 5.12. The supervised model has improved over the unsupervised baseline by 6.6%. This larger increase, compared to the development set, shows that the voting method’s performance is quite variable, while the Maximum Entropy model remains consistent.