On coordination disambiguation in a generative parsing model, with memory based techniques for parameter estimation

(1)

LEABHARLANN CHOLAISTE NA TRIONOIDE, BAILE ATHA CLIATH

TRINITY COLLEGE LIBRARY DUBLIN

OUscoil Atha Cliath

The University of Dublin

Terms and Conditions of Use of Digitised Theses from Trinity College Library Dublin

Copyright statement

All material supplied by Trinity College Library is protected by copyright (under the Copyright and

Related Rights Act, 2000 as amended) and other relevant Intellectual Property Rights. By accessing

and using a Digitised Thesis from Trinity College Library you acknowledge that all Intellectual Property

Rights in any Works supplied are the sole and exclusive property of the copyright and/or other I PR

holder. Specific copyright holders may not be explicitly identified. Use of materials from other sources

within a thesis should not be construed as a claim over them.

A non-exclusive, non-transferable licence is hereby granted to those using or reproducing, in whole or in

part, the material for valid purposes, providing the copyright owners are acknowledged using the normal

conventions. Where specific permission to use material is required, this is identified and such

permission must be sought from the copyright holder or agency cited.

Liability statement

By using a Digitised Thesis, I accept that Trinity College Dublin bears no legal responsibility for the

accuracy, legality or comprehensiveness of materials contained within the thesis, and that Trinity

College Dublin accepts no liability for indirect, consequential, or incidental, damages or losses arising

from use of the thesis for whatever reason. Information located in a thesis may be subject to specific

use constraints, details of which may not be explicitly described. It is the responsibility of potential and

actual users to be aware of such constraints and to abide by them. By making use of material from a

digitised thesis, you accept these copyright and disclaimer provisions. Where it is brought to the

attention of Trinity College Library that there may be a breach of copyright or other restraint, it is the

policy to withdraw or take down access to a thesis while the issue is being resolved.

Access Agreement

By using a Digitised Thesis from Trinity College Library you are bound by the following Terms &

Conditions. Please read them carefully.

(2)

O n C oord in ation D isa m b ig u a tio n in a G en erative

P a rsin g M o d el, w ith M em o ry -B a sed T ech niques for

P aram eter E stim a tio n

by

D eirdre H ogan

D isserta tio n

Presented to the

University of D ublin, T rinity College

in fulfillment

of the requirem ents

for the Degree of

D o cto r o f P h ilo so p h y

(3)

May 2007

TR IN ITY C O L L E G E ^

0 6 AUG 2007

(4)

D ecla ra tio n

I, the undersigned, declare th a t this work has not previously been submitted as an

exercise for a degree at this, or any other University, and th a t unless otherwise stated,

is my own work.

IfOi,

Deirdre HogAn

(5)

P erm issio n to L end a n d /o r C opy

I, the undersigned, agree th a t Trinity College Library m ay lend or copy this thesis

upon request.

D eirdre Hogan

(6)

A cknow ledgm ents

I would like first to th a n k my supervisors, P adraig C unningham and S aturnino Luz, for their feedback and good advice over th e years and for always pointing me in th e right direction when things went awry.

Much th an k s also to Jennifer Foster for her encouragem ent, for her help stru ctu rin g this thesis and for th e valuable and in-depth feedback she gave on th e individual thesis chapters.

T hanks also to C arl Vogel for his support and encouragem ent, for always giving me the sense he had confidence in my ability and for organising th e co m p u tatio n al linguis tics reading group. My g ratitu d e also extends to those a tten d in g th e reading group sessions especially for their com m ents and thoughts on coordination disam biguation.

I would also like to th an k Alexey Tsym bal and Joachim W agner for their useful feedback on p a rts of this work. Special thanks too to Libby and Suzie for their support and much valued friendship throughout my PhD and before.

(7)

I gratefully acknowledge th e financial su p p o rt provided by th e TC D B road C urriculum Fellowship initiative.

U n iversity o f D u blin , T rin ity College M a y 2007

(8)

O n C o o rd in a tio n D isa m b ig u a tio n in a G en era tiv e

P a rsin g M o d el, w ith M em o ry -B a sed T echniques for

P aram eter E stim a tio n

Publication No.

Deirdre Hogan, Ph.D .

University of D ublin, Trinity College, 2007

(9)

This thesis is concerned w ith im proving existing generative history-based probabil

ity m odels’ treatm en t of noun phrase coordination and the development of memory-

based techniques for m odel param eter estim ation.

Lexicahsed generative history-based parsing models have proven to be highly suc

cessful a t robust and accurate parsing. A lthough such models already achieve im pres

sive overall accuracy results there is nevertheless potential for im provem ent, p articu

larly in areas of difficulty for such parsers, such as coordination disam biguation and,

more generally, param eter estim ation from sparse data.

Though coordination has long been known as an area of difficulty for n a tu ra l lan

guage parsing, coordination am biguity is nevertheless a little studied area. O ur aim is

to increase understanding of coordination am biguity in generative history-based pars

ing models. We seek to find ways of im proving the m odel’s handling of noun phrase

coordination w ithout rem oving coordination from the parsing framework. As well as

reducing noise in th e d a ta , we look at modelling two m ain sources of inform ation for

disam biguation: sym m etry in conjunct stru ctu re, and th e likelihood of one lexical head

being conjoined w ith another. The la tte r step involves extending th e modelling of co

ordinate heads to include those found in base noun phrases and im proving param eter

estim ation by incorporating d a ta from th e BNC and using a word graph and a m easure

of word sim ilarity to decrease d a ta sparsity. We also alter the head-finding rules for

base noun phrases so th a t th e lexical item chosen to head th e entire phrase more closely

resembles th e head chosen for other types of coordinate noun phrase.

A difficulty in im proving probabilistic generative models is how to in corporate into

th e probability model features th a t will cap tu re inform ation in th e d a ta im p o rta n t for

disam biguation decisions w ithin th e U m itations of feature selection in history-based

models. In addition, adding new features to th e model increases th e risk of th e sparse

d a ta problem and sm oothing techniques which can overcome th e sparse d a ta problem

(10)

ory-based techniques for parameter estimation and demonstrate th at they are effective for

parameter estimation in a lexicahsed generative parsing model, allowing for flexible

feature selection, good smoothing of data, and can achieve state-of-the-art results for

(11)

C ontents

A c k n o w le d g m e n ts v

A b s tr a c t v ii

L ist o f T a b les x iv

L ist o f F ig u res x v i

C h a p te r 1 In tr o d u c tio n 1

1.1 Thesis O u tlin e ... 1

1.1.1 Coordination D isa m b ig u a tio n ... 4

1.2 Generative Lexicalised History-based Parsing m o d e l s ... 8

1.2.1 G enerative... 8

1.2.2 Lexicalised... 10

1.2.3 H istory-B ased... 11

1.2.4 Markovised ... 12

1.3 Coordination in the Baseline M o d e l... 12

1.4 Parameter Estimation ... 14

1.4.1 Linear Interpolation and Witten-Bell E stim atio n ... 14

1.4.2 /^-Nearest Neighbour Param eter E stim atio n ... 16

1.4.3 Similarity for S m o o th in g ... 16

1.5 Chapter by Chapter Guide to this T h e s is ... 17

C h a p ter 2 P r e v io u s W ork 20 2.1 In tro d u c tio n ... 20

(12)

2.2.1 [Magerman and Marcus, 1991, Magerman and Weir, 1992, Black

et a l, 1 9 9 2 ] ... 21

2.2.2 [Jelinek et al., 1994, Magerman, 1994, 1 9 9 5 ] ... 22

2.2.3 [Collins, 1996, 1997, 1 9 9 9 ] ... 23

2.2.4 [Charniak, 1996b, 1997, 2 0 0 0 ] ... 24

2.2.5 [Ratnaparkhi, 1997, 1998a]... 25

2.2.6 Henderson [2 0 0 3 ]... 26

2.2.7 Investigations into the Importance of Lexical S t a t i s t i c s ... 26

2.3 Ranking Algorithms ... 27

2.3.1 n-best l is ts ... 30

2.4 Memory-based Learning and Natural Language P ro c e s s in g ... 31

2.4.1 Advantages of Local Learning for Natural Language Learning T ask s... 31

2.4.2 Memory-Based P a rs in g ... 32

2.5 Similarity for S m o o th in g ... 34

2.5.1 /c-NN for Smoothing ... 34

2.5.2 Cooccurrence S m o o th in g ... 35

2.6 Previous Work on Coordination Ambiguity Resolution ... 37

C h a p te r 3 M e m o ry -B a se d P a r a m e te r E s tim a tio n 41 3.1 In tro d u c tio n ... 41

3.2 M otivation... 41

3.3 The Baseline M o d e l... 42

3.4 The Memory-Based M o d e l... 43

3.4.1 Constraint Features for Training Set R e stric tio n ... 44

3.4.2 S m o o th in g ... 45

3.4.3 Lexical S t a t i s t i c s ... 46

3.5 E x p erim en ts... 47

3.5.1 Experimental Set u p ... 47

3.5.2 Experimental D e ta ils ... 47

3.6 R e su lts... 50

3.7 Computational C o s t s ... 52

(13)

3.9 C o n c lu sio n ... 53

C h a p t e r 4 C o n jo in e d L e x ic a l H e a d N o u n s 55 4.1 I n t r o d u c tio n ... 55

4.2 M easures of W ord S im ila r ity ... 55

4.2.1 Sim ilarity based on C oordination C o o c c u rre n c e s... 57

4.2.2 W ordNet-Based Sim ilarity Measures ... 58

4.2.3 Em pirical E valuation of Similarity M e a s u r e s ... 59

4.2.4 D is c u s s io n ... 61

4.3 Modelling C oordinate Head W o r d s ... 62

4.3.1 Extending PcoordWord to Coordinate NPBs ... 64

4.3.2 E stim ating the PcoordWord Param eter Class from a C oordination Word G r a p h ... 64

4.4 R elation to Previous W o r k ... 67

4.5 Sum m ary ... 69

C h a p t e r 5 P a r a lle lis m A c ro s s C o n ju n c ts 71 5.1 I n tr o d u c tio n ... 71

5.2 Em pirical M easurements of P a r a lle lis m ... 71

5.2.1 M ethodology ... 72

5.2.2 R e s u lts ... 76

5.3 M odelling Sym m etry in C o n ju n c ts ... 79

5.5 Sum m ary ... 81

C h a p t e r 6 N o u n P h r a s e C o o r d in a t io n E r r o r A n a ly s is 82 6.1 I n t r o d u c tio n ... 82

6.2 Bracketing Guidelines for the Penn Treebank and Inconsistencies in W SJ C oordinate Noun P hrase A n n o t a t i o n ... 83

6.3 NPB H ead-Finding R u le s ... 85

6.3.1 Modifying the N PB H ead-Finding R u l e s ... 88

(14)

C h a p te r 7 E x p e r im e n ta l E v a lu a tio n - C o o r d in a tio n 91

7.1 In tro d u c tio n ... 91

7.2 Experimental E v a lu a tio n ... 91

7.3 Experimental Details and R e s u lts ... 93

7.3.1 Eliminating Noisy D ata ... 93

7.3.2 Modelling Symmetry in Conjunct Structure ... 94

7.3.3 NPB Head-Finding Rule and New Features for N P B s ... 94

7.3.4 Modelling Conjoined Head N o u n s ... 95

7.3.5 R e su lts... 96

7.4 D isc u ssio n ... 98

7.5 Summary ... 99

C h a p te r 8 C o n c lu sio n s an d F u tu r e W ork 101 8.1 Summary ...101

8.1.1 Memory-Based Parameter E stim a tio n ... 102

8.1.2 Noun Phrase Coordination D isam biguation...103

8.2 Future W o rk ... 104

8.2.1 Feature Weighting ...104

8.2.2 Constraint F e a tu re s...104

8.2.3 S m oothing...105

8.2.4 Word Graph and Similarity F u n c tio n ... 105

8.2.5 Modelling Dependencies Across CCs ... 106

8.2.6 Cleaning Noisy D a t a ... 107

(15)

List o f Tables

3.1 T he param eter class for generating Ch, the non-term inal label of the head child node. Cp is th e parent non-term inal label, Wp and tp its head word and part-of-speech respectively, and tgp is th e PO S tag of the grandparent node... 49

3.2 T he param eter classes for th e generation of modifier nodes. The notation is th a t used th roughout th e thesis, dir is a flag which indicates w hether th e modifier being generated is to the left or th e right of the head child. dis t is the distance m etric used in th e Collins parser, and ti- 2 are the PO S tags for the previous two generated nodes. Cgp is the grandparent non-term inal label... 49

3.3 T he param eter classes used only when Cp = N PB. T he notatio n is th a t used throughout th e thesis. In addition, Cggp and Cgggp are th e great- and great-great- grandparent non-term inal labels respectively. C i -2,Wi- 2 and C i s , W i- 3 are th e non-term inal labels and head words of th e second and th ird previously generated nodes... 49

(16)

4.1 Sum m ary of th e 9 diflFerent word sim ilarity m easures to be evaluated em pirically on W SJ cooccurrence d a ta ... 59 4.2 Sum m ary statistics for 9 different word sim ilarity measures (plus one

random measure) incoord and rinonCoord are the sam ple sizes for the co ordinate and non-coordinate noun pairs samples, respectively; Xcoordj SDcocyrd and XnonCoord-, SDnanCoord are the Sample means and standard deviations for th e two sets. The 95% C l column shows the 95% con fidence interval for the difference between the two sam ple means. The p-value is for a Welch two sample two-sided t- te s t... 60

5.1 Nodes aligned a t level 1 for th e trees in Figure 5 . 2 ... 73 5.2 Contingency table for th e head child non-term inal label TO a t conjunct

d ep th 1... 74 5.3 Percentage M atch(% M ) of head event labels Ch in right-of-head

con-ju n c ts w ith the corresponding label in the head concon-junct, grouped by D epth. Percentage m atch for head conjunct nodes collected in b o th a left-to-right (L-R) traversal and head-first (H -F) traversal are shown. . 77 5.4 Percentage M atch(% M ) of Ci and ti labels of dependent events in

right-of-head conjuncts with th e head conjunct, grouped by depth. Percentage m atch for head conjunct nodes collected in b o th a left-to-right (L-R) traversal and head-first (H-F) traversal are shown. T he to ta l num ber of dependent events {\D ep E ven ts\) in p o st-C C conjuncts for each level is displayed... 78

(17)

List o f Figures

1.1 Tree 1. T he correct noun phrase parse. Tree 2. T he incorrect parse for th e noun phrase... 5 1.2 Tree 1. The correct noun phrase parse. Tree 2. The incorrect parse for

the noun phrase... 6 1.3 T he basic form of a coordinated phrase, coord refers to the coordination

flag... 13

2.1 P arser and Reranking F-score Results Com parison on Section 23 of th e W S J ... 27

4.1 Tree 1. The correct noun phrase parse. Tree 2. The incorrect parse for th e noun phrase... 56 4.2 G raph of coordinations ex tracted from th e B N C ... 65

5.1 Exam ple of sym m etry in conjunct structure in a lexicahsed subtree. . . 72 5.2 Trees th a t contain conjuncts w ith non-isomorphic stru ctu re ... 73 5.3 P rio r and posterior (positive adaption) probabilities for head child non

term inal labels a t conjunct d ep th 1 ... 75 5.4 Prior and posterior (positive adaption) probabilities for head child non

term inal labels a t conjunct depth 5 ... 75 5.5 Prior and posterior (positive adaption) probabilities for modifier POS

labels at conjunct depth 1 ... 76

(18)

(19)

(20)

Chapter 1 Introduction

1.1 T h esis O u tlin e

Lexicalised generative history-based parsing m odels have proven to be highly success ful a t robust and accurate parsing [Collins, 1999, C harniak, 2000]. Developed from relatively simple P robabilistic C ontext Free G ram m ar (PC FG ) models [Booth and T hom pson, 1973], they are now highly complex models which weaken the indepen dence assum ptions of P C F G s by using inform ation from previously generated parse stru ctu re to help predict th e rem aining stru ctu re of th e parse tree. Although these models already achieve impressive overall accuracy results th ere is nevertheless p o ten tial for im provem ent. T his is particularly tru e in areas of difficulty for such parsers, such as, for example, coordination disam biguation, prepositional phrase (PP) a tta c h m ent, or, m ore generally, th e estim ation of p aram eters from sparse data.

Probabilistic parsing of n atu ral language can be broken down into three m ain com ponents: defining a probability model, estim ating th e param eters of th e model, and efficiently searching for th e m ost probable parse from th e space of all possible parses for the sentence. T he work in th is thesis is concerned w ith im proving param eter esti m ation in a generative m odel by using mem ory-based techniques as well as im proving th e m odel’s handling of coordination disam biguation and so lies w ithin th e first two areas.

(21)

for d isa m b ig u a tio n d ecisions w ith in th e U m itatio n s of fe a tu re selectio n in h isto ry -b a s e d m odels. In a d d itio n , a d d in g new fe a tu re s to th e m odel increases th e sp arse d a ta p ro b lem , one of th e core d ifficulties in e m p irical N L P. In creasin g th e n u m b er of c o n d itio n in g fe a tu re s w hen p re d ic tin g fu tu re s tr u c tu r e can im prove a cc u rac y as th e m odel h a s m ore in fo rm a tio n on w hich to b a se its p re d ic tio n . However, in creasin g th e n u m b er o f co n d i tio n in g fe atu res in creases th e n u m b e r o f p a ra m e te rs in th e m odel, sp re ad in g th e d a ta over m ore specific events, a n d o ften th e re is sim p ly n o t eno u g h tra in in g d a ta to b e ab le to a c c u ra te ly e s tim a te th e p ro b a b ilitie s o f events. T h is is especially tru e w hen d ealin g w ith fe atu res t h a t involve in d iv id u a l w ords, fe a tu re s w hich nev erth eless are im p o r ta n t as in d iv id u a l w ords te n d to have g o o d d isc rim in a tin g pow er. T h e m e th o d of e s tim a tin g th e local d istrib u tio n s , th ere fo re, p lay s a very im p o rta n t role in b u ild in g a g o o d m odel.

In th is th esis we e x am in e th e use of m em o ry -b ased tec h n iq u es for p a ra m e te r e s tim a tio n . Specifically, we use ^ -n earest n eig h b o u r (/c-NN) for sm o o th in g m odel p a ra m e te rs . E ssentially, th is te c h n iq u e involves b a sin g th e e stim a tio n of a p a rtic u la r p a ra m e te r, or q u e ry in stan ce, o n th e d is trib u tio n of th e class v ariab le (or fu tu re ) over th e se t of k in stan c es from th e tra in in g d a ta t h a t a re m o st sim ilar to th e q uery in stan c e, k is ty p ic a lly very large in p ro b a b ility e stim a tio n , co m p ared to w hen /c-NN is u sed for clas sification. In sta n ce s selected a re w eighted acco rd in g to th e ir sim ila rity to th e q u ery in stan c e, so t h a t in sta n c e s from m em o ry t h a t a re m ore sim ilar to th e q u ery in s ta n c e will b e given m ore w eight in th e p re d ic tio n o f th e class value.

M em o ry -b ased lea rn in g is d istin g u ish e d from o th er m ach in e lea rn in g a lg o rith m s in t h a t it delays g en eralisin g b ey o n d th e tra in in g d a ta u n til it m u st classify, or a ssig n a p ro b a b ility to , each new q u e ry in stan c e. T h is so rt of lazy lea rn in g avoids c o m m ittin g to a single global a p p ro x im a tio n a t tra in in g tim e b u t in ste a d im p licitly re p re se n ts th e ta r g e t fu n c tio n by a c o m b in a tio n o f m an y lo cal ap p ro x im atio n s, w hich tak e in to c o n sid e ra tio n th e q u ery in sta n c e w h en d ecid in g how to generalise. T h e specific a d v a n ta g e s of m em o ry -b ased lea rn in g - th e a b ility to m o d el com plex ta r g e t fu n c tio n s by a co llectio n of local a p p ro x im a tio n s a n d th e fa ct t h a t m em o ry -b ased lea rn in g does n o t a b s tr a c t aw ay from low freq u en cy d a ta - su g g est, con sid erin g th e irre g u la ritie s a n d sm all se ts of ex cep tio n s in n a tu r a l lan g u ag e, t h a t m em o ry -b ased lea rn in g a lg o rith m s sh o u ld len d th em selv es well to n a tu r a l lan g u a g e lea rn in g [D aelem ans et al., 1999a, D a ele m a n s a n d v an d en Bosch, 2005].

(22)

generative parsing model. A;-NN is a very simple, b u t effective m ethod, allows for flexible feature selection and achieves state-of-the-art perform ance in accuracy.

We carry out our experim ents w ithin the framework of generative parse reranking. We begin by describing a generative probabilistic model for parsing, based on M odel 1 of Collins [1999], which re-estim ates th e probability of each parse generated by an initial base parser (Bikel [2004a]’s im plem entation of th e Collins parser) using m em ory-based techniques to estim ate local probabilities. We achieve an /-score of 89.4% for sentences < 40 words on section 23 of the Penn Wall Street Jo u rn al (W SJ) Treebank [Bies et al., 1995], which represents a significant increase over our baseline parser and the Collins parser. A lthough th e m odel effectively reranks the to p -n parses o u tp u t from the base parser, insofar as it is generative th e approach is more sim ilar to a second-pass of a generative parser th a n to the discrim inative reranking approaches of [Collins, 2000, Collins and Duffy, 2002, Shen et al., 2003, Henderson, 2004, C harniak and Johnson, 2005, Koo and Collins, 2005].

Discrim inative approaches to parse reranking have recently become popular, m oti vated to a large extent by th e flexibility of discrim inative techniques in term s of feature selection com pared to history-based models. Discrim inative reranking approaches can choose features which incorporate arb itra ry aspects of th e whole parse tree structure, whereas in history-based models th e choice of conditioning features when predicting parse stru c tu re is lim ited to stru ctu re th a t has already been determ ined in the deriva tion of the tree. A lthough discrim inative reranking tends to improve on the perfor mance generative models, there rem ain relatively sm all differences in accuracy between generative and discrim inative models when tested on th e Penn Wall Street Journal Treebank, despite th e more restricted choice of features possible in history-based m od els.

(23)

h isto ry -b a s e d re ra n k in g m o d els have also an ad v an tag e in t h a t th e y can b e ap p lied to th e full o u tp u t of th e base p a rs e r an d n o t ju s t th e n -b est list to w hich d isc rim in a tiv e re ra n k e rs a re lim ited . T h is is becau se, unlike d iscrim in ativ e re ra n k in g ap p ro ach es, h isto ry -b a s e d m o d els can ta k e a d v a n ta g e of a packed re p re se n ta tio n of tre e s a n d can use d y n a m ic p ro g ram m in g to se a rc h for th e m ost p ro b ab le tree acco rd in g to th e m odel.

T h e re m a in d e r of th e th es is involves ta k in g th e m em o ry -b ased m odel as th e base line m o d el a n d w orking on im p ro v in g th e a rea in w hich th e m o d el p e rfo rm s w orst; c o o rd in a tio n d isam b ig u a tio n .

1.1.1 C oord in a tio n D isa m b ig u a tio n

As a n e x am p le of th e c o o rd in a tio n d isa m b ig u a tio n task , tak e th e p h ra se people o f all ages a n d all classes. T h e c o o rd in a tin g c o n ju n ctio n (C C ) and a n d th e n o u n p h ra se all classes co u ld a tta c h to th e n o u n p h ra se all ages, as illu s tra te d in T ree 1, F ig u re 1.1. A ltern a tiv e ly , all classes could b e in co rre ctly conjoined to th e n o u n p h ra se people o f all ages a s in T ree 2, F ig u re 1.1. T h is p ro b lem of w h eth er to a tta c h low (T ree 1) or a tta c h hig h (T ree 2) is a com m on so u rce o f e rro r in c o o rd in ate n o u n p h ra se d isa m b ig u a tio n .

A n o th e r co m m o n source of d isa m b ig u a tio n erro r is illu s tra te d w ith th e a lte rn a tiv e b ra ck e tin g of th e slig h tly m odified p h ra se people o f all ages an d classes show n in F ig  u re 1.2. H ere, th e p ro b lem is w h e th e r all m odifies b o th ages a n d classes as in T ree 1, F ig u re 1.2, o r w h e th e r all m odifies ages b u t n o t classes as in T ree 2, F ig u re 1.2.

A lth o u g h th e re h a s b een a s u b s ta n tia l b o d y of work on o th e r a reas of difficulty for p a rse rs , such as P P - a tta c h m e n t, c o o rd in a tio n a m b ig u ity is a relativ ely little s tu d ied a rea. O n e re aso n for th is co u ld b e t h a t dependencies involving P P - a tta c h m e n t te n d to o c c u r m uch m ore o ften th a n c o o rd in a tio n c o n stru ctio n s. T h u s, im p ro v in g P P - a tta c h m e n t h as p e rh a p s g re a te r p o te n tia l to im prove overall p a rser p erfo rm an ce. T h e c o rre c t b ra ck e tin g of c o o rd in a tio n co n stru ctio n s, however, rem ain s one o f th e m o st difficult p ro b lem s for n a tu r a l lan g u a g e p a rsers an d p a rsers o ften p erfo rm worse a t c o o r d in a tio n d isa m b ig u a tio n th a n P P - a tta c h m e n t. In th e C ollins p a rse r an d o u r e m u la tio n of his p a rs in g m odel, d e p en d e n cie s involving c o o rd in a tio n achieve by fa r th e w o rst p e rfo rm a n c e o f all d ependencies.^

(24)

Figure 1.1: Tree 1. The correct noun phrase parse. Tree 2. The incorrect parse for the noun phrase.

[image:24.405.39.391.31.371.2]

(25)

1. NP

Figure 1.2: Tree 1. The correct noun phrase parse. Tree 2. The incorrect parse for the noun phrase.

phrase (NP) coordination accounts for over 50% of coordination dependency error in our baseline model we focus primarily on NP coordination.

We examine some of the types of error made in noun phrase coordination, showing how Penn Treebank data for NP coordination is particularly noisy and how incon sistencies in the Penn Treebank WSJ annotation of coordinate NPs negatively affect parser performance. We also show how the different head-finding rules for noun phrases and non-recursive noun phrases (base NPs) affect disambiguation, suggesting slightly modified head-finding rules for base NPs.

[image:25.405.32.377.15.460.2]

(26)

concen-tr a t e on m odelling th e lik elihood of tw o n o u n s conjoining, designing a new p a ra m e te r class^ for use in b o th c o o rd in a te n o u n p h ra se s an d c o o rd in a te base n o u n p h rases. In th e e s tim a tio n of th is p a ra m e te r class, d a ta from th e u n lab e lle d B ritish N a tio n a l C o r p u s (B N C )[B u rn a rd , 1995] a re used in a d d itio n to W S J d a ta . W e use a w ord g ra p h to sto re th e tra in in g d a ta a n d ex p lo re v a ria tio n s of A:-nearest n e ig h b o u r w hich in co rp o ra te o u r m ea su re o f w ord sim ila rity in th e e s tim a tio n of p a ra m e te rs in o rd e r to red u ce d a ta sp arsen ess.

T h e re is o ften a c o n sid era b le b ias to w a rd sy m m e try in th e sy n ta c tic s tru c tu re of tw o c o n ju n c ts a n d m o st p rev io u s w ork on c o o rd in a tio n d isa m b ig u a tio n has a tte m p te d to tak e a d v a n ta g e of th is. W e give e m p irical m ea su re m e n ts o f th e e x te n t to w hich p arallelism in th e sy n ta c tic s tr u c tu r e of co n ju n c ts e x ists a n d th e n design new p a ra m e te r classes for th e g e n era tiv e m o d el w hich a tte m p t to c a p tu re th e parallelism effect an d th u s allow th e m odel to lea rn a b ias to w a rd sy m m e try in co n ju n cts.

T h e v ario u s changes to th e b aselin e m o d el in th e h a n d lin g of c o o rd in a tio n re su lt in a rise in N P c o o rd in a tio n d e p en d e n cy /-sco re from 69.9% to 73.9% , w hich re p re se n ts a re lativ e re d u c tio n in /-sco re e rro r of 13%.

We now su m m arise th e c o n trib u tio n s m ad e in th is thesis:

1. P a ra m e te r E s tim a tio n - C o m b a tin g D a ta S p arsen ess

• W e d e m o n s tra te t h a t m em o ry -b ased m o d els, b ased on varieties of th e k- n e a re st n e ig h b o u r a lg o rith m , are effective for p a ra m e te r e s tim a tio n in a lexicalised g e n e ra tiv e p a rsin g m odel, allow ing for flexible fe atu re selection a n d good s m o o th in g of d a ta , a n d can achieve s ta te -o f-th e -a rt re su lts for accuracy.

• W e in tro d u c e a novel tec h n iq u e for th e e s tim a tio n of c e rta in ty p es o f bilexical s ta tis tic s , w hich m akes use o f b o th lab elled an d u n lab e lle d d a ta an d in co r p o ra te s , for th e first tim e, a m ea su re o f w o rd sim ila rity in to a g en erativ e lexicalised p a rsin g m odel.

2. C o o rd in a te N o u n P h ra s e D isa m b ig u a tio n

(27)

• We investigate some of the causes for the errors in coordinate noun phrase disam biguation, showing th a t th e d a ta used for such parsers are particularly noisy w ith regard to N P coordination. We also dem onstrate how head- finding rules can negatively affect disambiguation.

• We give an em pirical analysis of noun phrase coordination in th e d a ta - fo cusing on two salient characteristics of noun phrase coordination: sym m etry in conjunct stru c tu re and word sim ilarity for coordinate head nouns. • Based on th e stu d y of training d a ta and parser errors we develop techniques

for im proving th e m odel’s ability to disam biguate coordinate stru ctu re, in cluding altering th e p aram eterisation of the model and im proving param eter estim ation.

An early version of the work on m em ory-based param eter estim ation for generative parsing was published in [Hogan, 2005]. Hogan [2007a] describes some of the work on coordinate noun phrase disam biguation reported in this thesis and Hogan [2007b] rep o rts on th e empirical m easurem ents of lexical sim ilarity in noun phrase conjuncts presented in C h ap ter 4.

The rem ainder of this chapter is organised as follows: In Section 1.2 we outline the generative history-based parsing model adopted in this thesis, introducing th e notatio n th a t will be subsequently used th roughout th e thesis. Then, in Section 1.3, we give a brief overview of how coordination is handled in the Collins parsing model - our baseline model. In Section 1.4 we outline the param eter estim ation techniques used in this thesis: linear interpolation and th e W itten-B ell estim ation of the baseline model and A:-nearest neighbour m ethods. Finally, Section 1.5 gives a chapter by chapter guide to th e rest of th e thesis.

1.2 G en era tiv e L exicalised H istory-b ased P arsin g

m od els

1.2.1 G en erative

(28)

language and T the set of parse trees. Each tree in T has a member of S as its yield (i.e. its sequence of leaf nodes).

Generative probability models define a joint probabiHty distribution, P (i, s) over the space of all possible sentence/parse tree pairs, which satisfies the constraint:

E

nt,s) = i

(1.1)

te T , s e s

As probabilities are for the entire language, it is possible to find the overall probability of a sentence:

P (5) = 5 ] F ( i , s ) (1.2)

te T

Generative parsing models estimate P (i|5) indirectly by making the observation that maximising P{t, s) is equivalent to maximising F (i|s ). The most likely parse tree, t, is given by:

P(t ,s)

t = argmax P (i|s ) = argmax ’ = argmax P (i,s ) (1-3)

teT teT P[s) t€T

(In (1.3) P{s) is constant so maximising is equivalent to maximising P{t,s)). The joint probabihty P{t, s) is simply P{t) where the yield of t is equal to 5, and 0 otherwise. Thus, from the space of all candidate parses for a particular sentence, generative parsers choose the parse tree th at maximises the probability P{t).

The probability of a tree is calculated as the product of all the rewrite rules from which the tree is derived. In a PCFG, for a tree derived by n applications of context-free rewrite rules LHSi RHSi,^ 1 < * <

P{t) = n P{ RHS, \ L HSi ) (1.4)

1 = 1 . . n

In PCFGs the context-free rewrite rules are so called because they are independent of surrounding context in the tree - th at is the probability of a rule expansion is independent of where the rule occurs in the tree. The probability of a rewrite rule is estimated using relative frequency estimates:

(29)

P { R H S i \ L H S i ) = count{LHSj ^ R H S j )

count{LH Si) (1.5)

where count{LHSi —> R H S i ) and count{LHSi) return the frequency of L H S i R H S i and LHSi in the corpus respectively. This is the maximum likelihood estimate (MLE) (see [ColHns, 1999, p. 40] for proof on why, in this case, the relative frequency estimate is the maximum likelihood estimate, and also for a fuller description of the generative parsing model than is given here).

1.2.2 L exicalised

Lexicalisation tends to improve parser accuracy because it allows the parser to use crucial information about the words in the sentence when disambiguating the syntactic structures of th a t sentence (see, for example, early work on lexical statistics for resolving syntactic ambiguity in [Hindle and Rooth, 1991]).

A PCFG can be lexicalised by associating a word, w, and also a part-of-speech (POS) tag, t, with each non-terminal in the tree. The key idea is th at each constituent has a ‘head’ which is its most important lexical item.

An unlexicalised PCFG rewrite rule can be written as:

w'here, on the left hand side of the rule, Cp is the parent constituent label. The right hand side of the rule consists of the children of a sequence of n constituents to the left of the head child constituent (left modifiers), followed by the head constituent, Ch, followed by the m constituents to the right of the head constituent (right modifiers).

The lexicalised version of (1.6) is;

Cp{u)p,tp) > Cl n{wi nf t l n) -- -Cl \ {wi \ , t i \ )Cf i {Wp^t p)Cr\{Wr\t t r\ )---Crm{y^Tmi t rm) (^•'^)

where each constituent is associated with its head word, Wi and head word POS tag ti. Note that the head word and POS tag of Ch - the head child - are inherited from the parent constituent.

The introduction of lexicalisation vastly increases the number of rules in the gram mar and makes direct estimation of constituent expansion rules unfeasible because of sparse d ata problems. Using the chain rule of probabilities the probability of a

(30)

rule is decomposed into the product of more tractable probabiHties and independence assumptions are made to reduce the number of parameters in the model.

1.2.3 H isto ry -B a sed

In order to incorporate richer context in the probability model, in an attem pt to over come the structural weaknesses inherent in the independence assumptions of probabilis tic context-free grammars, history-based models [Black et al., 1992] were developed. In history-based models the probability of a derivation Z) of a parse tree is the product of the probabilities of each step in the derivation of the tree. For example, for PCFGs, each step, or decision, in the derivation of the tree is an application of a rewrite rule. Unlike PCFGs, however, in history-based models the probability of a step di in the construction of a tree is conditioned on potentially all structure th at has already been determined in the derivation of the tree. For a tree derived by a sequence of n decisions:

P { D ) =

II

P{di\du...,di-i) (1.8)

1 = 1 ..n

The sequence of previous decisions d i , ..., c?,_i is referred to as the history of d^. In practice it is not practical to condition on the entire history as this would lead to a vast number of parameters. Instead a history mapping function maps the history to a finite set of history contexts, so that;

m)= n

_{2 = 1 ..n}

PCFGs are a special case of history-based model, where the history of a rule expansion is taken simply to be the non-terminal label of the node being expanded.

(31)

1998]):

Pit) = J2P{D)

D

(1.10)

1.2.4 M arkovised

We use a generative model for parsing following the lexicalised history-based model of Collins [1999] where the grammar rules are Markovised.

Take the lexicalised rule in (1.7). In a Markov grammar, instead of generating the right-hand-side in one step, the generation process is broken down into three main steps: first the head child Ch is generated, then through to Cin{win,tin), then Cri{u}ri, tri) through to Crmiwrm, trm)■ At each step, the probability of generating a particular child node can be conditioned on the children which have already been generated. In a first order Markov grammar a modifier node is conditioned on the previously generated node (as well as the parent node). In an order Markov grammar the node is conditioned on the m previously generated siblings (and parent node). The model also generates two special + S T O P + nonterminals as the leftmost (/n -|-1) and rightmost {rm + 1) children of every parent. In a markovised grammar the generation of the -l-STOP-l- nonterminals is necessary if the model is to sum to 1, due to the fact th at constituents have a variable number of children. See [Collins, 1999, p. 46] for a discussion on the importance of generating the -fSTOP-t- symbols.

An advantage of using a Markov grammar is th at breaking down the generation of the child nodes of a constituent into a series of steps helps combat data sparseness because it makes it possible to generate rules which have not occurred in the training data.

The term vertical markovisation is sometimes used when information from pre viously generated ancestor nodes is used as part of the local history in a parameter class.

1.3 C oord in ation in th e B a selin e M od el

(32)

Cp

C / i C j . i ( t r i = C C ) C r 2 ( c o o r d = l )

F ig u re 1.3: T h e basic form of a c o o rd in a te d p h rase, coord refers to th e co o rd in a tio n flag.

In th e C ollins p a rsin g m odel each n o d e in a p a rse tre e is a n n o ta te d w ith a co o rd in a tio n flag, set to tr u e if th e n o d e is conjoined to th e h e ad n o d e of th e p h rase, an d false o th erw ise. T h e h ead n o d e of a c o o rd in a te p h ra se alw ays precedes th e c o o rd in a tio n c o n ju n c tio n , an d th e c o o rd in a tin g co n ju n c tio n followed by th e second co n ju n c t alw ays o c cu r to th e rig h t of th e h ead co n ju n c t. A c o o rd in a tin g c o n ju n c tio n n ode, followed by a c o n ju n c t, are g e n e ra te d to g e th e r, unlike o th e r c o n stitu e n ts .

T ake th e tre e frag m e n t in F ig u re 1.3 w here th e P O S ta g tr\ of th e n o d e following th e h e ad n o d e is a c o o rd in a tin g c o n ju n c tio n . In such case th e n o d e follow ing th e CC no d e will have its c o o rd in a tio n flag (coord) set to tru e . T h e C C n o d e will n o t be g e n era te d as w ith o th e r m odifier nodes. In ste a d , n o d e C r2 is g e n e ra te d a fte r th e h ead n o d e Ch' T h e n th e CC n o d e is g e n e ra te d v ia a special C C p a ra m e te r class, co n d itio n ed on th e tw o co n ju n c ts Ch a n d C r2

C o o rd in a tio n is h a n d le d differently for base n o u n p h rases. A base, o r non-recursive, n o u n p h ra se (N P B ) as defined in [Collins, 1999], is a n o u n p h ra se w hich does n o t d ire c tly d o m in a te a n o th e r n o u n p h ra se, unless t h a t n o u n p h ra se is possessive. For n o d e s in base n o u n p h ra ses all c o o rd in a tio n flags are se t to false a n d m odifier nodes are g e n era te d in th e u su a l fash io n w ith no sp ecial tr e a tm e n t of C C nodes. T h e reason for th e diflFerent h a n d lin g of c o o rd in a tio n in b ase N P s is n o t s ta te d in C o llin s’ thesis. How ever, N P B s are tr e a te d differently to o th e r c o n s titu e n t ty p es in several ways. For all n odes, w ith th e ex cep tio n of N P B s, a m odifier n o d e to th e left o r rig h t of th e h ead n o d e is alw ays co n d itio n ed on th e h ead node. In c o n tra s t, for base n o u n p h rases th e m odifier n o d e is co n d itio n ed on th e p rev io u sly g e n e ra te d node. As discussed in [Bikel, 2004b], th e prev io u sly g e n e ra te d no d e in N P B s is tr e a te d as a h ead n o d e for th e p u rp o se of c o n d itio n in g a n d it is as a consequence of th is t h a t c o o rd in a te N P B s are n o t h a n d le d like o th e r c o o rd in a te p h rases.

[image:32.405.31.366.44.288.2]

(33)

is discussed in d e ta il in §4.3.1, w h ere we p ro p o se a n a lte rn a tiv e w ay of h a n d lin g co o r d in a tio n in N P B s.

1.4 P aram eter E stim a tio n

O n e of th e core difficulties in e m p irical N L P is th e sp arse d a ta p ro b lem - o ften th e re is n o t en o u g h d a ta collected to en ab le a c c u ra te e stim a tio n of th e p ro b a b ilitie s of low- frequency events. T h is is p a rtic u la rly tru e w hen co llectin g d a ta on events w hich inclu d e in d iv id u al w ords. D u e to th e sp arsen ess of N L P d a ta , th e m e th o d of e s tim a tin g th e local d is trib u tio n s p lay s a very im p o r ta n t role in b u ild in g a g o o d m odel.

D a ta sp arsen ess m ak es th e m ax im u m likelihood e s tim a te for lexicaUsed ru le p ro b a bilities u n re liab le , esp ecially if we are to in clu d e m o re fe atu res from th e history. W ith m ax im u m likelihood e s tim a tio n th e re will b e a very larg e n u m b er of cases o f rules w hich a re given a zero p ro b a b ility , w h en in fact th e y sh o u ld really have som e no n -zero p ro b a bility. In th is sense we can say t h a t m ax im u m lik elihood e stim a tio n causes o verfitting: all th e p ro b a b ility m ass is d is trib u te d over th e cases we have a lre ad y seen, w ith no p ro b a b ility m ass left for a c o m p letely new case. C learly th e re is a need to g eneralise, an d th is is w h a t sm o o th in g does in effect. T h is is a cru cial ste p in n a tu r a l lan g u ag e p arsin g w h ere th e a g g re g a te p ro b a b ility of th e u n seen or low p ro b a b ility events can b e significant.

A s m a x im u m lik elihood e s tim a tio n is know n to b e u n re liab le for low o r zero co u n ts, a v ariety of sm o o th in g tech n iq u es h a s b een develo p ed to im prove e stim a te s. C h en a n d G o o d m a n [1996] p re se n t a useful su rv ey of sm o o th in g tech n iq u es for lan g u ag e m odelling as well as a co m p reh en siv e co m p ariso n of several tech n iq u es. T o u ta n o v a et al. [2003] also c o m p a re d ifferent e stim a tio n tech n iq u es, in clu d in g a m em o ry -b ased tech n iq u e, in a H P S G p a rsin g m odel. W e focus h ere on th e e s tim a tio n tech n iq u es used in th is th esis: t h a t o f th e b aselin e C ollins p a rs e r - a ty p e of lin e a r in te rp o la tio n using W itte u B ell sm o o th in g - a n d th e n m em o ry -b ased tech n iq u es.

1.4.1 Linear In terp o la tio n and W itte n -B e ll E stim a tio n

(34)

are som etim es referred to as backoff levels. T he idea is th a t when th ere is insufficient d a ta to estim ate th e more specific model, th en th e more general m odel m ight provide useful inform ation.

In linear interpolation estim ates for the probability of a class y (the future), given feature vector Xi (the history), where Xi is the history a t backoff level i, are interpo lated as follows:

where n is th e num ber of backoff levels, 0 < < 1 and Xi +\ is a feature vector less specific th a n Xi (i.e. w ith fewer features). T h a t is, th e sm oothed model is de fined recursively as a linear interpolation of the MLE of th e m ore specific m odel and th e sm oothed estim ate of the less specific model. The recursion ends by taking the sm oothed estim ate of th e most general level of backoff to be the m axim um likelihood estim ate (alternatively, the uniform distribution could be taken as th e final sm oothed m odel).

One simple bu t effective m ethod for calculating th e A values, which does not require extensive training, is th e m ethod used in [Collins, 1999], which was ad ap ted from [Bikel et al., 1997] and the sm oothing technique of [W itten and Bell, 1991].

Axi is defined in term s of count(Xt), which is th e num ber of tim es context Xi occurs in the corpus:

where C is a constant which can be optim ised using held-out d a ta . D{ Xi ) is the diversity of th e history Xi, th a t is th e num ber of distinct outcom es th a t have been seen w ith context X^ in the training sample. We can in terp re t these calculations intuitively as follows: w ith probability Xx^ we should use the higher order m odel and w ith probability 1 — \ x^, th e lower order model. If th e particular context Xi has a high frequency of occurrence then a high value for Ax^ is suitable because th e higher-order distrib u tio n will be reliable. If th e context has occurred very infrequently th en a low value for Xxi is appropriate. If th e context is highly diverse th en we have less tru s t in the higher-order m odel and more in th e lower-order one. This technique is som etimes

P i n t e r p i y l ^ i ) ^ X i ^ M L E i y l ^ i ) + (1 ~ ^Xi)Pinterp{y\Xi-\-l) P in te rp iy \ ) = PML EP{ y \ ^ n )

(1.1 1)

(1.1 2)

0 if count{Xi) = 0

i' if counti Xi) > 0

(35)

referred to as W itten-B ell interpolation.

1.4.2 k-N earest Neighbour Param eter Estim ation

T he probability of a class y, given feature vector X , can be estim ated using th e /c-NN

algorithm as follows:

z 2 x ' e N k ( X ) ^ ))

where A { X , X ' ) is the distance function betw een feature vectors. 5{y^ v') = ^ iff y = y^

otherwise 0. w { A { X , X' ) ) is th e weight of neighbour X ' of X where th e weight is a

function of th e distance. N k i X ) is the set of /c-nearest neighbours of X .

For categorical variables the distance function often used is th e overlap m etric which

simply counts the num ber of m ism atching feature values between instances X and X ':

n

A ( X , X ' ) = d{xi, x[) (1.14)

i=\

where: d{xi, x[) = 0 iff Xj = a:' else 1. A (X , X ') is the distance betw een instances X and X ' , represented by n features, and d is th e distance per feature. In effect,

the weighting function w { A { X , X' ) ) tu rn s th e distance into a measure of nearness, or

similarity. A popular weighting function is the inverse distance function:

w ( A { X , X ' ) ) = — (1.15)

^ ^ (A (X , X ') + 1)"*

for some constant m.

1.4.3 Sim ilarity for Sm oothing

In this thesis, as well as using A;-NN for param eter estim ation as in (1.13), we use a

variation w hich calculates th e sim ilarity function directly, ra th e r th a n calculating th e

distance and th en converting th is to a sim ilarity function:

(36)

where s i m { X , X ' ) is a sim ilarity score between instances X and X ' .

If we group history sam ples together so th a t nj is th e history type X j - th a t is, rij

refers to those history feature vectors which have th e sam e values for each feature as

X j , and where count{nj) is th e num ber of history sam ples of type rij in the d a ta set.

We can rew rite (1-16) as:

, En. eN{n,)^Mnj,n^)count{y,n:r)

P{y\n-j) = —= ;--- r---^ (1-17)

22n,e N{nj)^^'rn{nj,Tia:)count{n:,)

where count{y, rix) is th e num ber of times future y occurs w ith history type Ux and

sim{rij, r i x) is a sim ilarity score between types ri j and and N { n j ) is the set of types

in th e neighbourhood of rij. T his is th e form of our bilexical estim ate in §4.3.2 where

we use a m easure of sim ilarity between words for smoothing.

1.5 C h ap ter by C hapter G uide to th is T h esis

T he work in this thesis began by replicating the state-of-the-art parser of [Collins,

1999] Model 1 and then altering this baseline model so th a t it used memory-based

learning for p aram eter estim ation. We then focused our attention on coordinate noun

phrase disam biguation as this was th e worst performing area of the parser. O ur ex

perim ents on noun phrase coordination disam biguation began w ith an analysis of the

errors produced by th e mem ory-based model, leading us to look at inconsistencies in

treebank a n n o tatio n as a source of error. Noticing also a marked tendency toward

parallelism across conjuncts, we then explored this area by first m easuring empirically

the extent of sym m etry across conjuncts and, based on positive evidence of the same,

we experim ented w ith incorporating a bias toward sym m etry in conjunct structure

into th e probability model. O ur analysis of errors also led us to experiment with new

head-finding rules for base noun phrases. We noticed too on inspection th a t many

conjoined nouns appeared to be sem antically similar and this m otivated us to carry

out experim ents w ith different m easures of sim ilarity between conjoined nouns on the

training set. In our final set of experim ents we focused on modelling the likelihood of

two nouns conjoining and reducing th e sparsity for this param eter class, developing a

sim ilarity-based p aram eter estim ation technique.

(37)

we do not always present the work in the thesis in chronological order of experiments carried out. The remainder of the thesis is arranged as follows.

Chapter 2 begins with an overview of history-based approaches to statistical nat ural language parsing, followed by a brief look at recent approaches to discriminative reranking. This is followed by a summary of memory-based techniques in natural language processing th at are most relevant to the work in this thesis. We also look at previous attem pts to use similarity measures for smoothing bilexical probability estimates. Finally, we discuss previous approaches to coordination disambiguation.

Chapter 3 presents our generative parsing model, with /c-nearest neighbour param eter estimation. We describe a technique based on constraint features to reduce the size of the training set for each parameter class which helps both with accuracy and speed. We also show how we combine A:-nearest neighbour with linear interpolation for bilexical statistics and present results which achieve state-of-the-art accuracy for generative models.

Chapter 4 begins our focus on coordinate noun phrase disambiguation and is divided into two main parts. The first introduces our distributional word similarity measure and compares it with several existing measures of word similarity, testing whether the various measures can detect similarity between the head nouns in coordinate noun phrases. The second part of this chapter concentrates on modelling the likehhood of two nouns conjoining, designing a new parameter class for use in both coordinate noun phrases and coordinate base noun phrases. In the estimation of this parameter, data from the unlabelled British National Corpus are used in addition to WSJ data. We use a word graph to store the training data and incorporate our word similarity measure in the estimation of the parameter in order to reduce data sparseness.

Chapter 5 begins with empirical measurements of the extent to which parallelism in the syntactic structure of conjuncts exists. We then design new parameter classes for the generative model which attem pt to capture the parallelism effect and thus allow the model to learn a bias toward symmetry in conjuncts.

(38)

base NPs.

Chapter 7 shows how we evaluate the experiments on coordination disambiguation and gives the details of the experiments carried out. We outline the effects of each different experiment and discuss the results.

(39)

C hapter 2

P rev io u s W ork

2.1 In tro d u ctio n

In this chapter we sum m arise previous work m ost related to th e work in this thesis.

F irst, in Section 2.2, we trace th e developm ent of history-based parsing and th e cur

rent state-o f-th e-art parsers, upon which our baseline model is based. Unless otherwise

sta te d parser accuracy is reported on section 23 of the Penn W SJ treebank for sentences

< 1 0 0 words. In Section 2.3 we give a brief outline of recent work in discrim inative

reranking and n-best parsing. Section 2.4 moves on to mem ory-based learning of n a t

ural language. We sum m arise previous work on why mem ory-based learning is suited

to n atu ral language learning tasks and briefly outline previous work on parsing th a t

comes under th e broad category of m em ory-based learning. In Section 2.5 we tu rn to

sim ilarity for sm oothing, first presenting previous work on sm oothing w ith mem ory-

based learning algorithm /c-NN, and th en looking a t som ewhat related work in nearest

neighbour cooccurrence sm oothing. Finally, in Section 2.6, we review previous work

(40)

2.2 D e v elo p m e n ts in H isto ry -B a sed S ta tistic a l P ars

ing

2.2.1 [M agerm an and Matrcus, 1991, M agerm an and W eir,

1992, B lack e t al., 1992]

Some early work in overcoming th e stru ctu ral weakness inherent in th e independence

assum ptions of th e PC FG was th a t of [M agerman and M arcus, 1991, M agerm an and

Weir, 1992], T he Picky parser, and its predecessor Pearl, differed from previous work

on probabilistic parsing in th a t a hand-crafted context-free gram m ar was modelled w ith

context-sensitive conditional probabilities train ed from a corpus. In th e probabilistic

m odel th e probability of each parse tree T given a sentence S w^as defined as:

P{T\S) = Y , P { A ^ a ) \ C ^ p A j , a o , a u a 2 ) (2.1)

A € T

w here A is th e non-term inal being expanded, C is the non-term inal node which imme

diately dom inates A, ai is the part-of-speech of the left-m ost word of constituent A,

and ao and a2 are the PO S tags of th e words to th e left and right of a i , respectively.

Black et al. [1992] were the first to develop the concept of th e history-based model

which is distinguished from th e context-free model in th a t for each constituent stru c

tu re the conditioning was extended to look a t potentially all previously built structure,

ra th e r th a n ju s t th e non-term inal being expanded as in PC FG s. As outlined in Sec

tion 1.2.3, in history-based models history is interpreted as any elem ent of the parse

tree which has already been determ ined and can include previous words, non-term inal

labels, constituent stru ctu re, and any other linguistic inform ation which is generated as

p a rt of th e parse stru ctu re. In Black et al. [1992]’s generative m odel each constituent

in th e parse tree was associated w ith the following probability:

P{Syn, Sem, R, Hi, H2\Synp, Senip, Rp, Ipc, Hip, H2p) (2.2)

where S y n and S e m are syntactic and sem antic labels associated w ith th e constituent,

R is th e c o n stitu e n t’s re-w rite rule, and Hi and H2 are two lexical heads associated with

th e constituent. These are conditioned on th e syntactic and sem antic labels, re-write

(41)

of Rp. This probabiUty is decomposed into the product of five probabihties, of which all, bar one, are estimated using deleted interpolation. The other of the component probabilities are estimated using decision trees. The introduction of lexical information is noteworthy as most subsequent high-performing, broad coverage parsers use some degree of lexicalisation. Words were not represented as individual tokens but rather as bit strings via the clustering algorithm of [Brown et al., 1990].

2.2.2 [Jelinek et al., 1994, M agerm an, 1994, 1995]

The parsing model developed by Jelinek et al. [1994] and extended in Magerman [1994] framed the natural language parsing task as one of treebank recognition. Unlike pre vious parsing models, which depended on carefully hand-crafted grammars, the model is presented with a treebank from which to learn and, given a sentence to parse, the task is to recognise the parse tree for the sentence th a t would be given it by a tree bank annotator. The parsing model is a history-based conditional model. Unlike other history-based models, where a tree is associated with just one unique derivation, multi ple derivations are possible and the probability of a tree is the sum of the probabilities for the various derivations of the tree.

Each decision made when building a particular parse derivation is conditional on decisions previously made within a certain window around the current node. Nodes in a parse tree are associated with various features and a parse tree is constructed by generating values for features of the tree nodes, bottom-up, one at a time, according to the distributions assigned by statistical models. The features for terminal nodes are the head word, head tag, and extension, where the extension feature connects the nodes in the tree and encode the tree’s shaped. Internal nodes have the additional feature of the non-terminal label.

(42)

to derivations for which th e local context a t th e various decisions was inconclusive or

misleading. In th e SPA TTER parser of [Magerman, 1994] there was also a conjunction

model in order to help predict th e scope of conjunctions. Each node in the tree was

associated w ith an additional boolean coordination flag, set to tru e for a p articu lar con

stitu en t when th e constituent is p a rt of a conjoined phrase. As in [Black et al., 1992]

words are represented as bit strings. The version of SPA TTER described in [Mager

m an, 1994] is train ed and tested on th e IBM C om puter M anuals domain. M agerm an

[1995] gives results of a version of th e SPA TTER parser, which does not include th e

derivation model, trained and tested on th e W SJ corpus.

2.2.3 [Collins, 1996, 1997, 1999]

Collins [1996] presents a conditional parsing model where parse trees are lexicalised

and represented as a set of head-m odifier dependency relationships and a set of base

noun phrases. P aram eters are estim ated using relative frequencies and a variation of

th e deleted interpolation m ethod for sm oothing described in [Jelinek, 1990]. T hough

a much sim pler model th a n [Magerman, 1995] Collins’ dependency model achieved a

higher accuracy of 85.3%/85.7% labelled precision and recall on section 23. M ath em at

ical shortcom ings in th e model, as well as some lim itations due to parse representation,

led to th e im proved generative m odel of [Collins, 1997], w ith some ex tra refinem ents

reported in [Collins, 1999]. Collins [1997, 1999] presents three history-based genera

tive models. T he parsing model explored in this thesis is derived from Collins’ Model

1. All three m odels are generative, lexicalised parsing models w ith first-order M arkov

gram m ar generation of nodes. Nodes in th e parse tree are a n n o tated w ith a coordi

nation and p u n ctu atio n flag, in addition to head word and head word part-of-speech

inform ation. For a more detailed description of the handling of coordination in th e

Collins generative m odel see §1.3. Model 2 adds a suffix ‘C ’ to all non-term inals which

are com plements. In addition, a new param eter class for the generation of subcate-

gorisation fram es is introduced. Before th e generation of a modifier non-term inal, its

subcategorisation frame is generated, which is th en used as a conditioning feature for

th e generation of the non-term inal label, head-w ord and so on. Finally, Model 3 inte

grates a probabilistic treatm en t of traces and W h-m ovement into the parsing model,