• No results found

Corpus Based Grammar Specialization

N/A
N/A
Protected

Academic year: 2020

Share "Corpus Based Grammar Specialization"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

In:

Proceedings of CoNLL-2000 and LLL-2000,

pages 7-12, Lisbon, Portugal, 2000.

Corpus-Based Grammar Specialization

N i c o l a C a n c e d d a a n d C h r i s t e r S a m u e l s s o n Xerox Research C e n t r e E u r o p e

6, c h e m i n de M a u p e r t u i s 38240, Meylan, F r a n c e

{Nicola. Cancedda, Christer. Samuelsson}@xrce. xerox, com

A b s t r a c t

Broad-coverage grammars tend to be highly am- biguous. W h e n such grammars are used in a restricted domain, it may be desirable to spe- cialize them, in effect trading some coverage for a reduction in ambiguity. G r a m m a r specializa- tion is here given a novel formulation as an opti- mization problem, in which the search is guided by a global measure combining coverage, ambi- guity and g r a m m a r size. The method, applica- ble to any unification g r a m m a r with a phrase- structure backbone, is shown to be effective in specializing a broad-coverage LFG for French. 1 I n t r o d u c t i o n

Expressive g r a m m a r formalisms allow g r a m m a r developers to capture complex linguistic gener- alizations concisely and elegantly, thus greatly facilitating g r a m m a r development and main- tenance. Broad-coverage grammars, however, tend to overgenerate considerably, thus allowing large amounts of spurious ambiguity. If the ben- efits resulting from more concise grammatical descriptions are to outweigh the costs of spuri- ous ambiguity, the latter must be brought down. We here investigate a corpus-based compilation technique that reduces overgeneration and spu- rious ambiguity without jeopardizing coverage or burdening the g r a m m a r developer.

The current work extends previous work on corpus-based g r a m m a r specialization, which applies variants of explanation-based learning (EBL) to grammars of natural languages. The earliest work (Rayner, 1988; Samuelsson and Rayner, 1991) builds a specialized g r a m m a r by

Subsequent work (Rayner and Carter, 1996; Samuelsson, 1994) views the problem as that of cutting up each tree in a treebank of cor- rect parse trees into subtrees, after which the rule combinations corresponding to the subtrees determine the rules of the specialized gram- mar. This approach reports experimental re- sults, using the SRI Core Language Engine, (Alshawi, 1992), in the ATIS domain, of more t h a n a 3-fold speedup at a cost of 5% in gram- matical coverage, the latter which is compen- sated by an increase in parsing accuracy. Later work (Samuelsson, 1994; Sima'an, 1999) at- tempts to automatically determine appropriate tree-cutting criteria, the former using local mea- sures, the latter using global ones.

The current work reverts to the view of EBL as chunking g r a m m a r rules. It extends the latter work by formulating grammar special- ization as a global optimization problem over the space of all possible specialized grammars with an objective function based on the cover- age, ambiguity and size of the resulting gram- mar. The m e t h o d was evaluated on the LFG g r a m m a r for French developed within the PAR- GRAM project (Butt et al., 1999), but it is applicable to any unification g r a m m a r with a phrase-structure backbone where the reference treebank contains all possible analyses for each training example, along with an indication of which one is the correct one.

(2)

parses; it only removes existing ones.

T h e rest of this paper is organized as follows: Section 2 describes the initial candidate gram- mar and the operators used to generate new candidate g r a m m a r s from any given one. T h e function to be maximized is introduced and mo- tivated in Section 3. T h e folded treebank repre- sentation is described in Section 4, while Sec- tion 5 presents the experimental results. 2 U n f o l d i n g a n d S p e c i a l i z a t i o n

T h e initial g r a m m a r is the g r a m m a r underly- ing the subset of correct parses in the training set. This is in itself a specialization of the gram- mar which was used to parse the treebank, since some rules m a y not show up in any correct parse in the training set; e x p e r i m e n t a l results for this first-order specialization are r e p o r t e d in (Can- cedda and Samuelsson, 2000). This g r a m m a r is further specialized by inhibiting rule combi- nations t h a t show up in incorrect parses much more often t h a n in correct parses.

In more detail, we considered downward un- folding of g r a m m a r rules (see F i g . l ) 3 A gram- m a r rule is unfolded downwards on one of the symbols in its right-hand side if it is replaced by a set of rules, each corresponding to the ex- pansion of the chosen symbol by means of an- other g r a m m a r rule. More formally, let G = (E, EN, S, R) be a context-free g r a m m a r , and let r , r ' C R, k E .M + such t h a t rhs(r) = aAfl, lal = k - 1, lhs(r') = A, rhs(r') = V. T h e rule adjunction of r I in t h e k th position of r is defined as a new rule RA(r, k, r ~) = r ' , such that:

lhs(r") = lhs(r) rhs(r") = aVfl

For unification grammars, we instead require lhs(r') U rhs(r)(k)

lhs(r 1') = O(lhs(r)) rhs(r") = O(oLTfl )

where rhs(r)(k) is the k t h symbol of rhs(r), where X t3 Y indicates t h a t X and Y unify, and where 0 is the most general unifier of lhs(r ~) and rhs(r)(k).

T h e downward rule unfolding of rule r on its k th position is t h e n defined as:

DRU(r, k) =

1The converse operation, upward unfolding, was not used in the current experiments.

{r'[3r"[r' = RA(r, k, r")]} if ¢ 0

= {r} otherwise

It is easy to see t h a t if all r I E DRU(r, k) are retained t h e n the new g r a m m a r has exactly the same coverage as the old one. Once the rule has been unfolded, however, the g r a m m a r can be specialized. This involves inhibiting some rule combinations by simply removing the cor- responding newly created rules. Any subset X C_ D R U ( r , k ) is called a downward special- ization of rule r on the k th element of its rhs.

Given a g r a m m a r , all possible (downward) unfoldings of its rules are considered and, for each unfolding, the specialization leading to the best increase in t h e objective function is deter- mined. T h e set of all such best specializations defines the set of c a n d i d a t e successor grammars. In the experiments, a simple hill-climbing algo- r i t h m was adopted. O t h e r iterative-refinement schemes, such as simulated annealing, could eas- ily be implemented.

3 T h e O b j e c t i v e F u n c t i o n

Previous research approached the task of de- termining which rule combinations to allow ei- ther by a process of m a n u a l trial and error or by statistical measures based on a collection of positive examples only: if the original g r a m m a r produces more t h a n a single parse of a sentence, only the "correct" parse was stored in the tree- bank. However, we here also have access to all incorrect parses assigned by the original gram- mar. This in t u r n m e a n s t h a t we do not need to estimate ambiguity t h r o u g h some correlated statistical indicator, since we can measure it di- rectly simply by checking which parse trees are licensed by every new c a n d i d a t e g r a m m a r G. T h e r e are m a n y possible ways of combining the counts of correct a n d incorrect parses in a suit- able objective function. For t h e sake of sire- plicity we opted for a linear combination. How- ever, simply maximizing correct parses and min- imizing incorrect ones would most likely lead to overfitting. In fact, a g r a m m a r with one large flat rule for each correct parse in t h e treebank would achieve a very high score during training, b u t most likely p e r f o r m poorly on unseen data. A way to avoid overfitting consists in penalizing large g r a m m a r s by introducing an appropriate t e r m in t h e linear combination. T h e objective

(3)

A • C

A-oc A_oc

B "-'~" D A "-'P" E F C A ---~" E F C

B---P- E F A ~

B ~ G S p e c i a l i z a t i o n

D o w n w a r d unfolding o f A -> B C o n " B "

A

C

j

B ---~ A D B " ' ~ " B C C " ' D " E B C B ' ' ' ~ A C " " - ~ E B C

C ' ' ' ~ E A Specialization

Upward unfolding

o f A -> B C

Figure 1: Schematic examples of upward and downward unfolding of rules. function Score was thus formulated as follows:

Scorea = Acorr Corra -- Aine InCa - ~size S i z e a

where Corr and Inc are the n u m b e r of correct and incorrect parses allowed by the grammar, and Size is the size of the g r a m m a r measured as the total n u m b e r of s y m b o l occurrences in the right-hand sides of its rules. Acorr and Ainc are weights controlling the pruning aggressiveness: their ratio Acorr/Ainc intuitively corresponds to the n u m b e r of incorrect trees a specialization must disallow for each disallowed correct tree, if the specialization is to lead to an improve- ment over the current grammar. The lower this ratio is, the more aggressive the pruning is. The relative value of ;~size with respect to the other As also controls the d e p t h to which the search is conducted: most specializations result in an in- crease in g r a m m a r size, which tends to be more and more significant as the n u m b e r and the size of rules grows; a larger Asize thus has the effect of stopping the search earlier. Note that only two of the three weights are independent. 4 T r e e b a n k R e p r e s e n t a t i o n

A folded treebank is a representation of a set of parse trees which allows an immediate as- sessment of the effects of inhibiting specific rule combinations. It is based on the idea of "fold- ing" each tree onto a representation of the gram- mar itself. Any p h r a s e - s t r u c t u r e g r a m m a r can be represented as a c o n c a t e n a t i o n / o r graph - - a directed b i p a r t i t e multigraph with an or-node for each s y m b o l and a concatenation-node for each rule in the grammar. The present de- scription covers context-free grammars, b u t the scheme can easily be e x t e n d e d to any unifica- tion g r a m m a r with a context-free b a c k b o n e b y

• ~/~ C EN × R s.t. ( A , r / E r/~ i f f A = lhs(r) • r~R : R × Af + ~ E s.t. r l R ( r , i) = X iff

rhs(r) = f i X % ]/3[ = i - 1

Figure 2 shows the correspondence between a simple g r a m m a r fragment and its concatena- t i o n / o r graph.

Each tree can b e represented by folding it onto the c o n c a t e n a t i o n / o r graph representing the g r a m m a r it was derived with, or, in other words, by a p p r o p r i a t e l y annotating the graph itself. If N is the set of nodes of a parse tree o b t a i n e d using g r a m m a r G, the corresponding folded tree is a partial function f

f : N x N - - - ~ R x A f +

such t h a t f ( n , n ' ) = (r, k) implies that node n was e x p a n d e d using rule r, and that node n' is its k th daughter in the tree (Fig.3). In the fol- lowing, we will use the inverse image of (r, k) un- der f , which we denote ¢(r, k) : f - l ( ( r , k)) :

{ ( n , n ' ) l f ( n , n ' ) : ( r , k ) } C g × N . This can in t u r n b e seen as a partial function

¢ : R × Jkf+ ~ 2 N×N

Disallowing the expansion of the k th element in the right-hand side of rule r by means of rule

r' (assuming symbols match, i.e.,

(r/R(r, k), r') E

rl2 ) results in suppressing a tree where:

3n, n', n" E N, k' E .hf +

[<n,

e ¢(r, k) A <n', n"> • ¢(r', k')]

[image:3.596.68.532.76.169.2]
(4)

A

nO

nl ~ B

n2 •

c ~ n4

f n3

c On5

• n7 n8

f e

¢ ( r l , 1) = {<no,n1), <rt4,rt5>} ¢ ( r l , 2) = {<no, n2), (n4, n6)} ¢(r2, 1) = 0

¢(r2, 2) = 0

¢(r3, 1) = {(n2, n3)}

¢ ( r 3 , 2) = (<n2, n4>}

¢(r4,

I)

= {<n6,

nT>}

¢ ( r 4 , 2) -~ {(n6, n8) }

A

r 1 --% ~ r2

O ^ ' ~ " O

C x \

e . ~ ~;, . e

r ~ ~o"

~ ; ! ' ~ r 4

Figure 3: A tree and its folded representation. the whole tree. T h e worst-case complexity is

still linear in the size of the tree, b u t in prac- tice, the n u m b e r of nodes e x p a n d e d using any given rule is m u c h smaller t h a n the total num- ber of nodes.

W h e n e v e r a specialization is performed, all folded trees t h a t are no longer licensed are removed; the c o n c a t e n a t i o n / o r g r a p h for the g r a m m a r is u p d a t e d to reflect t h e i n t r o d u c t i o n a n d the elimination of rules; a n d the annota- tions on the affected edges are a p p r o p r i a t e l y re- combined a n d distributed. If the p e r f o r m e d spe- cialization is X C_ DRU(r, k) ~ {r}, 3 t h e n the c o n c a t e n a t i o n / o r g r a p h is u p d a t e d as follows

= nux\{~)

~r = ~r U {(lhs(r),~)l~ E X } \ { ( l h s ( r ) , r ) }

~ ( r " , i) =

{ VR(r",i),

r" ¢ X

~]R(r,i), r u E X , i < k

= VR(r',i -- k + 1), r" = a A ( r , k, r') E X,

k < i < k + m - 1 ,

~R(r, i -- m + 1), r" = RA(r, k, r') E X,

i > k + m - 1 ,

where m = arity(r') = Irhs(r')l is the n u m b e r of right-hand-side symbols of rule r ~. For each tree t h a t is not eliminated as a consequence of

3If X = DRU(r, k) = {r}, then no update is needed.

the specialization we have

~(~",~)

=

¢(r",i),

if r" ¢~ X , r" ¢ r'

¢(~",/) k

(<n',n">13n[<n,n'> e

¢(~, k)]),

if r" = r ~

{(n,

n">13n', n", k'[<n, n'> ~ ¢(~, k)

A ( n ' , n " ) E ¢ ( r ' , k ' ) A ( n , n ' " ) E ¢(r,i)]} if r" = R A ( r , k , r ~) E X , i < k

{(n, n")13n'[(n, n') E ¢(r, k)A (n', n") E ¢(r', i -- k + 1)]}

i f r " = R A ( r , k , r ~) E X , k < i < k + m - 1 ,

{ ( n , n " ) 1 3 n ' , n " , k ' [ ( n , n ' ) E ¢ ( r , k ) A

<n', n"> e ¢(~', k')A

(n, n'") E ¢(r, i - m + 1)]}

if r" = R A ( r , k, r ~) E X , i > k + m - 1,

where again m = arity(r'). T h e s e u p d a t e s can be i m p l e m e n t e d efficiently, requiring neither a traversal of the tree nor of the g r a m m a r . 5 E x p e r i m e n t a l R e s u l t s

We specialized a broad-coverage L F G g r a m m a r for French on a corpus of technical d o c u m e n - t a t i o n using the m e t h o d described above. T h e t r e e b a n k consisted of 960 sentences which were all k n o w n to be covered by the original gram- mar. For each sentence, all t h e trees r e t u r n e d by

[image:4.596.52.544.86.290.2]
(5)

R - - { r l , r2, r3, r 4 } s . t .

rl: A - ~ c B r2: A --+ e A ra: B - + f A r4: B - a r E

~2 = {(A, r l ) , (A, r2}, {B, r3), (B, r4)}

C

fiR: (rl, 1) ~ c (rl, 2) ~ B

(r2,1) -+ e

(r2, 2) ~ A (r3, 1) ~ f (r3, 2) ~ A

(r4,1) I

@4, 2) --~ e

A

r l

r 2

® ®

/

e

/

r 3

\ / r 4

f

Figure 2: A tiny g r a m m a r and the correspond- ing c o n c a t e n a t i o n / o r graph.

the original g r a m m a r were available, together with a manually assigned indication of which was the correct one. T h e environment used, the Xerox Linguistic E n v i r o n m e n t (Kaplan and Maxwell, 1996) implements a version of opti- mality t h e o r y where parses are assigned "opti- mality marks" based on a n u m b e r of criteria, and are ranked according to these marks. T h e set of parses with the best marks are called the

optimal parses

for a sentence. T h e correct parse was also an optimal parse for 913 out of 960 sen- tences. Given this, the specialization was aimed at reducing the n u m b e r of optimal parses per sentence.

We ran a series of ten-fold cross-validation ex- periments; the results are s u m m a r i z e d in the ta- ble in Fig.4. T h e first line contains values for the original g r a m m a r . T h e second line contains measures for the first-order pruning grammar, i.e., the g r a m m a r with all and only those rules actually used in correct parses in the training set, with no combination inhibited. Lines 3 and 4 list results for fully specialized grammars. Re- sults in the t h i r d line were obtained with a value for

~corr

equal to 15 times the value of Ainc in the objective function: in other words, during training we were willing to lose a correct parse only if at least 15 incorrect parses were canceled as well. Results in the f o u r t h line were obtained when this ratio was reduced to 10. T h e average n u m b e r of parses per sentence is reported in the first column, whereas the second lists the av- erage n u m b e r of optimal parses.

Coverage was

measured as t h e fraction of sentences which still receive the correct parse with the specialized g r a m m a r . To assess the t r a d e off between cover- age and ambiguity reduction, we c o m p u t e d the F-score 4 considering only optimal parses when computing precision. This measure should not be confused with the F-score on labelled brack- eting r e p o r t e d for m a n y stochastic parsers; here precision and recall concern perfect matching of whole trees. Recall is t h e same as coverage: the ratio between the n u m b e r of correct parses pro- duced by the specialized g r a m m a r and the to- tal n u m b e r of correct parses (equalling the total n u m b e r of sentences in the test set). Precision is the ratio between the n u m b e r of correct parses p r o d u c e d by the specialized g r a m m a r and the total n u m b e r of parses p r o d u c e d by the same g r a m m a r . T h e f o u r t h column lists values for the F-score w h e n equal weight is given to precision and recall. Intuitively, however, in m a n y cases missing the correct parse is more of a problem t h a n r e t u r n i n g spurious parses, so we also com- p u t e d the F-score with a m u c h larger emphasis on recall, i.e., with a = 0.1. T h e corresponding values are listed in the last column.

T h e average n u m b e r of parses per sentence, b o t h optimal and non-optimal, decreases signif- icantly as more and more aggressive specializa- tion.is carried out, and consequently, more cov- erage is lost. T h e most aggressive form of spe-

[image:5.596.64.267.149.489.2]
(6)

Avg.p/s Avg. o.p./s. Coverage (%) F(a-- 0.5) F(c~-- 0.1)

orig. 1941 4.69 100 35.15 73.05

f.o.pruning 184 3.38 89 40.64 71.89

Acorr=lh)~inc 82 2.23 86 53.25 76.58

~corr----lO)kin c 63 2.03 82.5 54.46 74.80

Figure 4: Results of the cialization gives the highest F-score for c~ = 0.5, whereas somewhat more conservative parame- ter settings lead to a better F-score when re- call is valued more. A speedup of a factor 4 is achieved already by first-order pruning and remains approximately the same after further specialization.

6 Conclusions

Broad-coverage grammars tend to be highly am- biguous, which may constitute a serious prob- lem when using them for natural-language pro- cessing. Corpus-independent compilation tech- niques, although useful for increasing efficiency, do little in terms of reducing ambiguity.

In this paper we proposed a corpus-based technique for specializing a grammar on a do- main for which a treebank exists containing all trees returned for each sentence. This tech- nique, which builds extensively on previous work on explanation-based learning for NLP, consists in casting the problem as an optimiza- tion problem in the space of all possible spe- cializations of the original grammar. As initial candidate grammar, the first-order pruning of the original grammar is considered. Candidate successor grammars are obtained through the downward rule unfolding and specialization op- erator, that has the desirable property of never causing previously unseen parses to become available for sentences in the training set. Can- didate grammars are then assessed according to an objecting function combining grammar am- biguity and coverage, adapted to avoid overfit- ting. In order to ensure efficient computability of the objective function, the treebank is pre- viously folded onto the grammar itself. Exper- imental results using a broad-coverage lexical- functional grammar of French show that the technique allows effectively trading coverage for ambiguity reduction. Moreover, the parameters of the objective function can be used to control the trade off.

specialization experiments. A c k n o w l e d g e m e n t s

We would like to thank the members of the MLTT group at the Xerox Research Centre Eu- rope in Grenoble, France, and the three anony- mous reviewers for valuable discussions and comments. This research was funded by the Eu- ropean T M R network Learning Computational Grammars.

R e f e r e n c e s

Hiyan Alshawi, editor. 1992. The Core Language Engine. MIT Press.

M. Butt, T.H. King, M.E. Nifio, and F. Segond. 1999. A Grammar Writer's Cookbook. CSLI Pub- lications, Stanford, CA.

Nicola Cancedda and Christer Samuelsson. 2000. Experiments with corpus-based lfg specialization. In Proceedings o/ the NAACL-ANLP 2000 Con- ference, Seattle, WA.

Ronald Kaplan and John T. Maxwell. 1996. LFG grammar writer's workbench. Technical report, Xerox PARC. Available on-line as ftp://ftp.parc.xerox.com/pub/lfg/lfgmanual.ps. Manny Rayner and David Carter. 1996. Fast pars-

ing using pruning and grammar specialization. In Proceedings of the ACL-96, Santa Cruz, CA. Manny Rayner. 1988. Applying explanation-based

generalization to natural-language processing. In Proceedings o/ the International Con/erence

on Fifth Generation Computer Systems, Tokyo, Japan.

Christer Samuelsson and Manny Rayner. 1991. Quantitative evaluation of explanation-based learning as an optimization tool for a large-scale natural language system. In Proceedings o/the IJCAI-91, Sydney, Australia.

Christer Samuelsson. 1994. Grammar specialization through entropy thresholds. In Proceedings of the ACL-94, Las Cruces, New Mexico. Available as cmp-lg/9405022.

Khalil Sima'an. 1999. Learning Efficient Dis- ambiguation. Ph.D. thesis, Institute for Logic, Language and Computation, Amsterdam, The Netherlands.

[image:6.596.69.500.76.145.2]

Figure

Figure 1: Schematic examples of upward and downward unfolding of rules.
Figure 3: A tree and its folded representation.
Figure 2: A tiny grammar and the correspond- ing concatenation/or graph.
Figure 4: Results of the specialization experiments.

References

Related documents

This entails sending measurement data in a clear presentation directly to a printer or else generating a PDF document. Calipri laser device measure various parameters of

• Upgrade on a 4 to 8 night Royal Caribbean sailing for two, excluding Oasis of the Seas® and Allure of the Seas®, June - August (interior stateroom to minimum available

In the present study, we measured calcification rates during a daily cycle of a 12 · h:12 · h light:dark periods under constant conditions of light, and we performed

Later, Losonczi [ 9 ] proved the local stability of the additive equation for more general cases and applied the result to the proof of stability of the Hosszú’s functional

A/P = Anteroposterior; CAS(K) = Clinical Assessment Study of the Knee; CPG = Chronic Pain Grade; CSQ = Coping Strategies Questionnaire; GP = General Practi- tioner; HAD =

FIG 2 Neighbor-joining phylogenetic tree of ComC amino acid sequences of streptococcal species of the mitis group.. Two clusters of ComC based on

Abstract : - Maximum spatial-frequency diversity can be achieve by combining bit-interleaved coded modulation (BICM), orthogonal space-time block coding (OSTBC) and

Such model search techniques are used in this dissertation for learning how to better share learned structure; (c) Joint training of multiple models for a single task trains