Parse Fitting and Prose Fixing: Getting a Hold on Ill Formedness

(1)

Parse Fitting and Prose Fixing:

G e t t i n g a Hold on I I I - f o r m e d n e s s 1

K. J e n s e n , G. E. H e i d o r n , L. A . M i l l e r , a n d Y . R a v i n

C o m p u t e r S c i e n c e s D e p a r t m e n t I B M T h o m a s J. W a t s o n R e s e a r c h C e n t e r

P.O. Box 218

Y o r k t o w n H e i g h t s , N e w Y o r k 10598

Processing syntactically ill-formed language is an important mission o f the EPISTLE

system, l l l - f o r m e d input is treated by this system in various ways. Misspellings are

highlighted by a standard spelling checker; syntactic errors are detected and corrections are suggested; and stylistic infelicities are called to the user's attention.

Central to the EPISTLE processing strategy is its technique o f

fitted

parsing. W h e n the

rules o f a conventional syntactic grammar are unable to produce a parse for an input string,

this technique can be used to produce a reasonable

approximate

parse that can serve as

input to the remaining stages o f processing.

This paper first describes the fitting process and gives examples o f ill-formed language situations where it is called into play. W e then show how a fitted parse allows EPISTLE to carry on its text-critiquing mission where conventional grammars would fail either because

o f input problems or because of limitations in the grammars themselves. S o m e inherent

difficulties o f the fitting technique are also discussed. In addition, we explore how style

critiquing relates to the handling of ill-formed input, and how a fitted parse can be used in style checking.

Introduction

In its current form, the EPISTLE s y s t e m addresses the p r o b l e m s of g r a m m a r and style checking of texts written in ordinary English (letters, reports, and manuals, as o p p o s e d to novels, plays, and p o e m s ) . It is this goal that involves us so intimately with the processing

of ill-formed language. G r a m m a r checking deals with

such errors as d i s a g r e e m e n t in n u m b e r b e t w e e n subject and verb; style checking calls a t t e n t i o n to such infelicities as sentences that are too w o r d y or too c o m - plex. A standard spelling c h e c k e r is also included.

Our g r a m m a r is written in NLP ( H e i d o r n 1972), an a u g m e n t e d phrase structure language which is c u r r e n t - ly i m p l e m e n t e d in LISP/370. At this time the EPISTLE g r a m m a r uses syntactic, but not semantic, information. Access to an on-line s t a n d a r d d i c t i o n a r y with a b o u t 130,000 entries, including p a r t - o f - s p e e c h and s o m e o t h e r s y n t a c t i c i n f o r m a t i o n (such as t r a n s i t i v i t y of

1 The work described here is a continuation of work first presented at the Conference on Applied Natural Language Process- ing in Santa Monica, California (Jensen and Heidorn 1983).

verbs), m a k e s the s y s t e m ' s v o c a b u l a r y essentially un- limited. We test and i m p r o v e the g r a m m a r b y regular- ly running it on a data base of 2254 sentences f r o m

411 actual business letters. M o s t of these sentences

are rather complicated; the longest contains 63 words, and the a v e r a g e length is 19.2 words.

Since the subset of English r e p r e s e n t e d in business d o c u m e n t s is very large, we need a very c o m p r e h e n - sive g r a m m a r and a robust parser. We take a heuristic a p p r o a c h and consider that a natural language p a r s e r can be divided into three parts:

(a) a set of rules, called the c o r e g r a m m a r , that p r e - cisely defines the central, a g r e e d - u p o n g r a m m a t - ical structures of a language;

(b) peripheral p r o c e d u r e s that handle parsing a m b i - guity: w h e n the core g r a m m a r p r o d u c e s m o r e t h a n one parse, these p r o c e d u r e s decide which of the multiple parses is to be preferred;

(c) p e r i p h e r a l p r o c e d u r e s t h a t handle parsing failure: w h e n the core g r a m m a r c a n n o t define an a c c e p t a b l e parse, these p r o c e d u r e s assign s o m e r e a s o n a b l e structure to the input.

Copyright 1984 by the Association for Computational Linguistics. Permission to copy without fee all or part of this material is granted provided that the copies are not made for direct commercial advantage and the Journal reference and this copyright notice are included on the first page. To copy otherwise, or to republish, requires a fee a n d / o r specific permission.

0 3 6 2 - 6 1 3 X / 8 3 / 0 3 0 1 4 7 - 1 4 5 0 3 . 0 0

(2)

In EPISTLE,

(a) the core g r a m m a r consists at p r e s e n t of a set of a b o u t 300 syntax rules;

(b) a m b i g u i t y is r e s o l v e d b y using a m e t r i c t h a t ranks alternative parses ( H e i d o r n 1982); and (c) parse failure is handled b y the fitting p r o c e d u r e

described here.

In using the t e r m s core grammar and periphery, we

are c o n s c i o u s l y e c h o i n g r e c e n t w o r k in g e n e r a t i v e g r a m m a r , but we are applying the t e r m s in a s o m e w h a t d i f f e r e n t way. C o r e g r a m m a r , in c u r r e n t linguistic theory, suggests the notion of a set of v e r y general rules t h a t define universal p r o p e r t i e s of h u m a n language and effectively set limits on the types of g r a m - m a r s a n y p a r t i c u l a r l a n g u a g e m a y have; p e r i p h e r y p h e n o m e n a are those constructions that are peculiar to p a r t i c u l a r l a n g u a g e s and t h a t require rules b e y o n d w h a t the core g r a m m a r will p r o v i d e ( L a s n i k and Freid- in 1981). O u r c u r r e n t w o r k is not c o n c e r n e d with the m e t a - r u l e s of a U n i v e r s a l G r a m m a r . But we h a v e f o u n d that a distinction b e t w e e n core and p e r i p h e r y is useful e v e n within a g r a m m a r of a particular language - in this case, English.

Parsing in EPISTLE

EPISTLE's p a r s e r is w r i t t e n in the NLP p r o g r a m m i n g language, which w o r k s with a u g m e n t e d p h r a s e structure rules and with a t t r i b u t e - v a l u e records, which are m a n i p u l a t e d b y the rules. W h e n NLP is used to parse n a t u r a l l a n g u a g e text, the r e c o r d s d e s c r i b e c o n s t i t u - ents, and the rules put these constituents t o g e t h e r to f o r m e v e r larger c o n s t i t u e n t (or r e c o r d ) structures. R e c o r d s c o n t a i n all the c o m p u t a t i o n a l and linguistic i n f o r m a t i o n associated with words, with larger constituents, and with the parse f o r m a t i o n . At this time our g r a m m a r is s e n t e n c e - b a s e d ; we do not, for instance, create record structures to describe p a r a g r a p h s . D e - tails of the EPISTLE s y s t e m a n d of its core g r a m m a r m a y be f o u n d in H e i d o r n et al. (1982). An earlier o v e r v i e w of the s y s t e m is p r e s e n t e d in Miller et al. (1981).

A close e x a m i n a t i o n of parse trees p r o d u c e d by the core g r a m m a r will o f t e n r e v e a l b r a n c h a t t a c h m e n t s t h a t are not quite right: f o r e x a m p l e , s e m a n t i c a l l y incongruous prepositional p h r a s e a t t a c h m e n t s . In line with our p r a g m a t i c parsing philosophy, our core g r a m - m a r is designed to p r o d u c e unique a p p r o x i m a t e parses. (Recall that we currently have access only to syntactic and m o r p h o l o g i c a l i n f o r m a t i o n a b o u t constituents.) In the cases where semantic or p r a g m a t i c i n f o r m a t i o n is n e e d e d b e f o r e a p r o p e r a t t a c h m e n t can be made, rather t h a n p r o d u c e a c o n f u s i o n of multiple p a r s e s we force the g r a m m a r to try to assign a single parse. This is usually d o n e b y f o r c i n g s o m e a t t a c h m e n t s to be m a d e to the closest, or rightmost, available constituent. This s t r a t e g y only rarely i m p e d e s the t y p e of g r a m m a r - c h e c k i n g a n d s t y l e - c h e c k i n g t h a t we are

working on. A n d we feel that a single parse with a

c o n s i s t e n t a t t a c h m e n t s c h e m e will yield m u c h m o r e easily to later semantic processing t h a n would a large n u m b e r of different structures.

T h e rules of the core g r a m m a r (CG) p r o d u c e a single a p p r o x i m a t e parse for a l m o s t 7 0 % p e r c e n t of input text, and a small n u m b e r of multiple parses for a n o t h e r 16% . T h e CG can always be i m p r o v e d and its c o v e r a g e e x t e n d e d ; w o r k on i m p r o v i n g the EPISTLE CG is continual. But the c o v e r a g e of a core g r a m m a r will n e v e r reach 1 0 0 % . F o r those strings that c a n n o t be fully p a r s e d by rules of the core g r a m m a r we use a heuristic best f i t p r o c e d u r e that p r o d u c e s a r e a s o n a b l e parse structure.

The Fitting Procedure

T h e fitting p r o c e d u r e begins a f t e r the CG rules h a v e b e e n applied in a b o t t o m - u p , parallel fashion, but h a v e failed to p r o d u c e an S n o d e t h a t covers the string. At this point, as a b y - p r o d u c t of b o t t o m - u p parsing, records are available for inspection that describe the various s e g m e n t s of the input string f r o m m a n y p e r s p e c - tives, according to the rules that h a v e b e e n applied. T h e t e r m fitting has to do with selecting and fitting these pieces of the analysis t o g e t h e r in a r e a s o n a b l e fashion.

T h e fitting algorithm, which is itself i m p l e m e n t e d as a set of NLP rules, p r o c e e d s in t w o m a i n stages:

first, a head constituent is c h o s e n ; next, remaining

constituents are fitted in. In o u r c u r r e n t i m p l e m e n t a - tion, candidates f o r the h e a d are tested p r e f e r e n t i a l l y as follows, f r o m m o s t to least desirable:

(a) VPs with tense and subject;

(b) VPs with tense but no subject;

(c) phrases w i t h o u t verbs (e.g., NPs, PPs);

(d) non-finite VPs;

(e) others.

If m o r e t h a n one candidate is f o u n d in a n y c a t e g o r y , the one p r e f e r r e d is the widest (covering m o s t text). If there is a tie f o r widest, the l e f t m o s t of those is preferred. If there is a tie for leftmost, the o n e with the best value for the parse metric is chosen. If there is still a tie (a v e r y unlikely case), an a r b i t r a r y choice is made. ( N o t e t h a t we consider a VP to be a n y seg- m e n t of text that has a v e r b as its h e a d e l e m e n t . )

T h e fitting p r o c e s s is c o m p l e t e if the h e a d constituent covers the entire input string (as would be the case if the string c o n t a i n e d just a n o u n phrase, f o r e x a m p l e , " S a l u t a t i o n s and c o n g r a t u l a t i o n s " ) . If the h e a d c o n - stituent d o e s not c o v e r the entire string, r e m a i n i n g c o n s t i t u e n t s are added on either side, with the following o r d e r of p r e f e r e n c e :

(a) s e g m e n t s other t h a n VP; (b) u n t e n s e d VPs;

(c) tensed VPs.

As with the choice of head, the widest c a n d i d a t e is p r e f e r r e d at each step. T h e fit m o v e s o u t w a r d f r o m

(3)

K. Jensen, G.E. Heidorn, L.A. Miller, and Y. Ravin Parse Fitting a n d P r o s e Fixing

F I T T E D - - - N P . . . N O U N * - - - " E x a m p l e " - - - P U N C . . . . " : "

- - - V P * . . . . N P I . . . D E T . . . A D J * . . .

- - - P U N C . . . .

" Y o u r " I . . . N O U N * - - - " p e r c e n t a g e "

I . . . P P i . . . P R E P . . . o f " I . . . M O N E Y * . . . $ 2 5 0 . 0 0 " . . . . V E R B * - - - " i s "

[image:3.612.42.437.61.191.2] [image:3.612.50.444.624.729.2]

. . . . N P . . . M O N E Y * - - " $ 1 8 7 . 5 0 " i v . , v

Figure 1. An example of a fitted parse tree.

the head, b o t h leftward to the beginning of the string, and rightward to the end, until the entire input string has b e e n fitted into a b e s t a p p r o x i m a t e p a r s e tree. The overall effect of the fitting process is to select the largest chunk of sentence-like material within a text string and c o n s i d e r it to be central, with l e f t - o v e r chunks of text a t t a c h e d in some r e a s o n a b l e m a n n e r .

As a simple example, consider this text string: " E x a m p l e : Your p e r c e n t a g e of $250.00 is $ 1 8 7 . 5 0 . " Because this string has a capitalized first w o r d and a period at its end, it is s u b m i t t e d to the core g r a m m a r

for consideration as a sentence. But it is not a sen-

tence, and so the CG will fail to arrive at a c o m p l e t e d

parse. H o w e v e r , during processing, the CG will have

assigned m a n y structures to its substrings. L o o k i n g for a h e a d c o n s t i t u e n t a m o n g these structures, the fitting p r o c e d u r e will first seek VPs with tense and

subject. Several are present: " $ 2 5 0 . 0 0 is",

" p e r c e n t a g e of $ 2 5 0 . 0 0 is", " $ 2 5 0 . 0 0 is $ 1 8 7 . 5 0 " , and so on. The widest and leftmost of these VP c o n - stituents is the one which covers the string " Y o u r per- centage of $250.00 is $ 1 8 7 . 5 0 " , so it will be chosen as head.

The fitting process then looks for additional constituents to the left, favoring ones other t h a n VP. It finds first the colon, and then the word " E x a m p l e " . In this string the only constituent following the head is the final period, which is duly added. T h e c o m p l e t e fitted parse is shown in Figure 1.

T h e f o r m of parse tree used here shows the t o p - d o w n structure of the string f r o m left to right, with the

terminal nodes being the last item on each line. At

each level of the tree (in a vertical column), the head

e l e m e n t of a c o n s t i t u e n t is m a r k e d with an asterisk. T h e o t h e r e l e m e n t s a b o v e and b e l o w are p r e - a n d p o s t - m o d i f i e r s . T h e highest e l e m e n t of the trees s h o w n here is F I T T E D , r a t h e r t h a n the m o r e usual SENT. (It is i m p o r t a n t to r e m e m b e r that these parse d i a g r a m s are only s h o r t h a n d r e p r e s e n t a t i o n s f o r the NLP record structures, which contain an a b u n d a n c e of i n f o r m a t i o n a b o u t the string processed.)

The tree of Figure 1, which would be lost if we restricted ourselves to the rules of the core g r a m m a r , is n o w available f o r e x a m i n a t i o n , f o r g r a m m a r and style checking, and ultimately f o r s e m a n t i c i n t e r p r e t a - tion. It can take its place in the s t r e a m of continuous text and be a n a l y z e d f o r what it is - a sentence frag- m e n t , i n t e r p r e t a b l e only b y r e f e r e n c e to o t h e r sentences in context.

F u r t h e r E x a m p l e s

T h e fitted parse a p p r o a c h can help to deal with m a n y difficult n a t u r a l language p r o b l e m s , including f r a g - ments, difficult cases of ellipsis, proliferation of rules to handle single p h e n o m e n a , p h e n o m e n a for which no rule seems adequate, and p u n c t u a t i o n horrors. E a c h of these is discussed here with examples.

Fragments. T h e r e are m a n y of these in r u n n i n g text; t h e y are f r e q u e n t l y NPs, as in Figure 2, and include c o m m o n greetings, farewells, and sentiments. (N.B., m o s t of the e x a m p l e s in this p a p e r are t a k e n f r o m the EPISTLE data base.)

D i f f i c u l t c a s e s o f e l l i p s i s . In the s e n t e n c e of Figure 3, what we really have semantically is a c o n - junction of two p r o p o s i t i o n s which, if g e n e r a t e d di- rectly, would read: " S e c o n d l y , the Annual C o m m i s s i o n

F I T T E D I - - - N P * J . . . . N P i . . . A J P . . . A D J * . . . . " G o o d " I I I . . . N O U N * - - - " I u c k "

I I . . . . C O N J * - - - " a n d "

I I . . . . N P J . . . A J P . . . A D J * . . . . " g o o d " I I . . . N O U N * . . . s e l l i n g "

P - - - P U N C . . . . " . "

Figure 2. Fitted noun phrase (fragment).

(4)

F I T T E D . . . . A V P I . . . . A D V * . . . . " S e c o n d l y " I . . . . P U N C . . . , "

. . . . N P I . . . A J P . . . A D J * . . . . " t h e " i . . . N P . . . N O U N * - - - " A n n u a i " I . . . N P . . . N O U N * - - - " C o m m i s s i o n " I . . . N P . . . N O U N * . . . S t a t e m e n t " i . . . N O U N * . . . t o t a l "

. . . . V E R B . . . s h o u l d " . . . . V E R B * - - - " b e "

. . . . N P . . . M O N E Y * . . . . $ 1 4 , 6 8 2 . 6 1 "

- - - P U N C . . . . " , "

- - - A V P . . . A D V * . . . . " n o t " - - - N P . . . M O N E Y * . . . . $ 1 4 , 6 8 2 . 6 7 " - - - P U N C . . . . " . "

Figure 3. Fitted sentence with ellipsis.

F I T T E D I - - - N P . . . N O U N * - - - " B i l i " - - - P U N C . . . . " , "

I . . . . N P . . . P R O N * - - - " I " i . . . . V E R B . . . " ' v e "

i . . . . V E R B . . . . " b e e n " I . . . . V E R B * . . . a s k e d "

i . . . . I N F C L l - - I N F T O - - - " t o " I - - V E R B * - - - " c l a r i f y "

I - - N P I . . . A J P . . . A D J * . . . t h e " i . . . A J P . . . V E R B * - - - " e n c i o s e d " i . . . N O U N * . . . l e t t e r "

- - - P U N C . . . . " . "

Figure 4. Fitted sentence with initial vocative.

F I T T E D I - - - N P I . . . A J P . . . A D J * . . . . " G o o d " I . . . N O U N * - - - " l u c k "

- - - P P I . . . P R E P . . . . " t o "

J . . . N P . . . P R O N * - - - " y o u " i . . . C O N J * - - - " a n d "

I . . . N P . . . P R O N * - - - " y o u r s " - - - C O N J . . . . " , a n d "

- - - V P * . . . . N P . . . P R O N * - - - " I " . . . . V E R B * - - - " w i s h "

. . . . N P . . . P R O N * - - - " y o u "

. . . . N P . . . A J P . . . A D J * . . . . " t h e " . . . A D V . . . " V E R Y "

. . . A D J * . . . . " b e s t " . . . . P P . . . P R E P . . . . " i n "

. . . A J P . . . A D J * . . . . " y o u r " . . . A J P . . . A D J * . . . . " f u t u r e " . . . N O U N * - - - " e f f o r t s " - - - P U N C . . . . " . "

Figure 5. Fitted conjunction of noun phrase with clause.

[image:4.612.38.567.47.717.2] [image:4.612.40.552.389.695.2]

(5)

K. Jensen, G.E. Heidorn, L.A. Miller, and Y. R a v i n P a r s e F i t t i n g and P r o s e Fixing

S t a t e m e n t total should be $ 1 4 , 6 8 2 . 6 1 ; the A n n u a l C o m m i s s i o n S t a t e m e n t t o t a l should not be $14,682.67." D e l e t i o n p r o c e s s e s o p e r a t i n g on the s e c o n d p r o p o s i t i o n are lawful (deletion of identical elements) but massive. It would be unwise to write a core g r a m m a r rule that routinely allowed negativized NPs to follow main clauses, because:

(a) the p r o p e r analysis of this s e n t e n c e would be

obscured: some pieces - namely, the inferred

c o n c e p t s - are missing f r o m the second part of the surface sentence;

(b) the linguistic generalization would be lost: any

two conjoined propositions can u n d e r g o deletion of identical ( r e c o v e r a b l e ) elements.

A fitted parse such as Figure 3 allows us to inspect the main clause for syntactic and stylistic deviances, and at the same time m a k e s clear the b r e a k i n g point b e - tween the two propositions and opens the d o o r for a later semantic processing of the elided elements.

P r o l i f e r a t i o n o f r u l e s t o h a n d l e single phenomena. T h e r e are s o m e English c o n s t r u c t i o n s that, a l t h o u g h t h e y h a v e a fairly simple a n d u n i t a r y form, do not hold a n y t h i n g like a u n i t a r y o r d e r i n g relation within clause boundaries. The vocative is one of these:

(a) B i l l I ' v e b e e n asked to clarify the enclosed letter.

(b) I ' v e b e e n asked, B i l l to clarify the enclosed letter.

(c) I ' v e b e e n asked to clarify the enclosed letter, Bill.

In longer sentences there would be e v e n m o r e possible places to insert the vocative.

Rules could be written that would explicitly allow the p l a c e m e n t of a p r o p e r name, surrounded by c o m - mas, at different positions in the s e n t e n c e - a different

rule f o r each position. But this solution lacks ele-

gance, m a k e s a simple p h e n o m e n o n seem complicated, and always runs the risk of o v e r l o o k i n g yet one m o r e position where some other writer might insert a vocative. The parse fitting p r o c e d u r e provides an alternative that preserves the integrity of the main clause and allows the vocative to be added o n t o the structure, as shown, for example, in Figure 4. O t h e r similar p h e - n o m e n a , such as p a r e n t h e t i c a l expressions, can be handled in this same fashion.

P h e n o m e n a f o r w h i c h n o r u l e s e e m s a d e q u a t e . The sentence " G o o d luck to y o u and yours, and 1 wish you the very best in y o u r future e f f o r t s . " is, on the face of it, a conjunction of a noun phrase (or NP plus PP) with a finite v e r b phrase. Such constructions are not usually considered to be fully grammatical, and a core g r a m m a r that contains a rule describing this c o n - struction ought p r o b a b l y to be called a faulty g r a m - mar. Nevertheless, ordinary English c o r r e s p o n d e n c e a b o u n d s with strings of this sort, and readers have no

difficulty construing them. T h e fitted parse for this

s e n t e n c e in Figure 5 presents the finite clause as its h e a d and adds the remaining constituents in a r e a s o n a - ble fashion. F r o m this structure later semantic p r o c - essing could infer that " G o o d luck to y o u and y o u r s " really m e a n s " I e x p r e s s / s e n d / w i s h g o o d luck to y o u and y o u r s " - a special case of formalized, ritualized ellipsis.

Punctuation horrors. In any large sample of natu-

ral language text, there will be m a n y irregularities of

p u n c t u a t i o n that, although p e r f e c t l y u n d e r s t a n d a b l e to r e a d e r s , c a n c o m p l e t e l y disable an explicit c o m p u t a - tional g r a m m a r . In business text these difficulties are frequent. Some can be caught and c o r r e c t e d b y p u n c - t u a t i o n c h e c k e r s and b a l a n c e r s . But o t h e r s c a n n o t , s o m e t i m e s because, for all their trickiness, they are not

really wrong. Yet f e w g r a m m a r i a n s would care to

dignify, by describing it with rules of the core g r a m - mar, a text string like:

" O p t i o n s : A l - ( T r a n s m i t t e r C l o c k e d by D a t a - set) B 3 - ( w i t h o u t the 605 Recall Unit) C 5 - (with A B C Ring I n d i c a t o r ) D 8 - ( w i t h o u t A u t o A n s w e r ) E 1 0 - ( A u t o Ring Selective)."

Our parse fitting p r o c e d u r e handles this e x a m p l e b y building a string of NPs s e p a r a t e d with p u n c t u a t i o n

marks, as shown in Figure 6. This solution at least

e n a b l e s us to get a handle on the c o n t e n t s of the string.

B e n e f i t s

T h e r e are two main benefits to be gained f r o m using the fitted parse a p p r o a c h . First, it allows for syntactic p r o c e s s i n g - for o u r p u r p o s e s , g r a m m a r a n d style c h e c k i n g - to p r o c e e d in the a b s e n c e of a p e r f e c t

parse. Second, it provides a promising structure to

s u b m i t to later s e m a n t i c p r o c e s s i n g routines. A n d parenthetically, a fitted parse d i a g r a m is a great aid to

g r a m m a r rule debugging. T h e place where the first

b r e a k occurs b e t w e e n the h e a d c o n s t i t u e n t and its pre- or p o s t - m o d i f i e r s usually indicates fairly precisely where the core g r a m m a r failed.

It should be e m p h a s i z e d t h a t a fitting p r o c e d u r e c a n n o t be used as a substitute for explicit rules, and that it in no way lessens the i m p o r t a n c e of the core g r a m m a r . T h e r e is a tight interaction b e t w e e n the t w o c o m p o n e n t s . T h e success of the fitted parse d e p e n d s on the a c c u r a c y and c o m p l e t e n e s s of the core rules; a fit is only as good as its g r a m m a r .

C o r r e c t i n g S y n t a c t i c E r r o r s in a F i t t e d P a r s e

Suppose the text string in Figure 1 had c o n t a i n e d an u n g r a m m a t i c a l i t y , such as d i s a g r e e m e n t in n u m b e r b e t w e e n its subject and its verb. T h e n our troubles would be c o m p o u n d e d . T h e r e would be two r e a s o n s for the CG to reject that string: (a) it is a f r a g m e n t ; and (b) it contains a syntax error.

(6)

F I T T E D - - - N P . . . N O U N * - - - " O p t i o n s " - - - P U N C . . . . " : "

- - - N P . . . N O U N * - - - " A I " - - - P U N C . . . . " - "

- - - P U N C . . . . " ( "

- - - N P I . . . N P . . . N O U N * - - - " T r a n s m i t t e r " I . . . N O U N * . . . C l o c k e d "

- - - P P I . . . P R E P . . . . " b y " I . . . N O U N * . . . D a t a s e t "

- - - P U N C . . . ) " - - - N P . . . N O U N * - - - " B 3 " - - - P U N C . . .

I - - - P P * . . . . P U N C . . . . " ( " I . . . . P R E P . . . w i t h o u t " I . . . . A J P . . . A D J * . . . . " t h e " I . . . . Q U A N T - - - N U M * . . . 6 0 5 "

. . . . N P . . . N O U N * - - - " R e c a l I " . . . . N O U N * - - - " U n i t "

. . . . P U N C . . . ) " - - - N P . . . N O U N * - - - " C 5 "

- - - P U N C . . . . " - "

- - - P P . . . P U N C . . . . " ( " . . . P R E P . . . . " w i t h "

. . . N P . . . N O U N * - - - " A B C " . . . N P . . . N O U N * - - - " R i n g " . . . N O U N * - - - " I n d i c a t o r " . . . P U N C . . . ) " - - - N P . . . N O U N * - - - " D 8 " - - - P U N C . . . . " - "

- - - P P I . . . P U N C . . . ( " I . . . P R E P . . . . " w i t h o u t "

I . . . N P . . . N O U N * - - - " A u t o " I . . . N O U N S - - - " A n s w e r "

I . . . P U N C . . . ) " - - - N P . . . N O U N * - - - " E I 0 "

- - - P U N C . . . . " - "

- - - N P I . . . P U N C . . . . " ( "

I . . . N P . . . N O U N * - - - " A u t o " I . . . N P . . . N O U N * - - - " R i n g "

I . . . N O U N S . . . S e l e c t i v e " I . . . P U N C . . . ) "

- - - P U N C . . . . " . "

Figure 6. Fitted list.

But the CG can r e c o v e r f r o m m a n y syntax errors: it can d i a g n o s e and correct them, producing the parse tree that w o u l d be appropriate if the correction w e r e

made. Figure 7 illustrates this ability. This n u m b e r -

d i s a g r e e m e n t p h e n o m e n o n is fairly c o m m o n in current A m e r i c a n English. The tensed verb s e e m s to w a n t to agree with its closest n o u n neighbor (in this s e n t e n c e , "forms...are") rather than with its subject NP ("a car-

b o n copy...is"). A prescriptive rule still insists that

subject and verb s h o u l d agree in n u m b e r , h o w e v e r ,

and the EPISTLE g r a m m a r p r o v i d e s a c o r r e c t i o n for such cases. N o t e that in the last line o f Figure 7 the

w o r d "are" has b e e n c h a n g e d to "is". ( S e e H e i d o r n

et al. ( 1 9 8 2 ) for a m o r e t h o r o u g h discussion o f the error correction technique.)

A n d n o w the fitting procedure allows us to c o n t i n - ue this w o r k e v e n under wildly ungrammatical c o n d i - tions. Figure 8 is a fitted parse for the string in Figure 1, with a n u m b e r d i s a g r e e m e n t error introduced into the fragment.

[image:6.612.53.419.62.596.2]

(7)

K. Jensen, G.E. Heidorn, L.A. Miller, and Y. Ravin Parse Fitting and Prose Fixing

D E C L - - - N P I . . . D E T . . . A D J * . . . . " A " I . . . N P . . . N O U N * - - - " c a r b o n " i . . . N O U N * - - - " c o p y "

i . . . . P P . . . P R E P . . . . " o f "

. . . D E T . . . A D J * . . . . " t h e " . . . N P I . . . N O U N * . . . W o r k m a n "

I . . . P O S S . . . . " ' s "

. . . N P . . . N O U N * - - - " C o m p e n s a t i o n '' . . . N O U N * - - - " f o r m s "

- - - V E R B . . . . " a r e " - - - V E R B * - - - " e n c l o s e d "

- - - P P I . . . P R E P . . . . " f o r "

J . . . D E T . . . A D J * . . . . " y o u r " J . . . N O U N * - - - " i n f o r m a t i o n "

- - - P U N C . . . . " . "

G R A M M A T I C A L E R R O R : S U B J E C T - V E R B N U M B E R D I S A G R E E M E N T . A c a r b o n c o p y . . . A R E e n c l o s e d f o r y o u r i n f o r m a t i o n . C O N S I D E R :

A c a r b o n c o p y . . . I S e n c l o s e d f o r y o u r i n f o r m a t i o n .

Figure 7. Diagnosis and correction of a syntax error (not a fitted parse).

F I T T E D I - - - N P . . . N O U N * . . . E x a m p l e " - - - P U N C . . . . " : "

- - - V p * . . . . N P I . . . D E T . . . A D J * . . . . " y o u r " I . . . N O U N * . . . p e r c e n t a g e "

i . . . P P i . . . P R E P . . . . " o f " I . . . M O N E Y * - - " $ 2 5 0 . 0 0 " . . . . V E R B , - - - , , a r e ,,

. . . . N P . . . M O N E Y * . . . . $ ] 8 7 . 5 0 " - - - P U N C . . . . " . "

P O S S I B L E G R A M M A T I C A L E R R O R : S U B J E C T - V E R B N U M B E R E x a m p l e : y o u r p e r c e n t a g e . . . A R E $ 1 8 7 . 5 0 . C O N S I D E R :

E x a m p l e : y o u r p e r c e n t a g e . . . I S $ ] 8 7 . 5 0 .

D I S A G R E E M E N T .

Figure 8. Fitted parse containing clause with syntax error.

F I T T E D I - - - P P * I . . . . P R E P . . . . " B e t w e e n "

I I . . . . N P . . . P R O N * . . . y o u " I I . . . . C O N J * . . . a n d "

I I . . . . N P . . . P R O N * - - - " I " I - - - P U N C . . . . " . "

P O S S I B L E G R A M M A T I C A L E R R O R : B E T W E E N y o u a n d I . C O N S I D E R :

B E T W E E N y o u a n d M E .

W R O N G P R O N O U N I N O B J E C T

Figure 9. Case error in prepositional phrase.

P O S I T I O N .

[image:7.612.36.568.46.736.2] [image:7.612.46.450.65.371.2]

(8)

T h a n k s to the flexibility of this a p p r o a c h , it is possible to check g r a m m a r within the smallest imaginable constituents (Figure 9) - and in the largest imaginable (Figure 10).

In s u m m a r y , there are m a n y d i f f e r e n t causes f o r s y n t a c t i c i l l - f o r m e d n e s s in the p r o c e s s i n g of text:

misspellings, u n g r a m m a t i c a l i t i e s , f r a g m e n t s , c r a z y p u n c t u a t i o n , deficits in the p r o c e s s i n g g r a m m a r , etc. T h e t e c h n i q u e s d e s c r i b e d h e r e give us a c h a n c e to r e c o v e r f r o m all such cases of ill-formedness. First we d e v e l o p a core g r a m m a r , which itself is c a p a b l e of detecting spelling mistakes, and of correcting certain

F I T T E D - - - V P * . . . . S U B C L - - C O N J . . . B e f o r e "

- - N P I . . . D E T . . . A D J * . . . . " a n " I . . . N O U N * . . . a p p r o v a l " - - V E R B . . . . " c a n "

- - V E R B . . . . " b e " - - V E R B * - - - " i s s u e d " . . . . N P . . . P R O N * - - - " i t " . . . . V E R B . . . w i l l " . . . . V E R B * - - - " b e "

. . . . A J P i . . . . A D J * . . . . " n e c e s s a r y " I . . . . I N F C L - - I N F T O . . . t o "

- - V E R B * - - - " s u b m i t "

- - N P . . . N P . . . N O U N * " b l u e p r i n t " . . . N O U N * - - - " d r a w i n g s "

- - P P . . . P R E P . . . . " i n "

. . . A J P . . . A D J * . . . . " t r i p l i c a t e " . . . N O U N * - - - " s e t s "

- - P P . . . P R E P . . . . " o n " . . . N O U N * - - - " s h e e t s "

. . . A J P I . . . . A V P . . . A D V * . . . . " n o " i . . . . A D J * . . . s m a l l e r "

I . . . . P P I . . . P R E P . . . t h a n "

J . . . Q U A N T - - - A D J * . . . 1 5 "

I . . . N O U N * - - - " i n c h e s " - - - C O N J . . . . " a n d "

- - - P T P R T C L I V E R B * - - - " d r a w n "

i P P i . . . P R E P . . . t o "

J . . . D E T . . . A D J * . . . . " a " E . . . N O U N * - - - " s c a l e "

I . . . A J P . . . . A V P . . . A D V *

- - - P U N C . . . . " . "

" n o " . . . . A D J * . . . . " s m a l l e r "

. . . . P P [ . . . P R E P . . . . " t h a n " i . . . N O U N * - - - " I / 8 t h "

i . . . P P I . . . P R E P . . . . " o f "

I . . . D E T . . . A D J * . . . . " a n " i . . . N O U N * . . . i n c h "

. . . . P P I . . . P R E P . . . . " t o "

i . . . D E T . . . A D J * . . . . " t h e " i . . . N O U N * . . . f o o t "

P O S S I B L E G R A M M A T I C A L E R R O R : M I S S I N G C O M M A .

B e f o r e a n a p p r o v a l c a n b e i s s u e d i t w i l l b e n e c e s s a r y . . . C O N S I D E R :

B e f o r e a n a p p r o v a l c a n b e i s s u e d , i t w i l l b e n e c e s s a r y . . . A C O M M A I S N E E D E D T O D E F I N E C L A U S E B O U N D A R I E S

Figure 10. Comma error in long complex sentence.

[image:8.612.58.548.140.679.2]

(9)

syntactic mistakes w h e n they occur in otherwise legiti-

m a t e sentences. To this core g r a m m a r we couple a

fitting p r o c e d u r e that p r o d u c e s a r e a s o n a b l e best-guess parse for all other text strings, regardless of w h e t h e r t h e y m e e t the g r a m m a r ' s criteria for s e n t e n c e h o o d . T h e fitted parse t h e n allows us to c h e c k e v e n n o n - s e n t e n c e s f o r those c a t e g o r i e s of s y n t a c t i c errors

that we can correct.

C r i t i q u i n g S t y l i s t i c I I I - F o r m e d n e s s in E P I S T L E

T h e style c o m p o n e n t of EPISTLE uses the s e n t e n c e structures p r o v i d e d by the parser as input for stylistic critiquing. It consists of a set of NLP rules that apply to parse trees of sentences, identify stylistic errors and suggest corrections. T h e r e is a f u n d a m e n t a l difference b e t w e e n style analysis and g r a m m a r analysis, h o w e v e r : the g r a m m a r r u l e - s y s t e m is b a s e d on a set of objective syntactic criteria that d e t e r m i n e w h e t h e r the input is w e l l - f o r m e d or not. The style rules, b y contrast, are b a s e d on relative criteria. T h e y place the input on a c o n t i n u u m of stylistic acceptability so that w h a t is a stylistic error b e c o m e s a m a t t e r of degree.

Types of s t y l i s t i c i l l - f o r m e d n e s s .

(1) P u n c t u a t i o n . Stylistic i l l - f o r m e d n e s s is relative since it d e p e n d s on both the linguistic and the extra-linguistic contexts. Linguistically, a c o m b i n a t i o n of g r a m m a t i c a l f a c t o r s in the s e n t e n c e can m a k e a

sentence m o r e or less ill-formed. F o r example, the

need for a c o m m a in a c o m p o u n d sentence increases with the length of the conjoined clauses. Thus, in " a decision was r e a c h e d and the m e e t i n g e n d e d , " a c o m - ma b e f o r e the " a n d " is optional; but in " a decision which was m o d e r a t e enough to satisfy e v e n m y o b j e c - tions was r e a c h e d , and the m e e t i n g was finally a d j o u r n e d , " a c o m m a is necessary. An e v e n longer s e n t e n c e c o n t a i n i n g several o t h e r c o m m a s m i g h t require a semi-colon b e f o r e the " a n d " .

T o be able to detect the missing c o m m a , the stylistic rule must have access to syntactic i n f o r m a t i o n p r o - vided by the parser. It has to k n o w that the s e n t e n c e is c o m p o u n d . M o r e o v e r , it has to k n o w t h a t e a c h clause c o n t a i n s its o w n s u b j e c t (in this case, " a decision" and " t h e m e e t i n g " ) , since a c o m p o u n d sentence with only one subject does not take a c o m m a . A f t e r all the syntactic conditions are c h e c k e d and met, the rule measures the length of the clauses to deter-

mine h o w badly the c o m m a is needed. The output is

shown in Figure 11.

(2) O t h e r T y p e s of Stylistic I l l - F o r m e d n e s s . T h e style c o m p o n e n t of EPISTLE d e t e c t s o t h e r instances of missing or faulty p u n c t u a t i o n (a c o m m a at the end of a s u b o r d i n a t e clause, no colon b e f o r e a single n o u n - p h r a s e , etc). It also addresses some types of c o m p l i c a t e d g r a m m a t i c a l c o n s t r u c t i o n s t h a t m a y impede the r e a d e r ' s c o m p r e h e n s i o n , such as excessive length, excessive n o u n - m o d i f i c a t i o n (e.g. " e a r l y child- h o o d t h o u g h t d i s o r d e r m i s d i a g n o s i s " ) , a n d excessive

negation (e.g., "neither the professor, n o r his two as- sistants, w h o h a v e b e e n working with him on this p r o - ject, haven't noticed the t h e f t " ) . Some usage violations are signaled, such as "split infinitives" and the usage of " m o s t " i n s t e a d of " a l m o s t " ; and finally, some cosmetic changes are p r o p o s e d w h e n the s y n t a c - tic structure is too u n i f o r m or w h e n there is excessive repetition.

(3) Repetition. R e p e t i t i o n is a n o t h e r instance of the relative nature of stylistic ill-formedness. G e n - erally, repetition of strings is to be avoided; h o w e v e r , s o m e cases of r e p e t i t i o n are m o r e a c c e p t a b l e t h a n others. T h e d e g r e e of a c c e p t a b i l i t y d e p e n d s on the s y n t a c t i c f u n c t i o n of the r e p e a t e d strings. In " t h e m e e t i n g is very very i m p o r t a n t , " the two instances of " v e r y " h a v e the s a m e s y n t a c t i c function: t h e y b o t h intensify the adjective " i m p o r t a n t . " This double r e p e - tition, lexical and syntactic, is considered p o o r style. T h e error correction for this s e n t e n c e can be seen in

Figure 12. By contrast, in "it does not surprise me

that t h a t institution no longer e x i s t s , " the t w o instances of " t h a t " h a v e different syntactic roles - o n e is a c o n j u n c t i o n ; the other, a d e t e r m i n e r . This sentence is stylistically m o r e acceptable. Finally, in " w h a t he does does not c o n c e r n us," the two instances of " d o e s " belong to two different clauses. T h e style rules accept lexical repetition of this kind.

C o r r e c t i n g s t y l i s t i c e r r o r s in a f i t t e d parse.

Because syntactic i n f o r m a t i o n is always available, the style rules can apply to fitted parses, as they do to regular sentences. T h e y not only signal stylistic errors within the fitted parse but also assign different degrees of acceptability to different types of fitted parses. As n o u n phrases are the m o s t c o m m o n l y e n c o u n t e r e d type of f r a g m e n t , the rules a c c e p t n o u n p h r a s e s ( " M y w a r m e s t regards to y o u r s o n " ) but m a r k s u b o r d i n a t e clauses as i n c o m p l e t e ( " B e c a u s e he refused to sign the p a p e r s " ) , as shown in Figures 13 and 14.

E x t r a - l i n g u i s t i c f a c t o r s . The degree of stylistic

i l l - f o r m e d n e s s of a s e n t e n c e d e p e n d s on e x t r a - linguistic factors. T h e use of c o n t r a c t e d v e r b - f o r m s (e.g. " d o n ' t " , "I'11") is quite a c c e p t a b l e in i n f o r m a l writing; it is to be avoided, though, in f o r m a l docu- ments. Style rules should a c c o m m o d a t e different degrees of formality. T h e y should also be sensitive to the stylistic n o r m s o b s e r v e d in d i f f e r e n t d o m a i n s . In a technical m a n u a l , f o r e x a m p l e , a u n i f o r m s e n t e n c e - p a t t e r n is preferred, as it facilitates the r e a d e r ' s c o m - p r e h e n s i o n ; in f r e s h m a n c o m p o s i t i o n s , on the o t h e r hand, a variety of s e n t e n c e - p a t t e r n s is m o r e a p p r o p r i - ate as it breaks the m o n o t o n y . T h e style c o m p o n e n t of EPISTLE will a d d r e s s such extra-linguistic f a c t o r s in addition to the purely linguistic factors. In order to do so, it will p r e s e n t the user with a m e n u of style op- tions. T h e selection of the f o r m a l option will activate the " n o v e r b - c o n t r a c t i o n " rule; the selection of the i n f o r m a l o p t i o n will s u p p r e s s it. Similarly, the

(10)

C M P D - - - V P I . . . N P I . . . D E T . . . A D J * . . . . " A " I . . . N O U N * - - - " d e c i s i o n "

I . . . R E L C L L - - N P . . . P R O N * - - - " w h i c h " i - - V E R B * - - - " w a s "

I - - A J P I . . . . A D J * . . . . " m o d e r a t e " I . . . . A D V . . . " e n o u g h " I . . . . I N F C L I - - I N F T O - - - " t o "

l - - V E R B * - - - " s a t i s f y "

i - - N P l . . . A J P . . . A D V * . . . . " e v e n " I . . . D E T . . . A D J * . . . . " m y I . . . N O U N * - - - " o b j e c t i o n s " . . . V E R B . . . . " w a s "

. . . V E R B * . . . r e a c h e d " - - - C O N J * . . . a n d "

- - - V P l . . . N P I . . . D E T . . . A D J * . . . . " t h e " I I . . . N O U N * - - - " m e e t i n g "

I . . . V E R B . . . . " w a s "

I . . . A V P . . . A D V * . . . . " f i n a l l y " I . . . V E R B * . . . a d j o u r n e d "

- - - P U N C . . . . " . "

S T Y L I S T I C W E A K N E S S : M I S S I N G C O M M A I N C O M P O U N D S E N T E N C E . W H Y N O T H A V E A C O M M A B E F O R E T H E C O N J U N C T I O N ?

. . . w a s r e a c h e d , a n d t h e m e e t i n g w a s f i n a l l y a d j o u r n e d .

Figure 11. Diagnosis of a punctuation problem.

D E C L I - - - N P I . . . D E T . . . A D J * . . . T h e " I I . . . N O U N * - - - " m e e t i n g "

i - - - V E R B * - - - " i s "

i - - - A J P i . . . . A V P . . . A D V * . . . . " v e r y " I I . . . . A V P . . . A D V * . . . v e r y " I I . . . . A D J * . . . i m p o r t a n t " I - - - P U N C . . . . "

S T Y L I S T I C W E A K N E S S : R E P E T I T I O N . W H Y N O T A V O I D R E P E T I T I O N ?

T h e m e e t i n g i s v e r y i m p o r t a n t .

Figure 12. Diagnosis of a repetition problem.

F I T T E D I - - - N P * I . . . . D E T . . . A D J * . . . M y "

i . . . . A J P . . . A D J * . . . . " w a r m e s t " i . . . . N O U N * . . . r e g a r d s "

- - - P P I . . . P R E P . . . . " t o "

E . . . D E T . . . A D J * . . . . " y o u r "

I . . . N O U N * - - - " s o n " - - - P U N C . . . . " . "

Figure 13. Fitted noun phrase (no style problems).

[image:10.612.46.546.55.715.2] [image:10.612.53.555.68.362.2] [image:10.612.42.566.364.676.2]

(11)

K. Jensen, G.E. H e i d o r n , L.A. Miller, and Y. Ravin Parse Fitting and Prose Fixing

F I T T E D - - - C O N J . . . . " B e c a u s e "

- - - V P * I . . . . N P . . . P R O N * - - - " h e " I . . . . V E R B * . . . r e f u s e d " I . . . . I N F C L I - - I N F T O . . . t o "

l - - V E R B * - - - " s i g n "

I - - N P I . . . D E T . . . A D J * . . . t h e " I . . . N O U N * - - - " p a p e r s "

- - - P U N C . . . . " . "

P O S S I B L E S T Y L I S T I C W E A K N E S S : I N C O M P L E T E S E N T E N C E .

[image:11.612.61.494.63.177.2]

W H Y N O T C O M P L E T E T H I S S E N T E N C E B Y A D D I N G A M A I N C L A U S E ?

Figure 14. Fragment (subordinate clause) with diagnosis.

technical-writing option will diagnose excessive syntactic variety, w h e r e a s the c r e a t i v e - w r i t i n g o p t i o n will diagnose m o n o t o n o u s regularity.

Potential Difficulties

B e c a u s e the e r r o r - d e t e c t i o n and fitting p r o c e d u r e s permit all sorts of n o n - s e n t e n c e s to survive, they will inevitably increase the n u m b e r of ambiguities that the s y s t e m produces, and will require that additional e f f o r t be spent to restrict the n u m b e r of possible parses. This is a difficult but by no means impossible task, since all it entails is the addition of m o r e t h o r o u g h constraints on the core g r a m m a r .

As an example, consider the input string " W h a t exactly does that 15 m o n t h s d o . "

T h e intended m e a n i n g of this string could p r o b a b l y be p a r a p h r a s e d as " W h a t exactly does that 1 5 - m o n t h period m e a n ? " But there are two p r o b l e m s with the input. First and most confusingly, the phrase " t h a t 15 m o n t h s , " as given, has a plural head noun ( " m o n t h s " ) and a singular d e t e r m i n e r ( " t h a t " ) . Since the CG c a n n o t u n d e r s t a n d meaning, it has no w a y of telling that the given phrase might be an elided f o r m of " t h a t 1 5 - m o n t h p e r i o d . " It t h e r e f o r e detects and corrects a syntactic error: n u m b e r d i s a g r e e m e n t b e t w e e n p r e m o - difier and noun. Secondly, the input string should end

with a question mark. But in order to diagnose this

error, the g r a m m a r needs to realize t h a t a q u e s t i o n was intended.

W h e n the p r o b l e m s e n t e n c e was s u b m i t t e d to an earlier version of the CG, three parses resulted (Figures 15-17).

T h e parse in Figure 15 would be a p p r o p r i a t e for a s e n t e n c e such as " W h o ( e v e r ) e x a c t l y d o e s t h a t job prevails." H o w e v e r , it is thoroughly unhelpful for the

sentt:nce at hand. It diagnoses two errors, neither of

which really exists.

Figure 16 is close to a c c e p t a b l e for the input string. If the time adverbial NP (AVPNP in the parse tree) were replaced by a subject NP, the syntactic structure would be c o r r e c t for the intended meaning. As things

stand, this parse is only 5 0 % helpful: it c o r r e c t l y diagnoses the missing question mark, but it incorrectly insists that " m o n t h " should be singular in number.

The third parse (Figure 17) gives the desired single error correction, but it does so on the basis of a totally i n a p p r o p r i a t e parse. T h e structure in this figure would fit a question like " W h o exactly suffers (in order) that m a n y people might live?"

T h e c u r r e n t v e r s i o n of the CG blocks the three parses in Figures 15 t h r o u g h 17 on a principled basis. Figure 15 c a n be b l o c k e d b y tightening s o m e c o n - straints on the diagnosis of s u b j e c t - v e r b n u m b e r disa- g r e e m e n t in fitted parses. Figure 16 is b l o c k e d b e - cause of the p r e s e n c e of an adverbial NP where the subject NP ought to be. Figure 17 is b l o c k e d b y stipu- lating t h a t all s u b o r d i n a t e clauses b e g i n n i n g with a " t h a t " c o n j u n c t i o n should have m o d a l or subjunctive predicates.

An a c c e p t a b l e parse (Figure 18) is p r o v i d e d w h e n the n u m b e r a g r e e m e n t r e s t r i c t i o n is r e m o v e d f r o m phrases like " t h a t 15 m o n t h s . " This is a c c o m p l i s h e d by telling the p r o p e r rule to ignore n u m b e r a g r e e m e n t in particular cases that involve a small subset of q u a n - tified English time words. A d m i t t e d l y , the fix would be m o r e pleasing if it were p a r t of a larger scheme for u n d e r s t a n d i n g m e a n i n g and context. But the c o r r e c - tion m o v e s in the right direction, and certainly does not p r e v e n t future processing with a m o r e intelligent semantic c o m p o n e n t . This situation clearly illustrates h o w e r r o r d e t e c t i o n results in the addition of finer constraints on the core g r a m m a r .

Related W o r k

T h e parsing a p p r o a c h closest in spirit to o u r fitting p r o c e d u r e is that described in Slocum (1983, p. 170): the LRC M a c h i n e T r a n s l a t i o n System uses a " s h o r t e s t p a t h " technique to construct a " p h r a s a l analysis" of u n g r a m m a t i c a l input. With this analysis, phrases can be translated separately, e v e n in the a b s e n c e of a total s e n t e n c e parse. Aside f r o m S l o c u m ' s work, m o s t of the r e p o r t s in this field suggest t h a t u n p a r s a b l e or

(12)

D E C L - - - N P I . . . P R O N * - - - " W h a t "

I . . . V P I . . . A V P . . . A D V * . . . e x a c t l y "

I . . . V E R B . . . . " d o e s "

I . . . N P I . . . D E T . . . A D J * . . . . " t h a t "

I . . . Q U A N T - - - A D J * . . . . " 1 5 "

I . . . N O U N * - - - " m o n t h s "

- - - V E R B * - - - " d o "

- - - P U N C . . . . " . "

G R A M M A T I C A L E R R O R : S U B J E C T - V E R B N U M B E R D I S A G R E E M E N T .

W H A T e x a c t l y d o e s t h a t 1 5 m o n t h s D O .

C O N S I D E R :

W H A T e x a c t l y d o e s t h a t 1 5 m o n t h s D O E S .

G R A M M A T I C A L E R R O R : P R E M O D I F I E R - N O U N N U M B E R D I S A G R E E M E N T .

W h a t e x a c t l y d o e s T H A T . . . M O N T H S d o .

C O N S I D E R :

W h a t e x a c t l y d o e s T H A T . . . M O N T H d o .

T H E C O M B I N E D G R A M M A T I C A L C O R R E C T I O N S A R E :

[image:12.612.47.499.58.326.2]

W h a t e x a c t l y d o e s t h a t 1 5 m o n t h d o e s .

Figure 15. Two faulty error diagnoses; inappropriate parse.

ill-formed input should be handled by relaxation

techniques, that is, by relaxing restrictions in the grammar rules in some principled way. This is u n d o u b t e d l y a useful strategy - one which EPISTLE makes use of, in fact, in its rules for d e t e c t i n g grammatical errors ( H e i d o r n et al. 1982). H o w e v e r , it is questionable w h e t h e r such a strategy can ultimately succeed in the face of the overwhelming (for all practical purposes, infinite) variety of ill-formedness with which we are faced when we set out to parse truly unrestricted natu-

ral language input. If all ill-formedness is rule-based

(Weischedel and Sondheimer 1981, p. 3), it can only

be by some very loose definition of the term rule, such

as that which might apply to the fitting algorithm described here.

Thus Weischedel and Black ( 1 9 8 0 ) suggest three t e c h n i q u e s f o r r e s p o n d i n g intelligently to u n p a r s a b l e inputs:

(a) using presuppositions to determine user assump-

tions; this course is not available to a syntactic grammar like EPISTLE's;

(b) using relaxation techniques;

(c) supplying the user with i n f o r m a t i o n a b o u t the point w h e r e the parse is blocked; this would require an interactive environment, which would not be possible for e v e r y type of natural language processing application.

Kwasny and Sondheimer (1981) are strong p r o p o -

nents of relaxation techniques, which they use to handle b o t h cases of clearly u n g r a m m a t i c a l structures, such as co-occurrence violations like s u b j e c t / v e r b disa- greement, and cases of perfectly acceptable but difficult constructions (ellipsis and conjunction).

Weischedel and S o n d h e i m e r ( 1 9 8 2 ) describe an improved ellipsis processor. No longer is ellipsis handled with r e l a x a t i o n techniques, but by predicting transformations of previous parsing paths that would allow for the matching of f r a g m e n t s with plausible

contexts. This plan would be appropriate as a next

step after the fitted parse, but it does not guarantee a parse for all elided inputs.

H a y e s and Mouradian (1981) also use the relaxa-

tion method. T h e y achieve flexibility in their parser

by relaxing consistency constraints (grammatical restric-

tions, like K w a s n y and S o n d h e i m e r ' s c o - o c c u r r e n c e violations) and also by relaxing ordering constraints. H o w e v e r , they are working with a r e s t r i c t e d - d o m a i n semantic system and their a p p r o a c h , as they admit, " d o e s not e m b o d y a solution for flexible parsing of natural language in g e n e r a l " (p. 236).

The work of Wilks is heavily semantic and there- fore quite d i f f e r e n t f r o m EPISTLE, but his general philosophy meshes nicely with the philosophy of t h e fitted parse: " I t is p r o p e r to p r e f e r the normal ... but it would be absurd ... not to accept the abnormal if it is d e s c r i b e d " (Wilks 1975, p. 267).

(13)

D E C L - - - N P . . . P R O N * - - - " W h a t "

- - - A V P . . . A D V * . . . . " e x a c t l y "

- - - V E R B . . . " d o e s "

- - - A V P N P I - - - D E T . . . A D J * . . . . " t h a t "

I - - - Q U A N T - - - A D J * ... 1 5 "

l - - - N O U N * - - - " m o n t h s "

- - - V E R B * . . . . " d o "

- - - P U N C . . . " . "

G R A M M A T I C A L E R R O R : M I S S I N G Q U E S T I O N M A R K .

W h a t e x a c t l y d o e s t h a t 1 5 m o n t h s d o .

C O N S I D E R :

W h a t e x a c t l y d o e s t h a t 1 5 m o n t h s d o ?

G R A M M A T I C A L E R R O R : P R E M O D I F I E R - N O U N N U M B E R D I S A G R E E M E N T .

W h a t e x a c t l y d o e s T H A T . . . M O N T H S d o .

C O N S I D E R :

W h a t e x a c t l y d o e s T H A T . . . M O N T H d o .

T H E C O M B I N E D G R A M M A T I C A L C O R R E C T I O N S A R E :

W h a t e x a c t l y d o e s t h a t 1 5 m o n t h d o ?

Figure l 6 . O n e f a u l t y diagnosis, o n e c o r r e c t ; n e a r - s a t i s f a c t o r y parse.

D E C L I - - - N P . . . P R O N * ... W h a t "

- - - A V P . . . A D V * . . . . " e x a c t l y "

- - - V E R B * - - - " d o e s "

- - - S U B C L I - - C O N J . . . t h a t "

I - - N P I . . . Q U A N T - - - A D J ~ . . . 1 5 "

I I . . . N O U N * - - - ~ ' m o n t h s "

I - - V E R B * - - - " d o "

- - - P U N C . . . . " . "

C O N S I D E R :

Figure 17. Correct error diagnosis but misleading parse.

D E C L I - - - N P . . . P R O N * - - - " W h a t "

- - - A V P . . . A D V * . . . . " e x a c t l y "

- - - V E R B . . . . " d o e s "

- - - N P I . . . D E T . . . A D J * . . . . " t h a t "

I . . . Q U A N T - - - A D J * . . . 1 5 "

I . . . N O U N * ... m o n t h s "

- - - V E R B , - - - , , d o ,,

- - - P U N C . . . . " . "

C O N S I D E R :

Figure 18. Correct error diagnosis; satisfactory parse.

[image:13.612.42.562.49.744.2] [image:13.612.45.555.67.388.2]

(14)

R e f e r e n c e s

Hayes, P.J. and Mouradian, G.V. 1 9 8 1 Flexible Parsing. Am. J. Comp. Ling. 7(4): 232-242.

Heidorn, G.E. 1972 Natural Language Inputs to a Simulation Programming System. Technical Report NPS-55HD72101A. Monterey, California: Naval Postgraduate School.

Heidorn, G.E. 1982 Experience with an Easily Computed Metric for Ranking Alternative Parses. Proc. 20th Annual Meeting o f the ACL. Toronto, Canada: 82-84.

Heidorn, G.E.; Jensen, K.; Miller, L.A.; Byrd, R.J.; and Chodorow, M.S. 1982 The EPISTLE Text-Critiquing System. I B M Sys- tems Journal 21(3): 305-326.

Jensen, K. and Heidorn, G.E. 1983 The Fitted Parse: 100% Parsing Capability in a Syntactic Grammar of English. Proc. Conf. on Applied Natural Language Processing. Santa Monica, California: 93-98.

Kwasny, S.C. and Sondheimer, N.K. 1981 Relaxation Techniques for Parsing Ill-Formed Input. Am. J. Comp. Ling. 7(2): 99- 108.

Lasnik, H. and Freidin, R. 1 9 8 1 Core Grammar, Case Theory, and Markedness. Proc. 1979 G L O W Conf. Pisa, Italy.

Miller, L.A.; Heidorn, G.E; and Jensen, K. 1981 Text-Critiquing with the E P I S T L E System: An Authors's Aid to Better Syn- tax. A F I P S Conf. Proc., Vol. 50. Arlington, Virginia: 649-655. Slocum, Jonathan. 1983 A Status Report on the LRC Machine

Translation System. Proc. Conf. on Applied Natural Language Processing. Santa Monica, California: 166-173.

Weischedel, R.M. and Black, J.E. 1980 Responding Intelligently to Unparsable Inputs. Am. J. Comp. Ling. 6(2): 97-109. Weischedel, R.M. and Sondheimer, N.K. 1981 A Framework for

Processing Ill-Formed Input. Research Report. University of Delaware.

Weischedel, R.M. and Sondheimer, N.K. 1982 An Improved Heuristic for Ellipsis Processing. Proc. 20th Annual Meeting o f the ACL. Toronto, Canada: 85-88.

Wilks, Yorick 1975 An Intelligent Analyzer and Understander of English. Comm. A C M 18(5): 264-274.