TIC : PARSING INTERESTING TEXT.
David Allport School o/Cognitive Sciences
University of Sussex, Falmer, Brighton BN1 9QN
davida%ulc ,,c.mssex.cvaxa@cs -clac ad¢
A B S T R A C t
This paper gives a n o v e r v i e w o f the n a t u r a l language p r o b l e m s a d d r e s s e d in t h e Traffic I n f o r m a t i o n C o H a t o r / C o n d e n s c r (TICC) p r o - jeer, a n d describes in some d e t a i l the "interesting-corner parser" used in the T I C C ' s N a t u r a l Language S u m m a r i s e r . The TICC is designed to t a k e free t e x t i n p u t describing local traffic incidents, a n d a u t o m a t i c a l l y o u t - p u t local traffic i n f o r m a t i o n b r o a d c a s t s f o r m o t o r i s t s in a p p r o p r i a t e geographical areas. The "interesting-corner p a r s e r uses b o t h s y n - tactic a n d s e m a n t i c i n f o r m a t i o n , r e p r e s e n t e d as f e a t u r e s in a u n i f i c a t i o n - b a s e d g r a m m a r , to guide its b i - d i r e c t i o n a l search f o r significant p h r a s a l groups.
1. I N T R O D U C T I O N
The o v e r a l l goal of the TICC project is to s h o w the p o t e n t i a l benefits o f a u t o m a t i - c a l l y b r o a d c a s t i n g local traffic i n f o r m a t i o n . Our target s y s t e m , d e a l i n g w i t h traffic incidents in the Sussex area, is to be c o m - p l e t e d b y September 1989. The project f o r m s p a r t of the A l v e y M o b i l e I n f o r m a t i o n S y s - tems large-scale D e m o n s t r a t o r .
The N a t u r a l Language S u m m a r i s e r c o m - ponent of t h i s s y s t e m is being d e v e l o p e d a t Sussex U n i v e r s i t y . Its f u n c t i o n is to accept a series o f free t e x t messages describing traffic incidents, a n d to e x t r a c t f r o m these messages a n y i n f o r m a t i o n t h a t m i g h t be r e l e v a n t f o r b r o a d c a s t to o t h e r m o t o r i s t s .
The N a t u r a l Language S u m m a r i s e r is designed to w o r k in a r e s t r i c t e d d o m a i n , a n d o n l y needs to solve a subset of the p r o b l e m s of t e x t u n d e r s t a n d i n g . The T I C C ' s o u t p u t messages are s h o r t a n d v e r y s i m p l e assem- blies of c a n n e d text, posing no significant n a t u r a l language generation p r o b l e m s . O u r m a i n concern is t h a t the messages s h o u l d be u s e f u l to motorists, i.e t h a t t h e y be r e l i a b l e indications of the state o f the r o a d s at the time t h e y are broadcast.
P r o g r a m s such as METEO [ C h e v a l i e r et al. 1978] h a v e d e m o n s t r a t e d t h a t in a r e s - t r i c t e d d o m a i n w i t h a r e s t r i c t e d s u b - l a n g u a g e , a u t o m a t i c i n f o r m a t i o n b r o a d c a s t s can be use- fuL P r o g r a m s such as F R U M P [De Jong 1979, De Jong 1982] h a v e also d e m o n s t r a t e d t h a t e x p e c t a t i o n - d r i v e n a n a l y s e r s can o f t e n s u c c e s s f u l l y c a p t u r e the gist of free text. H o w e v e r , t h e t o p - d o w n d e p t h - f i r s t c o n f i r m a t i o n of e x p e c t a t i o n s based on s k e t c h y scripts, ignoring m o s t of the i n p u t s t r u c t u r e , can l e a d to serious m i s i n t e r p r e t a t i o n s [Ries- beck 82]. O u r concern f o r a c c u r a c y o f i n t e r p r e t a t i o n has l e d us to a processing s t r a - t e g y in w h i c h t h e N a t u r a l Language S u m - m a r i s e r a n a l y s e s t h e i n p u t t e x t a t a f a r greater l e v e l of d e t a i l t h a n is given in t h e o u t p u t messages, so the s y s t e m " k n o w s more" a b o u t the traffic i n c i d e n t s it is d e s c r i b i n g t h a n it s a y s in its b r o a d c a s t s . O u r p a r s e r uses b o t h s y n t a c t i c a n d s e m a n t i c i n f o r m a t i o n to guide its search f o r p h r a s e s in the i n p u t t h a t m i g h t be d i r e c t l y or i n d i r e c t l y r e l e v a n t to m o t o r i s t s , a n d e x p l o r e s a l t e r n a t i v e possible i n t e r p r e t a t i o n s b o t t o m - u p using an active c h a r t [ F a r l e y 1970, K a y 1973].
This is an ongoing research project, a n d w e do not c l a i m to h a v e s o l v e d a l l the p r o b l e m s i n v o l v e d in d e v e l o p i n g a successful s y s t e m yet. The c u r r e n t p a p e r considers t h e p a r t i c u l a r n a t u r a l language p r o b l e m s w e are a d d r e s s i n g a n d describes the " i n t e r e s t i n g - corner parser" t h a t has been i m p l e m e n t e d in the p r o t o t y p e s y s t e m .
2. THE N A T U R A L LANGUAGE
SUMMARISER'S TASK
2.1 I N P U T : O u r i n p u t d a t a comes f r o m the Sussex Police, w h o h a v e a c o m p u t e r s y s - tem f o r storing the t e x t of incoming traffic messages f r o m a v a r i e t y of sources (eg. p a t r o l cars, e m e r g e n c y services, m o t o r i n g organisations). A n e x a m p l e of the s t y l e of this text, d e r i v e d f r o m r e a l i n p u t b u t w i t h names etc. changed, is given in fig.1.
single i n c i d e n t c o n t i n u e s over a n u m b e r of hours, d e p e n d i n g o n the s e v e r i t y of the incident, a n d the TICC can afford to spend up to one m i n u t e a n a l y s i n g a n average l e n g t h message. ALl aspects of the police m a n a g e m e n t of the i n c i d e n t are described, a n d m a n y of the messages are o n l y i n d i r e c t l y r e l e v a n t to motorists. For example, if one of the vehicles i n v o l v e d i n a n accident needs a t o t a l l i f t to r e m o v e it f r o m the road, the l i k e l y d e l a y t i m e given i n the broadcast message m a y be longer, a l t h o u g h the need for t h e t o t a l l i f t w i l t n o t itself be m e n t i o n e d i n the broadcast. M u c h of the i n p u t is c o m p l e t e l y u n i n t e r e s t i n g f o r the TICC's purposes, such as details of i n j u r i e s sustained b y people i n v o l v e d , or of w h i c h police u n i t s are dealing w i t h the incident.
There is a great v a r i e t y i n prose style, f r o m the "normal" to the h i g h l y telegraphic, b u t there is a strong t e n d e n c y t o w a r d s the abbreviated. It is a n o n - t r i v i a l task to correctly i d e n t i f y the lexical items i n the text. Parts of the i n p u t string w h i c h are n o t recognised as entries i n the S u m m a r i s e r ' s lex- icon (or regular d e r i v a t i o n s f r o m entries) m a y be of f o u r types:
i) Names, n u m b e r s etc, w h i c h m a y be recognised as such f r o m the context (e.g pc h u m p h r i e s requesta .... f o r d cortina reg A B C 1 2 3 ) .
ii) Other English w o r d s n o t i n the lexi- con, w h i c h c a n n o t r e l i a b l y be predicted to be proper n a m e s (e.g h o v i s /orry b / d w n o / s bull's head ph).
Misspellings of items i n the lexicon.
iv) N o n - s t a n d a r d a b b r e v i a t i o n s of k n o w n words or phrases.
A b b r e v i a t i o n s are n o t a l w a y s of "canoni- cal" f o r m , a n d m a y be d e r i v e d f r o m com- plete w o r d s i n three different w a y s , as f o l - lows:
i) S i n g l e m o r p h e m e r o o t s : These u s u - a l l y have more t h a n one possible a b b r e v i a t e d f o r m a n d never i n c l u d e p u n c t u a t i o n eg. gge, g r g or g a r f o r garage. But some w o r d s do have canonical a b b r e v i a t i o n s (eg ~-d for road a n d st f o r street (or saint).
LD M u l t i - m o r p h e m e r o o t s : These o f t e n take o n l y the first l e t t e r f r o m the first root morpheme, a n d t h e n either p a r t or a l l of the second morpheme. T h e y o c c a s i o n a l l y i n c l u d e slash p u n c t u a t i o n eg. cway, c / w a y f o r car- rlageway, recycle, m / c f o r trmtorcycle, o / s f o r outalde (or offside), a n d ra f o r roundabout.
S e q u e n c e s / p h r a s e s : Some sequences of w o r d s have c a n o n i c a l a b b r e v i a t i o n s (e.g bbc a n d n o t britbrdcrp). C a n o n i c a l examples seen i n Fig. 1. b e l o w i n c l u d e r t a for road traffc accident a n d oic f o r officer in charge.
N o n - c a n o n i c a l sequences m a y have a v a r i e t y of a b b r e v i a t i o n s for each of the con- s t i t u e n t words, a n d m a y or m a y n o t have slash or period p u n c t u a t i o n , eg. f / b f o r rite brigade, eamb or eaamb f o r east (sussex) ambu- lance, hazchem for hazardous chemicals.
The p r o b l e m is c o m p o u n d e d f o r the TICC b y the fact t h a t the i n p u t we receive is a l l i n u p p e r case, hence the even the con- v e n t i o n of d i s t i n g u i s h i n g proper n a m e a b b r e - v i a t i o n s b y u p p e r case is n o t applicable. I n order to cope w i t h these different types of i n p u t string, we need n o t o n l y a "phrasal lexicon* as advocated b y Becket [Becket 1975], b u t also a n *abbreviation lexicon".
T/mei 1634 LocatWn: scaynes hill, haywards heath
1634 r t a serious near top scaynes hill persons trapped rqst esamb f / b 1/2 mile south Jw freshfleld rd.
1638 f m pc 123 acc Inv reqd poss black oic pc 456 1639 fire and arab en route
1642 req total l i f t for saloon car rota garage 1654 eamb now away f r o m scene
1655 freshfleld bodyshop on w a y
1657 f m pc 456 req rd closed n and s of hill st c r n r 1658 req two traft units to assist re closures
1709 can we i n f o r m brighton 1234 tell m r fred smith will be late due to th/s r t a 1715 local a u t h o r i t y required loose paving stones
1723 f m pc 234 at st george's hosp. dr i n charge having examlued m r jones now feels thls is n o t l i k e l y to be a black- d r i v e r of l o r r y has arrived, w i l l probably be released a f t e r t r e a t m e n t for cuts. car 45 w i l l be free f r o m ._ hesp i n about 20 rain
Our aim is to have a unified process for i d e n t i f y i n g idiomatic or fixed phrases a n d a b b r e v i a t e d sequences as i n iJJ) above, so t h a t for example as soon as pass, aaap a n d a.a.a.p. are a l l identified as the same "lexicaI item". W o r k o n this is, however, at a p r e l i m i n a r y stage, a n d we have n o t y e t f o u n d a n y gen- eral s o l u t i o n to the problem.
2.2 S U M M A R I S A T I O N : Deriving a short broadcast for motorists f r o m a long series of messages such as t h a t i n fig. 1 requires t w o m a i n phases. First, the N a t u r a l Language S u m m a r i s e r m u s t b u i l d u p a picture of w h a t is happening at the scene of the incident. Second, a Tactical Inferencer m u s t decide w h a t motorists s h o u l d be t o l d regarding the incident.
The N a t u r a l Language S u m m a r i s i n g pro- cess also requires two phases. I n the first phase a Message A n a l y s e r extracts interesting i n f o r m a t i o n f r o m a single message. I n the second phase a n E v e n t Recogniser p u t s together the i n f o r m a t i o n f r o m a series of messages to b u i l d u p a description of the incident as a whole, or r a t h e r those aspects of the i n c i d e n t r e l e v a n t to other motorists (see fig 2. below).
The Message A n a l y s e r does n o t b u i l d a complete representation of the s y n t a x a n d semantics of messages such as those at 1709 a n d 1723 i n fig. 1 above, since t h e y have no bearing on the progress of the traffic i n c i d e n t as f a r as other motorists are concerned. It just searches for phrases describing "interest- ing" events. These f a l l into two classes:
P r i m a r y E v e n t s : Such as vehicles blocking the road, substances spilling onto the road, a l l or part of the road being closed, diversions being p u t into operation, garage coming to remove vehicles f r o m the road, services like fire brigade a n d c o u n t y council removing hazards, etc.
The i n p u t messages r a r e l y describe these events i n f u l l , so the E v e n t Recogniser m u s t infer, for example, t h a t if the local council has been called o u t to r e m o v e debris f r o m the road, t h a t at some t i m e earlier debris m u s t have f a l l e n o n the road.
S e c o n d a r y E v e n t s : These i n c l u d e requests t h a t some of the p r i m a r y events s h o u l d happen, a n d people being i n f o r m e d t h a t p r i m a r y events have h a p p e n e d are h a p - pening or w i l l happen.
W e wiLt n o t have a n y model of the beliefs of the v a r i o u s agents i n v o l v e d i n i n c i d e n t h a n d l i n g . As f a r as the TICC N a t u r a l Language S u m m a r i s e r is concerned, the m e a n i n g of someone being i n f o r m e d t h a t a p r i m a r y event has happened is e q u i v a l e n t to the s t a t e m e n t t h a t it has happened. But the Tactical Inferencer w i l l use its m o d e l of the t y p i c a l progress of traffic i n c i d e n t s to predict the significance of the p r i m a r y events for other motorists. For example, if a vehicle is stated to need a f r o n t s u s p e n d e d tow, t h e n the Tactical Inferencer w i l l predict t h a t a certain a m o u n t of time w i l l elapse before the vehicle is t o w e d a w a y .
2.3 O U T P U T : Not e v e r y message i n p u t to the s y s t e m w i l l produce a n u p d a t e to the E v e n t Recogniser's description of the incident, because the Message A n a l y s e r m a y f a i l to find a description of a n i n t e r e s t i n g event. But even w h e n the E v e n t Recogniser passes a description of a traffic i n c i d e n t to the Tacti- cal Inferencer, this w i l l n o t necessarily r e s u l t i n a broadcast. For example, the E v e n t Recogniser m a y recognise a series of messages as describing a traffic light f a i l u r e incident. The Tactical Inferencer m a y decide to b r o a d - cast a message a b o u t this i n c i d e n t if it has occurred on a b u s y m a i n road i n the r u s h hour, b u t n o t if it has occurred late at n i g h t i n a smaU village.
Free Text
Messages
~Message
I Road/Junction
Database I
Analyser
)----~ <Event Recogniser
)---~<Taczical inferencer
)
I Incident Description
Database I I Incident Database I
Messages for
The d o m a i n k n o w l e d g e used in the the Tactical I n f e r e n c e r is n o n - l i n g u i s t l c , a n d c o n - cerns inferences a b o u t t h e l i k e l y t i m e d e l a y s f o r different t y p e s o f i n c i d e n t , t h e g e o g r a p h i - cal areas l i k e l y to be affected b y a given incident, etc. T h e T r a n s p o r t a n d Road Research L a b o r a t o r y , p a r t o f the D e p a r t m e n t of T r a n s p o r t , are assisting us in the d e v e l o p - m e n t of r u l e s f o r t h i s p a r t o f the s y s t e m .
There are o t h e r c o m p o n e n t s o f t h e TICC s y s t e m w h i c h w e do n o t detaU in t h i s p a p e r , such as t h e g r a p h i c a l interface, v i a a m a p o f the Sussex area, to a d a t a b a s e o f o f c u r r e n t traffic incidents. A l t h o u g h the TICC is designed to send its messages to a d e d i c a t e d b r o a d c a s t i n g s y s t e m , t h e a c t u a l b r o a d c a s t i n g aspect o f t h e p r o j e c t is t h e r e s p o n s i b i l i t y o f R A C A L research, one of o u r o t h e r A l v e y coIlaborators. I n o u r c u r r e n t p r o t o t y p e s y s - tem, i m p l e m e n t e d on a S u n - 3 w o r k s t a t i o n , b r o a d c a s t s to local geographical areas in Sussex are s i m u l a t e d , a n d t h e T a c t i c a l Inferencer is e x t r e m e l y simple.
3 . I 1 V I ' E ~ I N G C O R N E R P A R S I N G
The p a r s e r t h a t has been i m p l e m e n t e d for the Message A n a l y s e r searches b i d i r e c t i o n - a l l y f o r aU s y n t a c t i c parses associated w i t h s e m a n t i c a U y i n t e r e s t i n g p a r t s o f the i n p u t . Before describing the search s t r a t e g y in m o r e d e t a i l , w e need to c l a r i f y w h a t a s y n t a c t i c parse looks l i k e in o u r g r a m m a r f o r m a l i s m , a n d h o w w e s p e c i f y w h a t is s e m a n t i c a l l y interesting.
3.1 T H E G R A M M A R F O R M A L I S M : W e use a u n i f i c a t i o n - b a s e d g r a m m a r f o r m a l i s m , w i t h r u l e s t h a t l o o k s i m i l a r to c o n t e x t - f r e e p h r a s e - s t r u c t u r e rules. Both i m m e d i a t e d o m i - nance a n d c o n s t i t u e n t o r d e r i n g i n f o r m a t i o n are specified b y the same rule, r a t h e r t h a n b y separate s t a t e m e n t s as in F U G [ K a y 1985], LFG [ K a p l a n & Bresnan 1982] a n d GPSG [ G a z d a r et at 1985]. F e a t u r e - p a s s i n g between categories in r u l e s is done e x p l i c i t l y w i t h logical v a r i a b l e s , r a t h e r t h a n b y con- ventions such as t h e HFC a n d F F P in GPSG [Gazdar et a l 1985]. T h u s the r u l e f o r m a t is most s i m i l a r to t h a t u s e d in DCG's [PereLra & W a r r e n 1980]. Categories in r u l e s are f e a t u r e / v a l u e trees, a n d at each l e v e l the v a l u e of a f e a t u r e m a y i t s e l f be a n o t h e r f e a t u r e / v a l u e tree. F e a t u r e v a l u e s m a y be given logical names, a n d occurrences o f f e a t u r e v a l u e s h a v i n g the same logical n a m e in a r u l e m u s t u n i f y .
T h e f e a t u r e trees w h i c h c o n s t i t u t e categories in o u r g r a m m a r m a y s p e c i f y b o t h s y n t a c t i c a n d s e m a n t i c f e a t u r e s , so t h a t w e can w r i t e "syntactic" r u l e s w h i c h also i d e n - t i f y the s e m a n t i c t y p e s o f t h e i r c o n s t i t u e n t s . F o r e x a m p l e , i f w e use t h e f e a t u r e s f on categories to s p e c i f y a tree o f s e m a n t i c f e a t u r e s f o r t h a t c a t e g o r y , t h e n t h e rule:
(1) v p = ( s f : V S F ) - - > v = ( s f = ( p a t i e n t : P ) : V S F ) , n I ~ ( s f : P )
s a y s t h a t a v e r b p h r a s e m a y consist o f a v e r b f o l l o w e d b y a n o u n p h r a s e , a n d t h a t t h e s e m a n t i c f e a t u r e s on t h e n o u n p h r a s e ( l a b e l l e d t)) m u s t u n i f y w i t h t h e s e m a n t i c f e a t u r e s specified as t h e v a l u e o f t h e patient
s u b - f e a t u r e o f t h e v e r b ' s s e m a n t i c f e a t u r e s , a n d a d d i t i o n a l l y t h a t t h e s e m a n t i c f e a t u r e s on the w h o l e v e r b p h r a s e ( l a b e l l e d VSF)
m u s t u n i f y w i t h the ( c o m p l e t e tree of) s e m a n t i c f e a t u r e s on t h e v e r b .
By a d d i n g d o m a i n - s p e c i f i c s e m a n t i c f e a t u r e i n f o r m a t i o n to l e x i c a l categories, w e gain the p o w e r o f d o m a i n - s p e c i f i c s e m a n t i c g r a m m a r s , w h i c h h a v e been s h o w n to be s u c - cessful f o r h a n d l i n g i l l - f o r m e d i n p u t in l i m - i t e d d o m a i n s [Burton 1976]. But because w e use unification b y e x t e n s i o n as the basic c r i - t e r i o n f o r node a d m i s s a b i l i t y w h e n w e test for r u l e s to licence local trees, w e can also c a p t u r e g e n e r a l i s a t i o n s a b o u t s y n t a c t i c categories t h a t are n o t domain-specific. So f o r e x a m p l e if w e h a d a v e r b - p h r a s e r u l e such as (2) a n d a l e x i c a l e n t r y as in (3):
(2) v p - - > vffi(trftrans), n p
(3) close vffi(trffi(trans),
s f = ( e v e n t t y p e = r o a d _ c l o s u r e , a g e n t f s e r v i c e ,
patientffiroadlocation))
t h e n the v e r b f e a t u r e tree specihed in (2) w o u l d u n i f y w i t h the v e r b f e a t u r e tree in (3). Hence close can be t r e a t e d b o t h as a d o m a i n specific v e r b a n d as an instance o f the general class of t r a n s i t i v e v e r b s .
Using a f e a t u r e - b a s e d s e m a n t i c g r a m m a r t h e r e f o r e gives us a c o m p a c t r e p r e s e n t a t i o n of b o t h d o m a i n i n d e p e n d e n t a n d d o m a i n - s p e c i f i c i n f o r m a t i o n in a single u n i f o r m f o r m a l i s m . S y n t a c t i c g e n e r a l i s a t i o n s are c a p t u r e d b y rules such as (2), a n d d o m a i n - s p e c i f i c s u b - categorisation i n f o r m a t i o n is expressed in f e a t u r e - t r e e s as in (3), w h i c h states t h a t close
has the s e m a n t i c f e a t u r e s of a r o a d - c l o s u r e
lexicon also i n c l u d e s domain-specific meanings for p a r t i c u l a r l e x i c a l items, eg. b l a c k m e a n - ing r a t a / (cp messages at 1638 a n d 1723 in fig. 1 above).
3.2 A G R A M M A R F O R T H E T I C C D O M A I N : W r i t i n g a g r a m m a r to give a d e - q u a t e coverage o f the i n p u t t h a t o u r s y s t e m m u s t h a n d l e is a l e n g t h y task, w h i c h w i l l continue o v e r the n e x t t w o years. H o w e v e r , a n a l y s i s o f a c o r p u s o f d a t a f r o m police logs of o v e r one h u n d r e d i n c i d e n t s in t h e Sussex area, a n d t r i a l s w i t h e x p e r i m e n t a l g r a m m a r s , have led us to a d o p t a s t y l e o f g r a m m a r w h i c h w e expect w i l l r e m a i n c o n s t a n t as the g r a m m a r expands.
W e do not a t t e m p t to m a p t e l e g r a p h i c f o r m s onto "fuLly g r a m m a t i c a l " English f o r m s b y some v a r i a n t o f c o n s t r a i n t r e l a x a t i o n [ K w a s n y & S o n d h e i m e r 1981]. W e s i m p l y have a g r a m m a r w i t h f e w e r c o n s t r a i n t s . This is because i t is n o t a l w a y s e a s y to decide w h a t is missing f r o m an e l l i p t i c a l sentence, or w h i c h c o n s t r a i n t s s h o u l d be relaxed. C o n - sider f o r e x a m p l e the message at 1655 f r o m fig. 1, r e p e a t e d here:
(4) freshfield b o d y s h o p on w a y
I t is not a t a l l clear w h a t the "full" s e n t e n - t i m f o r m o f t h i s message ought to be, since it might also h a v e been p h r a s e d as one of:
(5.1) freshfield b o d y s h o p is on the w a y (5.2) freshfield b o d y s h o p is on its w a y (5.3) freshfield b o d y s h o p are on the w a y (5.4) freshfield b o d y s h o p are on t h e i r w a y
Each o f the ( 5 . 1 ) - ( 5 . 4 ) m u s t be a l l o w e d to be g r a m m a t i c a l ( a n d each m i g h t occur in our t y p e o f i n p u t ) , since n o u n phrases n a m - ing c o r p o r a t e entities can r e g u l a r l y be regarded as s i n g u l a r or p l u r a l (cp. F o r d M o t o r s h a s a n n o u n c e d m a s s i v e p r o f i t s ... vs. F o r d M o t o r s h a v e a n n o u n c e d m a s s i v e profits). But in each case the semantic r e p r e s e n t a t i o n t h a t t h e Message A n M y s e r m u s t b u i l d o n l y needs to r e p r e s e n t the f a c t t h a t the garage called freshfield b o d y s h o p are going s o m e - where ( w h i c h the E v e n t Recogniser w i l l expect to be the scene of the incident, in o r d e r to r e m o v e the d a m a g e d vehicle). Since the d i s t i n c t i o n s b e t w e e n the s y n t a c t i c f o r m s in these e x a m p l e s is i r r e l e v a n t f o r o u r p u r - poses, i t w o u l d be a w a s t e of the p a r s e r ' s effort to i n t r o d u c e search a n d inference p r o b - lems in the a t t e m p t to m a p the s y n t a x of (4) u n i q u e l y into t h e s y n t a x of one or o t h e r of the f o r m s in (5). I n d e e d i t is m o r e a p p r o p r i a t e f o r o u r purposes to r e g a r d on
w a y as a domain-specific i d i o m a t i c phrase, e q u i v a l e n t to ¢n route, e n t r e etc (each o f w h i c h occur in s i m i l a r contexts).
In keeping w i t h t h i s a p p r o a c h to i l l - f o r m e d n e s s , o u r g r a m m a r c o n t a i n s m a n y categories (ie f e a t u r e - t r e e s ) , t h a t w o u l d n o t be recognised as s y n t a c t i c categories in g r a m - m a r s f o r n o r m a l English, eg. w e h a v e special r u l e s f o r p h r a s e s c o n t a i n i n g p r e d i c t e d u n k - n o w n s such as names, car r e g i s t r a t i o n n u m b e r s , etc. O u r p a r s e r is l o o k i n g f o r p h r a s e s describing e v e n t s r a t h e r t h a n sen- tences, a n d w e w i l l n o t n e c e s s a r i l y a l w a y s assign a s t r u c t u r e w i t h a single "S" l a b e l s p a n n i n g aH the i n p u t message.
A s w e n o t e d in 3.1 above, the l e x i c a l entries f o r w o r d s t h a t suggest i n t e r e s t i n g events i n c l u d e trees o f s e m a n t i c f e a t u r e s t h a t s p e c i f y expected fillers f o r v a r i o u s roles in these events. These f e a t u r e trees p r o v i d e selectional r e s t r i c t i o n s u s e f u l f o r g u i d i n g t h e parse, b u t do n o t t h e m s e l v e s c o n s t i t u t e the "semantics" o f the l e x i c a l entries. The semantics are r e p r e s e n t e d as f i r s t - o r d e r logical expressions in a separate field o f the lexical e n t r y , a n d r e p r e s e n t a t i o n s of the m e a n i n g of phrases are b u i l t using s e m a n t i c r u l e s associ- a t e d w i t h each s y n t a c t i c r u l e , as p h r a s e s are c o m p l e t e d in t h e b o t t o m - u p parse.
3.3 T H E S E A R C H S T R A T E G Y : I n t e r e s t i n g - c o r n e r p a r s i n g is b a s i c a l l y an a d a p t a t i o n o f b o t t o m - u p c h a r t p a r s i n g to a l l o w i s l a n d - d r i v i n g t h r o u g h the i n p u t string, w h i l s t s t i l l parsing each i n d i v i d u a l r u l e u n i - d i r e c t i o n a U y . This gives a m a x i m a l l y efficient parse f o r o u r goal o f r e l i a b l y e x t r a c t i n g f r o m the i n p u t aU a n d o n l y the i n f o r m a t i o n t h a t is r e l e v a n t to o t h e r m o t o r - ists. This f o r m of e x p e c t a t i o n - d r i v e n p a r s i n g differs f r o m t h a t used in e a r l i e r s c r i p t - b a s e d s y s t e m s such as M A R G I E [Schank 1975], ELI [Riesbeck 1978] a n d F R U M P in f o u r w a y s :
F i r s t , the i n t e r e s t i n g - c o r n e r p a r s e r uses an active c h a r t to consider b o t t o m - u p a l l i n t e r e s t i n g i n t e r p r e t a t i o n s t h a t m i g h t be given to an i n p u t message, r a t h e r t h a n proceeding l e f t to r i g h t a n d filtering o u t l a t e r ( r i g h t ) c a n d i d a t e i n t e r p r e t a t i o n s on the basis of e a r - lier ( l e f t ) context.
T h i r d , t h e e x p e c t a t i o n s t h e m s e l v e s are expressed d e c l a r a t i v e l y in f e a t u r e trees t h a t f o r m p a r t o f the l e x i c a l categories, w h i c h c o n t r o l t h e search v i a s t a n d a r d unification w i t h d e c l a r a t i v e r u l e s , w h e r e p r e v i o u s s y s - tems used p r o c e d u r a l "requests" in t h e l e x i - con.
F o u r t h , o u r parser b u i l d s a n e x p l i c i t s y n t a c t i c tree f o r t h e i n p u t , a l b e i t i n c l u d i n g semantic f e a t u r e s , r a t h e r t h a n b y b u i l d i n g a s e m a n t i c r e p r e s e n t a t i o n " d i r e c t l y ' .
The i n t e r e s t i n g - c o r n e r p a r s e r checks t h e semantic f e a t u r e s o n e v e r y l e x i c a l i t e m in the i n p u t to see if t h e y are i n t e r e s t i n g , b u t t h i s is a f a r f a s t e r o p e r a t i o n t h a n testing m a n y t i m e s w h e t h e r a series of l e x i c a l i t e m s matches t h e e x p e c t a t i o n s f r o m a t o p - d o w n script. T h i s does a s s u m e t h a t t h e p a r s e r can i d e n t i f y w h a t t h e l e x i c a l i t e m s are, w h i c h is p r o b l e m a t i c as w e n o t e d in section 2.1 above. But as w e s h a l l see, t h e i n t e r e s t i n g - c o r n e r p a r s e r does use p r e d i c t i o n s a b o u t t h e presence of lexical i t e m s w i t h p a r t i c u l a r f e a t u r e s in its search, a n d hence is in no w o r s e a posi- tion t h a n a s t r i c t l y t o p - d o w n p a r s e r as regards m a t c h i n g e x p e c t a t i o n s to i l l - f o r m e d lexical items.
3 ~ . 1 U N I D I R E C T I O N A L I S L A N D - D R I V I N G : I s l a n d - d r i v i n g is u s e f u l f o r t e x t w h e r e one needs to s t a r t f r o m c l e a r l y identifiable ( a n d in o u r case, s e m a n t i c a l l y interesting) p a r t s of the i n p u t a n d e x t e n d the a n a l y s i s f r o m t h e r e to i n c l u d e o t h e r p a r t s . But p a r s i n g r u l e s b i - d i r e c t i o n a l l y is i n h e r e n t l y inefficient. Consider, for e x a m p l e , a c h a r t parse o f the i n p u t s t r i n g a b given a single rule:
c = > a b .
A s t a n d a r d b o t t o m - u p l e f t - t o - r i g h t active c h a r t parse o f t h i s i n p u t w o u l d create t h r e e nodes (1 a 2 b 3) t w o active edges ( a n e m p t y one a t node 1 a n d one f r o m nodes 1 to 2) a n d one inactive edge ( f r o m node 1 to 3).
But a b i - d i r e c t i o n a l parse, a l l o w i n g the rule to be i n d e x e d a t a n y point, w o u l d b u i l d a t o t a l of 7 active edges (one e m p t y one at each node, a n d 2 p a i r s w i t h one c o n s t i t u e n t f o u n d , b u i l t in d i f f e r e n t d i r e c t i o n s , ie 5 d i s - t i n c t edges). I t w o u l d also b u i l d the same inactive edge in t w o d i f f e r e n t directions. F o r a r u l e w i t h t h r e e d a u g h t e r s , a b i d i r e c t i o n a l parse produces 14 active edges (9 of w h i c h are d i s t i n c t ) a n d again 2 inactive edges.
This r e d u n d a n c y in s t r u c t u r e - b u i l d i n g can be r e m o v e d b y i n c o r p o r a t i n g c o n s t i t u e n t s
into r u l e s u n i d i r e c t i o n a l l y w h i l s t s t i l l p a r s i n g t h e t e x t bidirectionalAy. W e do t h i s b y i n d e x i n g each r u l e o n e i t h e r l e f t - m o s t o r r i g h t - m o s t d a u g h t e r , a n d p a r s i n g in a u n i q u e d i r e c t i o n a w a y f r o m t h e i n d e x e d d a u g h t e r . I n o r d e r to p r e s e r v e c o m p l e t e n e s s in t h e search, t h e c h a r t m u s t c o n t a i n l i s t s o f a c t i v e a n d i n a c t i v e edges f o r each d i r e c t i o n o f expansion, a l t h o u g h t h e same s t r u c t u r e can be s h a r e d in t h e i n a c t i v e e d g e - l i s t s f o r b o t h directions. The f u n d a m e n t a l r u l e of edge- c o m b i n a t i o n m u s t be a u g m e n t e d so t h a t w h e n a n i n a c t i v e edge is a d d e d to t h e c h a r t , i t combines w i t h a n y a p p r o p r i a t e a c t i v e edges at b o t h o f its ends. T h i s process m i g h t be
• m ,
c a l l e d " i n d e x e d - c o r n e r p a r s i n g , m t h a t i t effectively combines l e f t - c o r n e r p a r s i n g a n d r i g h t - c o r n e r parsing, a n d t h e d i r e c t i o n o f p a r s e a t a n y stage s i m p l y d e p e n d s u p o n h o w t h e i n d i v i d u a l g r a m m a r r u l e s are i n d e x e d .
T h e i n t e r e s t i n g - c o r n e r p a r s e r i m p l e m e n t s a n i n d e x e d - c o r n e r c h a r t p a r s e r , w i t h t h e a d d i t i o n o f a n a g e n d a c o n t r o l m e c h a n i s m a n d a n i n d e x i n g p r i n c i p l e f o r g r a m m a r rules.
3.3.2 A G E N D A C O N T R O L : T h e i n s e r t i o n o f edges i n t o the a g e n d a is c o n s t r a i n e d b y t h e v a l u e o f a " c o n t r o l - f e a t u r e N, w h i c h specifies w h e r e to l o o k in t h e f e a t u r e - t r e e s t h a t c o n s t i t u t e o u r categories in o r d e r to find the s e m a n t i c a l l y ~interesting ~ f e a t u r e s . I n o u r e x a m p l e s ( 1 ) a n d (2) a b o v e , t h i s c o n t r o l = f e a t u r e is n a m e d sf. W h e n a n o r m a l b o t t o m - u p c h a r t parse begins, a l l l e x i c a l items are t e s t e d to see w h e t h e r t h e y can s p a w n higher edges. But in t h e i n t e r e s t i n g - c o r n e r parse, h i g h e r edges are o n l y s p a w n e d f r o m l e x i c a l i t e m s t h a t h a v e a c o n t r o l - f e a t u r e specification w h i c h unifies w i t h a pre=defined i n i t i a l v a l u e of the c o n t r o l f e a t u r e . T h u s b y assigning ( s f - e v e n t _ _ t y p e ) to be t h e i n i t i a l v a l u e of the c o n t r o l f e a t u r e , w e e n s u r e t h a t o n l y those edges are e n t e r e d i n t o the a g e n d a t h a t h a v e s e m a n t i c f e a t u r e trees t h a t are extensions o f t h i s tree (eg t h e s e m a n t i c f e a t u r e tree f o r close in (3) above). T h i s effectively m e a n s t h a t p a r s i n g m u s t begin f r o m w o r d s t h a t suggest some k i n d of i n t e r e s t i n g event. Note t h a t the i n i t i a l a c t i v e edges m a y be p r o p o s e d f r o m a n y p o i n t in the i n p u t s t r i n g , a n d t h e i r d i r e c t i o n of e x p a n s i o n f r o m t h a t p o i n t is d e t e r m i n e d b y the i n d e x i n g on t h e rules.
there are a n y , it searches, i n the direction of expansion for the c u r r e n t active edge, for a n y lexical items t h a t have a semantic feature-tree w h i c h unifies w i t h the n e w specification of w h a t is "interesting'.
For example, if the r u l e given i n (1) above is indexed o n the left daughter, a n d a n active edge is proposed starting f r o m a n inactive edge representing the lexical item close defined as i n (3) above, t h e n via the logical n a m e P the features o n the n o u n - phrase being sought become i n s t a n t i a t e d to ( s f - r o a d l o c a t i o n ) . The parser t h e n looks r i g h t w a r d s i n the i n p u t s t r i n g f o r a n y lexical items h a v i n g semantic feature trees t h a t are extensions of this n e w tree. If it finds a n y , it predicts more active edges f r o m there, a n d so forth.
Fig. 3 below i l l u s t r a t e s the n u m e r i c a l order i n which the i n t e r e s t i n g - c o r n e r parser incorporates nodes into the parse tree for a v e r y simple "sentence" ( i n o u r g r a m m a r we a l l o w sentences w i t h deleted auxiliaries), b u t w i t h the details of the feature trees o m i t t e d for legibility.
Extension unification a l l o w s one of the structures to be unified (the target) to be a n extension of the other (the p a t t e r n ) , b u t n o t vice-versa. This means t h a t it is more res- tricted t h a n graph unification, a n d hence can be i m p l e m e n t e d mote efficiently. It is less restricted t h a n t e r m unification, a n d hence less efficient at parse-time, b u t it does a l l o w the g r a m m a r a n d lexicon to be f a r more compact t h a n t h e y w o u l d be w i t h t e r m - unification i n the absence of a g r a m m a r pre- processor. However, using extension unification as the basic operation does also mean t h a t t h a t the unification of logical variables i n rules is n o t o r d e r - i n d e p e n d e n t , a n d hence we need a n i n d e x i n g principle to determine the direction i n w h i c h p a r t i c u l a r rules s h o u l d be parsed.
3.3.3 T H E I N D E X I N G P R I N C I P L F : O u r general principle for i n d e x i n g r u l e s is t h a t we m u s t parse f r o m categories t h a t specify general i n f o r m a t i o n (ie. t h a t have s m a l l feature-trees) to those t h a t specify p a r t i c u l a r modifications of t h a t general i n f o r m a t i o n (ie. t h a t p r o v i d e extensions to the s m a l l e r trees b y unification). This u s u a l l y means t h a t we parse f r o m s y n t a c t i c heads to c o m p l e m e n t s , eg i n d e x i n g sentences on the v p (cf. HPSG [ P r o u d i a n & P o l l a r d 1985]).
I n o u r example r u l e (1), we i n d e x o n the verb, because its expectations specify the general semantic t y p e of the object, a n d the semantic f e a t u r e tree of the n o u n - p h r a s e w i l l specify a s u b - t y p e of this general type, a n d therefore wiLt be a n extension of the v e r b ' s p a t i e n t semantic f e a t u r e tree. I n the example s h o w n i n fig 3, the semantic tree of the n p b u i l t at node 4 is:
( s f - ( r o a d l o c a t i o n - ( n a m e - h u n t i n g d o n , r t i t l e - l a n e ) ) )
w h i c h unifies b y extension w i t h the f e a t u r e tree ( s f - r o a d l o c a t i o n ) ) , a n d this as we saw above became the expected semantic tree for the n o u n - p h r a s e w h e n r u l e (1) unified w i t h the v e r b i n (3).
F i n a l l y , rules for categories t h a t have expected u n k n o w n s as daughters are a l w a y s indexed o n the k n o w n categories, even if these ate n o t the g r a m m a t i c a l head (eg we index o n the p o l i c e m a n ' s title for rules h a n - d l i n g sgt smith, irtsp brown etc. a n d o n the k n o w n title of a road for cases like hunting- don lane. m a r k w o r t h y avenue etc.
3.3.4 E X T E N S I O N S T O T H E C U R R E N T
S Y S T E M : There are m a n y aspects of the TICC's N a t u r a l Language S u m m a r i s a t i o n n o t dealt w i t h i n this paper, such as the s e m a n - tic rules used i n the Message A n a l y s e r a n d
10 s
7 poltitle unknown 8
I
I
pc chisholm
/
v l
I
closing
5
np ~ 4
n l
unknown 3 n
I
I
[image:7.612.150.507.567.694.2]the Event Recogniser. There are also m a n y inadequacies in the current implementation of the Message Analyser, eg in its handling of abbreviations/phrases, and in the handling of input that is "ill-formed" even with respect to our relatively unconstrained grammar.
However, w o r k is currently in progress on these problems, and we believe that the basic mechanisms of interesting-corner parsing are sufficiently powerful to enable us to achieve a practical solution, whilst being sufficiently general to ensure that such a solution will be theoretically interesting.
4 . CONCLUSION
The automatic production of traffic broadcasts, given the type of free text we have described in this paper, poses m a n y difficult problems. In m a n y w a y s our overall approach to these problems follows in a long tradition of semantically driven systems, but the processing style of our Message Analyser is much closer to that used in contemporary syntax-driven systems. We make explicit use of rules in a unification-based grammatical formalism that express both semantic and syntactic information declaratively, and our interesting-corner parser provides a search of the input messages that is both thorough and efficient.
We believe that complete understanding of free text messages is well beyond the state of the art in computational linguistics, but that we can nevertheless develop the TICC's Natural Language Summariser to have sufficient partial understanding to be practi- cally useful.
REFERENCF~
Becket, J.D. (1975) *The Phrasal Lexi- con', in R. C. Schank and B. L. Nash- Webber (eds.), proceedings of the Workshop on Theoretical Issues in Natural Language Pro- cessing. Cambridge, Mass., Bolt, Beranek and Newman, pp. 70-73.
Burton, R. (1976) "Semantic Grammar: an Engineering Technique for Constructing Natural Language Understanding Systems", Technical Report 3453, Cambridge, Mass., Bolt, Beranek and Newman.
Chevalier, M., Dansereau, J., and Poulin, G. (1978) "TAUM-METEO: Description Du Systeme.', Montreed, Groupe TAUM, Univer- site de M o n t r e a l
DeJong, G.F. (1979) *Skimming Stories in Real Time", Doctoral Thesis, New Haven, Yale University.
DeJong, G.F. (1982) *An Overview of the FRUMP System', in W e n d y O. Lehnert and Martin H. Ringle (eds.), Strategies for Natural Language Processing. Hillsdale, Erlbaum, pp. 149-176.
Earley, J. (1970) "An Efficient Context- free Parsing Algorithm", Communications of the ACM. vol. 6, no. 8, pp. 451-455.
Gazdar, O., Klein, E., PuHum, G., and Sag, I. (1985) GeneraZlsed Phrase Strt~c~twe Grammar. Oxford, Blackwell.
Kaplan, R., and Bresnan, J. (1982) "Lexi- cal Functional Grammar: a Formal System for Grammatical Representation', in Joan Bresnan (ed.), The Menta2 Representation of Grammatical Relations. Cambridge MA, MIT Press, pp. 173-281.
Kay, M. (1973) "The MIND S y s t e m ' , in Randall Rustin (ed.), Natural Language Pro- cessing. New York, Algorithmics Press, pp.
155-188.
Kay, M. (1985) "Parsing in Functional Unification Grammar", in David R. Dowty, Lauri Karttunen and Arnold M. Zwicky (eds.), Natural Language Parsing. Cambridge, Cambridge University Press, pp. 251-278.
Kwasny, S.C., and Sondheimer, N.K. (1981) "Relaxation Theories for Parsing Ill- formed I n p u t ' , American ]~nnal of Computa- tional Linguistics. vol. 7, no. 2, pp. 99-108.
Pereira, F.C.N., and Warren, D.H.D. (1980) *Definite Clause G r a m m a r s for Language Analysis - a Survey of the Formal- ism and a Comparison w i t h Augmented Transition Networks", Artificial Intelligence. vol. 13, no. 3, pp. 231-278.
Proudian, D., and PoLlard, C.J. (1985) "Parsing Head-driven Phrase Structure G r a m - m a r ' , ACL Proceedings, 23rd Annual Meeting. pp. 167-171.
Riesbeck, C.K. (1978) *An Expectation- driven Production System for N a t u r a l Language Understanding', in Donald A. Waterman and Rick Hayes-Roth (eds.), Pattern-directed Inference Systems. New York, Academic Press, pp. 399-414.
Riesbeck, C.K. (1982) "Realistic Language Comprehension", in W e n d y G. Lehnert and Martin H. Ringle (eds.), Strategies for Natural Language Processing. HHlsdale, Erlbaum, pp. 37-54.