A T O O L K I T F O R L E X I C O N B U I L D I N G
T h o m a s E. A h l s w e d e C o m p u t e r S c i e n c e D e p a r t m e n t I l l i n o i s I n s t i t u t e of T e c h n o l o g y
C h i c a g o , I l l i n o i s 6e616, U S A
ABSTRACT
This p a p e r d e s c r i b e s a set of i n t e r a c t i v e r o u t i n e s that can be u s e d to create, m a i n t a i n , a n d u p d a t e a c o m p u t e r lexicon. The r o u t i n e s are a v a i l a b l e to the user as a set of c o m m a n d s r e s e m b l i n g a s i m p l e o p e r a t i n g system. The l e x i c o n pro- d u c e d by this s y s t e m is b a s e d on lexi- c a l - s e m a n t i c relations, but is c o m p a t i b l e w i t h a v a r i e t y of o t h e r m o d e l s of l e x i c o n structure. The l e x i c o n b u i l d e r is suit- able for the g e n e r a t i o n of m o d e r a t e - s i z e d v o c a b u l a r i e s and has been used to c o n s t r u c t a l e x i c o n for a small m e d i c a l e x p e r t system. A f u t u r e v e r s i o n of the l e x i c o n b u i l d e r w i l l c r e a t e a m u c h l a r g e r l e x i c o n by p a r s i n g d e f i n i t i o n s from m a c h i n e - r e a d a b l e d i c t i o n a r i e s .
I N T R O D U C T I O N
N a t u r a l l a n g u a g e p r o c e s s i n g s y s t e m s need m u c h larger l e x i c o n s than t h o s e a v a i l a b l e today. F u r t h e r m o r e , a g o o d com- puter l e x i c o n w i t h s e m a n t i c as w e l l as s y n t a c t i c i n f o r m a t i o n is e l a b o r a t e and hard to c o n s t r u c t . We have c r e a t e d a p r o g r a m w h i c h e n a b l e s its user to i n t e r - a c t i v e l y b u i l d and e x t e n d a lexicon. The p r o g r a m sets up a user e n v i r o n m e n t s i m i l a r to a s i m p l e i n t e r a c t i v e o p e r a t i n g system; in this e n v i r o n m e n t l e x i c a l e n t r i e s can be p r o d u c e d through a small set of c o m m a n d s , c o m b i n e d w i t h p r o m p t s s p e c i f i e d by the user for the d e s i r e d kind of lexicon.
The i n t e r a c t i v e l e x i c o n b u i l d e r is being used to help c o n s t r u c t e n t r i e s for a l e x i c o n to be used to p a r s e and g e n e r a t e stroke case reports. Many terms in this m e d i c a l s u b l a n g u a g e e i t h e r do not a p p e a r in s t a n d a r d d i c t i o n a r i e s or are used in the s u b l a n g u a g e w i t h special m e a n i n g s . The d e s i g n of the l e x i c o n b u i l d e r is inuended to be g e n e r a l e n o u g h to m a k e it useful for o t h e r s b u i l d i n g l e x i c o n s for large n a t u r a l l a n g u a g e p r o c e s s i n g s y s t e m s involving d i f f e r e n t s u b l a n g u a g e s .
The i n t e r a c t i v e l e x i c o n b u i l d e r w i l l be the b a s i s for a f u l l y a u t o m a t i c l e x i c o n b u i l d e r w h i c h uses S a g e r ' s L i n g u i s t i c S t r i n g P a r s e r (LSP) to p a r s e m a c h i n e - r e a d a b l e text into a r e l a t i o n a l n e t w o r k b a s e d on a m o d i f i e d v e r s i o n of W e r n e r ' s NTQ ( M o d i f i c a t i o n - T a x o n o m y - Q u e u e i n g ) schema. I n i t i a l l y this p r o g r a m will be a p p l i e d to W e b s t e r ' s S e v e n t h C o l l e g i a t e D i c t i o n a r y and the L o n g m a n D i c t i o n a r y of C o n t e m p o r a r y English, b o t h of w h i c h are a v a i l a b l e in m a c h i n e - r e a d a b l e form.
L E X I C A L - S E N A N T I C R E L A T I O N S
The s e m a n t i c c o m p o n e n t of the l e x i c o n p r o d u c e d by this s y s t e m c o n s i s t s p r i n c i - p a l l y of a n e t w o r k of l e x i c a l - s e m a n t i c relations. That is, the m e a n i n g of a w o r d in the l e x i c o n is i n d i c a t e d as far as p o s s i b l e by its r e l a t i o n s h i p s w i t h o t h e r words. T h e s e r e l a t i o n s o f t e n have s e m a n - tic c o n t e n t t h e m s e l v e s and thus c o n t r i b u t e to the d e f i n i t i o n of the w o r d s they link.
The two m o s t f a m i l i a r such r e l a t i o n s are s y n o n y m y and a n t o n y m y , but o t h e r s are i n t e r e s t i n g and i m p o r t a n t . For instance, to take an e x a m p l e f r o m the v o c a b u l a r y of s t r o k e reports, the c a r o t i d is a k i n d of a r t e r y and an a r t e r y is a kind of b l o o d Vessel. T h i s "is a kind of" r e l a t i o n is taxonomy. We e x p r e s s the t a x o n o m i c rela- tions of "carotid', "artery" a n d "blood v e s s e l " w i t h the r e l a t i o n a l a r c s
c a r o t i d T a r t e r y
a r t e r y T b l o o d v e s s e l
A n o t h e r i m p o r t a n t r e l a t i o n is that of the p a r t to the whole:
The p a r t - w h o l e r e l a t i o n is m o r e c o m p l i c a t e d than t a x o n o m y in its p r o p e r - ties; some i n s t a n c e s of it are t r a n s i t i v e and o t h e r s are not. F r o m this and other criteria, Iris et al. (forthcoming) d i s t i n g u i s h four d i f f e r e n t p a r t - w h o l e
relations.
T a x o n o m y and p a r t - w h o l e are very c o m m o n relations, by no m e a n s r e s t r i c t e d to a n y p a r t i c u l a r s u b l a n g u a g e . S u b l a n - g u a g e s may, however, use r e l a t i o n s that are rare or n o n e x i s t e n t in the g e n e r a l language. In the s t r o k e v o c a b u l a r y , there are m a n y w o r d s for p a t h o l o g i c a l c o n d i t i o n s i n v o l v i n g the failure of some p h y s i c a l or m e n t a l function. We have i n v e n t e d a rela- tion N N A B L E to e x p r e s s the c o n n e c t i o n b e t w e e n the c o n d i t i o n and the function:
a p h a s i a N N A B L E s p e e c h a m n e s i a N N A B L E m e m o r y
R e l a t i o n s such as T, PART, and N N A B L E are e s p e c i a l l y useful in m a k i n g infer- ences. For instance, if we have a n o t h e r relation FUNC, d e s c r i b i n g the typical f u n c t i o n of a body part, we m i g h t c o m b i n e the r e l a t i o n a l arc
s p e e c h FUNC B r o c a ' s area w i t h the arc
a p h a s i a N N A B L E s p e e c h
to infer that w h e n a p h a s i a is present, the d i a g n o s t i c i a n should check for the p o s s l - b i l i t y of d a m a g e to B r o c a ' s area (as w e l l as to any other body p a r t w h i c h has s p e e c h as a function).
F i g u r e i. Part of a r e l a t i o n a l n e t w o r k
A n o t h e r k i n d of r e l a t i o n is the "col- l o c a t i o n a l relation', w h i c h g o v e r n s the c o m b i n i n g of words. T h e s e are p a r t i c u - l a r l y u s e f u l for g e n e r a t i n g i d i o m a t i c text. C o n s i d e r the "typical p r e p o s i t i o n " r e l a t i o n PREP:
on P R E P list
w h i c h says that an item may be "on a list" as o p p o s e d to "in a list" or "at a list."
A l t h o u g h the l e x i c o n b u i l d e r is b a s e d on a r e l a t i o n a l model, it can be a d a p t e d for use in c o n n e c t i o n w i t h a v a r i e t y of m o d e l s of l e x i c o n s t r u c t u r e . A s e m a n t i c - field a p p r o a c h can be h a n d l e d by the same m e c h a n i s m as r e l a t i o n s ; the l e x i c o n b u i l d e r also r e c o g n i z e s u n a r y a t t r i b u t e s of words, and these a t t r i b u t e s can be t r e a t e d as s e m a n t i c f e a t u r e s if one w i s h e s to b u i l d a f e a t u r e - b a s e d lexicon.
A P P L I C A T I O N S F O R T H E L E X I C O N B U I L D E R This p r o j e c t was m o t i v a t e d p a r t l y by t h e o r e t i c a l q u e s t i o n s of l e x i c o n d e s i g n a n d p a r t l y by p r o j e c t s w h i c h r e q u i r e d the use of a lexicon.
For i n s t a n c e , the M i c h a e l R e e s e Hos- pital S t r o k e R e g i s t r y i n c l u d e s a text g e n e r a t i o n m o d u l e p o w e r e d by a r e l a t i o n a l l e x i c o n (Evens et al., 1984). This a p p l i - c a t i o n p r o v i d e d a f r a m e w o r k of g o a l s w i t h i n w h i c h the i n t e r a c t i v e l e x i c o n b u i l d e r was d e v e l o p e d . The v o c a b u l a r y required for the S t r o k e R e g i s t r y text g e n e r a t o r is of m o d e r a t e size, a b o u t 2000 w o r d s and phrases. This is small e n o u g h thau a l e x i c o n for it can be built i n t e r a c t i v e l y .
O n e can imagine m a n y a p p l i c a t i o n s for a large l e x i c o n such as the a u t o m a t i c l e x i c o n b u i l d e r w i l l c o n s t r u c t . Q u e s t i o n a n s w e r i n g is one of our o r i g i n a l a r e a s of
interest; a large, d e n s e l y c o n n e c t e d v o c a b u l a r y will g r e a t l y add to the v a r i e t y of i n f e r e n c e s a q u e s t i o n a n s w e r i n g s y s t e m can make. A n o t h e r a r e a is i n f o r m a t i o n re- trieval, w h e r e e x p e r i m e n t s (Evens et al., forthcoming) have shown that the use of a r e l a t i o n a l t h e s a u r u s leads to i m p r o v e m e n t s in both recall and precision.
[image:2.612.62.300.78.413.2]l a n g u a g e l e x i c o n p o i n t s t o w a r d a time w h e n s u b l a n g u a g e s w i l l be o b s o l e t e for m a n y of the p u r p o s e s for w h i c h they are now used; but they w i l l still be u s e f u l and i n t e r e s t i n g for a long time to come, a n d the a u t o m a t i c l e x i c o n b u i l d e r g i v e s us a new tool for a n a l y z i n g them.
T H E I N T E R A C T I V E L E X I C O N B U I L D E R
Commands
The i n t e r a c t i v e l e x i c o n b u i l d e r c o n s i s t s of an o p e r a t l n g - s y s t e m - l i k e e n v i r o n m e n t in w h i c h the user m a y i n v o k e the f o l l o w i n g c o m m a n d s :
H E L P d i s p l a y s a set of o n e - l i n e s u m m a r i e s of the c o m m a n d s , or a p a r a g r a p h - l e n g t h d e s c r i p t i o n of a s p e c i f i e d command. T h i s p a r a g r a p h d e s c r i b e s the c o m m a n d - l i n e a r g u m e n t s , o p t i o n a l or required, for the g i v e n command, and b r i e f l y e x p l a i n s the f u n c t i o n of the command.
A D D E N T R Y p r o v i d e s a series of p r o m p t s to e n a b l e the user to c r e a t e a l e x i c a l entry. Some of these p r o m p t s are hard coded; o t h e r s can be set up in a d v a n c e by the user so that the l e x i c o n can be t a i l o r e d to the u s e r ' s needs.
E D I T e n a b l e s the user to m o d i f y an e x i s t i n g entry. It d i s p l a y s the e x i s t i n g c o n t e n t s of the e n t r y item by item, p r o m p t i n g for c h a n g e s or a d d i t i o n s . If the d e s i r e d e n t r y is not a l r e a d y in the lexicon, EDIT b e h a v e s in the same way as ADDENTRY.
D E L E T E lets the user d e l e t e one or m o r e entries. An entry is not p h y s i c a l l y deleted; it is r e m o v e d from the d i r e c - tory, and all e n t r i e s w i t h arcs p o i n t i n g to it are m o d i f i e d to e l i m i n a t e t h o s e arcs. (This is s i m p l e to do, s i n c e for every such arc there is an i n v e r s e arc p o i n t i n g to that entry from the d e l e t e d one.) On the next PACK o p e r a t i o n (see below) the d e l e t e d e n t r y w i l l not be p r e s e r v e d in the lexicon.
This c o m m a n d can a l s o be used to d e l e t e the d e f e c t i v e e n t r i e s that are o c c a s i o n a l l y c a u s e d by u n r e s o l v e d bugs in the e n t r y - c r e a t i n g routines, or w h i c h m i g h t a r i s e from other c i r c u m s t a n c e s . A special o p t i o n w i t h this c o m m a n d s e a r c h e s the d i r e c t o r y for a v a r i e t y of "illegal" c o n d i t i o n s such as n o n p r i n t i n g c h a r a c t e r s , z e r o - l e n g t h names, etc.
LIST g i v e s o n e - l i n e l i s t i n g s of some or all of the e n t r i e s in the lexicon. The l i s t i n g for each entry includes the n a m e (the w o r d itself), sense number, p a r t of speech, and the first forty c h a r a c t e r s of the d e f i n i t i o n if there is one.
S H O W d i s p l a y s the full c o n t e n t s of one or m o r e entries.
R E L A T I O N S d i s p l a y s a t a b l e of the l e x i c a l - s e m a n t i c r e l a t i o n s u s e d by the l e x i c o n b u i l d e r . T h i s t a b l e is c r e a t e d by the u s e r in a s e p a r a t e o p e r a t i o n .
U N D E F is a s p e c i a l f o r m of EDIT. In c r e a t i n g a n entry, the u s e r may c r e a t e r e l a t i o n a l a r c s from the c u r r e n t w o r d to o t h e r w o r d s that are not in the lexicon. The s y s t e m k e e p s a q u e u e of u n d e f i n e d words. U N D E F i n v o k e s E D I T for the w o r d at the head of the queue, thus s a v i n g the user the t r o u b l e of l o o k i n g up u n d e f i n e d words.
PACK p e r f o r m s file m a n a g e m e n t on the lexicon, s o r t i n g the e n t r i e s a n d e l i m i - n a t i n g s p a c e left by d e l e t e d ones.
This r o u t i n e w o r k s in two passes. In the first pass, the e n t r i e s are c o p i e d from the e x i s t i n g l e x i c o n file to a new file in l e x i c o g r a p h i c o r d e r and a table is c r e a t e d that m a p s the e n t r i e s f r o m their old l o c a t i o n s to their new ones. At this stage, a r e l a t i o n a l arc from one e n t r y to a n o t h e r still p o i n t s to the o t h e r e n t r y ' s old location. The s e c o n d pass u p d a t e s the new lexicon, m o d i f y i n g all r e l a t i o n a l a r c s to p o i n t to the c o r r e c t new l o c a t i o n s .
Q U I T e x i t s from the l e x i c o n b u i l d e r e n v i r o n m e n t . Any n e w e n t r i e s or c h a n g e s m a d e d u r i n g the l e x i c o n b u i l d i n g s e s s i o n are i n c o r p o r a t e d and the d i r e c t o r y is updated.
E x t e n s i o n s t o the c o m m a n d s
All of the c o m m a n d s can be a b b r e v i - ated; so far they all have d i s t i n c t i v e i n i t i a l s and can thus be c a l l e d w i t h a s i n g l e k e y s t r o k e .
Each c o m m a n d may be a c c o m p a n i e d by c o m m a n d - l i n e a r g u m e n t s to d e f i n e its a c - tion m o r e p r e c i s e l y . D i s p l a y c o m m a n d s , s u c h as HELP or SHOW, a l l o w the user to get a p r i n t o u t of the display. W h e r e an e n t r y name is to be s p e c i f i e d , the user can get m o r e than one entry by m e a n s of "wild c a r d s . " For instance, the c o m m a n d "LIST p r o d u c = m i g h t y i e l d a list showing e n t r i e s for "produce', "produced", "pro- duces", "producing', "product', a n d "production. ~
The d e s i g n of the user i n t e r f a c e took into a c c o u n t b o t h the a v a i l a b l e f a c i l i t i e s and the e x p e c t e d users. The l e x i c o n b u i l d e r runs on a VAX 11-75B, n o r m a l l y a c c e s s e d w i t h l i n e - e d l t i n g terminals. This s u g g e s t s that a s i n g l e - l i n e c o m m a n d f o r m a t is m o s t a p p r o p r i a t e . Since much of the work w i t h the s y s t e m is d o n e over 3~0 b a u d t e l e p h o n e lines, c o n c i s e n e s s is a l s o important. The u s e r s have all had some p r o g r a m m i n g e x p e r i e n c e (though not n e c e s - s a r i l y very much) so an o p e r a t i n g - s y s t e m - like i n t e r f a c e is easy for them to get used to. If the l e x i c o n b u i l d e r b e c o m e s popular, we hope to have the o p p o r t u n i t y to d e v e l o p a m o r e s o p h i s t i c a t e d interface, p e r h a p s w i t h a c o m b i n a t i o n of f e a t u r e s for b e g i n n e r s and m o r e e x p e r i e n c e d users.
S t r u c t u r e of a l e x l c a l e n t r y
A c o m p l e t e lexical e n t r y c o n s i s t s of: i. The "name" of the entry -- its c h a r a c t e r - s t r i n g form.
2. Its sense. We r e p r e s e n t senses by simple numbers, not a t t e m p t i n g to f o r m a l l y d i s t i n g u i s h p o l y s e m y and h o m o - nymy, or any other d e g r e e of s e m a n t i c d i f f e r e n c e . The s y s t e m leaves to the user the p r o b l e m of d i s t i n g u i s h i n g d i f f e r e n t s e n s e s from e x t e n s i o n s of a s i n g l e sense: that is, w h e r e a word has a l r e a d y been e n t e r e d in some sense, the user must d e c i d e w h e t h e r to m o d i f y the e n t r y for that sense or c r e a t e a new entry for a new sense.
3. Part of speech, or "class." Our c l a s s i f i c a t i o n of parts of s p e e c h is b a s i c a l l y the t r a d i t i o n a l c l a s s i f i c a t i o n w i t h some c o n v e n i e n t a d d i t i o n s , l a r g e l y d r a w n from the c l a s s i f i c a t i o n used by Sager in the LSP (Sager, 1981). Most of the a d d i t i o n s are to the c a t e g o r y of v e r b s : "verb" to the lexicon b u i l d e r de- n o t e s the stem form, w h i l e the third p e r s o n and past tense are d i s t i n g u i s h e d as " f i n i t e verb', and the past and p r e s e n t
p a r t i c i p l e s are c l a s s i f i e d separately. 4. The text of the definition, e n t e r e d by the user.
At t h i s stage in our work, the d e f i n i t i o n is not p a r s e d or o t h e r w i s e ana- lyzed, so its p r e s e n c e is m o r e for p u r p o s e s of d o c u m e n t a t i o n than a n y t h i n g else. In future v e r s i o n s of the lexicon builder, the d e f i n i t i o n will play an i m p o r t a n t role in c o n s t r u c t i n g the entry but in the entry itself will be replaced by i n f o r m a t i o n d e r i v e d from its analysis.
5. A list of a t t r i b u t e s (or s e m a n t i c features), each with its value, w h i c h may be b i n a r y or scalar.
6 . A p r e d i c a t e c a l c u l u s d e f i n i t i o n . For example, for the m o s t c o m m o n sense of the verb "promise', the p r e d i c a t e c a l c u l u s d e f i n i t i o n is e x p r e s s e d as
p r o m i s e i x , y , z ) = say(x,w,z)
_eventiy) => w = w i l l happen(y) _ t h i n g ( y ) => w = w i l l receive(z,y) or, in freer form,
ix p r o m i s e s y to z} = ix says w to z) w h e r e w =
(y will happen)
if y is an e v e n t (z w i l l r e c e i v e y)
if y is a p h y s i c a l object. This is e n t e r e d by the user.
We have been i n c l i n e d to think of the r e l a t i o n a l l e x i c o n as a network, since the network r e p r e s e n t a t i o n v i v i d l y b r i n g s out the i n t e r c o n n e c t e d q u a l i t y w h i c h the r e l a t i o n a l model g i v e s to the lexicon. P r e d i c a t e c a l c u l u s is b e t t e r in o t h e r respects; for instance, it e x p r e s s e s the a b o v e d e f i n i t i o n of "promise" m u c h more e l e g a n t l y than any n e t w o r k n o t a t i o n could. The two m e t h o d s of r e p r e s e n t a t i o n have t r a d i t i o n a l l y b e e n seen as a l t e r n a t i v e s rather than as s u p p l e m e n t i n g each other; we b e l i e v e that p r e d i c a t e c a l c u l u s has an i m p o r t a n t s u p p l e m e n t a r y role to play in d e f i n i n g the core v o c a b u l a r y of the lexicon, a l t h o u g h we are not sure yet how to use it.
7. Case s t r u c t u r e (for verbs). This is a table d e s c r i b i n g , for each s y n t a c t i c slot a s s o c i a t e d w i t h the verb (subject, d i r e c t object, etc.) the s e m a n t i c case or c a s e s that may be used in that slot ('age,in, " e x p e r i e n c e r ' , etc.), w h e t h e r it is required, o p t i o n a l , or may be e x p r e s s e d e l l i p t i c a l l y (as w i t h the d i r e c t and i n d i r e c t o b j e c t in "I p r o m i s e i " r e f e r r i n g to an earlier statement).
Space is r e s e r v e d in this s t r u c t u r e for s e l e c t i o n r e s t r i c t i o n s . A r e l a t i o n a l m o d e l gives us the much more p o w e r f u l op- tion of i n d i c a t i n g t h r o u g h r e l a t i o n s such as " p e r m i s s i b l e subject', " p e r m i s s i b l e object', etc., not only what w o r d s m a y go w i t h what others, but w h e t h e r the usage is literal, a c o n v e n t i o n a l figure of speech, fanciful, or w h a t e v e r . S e l e c t i o n restric- tions do, however, have the v i r t u e of c o n c i s e n e s s , and they p e r m i t us to make g e n e r a l i z a t i o n s . R e l a t i o n a l a r c s may then be used to mark e x c e p t i o n s .
We find it c o n v e n i e n t to treat m o r - p h o l o g i c a l d e r i v a t i o n s such as p l u r a l of nouns, t e n s e s and p a r t i c i p l e s of verbs, as r e l a t i o n s c o n n e c t i n g s e p a r a t e entries. The e n t r y for a r e g u l a r l y d e r i v e d f o r m such as a n o u n p l u r a l is a m i n i m a l one, c o n s i s t i n g of name, sense, part of speech, and one r e l a t i o n a l arc, l i n k i n g the e n t r y to the stem form. The l e x i c o n b u i l d e r g e n e r a t e s these r e g u l a r forms a u t o m a t i - cally. It a l s o d i s t i n g u i s h e s t h e s e "regu- lar" e n t r i e s f r o m " u n d e f i n e d " e n t r i e s , w h i c h have b e e n e n t e r e d i n d i r e c t l y as
t a r g e t w o r d s of r e l a t i o n a l a r c s a n d w h i c h are on the q u e u e a c c e s s e d by UNDEF, as w e l l as from " d e f i n e d " entries.
n a m e s e n s e c l a s s text of
d e f i n i t i o n a t t r i b u t e list p r e d i c a t e
c a l c u l u s d e f i n i t i o n case s t r u c t u r e
table r e l a t i o n ~
list
w2-
I
w2 1.2[l :I
F i g u r e 2, S t r u c t u r e of a l e x i c a l e n t r y
File s t r u c t u r e of the l e x i c o n T h e r e are four data files w i ~ h the lexicon.
a s s o c i a t e d
The first is the l e x i c o n proper. The b i g g e s t c o m p l i c a t i n g factor in the d e s i g n of the l e x i c o n is the e x t r e m e l y inter- c o n n e c t e d n a t u r e of the data; a c h a n g e in one p o r t i o n of the file may n e c e s s i t a t e c h a n g e s in m a n y o t h e r p l a c e s in the file. Each entry is l i n k e d t h r o u g h r e l a t i o n a l arcs to m a n y o t h e r entries; a n d for e v e r y arc p o i n t i n g from w o r d l to word2, there m u s t be an i n v e r s e arc f r o m w o r d 2 to
wordl. This m e a n s that w h e n e v e r we c r e a t e a n e w arc in the c o u r s e of b u i l d i n g or m o d i f y i n g a n e n t r y for wordl, we m u s t u p d a t e the e n t r y for w o r d 2 so that it w i l l c o n t a i n the a p p r o p r i a t e i n v e r s e arc back to wordl• W o r d 2 ~ s e n t r y has to be u p d a t e d or c r e a t e d from scratch; we n e e d to s t r u c t u r e the l e x i c o n file so that this u p d a t i n 9 p r o c e s s , w h i c h may take p l a c e a n y w h e r e in the file, can be d o n e w i t h the l e a s t p o s s i b l e d i s l o c a t i o n .
a p h a s i a (1) n.
definition
a d i s o r d e r of l a n g u a g e due to i n j u r y to the b r a i n
a t t r i b u t e s n o n h u m a n c o l l e c t i v e
p r e d i c a t e c a l c u l u s
have(x, aphasia) -- " a b l e ( s p e a k ( x ) ) r e l a t i o n s
T A X
[aphasia is a k i n d of x] d e f i c i t
d i s o r d e r loss i n a b i l i t y "TAX
Ix is a kind of aphasia] a n o m i c
g l o b a l g e r s t m a n n ' s s e m a n t i c We rnicke ' s S r o c a ' s c o n d u c t i o n t r a n s c o r t i c a l S Y M P T O M
[aphasia is a s y m p t o m of x] s t r o k e
T I A A S S O C
[aphasia may be a s s o c i a t e d w i t h x] a p r a x i a
_ C A U S E
[x is a c a u s e of aphasia] injury
l e s i o n N N A B L E
[aphasia is the i n a b i l i t y to do x] s p e e c h
l a n g u a g e
F i g u r e 3. L e x i c a l entry for " a p h a s i a "
[image:5.612.60.546.80.750.2] [image:5.612.57.269.234.555.2]or even (eventually) h u n d r e d s of rela- tional arcs. "Aphasia', a m o d e r a t e l y large e n t r y w i t h 19 arcs, o c c u p i e s 322 bytes. Like all e n t r i e s in the c u r r e n t lexicon, it w i l l be s u b j e c t to u p d a t i n g and w i l l c e r t a i n l y b e c o m e m u c h larger.
W i t h this range of e n t r y sizes, the c h o i c e b e t w e e n f i x e d - s i z e and v a r i a b l e - size records b e c o m e s s o m e w h a t painful. V a r i a b l e - s i z e records w o u l d be h i g h l y c o n v e n i e n t as w e l l as e f f i c i e n t e x c e p t for the fact that w h e n we a d d a new e n t r y that is related to e x i s t i n g entries, we m u s t add new a r c s to those entries. The e x i s t i n g e n t r i e s thus no longer fit into their p r e v i o u s space and m u s t be e i t h e r b r o k e n up or m o v e d to a new space. The former o p t i o n c r e a t e s p r o b l e m s of i d e n t i f y i n g the v a r i o u s p i e c e s of the entry; the latter r e q u i r e s that yet m o r e e x i s t i n g e n t r i e s be m o d i f i e d .
B e c a u s e of t h e s e problems, we have opted for a f i x e d - s i z e record. Some space is wasted, e i t h e r in e m p t y space if the record is too large or t h r o u g h p r o l i f e r a - tion of p o i n t e r s if the record is too small; but the a m o u n t of n e c e s s a r y up- d a t i n g is m u c h less, and the file can be kept in order through f r e q u e n t use of the PACK command. The c h o i c e of record size is c o n d i t i o n e d by m a n y factors, s y s t e m r e q u i r e m e n t s as w e l l as the range of entry sizes. We are c u r r e n t l y w o r k i n g on d e t e r - m i n i n g the best record size for the MRH a p p l i c a t i o n .
So far the user does not have the op- tion of saving or rejecting the results of a lexicon b u i l d i n g session, since e n t r i e s are w r i t t e n to the file as soon as they are created. We are s t u d y i n g w a y s of p r o v i d i n g this option. A brute force w a y w o u l d be to keep the e n t i r e l e x i c o n in m e m o r y and rewrite it at the end of the session. This is f e a s i b l e if the host c o m p u t e r is large and the l e x i c o n is small. The 2 ~ g 0 - w o r d l e x i c o n for the M i c h a e l Reese stroke d a t a b a s e takes up a b o u t a third of a megabyte, so this a p p r o a c h w o u l d work on a m a i n f r a m e or a large m i n i c o m p u t e r such as our Vax 75g, but could not r e a d i l y be p o r t e d to a smaller machine; nor c o u l d w e h a n d l e a much larger v o c a b u l a r y such as we plan to c r e a t e w i t h the a u t o m a t i c l e x i c o n builder.
The second file is a d i r e c t o r y , showing each e n t r y ' s name, sense, and status (defined, u n d e f i n e d or regular d e r i v a u i v e ) , w i t h a pointer to the a p p r o - p r i a t e entry in the l e x i c o n proper. The d i r e c t o r y e n t r i e s are l i n k e d in lexico- g r a p h i c order. When the l e x i c o n b u i l d e r is invoked, the e n t i r e d i r e c t o r y is read into a buffer in memory, and this b u f f e r is u p d a t e ~ as e n t r i e s are created,
m o d i f i e d or deleted. At the end of a l e x i c o n b u i l d i n g session, the u p d a t e d d i r e c t o r y is w r i t t e n out to disk.
The third (optional) file is a table of a t t r i b u t e s , w i t h p o i n t e r s into the l e x i c o n proper. This can be e x t e n d e d into a f e a t u r e matrix.
The f o u r t h (also optional) is a table of p r e - d e f i n e d relations. This t a b l e includes, for each relation:
(i) its m n e m o n i c name.
(2) its p r o p e r t i e s . A r e l a t i o n may be reflexive, s y m m e t r i c or t r a n s i t i v e ; there may b e o t h e r p r o p e r t i e s w o r t h including.
(3) a p o i n t e r to the r e l a t i o n ' s inverse. If x R E L y, then we can d e f i n e some REL such that y REL x. If REL is r e f l e x i v e or symmetric, then REL = REL.
(4) the a p p r o p r i a t e p a r t s of s p e e c h for the w o r d s l i n k e d by the relation. For instance, the N N A B L E r e l a t i o n links two nouns, w h i l e the c o l l o c a t i o n a l PREP rela- tion links a p r e p o s i t i o n to a noun. T a x o n o m y can link any two w o r d s (apart from p r e p o s i t i o n s , c o n j u n c t i o n s , etc.) as long as they are of the same part of speech: n o u n s to nouns, verbs to verbs,
e t c .
(5) the text of a prompt. A D D E N T R Y uses this p r o m p t w h e n q u e r y i n g the user for the o c c u r r e n c e of r e l a t i o n a l arcs i n v o l v i n g this relation. For instance, if we are e n t e r i n g the w o r d "promise" and our a p p l i c a t i o n uses the t a x o n o m y relation, we m i g h t c h o o s e a short prompt, in w h i c h case the q u e r y for t a x o n o m y m i g h t take the form
"promise" T: [user e n t e r s w o r d 2 here] or we c o u l d use s o m e t h i n g m o r e explicit:
"promise" is a kind of:
Users familiar w i t h l e x i c a l - s e m a n t i c r e l a t i o n s m i g h t p r e f e r the s h o r t e r m n e m o n i c prompt, w h e r e a s other users m i g h t p r e f e r a p r o m p t that better e x p r e s s e d the s i g n i f i c a n c e of the relation.
T H E A U T O M A T I C L E X I C O N B U I L D E R
B u i l d i n g a v e r y l a r g e l e x i c o n
large lexicon that would result from anal- ysis of an entire dictionary, as the work of Amsler and White (1979) or Kelly and Stone (1975) shows. Integrating the lexicon builder with the LSP, and writing preprocessors for dictionary data, will also be big jobs. Fully automatic analy- sis of dictionary material, then, is a long-range goal.
A major problem in the relational analysis of the dictionary is that of determining what relations to use. Noun and verb definitions rely on taxonomh ~ to a great extent (e.g. Amsler and White, 1979) but there are definitions that do not clearly fit this pattern; further- more, even in a taxonomic definition, much semantic information is contained in the qualifying or differentiating part of the definition.
Adjective definitions are another problem area. Adjectives are usually defined in terms of nouns or verbs rather than other adjectives, so simple taxonomy does not work neatly. In a sample of about 7 , 0 ~ definitions from W7, we identified nineteen major relations unique to adjective definitions, and these covered only half of the sample. The remaining definitions were much more varied and would probably require far more then nineteen additional relations. And for each relation, we had to identify words or phrases (the "defining formulas') that signaled the presence of the relation.
The M'~ model
For these reasons as well as theoretical ones, we need a simplifying model of relations, a model that enables us either to avoid the endless identifica- tion of new relations or to conduct the identification within an orderly frame- work. Werner's MTQ schema (Werner, 1978; Werner and Topper, 1976) seems to provide
the basis for such a model.
Werner idennifies only three rela- tions: modification, taxonomy and queue- ing. He asserts that all other relations can be expressed as compounds of these relations and of lexical items -- for instance, the PART relation can be expressed, with the help of the lexical item "part', by the relational arcs
Broca's area T part
brain M part
which say in effect that Broca's area is a kind of part, specifically a "brain-part."
werner's concept of m o d i f i c a t i o n and taxonomy reflects Aristotle's model of the definition as consisting of species, genus and differentiae -- taxonomy links the species to the genus and m o d i f i c a t i o n links the differentiae to the genus. A study of definitions in W7 and LDOCE shows that they do indeed follow this pattern, although (as in adjective definitions) the pattern is not always obvious.
The special power of MTQ in the analysis of definitions is that in a definition following the A r i s t o t e l i a n
structure,
taxonomy and m o d i f i c a t i o n can be identified by purely syntactic means. One (or occasionally more than one) word in the definition is modified directly or indirectly by all the other words. The core word is linked to the defined word by taxonomy; all the others are linked to the core word by modification. (Queueing so far does not seem to be important in the analysis of definitions.)In order to avoid certain ambiguities that arise in a very elaborate network such as that generated from a large dic- tionary, we have replaced the separate modification and taxonomy arcs with a single, ternary relational arc that keeps the species, genus and d i f f e r e n t i a t i n g items of any particular definition linked to each other.
The problem of identifying "higher level" relations such as PART and NNABLE in an MT0 network still remains. At this point it seems to be similar to the prob- lem of identifying higher level relations from defining formulas.
Another pleasant discovery is that the Linguistic String Parser, which we have used successfully for some years, is exceptionally well suited for this strat- egy, since it is geared toward an analysis of sentences and phrases in terms of "centers" or "cores" with their modifying "adjuncts', which is exactly the kind of analysis we need to do.
Design of the automatic lexicon builder The automatic lexicon builder will contain at least the following suDsystems:
I. The standard data structure f o r the lexical entry, as described for the interactive lexicon builder, with slight changes to adjust to the use of MTQ.
word (=wordl') we are currently investi-
gating.) Incorporating the ternary MTQ
model, we would have two relation lists: a T list and an M list. The T list would
be a linked list of words connected to
wordl by the T relation; its structure
would be identical to the present relation
list except that its nodes would be
lexical entry pointers instead of rela-
tions. Each of these lexical entry point- ers would, like the relation nodes in the existing implementation, point to a linked list of word2s. The word2s in the T list would be connected to the T words by an inverse-modification relation ('M) and the word2s in the M list would be connected to the M words by inverse taxonomy ('T).
2. Preprocessors to convert pre-
existing data to the standard form. The
preprocessor need not be intelligent; its
job is to identify and decode part-of-
speech and other such information, sepa-
rating this from the definition proper. Part of the preprocessing phase is to generate a "dictionary" for the LSP. This
dictionary need only contain part-
of-speech information for all the words
that will be used in definitions; other
information such as part- of-speech
subclass and selection restrictions is
helpful but not necessary. Sager and her associates (198B) have created programs to do this.
3. Batch and interactive input
modules. The batch input reads a data
file in standard form, perhaps optionally noting where further information would be
especially desirable. The interactive
input is preserved from the interactive
version of the system and allows the user to "improve" on dictionary data as well as to observe the results of the dictionary parse.
4. Definition analyzer. In this
module, the LSP will parse the definition to produce a parse tree. which will then
be converted into an MTQ network to be
linked into the overall lexical network.
5. Entry generator. This module,
like the preprocessor, can be tailored to the user's needs.
S U ~ X
A program has been written that
enables a user interested in creating a
lexicon for natural language processing to generate lexical entries interactively and link them
automatically
to other lexicalentries through
lexical-semantic
rela-tions. The program provides a small set
of commands
that
allow the user to create,modify, delete, and display lexical
entries,
among other operations.The immediate motivation for the
program was to produce a relational
lexicon for text generation of clinical
reports by a diagnostic expert system. It
is now being used for that purpose. It
can equally well be used in any other sub-
language environment; in addition, it is
intended to be compatible, as far as
possible, with models of lexicon structure other than the relational model on which it is based.
The interactive lexicon builder is
further intended as the starting point for a fully automatic lexicon building program which will create a large, general purpose
relational lexicon from machine readable
dictionary text, using a slightly modified
form of Werner's Modification-Taxonomy-
Queueing
relational model.REFERENCES
Ahlswede, Thomas E., and Evens, Martha W., 1983. "Generating a Relational Lexicon
from a Machine-Readable Dictionary."
Proceedings of the Conference on
Artificial Intelligence, Oakland Univer-
sity, Rochester, Michigan.
Ahlswede, Thomas E., and Evens, Martha W., 1984. "A Lexicon for a Medical Expert System." Presented at the Workshop on Relational Models, Coling ' 8 4 , Stanford University, Palo Alto, California.
Ahlswede, Thomas E., in press. =A Lin-
guistic String Grammar of Adjective
Definitions." In S. Williams, ed. Humans
and Machines: The Interface Through
Language, Ablex.
Amsler, Robert A., and White, John S.,
1979. Development of a Computational
Methodology for Deriving Natural Language
Semantic Structures via Analysis of
Machine Readable Dictionaries. Linguis-
tics Research Center, University of Texas.
Evens, Martha W., Ahlswede, Thomas E.,
Hill, Howard, and Li, Ping-Yang, 1984.
"Generating Case Reports from the Michael
Reese Stroke Database." Proc. 1984
Conference on Intelligent Systems and
Machines, Oakland University, Rochester,
Michigan, April.
Evens, Martha W., Vandendorpe, James, and
Wang, Yih-Chen, in press. "Lexical-
Semantic Relations in Information Retriev-
al," In S. Williams, ed. Humans and
Machines: The Interface Through Language,
Iris, Madelyn, Litowitz, Bonnie, and
Evens, Martha W., unpublished. "The
Part-Whole Relation in the Lexicon: an
Investigation of Semantic Primitives."
Kelly, Edward F., and Stone. Philip J.,
1975. Computer Recognition of English
Word Senses. North-Holland, Amsterdam.
Sager, Naomi, 1981.
Information Processing.
New York.
Natural Language
Addison-Wesley,
Sager Naomi, Hirschman, Lynette, White,
Carolyn, Foster, Carol, Wolff, Susanne,
Grad, Robert, and Fitzpatrick, Eileen,
198~. Research into Methods for Automatic
Classification and Fact Retrieval in
Science Subfields. String Reports No.
13, New York University.
Werne~, Oswald, 1978. "The Synthetic
Informant Model: the Simulation of Large
Lexical/Semantic Fields." In M. Loflin
and J. Silverberg, eds., Discourse and
Difference in Cognitive Anthropology.
Mouton, The Hague.
Warner, Oswald, and Topper, Martin D.,
1976. "On the Theoretical Unity of
Ethnoscience Lexicography and Ethnoscience Ethnographies." In C. Rameh, ed., Seman-
tics, Theory and Application, Proc.
Georgetown University Round Table on