UTILIZING DOMAIN-SPECIFIC INFORMATION FOR PROCESSING COMPACT TE~T
E l a i n e Marsh
L l n g u l s t l c S t r i n g P r o j e c t Nee York U n i v e r s i t y
Nee York, Nee York
ABSTRACT
This paper I d e n t i f i e s the types of sentence fragments found In the t e x t of two domains: medical records and Navy equipment s t a t u s messages. The fragment types are r e l a t e d t o f u l l sentence forms on the basis of the elements which were r e g u l a r l y d e l e t e d . A breakdown of the fragment types and t h e i r d i s t r i b u t i o n s In the two domains Is presented. An approach t o r e c o n s t r u c t i n g t h e semantic c l a s s of d e l e t e d elements In the medical records Is proposed which is based on the semantic p a t t e r n s recognized In the domain.
I. INTRODUCTION
A l a r g e amount of natural language Input, whether t o text processing or questlon-answerlng systems, c o n s l s t s of shortened sentence forms, sentence nfragmentsn. Sentence fragments are found
in informal technical communlcatlons, messages, headlines, and In t e l e g r a p h i c c a m u n l c a t l o n s . Occurrences are c h a r a c t e r i z e d by t h e l r brev l t y and Informational nature. In a l l of these, i f people are not r e s t r i c t e d t o using complete, grammatical sentences, as they are In formal w r i t i n g s i t u a t i o n s , they tend t o leave OUt the p a r t s of the sentence which they b e l l e v e the reader w i l l be a b l e t o r e c o n s t r u c t . Thls is e s p e c i a l l y t r u e i f the w r i t e r deals wlth a s p e c i a l i z e d s u b j e c t matter where the f a c t s are to be used by o t h e r s in the same f i e l d .
Several approaches t o such h i l l - f o r m e d , , natural language Input have been f o l l o w e d . The LIFER system [ H e n d r l x , 1977; Hendrlx, e t a i . , 1978] and the I~_ANES system [ W a l t z , 1978] both account f o r fragments In procedural terms; they Co not r e q u i r e the user t o enumerate the types of fragments which w i l l be accepted. The L i n g u i s t i c S t r l n g P r o j e c t has c h a r a c t e r l z e d the r e g u l a r l y o c c u r r i n g ungrammatical c o n s t r u c t i o n s and made them p e r t of the parsing grammar [Anderson, e t e l . , 1975; Hlrschman and Sager, 1982]. K w a s n y and Sondhe!mer (10R1) have used e r r o r - h a n d l i n g procedures t o r e l a t e the I l l - f o r m e d input of sentence fragments t o w e l l - f o r m e d s t r u c t u r e s . While these approaches d i f f e r in the way they determine the s t r u c t u r e of the fragments and the deleted m a t e r i a l , f o r the most p e r t they r e l y h e a v i l y , a t some p o i n t , on the r e c o g n i t i o n of semantic word-classes. The purpose of t h i s paper Is to describe the s y n t a c t i c c h a r a c t e r i s t i c s of sentence fragments and t o l l l u s t r a t e how the
cooCcurrence p a t t e r n s of the semantic word-classes of a domain can be u t i l i z e d as a powerful t o o l f o r processing a body of compact t e x t , I . e . t e x t t h a t c o n t a i n s a l a r g e percentage of sentence fragments,
I I . IDENTIFICATION OF FRAGMENT TYPES The Nee York U n l v e r s l ~ y L i n g u i s t i c S t r i n g P r o j e c t has developed a computer program t o a n a l y z e ccmpact text In special Ized s u b j e c t areas using a general p a r s i n g program and an Engl Ish grammar augmented by procedures s p e c l f l c t o the subJect areas. In r e c e n t years the system has been t a i l o r e d f o r computer a n a l y s i s of f r e e - t e x t medical r e c o r d s , which are c h a r a c t e r i z e d by numerous sentence fragments. In the c o m p u t e r - a n a l y s i s and processing of the medical records, r e l a t l v e l y few types of sentence fragments s u f f l c e d t o d e s c r i b e t h e shortened forlas, a l though such fragments ccmprfsed f u l l y 49% of the natural language i n p u t CMarsh and Sager, 1982]. Fragment types can be r e l a t e d t o f u l l forms on t h e basis of the elements which are r e g u l a r l y delirfed. Elements d e l e t e d f r ~ n t h e fragments are fr~a one or more of the s y n t a c t i c p o s l t l o n s : s u b j e c t , tense, verb, obJect. The s i x fragment types I d e n t l f l e d in the s e t of medical records are shown In Table 1 as types i - V l .
A f e a t u r e of fragment types t h a t Is not I m e d l a t e l y obvious ts the f a c t t h a t they are a l r e a d y known In the ful I grammar as p a r t s of ful l e t c o n s t r u c t i o n s . The fragment types r e f l e c t d e l e t i o n s found in s y n t a c t i c a l l y d i s t i n g u i s h e d p o s i t i o n s w l t h i n f u l l sentences, as I l l u s t r a t e d in Table 2. For e~ample, In normal English, a sentence t h a t c o n t a i n s tense and the verb be can occur as the o b j e c t of verbs l i k e f i n d ( e . g . She found t h a t
the s e n t ~ c e was ~ ) . In the same
environment, as obJect of f i n d , a reduced sentence can occur [n which the tense and verb be have been o m i t t e d , as In fragment type I ( e . g . She found the sentence ~ l l l J ; ~ ) . In the same manner, o t h e r reduced forms r e f l e c t e d in fragment types a l s o r e p r e s e n t c o n s t r u c t i o n s g e n e r a l l y found as ~arts of r e g u l a r English sentences.
The f a c t t h a t the fragment types can be r e l a t e d t o f u l l English forms makes I t p o s s i b l e to
v Iee thee as Instances of reduced
!i ! !
i
~
o
. ~
~
~
~
-
. ~o~
~
~
~
~
w
I-- Z > -
( D u
3 ~
W e ' ~
W
f J - r "
" ~ 0
i ,
~E
- C E
m ~
c n u J > - ) . . I - - z
n . u .
w ..J I - -
~ ° ~ ~
° °
~
~ E
. -
o _
~~
~ ~~
o ~~
- - O
- =
~
..J
= [
e e
¢
~
d
Zd
O
Z
W 111
I'-- Z
* ( j
,..J I=.
, , . I m
w
m
+ Z
N
~4L~
~ . , , (
+ +
Z Z
~4
d
Z12. O " Z
+ + +
2 : Z Z
U.I
e, e~
el*"
n
z o .
I,..- .,.J I,.-
. g ~
-g
m w ~
m
L4,4
->
~-_
>
Z - - :="
TABLE 2. DELETION FORMS IN NORMAL ENGLISH
I. DELETED TENSE, VERB BE A. N + PASSIVE PRED B. N + PROGRESSIVE PRED C. N + ADJECTIVE PRED O. N + P N
E. N + Q F. N + N
THE KING HAD HIM B~FAI:)~.
WE 0eSERVED EILL . T ~ TO HIMSELF. SHE FOUNO THE S~NT~CE ~ .
THEY FOUND HiS J ~ OF J J ~ . JOHN THOUGHT HIM Z~ OR ~ . THEY CONSIDERED HER THEIR SAVIOUR.
I I . DELETED SUBJECT, TENSE, VE]~ BE [VERBAL PREDICATE] A. PASSIVE PREDICATE
B. PROGRESSIVE PREDICATE
THE MAN, ~ WITH HIS WORK, WENT HOME. MARY LEFT WHISTt I~K~ ~ HAPPY ~NE.
I I I . DELETED SUBJECT, TENSE, VERB BE A. ADJECTIVE PREDICATE
B. PN PRED I CATE
GRACIOUS AS ~LEB, SHE WELCOMED HER GUESTS. THE GUARD, IN GREAT /~./~M, CALLED THE POLICE.
IV. DELETED SUBJECT, TENSE, VERB BE
NO.IN PHRASE THE CHILD, 6 ~ U M ~ Y DANCER, TWISTED HER ANKLE.
V. Ol~ ETED SUBJECT, TENSE, VERB BF
INFINITIVAL PREDICATE THEY TOOK THE TRAIN TO AVOID THE TRAFFIC.
and, a t the same time, p r o v i d e s a framework f o r I d e n t i f y i n g t h e i r semantic content by r e l a t i n g t h m t o the corresponding f u l l forms.
The number of fragment types t h a t occur In compact t e x t of d i f f e r e n t t e c h n i c a l domains appears t o be r e l a t l v e l y l i m i t e d . When t h e fragment types found In medical records were compared w l t h those seen In a smell sample of N a v y equipment s t a t u s messages, f i v e of the s l x types found in the medlcal records were a l s o found In the Navy messages. Only one a d d i t i o n a l fragment type was r e q u i r e d t o cover the N a v y messages. This t y p e appears In Table I as type V l l , in which two
subjects have been deleted (Reauest advise
f o r Dick ~Q.).
While the number of fragment types Is
r e l a t i v e l y constant, the d i s t r i b u t i o n of fragment types v a r i e s according t o the domain of the t e x t . Table 3 s h o w s d i s t r i b u t i o n s f o r each of the fragment types I d e n t i f i e d in Table 1. For e~ample,
In Table 3, w h i l e fragment t y p e IV, from which s u b j e c t , tense, and verb have been d e l e t e d , is most frequent In medical records, I t is a much less frequent t y p e In the Navy messages. On the o t h e r hand, type VI, from whlch a s u b j e c t has been d e l e t e d , Is r e l a t i v e l y Infrequent In medical
records, but much more frequent in Navy messages.
In a d d i t i o n , the d i f f e r e n t sections of the input d i f f e r with respect t o the r a t i o of fragments
they c o n t a i n . For e~unple, the d i f f e r e n t s e c t i o n s of the medical records t h a t were analyzed ( e . g . HISTORY, EXAM, LAB-DATA, IMPRESSION, COURSE IN HOSPITAL) were d i s t i n g u i s h e d by differences in the d i s t r i b u t i o n of the fragment t y p e s . The EXAM paragraph of the medical t e x t s , In which the p h y s i c i a n describes the r e s u l t s of the p a t i e n t ' s p h y s i c a l e K ~ l n a t l o n , contained a relatively l a r g e number of fragments of t y p e I11, e s p e c i a l l y adjective phrases. The COURSE IN HOSPITAL
paragraph contained a larger number of complete
sentences than the o t h e r paragraphs.
TABLE ] . DISTRIBUTION OF FRAGMENT TYPES
T Y P E MEDiCAl NAVY
I. 22% 36%
if. I% 6%
iii. 12% 11% IV. 61% 15%
v. I% 0%
v l . 2$ 28%
[image:3.612.342.528.537.704.2]I I I . RECONSTRUCTION OF DELETIONS
The d e l e t i o n s which r e l a t e f r a g m e n t t y p e s t o t h e i r f u l l sentence forms f a l l I n t o two main c l a s s e s : ( I ) those found v i r t u a l l y In a l l t e x t s and I l l ) t h o s e s p e c l f l c t o t h e domain of t h e t e x t .
J u s t as t h e fragment t y p e s can be viewed as I n c o m p l e t e r e a l i z a t i o n s o f syntac-Nc S-V-O s t r u c t u r e s , t h e semantic p a t t e r n s In sentence fragments can be c o n s i d e r e d I n c o m p l e t e r e a l l z a t l o n s o f t h e semantic S-V-O p a t t e r n s . In g e n e r a l terms, t h e s t r u c t u r e of I n f o r m a t i o n In t e c h n i c a l domains can be s p e c i f i e d by a s e t of semantlc c l a s s e s , t h e words and phrases which belong t o t h e s e c l a s s e s , and by a s p e c l f l c a t l o n o f t h e pal'~erns t h e s e c l a s s e s e n t e r in'to, l . e . t h e s y n t a c t i c r e l a t i o n s h i p s among t h e members o f +he c l a s s e s [Grlshmen, e t e l . , 1982; Sager, 1978]. In +he case of t h e medical sublenguage processed by t h e L l n g u l s t l c StTlng P r o j e c t , t h e medical subclasses were d e r l v e d through t e c h n i q u e s of d i s t r i b u t i o n a l a n a l y s i s [Hlrschmen and Sager, 1982]. Semantlc S-V-O pet-I'erns were then d e r i v e d from t h e c o m b l n a t o r y p r o p e r t i e s of t h e medical c l a s s e s in t h e t e x t [Marsh and Sager, 1982]; +he semantic pat~rerns I d e n t i f i e d In a t e x t a r e s p e c i f i c t o t h e domain of +he t e x t . Whlle they s e r v e t o f o r m u l a t e sublanguage c o n s t r a i n t s which r u l e o u t i n c o r r e c t s y n t a c t i c a n a l y s e s caused by s t r u c t u r a l o r l e x l c a l ambiguity/, t h e s e r e l a t i o n s h i p s among c l a s s e s can a l s o p r o v i d e a means by which d e l e t e d elements in compact t e x t can be r e c o n s t r u c t e d . When a fragment Is r e c o g n i z e d as an I n s t a n c e of a g i v e n semantic p a t t e r n , I t Is +hen p o s s i b l e t o s p e c i f y a s e t of t h e semantic c l a s s e s from which t h e medical sublanguage c l a s s of +he d e l e t e d element can be s e l e c t e d .
On a s u p e r f l c l a l l e v e l , t h e d e l e t i o n s of be In fragment t y p e s I c - f and I l i a - b , f o r example, can be r e c o n s t r u c t e d on p u r e l y syntac~'lc grounds by f l l l l n g In t h e l e x i c a l Item be. However, I t Is a l s o p o s s i b l e t o p r o v i d e f u r t h e r I n f o r m a t i o n and s p e c i f y t h e semantic c l a s s of t h e lex l c a l Item be by r e f e r e n c e t o t h e semantlc S-V-O pat-tern m a n i f e s t e d by t h e o c c u r r i n g s u b j e c t and o b j e c t . For e~emple, In t y p e I f fragment s k i n no ~ r u o t l o n s , s k i n has t h e medical subclass BODYPART, and e r u n t l o n s has +he medlcal subclass SIGN/SYMFrrOM.
The semantic S-V-O pat-tern In which t h e s e c l a s s e s p l a y a p a r t Is=
BODYPART-SHOWVERB-SIGN/SYMPTOM
(as In Skln showed no e r u n t l o n s ) . Be can then be assigned t h e semantic c l a s s SHOWVERB. p r o t e i n ~ , t y p e I t , e n t e r s I n t o t h e semantic pal-~ern:
TEST-~STVERB-TES13~ESULT
and be can be assigned t h e c l a s s TESI~/ERB, which r e l a t e s a TEST s u b j e c t w l t h a TESllRESULT o b j e c t . A s s i g n i n g a semantic c l a s s t o t h e r e c o n s t r u c t e d be maximizes I t s I n f o r m a t i o n a l c o n t e n t .
In a d d i t i o n t o r e c o n s t r u c t i n g a d l s t l n g u l s h e d l e x l c a l Item, l i k e +he v e r b be, along w i t h I t s semantic c l a s s e s , I t Is a l s o p o s s i b l e t o s p e c i f y t h e s e t o f semantic c l a s s e s f o r a d e l e t e d e l e m e n t , even +hough a l e x l c a l Item Is n o t I m m e d i a t e l y r e c o n s t r u c t a b l e . For e~emple, t h e f r a g m e n t To
recelv9 f o l l c ~,J.~o of Type V I , c o n t a i n s a v e r b of t h e PI~/ERB" c l a s s and a MEDICATION-obJect, b u t t h e s u b j e c t has b ~ n d e l e t e d . The o n l y semantic pad-tern which p e r m i t s a v e r b and o b j e c t w l t h t h e s e medical subclasses Is t h e S-V-O p a t t e r n :
PATIENT-PTVERB-MEDICATION
Through r e c o g n { t l o n o f t h e semantic p a t t e r n in which +he o c c u r r i n g elements of t h e f r a g m e n t p l a y a r o l e , t h e semantic c l a s s PATIENT can be s p e c i f i e d f o r +he d e l e t e d s u b j e c t , p ~ t l e n t Is one of t h e d i s t i n g u i s h e d words In t h e domain of n a r r a t i v e medical r e c o r d s which a r e o f t e n n o t e x p l i c i t l y mentloned In t h e t e x t , a l t h o u g h they p l a y a r o l e In t h e sementlc p a t t e r n s .
The S-V-O r e l a t i o n s , o f which t h e f r a g m e n t i~/pes a r e Incomplete r e a l i z a t i o n s , form t h e b a s i s o f a p r o c e d u r e which s p e c i f i e s t h e semantic c l a s s e s o f d e l e t e d elements In f r a g m e n t s . Under t h e b e s t c o n d i t i o n s , t h e s e t o f semantic c l a s s e s f o r t h e d e l e t e d form c o n t a i n s o n l y one e l e m e n t . I t Is a l s o p o s s i b l e , however, f o r t h e s e t t o c o n t a i n more than one semantic c l a s s . For example, t h e t~fpe la f r a g m e n t Pain a l s o noted }n hands ~ knees, when r e g u l a r i z e d t o normal a c t i v e S-V-O word o r d e r as noted o a l n In hands and knees, has a d e l e t e d s u b j e c t . The s e t of p o s s i b l e medical c l a s s e s f o r t h e d e l e t e d s u b j e c t c o n s i s t s of ~PATIENT, FAMILY, OocrrOR}, s i n c e • f r a g m e n t w i t h a v e r b of t h e OBSERVE c l a s s , such as n o t e , and an o b j e c t of t h e SIGN/SYMPTOM c l a s s , such as o a l n , can e n t e r ~rtc
SUBJECT VERB OBJECT
FAMILY OBSERVE SIGN/SYMPTOM
PATIENT OBSERVE SIGN/SYMPTOM
DOCTOR OBSERVE SIGN/SYMPTOM
( M O ~ ~SERV~ F~ER. ) ( p _ ~ OBSFRV~ F~ER.)
(OOCTOR
OBSERVED F~ER,)any of t h e S-V-O p a t t e r n s In F i g u r e I , The c h o i c e of one subclass f o r t h e d e l e t e d element from among elements o f t h e s e t of p o s s i b l e subclasses Is dependent on several f a c t o r s . F i r s t , p r o p e r t i e s of paragraph s t r u c t u r e of t h e t e x t p l a c e r e s t r i c t i o n s on t h e s e l e c t i o n o f semantic c l a s s f o r a d e l e t e d element. The fragment noted o a l n I n h ~ d s and knees would s e l e c t a DOCTOR
subject
I f w r i t t e n In t h e IMPRESSION o r EXAH paragraph of t h e t e x t , b u t , In t h e HISTORY paragraph, a PATIENT o r FAMILYsubJect
could n o t be excluded. A second f a c t o r Is t h e presence of an a n t e c e d e n t h a v i n g one of t h e semantic classes s p e c i f i e d f o r t h e d e l e t e d element. I f a p o s s i b l e a n t e c e d e n t having t h e same sGmsntlc c l a s s can be found, subJect t o r e s t r l c t l o n s on change of t o p i c and d i s c o u r s e s t r u c t u r e , then t h e d e l e t e d element can be f i l l e d In by I t s a n t e c e d e n t , r e s t r i c t i n g t h e sementlc c l a s s o f t h e d e l e t e d element t o t h a t of t h e a n t e c e d e n t . Hoaever, an a n t e c e d e n t search may not always be s u c c e s s f u l , s i n c e t h e a n t e c e d e n t may n o t have been e x p l l c [ t l y mentioned In t h e t e x t . The a n t e c e d e n t may be one o f a c l a s s of d i s t i n g u i s h e d words In t h e sublanguage, such as n a t l e n t and . ~ , which may n o t be p r e v i o u s l y mentioned In t h e body o f t h e t e x t .Thus, semantic p a t t e r n s d e r i v e d from d l s t r l b u t [ o n a l a n a l y s i s p e r m i t t h e s p e c i f i c a t i o n of a s e t of semantic classes f o r d e l e t e d elements In t e x t s c h e r a c t e r l z e d by a l a r g e p r o p o r t i o n o f sentence fragments. This s p e c l f l c a t l o n can f a c i l i t a t e t h e r e c o n s t r u c f f o n of d e l e t e d elements by l i m i t i n g c h o i c e among p o s s i b l e a n t e c e d e n t s .
IV. CONCLUSION
In t h i s paper, seven d e l e t i o n p a t t e r n s found In t e c h n i c a l compact t e x t have been I d e n t i f i e d . The number of fragment types Is r e l a t i v e l y l i m i t e d . F i v e of t h e seven occur In t h e f u l l grammar of English as subparts of f u l l e r s t r u c t u r e s . These s y n t a c t i c fragment types can be vlewed as Incomplete r e a l i z a t i o n s of s y n t a c t i c SUBJ ECT-VERB-ORJECT s t r u c t u r e s ; t h e semantic p a t t e r n s In sentence fragments are found t o be Incomplete r e a l l z a t l o n s of t h e semantic SUBJECT-VER]-OBJECT pal-ferns found In f u l l sentences. Semantic classes can be s p e c l f l e d f o r d e l e t e d elements In sentence fragments based on these semantic p a t t e r n s .
AC~N~/L EDGIqENTS
Thls research was supported In p a r t by National Science Foundatlon g r a n t number IS1-/9-20788 frcm the D i v i s i o n of I n f o r m a t i o n Science and Technology, and in p a r t by N a t i o n a l L i b r a r y of H e d l c l n e g r a n t number 1-RO1-LM03953 awarded by t h e N a t i o n a l I n s t i t u t e of H e a l t h , Oepert~ent of Health and Human S e r v l c e s .
REFERENCES
Anderson, B . , Bross, I . O . J . and N. Sager (1975). Grammatical Compression In Notes and Records. h n ~ I c a n Journal o f ~ L l n o u l s t l c s
Z=4.
Grlshman, R., Hlrschman, L . , and C. Friedman (1982). Natural Language I n t e r f a c e s Using L l m i t e d S e m ~ t l c I n f o r m a t i o n . Proceedlnas o f 9~h I n t e r p a t l o n a l Conference o n ~ L t n a u l ~ t l e ~ (COLING 8 2 ) , Prague, C z e c h o s l o v a k i a .
H e n d r l x , G. (1977). Human E n g i n e e r i n g f o r A p p l i e d Natural Language Processing. Proceedlnas o f 5+h IJCAI, Cambridge, Hass.
H e n d r l x , G., S a c e r d o t l , E., Sagalowlcz, 0 . , and J. Slocum (1978). D e v e l o p i n g a Natural Language
I n t e r f a c e t o Complex Data, ACH TOOS ~ : 2 . HIrschman, t . and N. Sager (1982). A u t o m a t i c
I n f o r m a t i o n Format-ling of a Medical Sublanguage. Sublenauaae¢ ~ t u d l e s o f Lanaueae In R ~ I c + Q d ~ ~ (R. K l t - f r e d g e and J. L e r b e r g e r , a d s . ) . Waiter de G r u y t e r , B e r l i n ,
Kwesny, S.C. and N,K. Sondhelmer (1981).
R e l a x a t i o n Techniques f o r Parsing I l l - f o r m e d I n p u t . ~ Journal of
t l n o u l l t l ~ ~ : 2 .
Marsh, E. and N. Sager (1982). A n a l y s i s and Processing of Compact T e x t . proc~edln~s of
t h e 9~h ~ Conference on
L r n a u l s t l c s (COLING 82), Prague, C z e c h o s l o v a k i a .
Sager, N. (1978). Natural Language I n f o r m a t l o n Formal-ling= The A u t a n a t l c Conversion of Texts t o a S t r u c t u r e d Data Base. In Advances In 17 (N.C. Y o v l t s , e d . ) , Academic Press, Nee Y o r k .