American Journal
of
Computational Linguistics
Microfiche
36 : 68D e p a r t q e n t o f ~ n t h r o p o l o g y
U n i v e r s i t y
of
New
Brunswick
R o s w e l l Park
Memorial
X n s t i
t u t gB u f f a l o , New
York
L i n g u i s t i c S t r i n g P r o j e c t
New
York University
2
W a s h i n g t o n
SqyareV i l l a g e r
28
New
Y o r k , New Yofk
1 0 0 1 2
ABSTRACT
~inguistic
mechanisms of compression are used when making notes within a
context where
the
objects
and
meanings
are
known. Mechanisms of compressidn in
medical records for a collaborative study of breast cancer are described.
Thesyntactic devices were mainly deletion of words having a special status in the
grammar
of
the
whole language
and deletion in particular positions
of
word+
having
a
special sta&us in the sublanguage. The deIeted forms are described
and sublanguage Qord classes defined. A subcorpus of the medical records
wasparsed by
an
existing
computer parsing system; a component covering the dele-
tion-forms was added to the granunar. Modifications to t,he
computer grammar
Introduction
All 1anguages"have mechanisms of compression. Sentences may be embedded
within other sentenaes by means of nominalization and complementation. Various
grammatical transformations involve deletion of certain parts of the sentence.
In medical records, we find entries such as no evidence of metastases, which
may be said to be derived Trea something like There is no evidence of metastases.
Such incomplete sentences are not common in the spoken language of the medical
records (i.e. dictated reports). However when physiciakrs themselves are requirbd
to write- material
for
records, compression mechanisms are qmmonly use&.Although this paper will deal with
a
m i f i c corpus, similar devices wouldI
often be used for compression in other s-ations where there
is
pressure towrite as little as possible, Legal, educational, and scientific recordg where
informal notes are kept w o u m be other examples of this class of sitqations.
The original motivation for this study was to develop effective methods for
storing &e information in a medical record and to be able to retrieve this in-
formation for purposes of research, medical care, or administration. Fsoan pre-
vious research, the feasibility of verbatim input of dictated narrative has been
established, Computerized extraction of the information has been shown to be
feaeible
i~
a test system ACORN (Automated Coding of Report ~arrative). hissystem has been described in detail in a series of previous papers. 1 d t 3
1
I .Dm J. Bross et al. "Information in Natural Languages : A New Approach1'.
Journal of the American Medical ~ssociakkon, Vol. 207, No. 11, 1969, pp. 2080-
2084.
2
I . D . 3 , Bross et al. "Feasibility ~f ~utoxnated Information Systems in the
Usarts NatvlLal Language". American Scientist, Vol. 57, No. 2, 1969, pp. 193-205.
3p.k
Shapim and D.F. Stemole. "ACORN (Automated Coding ot Report Narrative.):An Automated Ratrral-language Question-Answering syst& for surgical Reportsw.
For
a
highly s t r u c t u r e d medical record where the e n t r i e s are s i n g l e words o r very r e s t r i c t e d sentences, t h e f e a s i b i l i t y ' o f computer-assisted e d i t i n g andcoding has a l s o
been
e s t a b l i s h e d . A procedure f o r typingi n
t h e e n t d e s ver-batim
i na
medical record,called 'TICPIS' (Type-In Codingand
Editing
System) 4ha8
been r e p o r t e d e1sew)rere. However,the
t h i t d , intermediate c l a s s of m a t e r i a lcannot be handled by ACORN o r by TICES. Therefore, a l i n g u i s t i c a n a l y s i s of
t h i s type of material has been undertaken with t h e u l t i m a t e o b j e c t i v e of s e t t i n g
up
a
comprehensive eomputer system t h a t can handle almost everything i n t h emedical
records.
I n
theearlier
e f f o x t s t o develop n a t u r a l language technology, t h e work wasf a c i l i t a t e d by t h e f a c t t h a t t h e documents involved were s t r i c t l y f o r the trans-
5
mission of f a c t u a l information. Such documents a r e regarded as important both
by the persons who are f i l l i n g them o u t and by t h e persons who read them. I n
t h i s no-nonsense s i t u a t i o n where t h e record may be c r i t i c a l l y reviewed by t h e
peers of the person who is r e p o r t i n g t h e information, unambiguous and informa-
t i v e transmission of information i s a c r i t i c a l need. Some of the simplicities
i n
t h e p r e s e n t a n a l y s i s may b e ~ e c u l i a r t ows
type of s i t u a t f o n .The e x i s t e n c e of a s u b c u l t u r e with shared t r a i n i n g , o b j e c t i v e s , and exper-
ience may f a c i l i t a t e t h e note-taking p r o c e s s i n somewhat t h e same way t h a t
a
person
taking n o t e s f o r himself can somehow be more concise without ambiguity...
Howeveb, r m a n y o t h e r note-taking s i t u a t i o n s would involve subculture, though
n o t n e c e s s a r i l y a medical one, and t h e f i n d i n g s here might be expected t o have
sdne general a p p l i c a b i l i t y .
4
1.D.J. Bross e t
a l .
"Unobtrusive Biomedical Data-Input Systems".-
Bio-Medical Computing, No. 4, 1973, pp. 219-228.
E
J
Source of Material
The medical&es discussed here are ffom tjhe records of t h e Surgical
Adjutrant Breast P r o j e c t , a nationwide c o l l a b o r a t i v e study involving 36 medical
i n s t i t u t i o n s . The records were f i l l e d o u t by medical and paramedical personnel
a t
t h e p a r t i c i p a t i n g i n s t i t u t i o n s and c e h t r a l i z e d a t Ro$glell ParkNemri&l
I n s t i t u t e i n a S t a t i s t i c a l u n i t under thq direction\of-Dr. Nelson S-lack. A
sample of approximately 50 w a s taken from the 2734 cage h i s t o r i e s of p a t i e n t s
i n
t h e program and is being used i n t h e l b g u i s t i c a n a l y s i s . Each case h i s t o r y o r d i n a r i l y consiats df 3-6 pages of d e t a i l e d information on t h e p a t i e n t ' s i n i -t i a l s t a t u s , treatment, pathology r e p o r t , nledicai problems, and subsequent
f a t e . When
the
s t r u c t u r e d information i n t h e record was excluded, each caseh i s t o r y had between 6 and 26 notated items, each item c o n s i s t i n g of 1 t 6 5 par-
t i a l - s e n t e n c e s . While this material
is
speckalized t ome
purposes of the col- l a b o r a t i v e study, t h i s type of information i q f a i r l y t y p i c a l of what i s foundi n the usual h o s p i t a l record.
The notes were typed v e x b a t h using An IBM Mag Card Communicator s o as t o
obtain simultaneously
a
typed paper document anda
record i n computer-usableform. This device
i s
used i n t h e data-input sgstem of T ~ C E S ; an e x i s t i n g system f o r handling completely s t r u c t u r e d records. I t would presumably be usea i n anyextension of TICES which would handle medical nates. I n e i s ' a n a l y s i s the
com-
p u t e r w a s used t o reorganize the m a t e r i a l i n a fbrmmore convenient f o r manual
a n a l y s i s by t h e l i n g u i s t .
Anderson analyzed the l i n g u i s t i c s t r u c t u r e of t h e e n t r i e s i n a sample of
t h e medical records involving r a d i a t i o n findings, A discussion of t h i s ana-
l y s i s w i l l take up
the
next partof
t h e paper. Sager and a s s o c i a t e s used some of the findings from this study t o develop methods f o r processing t h e s e samedeveloped fok p a r s i n g s c i e n c e a r t i c l e s . This p r o j e c t w i l l be discussed i n t h e
f i n a l part of t h e paper.
L i n g u i s t i c C h a r a c t e r i s t i c s of Medical Notes
Many o f the e n t r i e s on t h e medical records a r e i n t h e form of n o t e s which
a r e n e i t h e r complete sentences nor s i n g l e word e n t r i e s , b u t l i n g u i s t i c strings
of a n
intermediate
type, which we w i l l h e r e a f t e r c a l l fragments, Fragments a r ea compressed t y p of l i n g u i s t i c material r e s u l t i n g from v a r i o u s transformations
which have the e f f e c t of making l i n g u i s t i c s t r i n g s s h o r t e r by reducing or de-
l e t i n g materihl. The
writer
of t h e s e stretches of m a t e r i a l must make h i s en-tries brief, i n o r d e r t o save time and e f f o r t , b u t a l s o make them informative
and unambiguous. For t h i s reason the d e l e t e d material h a s t o be e a s i l y recover-
able, o r i n o t h e r words it must n o t c o n t a i n much information. An a n a l y s i s of
t h e fragments shows t h a t d e l e t i o n is maiinly of a small c l a s s of sentence parts:
(1) t e n s e and the v e r b
-
be (t-
b e ) ; ( 2 ) s u b j e c t , t e n s e and t h e verb-
be; (3) t h esubject; and (4) s u b j e c t , t e n s e , and verb (V) o t h e r t h a n
-
be.A second c h a r a c t e r i s t i c of fragments which makes d e l e t e d m a t e r i a l recover-
a b l e i s that both t h e m e t e d material and the remainders c o n s i s t of words i n
easily defined s u b c l a s s e s , based on both d i s t r i b u t i o n a l and semantic c r i t e r i a .
These subclasses are e a s i l y defined because of t h e n a t u r e of t h e sublanguage;
i n g e n e r a l t h e vocabulary i s l i m i t e d and each word has a l i m i t e d semantic range.
The question on a form khich is being answered can a l s o be used a s a basis f o r
r e t o r i n g d e l e t e d material.
One of the most commonly d e l e t e d items i n the medical records i s t
-
be (1and 2 ) . Tense is perhaps t h e most important information
-
be gives. The d e l e t i o nof t e n s e i n the medical records causes no ambiguity because u s u a l l y the physician
d e s c r i b e s t h e s i t u a t i o n a t the time of f i l l i n g o u t the report, O t h e r w i s e he
Fragment Types
I n Table 1 w e l i s t t h e fragment types, giving an example of each, but not
with a l l occurring word subcl&!3ses. The types w i l l S i r s t be given according t o
what m a t e r i a l is deleted and then w i l l be f u t t h e r subclassed according t o t h e
highest nodes of t h e t r e e s t r u c t u r e of t h e remainder. The material i n brac-
k e t s is the word subclasses which a r e assumed .So have been deldted.
TABU3 1. FRAGMENT TYPES
S t r u d u r s
Material Deleted of Fragment Example
1. t
-
be by N Ven no m e t a s t a t i c l e s i o n s [were] d e t e c t e a [byN-physician physician]
N Adj c h e s t films [were] nonnal
N P N p a t i e n t [was] without cough
1
N t o V this form [is]
t o
be used.
.
.
N Ving wound [ i s ] healing well2. Subject t
be
-
Ven [N-disease was] a s p i r a t e d once
Ads [N-Patient i s ] dead
t o be Ven [N-patient i s ] t o be seen by gynecologist
Ving [N-patient i s ] doing well
3. Subject t V Object
3a. N-physician Subject
3b. N-patient
Subject 3c. N-disease
Subject
(rare 1
4. Subject t
v
4a. N-physician t Object V-diecover
4b. N-physician Object
t V-do
4c. N-patient Object t have
[I] found osteochondritis in,
r i b (5th r i g h t )
[N-patient] had period one week ago
[N-disease] invades s k i n
[N-disease] seems minor
[I V-discovered] no bony metastases
[N-ghysician did] excision of (r )
5th c o s t a l c a r t i l a g e
Word Subclasses
The word subclassbs should have t h r e e c h a r a c t e r i s t i c s : (1) they should
enable d e l e t e d material t o be recovered, (2) they should make it p o s s i b l e t o
6
e x t r a c t and s t o r e informational u n i t s such as those i n ACORN and (3) they should
be defined s o t h a t a l i n g u i s t i c a l l y unsophisticated person can e a s i l y p u t words
i n t o their subclasses.
The word subclasses a t e based on both semantic and d i s t r i b u t i o n a l c r i t e r i a .
To a l a r g e e x t e n t nouns can conveniently be subclassed on a semantic basis and
verbs can be subclassed on a d i s t r i b u t i o n a l b a s i s , according
t a
the subclassesof nouns which they t a k e as s u b j e c t and object. Due t o t h e nature of the sub-
language t h e r e is r e l a t i v e l y l i t t l e overlap (e.g., a given verb is l i k e l y t o
t a k e only one noun subclass as s h j e c t ) compared t o what w e would f i n d i n t h e
language as a whole.
Two impoftant subclasses of h-n nouns used i n t h e medical records are
N-physician and N-patieht. Each has only a few members, b u t i s important because
many verbs c h q a c t e r i s t i c a l l y t a k e it as s u b j e c t o r o b j e c t , and a l s o because
both, but p a r t i c u l a r l y N-physician, are usually deleted. I t i s on t h e basis of
t h e verbs which c h a r a c t e r i s t i c a l l y t a k e them as s u b j e c t o r o b j e c t t h a t they can
u s u a l l y be recovered without ambiguity.
Other noun subclasses concern more d i r e c t l y t h e s u b j e c t matter of t h e r e -
ports, the concrete o b j e c t s with which t h e physician i s dealing. Unlike N-
physician and N-patient, t h e s e classes u s u a l l y have many mmbers and they a r e
seldom deleted. A s w i t h N-physician and N-patient, c e r t a i n uerb subclasses
c h a r ~ c t e r i s t i c a l l y t a k e them as s u b j e c t o r object.
Table 2 g i v e s some of t h e word subclasses with examples of each.
TABLE 2. SOME WORD SUBCLASSES
N - b w a $ t
N-change I
R-dimas ion N-disease
N-exam
N-locatibrl
N-patient
N- physician N-therapy
N-time
V-be-equivalent
V-change
V - d i s c e r
V-patient-object
V-patient-subject
V-physician-subject
V-show
Ad j -bodypart
Adj-changed
Adj-degree
Ad j -discover
Adj-disease q u a l i t y
abdomen, a x i l l a , bone, Br-t, c e r v i x , p e l v i s
change, elevation, enlargement, g a b , i n c r e a s e
pressure,' r a t e , rhythm, s i z e , weight
carcinoma, cough, disease, edema, fibxosis
6iopsy, exam, f i l m , qamogram; scan, x-xay
a r e a , f i e l d , f l o o r , lobe, neck, p a r t , r e g i o n r
she, her, p a t i e n t , lady, woman
doctor, he, him, h i s , I , * M . D . , r a d i o l o g i s t
drug, i n s u l i n , medication, medicine, r a d i a t i o n
d a t e , month, t h e , v i s i t , winter, year
appear, feel, i n d i c a t e , remain, r e p r e s e n t , seem a l t e r , c l e a r , change, enlarge, h e a l , progress
d e t e c t , f i n d , i d e n t i f y , ncyte, observe, see
a h = , give, leave, p l a c e , readmit, s e e , t r a n s f e r , t r q a t
complain, come, m o p e r a t e , e n t e r , f e e l , gain, go, have,
r e f u s e , show, suf f ;r
,
t a k ef e e l , have, place, tel.1, t'kansfer, t r e a t , See
show, demonstrate, i n d i c a t e , r e v e a l , suggest a x i l l a r y , bony, c l a v i c u l a r , lumbar, p e l v i c
elevated, enlarged, healed, s t a b l e , unchanged.
considerable, extensive, i n t e r m i t t e n t , l i t t l e
absent, evident, Fnown, p o s s i b l e , p r e s e n t
a c t i v e , bad, benign, degenerative, firm, hard,
malignant, metastatpc, nodular
adjoining, d i s t a l , d o r s a l , f r o n t a l , l e f t
c l e a r , f r e e , healthy, negative, normal
Computer Parsing of Medical Records
'
To t e s t t h e l i n g u i s t i c a n a l y s i s , a subset of the manually analyzed corpus
of medical records was parsed by computer, using the NYU L i n g u i s t i c S t r i n g Parser. 8
7
I d g r a t e f u l t o Cynthia I n s o l i o and Lynette Hirschman f o r their h e l p i n processing t h e s e data.
w.
S)8
R. Grishman, N. Sager, C. Raze, and B. Bookchin, "The ~ i n g u i s t i c S t r i n g
[image:8.763.65.678.125.830.2]The LSP grqmmar of English is based on the same linguistic principles as the
ACORN grammar. Hence it could also serve to test the feasibility of adding a
note-handling capability to the ACORN-TICES system. The LSY sylr which was
designed for text-processinQ, was adapted to the parsing of medical records by
deleting portions of the grammar which are not required for this type of ma-
terial and adding a section covering sentence fragments. These change$ are
described below, followed by the parsing resultb.
The corpus which was parsed consisted of 12 sections of the Radiation
Findings extracted in their order of appearance from the medical records. These
sections contained 245 sentences or sentence fragments (word sequences ending
in a period). Of these, 37 were complete English sentences and 205 were frag-
ments; 3 were combinations of both types. 21 entries were identical to others
in the corpus, accounting in all for 139 of the sentences ox sentence fragments.
Of the complete sentences, same were quite long, e.g., Reexamination shows some
scarring and thickening over the right apex which is perhaps slightly more evi-
dent than it was before, but nothing is seen that is typical of tumor involvement.
Typical sxorter sentences are Chest films on 10-25-68 and 12-14-68 do not show
any essential changes since- last reports, Liver scan 1-29-69 was normal. Frag-
ments were, as predicted, of the types listed in Table 1, above, though not all
t y M s were represent-ed in the parsed corpus.
Table 3 shows the new definitions or redefinitioqd which were added to the
LSP grammar to cover fragments. These definitions are written in ~ a h s - N a u r Form
(BNF), as ilze all the ca. 180 definitions which comprise the context-free-part
of the LSP English grammar. The BNF definitions are used by the parser to con-
struct a tree representing the structure of the input sentence.
In addition to BNF definitions, the grammar contains restrictions, which
test the sentence trees for grammatical and selectional well-formedness. The
9
TABLE 3. DEFINITIONS ADDED TO THE LSP GRAMMAR
TO COVER SENTENm FRAGMENTS
(SENTENCE) <TExTILET>
COLD-SENTENCE? <MORESENT> (INTRODUCER) <CENTER? <FRAGMENT>
::= <TEXTLET>.
::I <OLD-SEmNCE><MORESENT>.
::= <INTRODUCEI~<CENTER)<ENDMARK>.
: : = NULL/<TEXTLET>.
c : = NULL.
::=
<ASSERTION>/:FRAGMENT>/<IMPERATIVE>.
: := <SA>
C<SOBJBESHOW/<ASTG)(SA>/<NSTG><SA~/
<VENPASS~/<NSTG) ~ ~ A S S E R T I O N ~ / ~ S O B J B E S H O W ~
1
.
::= <SUBJECT><B$-OR-SHOW><OBJBE><SA>.
: :=- +--+/NULL.
::= +.+/+,+/+;+/+-4.
starting, or root, definition of the gramnqr is SENTENCE, so this is
tha
firstdefinition seen in Table 3. In the case of medical records, the unit may be.
longer than one sentence, but we have retained the root-word SENTENCE and de-
fined SENTENCE in this case to be a TEXTLET (definition 2 ) , ,which.consists of a
sentence (called OLD-SENTENCE, definition 3) optionally followed by more sdn-
tences (MORESENT, definition 4). The definition of OLD-SENTENCE has the same
three elements (INTRODUCER, CENTER, ENDMARK) that the definition of SENTENCE
does in the LSP grammar; however, in this case, tho INTRODUCER (definiqion 5) is
NULL; the CENTER (definition 6) contains an option FRAGMENT in addition to the
options ASSERTION and IMPERATIVE defined in the English grammar; (other~options
of
CENTER,
e.g. QUESTION, have been deleted); and the ENDMARK (definition 10)contains unconventional punctuation, such as dashes and cornma, in addition to
the period and semicolon. Since our main interest here is in FRAGMENT (defini-
tion 7), we will elaborate onlhis definition.
R. Grishman, "The Restrictton Language for Computer Grammars of TFatural Language'
[image:10.763.72.625.158.442.2]I n d e f i n i n g FRAGMENT, we have used parts of t h e grammar which were defined
independently of t h e fragment problem. That t h i s i s p o s s i b l e is i n i t s e l f a
p a r t i a l v e r i f i c a t i a n of t h e conclusion from manual a n a l y s i s t h a t only l i m i t e d ,
grammatically s p e c i f i a b l e , deletion-forms occur i n t h e fragments seen i n n o t e s
and records. For example, t h e dropping of t h e verb (type 1 of Table 1) can
occur i n normal English when a sentence containing t h e verb
-
be occurs as t h eo b j e c t of a verb l i k e
-
f i n d , e.g. We found t h e c h e s t c l e a r t o pekcussion anda u s c u l t a t i o n . itn the U P grammar t h e r e i s an o b j e c t s t r i n g defined f o r such oc-
currences; it is
calleg
SOBJBE (Subject-
+
Object of-
be).
This same s t r i n g can then be made an option of CENTER t o analyze fragments having the same £ o m e,g.Cheet c l e a r to.percussion and a u s c u l t a t i o n .
I n detail, the d e f i n i t i o n of FRAGMENT begins with t h e element S A (Sentence
-
Adjunct). The d e f i n i t i o n ' of SA (not shown h e r e ) contaihs 16 o p t i o n s covering a l l
-
types of sentence modifiers* I n t h i s material t h e most f r e q u e n t SA is a t h e
expzession, u s u a l l y a d a t e [ c a l l e d PDATE, f o r o p t i o n a l P r e p o s i t i o n
+
date) o rt h i s examination, t h i s v i s i t . Following SA i n t h e d e f i n i t i o n FRAGMENT are t h e
o p t i o n s proper, naming d e f i n i t i o n s a l r e a d y i n t h e LSP grammar. The f i r s t option
SOBJBESHOW (Subject
-
+
O b j ect of-
be o r-
show), corresponds t o t h e second and t h i r ds t r u c t u r e s of type 1 and a l s o occurrences l i k e Chest f i l m no change, which i s an
expansion of SOBJBE, discussed above. This o p t i o n covers d e l e t i o n s of t h e two
most common verbs i n this material,
-
be and-
show. The plade of-
be o r-
show (defi-n i t i o n 8) i n a fragment i s e i t h e r empty or i s f i l l e d by a dash.
The mkond and f o u r t h o p t i o n s , ASTG and VENPASS, i n FRAGMENT correspond t o
s t r u c t u r e s of type 2 i n Table 1 (e.g., Negative_, f e l t t o be a benign l e s i o n ) ,
where t h e s u b j e c t , t e n s e and verb
-
be have been dropped. I n t h e LSP grammar, ASTG(Adjective
-
-
s t r i n d is an option of OBJBE, and VEWASS (V-en-
p a s s i v e s t r i n g ) i ss t r i n d , is an o b j e c t of show e . g . , Mild degenerative chanqes (£$om, X-rays show
-
-'
mild*degenerative changes). I t ale0 covers occurrences of t h e f i r s t s t r u c t u r e of
type 1 (e.g. No X-rays taken) where f o r r e g u l a r i t y with more complete e n t r i e s t h e
p a s s i v e Verb (taken)
i s
seen as a r i g h t a d j u n c t of t h e noun. The l a s t o p t i o n , c o n s i s t i n g of NSTG followed by e i t h e r ASSERTION o r SOBJBESHOW, covers such oc-currences as PA and l a t e r a l c h e s t 1l,-5-71 reexamination shows some s c a r r i n g and
thickening over t h e r i g h t apex. where a noun phrase (PA a n d ' l a t e r a l c h e s t 11-5-
71) precedes an a s s e r t i o n about t h a t ngun phrase.
-
Space permits only a few remarks about t h e s e d e f i n i t i o n s . I t was h e l p f u l
t o o r d e r t h e options s o t h a t t h e longer o p t i o n s precede t h e s h o r t e r ones, s i n c e
some of the s h o r t e r o p t i o n s (e.g., NSTG) can have t h e &form a s t h e f i r s t
element of t h e longer ones. This is n o t r e q u i r e d i n p a r s i n g t e x t s , s i n c e i n f u l l
s e n t h c e s t h e r e is u s u a l l y no o t h e r way of f i t t i n g i n t h e remainder of t h e sen-
tence. Also, i n t e x t sentences, many nouns r e q u i r e a preceding determiner, s o
t h a t compound nouns are n o t s p l i t i n t o s e p a r a t e noun phrases. I n t h i s m a t e r i a l ,
determiners are r a r e l y emplbyed, s o t h i s c o n s t r a i n t cannot be applied.
T h i ~ , combined
wim
verb d e l e t i o n s and t h e use of commas both i n t h e t e x t and assentence s e p a m t o r s , makes fof a g r e a t d e a l of s y n t a c t i c ambiguity. However, as
the next s e c t i o n shows, it was p o s s i b l e t o o b t a i n t h e intended p a r s e as t h e
firs* o u t p u t i n most cases. T h i s was accomplished without using t h e s u b c l a s s e s
s p e c i a l t o the medical m a t e r i a l , which are used i n a subsequent s t a g e of pro-
cessing preparatory t o information r e t r i e v a l .
Parsing Results
Parsing output is i n t h e form of a tree, i l l u s t r a t e d f o r a t y p i c d l frag-
ment i n Fig. 1. (Only t h e nodes mentioned above are shown, p l u s LN/RN =
-
l e f t / r i g h t modifiers of Noun,) The f u l l power of the p a r s e r is better i l l u s t r a t e d by-
-
Fig. 1
Parse t r e e f o r FRAGMENT = 5-2-67 chest--no chanqe s j n c e 2-7-67
FRAGMENT.
A summary of t h e p a r s i n g r e s u l t s is given i n Table 4. Of t h e t o t a l 245 sen-
tences, a c o r r e c t f i r a t parse was obtained f o r 171 o r 69.8%, and a f i r s t p a r s e
adequate f o r f u r t h e r processing t o o b t a i n an "information format1' i n 213 cases,
o r 86.9%. The l a t t e r statement brings us t o t h e important question of how these
parses are to be used.
F u l l sentence
Fragment
F u l l S
+
Fragment TOTAL1st p a r s e c o r r e c t
TABU3 4. PARSING RESULTS
Number of Sentences
1st parse OK f o r format
2nd o r 3rd p a r s e OK f o r format
No p a r s e ox parse9 i - 3 n o t OK f o r format
TOTAL
Average time f o r 1st par$e 5.158 seconds
Percentage
15.1
83.7
1 . 2
100.0
The
aim
i n processing n a t u r a l language notes and records is t o a r r i v e a tforms f o r t h e data which are s u i t a b l e f o r computerized information r e t r i e v a l .
The d a t a s t r u c t u r e s must not change t h e meaning. This i s why s y n t a c t i c methods
are knpo%ant. Parsing with an English grammar provides t h e gross s t r u c t u r e of
[image:13.763.62.686.452.760.2]analysis more refined,) In each specialized technical area, more specific struc-
ture is p ~ s ~ i b i e , making use of the restricted word usage characteristic of the
disqourse in the given subject area. K)
A second stage of processing of this type is now being applied to the parsed
corpus of medical records and will be reported in a subsequent paper.
A
con-venkent test of the adeqyacy of the parsing outputs is therefore whether they can
serve as input to this second stage of processing (called forhatting). It can
be seen
in
Table 4 that a number of "wrong" parses were still adequate as inputto the formatting; the segmentation of the sentence into parts was correct even
if the parts were assigned an incorrect syntactic status, e.g., object instead
of adjunct. Only when the first parse was not adequate for formatting was the
sentence rerun to obtain alternative analyses.
The parsing times are a rough indication of the efficiency of the parsing
but two points should be kept in mind. (1) The present LSP system is not a pro-
duction model, but a research tool, with all that implies. (2) A bignificant
fraction of the input sentences were "no data" types, e,g., None this visit.
These word sequences were so limited linguistically that a literal formula could
serve to reaogniae them. The experimental use of such a formula cut down parsing
times on the no-data entries from about 1.817 to0.030. However, this formula was
not used in the parsing summarized in Table 4,
-
-
This investigation was supported by Public Health service Research Grant
number CA-11531 from the National Cancer Institute.
losee Ref. 5 and A. Sager, Syntactic Formatting of Scientific Information,