NATURAL LANGUAGE TEXT SEGMENTATION TECHNIQUES APPLIED TO THE
AUTOMATIC COMPILATION OF PRINTED SUBJECT INDEXES AND FOR ONLINE DATABASE ACCESS G. Vladu=z
Institute for Scientific Information
3501 Market S~reet, Philadelphia, Pennsylvania 19104 USA
ABSTRACT
The nature of the problem and earlier approaches to the automatic compilation of printed subject indexes are reviewed and illustrated. A simple method is described for the de~ection of semantically self-contained word phrase segments in title-like texts. The method is based on a predetermined list of acceptable types of nominative syntactic patterns which can be recognized using a small domain-indepen- dent dictionary. The transformation of the de~ected word phrases into subject index records is described. The records are used for ~he compilation of Key Word Phrase subJec= indexes (K~PSI). The me~hod has been successfully tested for the fully automatic
production of KWPSI-type indexes to titles of scientific publications. The usage of KWPSI-type display forma~s for the~enhanced online access to databases is also discussed.
i. The problem o~f automatic compilation of subject indexes
Printed subject indexes (SI), such as
back-of-the-book indexes and indexes to periodicals and abstracts journals remain important as the most common tools for information retrieval. Tradition- ally SI are compiled from subject descriptions produced for this purpose by human indexers. Such subject descriptions are usually nominalized sentences in which the word order is chosen to emphasize as theme one of the objects participating in the description; the corresponding word or word phrase is placed at the beginning of the nominative construction. Furthermore, the nominalized sentence is rendered in a specially transformed
('articulated') way involving the separated by commas display of component word phrases together with the dominating prepositions; e.g. the sentence 'In lemon juice lead (is) determined by atomic absorption spectrometry' becomes 'LEMON JUICE, lead
determination in, by atomic absorption spectroscopy.' Such rendering enhances the speedy understanding of the descriptions when browsing the index. At the same time it creates for the subject ~escription a llneary ordered sequence of focuses which can be used for the hierarchical multilevel grouping
of related sets of descriptions. The main focus (theme) serves for the grouping of descriptions under a corresponding subject heading, the secondary focuses make possible the further subdivision of such group by subheadings. This is illustrated on the SI fragment to "Chemical Abstracts" shown on figure [.
. ~ a a t P 1YfO2ts tmmm ~
lemtno ~ t d s of. blSStl3a. 1 1 2 6 4 ~ t'omGm, oL b e v e t q ~ 0 d e n u f ~ l t ~ n m ~ L t t l o n to,
T X%0b
~ m ~ . o~ F~mealnello. l t l & % l b
~ ~ , ~ ~ . . . .
m i ~ -t v k m m . ~ 7440*
$ m l y . ?~SlSt
~ s k n ~ N u t t ~ t m tl)mlts ¢~. IfsO4lOe I, e i ~ r * e m a l m ~ ~ u l l e o
[image:1.612.367.570.282.564.2]• ~ ~ . ~ . . - , ~ ~ ~. ~ o,. I,,mn~
| t & O/. ~ I r q d m t l o a m u ~ ( ~ 1 1 ~ 5 1 t
u m v ~ . VlOtlLtOm Ot r ~ m ~ m v l tn reiauom
m, 1707~
e d h ~ i m ~ot. e ~ | y rlmms II, p t f ~ 1 2 6 s
¢leamn| ¢omompm. (or. P ."O7&~k~
¢ a ~ u n p o ~ U e op~&cal/or t m p m v ~ abral~on
• arid t mmis~*~. P ~i9~1~
ICTV J I¢"11 yClCJ~/I II~ll~¢rvl4 t~.~ln vl p~erg*l~h,.Ion e poiynwr hydras,it fo~. P t32291u
cJlonlnl f l m4km~ ~ n s (or. l ~ 3 s
Ck, q n l n 4 s, dns ~or, p~Jtd~4~ in. P 120913k
tvurd. ~ h t a o t t ~ lel ~llnl..l fo~. P ~2~1~
hvdm~hlh¢ illl. entlblOtlf telltale [ ~ I0~%37 O
h v O m p m h , m ~ v m e m u P IO330~v Figure 1
Fragment of a subject index of traditional type to "Chemical Abstracts," compiled from subject descrip- tions by human indexers. A text processing problem, studied in connection with the compilation of such si of traditional type, was the automatic transformation of subject descriptions for selecting their different possible themes and focuses (Armitage, 1967). An experimental procedure, not yet implemented, takes as input pre-edited subject descriptions (Cohen, 1976).
Since the g e n e r a u i o n of subjec= descrip=ions by human indexers is a very expensive procedure P. Luhn
f o r a l l t h e i r l n d e x a b l e words, These words a r e a l p h a b e t i z e d and d i s p l a y e d on t h e p r i n t e d page i n t h e
central position of a column; their contextual
f r a g m e n t s a r e s o r t e d a c c o r d i n g t o t h e r i g h t - h a n d s i d e
COntextS of the index words. Such listings, called Key-Word-ln-Context (ENIC) indexes, have been
produced and successfully marketed since 1960
" q u i c k - a ~ d - d i r t y " $ I , d e s p i t e t h e i r 'mechanical'
appearance which makes them difficult enough t o r e a d
and browse. A fragment of KNIC index to "Biological
A b s t r a c t s , " f e a t u r i n g t i t l e s e n r i c h e d by a d d i t i o n ~ l
key words is shown in figure 2.
~ o ~ c
o~/41~ ~ t a w n ~ w e ~
F i g u r e 2.
Fragment of a K e y - N o r d - i n - C o n t e x t (KNIt) t y p e
subject index to "~iological A b s t r a c t s , "
automatically compiled from titles of biological
publicaclons,
The b l a n k sparser e p l a c e the r e p e a t e d o c c u r r e n c e s o f ~he key word a p p e a r i n g above.
Another mechanically compiled SI substitute still in current use is based on a similar idea and simply groups together all the titles concaining a same indexable word. Such Key-Word-out-of-Context (~WOC) indexes display the full texts of titles under a common beading. Figure 3 shows a KWOC sample
g e n e r a t e d from title-Like subject d e s c r i p t i o n s ac the
I n s t i t u t e f o r S c i e n t i f i c Information, The a p p e a r a n c e
of KWOC indexes is more steep,abe but their bro~elng is much hindered by the lack of articulation of the Lengthy subject descriptions (titles). Without proper articulation, the recognition of the context immediately relevant t o the index word becomes too slow.
In 1966 the Institute for Scientific Information
( I S I ) introduced a different type of automatically
compiled subject index called PERMUTERM Subject Index (PSI) OGarfieid, 19760, which at present is the main
type of SI to the Science Cltation Index and o t h e r
similar ISI publications. Two different negative dictionaries a r e used for producing ~his SI: a so called "full stop list" of words excluded from becoming headings as well as from being used as
subordinate index entries, and a "semi- stop List" of
~ords of little informative value, which a r e noC
allowed as headings but are used as index entries along with words ~ound neither in the full-stop uor
in ~he semi-stop Lists. ~n the PSI every word Co-occuring wi~h the heading word in some
[image:2.612.351.561.85.288.2] [image:2.612.94.339.96.665.2] [image:2.612.374.527.406.594.2]H Y O R O ~ E
Figure 3.
IMMUNOPRECI~TATION
~ ~e '~weamoge~a.l,,e,J~l v a ~
¢
Fragment of a KNOC index compiled from relatively long sub3ect descriptions° The words priuted in l o w e r c ~ s letters a r e "stop words," no~ used as inde~
h e a d i n g s .
• ubJect description ( t l C l e ) becomes an entry line
subordlu~ted to t h i s heading, The format of ~he PSI
is illustrated in figure 4.
AOE~OCAmCl ADENOMAS
• - ~ , ~
4
Figure 4.
Fragment of a PERMUTERM subject index to the Science Citation Index (Institute for Scientific Information), automatically compiled from ~ i t ± e s .
The index lines are words co-occucing in ci~les w k ~ the heading ,#ord. The arrows indicate the first occurrence under the given heading of a pointer to given article.
P S I has the unique ability to make possible the easy retrieval of all titles containing any given pair of informative words. This ability is similar to the ability of computerized online search systems to retrieve titles by any boolean combination of search terms. The corresponding PSI ability is available to PSI users who have been instructed about the principles used for compiling it. The naive user is more likely to utilize it as a browsing tool. When doing so, he may be inclined to perceive the subordinate word entries as being the immediate context of the headings. Used as a browsing tool, P S I may deliver relatively high percentage of false drops because of the lack of contextual information. Another shortcoming of the P S I is its relatively high cost due to its significant size which is propor- tional to the square of the average length of titles. The large number of entries subordinated to headings which are words of relatively
high
frequency makesthe exhaustive scanning of entries under such headings a time consumLing procedure.
An important advantage of all the above computer generated indexes over their manually compiled counterparts is the speed and essentially lower cost at which they are made available.
All the above compilation procedures are based exclusively on the most trivial facts concerning the syntaxis and semantics of natural languages. They make use of the fact that texts are built of words, of the existence of words having purely syntactic functions and of the existence of lexlcal units of very little informative value. A common disadvantage versus the SI of traditional type is that the above procedures fall to provide articulated contexts which would be short enough and structurally simple enough to be easily 8rasped in the course of browsing.
Certainly this problem can be solved by any systen which can perform the full syntactic analysis of titles or similar kinds of subject descriptions. From the syntactic tree of the title a brief articulated context can be produced for any given word of a title by detecting a subtree of suitable size which includes the given word. However, in the majority of cases the practical conditions of application of index compilation procedures are excluding the usage of full scale syntactic analysis, based on dictionaries containing the required
morphological, syntactic and semantic information for all the lexical units of the processed input. For instance, ISI is processing annually for its mutli- disciplinary publications around 700,000 titles ranging in their subject orientation from science and technology to arts and humanities. The effort needed for the creation and maintenance of dictionaries covering several hundred thousands entries with a high ratio of appearance of new words would be excessive. Therefore, the automatic compilation of SI is practially feasible only on the basis of quite simplistic procedures based on "negative" dic- tionaries involving approximative methods of analysis which yield good results in ~he majority of cases, but are robust enough not to break down even in difficult cases.
At one end of the range of problems involving natural language processsing are such as question answering which require a high degree of analytic
sophistication and are based on a significant amount of domain dependent information formated in bulky lexicons. Such procedures appear to be applicable to texts dealing with rather narrow fields of knowledge in the same way as the high levels of iu-depth human expertise are usually limited to specific domains. On the other end of the spectrum are simple problems requiring much less domain dependent information and relatively low levels of "intelligence" (oefined as the ability to discuss comprehensive texts from gibberish); the corresponding procedures are usually applicable to wide categories of texts. For reasons explained above, we consider the problems of
automatic compilation of subject indexes as belongin~ to this low end of the spectrum.
2. The automatic compilation o f Key-Word- Phrase Sub~ect Indexes
In this framework we developed an automatic procedure for the compilaclon of a SI based on :he detection and usage of word phrases. The earlier stages of development of this Key-Word-Phrase subject index (KW?SI) have been reported elsewhere (Vladutz 19795. The procedure starts by detecting certain types of syntactically self-contained segments of the input text; such segments are expected to be
semantically self-contained in view of the assumed well-formedness of the input. The segment detection procedure is based on a relatively short list of acceptable syntactic patterns, formulated in terms of markers attributable by a simple dictionary look-up. The markers are essentially ~he same as used in (Klein 1963) in the early days of machine translation for automatic grammatical coding of English words. All the words not found in an exlusion dictionary of
"~ 1,500 words are assigned the two markers ADJ and NOUN. All the acceptable syntactic patterns are characterized in the frameworks of a generative gr=--,~r constructed for title-type texts. Sucn texts are described as sequences of segments of acceptable syntactic patterns separated by arbitrary filler segments whose syntactic pattern is different from the acceptable ones. The analysis procedure leading to the detection of acceptable segments was
formulated as a reversal of the generative grammar and is performed by a right =o left scanning. .~ew acceptable syntactic patterns can easily be
incorporated into the generative grammar. It is envisaged to use in the future existing programs for automatically generating analysis programs from any specific variant of the grammar.
The present list of acceptable syntactic patterns includes such patterns where noun
phrases are concatenated by the preposition 'OF' anu the conjunctions 'AND', 'OR', 'AND/OR', as well as constructions of the type 'NPI, NP2, ... AND NPi'. Since no prepositions other than 'OF' and no conjunctions other than 'AND', 'OR', 'AND/OR' can occur in the acceptable segments the occurrences of other prepositions and conjunctions are used for initial delimitation of acceptable segments, but the detection procedure is not limited to such usage. In particular, a past participle or a group
an initial delimiter. The segmentation detection is illustrated for three titles in figure 5.
A) SPREADING OF VIRUS INFECTION among WILD BIRDS And MONKEYS during INFLUENZA EPIDEMIC C a u s • d by VICTORIA(3~75
V~L~Wr
OF,,,A(-~N2}
virus
B) EXERCISE I u d u c • d CHANGES in LEFT- VENTR/CULAR Function in Patients ~rlth MITRALE-VALVE PROLAPSE
C) DIFFERENTIATION OF MLC-INDUCED KILLER And Suppresser TTCELS by SENSITIVITY to FTRILAMINE
F i g u r e 5.
The detection of acceptable segments is shown for 3 titles. The words with all lowercase letters are prepositions and conjunctions used as initial deli~Liters. The words with only initial capital letters are "seml-stop" words, excluded from being
u s e d a s i n d e x h e a d i n g s ; t h e u n d e r s c o r e d by d o t t e d
lines "seml-stops" are past participles which become dellminters only when followed by initial dellmi~ars. The resulting multl-word phrases are underscored ~wice unlike the resultlng single word phrases which
a r e u n d e r s c o r e d o n c e .
The first part of the system's dictionary con- Junctions, prepositions, articles, auxiliary verbs and pronouns. Th/s part is completely domain independent. A second p a r t of the dictionary consists of nouns, adjectives, verbs, present a n d
past participles, all of them of little informative value and, therefore, called "seml-stop" words. Such words will not be allowtd later to become SI
headings. The semi-stop p a r ~ of the dictionary is somewhat domain-dependent and has to be atuned for different broad fields of knowledge such as science and technology, social sciences or arts and
humanities.
The second logical step in the SI compilation involves the transformation of acceptable segments into index records consisting of an informative word (not found in the system's dictionary) displayed as heading llne and of an index llne providing some relevant context for the headlng word. Each multi- word segment generates as many index records as many informative words it contains. The ri~ht-hand side of the segment following the heading word is placed at the beginning of the index line to serve as its i~nediate context and is followed through a senLicolon by the segment's left-hand side. When both sides are non-empty, an articulation of the index line is so achieved. In the case of a single word segment an "expansion" procedure is performed during index record generation. It starts by placing at the beginning of the index llne a fragment of the title consisting of the filler portion following the heading word and of the next acceptable segment, if any; this initial portion of the index line is followed by a semicolon after which follows the preceding acceptable segment, followed finally by the filler portion separating in the title this preceding segment from the heading word. The index record generation is illustrated in figure 6.
139
final "enrichment" phase of the index record generatlon involves the additional display (in parenthesis) of the unused segments of the processed title.
*SPREADING
of VIRUS INFECTION *VIRUS
INFECTION; SPREADING of * INFECTION
SPREADING OF VIRUS * *WILD
BIRDS a n d MONKEYS *BIRDS
and MONKEYS; WILD * *MONKEYS
WILD BIRDS and * *INFLUENZA
EPIDEMIC *EPIDEMIC
INFLUENZA * *SENSITIVITY
to FYR/LAMINE; DIFFERENTIATION of
MLC-INDUCED KILLER SUPRESSOR T-CELLS by * *PYRILAMINE
SENSITIVITY ~o *
Figure 6.
The transformation of Key-Word-Phrases into subJec~ headlngs and subject entries is
illmstrated for the first two seFjnents of the title A, Figure 6. The last two examples snow how single word segments (from Title C) are expanded to incluoe the preceding and following them segemencs.
As a result of this stage the informacioual value of the finally generated index record is almost
equivalent to the information content of ~ne initial full title. The entire process ultimately boils down to the the reshuffling of some component segments of the initial title. The e n r i c ~ e n t stage of index record generation is illustrated on figure 7.
The index records are alphabetized firstly by heading words and secondly by index lines with the exclusion from alphabetiza~ion of
prepositions and conjunctions if they occur aC the beginning of index lines. During the
photocomposition
different parts of the index [image:4.612.88.563.94.728.2]*SPREADING
of VIRUS INFECTION (WILD BIRDS and MONKEYS; INFLUENZA EPIDEMIC; VICTORIA(#)75 VAKIANT of A(K3N2) VIRUS)
*VIRUS
INFECTION; SPREADING OF * (WILD BIRDS a n d
MONKEYS; INFLUENZA EPIDEMIC; VICTORIA(3)75 VARIANT of A(H3N2) VIRUS)
*BIRDS
and MONKEYS; WILD * (SPREADING of VIRUS INFECTIONS; INFLUENZA EPIDEMIC, EPIDEMIC; VICTORIA(3)75 VARIANT of A(H3N2) VIRUS) *INFLUENZA
EPIDEMIC (SPREADING of VIRUS INFECTIONS; VICTORIA(3)75 VARIANT of A(H3N2) VIRUS) *PYRILAMINE
[image:5.612.74.327.71.606.2] [image:5.612.352.570.86.657.2]SENSITIVITY t o * (DIFFERENTIATION of MLC- INDUCED KILLER and SUPPRESSOR T-CELS)
Figure 7
The enrichment of the subject entries by the display (in parenthesis) of the unused by them segments of the same title, illustrated for some of the entries of Figure 6.
immediately relevan~ coutext of the head word is displayed in bold face in order to facilitate its rapid grasping when browsing. Details of the appearance and s~ructure of KWPSI are exemplified in figure 8 on a sample compiled for titles of publica- tions dealing w i t h librarianship and information science. The general appearance of KWPSI is close enough to the appearance of SI of traditional type.
For purposes of transportability the KWPSI system is programmed in ANSA COBOL. It includes two modules: the index generation module and the sorting and reformatting module. On an IBM 370 system index records are generated for titles of scholarly papers at a speed of ~ 70,000 titles/hour. The resulting total size of the index is of the same order as the size of KWOC indexes and compares favorably with the size of the PSI index.
The analysis of ~he rates and ~aCure of failures of ~he segment detection algorithm shows that
in 96% of cases the generated segments are fully acceptable as valuable index entries. In 2% of cases some important information is lost as a result of the elimination of prepositions, as in case of expressions o f 'wood to wood' type. The rest of failures results in somehow awkward segments which are not completely semantically self-contained. Even in such cases the index entries retain some
informative value. Around half of the failures can be eliminated by additions to the system's
dictionary, especially by the inclusion of more verbs and past participles. Not counted aS failures are the 5% of cases when the leng=h of the detected segments is excessive; such segments can include the whole title.
The extent of tuning required for =he
application of the system in a new area of knowledge depends mainly upon ~he extent of figure 8.
*INDEXING
•
K S N I E ~ C M L ) . . . IC-80-2.SS7
C O N V [ ~ N C ~ m d COMP&IIgRJI~ ~ T 1 k ~ ~ M.L-
U m O ~ - I I O O R - C H A M I U E R u ~ M E I ~ . A L
SUlIJ[CT * . . . IC-IiO-2-SQ4*
¢I~Vl/E/~IACZ.. ANA,%VSIS Of SUIJECT • (USE
CENTRAI.~TO $UIJECT W O E X ~ ~ R F ¢ ~ . OF. t W ~ . S O C ~ . S ~ E ~ S . 4 C 4 O E M ~ , C E N t R e . CoM~rTEE,OF.T~.CPSU) . . . K:.110-2.SII3
w ~JR~eWT 4 WAn~Wt$J' s f n v a ~ l . . ~J~TK:UI.ATT0
SUBJECT " ¢COMPAMSON.I ~ ~ ~ O ~ ~ X ~
L~/~mv STtOES) . . . I C - 1 0 - 2 . S l l t OAVAIIalur * . . . I C - I I O - 2 . S I O m'DOCUM£/W~" SUIRCT " flM4LJ~I;',O~ ~ '
k, M TNEMA TICAI. MOOEL$) . . . IC-S0-2-$66
e t F,A~ r S ¢ ~ N n ~ c ~ ' I t ~ O t C 4 L S (INOEX c,4 rALOGUE)
IC.eG2-SB2 £ J r ~ q ~ ' M ~ , : I~POeT o~ * (BOOK ,~O~'X~NG. tOOL
.qE,4~NG. RJr'$u.~'O.~o~qO. 44TAC~r. text') ~ C . S O - 2 - S I L FAILMRE; STUDY O~ * t ' T N E , ~ I ~ D AUrOA44 I?C
/NOEXlNG) . . . IC-80-2.S67
MAW IdiAN-PA~IFI¢ * . . . LC-80.2-582
KWOC aml PRI[CIS * (ARTICULATED $U&IECT INOEXING
CUR#ENT AW.¢REN~$,$ . ~ Y I C £ $ . C O M ~ A ~ $ O ~ LIBRARY $17JD~E$) . . . IC.80.2-Sl ~ LANGUAGES PA/'7~WE~t I~DUNO,~NT • (N4ruRAL
LANGUAG£) . . . IC;.80-2-S60
~ $ . ' FOUNOATION of • ( ~ I A . ~ q C A T / O N / = R ~ . ~ $ ~ )
IC.S0.2.~67 LAW F N F O R ~ I ~ ' N T , ~ d CRR~WAL JUSlIC~ tNP~flM4TION
(NAnONAI. CRtMINAI. JU$1"/C~ Tt'IE&4UltUS, DL;CIt~TORS) . . . IC-80-2.SSO
MACNmE.AIO[D * (EN~UNtTI~. ~UTOI44 ;1C ~VO~XtNG o l
3RD KIND) . . . I C - I 1 0 - 2 - S 6 7
~ . R~FERIENC[ STRING " ... IC.80-2-$8 ]
PAleR ~ and SYSTEM e~ • (APP[IC~T~IV o l
COMPUTERS. INI7tOOUOIVG. ;'W~.,~IU~Jb, 8UI.L£TIN)
IC-80-2-S61
P I ' R ~ . US~ of C[NTRALI~q[D SUBJECT * (LIgRAIW,OF-
THE,$OCIA£.SCIENC£$~ICADEM~ CENrRAL.COMMII'7"EE- OF. T~,~.CPSU: Alva. }'$/$ ot ,~JgJECT tNO~X/NG CONVERGENCE) . . . IC-80-2-563
m,e PEle~OtC4L UTEIt,4rt, M ~ e ¢ . 4 ~ ' ~ O . 4 M £ ~ C 4 * t m ~ G R A P ~ r . . . ~C.80.2,S82
e f I~WINITO OOCUMFNY~" C~81NATION of CLASS~F~CATIO~
an4 SUEJECT " . . . ~C-80.2,S6S
~ £ $ $ . " QUM~TITATIVE 0 ( $ C R I P T I O N of " . . . IC.80-2,S65
w R~$~AiI~N (APPLIC, ATIO~V of INFORfvMTION SCIENCE RECOGNITIONS. SPECIAL SCIENCES. #VFO~M/lrloN SYSTEMS. LARGE.C.4PAC/rY $TO~/IGE) . . . . IC.80-2.SS9
4~d RErR~tEVAIL; iNTELLIGENT * (MAN-MACHINE
P/IRTNER$t'fll "1) . . . IC-80.2.S6S
ROLE of AUTOMATI~ * (O~RArlONAL ON&INE RETRfEVAL
S ~S f f M $ ) . . . ~C-80-21S 54
I T t l K * (M~CROCOMPUTER.G~NEIM ITD GR~IP~C
D/SPLAYS. /liD) . . . IC.80-2-S80 SURVEY M A U S T R A L I A N * ( O ~ R V / I T ; O N $ ) . , IC-80.2.584
$ ~ r . ~ M : COMPUTF.R-ASSlST[0 DYNAMIC * (CAOtS) IC.60.2-S67
s r ~ ' M ; THESAURUS ehd * (~RO~LEMS of UNIVEI~S/I"Y
O~G/INI!'/IT~ON) . . . IC-80.2.57S
TERMINOLOGY POSTIN~ F~R~IrJ~. DO(: ~TI~EVAL II~4~ *
(PIIERA~C/'IV ~nd KWOC) . . . IC,80-2-S48
THEOR~'Ti~4L FOUNOAnON~" ~%ACTICE of * IC-80-2.565
THE~;AURU$-ILASEO AUTOMATtC * (~FUDV o f / N ~ X I N G
[All. USE) . . . eC-80.2.S67
TNESAUIIUS-IIASED OOCUMENT " (RECOGNIrlON o f
MUL T/COM~ONENT TERMS) ... IC.80"~.$76
F/O~: PRIOeLEMS Of COORO(NATE * (NETWOR~ O/
AuTOMA rED 5rl C E N T , S) . . . ~C-80.2 58A
V ~ R S U S RWOC AUTOMATED * (P~R~O~MANCE C(:~I~ARtSOIV) . . . IC.80"2 568
I~CAmULA~Y:FREE " . . . IC,80,2 S66
~ S T - ~ O O O : T I ~ N O S ,n * . . . IC 80-~.553
~f JRO RINO: AUTOMATIC * (/tAUICI~'YNE./llOED t N ~ AING ENCOUNrE~P) . . . IC-80" 7 567
~NOUSTRIAI.
ENrF~II~tS~S rUSE o~ P/l rENT IN~O~,~ r/ON e,.~ /NrERNAYI~NA~ PARENT CLA$SIFICATK:~I). . IC-80.2 572
INFORMA 7 ~ N rPt£~AURfJ$ (CENTRAL ,AMEIP, C~ Br~ DOM~N~/IN.R~PUB.~C) . . . ~C-80.2 sso
Figure 8
A photocomposed k'WPSI sample showing details of i~s structure and appearance.
I n d e x " includes besides ci~les of arclcles, also cicles of book reviews, as well as descriptions of musical performances and musical records, compiled accordin G co special rules. Many contain mulciword names of works of arC, cakes in quotation marks,
which have ~o be handled as single words. A KWPSZ
sample for ar~s and humaniCies is shown on figure 9 •
Two more KWPSI samples are given on figure 4 ~
(science and C e c t ~ o l o g y ciCles) and figure 1 1
(&eoscience research f=on~ names).
DRAMA
" ~ ~ . . .
~ ~:::::::::::::::;~,~---~.~,
• " . : : : : : : : : : : : : : : .
~ # e , l ~ ¢ m ~ l w m e m l l u m ~ • ~ u a l l m l n ~
DRAWING
• . . . . : ~ : ~
, , ' ~ " ~ F . . . ' : ~ - - " . . ~ . ~ . ~ " ~ ' ~ '
~ : : : : : : : : : ~ " ~
~-,~__-...."~....,....~...::::~,~:
• ,, , , ~ ... , m , . . . ~,~"-- ,,~'=~'P,.im ~ . ~ ~ ' "
~ m i m m m i ~ * * n ~ I t m t ~ i m . . . ~ u ~ m l c
m e ~ i i , ~ " i i - - i ~ c j ~ , ~ m ; . . . m m m . , " .. . . , ~ ~ ~ 1 ~ 1 ° m m m l - . . . m .
. . . x p ~ m ~
. . . m , m ~ m m m ~ i . . . . . ~ n s ~ s
• . ~ . . ,
i i ~ p ~ m r ~ f w ~ q ~ n l ~ ~ ~ t r
a * A l . , z ~ r
o e a ~ i ~ n ~ r ~
t n a p ~ .~ a ~ L ~ a m m m • a e o , ~ o v ~
~ l q I ~ ~ ~ ~ M J ~ m P g~a~ma m d M ~
,,' ~ I U ~ ~ l l A m ~ l a n l - . . . u m a q l l l • 'Oallr~lol, IR . O m p "
, . a l w i m . *
~ 1 , ~ . . . ,,..sP.,I J ,
• . . ... ;!!!!!!!!!
[image:6.612.346.550.127.377.2] [image:6.612.101.311.177.675.2]' : :::::::::?:!!!i!!!iii!! :
Figure
L,,i}~i samples for arts and humanities titles.
e I O R D £ R U N E "CANCER
. . . - - " L ~ - j . - ~
• . . . w , . . . ,
I m
, % ' ~ ' ~ , , , ~ , , ' ~ . . . ~ ~ ,
. ' ~ , ~ u ~ . . . ~ . . . . , . . . , ~ , , ~
~ ( ~ a m m a O g * i U L • ~ J q ~ , r ~ w s m e m n a d a r m i w ~ I ~ m ~ ~ m r s ~ 4 ~ , e
~ - % : ~ , ~
--.~E~2~,
q ~ 4 a e ~
• w
Figure I0
K~PSI sample for clclea covered by the mulci- dlsciplinaz7 SCI (science and cechnology).
3 . P o s s i b l e u s a g e o_ff a u t o m a t i c a l l y d e t e c t e d v o r d p h r a s e s f o r e n h a n c e d o n l i n e a c c e s s co
( L a c a b e s e s
The common method of online access to commercially available textual databases of both bibliographic and full cexc type is through boolean queries formulaCsd in tern~s of single words. AuCo~Icically detectable word phrases of the cype used in the KWP$1 sysCem could be used in ~L~ree different rays for improving online access.
One extreme way would involve the creation of word phrases of the above cype a= the input stage for every informative word of the Input. In response to a slnsle word query a sequence of screens would be shown displaying the image of whaC in a printed KWPSl would be the KWPSI section under ~he Given word ta~eL~ as headin G . After browsing online s o m e p a r ~ o~ t n i ~ up-to-dace online 51 the user could choose to limi~ further browsin G by respondln~ wits an additional search term, mos~ likely chosen from s o m e of t~le already examined index entries. As a result ~he system would reply by ellmlna~ing from ~ne displayed output the encrlas aoc concalnin 8 che given word an~ the user would conclnue co browse che so ~rimmed display. Several such i~eratlons could be performe~
ATMOSPHERE
c..* * z s~ ii ~*ll
I I i - ~ R I I I I I e# I F I I . ~ i .~I$ii ~ I I ~ R ~ l l ~ m ! • , . . . ii.iii i
,iI~.~ l ~ l i ' l t lll¢14&IMlm1,1 1 ~ I l .I l i ~
' ~ , ~ , ~ ' ~ " . . _ . _ ~"~"*~u~ . . ' " m ' .'2 ~'~"
m l ~ . l l l m llll,mu¢ c l m u l ~ . i m m l l ~ j ,.,~ ¢l'l~Iz , , m m , ¢ ~ I m ~ a c l M . l W l , m l ~ n .,I ~ -
[image:7.612.91.311.100.422.2]ll.Olll
| 1 . ~
| H * Z ~
llllUl" r'~C~l fPWl,lll i ~ ~ FI~I~ . ll-OllW
ATMOSPHERIC
K o o ~ r ~ l . . . n L . l l g l
i L . o ; * |
a m m u m m m r ! e,*¢~ o , , a o ~ , , ~ c m o , a ~ ' m ~ ,,m l l , o p t . i
. , . , i l u ~ l e , l , a l m , , m ~ m ~ l i l m l m l l .
I i-o¢III
o,1~111~ I ~ / l ~ l l C 7/o~ e/#1~4 G G / N I ) . . . ll.llll
~ ~ l l l t ~ - . . . . e 1 . 1 1 1 @ ~ 1 , ~ 1 ! ~ i & ~ l - l l , i l l l
. . . I l l |
- . , , . . •
I14)I01 /lllfll I II11~I I I# I . . . II.IIII
,,,u*ll i , c l l l . ~ , ~ l l . m ~ . - . . . I I ~101
Figure Ii
KW'PSI sample for names of geoscience research fronts.
until t h e user would b e left with the display of a SI
to relevant items of the database. This SI would be
then printed together with the full list of relevant
items. It is ~hought that such kind of interaction
could be more user friendly than the currently used boolean mode.
Another way of using the KWPSI technique in an online environment would be to use the KWPSI format for the output of the results of a retrieval performed in a traditional boolean
way. The query word which achieved the most
strong trimming effect would be used as heading.
A third way would involve the compression of a KWFSI section under a given heading before it is
displayed in response to a word. One could e.g.
retain only such noun phrases containing the given word which occur at least k times in the database. An example of such list for the "Arts and Humanities"
database is given in figure 12. By displaying such
lists of words closely co- occurring with a given one ~he system would perform thesaurus-type functions.
The implementation of all such
possibilities would be rather difficult for any existin 8 system in view of the effort required
to reprocess past input. Instead, after the input of
a query word the corresponding full text records could be called a not processed online in core for
generating KWPSl-type index records. In this case
all the above functions could be still performed.
AFRICAN * HORROR(S) *
AMERICAN * iNDUSTRY
ARCHIVES LITERARY *
AVANT-GARDE * MAKER(S)
BLACK * OBSCENE *
BRAZILIAN * POLISH *
COMPANY PRODDCTION
CRITICISM PROPAGANDA
DIRECTOR TECHNIQUES
DOCUMENTARY * THEORY
FESTIVAL(S) TV *
HISTORY WESTERN *
Figure 12
Lie= of word phrases containing the word 'FILM(S)' and appearing at least twice in titles covered during a three months period by
[SI's ARTS and HUMANITIES CITATION INDEA. Such
automatically compiled lists are suggested as search aides for the online access to databases.
Another possibility which we are considering is to place the K~PSI-type processing capabilities into a microcomputer which is being used to mediate online
searches in remote databases. All the text records
containin8 a given (not too frequent) word coulu be initially tapped from the database into the micro-
computer. Following that the microcomputer could
perform all the above functions in an offline mode,
REFERENCES
Armltage, J.E., Lynch, M.F. "Articulation
in the Generation of Subject Indexes by
Computer." Journal of Chemical Documentation:
7:170-8, 1967.
Cohen, S.M., Dayton, D.L., Salvador, R. "Experimental Algorithmic Generation of
Articulated Index Entries from Natural Language Phrases a= Chemical Abstracts Service."
Journal of C h e m i g a l l n f o r m a t i o n and Computer Sciences: 1976 May, 16(2): 93-99.
Garfield, E. "The Permuterm Subject
Index: An Autobiographic Review." Journal of
the Amer_ica_n Society fqr Information Science: 27(5/6): 288-291, 1976.
Klein S., Simmons R.F. "A Computational
Approach to grammatical coding of English
words." Journal of ACM: 10(3):334-347, 1963.
Luhn P. "Keyword-in-Con=ext Index for Technical
Literature." Report RC 127. New York: IBM Corp.,
Advanced System Development Division, 1959.
Vladutz G., Garfield E. "k-WPSI - An
Algorithmically derived Key Word/Phrase Subject
Index." Proceedings of the ASIS; 42nd Annual