Natural Language Text Segmentation Techniques Applied to the Automatic Compilation of Printed Subject Indexes and for Online Database Access

(1)

NATURAL LANGUAGE TEXT SEGMENTATION TECHNIQUES APPLIED TO THE

AUTOMATIC COMPILATION OF PRINTED SUBJECT INDEXES AND FOR ONLINE DATABASE ACCESS G. Vladu=z

Institute for Scientific Information

3501 Market S~reet, Philadelphia, Pennsylvania 19104 USA

ABSTRACT

The nature of the problem and earlier approaches to the automatic compilation of printed subject indexes are reviewed and illustrated. A simple method is described for the de~ection of semantically self-contained word phrase segments in title-like texts. The method is based on a predetermined list of acceptable types of nominative syntactic patterns which can be recognized using a small domain-independent dictionary. The transformation of the de~ected word phrases into subject index records is described. The records are used for ~he compilation of Key Word Phrase subJec= indexes (K~PSI). The me~hod has been successfully tested for the fully automatic

production of KWPSI-type indexes to titles of scientific publications. The usage of KWPSI-type display forma~s for the~enhanced online access to databases is also discussed.

i. The problem o~f automatic compilation of subject indexes

Printed subject indexes (SI), such as

back-of-the-book indexes and indexes to periodicals and abstracts journals remain important as the most common tools for information retrieval. Tradition- ally SI are compiled from subject descriptions produced for this purpose by human indexers. Such subject descriptions are usually nominalized sentences in which the word order is chosen to emphasize as theme one of the objects participating in the description; the corresponding word or word phrase is placed at the beginning of the nominative construction. Furthermore, the nominalized sentence is rendered in a specially transformed

('articulated') way involving the separated by commas display of component word phrases together with the dominating prepositions; e.g. the sentence 'In lemon juice lead (is) determined by atomic absorption spectrometry' becomes 'LEMON JUICE, lead

determination in, by atomic absorption spectroscopy.' Such rendering enhances the speedy understanding of the descriptions when browsing the index. At the same time it creates for the subject ~escription a llneary ordered sequence of focuses which can be used for the hierarchical multilevel grouping

of related sets of descriptions. The main focus (theme) serves for the grouping of descriptions under a corresponding subject heading, the secondary focuses make possible the further subdivision of such group by subheadings. This is illustrated on the SI fragment to "Chemical Abstracts" shown on figure [.

. ~ a a t P 1YfO2ts tmmm ~

lemtno ~ t d s of. blSStl3a. 1 1 2 6 4 ~ t'omGm, oL b e v e t q ~ 0 d e n u f ~ l t ~ n m ~ L t t l o n to,

T X%0b

~ m ~ . o~ F~mealnello. l t l & % l b

~ ~ , ~ ~ . . . .

m i ~ -t v k m m . ~ 7440*

$ m l y . ?~SlSt

~ s k n ~ N u t t ~ t m tl)mlts ¢~. IfsO4lOe I, e i ~ r * e m a l m ~ ~ u l l e o

[image:1.612.367.570.282.564.2]

• ~ ~ . ~ . . - , ~ ~ ~. ~ o,. I,,mn~

| t & O/. ~ I r q d m t l o a m u ~ ( ~ 1 1 ~ 5 1 t

u m v ~ . VlOtlLtOm Ot r ~ m ~ m v l tn reiauom

m, 1707~

e d h ~ i m ~ot. e ~ | y rlmms II, p t f ~ 1 2 6 s

¢leamn| ¢omompm. (or. P ."O7&~k~

¢ a ~ u n p o ~ U e op~&cal/or t m p m v ~ abral~on

• arid t mmis~*~. P ~i9~1~

ICTV J I¢"11 yClCJ~/I II~ll~¢rvl4 t~.~ln vl p~erg*l~h,.Ion e poiynwr hydras,it fo~. P t32291u

cJlonlnl f l m4km~ ~ n s (or. l ~ 3 s

Ck, q n l n 4 s, dns ~or, p~Jtd~4~ in. P 120913k

tvurd. ~ h t a o t t ~ lel ~llnl..l fo~. P ~2~1~

hvdm~hlh¢ illl. entlblOtlf telltale [ ~ I0~%37 O

h v O m p m h , m ~ v m e m u P IO330~v Figure 1

Fragment of a subject index of traditional type to "Chemical Abstracts," compiled from subject descriptions by human indexers. A text processing problem, studied in connection with the compilation of such si of traditional type, was the automatic transformation of subject descriptions for selecting their different possible themes and focuses (Armitage, 1967). An experimental procedure, not yet implemented, takes as input pre-edited subject descriptions (Cohen, 1976).

Since the g e n e r a u i o n of subjec= descrip=ions by human indexers is a very expensive procedure P. Luhn

(2)

f o r a l l t h e i r l n d e x a b l e words, These words a r e a l p h a b e t i z e d and d i s p l a y e d on t h e p r i n t e d page i n t h e

central position of a column; their contextual

f r a g m e n t s a r e s o r t e d a c c o r d i n g t o t h e r i g h t - h a n d s i d e

COntextS of the index words. Such listings, called Key-Word-ln-Context (ENIC) indexes, have been

produced and successfully marketed since 1960

" q u i c k - a ~ d - d i r t y " $ I , d e s p i t e t h e i r 'mechanical'

appearance which makes them difficult enough t o r e a d

and browse. A fragment of KNIC index to "Biological

A b s t r a c t s , " f e a t u r i n g t i t l e s e n r i c h e d by a d d i t i o n ~ l

key words is shown in figure 2.

~ o ~ c

o~/41~ ~ t a w n ~ w e ~

F i g u r e 2.

Fragment of a K e y - N o r d - i n - C o n t e x t (KNIt) t y p e

subject index to "~iological A b s t r a c t s , "

automatically compiled from titles of biological

publicaclons,

The b l a n k sparse

r e p l a c e the r e p e a t e d o c c u r r e n c e s o f ~he key word a p p e a r i n g above.

Another mechanically compiled SI substitute still in current use is based on a similar idea and simply groups together all the titles concaining a same indexable word. Such Key-Word-out-of-Context (~WOC) indexes display the full texts of titles under a common beading. Figure 3 shows a KWOC sample

g e n e r a t e d from title-Like subject d e s c r i p t i o n s ac the

I n s t i t u t e f o r S c i e n t i f i c Information, The a p p e a r a n c e

of KWOC indexes is more steep,abe but their bro~elng is much hindered by the lack of articulation of the Lengthy subject descriptions (titles). Without proper articulation, the recognition of the context immediately relevant t o the index word becomes too slow.

In 1966 the Institute for Scientific Information

( I S I ) introduced a different type of automatically

compiled subject index called PERMUTERM Subject Index (PSI) OGarfieid, 19760, which at present is the main

type of SI to the Science Cltation Index and o t h e r

similar ISI publications. Two different negative dictionaries a r e used for producing ~his SI: a so called "full stop list" of words excluded from becoming headings as well as from being used as

subordinate index entries, and a "semi- stop List" of

~ords of little informative value, which a r e noC

allowed as headings but are used as index entries along with words ~ound neither in the full-stop uor

in ~he semi-stop Lists. ~n the PSI every word Co-occuring wi~h the heading word in some

[image:2.612.351.561.85.288.2] [image:2.612.94.339.96.665.2] [image:2.612.374.527.406.594.2]

H Y O R O ~ E

Figure 3.

IMMUNOPRECI~TATION

~ ~e '~weamoge~a.l,,e,J~l v a ~

¢

Fragment of a KNOC index compiled from relatively long sub3ect descriptions° The words priuted in l o w e r c ~ s letters a r e "stop words," no~ used as inde~

h e a d i n g s .

• ubJect description ( t l C l e ) becomes an entry line

subordlu~ted to t h i s heading, The format of ~he PSI

is illustrated in figure 4.

AOE~OCAmCl ADENOMAS

• - ~ , ~

4

Figure 4.

Fragment of a PERMUTERM subject index to the Science Citation Index (Institute for Scientific Information), automatically compiled from ~ i t ± e s .

The index lines are words co-occucing in ci~les w k ~ the heading ,#ord. The arrows indicate the first occurrence under the given heading of a pointer to given article.

(3)

P S I has the unique ability to make possible the easy retrieval of all titles containing any given pair of informative words. This ability is similar to the ability of computerized online search systems to retrieve titles by any boolean combination of search terms. The corresponding PSI ability is available to PSI users who have been instructed about the principles used for compiling it. The naive user is more likely to utilize it as a browsing tool. When doing so, he may be inclined to perceive the subordinate word entries as being the immediate context of the headings. Used as a browsing tool, P S I may deliver relatively high percentage of false drops because of the lack of contextual information. Another shortcoming of the P S I is its relatively high cost due to its significant size which is propor- tional to the square of the average length of titles. The large number of entries subordinated to headings which are words of relatively

high

frequency makes

the exhaustive scanning of entries under such headings a time consumLing procedure.

An important advantage of all the above computer generated indexes over their manually compiled counterparts is the speed and essentially lower cost at which they are made available.

All the above compilation procedures are based exclusively on the most trivial facts concerning the syntaxis and semantics of natural languages. They make use of the fact that texts are built of words, of the existence of words having purely syntactic functions and of the existence of lexlcal units of very little informative value. A common disadvantage versus the SI of traditional type is that the above procedures fall to provide articulated contexts which would be short enough and structurally simple enough to be easily 8rasped in the course of browsing.

Certainly this problem can be solved by any systen which can perform the full syntactic analysis of titles or similar kinds of subject descriptions. From the syntactic tree of the title a brief articulated context can be produced for any given word of a title by detecting a subtree of suitable size which includes the given word. However, in the majority of cases the practical conditions of application of index compilation procedures are excluding the usage of full scale syntactic analysis, based on dictionaries containing the required

morphological, syntactic and semantic information for all the lexical units of the processed input. For instance, ISI is processing annually for its mutli- disciplinary publications around 700,000 titles ranging in their subject orientation from science and technology to arts and humanities. The effort needed for the creation and maintenance of dictionaries covering several hundred thousands entries with a high ratio of appearance of new words would be excessive. Therefore, the automatic compilation of SI is practially feasible only on the basis of quite simplistic procedures based on "negative" dictionaries involving approximative methods of analysis which yield good results in ~he majority of cases, but are robust enough not to break down even in difficult cases.

At one end of the range of problems involving natural language processsing are such as question answering which require a high degree of analytic

sophistication and are based on a significant amount of domain dependent information formated in bulky lexicons. Such procedures appear to be applicable to texts dealing with rather narrow fields of knowledge in the same way as the high levels of iu-depth human expertise are usually limited to specific domains. On the other end of the spectrum are simple problems requiring much less domain dependent information and relatively low levels of "intelligence" (oefined as the ability to discuss comprehensive texts from gibberish); the corresponding procedures are usually applicable to wide categories of texts. For reasons explained above, we consider the problems of

automatic compilation of subject indexes as belongin~ to this low end of the spectrum.

2. The automatic compilation o f Key-Word- Phrase Sub~ect Indexes

In this framework we developed an automatic procedure for the compilaclon of a SI based on :he detection and usage of word phrases. The earlier stages of development of this Key-Word-Phrase subject index (KW?SI) have been reported elsewhere (Vladutz 19795. The procedure starts by detecting certain types of syntactically self-contained segments of the input text; such segments are expected to be

semantically self-contained in view of the assumed well-formedness of the input. The segment detection procedure is based on a relatively short list of acceptable syntactic patterns, formulated in terms of markers attributable by a simple dictionary look-up. The markers are essentially ~he same as used in (Klein 1963) in the early days of machine translation for automatic grammatical coding of English words. All the words not found in an exlusion dictionary of

"~ 1,500 words are assigned the two markers ADJ and NOUN. All the acceptable syntactic patterns are characterized in the frameworks of a generative gr=--,~r constructed for title-type texts. Sucn texts are described as sequences of segments of acceptable syntactic patterns separated by arbitrary filler segments whose syntactic pattern is different from the acceptable ones. The analysis procedure leading to the detection of acceptable segments was

formulated as a reversal of the generative grammar and is performed by a right =o left scanning. .~ew acceptable syntactic patterns can easily be

incorporated into the generative grammar. It is envisaged to use in the future existing programs for automatically generating analysis programs from any specific variant of the grammar.

The present list of acceptable syntactic patterns includes such patterns where noun

phrases are concatenated by the preposition 'OF' anu the conjunctions 'AND', 'OR', 'AND/OR', as well as constructions of the type 'NPI, NP2, ... AND NPi'. Since no prepositions other than 'OF' and no conjunctions other than 'AND', 'OR', 'AND/OR' can occur in the acceptable segments the occurrences of other prepositions and conjunctions are used for initial delimitation of acceptable segments, but the detection procedure is not limited to such usage. In particular, a past participle or a group

(4)

an initial delimiter. The segmentation detection is illustrated for three titles in figure 5.

A) SPREADING OF VIRUS INFECTION among WILD BIRDS And MONKEYS during INFLUENZA EPIDEMIC C a u s • d by VICTORIA(3~75

V~L~Wr

OF,,,A(-~N2}

virus

B) EXERCISE I u d u c • d CHANGES in LEFT- VENTR/CULAR Function in Patients ~rlth MITRALE-VALVE PROLAPSE

C) DIFFERENTIATION OF MLC-INDUCED KILLER And Suppresser TTCELS by SENSITIVITY to FTRILAMINE

F i g u r e 5.

The detection of acceptable segments is shown for 3 titles. The words with all lowercase letters are prepositions and conjunctions used as initial deli~Liters. The words with only initial capital letters are "seml-stop" words, excluded from being

u s e d a s i n d e x h e a d i n g s ; t h e u n d e r s c o r e d by d o t t e d

lines "seml-stops" are past participles which become dellminters only when followed by initial dellmi~ars. The resulting multl-word phrases are underscored ~wice unlike the resultlng single word phrases which

a r e u n d e r s c o r e d o n c e .

The first part of the system's dictionary con- Junctions, prepositions, articles, auxiliary verbs and pronouns. Th/s part is completely domain independent. A second p a r t of the dictionary consists of nouns, adjectives, verbs, present a n d

past participles, all of them of little informative value and, therefore, called "seml-stop" words. Such words will not be allowtd later to become SI

headings. The semi-stop p a r ~ of the dictionary is somewhat domain-dependent and has to be atuned for different broad fields of knowledge such as science and technology, social sciences or arts and

humanities.

The second logical step in the SI compilation involves the transformation of acceptable segments into index records consisting of an informative word (not found in the system's dictionary) displayed as heading llne and of an index llne providing some relevant context for the headlng word. Each multi- word segment generates as many index records as many informative words it contains. The ri~ht-hand side of the segment following the heading word is placed at the beginning of the index line to serve as its i~nediate context and is followed through a senLicolon by the segment's left-hand side. When both sides are non-empty, an articulation of the index line is so achieved. In the case of a single word segment an "expansion" procedure is performed during index record generation. It starts by placing at the beginning of the index llne a fragment of the title consisting of the filler portion following the heading word and of the next acceptable segment, if any; this initial portion of the index line is followed by a semicolon after which follows the preceding acceptable segment, followed finally by the filler portion separating in the title this preceding segment from the heading word. The index record generation is illustrated in figure 6.

139

final "enrichment" phase of the index record generatlon involves the additional display (in parenthesis) of the unused segments of the processed title.

*SPREADING

of VIRUS INFECTION *VIRUS

INFECTION; SPREADING of * INFECTION

SPREADING OF VIRUS * *WILD

BIRDS a n d MONKEYS *BIRDS

and MONKEYS; WILD * *MONKEYS

WILD BIRDS and * *INFLUENZA

EPIDEMIC *EPIDEMIC

INFLUENZA * *SENSITIVITY

to FYR/LAMINE; DIFFERENTIATION of

MLC-INDUCED KILLER SUPRESSOR T-CELLS by * *PYRILAMINE

SENSITIVITY ~o *

Figure 6.

The transformation of Key-Word-Phrases into subJec~ headlngs and subject entries is

illmstrated for the first two seFjnents of the title A, Figure 6. The last two examples snow how single word segments (from Title C) are expanded to incluoe the preceding and following them segemencs.

As a result of this stage the informacioual value of the finally generated index record is almost

equivalent to the information content of ~ne initial full title. The entire process ultimately boils down to the the reshuffling of some component segments of the initial title. The e n r i c ~ e n t stage of index record generation is illustrated on figure 7.

The index records are alphabetized firstly by heading words and secondly by index lines with the exclusion from alphabetiza~ion of

prepositions and conjunctions if they occur aC the beginning of index lines. During the

photocomposition

different parts of the index

[image:4.612.88.563.94.728.2]

(5)

*SPREADING

of VIRUS INFECTION (WILD BIRDS and MONKEYS; INFLUENZA EPIDEMIC; VICTORIA(#)75 VAKIANT of A(K3N2) VIRUS)

*VIRUS

INFECTION; SPREADING OF * (WILD BIRDS a n d

MONKEYS; INFLUENZA EPIDEMIC; VICTORIA(3)75 VARIANT of A(H3N2) VIRUS)

*BIRDS

and MONKEYS; WILD * (SPREADING of VIRUS INFECTIONS; INFLUENZA EPIDEMIC, EPIDEMIC; VICTORIA(3)75 VARIANT of A(H3N2) VIRUS) *INFLUENZA

EPIDEMIC (SPREADING of VIRUS INFECTIONS; VICTORIA(3)75 VARIANT of A(H3N2) VIRUS) *PYRILAMINE

[image:5.612.74.327.71.606.2] [image:5.612.352.570.86.657.2]

SENSITIVITY t o * (DIFFERENTIATION of MLC- INDUCED KILLER and SUPPRESSOR T-CELS)

Figure 7

The enrichment of the subject entries by the display (in parenthesis) of the unused by them segments of the same title, illustrated for some of the entries of Figure 6.

immediately relevan~ coutext of the head word is displayed in bold face in order to facilitate its rapid grasping when browsing. Details of the appearance and s~ructure of KWPSI are exemplified in figure 8 on a sample compiled for titles of publications dealing w i t h librarianship and information science. The general appearance of KWPSI is close enough to the appearance of SI of traditional type.

For purposes of transportability the KWPSI system is programmed in ANSA COBOL. It includes two modules: the index generation module and the sorting and reformatting module. On an IBM 370 system index records are generated for titles of scholarly papers at a speed of ~ 70,000 titles/hour. The resulting total size of the index is of the same order as the size of KWOC indexes and compares favorably with the size of the PSI index.

The analysis of ~he rates and ~aCure of failures of ~he segment detection algorithm shows that

in 96% of cases the generated segments are fully acceptable as valuable index entries. In 2% of cases some important information is lost as a result of the elimination of prepositions, as in case of expressions o f 'wood to wood' type. The rest of failures results in somehow awkward segments which are not completely semantically self-contained. Even in such cases the index entries retain some

informative value. Around half of the failures can be eliminated by additions to the system's

dictionary, especially by the inclusion of more verbs and past participles. Not counted aS failures are the 5% of cases when the leng=h of the detected segments is excessive; such segments can include the whole title.

The extent of tuning required for =he

application of the system in a new area of knowledge depends mainly upon ~he extent of figure 8.

*INDEXING

•

K S N I E ~ C M L ) . . . IC-80-2.SS7

C O N V [ ~ N C ~ m d COMP&IIgRJI~ ~ T 1 k ~ ~ M.L-

U m O ~ - I I O O R - C H A M I U E R u ~ M E I ~ . A L

SUlIJ[CT * . . . IC-IiO-2-SQ4*

¢I~Vl/E/~IACZ.. ANA,%VSIS Of SUIJECT • (USE

CENTRAI.~TO $UIJECT W O E X ~ ~ R F ¢ ~ . OF. t W ~ . S O C ~ . S ~ E ~ S . 4 C 4 O E M ~ , C E N t R e . CoM~rTEE,OF.T~.CPSU) . . . K:.110-2.SII3

w ~JR~eWT 4 WAn~Wt$J' s f n v a ~ l . . ~J~TK:UI.ATT0

SUBJECT " ¢COMPAMSON.I ~ ~ ~ O ~ ~ X ~

L~/~mv STtOES) . . . I C - 1 0 - 2 . S l l t OAVAIIalur * . . . I C - I I O - 2 . S I O m'DOCUM£/W~" SUIRCT " flM4LJ~I;',O~ ~ '

k, M TNEMA TICAI. MOOEL$) . . . IC-S0-2-$66

e t F,A~ r S ¢ ~ N n ~ c ~ ' I t ~ O t C 4 L S (INOEX c,4 rALOGUE)

IC.eG2-SB2 £ J r ~ q ~ ' M ~ , : I~POeT o~ * (BOOK ,~O~'X~NG. tOOL

.qE,4~NG. RJr'$u.~'O.~o~qO. 44TAC~r. text') ~ C . S O - 2 - S I L FAILMRE; STUDY O~ * t ' T N E , ~ I ~ D AUrOA44 I?C

/NOEXlNG) . . . IC-80-2.S67

MAW IdiAN-PA~IFI¢ * . . . LC-80.2-582

KWOC aml PRI[CIS * (ARTICULATED $U&IECT INOEXING

CUR#ENT AW.¢REN~$,$ . ~ Y I C £ $ . C O M ~ A ~ $ O ~ LIBRARY $17JD~E$) . . . IC.80.2-Sl ~ LANGUAGES PA/'7~WE~t I~DUNO,~NT • (N4ruRAL

LANGUAG£) . . . IC;.80-2-S60

~ $ . ' FOUNOATION of • ( ~ I A . ~ q C A T / O N / = R ~ . ~ $ ~ )

IC.S0.2.~67 LAW F N F O R ~ I ~ ' N T , ~ d CRR~WAL JUSlIC~ tNP~flM4TION

(NAnONAI. CRtMINAI. JU$1"/C~ Tt'IE&4UltUS, DL;CIt~TORS) . . . IC-80-2.SSO

MACNmE.AIO[D * (EN~UNtTI~. ~UTOI44 ;1C ~VO~XtNG o l

3RD KIND) . . . I C - I 1 0 - 2 - S 6 7

~ . R~FERIENC[ STRING " ... IC.80-2-$8 ]

PAleR ~ and SYSTEM e~ • (APP[IC~T~IV o l

COMPUTERS. INI7tOOUOIVG. ;'W~.,~IU~Jb, 8UI.L£TIN)

IC-80-2-S61

P I ' R ~ . US~ of C[NTRALI~q[D SUBJECT * (LIgRAIW,OF-

THE,$OCIA£.SCIENC£$~ICADEM~ CENrRAL.COMMII'7"EE- OF. T~,~.CPSU: Alva. }'$/$ ot ,~JgJECT tNO~X/NG CONVERGENCE) . . . IC-80-2-563

m,e PEle~OtC4L UTEIt,4rt, M ~ e ¢ . 4 ~ ' ~ O . 4 M £ ~ C 4 * t m ~ G R A P ~ r . . . ~C.80.2,S82

e f I~WINITO OOCUMFNY~" C~81NATION of CLASS~F~CATIO~

an4 SUEJECT " . . . ~C-80.2,S6S

~ £ $ $ . " QUM~TITATIVE 0 ( $ C R I P T I O N of " . . . IC.80-2,S65

w R~$~AiI~N (APPLIC, ATIO~V of INFORfvMTION SCIENCE RECOGNITIONS. SPECIAL SCIENCES. #VFO~M/lrloN SYSTEMS. LARGE.C.4PAC/rY $TO~/IGE) . . . . IC.80-2.SS9

4~d RErR~tEVAIL; iNTELLIGENT * (MAN-MACHINE

P/IRTNER$t'fll "1) . . . IC-80.2.S6S

ROLE of AUTOMATI~ * (O~RArlONAL ON&INE RETRfEVAL

S ~S f f M $ ) . . . ~C-80-21S 54

I T t l K * (M~CROCOMPUTER.G~NEIM ITD GR~IP~C

D/SPLAYS. /liD) . . . IC.80-2-S80 SURVEY M A U S T R A L I A N * ( O ~ R V / I T ; O N $ ) . , IC-80.2.584

$ ~ r . ~ M : COMPUTF.R-ASSlST[0 DYNAMIC * (CAOtS) IC.60.2-S67

s r ~ ' M ; THESAURUS ehd * (~RO~LEMS of UNIVEI~S/I"Y

O~G/INI!'/IT~ON) . . . IC-80.2.57S

TERMINOLOGY POSTIN~ F~R~IrJ~. DO(: ~TI~EVAL II~4~ *

(PIIERA~C/'IV ~nd KWOC) . . . IC,80-2-S48

THEOR~'Ti~4L FOUNOAnON~" ~%ACTICE of * IC-80-2.565

THE~;AURU$-ILASEO AUTOMATtC * (~FUDV o f / N ~ X I N G

[All. USE) . . . eC-80.2.S67

TNESAUIIUS-IIASED OOCUMENT " (RECOGNIrlON o f

MUL T/COM~ONENT TERMS) ... IC.80"~.$76

F/O~: PRIOeLEMS Of COORO(NATE * (NETWOR~ O/

AuTOMA rED 5rl C E N T , S) . . . ~C-80.2 58A

V ~ R S U S RWOC AUTOMATED * (P~R~O~MANCE C(:~I~ARtSOIV) . . . IC.80"2 568

I~CAmULA~Y:FREE " . . . IC,80,2 S66

~ S T - ~ O O O : T I ~ N O S ,n * . . . IC 80-~.553

~f JRO RINO: AUTOMATIC * (/tAUICI~'YNE./llOED t N ~ AING ENCOUNrE~P) . . . IC-80" 7 567

~NOUSTRIAI.

ENrF~II~tS~S rUSE o~ P/l rENT IN~O~,~ r/ON e,.~ /NrERNAYI~NA~ PARENT CLA$SIFICATK:~I). . IC-80.2 572

INFORMA 7 ~ N rPt£~AURfJ$ (CENTRAL ,AMEIP, C~ Br~ DOM~N~/IN.R~PUB.~C) . . . ~C-80.2 sso

Figure 8

A photocomposed k'WPSI sample showing details of i~s structure and appearance.

(6)

I n d e x " includes besides ci~les of arclcles, also cicles of book reviews, as well as descriptions of musical performances and musical records, compiled accordin G co special rules. Many contain mulciword names of works of arC, cakes in quotation marks,

which have ~o be handled as single words. A KWPSZ

sample for ar~s and humaniCies is shown on figure 9 •

Two more KWPSI samples are given on figure 4 ~

(science and C e c t ~ o l o g y ciCles) and figure 1 1

(&eoscience research f=on~ names).

DRAMA

" ~ ~ . . .

~ ~:::::::::::::::;~,~---~.~,

• " . : : : : : : : : : : : : : : .

~ # e , l ~ ¢ m ~ l w m e m l l u m ~ • ~ u a l l m l n ~

DRAWING

• . . . . : ~ : ~

, , ' ~ " ~ F . . . ' : ~ - - " . . ~ . ~ . ~ " ~ ' ~ '

~ : : : : : : : : : ~ " ~

~-,~__-...."~....,....~...::::~,~:

• ,, , , ~ ... , m , . . . ~,~"-- ,,~'=~'P,.im ~ . ~ ~ ' "

~ m i m m m i ~ * * n ~ I t m t ~ i m . . . ~ u ~ m l c

m e ~ i i , ~ " i i - - i ~ c j ~ , ~ m ; . . . m m m . , " .. . . , ~ ~ ~ 1 ~ 1 ° m m m l - . . . m .

. . . x p ~ m ~

. . . m , m ~ m m m ~ i . . . . . ~ n s ~ s

• . ~ . . ,

i i ~ p ~ m r ~ f w ~ q ~ n l ~ ~ ~ t r

a * A l . , z ~ r

o e a ~ i ~ n ~ r ~

t n a p ~ .~ a ~ L ~ a m m m • a e o , ~ o v ~

~ l q I ~ ~ ~ ~ M J ~ m P g~a~ma m d M ~

,,' ~ I U ~ ~ l l A m ~ l a n l - . . . u m a q l l l • 'Oallr~lol, IR . O m p "

, . a l w i m . *

~ 1 , ~ . . . ,,..sP.,I J ,

• . . ... ;!!!!!!!!!

[image:6.612.346.550.127.377.2] [image:6.612.101.311.177.675.2]

' : :::::::::?:!!!i!!!iii!! :

Figure

L,,i}~i samples for arts and humanities titles.

e I O R D £ R U N E "CANCER

. . . - - " L ~ - j . - ~

• . . . w , . . . ,

I m

, % ' ~ ' ~ , , , ~ , , ' ~ . . . ~ ~ ,

. ' ~ , ~ u ~ . . . ~ . . . . , . . . , ~ , , ~

~ ( ~ a m m a O g * i U L • ~ J q ~ , r ~ w s m e m n a d a r m i w ~ I ~ m ~ ~ m r s ~ 4 ~ , e

~ - % : ~ , ~

--.~E~2~,

q ~ 4 a e ~

• w

Figure I0

K~PSI sample for clclea covered by the mulci- dlsciplinaz7 SCI (science and cechnology).

3 . P o s s i b l e u s a g e o_ff a u t o m a t i c a l l y d e t e c t e d v o r d p h r a s e s f o r e n h a n c e d o n l i n e a c c e s s co

( L a c a b e s e s

The common method of online access to commercially available textual databases of both bibliographic and full cexc type is through boolean queries formulaCsd in tern~s of single words. AuCo~Icically detectable word phrases of the cype used in the KWP$1 sysCem could be used in ~L~ree different rays for improving online access.

One extreme way would involve the creation of word phrases of the above cype a= the input stage for every informative word of the Input. In response to a slnsle word query a sequence of screens would be shown displaying the image of whaC in a printed KWPSl would be the KWPSI section under ~he Given word ta~eL~ as headin G . After browsing online s o m e p a r ~ o~ t n i ~ up-to-dace online 51 the user could choose to limi~ further browsin G by respondln~ wits an additional search term, mos~ likely chosen from s o m e of t~le already examined index entries. As a result ~he system would reply by ellmlna~ing from ~ne displayed output the encrlas aoc concalnin 8 che given word an~ the user would conclnue co browse che so ~rimmed display. Several such i~eratlons could be performe~

(7)

ATMOSPHERE

c..* * z s~ ii ~*ll

I I i - ~ R I I I I I e# I F I I . ~ i .~I$ii ~ I I ~ R ~ l l ~ m ! • , . . . ii.iii i

,iI~.~ l ~ l i ' l t lll¢14&IMlm1,1 1 ~ I l .I l i ~

' ~ , ~ , ~ ' ~ " . . _ . _ ~"~"*~u~ . . ' " m ' .'2 ~'~"

m l ~ . l l l m llll,mu¢ c l m u l ~ . i m m l l ~ j ,.,~ ¢l'l~Iz , , m m , ¢ ~ I m ~ a c l M . l W l , m l ~ n .,I ~ -

[image:7.612.91.311.100.422.2]

ll.Olll

| 1 . ~

| H * Z ~

llllUl" r'~C~l fPWl,lll i ~ ~ FI~I~ . ll-OllW

ATMOSPHERIC

K o o ~ r ~ l . . . n L . l l g l

i L . o ; * |

a m m u m m m r ! e,*¢~ o , , a o ~ , , ~ c m o , a ~ ' m ~ ,,m l l , o p t . i

. , . , i l u ~ l e , l , a l m , , m ~ m ~ l i l m l m l l .

I i-o¢III

o,1~111~ I ~ / l ~ l l C 7/o~ e/#1~4 G G / N I ) . . . ll.llll

~ ~ l l l t ~ - . . . . e 1 . 1 1 1 @ ~ 1 , ~ 1 ! ~ i & ~ l - l l , i l l l

. . . I l l |

- . , , . . •

I14)I01 /lllfll I II11~I I I# I . . . II.IIII

,,,u*ll i , c l l l . ~ , ~ l l . m ~ . - . . . I I ~101

Figure Ii

KW'PSI sample for names of geoscience research fronts.

until t h e user would b e left with the display of a SI

to relevant items of the database. This SI would be

then printed together with the full list of relevant

items. It is ~hought that such kind of interaction

could be more user friendly than the currently used boolean mode.

Another way of using the KWPSI technique in an online environment would be to use the KWPSI format for the output of the results of a retrieval performed in a traditional boolean

way. The query word which achieved the most

strong trimming effect would be used as heading.

A third way would involve the compression of a KWFSI section under a given heading before it is

displayed in response to a word. One could e.g.

retain only such noun phrases containing the given word which occur at least k times in the database. An example of such list for the "Arts and Humanities"

database is given in figure 12. By displaying such

lists of words closely co- occurring with a given one ~he system would perform thesaurus-type functions.

The implementation of all such

possibilities would be rather difficult for any existin 8 system in view of the effort required

to reprocess past input. Instead, after the input of

a query word the corresponding full text records could be called a not processed online in core for

generating KWPSl-type index records. In this case

all the above functions could be still performed.

AFRICAN * HORROR(S) *

AMERICAN * iNDUSTRY

ARCHIVES LITERARY *

AVANT-GARDE * MAKER(S)

BLACK * OBSCENE *

BRAZILIAN * POLISH *

COMPANY PRODDCTION

CRITICISM PROPAGANDA

DIRECTOR TECHNIQUES

DOCUMENTARY * THEORY

FESTIVAL(S) TV *

HISTORY WESTERN *

Figure 12

Lie= of word phrases containing the word 'FILM(S)' and appearing at least twice in titles covered during a three months period by

[SI's ARTS and HUMANITIES CITATION INDEA. Such

automatically compiled lists are suggested as search aides for the online access to databases.

Another possibility which we are considering is to place the K~PSI-type processing capabilities into a microcomputer which is being used to mediate online

searches in remote databases. All the text records

containin8 a given (not too frequent) word coulu be initially tapped from the database into the micro-

computer. Following that the microcomputer could

perform all the above functions in an offline mode,

REFERENCES

Armltage, J.E., Lynch, M.F. "Articulation

in the Generation of Subject Indexes by

Computer." Journal of Chemical Documentation:

7:170-8, 1967.

Cohen, S.M., Dayton, D.L., Salvador, R. "Experimental Algorithmic Generation of

Articulated Index Entries from Natural Language Phrases a= Chemical Abstracts Service."

Journal of C h e m i g a l l n f o r m a t i o n and Computer Sciences: 1976 May, 16(2): 93-99.