• No results found

Automatic Indexing and Generating of Content Graphs from Unrestricted Text

N/A
N/A
Protected

Academic year: 2020

Share "Automatic Indexing and Generating of Content Graphs from Unrestricted Text"

Copied!
17
0
0

Loading.... (view fulltext now)

Full text

(1)

Automatic Indexing and Generating of

Content Graphs from Unrestricted Text^

1

Introduction

F o r q u ite s o m e tim e , I h a v e b e e n e x p lo r in g th e su rfa ce sign als o f la n g u a g e , a n d t r y in g t o p u t th e m t o as m u ch u se as p o s s ib le , p rim a rily in m o r p h o lo g y -b a s e d p a r t -o f-s p e e c h a ssig n m e n t (K ä llg r e n 1 9 8 4 a ,b ,c , 1 985) a n d p a tte r n -b a s e d s y n ta c ­ t ic a n a ly sis (K ä llg r e n 1 9 8 7 ). T h is k in d o f la rg e -s ca le , p r o b a b ilis tic p a rsin g o n th e b a sis o f m o r p h o lo g ic a l a n d s y n t a c t ic p a tte r n s h as la te ly c o m e t o u se in several p r o je c t s . S o m e m o d e ls th a t h a v e b e e n d o c u m e n te d axe th e U C R E L p a rser in E n g la n d fo r th e B r o w n a n d L O B c o r p o r a (G a r s id e & L eech 1 9 8 7 ), th e V O L - S U N G A p a rse r in th e U S A fo r th e B r o w n c o r p u s (D e R o s e 1 9 8 8 ), K e n C h u r c h ’ s s t o c h a s t ic m o d e ls (C h u r c h 1 9 8 8 ), as w ell as o th e r w o rk in th e U S (B la c k 1 9 8 8 ), b u t I a m su re w o r k a lo n g th e se lin es is g o in g o n in sev era l p la ces.

T h e im p e tu s b e h in d th e w o r k s ju s t m e n tio n e d is m a in ly a n eed fo r a n a ly zin g la r g e a m o u n ts o f u n r e s tr ic te d te x t in a w a y th a t is n o t t o o r e s o u rce -d e m a n d in g , e ith e r o n tim e o r o n c o m p u t in g p o w e r . A s a s e c o n d a r y g o a l, I h a v e seen th e needs o f la r g e -s c a le in fo r m a tio n retriev a l. K e e p in g m y o r ig in a l su rfa ce -o rie n ta tio n , I h a v e g o n e fu r th e r fr o m th e a n a ly sis in to p a r ts -o f-s p e e c h a n d co n s titu e n ts an d s ta r te d t o lo o k a t th e e x t r a c t io n a n d re p re s e n ta tio n o f s o m e k in d o f ‘ c o n te n t’ fr o m th e s u rfa ce o f te x ts , w ith o u t a n y k in d o f k n o w le d g e b a se s u p p o r t. T h is m ig h t se e m q u ite im p o s s ib le . M a n y o f th e o th e r p a p e r s in th is v o lu m e d ea l w ith h o w u n a v a ila b le in fe re n ce s a re t o th e c o m p r e h e n s io n o f t e x t, an d o f c o u r s e th e y a re rig h t. I f th e a im is t o b u ild a c o m p u te r iz e d s y s te m th a t w ill in a n y w ay s im u la te la n g u a g e u n d e r s ta n d in g , it is n e ce s s a ry t o h a v e a la rge k n o w le d g e base a n d m e ch a n is m s fo r m a k in g in feren ces fr o m it, b u t th ere a re a ls o a p p lic a tio n s w h e r e th e h u m a n k n o w le d g e a n d in fe re n cin g c a p a c itie s c a n b e u sed in ste a d . M y a p p r o a c h in th e e x p e r im e n t t o b e r e p o r te d h ere has b e e n t o let e a ch o n e d o

* My sincere thanks to Boris Prochaska and Sten-Erik Bergner, formerly at Ericsson Tele­ com, who wrote the original version of the graph drawing program that I have used, and to Howard Gayle who set me in contact with them. Sune Magnberg has done a great job in trans­ ferring the program to a PC environment and has also written the part of the whole system that checks for collocational pairs. I also wish to thank Benny Brodda, Kari Fraurud, and Sune Magnberg for valuable comments on earlier drafts o f this article.

(2)

Gunnel Källgren: Automatic Indexing

61

w h a t h e /s h e /i t is g o o d a t; let th e c o m p u t e r s to r e , s o r t, a n d fin d fa c t s , a n d let th e h u m a n b e in g d o th e in fe re n cin g .

T h is is a c tu a lly th e w a y it n o r m a lly is in th e field o f in fo r m a tio n re triev a l, w h ere it is th e h u m a n u ser th a t (in te ra ctiv e ly , a t b e s t) d e c id e s w h e th e r s h e /h e has g o t th e d esired m a te ria l. T h e p r o b le m is th en t o p r o d u c e an o p t im a l b a sis fo r th a t d e cisio n . I h a v e s o fa r b e e n te s tin g a u to m a t ic in d e x in g o f te x ts , i. e. t o find cen tra l c o n c e p t s in te x ts a u to m a t ic a lly (K ä llg r e n 1 9 8 4 c ). T h is is n o t all th a t diflBcult, b u t n o t a ll th a t e ffe c tiv e e ith er. In o r d e r t o b e c o v e r in g , th e in d e x lists w ill ea sily g r o w t o o lo n g ; i f s h o r te n e d , im p o r ta n t d e s c r ip to r s m a y d is a p p e a r . T h is is an in s ta n c e o f w h a t is k n ow n as th e c o n flic t b e tw e e n re ca ll a n d p r e c is io n . Still it is cle a r th a t s im p le w o r d lists c a n b e q u ite in fo r m a tiv e , th o u g h a b it b o rin g .

I h a v e a ls o g o n e t o th e o th e r e x tr e m e a n d trie d t o a c tu a lly g e n e r a te c o h e r e n t a b s tr a c ts a u to m a t ic a lly (K ä llg r e n 1 9 8 8 ). T h is w o u ld c e r ta in ly b e d e sira b le , a n d th o u g h it is fa r aw ay, I d o n o t th in k it is im p o s s ib le .

W h a t I w ill r e p o r t o n h ere is s o m e t h in g in b e tw e e n . It is a w a y o f sh o w in g cen tra l c o n c e p ts a n d th eir in te rre la tio n s in w h a t I h a v e c a lle d ‘ c o n te n t g r a p h s ’ . It is th en u p t o th e u ser t o in te rp re t th e r e la tio n s a n d m a k e th e in fe re n ce s th a t are n eed ed in o r d e r t o g e t a p ic t u r e o f th e c o n te n t o f th e c o n c e r n e d te x t.

In p rin cip le , it m a y b e w r o n g t o ta lk a b o u t c o n te n t g r a p h s . T h e g ra p h s p ictu re ‘ w h a t a te x t is a b o u t ’ , ra th e r th a n its c o n te n t, b u t t o c a ll th e m ‘ a b o u tn e s s g r a p h s ’ is ju s t t o o clu m sy . W h a t th e se g ra p h s a c tu a lly d o is t o g iv e a h in t a b o u t th e co n te n t o f a g iv en t e x t , n o t a fu ll a n d tru e re p re s e n ta tio n o f it.

G iv en th ese lim ita tio n s , th e g ra p h s still h a v e th e ir ju s tific a t io n . T h e d e riv a ­ tio n o f th e m fr o m te x ts is a n in te re stin g ta sk , fo r a set o f rea son s: T h e p r o c e s s ca n b e fu lly a u to m a tiz e d . It ca n b e ru n o n u n r e s tr ic te d t e x t w ith o u t m a n u a l p re ­ p ro ce s s in g . T h e o u tp u t c a n o ft e n b e str ik in g ly a c c u r a te . It a ls o seem s t o h a v e s o m e in terestin g p s y c h o lin g u is tic im p lic a tio n s .

2

Surface-Oriented Indexing and Information

Retrieval

(3)

in d e x in g s y s te m s m a y p e r fo r m q u ite as w ell as m a n u a l in d e x in g , a n d a lso th a t s im p le s u r fa c e -b a s e d p r o c e d u r e s c a n b e as g o o d as o r b e tt e r th a n m o r e refin ed m e t h o d s (ib id . p . 1 0 2 ).

T h e o r ig in a l in s p ir a tio n fo r th is w o r k c o m e s fr o m P h illip s (1 9 8 5 ) w h ich relates t o s o m e v e r y e a r ly w o r k w ith in c o m p u t a t io n a l lin g u istics (e .g . S in cla ir et al. 1 9 7 0 ). M a n y o f th e id e a s s u g g e s te d a t th a t tim e w o u ld d e se rv e a ren ew ed in terest to d a y , w h en c o m p u t a t io n a l p o w e r as w ell as lin g u istic k n o w le d g e h as in crea sed (t h e fo r m e r c o n s id e r a b ly m o r e th a n th e la tte r , h o w e v e r). A g o o d o ld id e a th a t tu rn s u p e v e r y n o w a n d th e n is th e c o n c e p t o f c o llo c a t io n . C o llo c a t io n s a re w ord s th a t a p p e a r to g e th e r c o n s id e r a b ly m o r e o ft e n th a n w o u ld b e e x p e c t e d o n p u rely s ta tis tic a l g r o u n d s . T h e y c a n e ith e r b e im m e d ia te ly a d ja c e n t o r a p p e a r w ith in a lim ite d d is ta n c e fr o m ea ch o th e r . T h is d is ta n c e , in term s o f n u m b e r o f w ord s, 6an b e c a lle d a sp a n . ‘ C o llo c a t io n ’ a n d ‘ s p a n ’ a re b a sic c o n c e p t s in m y m e th o d fo r g e n e r a tin g g ra p h s.

T o sea rch fo r c o llo c a t io n s c a n b e a w a y o f fin d in g th e id io m s o f a la n g u a g e, b o t h th o s e th a t a re e n tire ly fix e d , lik e ‘ red t a p e ’ , a n d th o s e th a t c o n ta in slots, like ‘ p u ll s o m e o n e ’s le g ’ . It c a n a ls o b e a w a y o f fin d in g rela tion s b e tw een w o rd s. In th e k in d o f c o n te n t a n a ly sis th a t is ca r r ie d o u t in th e s o c ia l scie n ce s, c o o c c u r ­ re n ce s b e tw e e n p r e d e s ig n e d p a irs o r sets o f w o r d s h a v e so m e tim e s b e e n in v esti­ g a te d . M y tr e a tm e n t o f th e c o llo c a t io n s is re la te d t o b o th uses; t o th e fo r m e r in r e g a r d in g a ll w o r d s in a te x t as lia b le t o e n ter c o llo c a t iv e rela tion s, t o th e la tte r in a ssig n in g s o m e k in d o f s e m a n tic lo a d t o th e rela tion s. T h is a m o u n ts t o sa y in g th a t th e fa c t th a t tw o w o r d s c o o c c u r s u s p ic io u s ly o fte n ca rries s o m e m e a n in g in itself.

T h e r e is, h o w e v e r, o n e lim ita tio n o n w h a t w o rd s ca n fo r m c o llo c a t io n s in m y s y s te m . T o a v o id u n in te re s tin g c o llo c a t io n s , su ch as a r tic le p lu s n o u n e t c ., I o n ly ta k e c o n te n t w o r d s in to re g a r d , n o t fo r m w o r d s . T h e d is t in c tio n b e tw e e n co n te n t w o r d s a n d fo r m w o r d s is o f c o u r s e n o t to t a lly cle a r (w h ic h lin g u is tic d is tin c tio n s a r e ? ), b u t c le a r e n o u g h t o b e o p e r a tio n a liz a b le . T h e r e a re s o m e ra re in sta n ces o f h o m o g r a p h y , as w h en ‘ o u t ’ ca n b e a n o u n in c o n n e c tio n w ith b a se b a ll, an d s o m e a d v e r b s c a n b e felt t o b e ‘ c o n te n t h e a v y ’ . D is r e g a r d in g th is, fo r m w o rd s c a n b e g iv e n as lists o f w o r d s fr o m th e c lo s e d c a te g o r ie s ; p r o n o u n s , p re p o s itio n s , a d v e r b s , a u x ilia r y v e r b s , a r tic le s , p a r tic le s , a n d c o n ju n c tio n s . R e m o v in g su ch w o r d s fr o m ru n n in g t e x t , o r p la c in g th e m o n s o -c a lle d ‘ s to p lis ts ’ , is a m u ch u sed p r a c tic e in a u t o m a t ic in d e x in g , a n d it is e s tim a te d th a t a b o u t 25 0 c o m m o n w o r d s c o v e r 4 0 -5 0 p e r c e n t o f an a v e ra g e E n g lish te x t (S a lto n k . M c G ill p. 7 1 ). L e m m a t iz a tio n a n d ra n k o r d e r in g , as d e s c r ib e d in step s 2 a n d 3 in th e a lg o r ith m b e lo w , a re a ls o w e ll-e s ta b lis h e d te ch n iq u e s.

3

The Algorithm

(4)

Gunnel Källgren; Automatic Indexing

63

(1) The ALGORITHM:

1. Eliminate all form words.

2. Lemmatize the remaining words, i. e. disregard differences at the end o f words that either belong to the sets o f derivational and inflectional endings or are admissible combinations o f those.

3. Rank the lemmas in order o f frequency.

4. Decide on a lowest frequency o f lemmas and exclude the lemmas below that level. The level is dependent on the length o f the text and the degree o f recall wanted. The lemmas above the frequency threshold form the set o f INDEX words.

5. Decide on a SPAN length. The length o f the span is not dependent on text length or wanted recall, but might be language dependent.

6. Find all instances where two words from the index set appear within the same span. These are the COLLO CATION AL PAIRS.

7. Find all pairs that are identical, disregarding order, as the pairs in themselves are unordered.

8. Rank the collocational pairs in order o f frequency.

9. Decide on a lowest frequency o f collocational pairs, based on the same principles as for the lemma frequency. Pick out all pairs above that frequency.

10. Construct ADJACEN CY LISTS, i. e., for each lemma, list all other lemmas with which it forms a pair.

11. Use the adjacency lists as input to the GRAPH-drawing program.

A lte r n a tiv e v e rsio n w ith g ra p h s d ra w n b y h a n d :

10' Try to find optimal orderings o f the pairs, look for central concepts that occur in many pairs.

11' Draw the GRAPH .

4

Implementation

T h e sy ste m has b e e n te s te d fo r S w ed ish , a n d th e p r o g r a m s fo r r e m o v in g fo r m w o rd s a n d le m m a tiz in g c o n te n t w o r d s s o fa r o n ly e x is t in S w ed ish v ersion s. T h e y a re a set o f L isp p r o c e d u r e s ru n n in g o n P C s . F o r th e p u r p o s e o f d e m o n s tr a tio n , I w ill h ow ev er u se an E n g lish te x t w h ere th e le m m a tiz in g h as b e e n d o n e m a n u a lly . T h e fu ll E n g lish te x t is g iv e n in A p p e n d ix A a n d all e x a m p le s b e lo w a re tak en fr o m th a t t e x t. A p p e n d ix B c o n ta in s th ree s h o r te r S w ed ish te x ts a n d A p p e n d ix C th eir r e s p e ctiv e g ra p h s. T h e s e h a v e b e e n p r o d u c e d in a w h o lly a u to m a t ic a l w a y as sp e cifie d in th e a lg o r ith m .

(5)

triviaJ p r o c e s s , b u t c a n in th is c o n n e c t io n b e d o n e in a sim p lified m a n n er. T h e t e x t , d e v o id o f its fo r m w o r d s , is tr e a te d as a w o r d list a n d s o rte d a lp h a b e t­ ica lly . W o r d s th a t sta rt in a sim ila r w a y a re c o m p a r e d as t o th eir en d in g s. I f tw o w o r d s axe id e n tic a l a ll th e w ay, th e y c le a r ly b e lo n g to g e th e r . I f th e p a rts w h e r e th e y d iffer b e lo n g t o th e p r e -d e fin e d set o f en d in g s, th e y a re a lso reg a rd ed as b e lo n g in g to g e th e r . T h is m a t c h in g c a n b e d o n e in m o r e o r less s o p h istica te d w a y s. E ith e r th e p a irs o f m a t c h e d e n d in g s m u st sign a l th e sa m e p a r t-o f-s p e e c h a n d b e m o r p h o lo g ic a lly c o n n e c te d , as w h en b erry a n d b erries fo r m a le m m a

b e r r w ith m a t c h in g e n d in g s y — ie s . O th e rw is e , a n y th in g ‘ e n d in g lik e ’ w ill d o ,

as w h e n f a v o r i t e , fa v o r e d a n d fa v o ra b le a re m a tch e d . A c tu a lly , I th in k th e la tte r a lte r n a tiv e , le m m a tiz in g a c r o s s p a r t-o f-s p e e c h b o u n d a rie s , s h o u ld b e p referred , as w e a re p r im a r ily lo o k in g fo r s e m a n tic r e la tio n s, regard less o f h o w th e y a re e x ­ p re sse d . In th is w ay, th e tr u n c a te d stem s (se e b e lo w ) th a t rep resen t e

2

w:h lem m a c o m e t o refer t o a c o n c e p t m o r e th a n ju s t a w o r d . T h is k in d o f le m m a tiz a tio n h a s b e e n c a lle d ‘ r o o t le m m a tiz a tio n ’ a n d a lin g u istica lly s o p h is tic a te d w a y o f d o in g it is d e s c r ib e d in F je ld v ig -G o ld e n (1 9 8 4 ).

S e m a n tic a lly e r r o n e o u s le m m a tiz a tio n c a u o f co u r s e n o t b e a v o id e d , as w h en

la te , w h ic h in th e t e x t is u sed in th e sen se o f d eceased , is le m m a tiz e d w ith a

te m p o r a l la te ly . T h is is h o w e v e r n o t su ch a b ig p r o b le m as o n e m ig h t s u s p e c t, as s u ch in fe lic itio u s p a irin g s ra r e ly reeich a fr e q u e n c y w h ere th e y w ill in flu en ce th e o u t c o m e o f th e e n tire p r o c e s s . T o s o lv e th e p r o b le m , o n th e o th e r h a n d , w o u ld d e m a u d a v e r y la r g e a p p a r a tu s b a s e d o n n o t o n ly se m a n tic b u t a lso p r a g m a tic k n o w le d g e .

(6)

Gunnel Källgren; Automatic Indexing

65

(2) I

ndex words

. (L

emmas with frequency

> = 2.)

a cre hair o r c h a r d

b e rr I p a st

b ro w n in clu d p la n t

c a p tiv a t N o r t h J s la n d p r o d u c c h a n c e J im -M a c lo u g h lin seed

C h in e se k iw i sh ip

c o m m e r c ia l k iw ifru it s o ld

c o u n t r la te su ccess

d e v e lo p le m o n ta st

ea r like T e -P u k e

eg g m a rk et th o u s a jid

fa v o r m e v in e

five m illio n a ire w h ite

flesh m o s t w ild

& uit n ew w o r ld

green N ew J Z ea la n d y e a r

N e x t, a sp a n len g th h as t o b e s e ttle d . T h is d o e s n o t, h o w e v e r, seem t o b e c o n ­ n e cte d t o re ca ll a n d p r e c is io n in th e sa m e w a y as th e fr e q u e n c y lim its. R a th e r , th ere seem s t o b e an o p t im a l sp a n le n g th . In c r e a s in g o r d e c r e a s in g th e sp a n len g th in re la tio n t o th e o p t im a l le n g th w ill in c r e a s e /d e c r e a s e re ca ll in th e w a y th a t w o u ld b e e x p e c te d , w h ile b o t h in cre a se a n d d e c r e a s e o f s p a n le n g th , in ter­ es tin g ly e n o u g h , seem t o re d u ce p r e c is io n . A n in c r e a s e in sp a n le n g th w ill g iv e m o r e o f a c c id e n ta l a n d th e r e b y u n in te re s tin g c o llo c a t io n s a n d a ls o a h ig h e r rel­ a tiv e fre q u e n cy o f su ch u n in te re stin g c o llo c a t io n s a m o n g a ll c o llo c a t io n s a b o v e th e c r itic a l th re s h o ld th a t is t o b e set in s te p 9 o f th e a lg o r ith m . A t th e sa m e tim e, in cre a se d sp a n le n g th seem s t o g iv e su r p r is in g ly fe w n ew ‘ h its ’ , w h ile th e o ld h its ru n a risk o f b e in g o u tn u m b e r e d b y th e n ew a c c id e n ta l c o llo c a t io n s . A d e cre a se in spcin le n g th w ill re m o v e m a n y w a n te d c o llo c a t io n s , w h ile th e re la tiv e p r o p o r tio n o f h its a m o n g th e rem a in in g c o llo c a t io n s w ill n o t in crea se. A n y v a ri­ a tio n o f th e sp a n le n g th th u s seem s t o g iv e a r e d u c e d p r o p o r t io n o f s e m a n tic a lly sig n ifica n t c o llo c a t io n s . T h is is, h ow ev er, o n ly s u b je c t iv e im p re ssio n s fr o m sm a ll- sca le tests w ith v a r y in g sp a n len g th . S im ila r resu lts h a v e b e e n re a ch e d b y o th e r s (S in cla ir e t al. 1970, re ferred in P h illip s 1 9 8 5 ), a n d h a v e led t o e s ta b lis h in g a sp a n len g th o f fo u r o r t h o g r a p h ic w o rd s as o p tim a l.

T h is is a p o in t th a t w o u ld d eserv e a m o r e th o r o u g h in v e s tig a tio n . It p r o b a b ly has s o m e th in g t o d o w ith th e n o rm a l size o f c o m m o n c o n s tr u c tio n s : m o d ifie r a n d n ou n w ill a lm o s t a lw a y s a p p e a r w ith in less th a n fo u r w o r d s d is ta n c e , as w ill m o s t ly s u b je c t — v e r b a n d v e r b — o b je c t , w h ile e. g . m o r e p e rip h e ra l a d v e rb ia ls w ill n o t o c c u r th a t c lo s e t o th e n ex u s p a rt o f th e s en ten ce.

(7)

a n y a re fo u n d , th e re su ltin g p a irs a re s to r e d a n d th e sea rch g o e s o n . In (3 ) a c la u s e fr o m th e te x t is g iv e n w ith all in d e x w o rd s ca p ita liz e d .

( 3 ) T H O U S A N D S OF A C R E S ARE N E Wl y P L A N T E D EACH Y E A R IN A DOZEN OR MORE C O U N T R I E S , . . .

H ere, th o u sa n d c o llo c a t e s w ith a cre a n d n e w , b u t n o t w ith p la n t. A c r e c o l­ lo c a t e s w ith n e w a n d p la n t, n e w w ith p la n t a n d y e a r , emd p la n t w ith yea r.

C o u n tr h a s n o c o llo c a t io n s in th is in s ta n ce . T h e in tern a l o r d e r o f th e c o llo c a ­

tio n a l p a irs is o f n o im p o r t a n c e , s o th e ste m s w ith in e a c h p a ir ^lre s to re d in a lp h a b e tic a l o r d e r . T h e p a irs a re th en s o r te d a lp h a b e tic a lly a n d th e fre q u e n cy o f e a c h c o llo c a t io n a l p a ir is c a lc u la te d .

T h e n e x t s te p is a g a in t o d e c id e o n a low est freq u en cy , th is tim e o f c o llo c a ­ tio n a l p a irs. T h is d e c is io n g o v e r n s w h ich p a irs, a n d c o n s e q u e n tly w h ich lem m a s, a r e t o b e r e g a r d e d as re p r e s e n ta tiv e o f th e c o n te n t o f th e t e x t. A s th is h as su ch g r e a t im p a c t o n th e o u t p u t , it m a y w ell b e th a t it sh o u ld b e p o s s ib le t o va ry th e fr e q u e n c y th r e s h o ld fo r c o llo c a t io n a l p a irs in tera ctiv ely , in o r d e r t o fa cilita te c lo s e r in s p e c tio n o f in te re s tin g fin d in g s. A w a y o f m a k in g exp^lnsions o f th e sets o f le m m a s a n d re la tio n s w ill b e d e s c r ib e d b e lo w .

In ( 4 ) , a ll c o llo c a t io n a l p a irs w ith a fr e q u e n c y eq u a l t o o r a b o v e 2 in th e s a m p le t e x t axe g iv e n in a lp h a b e tic a l o r d e r .

(4) C

ollocational pairs with frequencies

.

b e r r — k iw ifru it 2 c o m m e r c ia l — k iw ifru it 2 d e v e lo p — k iw ifru it 2 flesh — g reen 2 I — k iw ifru it 3 I — N e w -Z e a la n d 2

T h e id e a is th a t th is c a n g iv e a m o r e o r less a c c u r a te p ic tu r e o f c o n c e p ts a n d re la tio n s th a t a re c e n tr a l t o th e t e x t , a t lea st in th e sen se th a t th e y sh o w a h ig h fre q u e n cy . M o s tly , th is is su flicien t t o p r o v id e a h in t a b o u t w h a t th e te x t is a b o u t . In s o m e ca se s th e r e m a y h o w e v e r b e a n eed fo r en la rg in g th e b a sis o f th e r e p r e s e n ta tio n . T h is c a n b e d o n e b y s e ttin g a lo w e r m in im a l fr e q u e n c y level fo r le m m a s o r c o llo c a t io n a l p a irs o r b o t h , b u t th is m ea n s r e d o in g p a r ts o f th e p r o c e s s in g . A b e t t e r w a y c a n b e t o u se a set o f e x p a n s io n o p e r a tio n s as d efin ed b e lo w .

W it h o u t c h a n g in g th e g iv e n fr e q u e n c y levels fo r lem m a s a n d c o llo c a t io n a l p a irs, w e c a n d e r iv e th e f o llo w in g sets o f c o n c e p t s a n d rela tio n s b e tw e e n c o n c e p ts :

(5) E

xpansions
(8)

Gunnel Källgren: Automatic Indexing

67

F i r s t e x p a n s i o n : a ll c o llo c a t io n s b e tw e e n p r im a r y c o n c e p ts .

S e c o n d e x p a n s i o n : all o th e r le m m a s c o llo c a t in g w ith p r im a r y c o n c e p t s . T h is g iv es th e set o f s e c o n d a r y c o n c e p ts .

T h i r d e x p a n s i o n : a ll c o llo c a t io n s b e tw e e n s e c o n d a r y c o n c e p ts .

F o u r t h e x p a n s i o n : a ll le m m a s c o llo c a t in g w ith s e c o n d a r y c o n c e p ts .

E t c .

T h e s e c o n d a n d fo u r th (g e n e ra lly : a ll e v e n ) e x p a n s io n s a re ‘ o p e n in g ’ e x p a n ­ sion s, as th e y b r in g in n ew c o n c e p ts . T h e first a n d th ir d (^lnd all o d d ) e x p a n s io n s are ‘ c lo s in g ’ , as th e y e sta b lish re la tio n s b e tw e e n e x is tin g c o n c e p t s a n d m a k e th e c o r r e s p o n d in g g ra p h m o r e c lo s e ly k n it.

In (6 ) b e lo w , w e see fo r ea ch o f th e p r im a r y c o n c e p t s th e c o llo c a t io n s it en ters: a ) w ith o th e r p rim a ry c o n c e p ts a n d w ith a fr e q u e n c y a b o v e th e m in im a l level; b ) w ith o th e r p rim a ry c o n c e p t s b u t w ith a fr e q u e n c y b e lo w th e m in im a l level (first e x p a n s io n ); c ) all c o llo c a t io n s b e tw e e n p r im a r y c o n c e p t s a n d o th e r le m m a s fr o m th e set o f in d e x w o r d s (s e c o n d e x p a n s io n ). T o c o n s tr u c t all in te rre la tio n s b etw een all th o s e ite m s w o u ld in its tu rn g iv e th e th ir d e x p a n s io n .

(9)

( 6 ) Co l l o c a t i o n a l p a i r s w i t h f r e q u e n c i e s: a) p r i m a r y c o n c e p t s, b) FIRST EXPANSION, C ) SECOND EXPANSION

Primary concepts Possible expansions

kiwifruit:

a) 3 kiwifruit I b) 1 kiwifruit New.Zealand

2 kiwifruit berr c) 1 kiwifruit captivat 2 kiwifruit commercial 1 kiwifruit chance 2 kiwifruit develop 1 kiwifruit includ

1 kiwifruit Jim.Macloughlin 1 kiwifruit millionaire 1 kiwifruit adjacency 1 kiwifruit plant 1 kiwifruit sold 1 kiwifruit tast 1 kiwifruit vine 1 kiwifruit year berr(y):

a) 2 berr kiwifruit c) 1 berr brown

1 berr like 1 berr tast 1 berr wild commercial:

a) 2 commercial kiwifruit b) 1 commercial New .Zealand c) 1 commercial orchard develop:

a) 2 develop kiwifruit b) 1 develop New.Zealand

c) 1 develop taste 1 develop vine flesh:

a) 2 flesh green b) 1 flesh I

1 flesh kiwifruit c) 1 flesh tast green:

a) 2 green flesh b) 1 green I

1 green kiwifruit c) 1 green fruit

1 green seed I:

a) 3 I kiwifruit c) 1 I I

2 I New.Zealand 1 I vine

New .Zealand:

a) 2 New.Zealand I c) 1 New.Zealand captivat 1 New.Zealand lemon 1 New.Zealand NorthJsland 1 New.Zealand produc 1 New.Zealand tast

(10)

Gunnel Källgren: Automatic Indexing

69

in feren ce m a k in g . T o p r o c e e d t o th is, a set o f a d ja c e n c y lists is c o n s t r u c t e d o n th e basis o f (4 ) . T h e a d ja c e n c y lists fo r m th e in p u t t o th e g r a p h -d r a w in g p r o g r a m , w h ere eaeh le m m a w ill c o r r e s p o n d t o a n o d e in th e g r a p h . In a n a d ja c e n c y list, eaeh le m m a , i. e. ea ch n o d e , is g iv e n a list o f all its im m e d ia te ly a d ja c e n t n o d e s . T h is w ay, ea ch c o llo c a t io n a l p a ir w ill b e re p resen ted tw ic e , c o r r e s p o n d in g t o th e tw o p o s s ib le d ir e c tio n s o f th e a r c b e tw e e n th e n o d e s. T h e g ra p h s re su ltin g fr o m th is sy ste m a re h o w e v e r u n d ir e c te d . It w o u ld b e p o s s ib le t o h a v e w e ig h te d a rcs in th e g r a p h , c o r r e s p o n d in g t o th e fre q u e n cie s o f c o llo c a t io n a l p a irs, b u t th is has n o t b een im p le m e n te d in th e p resen t s y s te m . T h e a d ja c e n c y lists d e r iv e d fr o m (4 ) are sh ow n in (7 ) .

(7) A

djacency lists

k iw ifr u it(I, b e rr, c o m m e r c ia l, d e v e lo p ) b e r r (k iw ifr u it)

c o m m e r c ia l(k iw ifr u it) d e v e lo p (k iw ifr u it) fle sh (g reen ) g reen (flesh )

I(k iw ifru it, N e w -Z e a la n d ) N e w -Z e a la n d (I)

5

Graph-Drawing

T h e la st ste p in th e a lg o r ith m is th e d ra w in g o f a g ra p h . A u t o m a t ic d r a w in g o f g ra p h s b y m ea n s o f a c o m p u t e r is a d e m a n d in g ta sk , e s p e c ia lly i f th e w o r k , as in th e presen t ca se, is t o b e d o n e o n a P C . W e h a v e, h o w e v e r, b e e n a b le t o fin d a s a tis fa c to r y so lu tio n .

T h e p r o g r a m c o n sists o f tw o m a in p a rts. T h e first o n e fin d s th e a rea s a n d su b a re a s th a t to g e th e r b u ild u p th e g r a p h . It tries t o a v o id c r o s s in g a rcs, b u t if th a t is n o t p o s s ib le , th e p ro g ra m fin d s th e b e s t p la c e s t o a d d ‘ p s e u d o -n o d e s ’ , i. e. crossin g s. Its o u t p u t is a ‘ r o a d d e s c r ip t io n ’ o f th e g r a p h . T h e s e c o n d p a rt o f th e p ro g ra m p e r fo r m s th e c o m p u ta tio n a lly h e a v y ta sk o f a c tu a lly d r a w in g th e g ra p h , la y in g it o u t n ic e ly o n th e screen o r in a file th a t c a n b e s to r e d o r p rin te d o u t o n p a p er.

T h e first p a rt o f th e p r o g r a m w as o r ig in a lly w ritte n b y B o r is P r o c h a s k a as a p a rt o f h is e x a m in a tio n a t th e R o y a l In s titu te o f T e c h n o lo g y in S to c k h o lm (P r o c h a s k a 1 9 8 8 ), a n d th e s e c o n d p a rt w as w r itte n b y S te n -E r ik B e r g n e r , w h o w as B o r is ’ s u p e rv is o r d u r in g his e x a m in a tio n j o b at E r ic s s o n T e le c o m . T h e ir v ersion o f th e p r o g r a m is w r itte n in P S L -L is p a n d h a d t o b e r e w ritte n in G C - L isp , a su b set o f C o m m o n L isp , fo r u se o n P C s . T h is n o n -tr iv ia l j o b h as b e e n u n d erta k en b y S u ne M a g n b e r g , w h o s e p r o g r a m m in g sk ills, ea rlier k n o w le d g e o f g ra p h th eory , a n d g e n era l c o m b in a t io n o f in v en tiv en ess a n d p a tie n c e , m a d e th e j o b o f tr a n s p o r tin g th e ‘ p o r t a b le ’ L isp p o s s ib le .

(11)

t o h a v e s t r o n g c o n n e c t io n s t o w h a t th e te x t is a b o u t . T h e g ra p h th a t, b y m ean s o f th e d e s c r ib e d s y s te m , c a n b e a u to m a t ic a lly d e riv e d fr o m th e te x t in A p p e n d ix A is s h o w n in ( 8 ) , w h ile (9 ) is th e first e x p a n s io n o f th a t g ra p h (c f. (5 ) ).

(12)

Gunnel Källgren: Automatic Indexing

71

6

Concluding Remarks

T h e s e g ra p h s s u p p o r t in p r in c ip le th e s a m e in fe re n ce s as d id th e lists o f p a irs, b u t in a n e a te r w ay. T h e k in d o f r e la tio n s th a t a re sig n a lled b y th e a rcs varies c o n s id e r a b ly a n d is left t o th e h u m a n u ser t o gu ess— a t th e risk o f m a k in g m istak es. A v e r y n a tu ra l q u e s tio n t o a sk is w h e th e r all th is a p p a r a tu s g iv es a n y th in g m o r e th a n w o u ld a s im p le list o f h ig h -fr e q u e n c y w o r d s . M y im p r e s s io n is th a t it d o e s . B e lo w is a list o f th e 9 m o s t fre q u e n t le m m a tiz e d c o n te n t w o r d s in th e te x t, all lem m a s w ith a fr e q u e n c y o f 4 o r h ig h er. T h e list s h o u ld b e c o m p a r e d t o th e w o rd s o f th e a d ja c e n c y lists, (7 ) , a n d t o th e fu ll te x t in A p p e n d ix A .

(10) C

ontent words from

K

iwi

-

t e x t

,

lemmatized and sorted ac

­

cording TO FREQUENCY 13 k iw ifru it

10 N ew _ZeaJand 5 b e rr

5 sh ip 4 C h in ese 4 v in e 4 I 4 lem on 4 m ark et

K iw ifr u it, N e w .Z e a la n d , b e r r y / ie s , a n d / a re a ls o re p resen ted a m o n g th e

eigh t lem m a s p ick ed o u t b y th e g r a p h -c o n s tr u c tin g a lg o r ith m a n d th e g r a p h clea rly sh ow s th eir cen tra lity . T h e g r a p h a ls o sh o w s c o m m e r c ia l a n d d e v e lo p m e n t as h ig h ly ce n tra l, w h ile th e d e s c r ip t io n s g r e e n a n d fl e s h a re sh o w n t o b e s o m e ­ w h a t less cen tra l. T h e p u re fr e q u e n c y s ta tis tic s , h o w e v e r, h as it th a t sh ip / p in g ,

C h in e s e , a n d le m o n a re q u ite as im p o r ta n t as m a r k e t a n d v in e . B u t th e a r tic le

(in A p p e n d ix A ) is c e r ta in ly n o t a b o u t th e s h ip p in g o f C h in e s e le m o n s , it is a s u b je c tiv e ly c o lo r e d b o a s tin g a b o u t th e c o m m e r c ia l su c c e s s o f k iw ifru it a n d all th a t th is h a s m e a n t t o N e w Z e a la n d , in te rsp e rse d w ith ly r ic b u r s ts a b o u t th e lo o k a n d ta s te o f th e b erry . T h e r e is n o d o u b t th a t th is is m o r e c le a r ly s ig n a lle d b y th e g ra p h th a n b y th e fr e q u e n c y list, a lth o u g h b o t h r e p r e s e n ta tio n s n eed a g o o d d ea l o f h u m a n in fe re n ce m a k in g t o b e a d d e d .

T h e resu lts h a v e n o t y e t b e e n in d e p e n d e n tly e v a lu a te d , b u t th e m e t h o d h as b een a p p lie d t o sev era l S w ed ish te x ts . T h r e e s h o r t S w ed ish te x ts a re s h o w n in A p p e n d ix B a n d th eir c o r r e s p o n d in g g ra p h s in A p p e n d ix C . O n e v e r y in te re s tin g fin d in g is th a t th e m e t h o d seem s t o b e u tte r ly im p o s s ib le o n lite r a r y te x ts , b u t o k e y o n o th e rs . W h y th is is s o is s o m e th in g th a t h as t o b e in v e s tig a te d m o r e closely. It m u st a ls o b e in v e s tig a te d fo r w h ich te x t ty p e s th e m e t h o d is b e s t s u ite d a n d u n d er w h a t cir c u m s ta n c e s it ru n s a risk o f b e in g se r io u s ly m isle a d in g .

(13)

b e e n p r e v io u s ly d e r iv e d fr o m th e te x ts in th e d a ta b a s e t o b e sea rch ed . I f th e se a rch q u e s tio n w a s in n a tu ra l la n g u a g e , th e p re se n ce o f in terrela tion s b etw een k e y w o r d s c a n a ls o b e c h e c k e d . A m ea su re fo r w h en a g r a p h is ‘ s a tisfa cto rily s im ila r ’ t o th e in fo r m a tio n d e riv e d fr o m th e search q u e s tio n m u st b e d efin ed. N e x t , o n e s e le c te d g r a p h a t a tim e w ill b e s h o w n o n th e screen a n d th e u ser ca n c h o o s e i f s h e /h e w a n ts t o h a v e th e fu ll t e x t. In d o u b tfu l c a se s it m a y b e p o ssib le t o g e t o n e o r m o r e o f th e e x p a n s io n s in o r d e r t o g et a b r o a d e r ba sis fo r d ecision s. T h e sea rch c a n a ls o b e c a r r ie d o u t in su ch a w a y th a t g r a p h s th a t are ju d g e d as relev a n t c a n b e u sed fo r d e r iv in g n ew , c o n jo in e d g ra p h s.

I f th e s e id e a s c a n b e d e v e lo p e d t o w o r k w ell, th e p r a c tic a l usefu ln ess o f th e c o n te n t g r a p h s is cle a r, b u t a m o n g th e m o s t th rillin g q u e s tio n s a re w h y th e m e t h o d w o r k s w h en it w o rk s, a n d w h y it d o e s n ’t w ork w h en it d o e s n ’t. T h is is as y e t fa r fr o m clea r.

References

Black, E., 1988. Grammar Development for Speech Recognition. Proc. from ELS Con­ ference on Computational Linguistics, IBM Norway.

Church, K. W ., 1988. A Stochastic Parts Program and Noun Phrase Parser for Unre­ stricted Text. ACL Second Conference on Applied Natural Language Processing, Austin, Texas.

DeRose, S. J., 1988. Grammatical Category Disambiguation by Statistical Optimiza­ tion. Computational Linguistics, 14(l);31-39.

Fjeldvig, T. Si Golden, A., 1984. Automatisk rotlemmatisering — et lingvistisk hjelpe­ middel for tekstsöking. CompLex no. 9/84. Universitetsforlaget, Oslo.

Garside, R. Si Leech, F., 1987. The UCREL probabilistic parsing system. In Garside, R. et al. The Computational Analysis o f English. Longman, London.

Källgren, G., 1984a. HP-systemet som genväg vid syntaktisk märkning av texter. In Svenskans beskrivning I

4

, p. 39-45. Lunds universitet.

Källgren, G., 1984b. HP — A Heuristic Finite State Parser Based on Morphology. In SågvaU-Hein, Anna (ed.) De nordiska datalingvistikdagama 1983, p. 155-162. Uppsala universitet.

Källgren, G. 1984c. Automatisk excerpering av substantiv ur löpande text. Ett möjligt hjälpmedel vid automatisk indexering? IRI-rapport 1984:1. Institutet för Rätts- informatik, Stockholms universitet.

Källgren, G. 1985. A Pattern Matching Parser. In Togeby, Ole (ed.) Papers from the Eighth Scandinavian Conference o f Linguistics. Copenhagen University.

Källgren, G., 1987. W hat G ood is Syntactic Information in the Lexicon o f a Syntac­ tic Parser? In Nordiske Datalingvistikdage 1987, Lambda, 7. Copenhagen, Han­ delshøjskolen.

Källgren, G. 1988. Automatic Abstracting o f Content in Text, Nordic Journal o f Lin­ guistics, Vol. 11(1-2): 89-110.

Phillips, M., 1985. Aspects o f Text Structure. North-HoUand, Amsterdam.

(14)

Gunnel Källgren: Automatic Indexing

73

Salton, G. & McGill, M. J., 1983. Introduction to M odem Information Retrieval. McGraw-Hill, New York.

Sinclair, J. McH. et al., 1970. English Lexical Studies: Report to OSTI on Project C /L P /0 8 , Dept o f English, University o f Birmingham.

A

The Captivating Kiwifruit

Thirty years ago, growing up in New Zealand, I often sliced into a brown berry that looked like a duck’s egg in a bristly hair skirt. Repulsive? Not really, for I knew a secret: The berry’s odd ap]>earance disguised an equally exotic interior, a sunburst o f neat white streaks radiating from a creamcolored core, past tiny black seeds and into shimmering green flesh (above). Sweet-tart in taste, it seemed a succulent blend o f strawberry, banana, melon, and pineapple flavors. Delicious! I loved the kiwifruit.

I stiU do, and today this peculiar product o f a woody vine is captivating palates outside New Zealand at an extraordinary pace. In 1986 more than a billion kiwifruit, once called Chinese gooseberries, were tucked into trays and shipped to at least 30 nations. Thousands o f acres axe newly planted each yecir in a dozen or more countries, including the United States, France, Japan, and Italy, the leading producers after New Zealand.

This universal success has uniquely New Zealand roots. The kiwifruit’s conversion to a commercial crop occurred in New Zealand, and its name— coined in the 1950s as a marketing tactic— conjures up both that likable country and its whimsical, flightless native bird, renowned for oversize eggs and hairlike brown feathers. Moreover, exports of the fuzzy, four-ounce berry are increasingly important to New Zealand’s economy and the creator o f more millionaires than anything else in my homeleind’s history.

The only fruit with such bright green flesh, the kiwifruit is one o f just a handful of food plants domesticated within the past thousand years. Originating in the Yangtze Valley, it has long been a favorite o f the Chinese, glorified in poetry as early as the eight century. Chinese peasants stiU gather the wild fruit for sale in rural markets.

The transformation o f a small, hard, and wild Chinese berry into fleshier, tastier ki­ wifruit began about 1904, when a traveler returned from a China visit with seeds for Alexander Allison, a nurseryman on New Zealand’s North Island. In the following three decades he and other gardeners developed superior kiwifruit vines through careful se­ lection, pruning, and grafting. Most o f these early fcinciers were as much interested in the vine’s showy white blossoms and attractive fan-shaped leaves as in its berries.

Kiwifruit farming got its commercial start in the 1930s, most successfully at Te Puke on the North Island’s east coast. The late James MacLoughlin became the father of the modern kiwifruit— and ultimately a millionaire— by chance.

After he lost his job as a shipping clerk during the Great Depression, Jim’s wife’s aunt invited them to stay on her lemon orchard at Te Puke. “Later the bottom fell out of the lemon market,” he told me, “ but a neighbor sold the kiwifruit from a single plant for five pounds (then worth about $20 U.S.). To me that was a lot o f money, so I risked putting in half an acre o f them.”

Luckily for MacLoughlin, the warm, wet climate and volcanic soil at Te Puke favored his vines. Neighbors soon launched their own commercial orchards, which further expanded during World War II when GIs stationed in New Zealand developed a taste for kiwifruit.

(15)

kiwifruit,” Jim MacLoughlin explained. “A dock strike delayed the ship five weeks and the lemons arrived rotten, but the kiwifruit were in perfect shape." They sold well, and New Zealanders suddenly realized that they’d opened a world market.

B

Swedish Texts

T ex t 1

Rudi och Renate hyr en liten stuga ovanför sjön, fast de h2ir visst aldrig råd att betala den. Där finns ett rum och kök.

När Malin och jag kommer dit, sätter vi oss på golvet och jag tar av mig skorna också, jag vill vara som hon. Rudi spelar Mozart på en grammofon, som han lånat hem “på prov” . Det är alldeles lör dyrt att köpa egna grammofoner.

Solen skiner rakt in i köket. Rudi visar sina bilder, Malin röker pipa och ler så gott när hon ser tavlorna och Rudi pratax så mycket att jag slipper.

(Göran Tunström: Prästungen)

T ex t 2

M e tro p o le n O slo får en ny profil

Inte på 100 år har så många och omfattande byggprojekt påbörjats i Oslo. När de är klaxa kommer den norska huvudstaden att få en ny profil och nya möjligheter. Under tiden lider Osloborna.

Nordens högsta hotell, en kongresshall med plats för 10 000 åskådare och Europas längsta gågata under tak är några av de projekt som redan är i full gång.

Det har skett en snabb utveckling de senaste åren. Oslo blir alltmer en metropol. Vad gäller nattliv och restauranger kan staden konkurrera med både Stockholm och Köpenhamn. Den sista sammanräkningen visade 90 nattklubbar och kafeer som höll öppet mellan två och fyra på natten.

Aker Brygge med sin kombination av butiker, restauranger, teater och kontor i läckra omgivningar vid hamnen har blivit något som Osloborna stolt visar upp för tillresande. En bärande tanke har varit att öppna staden mot fjorden igen. Biltrafiken ska läggas så mycket som möjligt i tunnlar.

(DN 1987-12-05)

T ex t 31

O m b a tterier o ch b a tterib yte

Låt inte ett kvartsur som stannat bli liggande. Batteriet kan börja läcka och skada din klocka.

Vågax man då byta batteri själv?

(16)

Gunnel Källgren: Automatic Indexing

75

Det är också viktigt att du får rätt sorts batteri och inte ett som är avsett för fotoar­ tiklar eller hörapparater. Då gäller inte garantin som de flesta tillverkare av urbatterier ger.

På de flesta klockor måste boetten öppnas vid batteribyte. Då fordras specialverktyg och stor försiktighet för att inte elektroniken ska ta skada. Och det är viktigt att boetten sluter ordentligt tätt efter batteribytet.

Det är ett axbete du ska överlåta till en fackman.

C

Conceptual Graphs of the Swedish Texts

T ext 1

(17)

T e x ts

References

Related documents

A proportional mortality study for the period 1958-1967 comparing Devonport dockyard workers with other Plymouth males showed only a slight (but not statistically

Early Years Provision in Schools : Fran Butler 14.01.15 Page 8 related to a nursery class in an infant or primary school; there is also an emerging interest from secondary schools

For the surface circulation, pressure in the Gulf of Alaska is significantly higher for all days leading up to and including the day of the extreme and in general higher pressure

The Language type component encompasses six items(see Table 6) that largely evaluate the view of the raters on the authenticity of the textbook, whether the language of the

D4566 Test Methods for Electrical Performance Properties of Insulations and Jackets for Telecommunications Wire and Cable D5537 Heat Release, Flame Spread and Mass Loss Testing

CPM also comprises a series of analytic applications, such as BP&F, financial- consolidation, and financial-reporting solutions, which provide the functionality to support

AR provides the means for intuitive information presentation which enhances our perception of the real world, placing virtual objects and

There could, potentially, be interest in supporting certain aspects of the preparatory work in Namibia (e.g. the Launch of the Assembly) and this needs to be explored with