A u t o m a t i c
T e x t S u m m a r i z a t i o n
B a s e d o n
t h e G l o b a l D o c u m e n t
A n n o t a t i o n
Katashi Nagao
S o n y C o m p u t e r S c i e n c e L a b o r a t o r y I n c . 3 - 1 4 - 1 3 H i g a s h i - g o t a n d a , S h i n a g a w a - k u ,
T o k y o 1 4 1 - 0 0 2 2 , J a p a n n a g a o ~ c s l . s o n y . c o . j p
KSiti Hasida
E l e c t r o t e c h n i c a l L a b o r a t o r y1 - 1 - 4 U m e z o n o , T u k u b a , I b a r a k i 3 0 5 - 8 5 6 8 , J a p a n
h a s i d a @ e t l . g o . j p
A b s t r a c t
The GDA (Global Document Annotation) project proposes a tag set which allows machines to auto- matically infer the underlying semantic/pragmatic structure of documents. Its objectives are to pro- mote development and spread of N L P / A I applica- tions to render GDA-tagged documents versatile and intelligent contents, which should nmtivate W W W (World Wide Web) users to tag their documents as part of content authoring. This paper discusses au- tomatic text summarization based on GDA. Its main features are a domain/style-free algorithm and per- sonalization on summarization which reflects read- ers' interests and preferences. In order to calcu- late the importance score of a text element, the algorithm uses spreading activation on an intra- document network which connects text elements via thematic, rhetorical, mid coreferential relations. The proposed method is flexible enough to dynamically generate summaries of various sizes. A summary browser supporting personalization is reported as well.
1 I n t r o d u c t i o n
The W W W has opened up an era in which an un- restricted number of people publish their messages electronically through their online documents. How- ever, it is still very hard to automatically process contents of those documents. T h e reasons include the following:
1. H T M L (HyperText Markup Language) tags mainly specify the physical layout of docu- ments. T h e y address very few content-related annotations.
2. Hypertext links cannot very nmch help readers recognize the content of a document.
3. The W W W authors tend to be less careful about wording and readability than in tradi- tional printed media. Currently there is no sys- tematic means for quality control in the W W W .
Although H T M L is a flexible tool that allows you to freely write and read messages on the W W W , it is neither very convenient to readers nor suitable for automatic processing of contents.
We have been developing an integrated platform for document authoring, publishing~ and reuse by combining natural language and W W W technolo- gies. As the first step of our project, we defined a new tag set and developed tools for editing tagged texts and browsing these texts. T h e browser has the functionality of summarization and content-based retrieval of tagged documents.
This paper focuses on summarization based on this system. The main features of our summariza- tion method are a domain/style-free algorithm and personalization to reflect readers" interests and pref- erences. This method naturally outperforms the tra- ditional summarization methods, which just pick out sentences highly scored on the basis of superficial clues such as word count, and so on.
2 G l o b a l D o c u m e n t A n n o t a t i o n
GDA (Global Document Annotation) is a chal- lenging project to make W W W texts machine- understandable on the basis of a new tag set, and to develop content-based presentation, retrieval. question-answering, summarization, and translation systems with much higher quality than before. GDA thus proposes an integrated global platform for elec- tronic content authoring, presentation, and reuse.
The GDA tag set is based on XML (Extensible Markup Language), and designed as compatible as possible with HTML, TEI, EAGLES, and so forth. An example of a GDA-tagged sentence is as follows:
<su><np sem=timeO>time</np> <vp><v s e m = f l y l > f l i e s < / v >
<adp><ad sem=likeO>like</ad> <np>an <n sem=arrowO>arrow</n></np>
</adp></vp>. </su>
<su> means sentential unit.
noun phrase, verb, verb phrase, adnoun or a d v e r b (including preposition and postposition), and ad- nonfinal or adverbial phrase, respectively 1.
T h e G D A initiative aims at having m a n y W W W authors a n n o t a t e their on-line documents with this c o m m o n tag set so t h a t machines can automatically recognize the underlying semantic and p r a g m a t i c structures of those d o c u m e n t s much nmre easily t h a n by analyzing traditional H T M L files. A huge a m o u n t of a n n o t a t e d d a t a is expected to emerge, which should serve not just as tagged linguistic cor- p o r a but also as a worldwide, self-extending knowl- edge base, mainly consisting of examples showing how our knowledge is manifested.
G D A has three main steps:
1. Propose an X M L tag set which allows machines to automatically infer the underlying structure of documents.
2. P r o n m t e development and spread of N L P / A I applications to turn tagged texts to versatile and intelligent contents.
3. Motivate thereby the authors of W W W files to a n n o t a t e their d o c u m e n t s using those tags. 2.1 T h e m a n t i c / R h e t o r i c a l R e l a t i o n s
T h e t e l a t t r i b u t e encodes a relationship in which the current element stands with respect to the ele- ment t h a t it semantically depends on. Its value is called a relational term. A relational t e r m denotes a binary relation, which m a y be a thematic role such as agent, patient, recipient, etc., or a rhetorical rela- tion such as cause, concession, etc. Thus we conflate t h e m a t i c roles and rhetorical relations here, because the distinction between t h e m is often vague. For in- stance, c o n c e s s i o n m a y be b o t h intrasentential and intersentential relation.
Here is an example of a r e 1 attribute:
<su ctyp=fd><name rel=agt>Tom</name> <vp>came</vp>. </su>
c t y p = f d means t h a t the first element <name r e l = a g t > T o m < / n a m e > depends on the second element <vp>came</vp>. r e l = a g t means t h a t T o m has the agent role with respect to the event denoted by c a m e .
r e 1 is an open-class a t t r i b u t e , potentially encom- passing all the binary relations lexicalized in nat- ural languages. An exhaustive listing of t h e m a t i c roles and rhetorical relations a p p e a r s impossible, as widely recognized. We are not yet sure a b o u t how 1A more detailed description of the GDA tag set can be found at http ://~w. etl. go. jp/etl/nl/GDA/tagset, html.
m a n y t h e m a t i c roles and rhetorical relations are suf- ficient for engineering applications. However. the a p p r o p r i a t e granulal~ty of classification will be de- termined by the current level of technology.
2.2 A n a p h o r a a n d C o r e f e r e n c e
Each element m a y have an identifier as the value of the i d attribute. Anaphoric expression should have the aria a t t r i b u t e with its a n t e c e d e n t ' s i d value. An example follows:
<name id=l>John</name> beats <adp ana=l>his</adp> dog.
A non-anaphoric coreference is m a r k e d by the c r f attribute, whose usage is the same as the a n a at- tl~bute.
When the coreference is at the level of t y p e (kind. sort, etc.) which the referents of the antecedent and the a n a p h o r are tokens of, we use the c o t y p a t t r i b u t e as below:
You bought <np id=ll>a car</np>. I bought <np cotyp=ll>one</np>, too.
A zero a n a p h o r a is encoded by using the appro- priate relational t e r m as an a t t r i b u t e name with the referent's id value. Zero anaphors of compulsory el- ements, which describe the internal structure of the events represented by the verbs of adjectives are re- quired to be resolved. Zero a n a p h o r s of optional ele- ments such as with reason and means roles m a y not. Here is an e x a m p l e of a zero a n a p h o r a concerning an optional t h e m a t i c role b e n (for beneficiary):
Tom visited <name id=lll>Mary</name>. He <v ben=111>brought</v> a present.
3 T e x t S u m m a r i z a t i o n
As an example of a basic application of GDA, we have developed an a u t o m a t i c text s u m m a r i z a t i o n system. S u m m a r i z a t i o n generally requires deep se- mantic processing and a lot of background knowl- edge. However, nmst previous works use several su- perficial clues and heuristics on specific styles or con- figurations of d o c u m e n t s to summarize.
For example, clues for determining the i m p o r t a n c e of a sentence include (1) sentence length, (2) key- word count, (3) tense, (4) sentence type (such as fact, conjecture and assertion), (5) rhetorical rela- tion (such as reason and example), and (6) position of sentence in the whole text. Most of these are ex- tracted by a shallow processing of the text. Such a c o m p u t a t i o n is rather robust.
a c c o r d i n g to t h e score, a n d s i m p l y p u t t h e s e l e c t e d s e n t e n c e s t o g e t h e r in o r d e r of t h e i r o c c u r r e n c e s in t h e o r i g i n a l d o c u m e n t . In a sense, t h e s e s y s t e m s a r e successful e n o u g h to be p r a c t i c a l , a n d a r e b a s e d on r e l i a b l e t e c h n o l o g i e s . However, t h e q u a l i t y of s u m - m a r i z a t i o n c a n n o t b e i m p r o v e d b e y o n d this b a s i c level w i t h o u t a n y d e e p c o n t e n t - b a s e d p r o c e s s i n g .
W e p r o p o s e a new s u m m a r i z a t i o n m e t h o d b a s e d on G D A . T h i s m e t h o d e m p l o y s a s p r e a d i n g a c t i v a - tion t e c h n i q u e ( H a s i d a et al., 1987) to c a l c u l a t e t h e i m p o r t a n c e v a l u e s of e l e m e n t s in t h e t e x t . Since t h e m e t h o d d o e s n o t e m p l o y a n y h e u r i s t i c s d e p e n d e n t on t h e d o m a i n a n d s t y l e of d o c u m e n t s , i t is a p p l i c a b l e to a n y G D A - t a g g e d d o c u m e n t s . T h e m e t h o d also can t r i m s e n t e n c e s in t h e s u m m a r y b e c a u s e i m p o r - t a n c e scores a r e a s s i g n e d to e l e m e n t s s m a l l e r t h a n sentences.
A G D A - t a g g e d d o c u m e n t n a t u r a l l y defines a n i n t r a - d o c u m e n t n e t w o r k in which n o d e s corre- s p o n d to e l e m e n t s a n d links r e p r e s e n t t h e s e m a n - tic r e l a t i o n s m e n t i o n e d in t h e p r e v i o u s section. T h i s n e t w o r k c o n s i s t s of s e n t e n c e t r e e s ( s y n t a c t i c h e a d - d a u g h t e r h i e r a r c h i e s of s u b s e n t e n t i a l e l e m e n t s such as w o r d s o r p h r a s e s ) , c o r e f e r e n c e / e m a p h o r a links, d o c u m e n t / s u b d i v i s i o n / p a r a g r a p h n o d e s , a n d r h e t o r i c a l r e l a t i o n links.
F i g u r e 1 shows a g r a p h i c a l r e p r e s e n t a t i o n of t h e i n t r a - d o c u m e n t n e t w o r k .
document
subdivision ~ / ~ v
/l \
paragraph /¢J% U U U U U • * * * (optional) / ~ _
sentence ~ / \ ~ ~ / ~ . . . . n . . . . t
s u b s e n t e n t i a l ( ~ l l ' ~ l l ( ~ 3 ~ (~3 ~ . . . . link
segment j~% "~ ~ /~ -~ . . . . ref .. . .
....
link
F i g u r e 1: I n t r a - D o c u m e n t N e t w o r k
T h e s u m m a l i z a t i o n a l g o r i t h m is t h e following:
1. S p r e a d i n g a c t i v a t i o n is p e r f o r m e d in s u c h a w a y t h a t t w o e l e m e n t s h a v e t h e s a m e a c t i v a - t i o n v a l u e if t h e y a r e c o r e f e r e n t o r One of t h e m is t h e s y n t a c t i c h e a d of t h e o t h e r .
2. T h e u n m a r k e d e l e m e n t w i t h t h e h i g h e s t a c t i v a - t i o n v a l u e is m a r k e d for i n c l u s i o n in t h e s u m - m a r y .
3. W h e n a n e l e m e n t is m a r k e d , o t h e r e l e m e n t s l i s t e d b e l o w a r e r e c u r s i v e l y m a r k e d ms well, u n t i l no m o r e e l e m e n t m a y b e m a r k e d .
• its h e a d • its a n t e c e d e n t
• its c o m p u l s o r y o r a priori i m p o r t a n t
d a u g h t e r s , t h e v a l u e s of w h o s e r e l a t i o n a l a t t r i b u t e s a r e a g t . p a t . o b j . p o s , c n t , c a u , end, sbra, a n d so forth.
• t h e a n t e c e d e n t of a zero a n a p h o r in it w i t h s o m e of t h e a b o v e v a l u e s for t h e r e l a t i o n a l a t t r i b u t e
4. All m a r k e d e l e m e n t s in t h e i n t r a - d o c m n e n t net- w o r k a r e g e n e r a t e d p r e s e r v i n g t h e o r d e r of t h e i r p o s i t i o n s in t h e o r i g i n a l d o c u m e n t .
5. I f a size of t h e s u n n n a r y r e a c h e s t h e user- specified v a l u e , t h e n t e r n f i n a t e ; o t h e r w i s e go b a c k to S t e p 2.
T h e following a r t i c l e of t h e W a l l S t r e e t J o u r n a l w a s u s e d for t e s t i n g t h i s a l g o r i t h m .
During its centennial year. The Wall Street Journal will report events of the past century t h a t stand as milestones of American busi- ness history. T H R E E C O M P U T E R S T H A T C H A N G E D the face of personal computing
were launched in 1977. T h a t year the Ap-
ple II. Commodore Pet and ' r a n d y TRS came to market. The computers were crude by to- day's stmldards. Apple II owners, for exam- ple. had to use their television sets as screens and stored d a t a on audiocassettes. But A p p l e II was a m a j o r advance from A p p l e I, which was built in a garage by Stephen Wozniak and Steven Jobs for hobbyists such as the Home- brew Computer Club. In addition, the Ap-
ple II was an affordable $1,298. Crude as
[image:3.612.80.283.389.525.2]market. Today. PC shipments annually total some $38.3 billion world-wide.
Here is a short, computer-generated s u m m a r y of this sample article:
T H R E E C O M P U T E R S T H A T
C H A N G E D the face of personal computing were launched. Crude as they were, these early PCs triggered explosive p r o d u c t de- velopment. C u r r e n t P C s are more t h a n 50 times faster and have m e m o r y capacity 500 times greater t h a n their counterparts.
T h e proposed m e t h o d is flexible enough to dy- nmnically generate summaries of various sizes. If a longer s u m m a r y is needed, the user can change the window size of the s u m m a r y browser, as described in Section 3.1. Then. the sumnlary changes its size to fit into the new window. An example of a longer s u m m a r y follows:
T H R E E C O M P U T E R S T H A T
C H A N G E D the face of personal c o m p u t - ing were launched. T h e Apple II, Com- nlodore P e t and T a n d y T R S came to mar- ket. T h e c o m p u t e r s were crude. Apple II owners had to use their television sets and stored d a t a on audiocassettes. The Ap- ple II was an affordable $1.298. Crude as they were, these early P C s triggered explo- sive product development. T h e new P C s had keyboards and could store a b o u t two pages of d a t a in their memories. C u r r e n t P C s are more t h a n 50 times faster and have memo~T capacity 500 times greater t h a n their counterparts. There were m a n y pi- oneer P C contributors. William G a t e s and Paul Allen developed an early language- housekeeper system, and Gates became an industry billionaire after IBM a d a p t e d one of these versions. IBM didn't offer its first PC.
An observation obtained from this experiment is t h a t tags for coreferences and thematic and rhetori- cal relations are almost enough to make a s u m m a r y . In particular, coreferences and rhetorical relations help s u m m a r i z a t i o n very much.
G D A tags allow us to apply more sophisticated n a t u r a l language processing technologies to come up with b e t t e r summaries. It is straightforward to in- c o r p o r a t e sentence generation technologies to para- phrase parts of the document, rather t h a n just se- lecting or pruning them. Annotations on a n a p h o r a can be exploited to produce context-dependent para-
phrases. Also the s u m m a r y could be itemized to fit in a slide presentation.
3 . 1 S u m m a r y B r o w s e r
We developed a s u m m a r y browser using a Java- capable W W W browser. Figure 2 shows an example screen of the s u m m a r y browser.
1,
....
~!i
During its centennial year The Wall Street Journal will report events ol the past century that stand its milestones of American business history. THREE COMRJTERS THAT CHANGED the ! face of personal computing were launched in | 977. That year the Apple II, Commodore Pet and Tandy TRS came to market. The computers were crude by today's standards. Apple U owners, for ~¢ample, had to use their television sets as scfeens and stored data on i audiocasset t es. But Apple II was a rllajof advance horn Apple I, which was built in a garage by t Stephan Wozniak and Stevan Jobs for hobbyists such as the Homebrew Computer Club+ In
addition, the Apple n was an affordable $ 1 , 2 9 8 . Crude as they were, these early I~:s trl "ggered e~plo~ve product development in desktop models for the home and office_ B/g
mainlrame co~nput ers for business had been around for yeats. But the ~ 1977 PCs-- unlike eadier built-from-kit types such as the Altair, Sol and IMSAI - had keyboards and could store
about t w o pages of data in their memories. Current PCs are more than 50 times faster and t have memory capacity SO0 times greater than their 1977 counteq~acts. There were many
pioneer PC contributors. W~lliam Gates and Paul Allen in 197S devdoged an early language-housek e e p ~ system for PCS, and Gates became an industry billionaire six years alter IBM adapted one of these versions in 1981. Alan F. Sbugart, currently chairman ol' Seagate Technology, led the team that developed the disk drives for PCs. Dennis Hayes and Dale Heatheriagton, t w o Atlanta engineers, were co-devolopef~ of the internal moderns that allow PCs to share data via the telephone. IBM, the w o d d leader in computers, didn't offer its
f~s'lr PC lunta Al/nll~t 1 qR1 =¢ m ~ m nthtl¢ r n m n l n i ~ ~ntmt=~l th~ mlr~at Tnd=u P~ ... ... ... ... . . . ~, THREE" COMPUTERS THAT CHANGED the face of personal computing were launched. Crude as i they were, these early PCs tnggered e~plosive product development. Current PCs aee mote
! than 50 times taster and have memory capacity SO0 times greater than their counterparts.
I
Figure 2: S u m m a r y Browser It has the following functionalities:
1. A screen is divided into three parts (frames). One frame provides a user input form through which you can select d o c u m e n t s and type key- words. T h e other frames are for displaying the original d o c u m e n t and its s u m m a r y .
2. T h e frame for the s u m m a r y text is resizable by sliding the b o u n d a r y with the original doc- u m e n t frame. T h e size of the s u m m a r y frame influences the size of the s u m m a r y itself. T h u s you can see the s u m m a r y in a preferred size and change the size in an easy and intuitive way. 3. T h e f r a m e for the original d o c u m e n t is mouse
sensitive. You can select any element of text in this frame. This function is used for the cus- tomization of the s u m m a r y , as described later. 4. H T M L tags are also handled by the browser.
So, images are viewed and hyperlinks are nian- aged b o t h in the s u m m a r y . If a hyperlink is clicked in the original d o c u m e n t frame, the linked d o c u m e n t a p p e a r s on the same frame. T h e hyperlinks are kept in the summary.
4
P e r s o n a l i z a t i o n
[image:4.612.322.560.133.350.2]cording to the interests or preferences of its reader. Let us refer to the a d a p t a t i o n of the summariza- tion process to a particular user as personalization.
GDA-based s u m m a r i z a t i o n can be easily personal- ized because our m e t h o d is flexible enough to bias a s u m m a r y toward the user's concerns. You can se- lect any elements in the original d o c u m e n t during summarization, to interactively provide information concerning y o u r personal interests.
We have been developing the following techniques for personalized summarization:
• Keyword-based customization
The user can input any words of interest. The system relates those words with those in the d o c u m e n t using cooccurrence statistics ac- quired from a corpus and a dictionary such as WordNet (Miller, 1995). T h e related words in the d o c u m e n t are assigned numeric values t h a t reflect closeness to the input words. These val- ues are used in spreading activation for calcu- lating i m p o r t a n c e scores.
• Interactive custonfization by selecting any ele- ments from a d o c u m e n t
T h e user can m a r k any words, phrases, and sen- tences to be included in the s u m m a r y . T h e sum- m a t t browser allows the user to select those el- ements by pointing devices such as mouse and stylus pen. T h e user can easily select elements by clicking on them. T h e click count corre- sponds to the level of elements. T h a t is, the first click means the word, the second the next larger element containing it, and so on. T h e se- lected elements will have higher activation val- ues in spreading activation.
• Learning user interests by observation of W W W browsing
T h e s u m m m i z a t i o n system can customize the s u m m a r y according to the user without any ex- plicit user inputs. We implemented a learning mechanism for user personalization. T h e mech- anism uses a weighted feature vector. T h e fea- ture corresponds to the category or topic of doc- uments. T h e category is defined according to a W W W directory such as Yahoo. T h e topic is detected using the s u m m a r i z a t i o n technique. Learning is roughly divided into d a t a acquisi- tion and model nmdification. T h e user's behav- ioral d a t a is acquired by detecting her informa- tion access on the W W W . This d a t a includes the time and duration of t h a t information ac- cess and features related to t h a t information. The first step of model modification is to esti- m a t e the degree of relevance between the input
feature vector assigned to the information ac- cessed by the user and the model of the user's interests acquired fl'om previous data. T h e sec- ond step is to adjust the weights of features in the user model.
5 Concluding
R e m a r k s
We have discussed the G D A project, which aims at supporting versatile and intelligent contents. Our focus in the present p a p e r is one of its applications to a u t o m a t i c text summarization. We are evaluating our summarization m e t h o d using online Japanese ar- ticles with G D A tags. We are also extending text summarization to t h a t of hypertext. For example, a s m n m a r y of a h y p e r t e x t d o c u m e n t will include re- cursively embedding linked d o c u m e n t s in summary, which should be useful for encyclopedic entries, too. Future work includes construction of a large-scale G D A corpus a n d system evaluation by open exper- imentation. G D A tools including a tagging editor and a browser will soon be publicly available on the W W W . Our m a i n current concern is interactive and intelligent presentation, as an extension of text sum- marization. This m a y turn out to be a killer appli- cation of GDA. because it does not just presuppose rather small a m o u n t of tagged d o c u m e n t but also makes the effect of tagging i m m e d i a t e l y visible to the author. We hope t h a t our project revolutionize global and intercultural communications.
R e f e r e n c e s
K6iti Hasida, Syun Ishizaki, and Hitoshi Isahara. 1987. A connectionist approach to the generation of abstracts. In Gerard K e m p e n , editor. Natural Language Generation: New Results in Artificial Intelligence, Psychology, and Linguistics, pages 149-156. Martinus Nijhoff.
E d u a r d Hovy and Chin Yew Lin. 1997. A u t o m a t e d text summaxization in S U M M A R I S T . In Proceed- ings o/ A CL Workshop on Intelligent Scalable Text Summarization.
George Miller. 1995. WordNet: A lexical database
for English. Communications of the ACM,
38(11):39-41.
Hideo W a t a n a b e . 1996. A m e t h o d for abstract- ing newspaper articles by using surface clues. In
Proceedings o/ the Sixteenth International Con- ference on Computational Linguistics (COLING-