105 San
AUTOMATIC TEXT SUMMARIZATION USING
FEATURE-BASED FUZZY EXTRACTION
2.0",
Key
Ladda Suanrnali1, Naomie Salim2, Mohammed Salem Binwahlan3 Ivan
rsity, IFaculty of Science and Technology
Suan Dusit Rajabhat University
295 Rajasrima Rd, Dusit, Bangkok, Thailand 10300 ~ce",
2.3Faculty of Computer Science and Infonnation System Universiti Teknologi Malaysia
81310 Skudai, Johor
E-mail: [email protected]@[email protected]
Abstract: Automatic text summarization is to compress the original text into a shorter version and help the user to quickly understand large volumes of infonnation. This paper focuses on the automatic text summarization by sentence extraction with important features based on fuzzy logic. In our experiment, we used 6 test documents in DUC2002 data set. Each document is prepared by preprocessing process: sentence segmentation, tokenization, renlUving Stop Word and stemming Word. Then, we use 8 important features and calculate their score for each sentence. We propose a method using fuzzy logic for sentence extraction and compare our result with the baseline summarizer and Microsoft Word 2007 summarizers. The results show that the highest average precision, recall, and F-mean for the summaries are conducted from fuzzy method.
Keyword: Automatic text summarization, Fuzzy logic, Sentence extraction
1. INTRODUCTION
Automatic text summarization is the summary of the source text by machine to present the most important infonnation in a shorter version of the original text while still keeping its main semantic content and helps the user to quickly understand large volumes of information. Text summarization addresses both the problem of selecting the most important portions of text and the problem of generating coherent summaries. This process is significantly different from that
Jilid 20, Bit. 2 (Disember 2008) Jurnal Teknologi Maklumat
105
AUTOMATIC TEXT SUMMARIZATION USING
FEATURE-BASED FUZZY EXTRACTION
Ladda Suanmali1,Naomie Salim2,Mohammed Salem Binwahlan3 'Faculty of Science and Technology
Suan Dusit Rajabhat University
295 Rajasrima Rd, Dusit, Bangkok, Thailand 10300 2,3Faculty of Computer Science and Information System
Universiti Teknologi Malaysia 81310 Skudai, lohor
E-mail: [email protected]@[email protected]
Abstract: Automatic text summarization is to compress the original text into a shorter version and help the user to quickly understand large volumes of information. This paper focuses on the automatic text summarization by sentence extraction with important features based on fuzzy logic. In our experiment, we used 6test documents in DUC2002 data set. Each document is prepared by preprocessing process: sentence segmentation, tokenization, remuving Stop Word and stemming Word. Then, we use 8 important features and calculate their score for each sentence. We propose a method using fuzzy logic for sentence extraction and compare our result with the baseline summarizer and Microsoft Word 2007 summarizers. The results show that the highest average precision, recall, and F-mean for the summaries are conducted from fuzzy method.
Keyword: Automatic text summarization, Fuzzy logic, Sentence extraction
. 1.INTRODUCTION
Automatic text summarization is the summary of the source text by machine to present the most important information in a shorter version of the original text while still keeping its main semantic content and helps the user to quickly understand large volumes of information. Text summarization addresses both the problem of selecting the most important portions of text and the problem of generating coherent summaries. This process is significantly different from that
106
-
---",.--of human based text summarization since human can capture and relate deep meanings and themes of text documents while automation of such a skill is very difficult to implement. A number of researchers have proposed techniques for automatic text summarization which can be classified into two categories: extraction and abstraction. Extraction summary is a selection of sentences or phrases from the original text with the highest score and put it together to a new shorter text without changing the source text. Abstraction summary method uses linguistic methods to examine and interpret the text. Most of the current automated text summarization system use extraction method to produce summary. Automatic text summarization works best on well-structured documents, such as news, reports, articles and scientific papers.
In this paper, we propose text summarization based on fuzzy logic aided method to extract important sentences as a summary of document. The rest of this paper is organized as follows. Section 2 presents the summarization approach. Section 3 and 4 describes our proposed, followed by experimental design, experimental results and evaluation. Finally, we conclude and suggest future work that can be carried out in Section 5.
2. SUMMARIZAnON APPROACH
Automatic text summarization dates back to the Fifties, when Luhn created the first summarization system [1] in 1958. Rath et al. [2] in 1961 proposed empirical evidences for difficulties inherent in the notion of ideal summary. Both studies used thematic features such as term frequency, thus they characterized by surface-level approaches. In the early 1960s, new approaches called entity-level approaches appeared; the first approach of this kind used syntactic analysis [3]. The location features were used in [4], where key phrases are used dealt with three additional components: pragmatic words (cue words, i.e., words would have positive or negative effect on the respective sentence weight like significant, key idea, or hardly); title and heading words; and structural indicators (sentence location, where the sentences appearing in initial or final of text unit are more significant to include in the summary.
In this paper, we propose important sentence extraction used fuzzy rules and a set for selecting sentences based on their features. Fuzzy set proposed by Zadeh [10] is a mathematical tool for dealing with uncertainty, vagueness and ambiguity. Its application in text representation for information retrieval was first proposed by Buell [11], in which a document can be represented as a fuzzy set of terms. Miyamoto [12] investigated applications of fuzzy set theory in information retrieval and cluster analysis. Witte and Bergler [13] presented a fuzzy-theory
Jilid 20, Bil. 2 (Disember 2008) Jumal Teknologi Maklumat
106
of human based text summarization since human can capture and relate deep meanings and themes of text documents while automation of such a skill is very difficult to implement. A number of researchers have proposed techniques for automatic text summarization which can be classified into two categories: extraction and abstraction. Extraction summary is a selection of sentences or phrases from the original text with the highest score and put it together to a new shorter text without changing the source text. Abstraction summary method uses linguistic methods to examine and interpret the text. Most of the current automated text summarization system use extraction method to produce summary. Automatic text summarization works best on well-structured documents, such as news, reports, articles and scientific papers.
In this paper, we propose text summarization based on fuzzy logic aided method to extract important sentences as a summary of document. The rest of this paper is organized as follows. Section 2 presents the summarization approach. Section 3 and 4 describes our proposed, followed by experimental design, experimental results and evaluation. Finally, we conclude and suggest future work that can be carried out in Section 5.
2. SUMMARIZAnON APPROACH
Automatic text summarization dates back to the Fifties, when Luhn created the first summarization system [1] in 1958. Rath et al. [2] in 1961 proposed empirical evidences for difficulties inherent in the notion of ideal summary. Both studies used thematic features such as term frequency, thus they characterized by surface-level approaches. In the early 1960s, new approaches called entity-level approaches appeared; the first approach of this kind used syntactic analysis [3]. The location features were used in [4], where key phrases are used dealt with three additional components: pragmatic words (cue words, i.e., words would have positive or negative effect on the respective sentence weight like significant, key idea, or hardly); title and heading words; and structural indicators (sentence location, where the sentences appearing in initial or final of text unit are more significant to include in the summary.
In this paper, we propose important sentence extraction used fuzzy rules and a set for selecting sentences based on their features. Fuzzy set proposed by Zadeh [10] is a mathematical tool for dealing with uncertainty, vagueness and ambiguity. Its application in text representation for information retrieval was first proposed by Buell [11], in which a document can be represented as a fuzzy set of terms. Miyamoto [12] investigated applications of fuzzy set theory in information retrieval and cluster analysis. Witte and Bergler [13] presented a fuzzy-theory
107
I
, based approach to co-reference resolution and its application to text summarization. Automatic determination of co-reference between noun phrases is fraught with uncertainty. Kiani and Akbarzadeh [15] proposed technique for summarizing text using combination of Genetic Algorithm (GA) and Genetic Programming (GP) to optimize rule sets and membership function ; offuzzy systems.
The feature extraction techniques are used to locate the important sentences in the text. •For instance, Luhn looked at the frequency of word distributions as frequent words should indicate the most important concepts of the document. Some of features are used in this research such as sentence length. Some sentences are short or some sentences are long. What is .' clear is that some of the attributes have more importance and some have less and so they should have balance weight in computations and we use fuzzy logic to solve this problem by defining the membership functions for each feature.
3. EXPERIMENT 3.1 Data Set
We used 6 documents from DUC2002. Each document consists of 16 to 56 sentences with an , average of 31 sentences. The DUC2002 collection provided [10]. Each document in DUC2002 collection is supplied with a set of human-generation summaries provided by two different experts. While each expert was asked to generate summaries of different length, we use only loo-word variants. DUC2002 for automatic single-document summarization create a generic
Currently, input document are of plain text format. In this paper, we use Microsoft Visual C# 2008 for preprocessing data. There are four main activities performed in this stage: Sentence Segmentation, Tokenization, Removing Stop Word, and Stemming Word. Sentence ,segmentation is boundary detection and separating source text into sentence. Tokenization is ,separating the input document into individual words. Next, Removing Stop Words, stop words are the words which appear frequently in document but provide less meaning in identifying the important content of the document such as 'a', 'an', 'the', etc.. The last step for preprocessing is ,Stemming word; Stemming word is the process ofremoving prefixes and suffixes of each word.
:Jilid 20, Bil. 2 (Disember 2008) Jurnal Teknologi Maklumat 107 based approach to co-reference resolution and its application to text summarization. Automatic determination of co-reference between noun phrases is fraught with uncertainty. Kiani and Akbarzadeh [15] proposed technique for summarizing text using combination of Genetic Algorithm (GA) and Genetic Programming (GP) to optimize rule sets and membership function offuzzysystems.
The feature extraction techniques are used to locate the important sentences in the text. For instance, Luhn looked at the frequency of word distributions as frequent words should indicate the most important concepts of the document. Some of features are used in this research such as sentence length. Some sentences are short or some sentences are long. What is clear is that some of the attributes have more importance and some have less and so they should have balance weight in computations and we use fuzzy logic to solve this problem by defining the membership functions for each feature.
3. EXPERIMENT 3.1 Data Set
We used 6 documents from DUC2002. Each document consists of 16 to 56 sentences with an average of 31 sentences. The DUC2002 collection provided [10]. Each document in DUC2002 collection is supplied with a set of human-generation summaries provided by two different .experts. While each expert was asked to generate summaries of different length, we use only loo-word variants. DUC2002 for automatic single-document summarization create a generic loo-word summary.
3.2 Preprocessing
~Currently, input document are of plain text format. In this paper, we use Microsoft Visual C#
!
i
2008 for preprocessing data. There are four main activities performed in this stage: Sentence,
tSegmentation, Tokenization, Removing Stop Word, and Stemming Word. Sentence
i
segmentation is boundary detection and separating source text into sentence. Tokenization is .separating the input document into individual words. Next, Removing Stop Words, stop words are the words which appear frequently in document but provide less meaning in identifying the :important content of the document such as 'a', 'an', 'the', etc.. The last step for preprocessing is Stemming word; Stemming word is the process of removing prefixes and suffixes of each word.108
3.3 Features in Text Summarization
In order to use a statistical method it is necessary to represent the sentences as vectors of features. These features are attributes that attempt to represent the data used for the task. We concentrate our presentation in eight features for each sentence. Each feature is given a value between '0' and 'I'. Therefore, we can extract the appropriate number of sentences according to compression rate. There are eight features as follows:
(I) Title feature: The number of title word in sentence, words in sentence that also occur in title gives high score [6]. This is detennined by counting the number of matches between the content words in a sentence and the words in the title. We calculate the score for this feature which is the ratio of the number of words in sentence that occur in the title over the number of word in title.
Score (SJ = No. Title word in S1 (I)
No. Word in Title
(2) Sentence length: The number of word in sentence, this feature is useful to filtering out short sentences such as datelines and author names commonly found in news articles. The short sentences are not expected to belong to the summary [5]. We use nonnalized length of the sentence, which is the ratio of the number of words occurring in the sentence over the number of words occurring in the longest sentence of the document.
Score (SJ = No. Word occurring in S1 (2)
No. Word occurring in longest sentence
(3) Term weight: Calculating the average of the TF-ISF (Tennfrequency, Inverse sentence frequency). The frequency of tenn occurrences within a document has often been used for calculating the importance of sentence [7].
Score (SJ = Sum ofTF-ISF in St (3)
Max(Sum ofTF-ISF)
(4) Sentence position: Whether it is the first and last Sentence in the paragraph, sentence position in text gives the importance of the sentences. This feature can involve several items
Jilid 20, Bit. 2 (Disember 2008) Jurnal Teknologi Maklumat
such as the position of and last sentence highe other sentence.
Score (SJ = 1
fi
OJ
(5) Sentence to sentel similarity between s an score of this feature f( sentence similarity of Sl
Score (SJ = Sun
Max(
(6) Proper noun: The (proper noun). Usually is most probably incl calculated as the ratio c
Score (SJ = No
j
(7) Thematic word: because terms that OCCI
thematic words indicat frequent content word the ratio of the number
Score (SJ = Iv
(8) Numerical data: numerical data is impc score for this feature i~ the sentence length Jilid 20, Bil. 2 (Disembe~ 108
3.3 Features in Text Summarization
In order to use a statistical method it is necessary to represent the sentences as vectors of features. These features are attributes that attempt to represent the data used for the task. We concentrate our presentation in eight features for each sentence. Each feature is given a value between '0' and '1'. Therefore, we can extract the appropriate number of sentences according to compression rate. There are eight features as follows:
(I)Title feature: The number of title word in sentence, words in sentence that also occur in title gives high score [6]. This is determined by counting the number of matches between the content words in a sentence and the words in the title. We calculate the score for this feature which is the ratio of the number of words in sentence that occur in the title over the number of word in title.
Score (SJ =No. Title word inS1
No. Word in Title
(I)
(2) Sentence length: The number of word in sentence, this feature is useful to filtering out short sentences such as datelines and author names commonly found in news articles. The short sentences are not expected to belong to the summary [5]. We use normalized length of the sentence, which is the ratio of the number of words occurring in the sentence over the number of words occurring in the longest sentence of the document.
Score (SJ = No. Word occurring inSI
No. Word occurring in longest sentence
(2)
(3) Term weight: Calculating the average of the TF-ISF (Term frequency, Inverse sentence frequency). The frequency of term occurrences within a document has often been used for calculating the importance of sentence [7].
Score (SJ = Sum ofTF-ISF inSf
Max(Sum ofTF-ISF)
(3)
(4) Sentence position: Whether it is the first and last sentence in the paragraph, sentence position in text gives the importance of the sentences. This feature can involve several items
e
e g n )f"
109such as the position of a sentence in the document, section, paragraph, etc., [14] proposed first and last sentence highest ranking. The score for this feature: 1 for first and last sentence, 0 for other sentence.
Score (SJ
=
1for First and Last sentence,Dfor other sentences (4)
(5) Sentence. to sentence similarity: Similarity between sentences, for each sentence s, the similarity between s and each other sentence is computed by the cosine similarity measure. The score of this feature for a sentence s is obtained by computing the ratio of the summary of sentence similarity of sentence s with each other sentence over the maximum of summary
Score (SJ = Sum ofSentemce Similarity in S;
Max(Sum ofSentence Similarity)
(5)
(6) Proper noun: The number of proper noun in sentence, sentence inclusion of name entity (proper noun). Usually the sentence that contains more proper nouns is an important one and it is most probably included in the document summary [17]. The score for this feature is calculated as the ratio of the number of proper nouns in sentence over the sentence length.
e
r Score (SJ = No. Proper nouns in SI
Length (SJ
(6)
r
(7)
Score (SJ = No. Thematic word in St_
Length (SJ
(7) Thematic word: The number of thematic word in sentence, this feature is important because terms that occur frequently in a document are probably related to topic. The number of thematic words indicates the words with maximum possible relativity. We used the top 10 most frequent content word for consideration as thematic. The score for this feature is calculated as the ratio of the number of thematic words in sentence over the sentence length
(8) Numerical data: The number of numerical data in sentence, sentence that contains numerical data is important and it is most probably included in the document summary [16].The score for this feature is calculated as the ratio of the number of numerical data in sentence over the sentence length
Jilid 20, Bil. 2 (Disember 2008) Jurnal Teknologi Maklumat
109 such as the position of a sentence in the document, section, paragraph, etc., [14] proposed first and last sentence highest ranking. The score for this feature: 1 for first and last sentence, 0 for other sentence.
Score (SJ
=
1for First and Last sentence,Dfor other sentences (4)
(5) Sentence. to sentence similarity: Similarity between sentences, for each sentence s, the similarity between s and each other sentence is computed by the cosine similarity measure. The score of this feature for a sentence s is obtained by computing the ratio of the summary of sentence similarity of sentenceswith each other sentence over the maximum of summary
Score (SJ
=
Sum ofSentemce Similarity inS;Max(Sum ofSentence Similarity)
(5)
(6) Proper noun: The number of proper noun in sentence, sentence inclusion of name entity (proper noun). Usually the sentence that contains more proper nouns is an important one and it is most probably included in the document summary [17]. The score for this feature is calculated as the ratio of the number of proper nouns in sentence over the sentence length.
Score (SJ = No. Proper nouns inS!
Length (SJ
(6)
(7) Thematic word: The number of thematic word in sentence, this feature is important because terms that occur frequently in a document are probably related to topic. The number of thematic words indicates the words with maximum possible relativity. We used the top 10 most frequent content word for consideration as thematic. The score for this feature is calculated as the ratio of the number of thematic words in sentence over the sentence length
Score (SJ = No. Thematic word in St_
Length (SJ
(7)
(8) Numerical data: The number of numerical data in sentence, sentence that contains numerical data is important and it is most probably included in the document summary [1 6].The score for this feature is calculated as the ratio of the number of numerical data in sentence over the sentence length
llO
Score (SJ
=
No. Numerical data in St_ (8)Length (SJ
3.4 Text Summarization based on Fuzzy Logic
In order to implement text summarization based on fuzzy logic, we use MATLAB since it is possible to simulate fuzzy logic in this software. First, the features extracted in previous section are used as input to the fuzzy inference system. We used Bell membership functions. The generalized Bell membership function depends on three parameters a, b, and c as given by (9)
1
f
(Xi a , b,c)=
IK-'J:b-(9)
l + - j
/I •
where the parameter b is usually positive. The parameter c and a, locate the center and width of the curve.
For instance, membership function of sentence to sentence similarity is show in Figure 1.
IeSSSupilarnv ave.~
I
-
,
IFigure I. Membership function of sentence to sentence similarity
Afterword, we use fuzzy logic to summarize the document. A value from zero to one is obtained for each sentence in the output based on sentence characteristics and the available rules in the knowledge base. The obtained value in the output determines the degree of the importance of the sentence in the final summary.
0
-The input mem1
pIe, membership fill HighSimilarity}. L
ted from these rule,
IF (NoWordIll very much) an is highSimilar many) and (Nl ALUAnON AND the ROUGE, a se ion toolkit [8] th of ROUGE, whic y, at a confidence with human asses1
Jilid 20, Bil. 2 (Disember 2008) Jumal Teknologi Mak1umat
110
Score(SJ
=
No. Numerical data inSt_Length(SJ
3.4 Text Summarization based on Fuzzy Logic
(8)
In order to implement text summarization based on fuzzy logic, we use MATLAB since it is possible to simulate fuzzy logic in this software. First, the features extracted in previous section are used as input to the fuzzy inference system. We used Bell membership functions. The generalized Bell membership function depends on three parametersa,b, and c as given by(9)
1
f
(XjQ,b,c)=
.bI
X-'j'
1 + -/I(9)
where the parameterbis usually positive. The parametercanda. locate the center and width of the curve.
For instance, membership function of sentence to sentence similarity is show in Figure 1.
!essSi/hila,ny avel'.lge
.·f.-
r-~~\
(.J\~~~'
.' ,
Figure 1.Membership function of sentence to sentence similarity
Afterword, we use fuzzy logic to summarize the document. A value from zero to one is obtained for each sentence in the output based on sentence characteristics and the available rules in the knowledge base. The obtained value in the output determines the degree of the importance of the sentence in the final summary.
III
II
1
impo art
I:' 01 02 (13 O~ u.s (I'; 0: !H; 09
cdp..t Y8riable "OuIpd"
Figure 2. Membership function of Output
The input membership function for each feature is divided into three membership functions which are composed of insignificant values, average and significant values. For example, membership functions for title feature: SetenenceSimilarity {LessSimilarity, Average, and HighSimilarity}. Likewise, the output membership function is divided into three membership functions: Output {Unimportant, Average, and Important}. The most important
f
part in this procedure is the definition of fuzzy IF-THEN rules. The important sentences are extracted from these rules according to our features criteria. For example our rules are showed as follow.
IF (No WordIn Title is many) and (SentenceLength is long) and (TermFreq is very much) and (SentencePosition is first-last position) and (SentenceSimilarity is highSimilarity) and (NoProperNoun is many) and (No Thematic Word is many) and (NumbericalData is many) THEN (Sentence is important)
Figure 3. Sample ofIF-THEN Rules
! 4. EVALUATION AND RESULT
We use the ROUGE, a set of metrics called Recall-Oriented Understudy for Gisting Evaluation, evaluation toolkit [8] that has become standards of automatic evaluation of summaries. It
compares the summaries generated by the program with the human-generated (gold standard) summaries. For comparison, it uses n-gram statistics. Our evaluation was done using n-gram setting of ROUGE, which was found to have the highest correlation with human judgments, namely, at a confidence level of 95%. It is claimed that ROUGE-I consistently correlates highly with human assessments and has high recall and precision significance test with manual
Jilid 20, Bil. 2 (Disember 2008) Jumal Teknologi Maklumat
III 1 itnpO art average
I
c,====::=::::::::=:::L:::",~...i-'
-=:'
:..:::-;-~---,---I
c' 1)1 02 03 O~ u.s 0'3 0: IH; 09 rdp..tY8riable "Outplj"Figure 2. Membership function of Output
The input membership function for each feature is divided into three membership functions which are composed of insignificant values, average and significant values. For example, membership functions for title feature: SetenenceSimilarity {LessSimilarity, Average, and HighSimilarity}. Likewise, the output membership function is divided into three membership functions: Output {Unimportant, Average, and Important}. The most important part in this procedure is the definition of fuzzy IF-THEN rules. The important sentences are extracted from these rules according to our features criteria. For example our rules are showed as follow.
IF (No WordIn Title is many) and (SentenceLength is long) and (TermFreq is very much) and (SentencePosition is first-last position) and (SentenceSimilarity is highSimilarity) and (NoProperNoun is many) and (No Thematic Word is many) and (NumbericalData is many) THEN (Sentence is important)
Figure 3. Sample of IF-THEN Rules
! 4. EVALUATION AND RESULT
We use the ROUGE, a set of metrics called Recall-Oriented Understudy for Gisting Evaluation, evaluation toolkit [8] that has become standards of automatic evaluation of summaries. It
compares the summaries generated by the program with the human-generated (gold standard) summaries. For comparison, it uses n-gram statistics. Our evaluation was done using n-gram setting of ROUGE, which was found to have the highest correlation with human judgments, namely, at a confidence level of 95%. It is claimed that ROUGE-I consistently correlates highly with human assessments and has high recall and precision significance test with manual
•
112
r -evaluation results. So we choose ROUGE-I as the measurement of our experiment results. In
o.n
O.G( the table I, we compare fuzzy summarizer with baseline summarizer form DUC2002 data set
0.5( 0.4(
and Microsoft Word 2007 Summarizer.
0.3( 0.2( O.l( 0.0(
Table I. The result of comparing Fuzzy Summarizer and other Summarizers using Document set 0061
Fu Sumarizer Baseline MS-Word Summarizer
Document
p R F P R F P R F
AP880911-0016 0.59223 0.60396 0.59804 0.41748 0.40952 0.41346 0.55556 0.42857 0.483 Figure 5. Rec AP880912-0095 0.45484 0.48001 0.47607 0.43636 0.41379 0.42478 0.49231 0.44545 0.41 AP880912-0137 0.48039 0.47573 0.47805 0.46602 0.47059 0.46829 0.47525 0.47525 0.470 AP880915-0003 0.49038 0.48571 0.48803 0.44330 0.40952 0.42574 0.48571 0.48113 0.483
I
0.70 AP880916-0060 0.50816 0.46714 0.48095 0.32642 0.32642 0.32222 0.31148 0.33929 0.324 0.60 , 0.50 WSJ880912-0064 0.49524 0.51485 0.50485 0.49515 0.50495 0.50000 0.44231 0.42593 0.433 ' 0.40 Avera e 0.50354 0.50457 0.50433 0.43079 0.42247 0.42575 0.46044 0.43260 0.435 0.30 \ 1 0.20 0.10The results are show in Table I. Baseline reaches an average precision of 0.43079, 0.00 average recal1 of 0.42247 and average F-mean of 0.42575; while Microsoft Word 2007
summarizer reaches an average precision 0.46044, recal1 of 0.43260 and F-mean of 0.43555. The fuzzy summarizer achieves an average precision of 0.50354, recal1 of 0.50457 and F-mean of 0.50433. Figure 6. F-m 0.70000 0.60000 0.50000 The results cl 0.40000
0.30000 _ .•. - Fuzzy better than baseline s,
0.20000
... Baseline
0.10000 performance of the fu
0.00000 - W o r d
and f-mean results. II ~ p,'? ~ ,s,'" !SJ &>~
..rS' ~ ~...
?
J9 .
.,p
shows that the judge5<;:,0,... <;:,0, <;:,0, ~o, <;:,o,~ <;:,0,....
"-~ '0 '0 '0 '0 '0
~v <:j,'O <:j,~ <:j,'O <:j,'O ....~ 0.59223, 0.60396, at
~ ~ ~ ~ ~~7
provides strong evide Figure 4. Precision result under difference summarizer using Document Set 0061
1
Jilid 20, Bil. 2 (Disember 2008) Jumal Teknologi Maklumat Jilid 20, Bil. 2 (Disem~
112
evaluation results. So we choose ROUGE-I as the measurement of our experiment results. In the table I, we compare fuzzy summarizer with baseline summarizer fonn DUC2002 data set and Microsoft Word 2007 Summarizer.
Table I. The result of comparing Fuzzy Summarizer and other Summarizers using Document set 0061
Document Fu~zvSumarizer Baseline MS-Word Summarizer
p R F P R F P R F AP880911-0016 0.59223 0.60396 0.59804 0.41748 0.40952 0.41346 0.55556 0.42857 0.483 AP880912-0095 0.45484 0.48001 0.47607 0.43636 0.41379 0.42478 0.49231 0.44545 0.416E AP880912-0137 0.48039 0.47573 0.47805 0.46602 0.47059 0.46829 0.47525 0.47525 0.470 AP880915-0003 0.49038 0.48571 0.48803 0.44330 0.40952 0.42574 0.48571 0.48113 0.483 AP880916-0060 0.50816 0.46714 0.48095 0.32642 0.32642 0.32222 0.31148 0.33929 0.324 WSJ880912-0064 0.49524 0.51485 0.50485 0.49515 0.50495 0.50000 0.44231 0.42593 0.433 Average 0.50354 0.50457 0.50433 0.43079 0.42247 0.42575 0.46044 0.43260 0.435
The results are show in Table I. Baseline reaches an average precision of 0.43079, average recall of 0.42247 and average F-mean of 0.42575; while Microsoft Word 2007 summarizer reaches an average precision 0.46044, recall of 0.43260 and F-mean of 0.43555. The fuzzy summarizer achieves an average precision of 0.50354, recall of 0.50457 and F-mean of 0.50433. I I 0.70000
i
0.60000 I'"
I
0.50000~
. ·r·:..···· ... ..'i
0.40000 !l... ... ...I
0.30000 _ .•. - FuzzyI
0.200000.10000 ... BaselineI
0.00000 -WordI
~ p,"> ~#'"
IS> ~I> ..<S>~ ~'"
{>~,;'::J
I
<::10,-" <::10, <::10, <::10, <::Io,~ <;)0,-" ~ '0 ~ '0 '0 ~ I 9,'0 9.'0 9.'0 9,'0 9.'0 b-'O I,.,.,.,.
,.~I
!Figure 4. Precision result under difference summarizer using Document Set 0061
..
113 r---
---I
0.70000 --- ---.- ---JI
0.60000 . . .I
Im~g:~
Ii
0.20000 0.10000 ... Baseline 0.00000 > - - - - . - W o r dFigure 5. Recall result under difference summarizer using Document Set 0061 .
0.70000 0.60000 ,
-g:~~~~~
-~-~~:?--0.20000 Fuzzy 0.10000 -- Baseline 0.00000 --WordFigure 6. F-mean result under difference summarizer using Document Set 0061
The results clearly show that fuzzy summarizer approach under consideration perform better than baseline summarizer and Microsoft Word 2007 summarizer. We further compare the performance of the fuzzy summarizer and other summarizer by examining their precision, recall and f-mean results. In this case, the best precision, recall and f-mean from Figure 4, 5, and 6 shows that the judges from fuzzy summarizer are the highest score. The score are as followed: 0.59223, 0.60396, and 0.59804. The significant performance improvement over fuzzy logic provides strong evidence of its feasibility in text summarization applications
Jilid 20, Bil. 2 (Disember 2008) lurnal Teknologi Maklumat
... Baseline - . • . - FUllY r -"- - --"--""
---I
0.70000 -- ----I
0.60000 . . .Ig;~~···~···
0.20000 0.10000 -." 0.00000 "----JI
Ii
113 - W o r dFigure 5. Recall result under difference summarizer using Document Set 0061 .
0.70000 0.60000 . ,
-~:~~ggg -~~V-"
0.20000 0.10000 0.00000 Fuzzy -- Basclirlc ---.-- WordFigure 6. F-mean result under difference summarizer using Document Set 0061
The results clearly show that fuzzy summarizer approach under consideration perform better than baseline summarizer and Microsoft Word 2007 summarizer. We further compare the performance of the fuzzy summarizer and other summarizer by examining their precision, recall and f-mean results. In this case, the best precision, recall and f-mean from Figure4, 5, and6 shows that the judges from fuzzy summarizer are the highest score. The score are as followed: 0.59223, 0.60396, and 0.59804. The significant performance improvement over fuzzy logic provides strong evidence of its feasibility in text summarization applications
.,
114
5. CONCLUSION AND FUTURE WORK [10] L. Zadeh., "Fuz
[11] D. Buell., "An In this paper, we propose automatic text summarization for important sentence extraction with systems," FUZZ) important features based on fuzzy logic; title feature, sentence length, term weight, sentence [12] S. Miyamoto., position, sentence to sentence similarity, proper noun, thematic word and numerical data. We Academic Publi choose 6 documents from DUC2002 data set and compare our summarizer with the baseline [13] R. Witte and summarizer and Microsoft Word 2007 summarizers. The results show that the judge gave a Proceedings of better average precision, recall and f-mean to summaries produced by fuzzy method. Our Applications to method is intent to be used for single document summarization as well as multi documents UniversitA Ca' I summarization. We conclude that we need to extend the proposed method for multi document [14] Louisa Ferrier., summarization and combine fuzzy logic and other learning methods in a large data set. Artificial Intellij [15] Arman Kiani a:
REFERENCES Fuzzy GA-GP;
Systems, Shera 983.2006.
'/' .
[1] H. P. Luhn., "The Automatic Creation of Literature Abstracts," IBM Journal of Research
and Development, vol. 2, pp.159-165. 1958. [16] c.Y. Lin., "Tra
international cc [2] G. 1. Rath, A. Resnick, and T. R. Savage., "The formation of abstracts by the selection of
Missouri, Unitel [3]
sentences," American Documentation, vol. 12, pp.139- 143.1961. Inderjeet Mani and Mark T. Maybury, editors., "Advances summarization," MIT Press. 1999.
in automatic text [17]
1. Kupiec. , 1. Proceedings of Development in [4] H. P. Edmundson., "New methods in automatic extracting," Journal of the Association
for Computing Machinery 16 (2). pp.264-285.l969.
[5] S. D. Afantenos, V. Karkaletsis and P. Stamatopoulos., "Summarization from Medical Documents: A Survey," Artificial Intelligence in Medicine, vol. 33, pp.157-177. 2005. [6] G. Salton, C. Buckley., "Term-weighting approaches in automatic text retrieval,"
Information Processing and Management 24, 1988.513-523. Reprinted in: Sparck-Jones, K.; Willet, P. (eds.) Readings in I.Retrieval. Morgan Kaufmann. pp.323-328.1997. [7] G. Salton., "Automatic Text Processing: The Transformation, Analysis, and Retrieval of
Information by Computer," Addison-Wesley Publishing CoJnpany. 1989.
[8] C.Y. Lin., "ROUGE: A Package for Automatic Evaluation of Summaries," In Proceedings of Workshop on Text Summarization of ACL, .Spain. 2004.
[9] DUC. Document understanding conference 2002 (2002), http://www-nlpir.nist.gov/projects/duc
Jilid 20, Bil. 2 (Disember 2008)
.. JJI. .
ntr
lumal Teknologi Maklumat Jilid 20, Bil. 2 (Disembl 1145. CONCLUSION AND FUTURE WORK
In this paper, we propose automatic text summarization for important sentence extraction with important features based on fuzzy logic; title feature, sentence length, term weight, sentence position, sentence to sentence similarity, proper noun, thematic word and numerical data. We choose 6 documents from DUC2002 data set and compare our summarizer with the baseline summarizer and Microsoft Word 2007 summarizers. The results show that the judge gave a better average precision, recall and f-mean to summaries produced by fuzzy method. Our method is intent to be used for single document summarization as well as multi documents summarization. We conclude that we need to extend the proposed method for multi document summarization and combine fuzzy logic and other learning methods in a large data set.
REFERENCES
[l] H.P. Luhn., "The Automatic Creation of Literature Abstracts," IBM Journal of Research and Development, vol. 2, pp.159-165. 1958.
[2] G.1. Rath, A. Resnick, and T. R. Savage., "The formation of abstracts by the selection of sentences," American Documentation, vol. 12, pp.139- 143.1961.
[3] Inderjeet Mani and Mark T. Maybury, editors., "Advances in automatic text summarization," MIT Press. 1999.
[4] H. P. Edmundson., "New methods in automatic extracting," Journal of the Association for Computing Machinery 16 (2). pp.264-285.1969.
[5] S. D. Afantenos, V. Karkaletsis and P. Stamatopoulos., "Summarization from Medical Documents: A Survey," Artificial Intelligence in Medicine, vol. 33, pp.157-177. 2005. [6] G. Salton, C. Buckley., "Term-weighting approacbes in automatic text retrieval,"
Information Processing and Management 24, 1988.513-523. Reprinted in: Sparck-Jones, K.;Willet, P. (eds.) Readings in LRetrieval. Morgan Kaufmann. pp.323-328.1997. [7] G. Salton., "Automatic Text Processing: The Transformation, Analysis, and Retrieval of
Information by Computer," Addison-Wesley PublisbingCompany. 1989.
[8] c.Y. Lin., "ROUGE: A Package for Automatic Evaluation of Summaries," In Proceedings of Workshop on Text Summarization of ACL, .Spain. 2004.
[9] DUC. Document understanding conference 2002 (2002), http://www-nlpir.nist.gov/projects/duc
115
[10] L. Zadeh., "Fuzzy sets. Information Control," vol. 8, pp.338-353.I965.
[11] D. Buell., "An analysis of some fuzzy subsets application to information retrieval systems," Fuzzy Sets and Systems, vol. 7, no. I, pp.35-42.I982.
[12] S. Miyamoto., "Fuzzy Sets in Information Retrieval and Cluster Analysis," Kluwer Academic Publishers, 1990.
[13] R. Witte and S. Bergler., "Fuzzy coreference resolution for summarization," In Proceedings of 2003 International Symposium on Reference Resolution and Its Applications to Question Answering and Summarization (ARQAS). Venice, Italy: Universita Ca' Foscari. pp.43-50. 2003. http://rene-witte.net.
[14] Louisa Ferrier., "A Maximum Entropy Approach to Text Summarization," School of Artificial Intelligence, Division of Informatics, University of Edinburgh, 200 I.
[15] Arman Kiani and M.R. Akbarzadeh., "Automatic Text Summarization Using: Hybrid Fuzzy GA-GP," In Proceedings of 2006 IEEE International Conference on Fuzzy Systems, Sheraton Vancouver Wall Center Hotel, Vancouver, BC, Canada. pp.977-983.2006.
[16] C.Y. Lin., "Training a selection function for extraction," In Proceedings of the eighth international conference on Information and knowledge management, Kansas City, Missouri, United States. pp.55-62. 1999.
[17] J. Kupiec. , J. Pedersen, and F. Chen., "A Trainable Document Summarizer," In Proceedings of the Eighteenth Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR), Seattle, WA, pp.68-73.1995.
Jilid 20, Bi!. 2 (Disember 2008) Jurnal Teknologi Maklumat
115
[10] L. Zadeh., "Fuzzy sets. Information Control," vol. 8, pp.338-353.1965.
[11] D. Buell., "An analysis of some fuzzy subsets application to information retrieval systems," Fuzzy Sets and Systems, vol. 7, no. 1, pp.35--42.1982.
[12] S. Miyamoto., "Fuzzy Sets in Information Retrieval and Cluster Analysis," Kluwer Academic Publishers, 1990.
[13] R. Witte and S. Bergler., "Fuzzy coreference resolution for summarization," In Proceedings of 2003 International Symposium on Reference Resolution and Its Applications to Question Answering and Summarization (ARQAS). Venice, Italy: Universita Ca' Foscari. pp.43-50. 2003. http://rene-witte.net.
[14] Louisa Ferrier., "A Maximum Entropy Approach to Text Summarization," School of Artificial Intelligence, Division of Informatics, University of Edinburgh, 2001.
[15] Arman Kiani and M.R. Akbarzadeh., "Automatic Text Summarization Using: Hybrid Fuzzy GA-GP," In Proceedings of 2006 IEEE International Conference on Fuzzy Systems, Sheraton Vancouver Wall Center Hotel, Vancouver, BC, Canada. pp.977-983.2006.
[16] c.Y. Lin., "Training a selection function for extraction," In Proceedings of the eighth international conference on Information and knowledge management, Kansas City, Missouri, United States. pp.55-62. 1999.
[17] 1. Kupiec. , J. Pedersen, and F. Chen., "A Trainable Document Summarizer," In Proceedings of the Eighteenth Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR), Seattle, WA, pp.68-73.1995.