Domain Independent Framework for Automatic Text Summarization

(1)

Procedia Computer Science 48 ( 2015 ) 722 – 727

Peer-review under responsibility of scientific committee of International Conference on Computer, Communication and Convergence (ICCC 2015) doi: 10.1016/j.procs.2015.04.207

ScienceDirect

International Conference on Intelligent Computing, Communication & Convergence

(ICCC-2014)

Conference Organized by Interscience Institute of Management and Technology,

Bhubaneswar, Odisha, India

Domain Independent Framework for Automatic Text

Summarization

Yogesh Kumar Meena

a,

*, Dinesh Gopalani

b

a,b_{Malaviya National Institute of Technology, JLN Marg, Jaipur, 302017, India}\

Abstract

Due to the exponential growth of documents on internet, users want all the relevant data at one place without any hassle. This led to the growth of Automatic Text Summarization. For this purpose a number of methods have been proposed by researchers but no method is able to work on all domains of text documents. Some methods which work for News domain may fail in Medical domain to give efficient results. In this paper we proposed a domain independent framework for Automatic Text Summarization. The Process first categorises the source text and then it applies the respective category’s optimal set of rules or weights or method. The major advantage of framework is that it can be applicable for both extractive and abstractive text summarization.

Selection and peer-review under responsibility of scientific committee of Missouri University of Science and Technology.

Keywords: Abstractive; Clustering; Statistical; Extractive; Summarization; Term Frequency;

* Corresponding author. Tel.: +919461306647; fax: +01412529174 E-mail address: [email protected]

Peer-review under responsibility of scientific committee of International Conference on Computer, Communication and Convergence (ICCC 2015)

(2)

1. Introduction

With the increasing amount of textual information over the internet it becomes difficult for the users to find the desired information quickly. They have to look up whole the document to get a glimpse of actual theme. Automatic text summarization solves the problem by generating summaries that could be utilized as a condensed replica of a document or a set of documents. Therefore Automatic Text Summarization can be defined as the process of condensing the source text document or set of text documents while retaining main information contents using a automatic machine. Automatic text summarization could be classified mainly as extractive automatic text summation and abstractive automatic text summarization. The generic process of automatic text summarization is shown figure 1. Generic Summarization condenses overall information content available in source text. In Extractive automatic text summarization a subset of sentences from the original input set of sentences is selected as summary. In abstractive automatic text summarization important topics in the textual unit are identified and new sentences are formed. Various Taxonomies are given from different aspects for text summarization by researchers. Spark Jones 1998 [1] Hovy & Lin 1998 [2] and Mani & Maybury 1999 [3] distinguished text summarization process from input, purpose and output factors. Research in the area of text summarization started from 1950's and till now no system is available that can generate summaries as like professionals or humans (Gold Summary).

Fig. 1. Generic Automatic Text Summarization Process

Other then generic summarization, update summaries, query focused summaries, sentimental summaries etc. can also be extracted for the text, but it depends on the purpose for which summarization needs to be performed. Statistical methods for extractive text summarization do not use much domain knowledge but instead they work on the available information content on the source document. There are efforts made in abstractive automatic text summarization as well, but due the requirement of large domain knowledge and that too for specific domains is not preferred. Another reason is that information fusion that is the most difficult part of abstractive text summarization. Multi document summarization where for many documents a single summary needs to be generated is also important now a day due to different sources of the same topic of information. There is not any method available that could cater to all kind of domains. Particular method that works on News data may not work on Medical data efficiently. Rest of the paper is organized as in section 2 we discuss related work in the area of automatic text summarization. In section 3 we discuss proposed domain independent framework for automatic text summarization method. In section 4 we conclude the paper.

(3)

2. Related Work

In this section we will review work done so far in the area of automatic text summarization. Research work started in this area started in 1950s but still we are lacking with efficient methods that can generate summaries like humans or professionals. Now a days with large volume of text content available online, much more efforts are required to perform text summarization. At the start Luhn [4] in 1958 proposed the method that utilized term frequency to score the sentences. Top scoring sentences were selected as the summary sentences. Later Baxendale [5] in 1969 proposed sentence location as a scoring criterion along with term frequency to score the sentences. Edmundsun [6] in 1969 proposed two more features for sentence scoring. Title similarity and Cue word feature also included for sentence scoring along with earlier two features. He proposed combinations of all four and analyzed the results for different combinations. Rush et al [7] in 1971 used sentence rejection methodology using different set of rules. Later it was used in 1975 for generation of chemical abstracts [15]. Later in 1997 Hovy and Lin [8] experimented that position method cannot produce efficient summaries for all domains. In 2001 MEAD [9] was developed that used features such as TF/IDF, Sentence Location, Cue Words and Longest Common Subsequences for sentence scoring. Nobata et al [14] in 2001 used sentence location, sentence length, TF/IDF values of words, headline similarity and query to score significant sentences. They compared their method with TF-based and Lead-based methods. Varma et al [12] in 2005 used additional features such as length of words, parts of speech tag, familiarity of the word, named entity tag, occurrence as heading or subheading and font style to score the sentences. Fattah and Ren [11] in 2009 proposed trainable models genetic algorithm, mathematical regression, feed forward neural networks, probabilistic neural network and Gaussian mixture model for text summarization using different features. Prasad & Kulakarni [16] in 2010 used word similarity among paragraph, word similarity among sentence, iterative query score, format based score, numerical data, cue words, term frequency, thematic feature and tile similarity as features to score the sentences. The applied some evolutionary approaches as well for summary production. Abuobieda et al [10] in 2012 used title feature, sentence length, sentence position, numerical data and thematic words as features for scoring sentences. They used genetic algorithms to final design of the feature space. Rafael et al [17] in 2012 deeply analyzed all the sentence scoring features using ROUGUE [29] evaluation matrices. Mendoza et al [13] in 2014 used sentence position, title similarity, sentence length, cohesion and coverage as the features of the objective function. They used optimization techniques along with evolutionary algorithms for final summary generation. All above researchers scored sentences on the basis of some features. Most of the researchers gave equal weight to the all features included for sentence scoring. Rafael et al [18] in 2014 tried the combinations of different word, sentence and graph level algorithms for sentence scoring. In 2014 Meena and Gopalani [19] reviewed 22 features from available sources and analyzed the results on different combinations of features. For Abstractive summarization Barzilay and McKeown [21] in 1999 proposed a Tree based technique that is using Dependency based representation model DSYNT tree. They used theme intersection algorithm for content selection. For summary generation they used FUF/SURGE language Generator. Harabagiu and Lacatusu [23] in 2002 used Template based method. They represented text into Template/Frame having slots and fillers. They selected contents using Extraction rules /patterns and Information Extraction based Algorithm for summary generation. Lee and Jian [24] in 2005 used Ontology based method and represented the text using Fuzzy ontology. For content selection they used Classifier and for summary generation they used News agent. Barzilay and McKeown [22] in 2005 used Tree based method that utilized dependency tree for text representation. They used local alignment across pair of parsed sentences for content selection and Algorithm for reusing and altering phrases from input sentences. Tanaka and Kinoshita [25] in 2009 Lead and Body phrase method and represented text using Lead, body and supplement structure. They used revision candidates for content selection and insertion and substitution operations on phrases for summary generation. Greenbacker [27] in 2011 used Semantic model method for summarization process. They used information density (ID) metric for content selection and synchronous tree for summary generation. Genest and Lapalme [20] in 2011 used INIT (information Item) based method for representation. They ranked generated sentences based on document frequency and used Simple NLG for summary generation. Genest and Lapalme [26] in 2012 used rule based method that represented text using categories and aspects. Extraction rules were used for content selection and generation patterns for summary generation. Moawad and Aref [28] in 2012 used semantic graph based method and represented text using rich semantic graph. They used weights of concepts for content selection and reduced semantic graph and domain ontology for summary generation.

(4)

3. Proposed Domain Independent Framework

In our proposed approach we provide a framework to overcome domain dependency of text summarization. This framework utilized the advantage of text categorization to solve the problem of domain dependency. As shown in the figure 1 the process starts with training of our system to categorize the document correctly. Document Corpus is used for this purpose. In this framework text classification process is applied first, and then according to the specific label, summarization procedure is applied. The procedure is as follows, also as shown in figure 2:

Fig. 2 Framework for Domain Independent Automatic Text Summarization

Step-1: A Corpus of documents is used for training purpose.

Step-2: All documents in Corpus are preprocessed using sentence segmentation, stop word removal, stemming etc.

Step-3: Features required for classification purpose are extracted like keywords, TF/IDF etc.

Step-4: A Machine Learning Algorithm is used in this step. A label is used to in learning procedure from already classified documents. There are a number of classification algorithms available like Decision Tree, Naïve Bays, and K- Nearest Neighbor etc. Here after training a model classifier is build.

(5)

Step-5: Once training is over, a document for which summary needs to be generated will be given as input text document. This document is preprocessed and its features similar to step-3 are extracted.

Step-6: Document is classified to a label using the modal classifier build in step-4.

Step-7: As per the label of document respective process/model of summarization is selected.

Step-8: Respective summarization process is applied with its processing like feature extraction, sentence ranking, sentence selection etc.

Step-9: Finally summary is generated.

This framework relies on base of text categorization. Various sample categories of text are given in Table 1 as reference. A document is basically classified in one of these categories. After classification respective labels rules are applied on text document. While considering a specific methods for a particular label, optimal are applied. In this we can use both abstractive and extractive methods as we are using specific method in relation with its category. In preprocessing step sentences are segmented using appropriate methods. In general ‘.’ And ‘?’ symbols are used as sentence end marker. Stopwords are removed as they do not convey much more information related to the actual topics of the text summarization. Stemming is performed using any standard method like Porter’s Stemming Algorithms. The summarization procedure is similar to as given in figure 1.

Table 1: Sample Text Categories

S.No. Main Cetogory Sub-Categories

1 Agriculture Animal Feed, Spices, Poulltry Farming, Forestry 2 Astronomy Moons, Planet, Stars, Nebulae, Gaaxies

3 Business Electronic Commerce, Industrial Automation Business, Social Business, Marketing, Healthcare 4 Computer Hardware, Software, CIS, Mutimedia

5 Economics Markets, Supply & Demand, Industrial Organization, International Trade, Product & Efficiancy 6 Environmment Ecology, Physics, Chemistry, Soil Science, Geography

7 Bilogy Cell Theory, Evolution, Genetics

8 Chemistry Atom, Element, Compund, Ions & Salts, Acidity

9 Sport Swimming, Water Polo, Tennis, Baseball, Football, Cricket 10 Tourisom Healthy Tourism, Cultural Tourisom, Nature Toursom, Accomodation 11 News Crime, Accident, Update, Event

12 Movies Action, Romance, Hstoric, Animation 13 Legal Crime, Revenue

For Example if the document is classified as News, following procedure can be used to get the optimized performance as this method works well specifically on news documents. This approach specifically optimized for news data, consists of three steps. In preprocessing step the data is preprocessed using the method that we have already discussed in this paper. In preprocessing we discarded the sentences whose length was less than 4 words or greater than 40 words. Sentence filtering has been applied in three levels. One feature is used to filter some sentences at each level. Filtering features are TF-ISF first then Named Entities then Proper Nouns. At last the first and last sentence along with sentences retrieved by filtering process can be considered as generated summary. 4. Conclusion & Future Work

This paper suggests that a single method cannot work efficiently on all kind of domains. Therefore the proposed framework can solve this problem by categorizing the input document. The framework proposed is not dependent on any specific domain but it uses knowledge available in corpus to select the appropriate method for summarization. Our framework requires sufficient information about the categories of input set of documents. Specific methods either extractive or abstractive, which work well individually, will definitely work well once correct category will be identified. Our framework depends on the accuracy of classifier method used. Proposed framework may be extended with subcategories of documents. In future this framework will be implemented and tested on standard datasets such as DUC, TAC etc.

(6)

References

1. K. S. Jones. Automatic summarising: Factors and directions. In Advances in Automatic Text Summarization, pages 1-12. MIT Press, 1998.

2. E. Hovy and C.-Y. Lin. Automated text summarization and the summarist system. In Proceedings of a Workshop on Held at Baltimore, Association for Computational Linguistics., Maryland: October 13-15, 1998, TIPSTER '98, pages 197-214, Stroudsburg, PA, USA, 1998.

3. I. Mani and M. T. Maybury. Advances in Automatic Text Summarization. MIT Press, Cambridge, MA, USA, 1999.

4. H. P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2):159-165, Apr. 1958. 5. P. B. Baxendale. Machine-made index for technical literature: An experiment. IBM J. Res. Dev., 2(4):354-361, Oct. 1958. 6. H. P. Edmundson. New methods in automatic extracting. J. ACM, 16(2):264-285, Apr. 1969.

7. J. E. Rush, R. Salvador, and A. Zamora. Automatic abstracting and indexing. ii. Production of indicative abstracts by application of contextual inference and syntactic coherence criteria. Journal of the American Society for Information Science, 22(4):260-274, 1971. 8. Chin-Yew Lin and Eduard Hovy. 1997. Identifying topics by position. In Proceedings of the fifth conference on Applied natural

language processing (ANLC '97). Association for Computational Linguistics, Stroudsburg, PA, USA, 283-290. 1997

9. Radev. D., Blair-Goldensohn S. And Zhang Z. (2001). Experiments in single and multi-document summarization using MEAD. In First Document Understanding Conference, New Orleans LA, 2001.

10. A. Abuobieda, N. Salim, A. Albaham, A. Osman, and Y. Kumar. Text summarization features selection method using pseudo genetic-based model. In International Conference on Information Retrieval Knowledge Management (CAMP), 2012, pages 193-197, March 2012.

11. M. A. Fattah and F. Ren. Ga, mr, _nn, pnn and gmm based models for automatic text summarization. Computer Speech and Language, 23(1):126-144, 2009.

12. J. J, Pingali, and V. Varma. Sentence extraction based on single document summarization. In Workshop on Document Summarization, 2005.

13. M. Mendoza, S. Bonilla, C. Noguera, C. Cobos, and E. Leon. Extractive single-document summarization based on genetic operators and guided local search. Expert Syst. Appl., 41(9):4158-4169, July 2014.

14. C. Nobata, S. Sekine, M. Murata, K. Uchimoto, M. Utiyama, and H. Isahara. Sentence extraction system assembling multiple evidence. In Proceedings of the Second NTCIR Workshop Meeting, pages 5-213, 2001.

15. J. J. Pollock and A. Zamora. Automatic Abstracting Research at Chemical Abstracts Service. Chemical Information and Computer Sciences, 15(4):226-232, Nov. 1975.

16. P. R. Shardanand and U. Kulkarni. Implementation and evaluation of evolutionary connectionist approaches to automated text summarization. In Journal of Computer Science, pages 1366-1376, Feb 2010.

17. Rafael Ferreira, Luciano de Souza Cabral, Rafael Dueire Lins, Gabriel Pereira e Silva, Fred Freitas, George D.C. Cavalcanti, Rinaldo Lima, Steven J. Simske, Luciano Favaro, Assessing sentence scoring techniques for extractive text summarization, Expert Systems with Applications, Volume 40, Issue 14, 15 October 2013, Pages 5755-5764, ISSN 0957-4174, 2012.

18. Rafael Ferreira., Freitas F., De Souza Cabral L, Dueire Lins R., Lima R., Franca G., Simske S.J. and L. Favaro. "A Context Based Text Summarization System," 2014 11th IAPR International Workshop on Document Analysis Systems (DAS), vol., no., pp.66,70, 7-10 April 2014.

19. Yogesh Kumar Meena and Dinesh Gopalani, Analysis of Sentence Scoring Methods for Extractive Automatic Text Summarization, International Conference on Information and Communication Technology for Competitive Strategies (ICTCS-2014), November 14 - 16 2014, Udaipur, Rajasthan, India , ACM 978-1-4503-3216-3/14/11, 2014.

20. P.E. Genest and G. Lapalme, "Framework for abstractive summarization using text- to-text generation," in Proceedings of the Workshop on Monolingual Text-To-Text Generation, pp. 64-73, 2011.

21. R. Barzilay , et al. , "Information fusion in the context of multi-document summarization," in Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics , pp. 550- 557, 1999.

22. R. Barzilay and K. R. McKeown, "Sentence fusion for multidocument news summarization," Computational Linguistics, vol. 31, pp. 297-328, 2005.

23. S. M. Harabagiu and F. Lacatusu, "Generating single and multi-document summaries with GISTEXTER," Proceedings of the workshop on automatic summarization, pp. 30-38, 2002.

24. C.-S. Lee, et al. , "A fuzzy ontology and its application to news summarization," IEEE Transactions on Systems, Man, and Cybernetics, Cybernetics, vol. 35, pp. 859-880, 2005.

25. H. Tanaka , et al. , "Syntax-driven sentence revision for broadcast news summarization," in Proceedings of the 2009 Workshop on Language Generation and Summarisation , pp. 39-47, 2009.

26. P.-E. Genest and G. Lapalme, "Fully abstractive approach to guided summarization," in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Volume 2 , pp. 354-358, 2012.

27. C. F. Greenbacker, "Towards a framework for abstractive summarization of multimodal documents," ACL HLT 2011, p. 75, 2011. 28. I. F. Moawad and M. Aref, "Semantic graph reduction approach for abstractive Text Summarization," in Computer Engineering &

Systems (ICCES), 2012 Seventh International Conference on , pp. 132-138, 2012

29. Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.