Volume 5, Issue 4, 2018
10 Available online at www.ijiere.com
International Journal of Innovative and Emerging
Research in Engineering
e-ISSN: 2394 – 3343 p-ISSN: 2394 – 5494
Survey Paper on Text Summarization Methods
Bhoomika Batra
a, Shilpa Sethi
band Dr.Ashutosh Dixit
caResearch Scholar, Department of Computer Engineering, YMCAUST, Faridabad, Haryana bAssistant Professer, Department of Computer Engineering, YMCAUST, Faridabad, Haryana cAssociate Professer, Department of Computer Engineering, YMCAUST, Faridabad, Haryana
ABSTRACT:
Text Summarization Systems are the need of the hour in the world of texts which is growing so fast with the advent of internet. It has become one of most popular research areas among research scholars in last few years. These systems summarizes the given text and provide readers its summary quickly which includes all the main points about it so that readers do not need to read the entire text. This paper provides a comprehensive survey of text summarization which includes taxonomy of text summarization methods in detail. Finally, a comparative analysis among prevalent text summarization methods is provided.
Keywords: Text Summarization Methods, Extractive vs Abstractive Summarization, Multi-Document vs Single-document Summarization, Informative vs Indicative Summarization, Multi-lingual vs Mono-lingual Summarization
I. INTRODUCTION
The world consisting of texts is a ubiquitous world. Great deal of vital information exists in text format in most of data networks specially internet. It is not feasible for users to read all information available. With the immense increase in text information available over internet need for text summarization systems has also increased immensely as they help in locating most essential contents from text in a very short time. Text summarization systems have various applications ,for example can be used in search engine as summarizer to give users a summarized information of webpages or can be used as a tool to give summarized view of a document to users so that they can decide whether to read the full document or not. In addition to this text summarization systems can also be used by newsgroups to merge the most vital information available in different documents but discussing the same topic. Text summarization systems can also be used to summarize letters and other important documents used in any organization. Text summarization systems are more commonly used in areas where there is a need to transfer less information ,for instance people who use their phone to check emails like to do that but by using less data while being connected to internet.
In this paper a survey of text summarization methods and their comparison has been conducted. Structure of this paper is-section 2 consists of text summarization definition, classification of text summarization methods on different basis and discussion of these methods in detail. Section 3 compares the text summarization methods. Section 4 concludes this study.
II. TEXT SUMMARIZATION DEFINITION AND ITS CLASSIFICATION
Text summarization [1] is defined as a process which involves diminishing size of original text and retention of the main content and writing its summary in concise form.
Text summarization can be classified into its subtypes (as shown in Figure 1) on the basis of various factors [2, 3] some of which are-
•
Approaches•
Type of details•
Content•
Limitation of input text•
Number of input documents•
LanguageThe details of these types are described in subsequent sections. A. Approaches
On this basis of approach used text summarization [4] is of two types –
•
Extractive summarizationVolume 5, Issue 4, 2018 Extractive Summarization does its work by choosing a subset of already existing and important words, phrases or sentences from the original document in order to form summary. This method is usually easy to implement because it is not based on the semantic relation between sentences and are more successful. Extractive summaries generally considers most important information is used as first sentence of the summary [5]. Also these summaries generated using this technique are generally longer than average. The summaries generated from this technique suffer from the drawbacks of inconsistency, lack of balance and lack of cohesion. In this study, text summarization will be done using this technique.
Abstractive Summarization works by generating a sentence from data having semantic representation and after that generate a summary by using natural language processing (NLP) methods. It includes interpretation of original information in shorter version. Abstractive summaries generated using this technique may include words which are not actually there before summarizing the original [6]. They are difficult to generate as they require deep knowledge of NLP tasks but are more concise and accurate than extractive summaries. These summaries are required in cases where opinions of people are very diverse. Summaries generated using this technique have less compression ratio and less redundant data but cost of generating these summaries is usually high.
Figure1.Taxonomy of Text Summarization
B. Based on Type of Details
On this basis text summarization is classified into two types -
•
Informative summarization•
Indicative summarizationInformative Summarization is used as an alternative to the original text. It generate summaries which give the user, the brief information regarding the original text and motivates Length of these summaries is nearly 20 to 30 percent of the original text. But in some cases these summaries even leads to miscommunication that the text is not worth reading because they do not have detailed information about the text. These summaries are easier to produce.
Indicative Summarization is applied for quickly viewing a long text. It is used in cases when the user needs to know what the main idea of the original text is. Summaries generated using this techniques generally small 5 to 10 percent of the original text and it helps users in deciding whether they want to read the full document or not. It does not contain the actual data but contains the metadata which can include scope of the document, methodology used in the document, purpose of the document etc. For example, before purchasing a book or novel, a buyer in most cases first reads the summary given in the front and back side of the novel and then continues with content later.
C. Based onContent
On this basis text summarization is classified into two types -
•
Generic summarization•
Query-based summarizationGeneric Summarization is kind of summarization [7] in which summaries are generated for any kind of user and in addition to that these summaries also do not rely on the theme of text. Summaries are generated from author’s
TEXT SUMMARIZATION BASED ON
Abstractive
Informative
Single
Indicative
Approaches Number of
Documents Type of
Details
Limitation of Input Text
Content Language
Generic Extractive
Query-based
Genre Specific
Domain Independent
Multi
Mono-lingual
Volume 5, Issue 4, 2018
12 perspective and are not user-specific. As summaries can be used by any kind of user all of the information is given the same kind of importance. No prior knowledge of text is available while generating the summaries. Also there is no deed to keep track of different interests of different users while generating these summaries.
Query-based Summarization is a kind of question answer summarization in the sense that user have general information about a particular interest and ask for special information about it. In this technique summaries generated are result of user queries. These summaries provide user specific view and rely on the type of query given by the user which gives idea about users’ interests. These summaries can be used only by users who have interests related to these summaries and therefore they are difficult to generate as there is need to track different interests of different users.
D. Based onLimitation of Input text
On this basis text summarization is classified into two types -
•
Genre specific summarization•
Domain independent summarizationGenre specific summarization is used by systems which accept only special kind of input [8]. For example input can be in form of manuals, newspaper articles, stories etc. It solves the problem of summarizing heterogeneous documents. But only few real life systems use this technique.
Domain Independent Summarization kind of summarization technique uses systems that can accept different kinds of text. This technique generate summaries which do not depend on domain. These summaries can be used by any kind of user and are not dependent on any type of input received. But these summaries are difficult to generate because different criteria is required for pre-processing different kinds of input received each time. Most of the real life systems are domain independent.
E. Based onNumber of Input Documents
On this basis text summarization is classified into two types -
•
Single-document summarization•
Multi-document summarizationSingle-document Summarization accepts only one document at a time as input. These summaries are generally easier to generate as only a single document is required to summarize and they have less overhead. Almost all systems using single document summarization technique generate summaries using monolithic structure of the document. For example, for writing single document summaries, take first sentence of each paragraph and arrange them together in the original sequence. Main drawback of this technique is that summaries of related topics cannot be generated using it.
Multi-document Summarization accepts several documents at a time as input. These summaries are generally difficult to generate as only a multiple documents are required to summarize. Systems using multi document summarization technique for generating summaries usually do not depend upon the structure of the document because structure of different documents used for summarization and are not readily available as in case of a single document. A multi-document summarization system is efficient if it organize the information around the most important aspects so that variant views can be represented easily and as a result users get a good overview of a particular topic whose documents are summarized.
F. Based onLanguage
On this basis text summarization is classified into two types -
•
Mono-lingual summarization•
Multi-lingual summarizationMono-lingual Summarization technique is used by systems which accepts only those documents which have a specific language and generates output on the basis of that language only. Summaries generated using this technique have less overhead and are easier to implement as only one language needs to be processed. But these are used in very less areas as nowadays most of the companies and organizations are multinational and require handling of different languages. In most cases these summaries require translation in the language required by the user which itself is a tiresome process which make systems using this technique less efficient.
Volume 5, Issue 4, 2018 III. COMPARISON BETWEEN TEXT SUMMARIZATION METHODS
Comparison between various text summarization methods is provided in Table 1.
Table1.Comparison Table of Text Summarization Methods Summarization
Method
Classified on the Base of
Main Idea Advantages Disadvantages
Extractive Summarization
Approaches This technique does its work by choosing a subset of already existing an important words, phrases or sentences from the original document in order to form summary.
1. Summaries can be easily generated using this technique.
2. Also these are more successful.
1. Summaries generated using this technique suffer from Inconsistency
Lack of balance Lack of cohesion
2. Moreover these summaries are lengthy also.
Abstractive Summarization
Approaches This technique works by generating a sentence from data having semantic representation and after that generate a summary by using natural language processing (NLP) methods.
1. This technique generates summaries which have less compression ratio. 2. These Summaries are not lengthy as extractive summaries. 3. These summaries are semantically related to each other.
4. Also, these summaries have less redundant data.
1. Summaries using this technique are difficult to compute, involve high cost and require in-depth knowledge of NLP tasks.
2. Costs involved in generating these summaries is also high.
Informative Summarization
Type of Details
It gives the user, the brief information regarding the original document which helps the user in deciding whether that document is worth reading or not.
1. These summaries are easier to produce.
1. Summaries generated using this technique do not have detailed information which sometimes leads to a miscommunication that the document is not worth reading. 2. Also, these summaries cannot be used for quick categorization of products.
Indicative Summarization
Type of Details
It provides summaries which helps users in deciding whether they want to read the full document or not.
1. Summaries
generated using this technique in most cases motivates the users to read the document in detail.
2. These summaries are not lengthy, usually 5 to 10 percent of the original text.
3. Also these summaries can be used
for quick
categorization of texts.
1. Summaries are difficult to produce using this technique.
Generic Summarization
Content It is kind of summarization in which summary is made for any kind of user and in addition to that summary also does not rely on the theme of document.
1. Summaries
generated using this technique can be used by any kind of user. 2. These summaries can be easily generated, no need to keep information about different interests of different users.
Volume 5, Issue 4, 2018
14
Query-based Summarization
Content It is a kind of question answer summarization. In this technique summary is result of a query.
1. Summaries using this techniques gives idea about user’s interest.
2. Also, these summaries can be used for searching information about a particular topic.
1. Summaries generated using this technique are difficult to compute as tracking of different interests of different users is required.
2. Only specific users who have interests related to these summaries can use them.
Genre Specific
Summarization
Limitation of Input text
This summarization uses systems which accept only special kind of input.
1. Summaries
generated using this technique solves the
problem of
summarizing heterogeneous documents.
2. Systems using these techniques are good for users who want to view summaries of only a particular genre.
1. Problem is that this technique accepts input in form of only some specific templates to generate summaries.
2. There are only few systems available which uses this technique to generate summaries. Domain Independent Summarization Limitation of Input text
This kind of summarization
technique uses systems that can accept different kinds of text.
1. Summaries
generated using this technique can be used by in any domain by any user.
2. Systems using this technique are not dependent on type of input received.
3. Most of the systems available use this technique to generate summaries.
1.Summaries generated using this technique are difficult to produce because different technique is required in pre-processing different type of input received each time.
Single-document Summarization
Number of Input Documents
This kind of summarization accepts only one document at a time as input.
1. Summaries
generated using this technique have less overhead.
2. Moreover, these summaries can be easily generated as only one document is needed to be summarized.
1. Summaries of related topics cannot be generated using this technique.
Multi-document Summarization
Number of Input Documents
This kind of summarization accepts several documents at a time as input.
1. This technique can combine the summaries generated by different documents of related topics into a single document.
2. And also these summaries are more efficient.
1. Summaries generated using this technique are difficult to produce as in addition to summarize more than one document, user also need to check whether these documents have related topics or not which leads to addition in cost of summary generation.
Mono-lingual Summarization
Language This kind of summarization
technique uses systems which only accept documents having a specific language and generates output on the basis of that language
1. Summaries
generated using this technique have less overhead.
2. Moreover these summaries can be easily generated as only one language
1.Summaries generated using this technique are used in very less areas as nowadays most of the companies and organizations are multinational and require handling of different languages.
Volume 5, Issue 4, 2018 only. needs to be processed
and hence systems using this techniques are easier to implement.
summaries require translation in the language required by the user which itself is a tiresome process which make systems using this technique less efficient.
Multi-lingual Summarization
Language This kind of summarization
technique is used by systems which accept documents having different languages and generates output on the basis of different languages accepted.
1. Systems using this technique for generating summaries have a wide variety of applications in different areas as it can handle multiple language.
2. Also these systems are more efficient and have less overhead as they can be used by different users in different language without the need of translation of the document in their own language.
1. Summaries generated using this technique are difficult to produce as documents in different languages need to be handled in the same system.
Each text summarization method has its own advantage and disadvantages. Selection of text summarization method depends upon the needs and situations of the users or organizations using text summarization technique.
IV. CONCLUSION
Growth of internet has led to rapid increase in text information available over internet. But as a lot of information in form of text is available on internet, need for summarizing text has also been increased tremendously. For this need for strong and efficient text summarizers which summarizes text information accurately and efficiently so that users don’t need to go through the entire information and waste time. Most of the systems use either extractive or abstractive text summarization in combination with multi-document and multi-lingual text summarization methods. This paper has discussed various methods for text summarization methods with their advantages and disadvantages.
REFERENCES
[1] Alaa F. Alsaqer, Sreela Sasi, “Movie review summarization and sentiment analysis using
rapidminer”,International Conference on Networks & Advances in Computational Technologies, July 2017 [2] Nikita Munot , Sharvari S. Govilkar , “ Comparative study of text summarization methods”, International
Journal of Computer Applications (0975 – 8887), Vol. 102, Issue 12 , Sept 2014
[3] Nidhika Yadav, Niladri Chatterjee, “Text Summarization using Sentiment Analysis for DUC Data”, International
Conference on Information Technology,2016
[4] Vishal Gupta,Gurpreet Singh Lehal , "A survey of text summarization extractive techniques," Journal of Emerging Technologies in web intelligence, VOL.2,Issue, Aug 2010
[5] D. Gaikwad and C. Mahender, “A review paper on text summarization”, International Journal of Advanced Research in Computer and Communication Engineering, Vol. 5,Issue 3, March 2016
[6] http://www4.ncsu.edu/~slrace/genericsummarizationtalk.pdf
[7] https://rare-technologies.com/text-summarization-in-python-extractive-vs-abstractive-techniques-revisited/ [8] http://explainwell.org/index.php/table-of-contents-synthesize-text/types-of-summaries/