• No results found

Survey Paper on Text Summarization Methods

N/A
N/A
Protected

Academic year: 2020

Share "Survey Paper on Text Summarization Methods"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

Volume 5, Issue 4, 2018

10 Available online at www.ijiere.com

International Journal of Innovative and Emerging

Research in Engineering

e-ISSN: 2394 – 3343 p-ISSN: 2394 – 5494

Survey Paper on Text Summarization Methods

Bhoomika Batra

a

, Shilpa Sethi

b

and Dr.Ashutosh Dixit

c

aResearch Scholar, Department of Computer Engineering, YMCAUST, Faridabad, Haryana bAssistant Professer, Department of Computer Engineering, YMCAUST, Faridabad, Haryana cAssociate Professer, Department of Computer Engineering, YMCAUST, Faridabad, Haryana

ABSTRACT:

Text Summarization Systems are the need of the hour in the world of texts which is growing so fast with the advent of internet. It has become one of most popular research areas among research scholars in last few years. These systems summarizes the given text and provide readers its summary quickly which includes all the main points about it so that readers do not need to read the entire text. This paper provides a comprehensive survey of text summarization which includes taxonomy of text summarization methods in detail. Finally, a comparative analysis among prevalent text summarization methods is provided.

Keywords: Text Summarization Methods, Extractive vs Abstractive Summarization, Multi-Document vs Single-document Summarization, Informative vs Indicative Summarization, Multi-lingual vs Mono-lingual Summarization

I. INTRODUCTION

The world consisting of texts is a ubiquitous world. Great deal of vital information exists in text format in most of data networks specially internet. It is not feasible for users to read all information available. With the immense increase in text information available over internet need for text summarization systems has also increased immensely as they help in locating most essential contents from text in a very short time. Text summarization systems have various applications ,for example can be used in search engine as summarizer to give users a summarized information of webpages or can be used as a tool to give summarized view of a document to users so that they can decide whether to read the full document or not. In addition to this text summarization systems can also be used by newsgroups to merge the most vital information available in different documents but discussing the same topic. Text summarization systems can also be used to summarize letters and other important documents used in any organization. Text summarization systems are more commonly used in areas where there is a need to transfer less information ,for instance people who use their phone to check emails like to do that but by using less data while being connected to internet.

In this paper a survey of text summarization methods and their comparison has been conducted. Structure of this paper is-section 2 consists of text summarization definition, classification of text summarization methods on different basis and discussion of these methods in detail. Section 3 compares the text summarization methods. Section 4 concludes this study.

II. TEXT SUMMARIZATION DEFINITION AND ITS CLASSIFICATION

Text summarization [1] is defined as a process which involves diminishing size of original text and retention of the main content and writing its summary in concise form.

Text summarization can be classified into its subtypes (as shown in Figure 1) on the basis of various factors [2, 3] some of which are-

Approaches

Type of details

Content

Limitation of input text

Number of input documents

Language

The details of these types are described in subsequent sections. A. Approaches

On this basis of approach used text summarization [4] is of two types –

Extractive summarization

(2)

Volume 5, Issue 4, 2018 Extractive Summarization does its work by choosing a subset of already existing and important words, phrases or sentences from the original document in order to form summary. This method is usually easy to implement because it is not based on the semantic relation between sentences and are more successful. Extractive summaries generally considers most important information is used as first sentence of the summary [5]. Also these summaries generated using this technique are generally longer than average. The summaries generated from this technique suffer from the drawbacks of inconsistency, lack of balance and lack of cohesion. In this study, text summarization will be done using this technique.

Abstractive Summarization works by generating a sentence from data having semantic representation and after that generate a summary by using natural language processing (NLP) methods. It includes interpretation of original information in shorter version. Abstractive summaries generated using this technique may include words which are not actually there before summarizing the original [6]. They are difficult to generate as they require deep knowledge of NLP tasks but are more concise and accurate than extractive summaries. These summaries are required in cases where opinions of people are very diverse. Summaries generated using this technique have less compression ratio and less redundant data but cost of generating these summaries is usually high.

Figure1.Taxonomy of Text Summarization

B. Based on Type of Details

On this basis text summarization is classified into two types -

Informative summarization

Indicative summarization

Informative Summarization is used as an alternative to the original text. It generate summaries which give the user, the brief information regarding the original text and motivates Length of these summaries is nearly 20 to 30 percent of the original text. But in some cases these summaries even leads to miscommunication that the text is not worth reading because they do not have detailed information about the text. These summaries are easier to produce.

Indicative Summarization is applied for quickly viewing a long text. It is used in cases when the user needs to know what the main idea of the original text is. Summaries generated using this techniques generally small 5 to 10 percent of the original text and it helps users in deciding whether they want to read the full document or not. It does not contain the actual data but contains the metadata which can include scope of the document, methodology used in the document, purpose of the document etc. For example, before purchasing a book or novel, a buyer in most cases first reads the summary given in the front and back side of the novel and then continues with content later.

C. Based onContent

On this basis text summarization is classified into two types -

Generic summarization

Query-based summarization

Generic Summarization is kind of summarization [7] in which summaries are generated for any kind of user and in addition to that these summaries also do not rely on the theme of text. Summaries are generated from author’s

TEXT SUMMARIZATION BASED ON

Abstractive

Informative

Single

Indicative

Approaches Number of

Documents Type of

Details

Limitation of Input Text

Content Language

Generic Extractive

Query-based

Genre Specific

Domain Independent

Multi

Mono-lingual

(3)

Volume 5, Issue 4, 2018

12 perspective and are not user-specific. As summaries can be used by any kind of user all of the information is given the same kind of importance. No prior knowledge of text is available while generating the summaries. Also there is no deed to keep track of different interests of different users while generating these summaries.

Query-based Summarization is a kind of question answer summarization in the sense that user have general information about a particular interest and ask for special information about it. In this technique summaries generated are result of user queries. These summaries provide user specific view and rely on the type of query given by the user which gives idea about users’ interests. These summaries can be used only by users who have interests related to these summaries and therefore they are difficult to generate as there is need to track different interests of different users.

D. Based onLimitation of Input text

On this basis text summarization is classified into two types -

Genre specific summarization

Domain independent summarization

Genre specific summarization is used by systems which accept only special kind of input [8]. For example input can be in form of manuals, newspaper articles, stories etc. It solves the problem of summarizing heterogeneous documents. But only few real life systems use this technique.

Domain Independent Summarization kind of summarization technique uses systems that can accept different kinds of text. This technique generate summaries which do not depend on domain. These summaries can be used by any kind of user and are not dependent on any type of input received. But these summaries are difficult to generate because different criteria is required for pre-processing different kinds of input received each time. Most of the real life systems are domain independent.

E. Based onNumber of Input Documents

On this basis text summarization is classified into two types -

Single-document summarization

Multi-document summarization

Single-document Summarization accepts only one document at a time as input. These summaries are generally easier to generate as only a single document is required to summarize and they have less overhead. Almost all systems using single document summarization technique generate summaries using monolithic structure of the document. For example, for writing single document summaries, take first sentence of each paragraph and arrange them together in the original sequence. Main drawback of this technique is that summaries of related topics cannot be generated using it.

Multi-document Summarization accepts several documents at a time as input. These summaries are generally difficult to generate as only a multiple documents are required to summarize. Systems using multi document summarization technique for generating summaries usually do not depend upon the structure of the document because structure of different documents used for summarization and are not readily available as in case of a single document. A multi-document summarization system is efficient if it organize the information around the most important aspects so that variant views can be represented easily and as a result users get a good overview of a particular topic whose documents are summarized.

F. Based onLanguage

On this basis text summarization is classified into two types -

Mono-lingual summarization

Multi-lingual summarization

Mono-lingual Summarization technique is used by systems which accepts only those documents which have a specific language and generates output on the basis of that language only. Summaries generated using this technique have less overhead and are easier to implement as only one language needs to be processed. But these are used in very less areas as nowadays most of the companies and organizations are multinational and require handling of different languages. In most cases these summaries require translation in the language required by the user which itself is a tiresome process which make systems using this technique less efficient.

(4)

Volume 5, Issue 4, 2018 III. COMPARISON BETWEEN TEXT SUMMARIZATION METHODS

Comparison between various text summarization methods is provided in Table 1.

Table1.Comparison Table of Text Summarization Methods Summarization

Method

Classified on the Base of

Main Idea Advantages Disadvantages

Extractive Summarization

Approaches This technique does its work by choosing a subset of already existing an important words, phrases or sentences from the original document in order to form summary.

1. Summaries can be easily generated using this technique.

2. Also these are more successful.

1. Summaries generated using this technique suffer from  Inconsistency

 Lack of balance  Lack of cohesion

2. Moreover these summaries are lengthy also.

Abstractive Summarization

Approaches This technique works by generating a sentence from data having semantic representation and after that generate a summary by using natural language processing (NLP) methods.

1. This technique generates summaries which have less compression ratio. 2. These Summaries are not lengthy as extractive summaries. 3. These summaries are semantically related to each other.

4. Also, these summaries have less redundant data.

1. Summaries using this technique are difficult to compute, involve high cost and require in-depth knowledge of NLP tasks.

2. Costs involved in generating these summaries is also high.

Informative Summarization

Type of Details

It gives the user, the brief information regarding the original document which helps the user in deciding whether that document is worth reading or not.

1. These summaries are easier to produce.

1. Summaries generated using this technique do not have detailed information which sometimes leads to a miscommunication that the document is not worth reading. 2. Also, these summaries cannot be used for quick categorization of products.

Indicative Summarization

Type of Details

It provides summaries which helps users in deciding whether they want to read the full document or not.

1. Summaries

generated using this technique in most cases motivates the users to read the document in detail.

2. These summaries are not lengthy, usually 5 to 10 percent of the original text.

3. Also these summaries can be used

for quick

categorization of texts.

1. Summaries are difficult to produce using this technique.

Generic Summarization

Content It is kind of summarization in which summary is made for any kind of user and in addition to that summary also does not rely on the theme of document.

1. Summaries

generated using this technique can be used by any kind of user. 2. These summaries can be easily generated, no need to keep information about different interests of different users.

(5)

Volume 5, Issue 4, 2018

14

Query-based Summarization

Content It is a kind of question answer summarization. In this technique summary is result of a query.

1. Summaries using this techniques gives idea about user’s interest.

2. Also, these summaries can be used for searching information about a particular topic.

1. Summaries generated using this technique are difficult to compute as tracking of different interests of different users is required.

2. Only specific users who have interests related to these summaries can use them.

Genre Specific

Summarization

Limitation of Input text

This summarization uses systems which accept only special kind of input.

1. Summaries

generated using this technique solves the

problem of

summarizing heterogeneous documents.

2. Systems using these techniques are good for users who want to view summaries of only a particular genre.

1. Problem is that this technique accepts input in form of only some specific templates to generate summaries.

2. There are only few systems available which uses this technique to generate summaries. Domain Independent Summarization Limitation of Input text

This kind of summarization

technique uses systems that can accept different kinds of text.

1. Summaries

generated using this technique can be used by in any domain by any user.

2. Systems using this technique are not dependent on type of input received.

3. Most of the systems available use this technique to generate summaries.

1.Summaries generated using this technique are difficult to produce because different technique is required in pre-processing different type of input received each time.

Single-document Summarization

Number of Input Documents

This kind of summarization accepts only one document at a time as input.

1. Summaries

generated using this technique have less overhead.

2. Moreover, these summaries can be easily generated as only one document is needed to be summarized.

1. Summaries of related topics cannot be generated using this technique.

Multi-document Summarization

Number of Input Documents

This kind of summarization accepts several documents at a time as input.

1. This technique can combine the summaries generated by different documents of related topics into a single document.

2. And also these summaries are more efficient.

1. Summaries generated using this technique are difficult to produce as in addition to summarize more than one document, user also need to check whether these documents have related topics or not which leads to addition in cost of summary generation.

Mono-lingual Summarization

Language This kind of summarization

technique uses systems which only accept documents having a specific language and generates output on the basis of that language

1. Summaries

generated using this technique have less overhead.

2. Moreover these summaries can be easily generated as only one language

1.Summaries generated using this technique are used in very less areas as nowadays most of the companies and organizations are multinational and require handling of different languages.

(6)

Volume 5, Issue 4, 2018 only. needs to be processed

and hence systems using this techniques are easier to implement.

summaries require translation in the language required by the user which itself is a tiresome process which make systems using this technique less efficient.

Multi-lingual Summarization

Language This kind of summarization

technique is used by systems which accept documents having different languages and generates output on the basis of different languages accepted.

1. Systems using this technique for generating summaries have a wide variety of applications in different areas as it can handle multiple language.

2. Also these systems are more efficient and have less overhead as they can be used by different users in different language without the need of translation of the document in their own language.

1. Summaries generated using this technique are difficult to produce as documents in different languages need to be handled in the same system.

Each text summarization method has its own advantage and disadvantages. Selection of text summarization method depends upon the needs and situations of the users or organizations using text summarization technique.

IV. CONCLUSION

Growth of internet has led to rapid increase in text information available over internet. But as a lot of information in form of text is available on internet, need for summarizing text has also been increased tremendously. For this need for strong and efficient text summarizers which summarizes text information accurately and efficiently so that users don’t need to go through the entire information and waste time. Most of the systems use either extractive or abstractive text summarization in combination with multi-document and multi-lingual text summarization methods. This paper has discussed various methods for text summarization methods with their advantages and disadvantages.

REFERENCES

[1] Alaa F. Alsaqer, Sreela Sasi, “Movie review summarization and sentiment analysis using

rapidminer”,International Conference on Networks & Advances in Computational Technologies, July 2017 [2] Nikita Munot , Sharvari S. Govilkar , “ Comparative study of text summarization methods”, International

Journal of Computer Applications (0975 – 8887), Vol. 102, Issue 12 , Sept 2014

[3] Nidhika Yadav, Niladri Chatterjee, “Text Summarization using Sentiment Analysis for DUC Data”, International

Conference on Information Technology,2016

[4] Vishal Gupta,Gurpreet Singh Lehal , "A survey of text summarization extractive techniques," Journal of Emerging Technologies in web intelligence, VOL.2,Issue, Aug 2010

[5] D. Gaikwad and C. Mahender, “A review paper on text summarization”, International Journal of Advanced Research in Computer and Communication Engineering, Vol. 5,Issue 3, March 2016

[6] http://www4.ncsu.edu/~slrace/genericsummarizationtalk.pdf

[7] https://rare-technologies.com/text-summarization-in-python-extractive-vs-abstractive-techniques-revisited/ [8] http://explainwell.org/index.php/table-of-contents-synthesize-text/types-of-summaries/

References

Related documents

a) It decrements the stack pointer by 2 and pushes the flag register on the stack. b) I disables the INTR interrupt input by clearing the interrupt flag in the flag register. c)

In the case of a lawfully existing solid waste management facility the setbacks prescribed above shall apply to physical expansions of the facility only In all cases setback

If your establishment is closed when FDNY arrives for your PA inspection, the inspector will return the following week. If the establishment is closed during the second visit,

At low energies, πN scattering can be studied within Chiral Perturbation Theory, which takes into account the spontaneously broken chiral symmetry that rules the low energy

As stated in the introduction, the goal of the Jefferson Lab Nuclear Data Mining collaboration was to (1) collect the data from nuclear target experiments using the CLAS detector,

Single document summarization is the process which summarizes the text of a document whereas multi-documents methods can summarize more than one document at a

The interaction of two-spotted spider mites, Tetranychus urticae Koch, with Cry protein production and predation by Amblyseius andersoni (Chant) in Cry1Ac/Cry2Ab cotton and