Document Summarization: A Review Sonali Gupta

(1)

Document Summarization: A Review

Sonali Gupta

Assistant Professor, Computer Engineering Department

YMCA University of Science & Technology Faridabad, India

[email protected]

Himanshi Chopra

Student, M.Tech, Computer Engineering Department

YMCA University of Science & Technology Faridabad, India

[email protected]

Abstract— Automatic summarization is a process in which the text is subjected to some module which in turn gives the informative short form of the input text and this resulted text retains the original idea of the input text. Sometimes, the techniques which are applied for single-document summarization, can also be extended for multi-document summarization but not always as it may result to some redundancy or may make the summary less informative. Various techniques designed by researchers have been used to summarize document(s) efficiently, without losing the originality of the document(s). In this paper, various techniques have been categorised on the basis of Single-document summarization and Multi-document summarization.

Keywords — Automatic summarization, Single- document summarization, multi-document summarization, Extractive and abstractive techniques, Generic and query-based techniques.

I. INTRODUCTION

Large amount of unstructured data is present on the web and on other data sources, but to tackle that data a user will need some module which may provide the significant data from the collected data. There may be different summarizer module which deal with web-based data. So, a user should opt for the appropriate one according to the summary required.

Summary can be created from single text or multiple texts and the summaries not greater than half of the original text are generally preferred. Large data are generated from different sources and this data may be textual data, multimedia data, etc. So to deal with this bulk amount of data, automatic summarization is important. These summaries are significant to media, stock markets, making notes, weather forecasting, writing synopsis, etc. It is known that managing and processing a large amount of data is a difficult task in today’s world. Systems made for summarization may not give the exact summaries but can give approximate summaries and different systems may generate different summaries or the same text. These generated summaries may also vary from the human generated summaries. Sometimes, a normal

approximate summary need to be generated but sometimes, a summary pertaining to some input text (a query) need to be generated. We need to apply the summarization techniques based on the type of summaries we need.

On the basis of input documents(s), the summarization can be classified as:

1) Single-document summarization 2) Multi-document summarization

Single-document summarization only refers one document for creating summary while Multi- document summarization considers different documents for summarizing. These multiple documents can be inter-related or can’t be related to each other. More processing are to be carried out while considering not related documents.

Sometimes, Clustering techniques can be used while dealing with documents which are not related to each other. Other techniques may also be used which is further more complex. Main theme of the document should be retained in the summary while considering multi-document summarization techniques (which is easy to be retained in single document summarization).

So, it can be concluded that multi- document summarization is more complex as compared to single-document summarization.

Various issues like redundancy, significance of sentences, etc. have been observed in summarization of document(s).

Redundancy in summarization: Sometimes, the same data is repeating in source document(s), for efficient summaries, this overlapping data needs to be included in the summaries only once.

Redundancy problem is as illustrated; for example, using textcompactor in [1], summary with redundancy can be observed. In Fig. 1, the data for which summary is required has been entered. Then the summary limit i.e. 50% is entered, after that the entered data get summarized (Fig. 2) but as in the example it is clearly observed that, redundancy is there in the summary.

(2)

International Journal of Advanced Engineering Science and Technological Research

Fig.1 textcompactor summarizer[1]

Fig.2 Summary generated using textcompactor summarizer[1]

II. AUTOMATICSUMMARIZATION

Some of the summaries contain the information enough to judge the significance of the entire document(s). Those are the indicative summaries.

Also, some of the summaries contain the data which is important are in the category of informative summaries, whereas the evaluative summaries describes views of different authors for a given topic. On the basis of preserving originality of document(s), summarization can be classified into:

a) Extractive summarization: the summary contains the exact selection of important sentences.

b) Abstractive summarization: the decomposition of sentences is done to create summary of text.

Summarization can also be categorized based upon input(s) provided, as:

a) Generic summarization: It consider only source documents(s) for making summaries.

b) Query based summarization: It considers both source document(s) as well as query given by the user, for making summaries.

There are very less chances that query based summaries can be same because every summary is centralized to different theme. Manual summaries can also be generated.

Various evaluation methods can be used for summary evaluation. These have been designed to compare the efficiency and performance of generated summary by various systems. These are:

a) Automatic b) Manual c) Extrinsic d) Intrinsic

Automatic method may include the evaluation done by ROUGE

(

Recall Oriented Understudy for Gist Evaluation) [2]. ROUGE is an official technique for DUC 2004, TIPSTER. Manual methods measures performance by evaluating summaries by humans. With human evaluations the summaries

(3)

may differ. Extrinsic methods evaluates by describing how other application tasks are being affected whereas intrinsic methods evaluates by analysing using some of the set of norms .

III. RELATED WORK

Various techniques for text summarization have been designed for making summarization process efficient. Many applications use single-document summarization whereas some use multi-document summarization. Some of those techniques reviewed are as follows:

Single-Document Summarization Y. Surendranadha Reddy and A.P. Siva Kumar proposed summarization technique in [3] that uses the neighborhood knowledge for producing summaries. It uses statistical sentence evaluation measures. The technique is called as Nearest Neighbor Search technique. This is single- document and entity/surface level summarization technique. Abstractive summarization is performed in this technique.

Fig.3 various steps

Fig.3 shows various steps considered in this technique. It takes single-document as input, the neighbor documents i.e. topically related, with the input document are used for making summary. By using various similarity functions, global affinity graph, term frequency evaluation and score evaluation, the summary is generated. Penalty to the sentences that are highly overlapping sentences is applied to reduce redundant data or sentences.

Term Frequency/Inverse Term Frequency based sentence extraction is employed. Further, it may be extended to multi-document summarization.

Intrinsic evaluation is carried out.

Niladri Chatterjee and Shiwali Mohan designed a summarization algorithm in [4] that uses semantic similarity between sentences, so that the redundant sentences may be removed from the text. Cosine function is used to calculate semantic similarity but before that calculation of semantic similarity scores using cosine functions, the mapping of sentences is done using Random Indexing. Random Indexing

reduces the size of word space formed (Word Space model was expensive in terms of space). So the efficient way to provide is to calculate the semantic similarity between words, sentences and documents. The problem of high dimensionality may be tackled by Random Indexing, Latent Semantic Analysis, manual grammar method, etc.

Word Space Model is subjected to Random Indexing. Random Indexing algorithm is as described in Fig.4:

Fig.4 Basic Algorithm

After employing Random Indexing and calculating carious similarities, graph based ranking algorithms have also been employed to extract the summary required. The graph based ranking algorithm here used is Page Rank algorithm. Finally a generic, extractive, informative summary will be generated by exploiting semantic similarity. Further , it is planned to use Random Indexing with word space model for summarization purposes to resolve the ambiguities , more efficiently .Smoothing out of abruptness was also planned to be resolved by constructing Stiener trees of graphs . Also it was planned to construct more specific technique to measure efficiency of scheme. This approach provide better results than commercially available summarizers like Copernic, etc. This approach is entity level based. Evaluation method used for technique are intrinsic methods.

Maria Soledad Pera and Yiu-Kai D. Ng in [5]

shared an approach for summarization which only relies on similarity of word to generate quality summaries. They used CorSum, which is an extractive and single- document summarization approach. First of all word correlation factors are computed. Then these factors are used to identify the sentences (which are significant) to include in the summary. Naïve Bayes Classifier is used for classifying text. Naïve Bayes Classifier may be effectively implemented. It is scalable, used with wide variety of domains and it is used for classifying summaries generated by CorSum.

Naïve Bayes Classifier gives high accuracy by considering mutual independence of attributes. The word occurrence frequency may be used for calculating probability of a document to be assigned to a category. Here Probability may be smoothed by using the Laplace approach which is known as add-one smoothing.

(4)

International Journal of Advanced Engineering Science and Technological Research

Shilpi Malhotra and Ashutosh Dixit in [6] gave a query based news article text summarization technique, i.e. it works on the data fetched from web and the data is summarized relevantly to the query entered by user. First of all the user entered query is passed to the corpus builder, and it brought the headlines of the relevant documents. The user chose from the headlines (the headlines of the documents), and the summary is made for the document (to which the headline belong to). This architecture basically consists of – corpus builder and summarizer. The similarity between the headlines (h) and query (q) is calculated by:

Sim (q, h) = KW (q) ^ KW (h) / KW (q) v KW (h)

KW (q) refers keywords in query and KW (h) refers keywords in headlines. The score for each document is calculated by using from frequency measures. The technique makes efficient summary for web based data. There is no proper decomposition of long sentences as it is an extractive summarization technique. Also this summarization technique is not able to handle the documents which are not related to each other i.e.

for applying this technique related documents is a necessity.

Multi-Document Summarization

Ramakrishna Varadarajan and Vagelis Hristidis in [7] described a query specific multi-document summarization method which uses minimum spanning tree concept for finding summaries for document(s). A set of documents is taken and a document graph for each document (d) is made in which text fragments are made corresponding to a node. Semantic approach is followed where a delimiter is chosen to create text fragments. After creating graph nodes, for each pair of nodes we compute the association degree between them, that is the score (weight) of each edge. The weights with value greater than or equal to threshold value are only considered.

After finding document graph G and taking the query Q as input, a summary (sub tree of G) T is assigned a Score (T) by combining scores of nodes belonging to T. T should be corresponding to a minimum spanning tree of document graph G.

Score (T) = a.1/EScore (e) + b.1/NScore (v) Where a and b are constants. Moreover in this method, Multi-Result Enumeration algorithm, Multi-Result Expanding Search algorithm, etc. are used for efficient summary computation.

Ulukbek Attokurav & Ulug Bayazit proposed a multi-document summarization technique in [8]

using BFOS and HAC algorithm. They preferred BFOS algorithm because generated optimal trees in the method yield the best trade-off between the

semantic distortion and rate. The system described in Fig.5 is followed for summary generation:

Fig.5 Basic summary generation flow

In this technique, in preprocessing step, source documents are represented in vector space and the sentences which occur in more than one documents are extracted. Then sentences of document set are represented by sentence X term matrix with n columns and m rows, where n is the number of sentences, m is the number of terms in the feature set. For matrix elements, TF-IDF are used according to significance of elements:

TF-IDF = TF * log (N/DF)

Where TF= Term Frequency, DF = Document Frequency and N= number of sentences. After that redundancy is detected using Hierarchical Agglomerative clustering (HAC) algorithm.

Similar clusters are merged to form a new cluster.

HAC algorithm is operated till a single cluster is maintained and the tree built is HAC tree. Further, the detected redundancy is eliminated using BFOS algorithm (applied to HAC tree). Distortion contribution of each cluster node is:

D = sum d (rs, s)

Where d is distance between the representative distance (rs) and a sentence(s) in cluster. The node with minimum lambda value is identified and pruning is done at that node. Finally, a summary is created by selecting a threshold based on rate.

These type of techniques perform better with additional systems (LSI).

Anjali R. Deshpande & Lobo L.M.R.J. in [9]

worked on a technique for multi-document summarization using clustering. This technique produces extractive summaries. A set of documents and query is taken as input to the summarizer. A list of maps is maintained in which each term from the document collection is stored. Term’s number of occurrences are stored with its synonyms. For obtaining synonyms, WordNet dictionary is used.

Query strengthening is done by preprocessing query and appending the synonyms of query terms to query itself. This corpus results better summaries. Term frequency and Inverse Document Frequency are used to score the sentences. Then Cosine similarity is used to generate appropriate document clusters. Further, from every document cluster, sentences are also clustered. Summaries are

(5)

generated according to the scores calculated, best scored sentences are included into the summary.

The cluster size decides the number of sentences to be selected. Sentence clusters are sorted in reverse order of group (cluster) score and then only added to the summary. This technique reduces redundancy due to clustering very efficiently.

Marina Litvak and Natalia Vanetik in [10] derived a technique based on tensor decomposition (for summarization). This method is an extractive and multi-document summarization is done, which gives the significant sentences in the summary.

Suffix Tree Clustering may be used for clustering.

Documents are given as input and preprocessing is done. The sentences in the documents are represented by real-vectors of tf-idf weights.

Preprocessing may be done to reduce the dimensionality of the tensor representation. To avoid redundancy in the extracted sentences, clustering is done on all the sentences of the document set.

Fig.6 tensor is made

Referring to Fig.6 the document set is firstly converted into a tensor. 3-order tensor is taken here, where the first dimension represents terms, second dimension represents clusters of sentences and the third one represents documents. By using tf-idf weights, a tensor entry is defined. Then the decomposition called Tucker decomposition is done. Tucker decomposition technique used is High Order Singular Value decomposition (HOSVD), which results to orthogonal singular vectors in each dimension. They approximated tensor by a product of a smaller core tensor and rank one tensors. Ultimately, summary may be generated by extracting centroid sentences from clusters, taking into account the higher and lower rankings. Cosine similarity was used to calculate distance between sentences. This unsupervised technique is based on greedy approach.

IV. COMPARISON

In comparing the above referred techniques, various categorizing attributes have been considered. In TABLE I, single-document summarization techniques have been compared and in TABLE II, multi-document summarization techniques have been compared.

TABLE I

COMPARISON OF SINGLE-DOCUMENT SUMMARIZATION TECHNIQUES Researcher(s) Generic /

Query- Specific

Extractive / Abstractive

Techniques Strength(s) Weakness(es) Y.S.Reddy ,A.P.S. Kumar

2012 [3]

Generic Abstractive Nearest-Neighbour Search Technique

More effective summaries

Not retaining sentences as in original document

N.Chatterjee ,S.Mohan 2007 [4]

Generic Extractive Random Indexing Better results than commercial summarizers

Some Abruptness in summaries

Yiu-Kai D.Ng , M.S. Pera 2009 [5]

Generic Extractive CorSum ,Naïve Bayes Classifier

High quality summaries

No extractors and selectors for better result

S.Malhotra , A.Dixit 2013 [6]

Query- based Extractive Corpus-Builder and Summarizer

Efficient for web- based data

1)No-Proper

decomposition of long sentences

2)May not handle unrelated sentences

TABLE II

COMPARISON OF MULTI-DOCUMENT SUMMARIZATION TECHNIQUES Researcher(s) Generic /

Query- Specific

Extractive / Abstractive

Techniques Strength(s) Weakness(es)

(6)

International Journal of Advanced Engineering Science and Technological Research

R.Varadarajan , V. Hristidis 2006 [7]

Query- based Extractive Minimum Spanning tree concept

Good Linking between text fragments leading better summaries

Not using Hyperlinks

U. Attokurav , U. Bayazit 2014 [8]

Generic Extractive Distortion-Rate Ratio Technique

Performs better with additional systems

Pruning dependent on distortion measure

A.R. Deshpande , Lobo L.M.R.J. 2013 [9]

Query-based Extractive Clustering technique

Reduced redundancy Not multi-lingual

M.Litvak , N.Vanetik 2014 [10]

Generic Extractive Tensor- Decomposition Technique

Multi-lingual Not extended to all IR domains

V. CONCLUSIONS

For a summary, it is required that it should be able to describe the source document(s) completely and efficiently. It has been observed that single document summarization techniques like [6] are dealing with less data, so they may be used by less applications, whereas multi document techniques can be used with large number of applications. It has also been observed that some single document techniques can be further extended to multi document techniques, but in many cases it can give redundancy. It has been observed in [7], [9] that efficiency of summarization can also be increased by query specific techniques, because it would give user-specific summaries. Most of the times, the efficiency of generated summary is entirely depending on the technique being used. Therefore, this paper has concluded various techniques designed by researchers to generate summaries.

REFERENCES [1] www.textcompactor.com

[2] Chin-Yew Lin, “ROUGE: A Package for Automatic Evaluation of Summaries”, ACL, Barcelona, Spain, 2004.

[3] Y. Surendranadha Reddy and A.P. Siva Kumar, Volume 2, Issue 7, International Journal of Advanced Research in Computer Science and Software Engineering, 2012.

[4] Niladri Chatterjee, Shiwali Mohan, “Extraction-Based Single-Document Summarization Using Random Indexing,” 19th IEEE International Conference on Tools with Artificial Intelligence, 2007.

[5] Maria Soledad Pera, Yiu-Kai Ng, “Classifying Sentence- Based Summaries of Web Documents,” 21st IEEE International Conference on Tools with Artificial Intelligence, 2009.

[6] Shilpi Malhotra, Ashutosh Dixit, International Journal of Computer Applications, Volume 75– No.17, 2013.

[7] Ramakrishna Varadarajan and Vagelis Hristidis, “A System for Query-Specific Document Summarization”, CIKM, November 5–11, Arlington, Virginia, USA, 2006.

[8] Ulukbek Attokurov, Ulug Bayazit, “Multi-Document Summarization Using Distortion-Rate Ratio”, Student Research Workshop, pages 64–70, Baltimore, Maryland USA, 2014.

[9] Anjali R. Deshpande , Lobo L. M. R. J. , “Text Summarization using Clustering Technique” , International Journal of Engineering Trends and Technology (IJETT) , Volume4 , Issue8 , 2013.

[10] Marina Litvak and Natalia Vanetik, “Multi-document Summarization using Tensor Decomposition”, 2014.