Mining topical relevant patterns for multi-document summarization

(1)

This is the author’s version of a work that was submitted/accepted for

pub-lication in the following source:

Wu, Yutong

,

Gao, Yang

,

Li, Yuefeng

,

Xu, Yue

, & Chen, Meihua

(2015)

Mining topical relevant patterns for multi-document summarization. In

2015 IEEE/WIC/ACM International Conference on Web Intelligence and

Intelligent Agent Technology (WI-IAT)

, IEEE, Singapore, pp. 114-117.

This file was downloaded from:

http://eprints.qut.edu.au/94109/

c

Copyright 2015 IEEE

Notice:

Changes introduced as a result of publishing processes such as

copy-editing and formatting may not be reflected in this document. For a

definitive version of this work, please refer to the published source:

(2)

Mining Topical Relevant Patterns for Multi-document

Summarization

Yutong Wu, Yang Gao, Yuefeng Li, Yue Xu Faculty of Science and Engineering

Queensland University of Technology, Brisbane, Australia

Email: [email protected], {y21.gao, y2.li,

yue.xu}@qut.edu.au

Meihua Chen

Hubei University of Technology, Wuhan, China

Abstract—Multi-document summarization addressing the problem of information overload has been widely utilized in the various real-world applications. Most of existing approaches adopt term-based representation for documents which limit the performance of multi-document summarization systems. In this paper, we proposed a novel pattern-based topic model (PBTMSum) for the task of the multi-document summarization. PBTMSum combining pattern mining techniques with LDA topic modelling could generate discriminative and semantic rich representations for topics and documents so that the most representative and non-redundant sentences can be selected to form a succinct and informative summary. Extensive experiments are conducted on the data of document understanding conference (DUC) 2007. The results prove the effectiveness and efficiency of our proposed approach.

Keywords—multi-document summarization; pattern mining; topic model

I. INTRODUCTION

Nowadays, we are facing the problem of information overload. The ever-increasing huge amount of textual documents needs to be analyzed and summarized efficiently and effectively. Thus, the demand for extensive study in the automatic multi-document summarization is growing rapidly as well. Multi-document summarization summarizes information from multiple documents which share similar topics. It can help users quickly sift through large text data collections, catch the most relative or important information. Multi-document summarization has been applied in a wide range of domains, from the traditional such as newswire or scientific articles summarization, to novel domains such as literary text, patents, or blog post summarization and twitters analysis.

Multi-document summarization methods can be either extractive or abstractive. In this study, we focus on extractive summarization methods. While abstractive methods attempt to build an internal semantic representation based on the understanding of the main concepts in a document, then use natural language generation techniques to create a summary which retells those concepts in fewer words and clear natural language, the extractive approaches avoid any efforts of deep understanding of natural language and produce summary by selecting existing most important sentences in the original text, so they are conceptually simple and more practicable. Even though there are numerous extraction-based approaches having been proposed for document summarization to date, extractive methods can only produce much less accurate summaries compared to human made summaries. Most of them adopt

term-based representation for documents, which are ineffective for capturing correlations among lexical items and building an internal semantic representation. This is the bottleneck in extraction-based multi-document summarization researches.

In recent years, Bayesian topic models, which exhibit an impressive sophistication and representation power, have become one of the most popular extraction-based summarization approaches. Bayesian topic models use structured probabilistic topic models to represent document content [1]-[4]. Bayesian topic model based approaches taking account of semantic associations outperform other statistic-based approaches which only count on shallow features of documents. Different from other approaches which treat the multi-documents to be summarized as one long text without document boundaries, the topic model approaches have an explicit representation for every individual document, so that they can capture information that is missed in most of the other approaches. Although successful, focusing little on redundancy and coherence issues, topic representation approaches usually ignore the contextual information of words. Celikyilmas and Hakkani-Tur [5] proposed two-tiered topic model (TTM) to discover hierarchical latent structure of multi-documents via characterizing general and specific words separately into high-level topics (concepts) and low-high-level topics. TTM can extract topically coherent sentences based on a discovery of hierarchical topics and their correlations. However, TTM did not consider the word dependency. Most recently, Yang, Wen, Kinshuk, Chen and Sutinen [6] proposed a Bayesian hierarchical topic model that can capture both the hierarchies and the word dependencies over latent topics. In this approach, the contextual information is represented as a probabilistic distribution by training a large data corpus before summarizing documents, so that the summarization systems have context feature embedded systemically. Even though it still has limitations on performance and sentence coherence problem, this proposed contextual topic model has shown how a novel topic model can be adapted to solve the problem of multi-document summarization.

Pattern mining is one of most important technique in data mining area, which can discover interesting relations between data examples. Pattern mining techniques have been widely adopted to solve text mining problems for many years [7]-[9]

since 1993 when it was first introduced by Agrawal, Imieliński

and Swami [10]. Since the patterns usually carry more semantic meaning than words, it has been proved that pattern mining is adept at discovering hidden correlations that

(3)

frequently occur in the analyzed data. On the other hand, there are very few attempts [11] to adapt well-established pattern mining techniques for automatic document summarization. The usage of pattern mining should be investigated further to prove that the potential power of patterns discovered could help to break through the bottleneck of automatic multi-document summarization.

In this paper, we propose a novel multi-document summarization approach, which combines pattern mining techniques with topic modelling to automatically generate discriminative and semantic rich representations for topics and documents. Considering a pattern-based topic model may have a potential in building up a proper topic representation of the text document, we can assume that exploiting pattern-based topic model in document summarization would outperform purely term-based approaches. In order to adapt pattern-based topic model for use in extractive multi-document summarization, we firstly attempt to adapt pattern-based topic model to represent each sentence in the document to make it suitable for extractive summary which should be composed of the most representative and non-redundant sentences to capture the major topics in the documents set. Then, with the respect to the discovered correlations and semantic meaning of sentences, we score every sentence in documents set and select the most informative sentence based on maximum coverage and least redundancy to form the summary.

The proposed research work will provide two main contributions to the field of automatic multi-document summarization: (i) building a novel pattern-based topic model to represent documents with the most representative and discriminative patterns; such model can carry more semantic meanings, correlation and context information; (ii) creating a set of methods to measure the importance and information coverage of every sentences, with respect to the proposed model, i.e. generated pattern-based topic representations.

Some preliminary experiments on summarization benchmark data sets DUC 2007 have been conducted to demonstrate the effectiveness of our proposed approach. The results have shown that the proposed method performs better than average performance of the extraction-based multi-document summarization systems.

II. PATTERN-BASED TOPIC MODEL FOR MULTI-DOCUMENT SUMMARIZATION

A. Overview

Inspired by an information filtering model, the Maximum matched Pattern-based Topic Model (MPBTM) [12], the proposed model namely PBTMSum attempts to exploit pattern-based topic model to automatically generate discriminative and semantic rich representations for multi-document summarization. PBTMSum combines the pattern mining techniques with Latent Dirichlet Allocation (LDA) topic modelling to capture not only the context information of the words and sentences but also the hidden correlations among them. For extraction-based multi-document summarization task, PBTMSum firstly generates pattern-enhanced topic representation for sentences and given document set, then score the saliency of every sentence based on sentence’s

pattern-enhanced topic representation and select the most important sentences based on maximum coverage and least redundancy to form the final summary. The architecture of our approach is shown in Fig.1.

B. Pattern-enhanced Topic Representation for Sentences and Documents

In this phase, the input of multiple documents i.e. a document set sharing similar topics will be represented as a mixture of topics, and each topic has an explicit pattern-enhanced representation for every individual sentence. Such detailed representation makes it easier to convey the similarities and differences among the different sentences and documents, so that better summarizers can be developed. Moreover, pattern-enhanced representations are more semantically meaningful and more accurate than word-based representations, and can reveal the association between words [12].

1) LDA topic modelling and topic-sentence representation

First step of pattern-based topic modelling is to construct a transactional table from the LDA model results of the document collection [12]. The rows of such table are sentences instead of documents in MPBTM; the columns are a set of terms which are assigned to topic by LDA. So each row representing a sentence contains the words which are in

sentence and assigned to topic by LDA. This

topic-sentence transactional table for topic can be denoted as Γ.

2) Pattern discovery and pattern-enhanced topic-sentence representation

In proposed approach, we use frequent closed patterns [13] generated from each sentence-based transactional data table to represent identified topic for given document set. In current stage, we choose frequent closed pattern other than frequent pattern because frequent closed pattern are considered more

Fig 1 Pattern-based Topic model for Multi-document Summarization (PBTMSum)

Multiple Documents

Pattern-based Topic Modelling Intermediate Topic Representation Latent Dirichlet Allocation

Topic Modelling Topic-sentence Representation Topic-sentence-based Pattern Discovery Pattern-sentence Topic Interpretation Sentence Scoring Sentence Selection Multi-document Summary Redundancy Removing Greedy Algorithm Pattern-based Topic Significance

(4)

effective and efficient to represent topics than frequent patterns. We adopt pattern mining techniques here to discover succinct informative patterns which frequently occur in sentences other than in document to make it particularly suitable for the extraction-based summarization. Usually a minimum support threshold should be given to drive such pattern mining and it should be little bit greater than in document-based pattern mining. After pattern discovery, the PBTMSum builds up a pattern-enhanced topic representation

for each identified topic. For example, topic Zcan be

represented by a set of all frequent closed patterns generated

from Γ, denoted as XZ x , x , x , … , x , where m is

the total number of patterns in XZ. XZ called topical relevant patterns is a set of most representative patterns for topic Z.

C. Sentence Scoring

After we generate an intermediate pattern-enhanced topic representation for each identified topic, we measure the saliency of every sentence for its associated topic by exploiting

topic significance [12] to sentence instead of document. Let s

be a sentence, Zbe an identified topic of a document set, PA

be a set of matched patterns for topic Z in sentences, k

1, … , n , and , … , , be the corresponding supports of the matched patterns, and then the topic significance of Z to s is defined as:

TSig Z , s ∑ PA w ∑ PA , (1)

where m is the scale of pattern specificity [12] (we set m =

0.8), and wis the weight of the topic Z.

D. Sentence Selection and Reduandancy Removing

In multi-document summarization task, sentence selection should base on maximal information coverage and least redundancy. To guarantee information coverage, we pool all sentences from every set of top-n ranking sentences for each identified topic. Then to reduce redundancy, we adopt a greedy algorithm applying diversity penalty imposition [14] [15] for sentence selection. The first sentence in the final summary was the one having the highest score in sentences pool. We then add one by one sentence with the highest diversity-penalty-imposed score in remaining sentences using the equation as it was formulated in SRRank [15]. The penalty weight parameter δ here is set to 6 since our experiments gave the best result with it for our PBTMSum model.

III. EXPERIMENT AND EVALUATION

A. Dataset

We evaluated the proposed model on the DUC 2007 dataset, which is open benchmark dataset from Document

Understanding Conference1_{(DUC) for automatic}

summarization evaluation. DUC2007 dataset consist of 45 topics, and each topic contains a topic description and a document set with 25 news articles. The task is to create a summary of no more than 250 words for each document set. DUC2007 provides 4 human written reference summaries of each document set for evaluation.

1_{http://www-nlpir.nist.gov/projects/duc/index.html}

B. Evaluation Metric

We use ROUGE2_{(Recall-Oriented Understudy of Gisting}

Evaluation) toolkit [16] to evaluate the performance of our summarization system. ROUGE has been adopted by DUC as the official evaluation metric for automatic document summarization. ROUGE metrics automatically determine the quality of a summary by counting the number of overlapping units such as the n-gram, word sequences and word pairs between the candidate summary and an ‘ideal’ summary or a set of ‘ideal’ summaries created by humans. In the experiment, we report the ROUGE-1, ROUGE-2 and ROUGE-SU4. ROUGE-1 based on unigram evaluates how well the testing summary is consistent with human judgments [17], the ROUGE-2 measures the overlap of bigrams and ROUGE-SU4 examines the overlap of skip-bigrams with a maximum skip distance of four. Skip-bigrams measures the overlap of any pair words in their sentence order allowing arbitrary gaps between them. In general, the higher the ROUGE scores, more similar to ‘ideal’ summaries, so better summaries.

C. Baseline

In the experiments, we compare our proposed model with two state-of-the-art summarization approaches: Lead [18]: takes the leading sentences one by one (up to 250 words) in the chronologically ordered documents for each document set; CLASSY04 (Clustering, Linguistics, And Statistics for Summarization Yield) [19]: based on a 5-state Hidden Markov Model using one feature, that is the number of signature terms computed based on given clusters in each sentence for sentence selection, and a pivoted QR algorithm to generate a multi-document summary. Lead and CLASSY04 were as simple baseline summarizer and high-performance generic baseline summarizer respectively in DUC2007 main multi-document summarization task.

D. Experimental Results

In the experiment, we consider both topic significance and topic coverage into the selection of summary sentences. For DUC2007 dataset, we choose top 3 topics to be covered in summaries and top 12 ranking sentence for each topic to be candidate for final summaries. We compare the PBTMSum with two baseline models mentioned above using the all 45 document set in DUC2007. The Table I shows the comparison results on DUC2007. In addition, DUC2007 average and best scores (reported in DUC 2007 by the National Institute of Standards and Technology) are listed here for a clear comparison.

Seen from the table, our proposed method (PBTMSum) performed much better than the baseline Lead model and CLASSY04 model, which demonstrates that our proposed methods are effective and efficient for general multi-document summarization. Though the result in current stage does not outperform the best system in DUC 2007 (it should be noted that this best system uses a supervised/manual process for sentence reduction and our approach is purely unsupervised), we believe there is a big room for the improvement in

(5)

exploiting pattern-based topic model for document summarization.

TABLE I. EXPERIMENTAL RESULTS OF ROUGEEVALUATION ON DUC 2007DATA

System

Recall

ROUGE-1 ROUGE-2 ROUGE-SU4

Lead 0.31245 0.06047 0.10508

CLASSY04 0.40541 0.09364 0.14628

DUC Average 0.39728 0.09486 0.14747

DUC Best 0.44502 0.12450 0.17709

PBTMSum 0.42011 0.10727 0.16176

IV. CONCLUSION AND FUTURE WORKS

In this paper we explored well-established pattern mining techniques combining with LDA topic modelling in the context of multi-document summarization area. We proposed a novel approach called PBTMSum which exploits pattern-based topic model to derive discriminative and semantic rich representation of the documents set that captures the topics and hidden correlation among text data. Evaluation results on DUC2007 dataset verify the effectiveness and efficiency of proposed multi-document summarization method, PBTMSum.

In future work, we will perform the following investigations:

(1) We will attempt to adopt our model in query-based

multi-document summarization task, which need to produce a most typical summary reflecting the condensed information closely related to the initial given query or user profile. The success of pattern-based topic modelling in information retrieval has shown the significant but untapped potential for exploiting the pattern-based topic modelling in query-based multi-document summarization.

(2) We will investigate the influences of different

optimizing methods in LDA topics modelling phase.

(3) We will explore the usage of other novel type of

pattern to represent sentences and documents for multi-document summarization task.

(4) We will optimize the algorithm for redundancy

removing to enhance the performance of the PBTMSum model.

(5) We will take co-reference and sentence ordering into

consideration to develop more accurate solution for multi-document summarization with pattern-based topic modelling.

REFERENCES

[1] A. Haghighi and L. Vanderwende, “Exploring content models for multi-document summarization.” In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2009, pp. 362-370.

[2] D. Wang, S. Zhu, T. Li, and Y. Gong, “Multi-document summarization using sentence-based topic models.” In Proceedings of the ACL-IJCNLP

2009 Conference Short Papers. Association for Computational Linguistics, 2009, pp. 297-300.

[3] A. Celikyilmaz and D. Hakkani-Tur, “A hybrid hierarchical model for multi-document summarization.” In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 815-824.

[4] H. Daumé III and D. Marcu, “Bayesian query-focused summarization.” In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006, pp. 305-312.

[5] A. Celikyilmaz and D. Hakkani-Tür, “Discovery of topically coherent sentences for extractive summarization.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 2011, pp. 491-499.

[6] G. Yang, D. Wen, Kinshuk, N.-S. Chen, and E. Sutinen, “A novel contextual topic model for multi-document summarization.” Expert Systems with Applications, vol. 42, no. 3, pp. 1340-1352, 2015. [7] S.T. Wu, Y. Li and Y. Xu, “Deploying Approaches for Pattern

Refinement in Text Mining,” In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), Hong Kong, 2006, pp. 1157-1161.

[8] Y. Li, A. Algarni, and N. Zhong, “Mining Positive and Negative Patterns for Relevance Feature Discovery,” In Proceedings of 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010), Washington, USA, 2010, pp. 753-762.

[9] Y. Li, A. Algarni, M. Albathan, Y. Shen, and M. Arif Bijaksana, “Relevance Feature Discovery for Text Mining.” IEEE Trans. Knowl. Data Eng, vo. 27, no. 6, pp. 1656-1669, 2015.

[10] R. Agrawal, T. Imieliński, and A. Swami, “Mining association rules between sets of items in large databases.” In ACM SIGMOD Record , vol. 22, no. 2, pp. 207-216, June 1993.

[11] E. Baralis, L. Cagliero, S. Jabeen, and A. Fiori, “Multi-document summarization exploiting frequent itemsets,” In Proceedings of the 27th Annual ACM Symposium on Applied Computing. ACM, 2012, pp. 782-786.

[12] Y. Gao, Y. Xu, and Y. Li, “Pattern-based Topics for Document Modelling in Information Filtering.” IEEE Transactions on Knowledge & Data Engineering, vol. 27, no. 6, pp. 1629-1642, 2015.

[13] J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining: current status and future directions.” Data Mining and Knowledge Discovery, vol. 15, no. 1, pp. 55-86, 2007.

[14] X. Wan, “Using only cross-document relations for both generic and topic-focused multi-document summarizations,” Information Retrieval,

vol. 11, pp.25-49, 2008

[15] S. Yan and X. Wan, “SRRank: leveraging semantic roles for extractive multi-document summarization.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 2048-2058, 2014. [16] C. Y. Lin, “Rouge: A package for automatic evaluation of summaries.”

In Proceedings of the ACL-04 Workshop on Text Summarization Branches Out. 2004, pp. 74-81.

[17] C. Y. Lin and E. Hovy, “Automatic evaluation of summaries using n-gram co-occurrence statistics.” In Proceedings of the North American Chapter of the Association for Computational Linguistics on Human Language Technology Conference, Association for Computational Linguistics, 2003, pp. 71-78.

[18] M. Wasson, “Using leading text for news summaries: Evaluation results and implications for commercial summarization applications.” In

Proceedings of the 17th_{international conference on Computational}

Linguistics-Volume 2, Association for Computational Linguistics, 1998, pp. 1364-1368.

[19] J. M. Conroy, J. D. Schlesinger, J. Goldstein, and D. P. O’leary, "Left-brain/right-brain multi-document summarization." In Proceedings of the Document Understanding Conference (DUC 2004). 2004.