• No results found

Thesis Contribution

In this thesis, we make the following contributions:

Word importance estimation: We introduce a rich set of novel features that in- dicate the importance of words in the input documents. Apart from frequency and location based features, we include part-of-speech, named entity tags, re- inforcement information from machine summaries, semantic categories, and intrinsic importance of the words; many of these features are not widely used in prior work on this task. To evaluate the estimates of word importance, we introduce the task of identifying words that appear in human summaries (summary keywords). Based on the proposed features, we present a logistic re- gression model to find summary keywords, which achieves better performance than prior methods of estimating word importance. We also introduce a novel

method to evaluate summary keyword identification, inspired by the Pyramid method. We present a throughout study on the effects of features.

Comparing summarization systems: We study the effectiveness of different word weighting methods towards the final summary quality. We present a greedy extractive summarizer, whose modularity and transparency makes it easy to compare different weighting methods. There, we show that the best method of identifying summary keywords is also the best one for summarization, which achieves a performance comparable to the state-of-the-art for generic summa- rization of news. We also discuss whether real-value or binary weights should be used in summarization systems. We show that binary weights works better if the weights are estimated by unsupervised methods; while real-value weights works better if the weights are learned by our proposed supervised model. We present a new repository that includes summaries from six state-of-the-art and six competitive baseline systems. Our repository addresses the problem that different systems were evaluated on different datasets, based on different ROUGE metrics. Our repository also makes it feasible to compare summaries from recent systems. Future researchers can also use summaries in our repos- itory for comparison. Our experiments show that summaries from automatic systems have very low overlap in terms of content.

Extracting and applying global knowledge: We study the use of global knowl- edge, an aspect ignored in many summarization systems developed in recent years. We propose two methods of mining global knowledge: (1) using dictio- naries that group words into categories, and (2) computing the intrinsic im- portance of words by analyzing the summary-article pairs from a large corpus. There, we find that certain words or categories tend to be globally impor- tant or unimportant. Moreover, we test the effects in three tasks related to

content selection: summary keyword identification, summarization, and sum- mary combination, where we observe a small improvement on all three tasks. In addition, we show that intrinsic importance of words are very helpful in identifying words that have low frequency in the input.

System combination: We present a new framework of system combination for multi-document summarization (SumCombine). We first generate candidate summaries by combining whole sentences from different systems. We show that summary combination is very promising, based on an analysis of the best choice among the candidates. To select among these candidates, we employ a supervised model that predicts the informativeness of the entire summary. Our model relies on a rich set of novel features that capture content importance from different perspectives, based on various sources. Ablation experiments verifies the efficacy of our features. SumCombine generates better summaries than the best summarizer while combining short summaries and achieves a performance comparable to the state-of-the-art on multiple DUC/TAC datasets. We also discuss why our model fails while combining longer summaries.

We investigate factors that affect the success of system combination. Our main study focuses on properties of the basic systems (macro-level). We show that it is critical to combine systems that have similar performance: if the basic systems perform similarly, even very simple methods outperform the basic sys- tems; if some basic systems are much inferior than others, then methods based on consensus between summaries cannot achieve a good performance. We also show that for combination, it is easier to improve over low-performing basic systems than high-performing ones. This implies that a combination method that achieves a large improvement by combining low-performing systems might not be very effective. Moreover, we show that our model proposed in Chapter 6 is the most effective combination method, while selecting the summary with the smallest input-summary Jensen-Shannon divergence is a strong baseline.

We have also conducted a preliminary study based on the data and the basic systems in Chapter 6. This study focuses on properties of the basic summaries (micro-level). We observe a significant relation between diversity and mean quality of the basic summaries, which might be helpful to predict the difficulty of summarizing an input (Section 7.1.2). Surprisingly, we find that diversity is not a factor that affects the success of system combination on our data. Future research may investigate whether these findings hold in general.

Chapter 2

Data and Evaluation

We describe the data used in the thesis in Section 2.1 and the evaluation methods towards content selection quality in Section 2.2.

2.1

News Data from DUC and TAC

We focus on multi-document summarization, which produces a summary according to a set of related documents on a given topic. We mainly focus on generic summa- rization, where the task is to produce a summary according to the input documents. The topic is assumed to be not given during summarization. Another task that we also investigate is topic-based summarization. For the latter task, a system is given a set of documents as well as a topic statement. The system is expected to provide a summary that addresses the statement.1

We perform our analyses on data from the multi-document summarization task of the Document Understanding Conference (DUC) between 2001 and 2007 (Over et al., 2007) and from the Text Analysis Conference (TAC) in 2008 and 2009. These conferences are organized by the National Institute of Technology (NIST). The tasks

1Topic-based summarization can be regarded as a variation of query-focused summarization

(Nenkova and McKeown, 2011), which generates a summary that answers a query. In this thesis, we use these two terms interchangeably.

Year 2001 2002 2003 2004

Number of input document sets 30 59 30 50

Number of documents per set 6–16 5–15 10 10

Number of human summaries 3 2 4 4

Summary length 50/100/200/400 50/100/200 100 100

Year 2005 2006 2007 2008 2009

Number of input document sets 50 50 45 48 44

Number of documents per set 25–50 25 25 10 10

Number of human summaries 4–9 4 4 4 4

Summary length 250 250 250 100 100

Table 2.1: Description of the generic (top) and topic-based (bottom) multi-document summarization datasets from the DUC 2001–2007 and TAC 2008–2009 workshops.

in DUC 2001–2004 are generic summarization of newswire articles, while the tasks in DUC 2005–2007 and TAC 2008, 2009 are topic-based summarization of news. We use the DUC 2003, 2004 dataset in Chapter 3, 4, 5 and all nine datasets in Chapter 6. The DUC 2001–2004 and TAC 2008, 2009 datasets are used in Chapter 7.

The summarization problem was created by experts who collected a group of re- lated newswire articles on a same event. For topic-based summarization, the expert who collected the documents also create the topic statement. Then the automatic summarization systems generate summaries up to a certain number of words.2 Sum-

maries over the length limit will be automatically truncated.3 To facilitate evalua- tion, NIST assessors create summaries that are about the same length as machine summaries. In Section 2.2, we will show how the quality of machine summaries are evaluated using the manually-generated summaries.

2An exception is the DUC 2004 evaluation, where summaries of up to 665 bytes (around 100

words) were required. This means systems will truncate words to different numbers of words for evaluation. This is disturbing, since the variation in length has an impact on automatic evaluation results. Therefore, later work (as well as our work) still mostly truncate the summaries to 100 word summaries during evaluation on this data.

Topic: The 1998 NBA lockout

Human summary

In a dispute over a new collective bargaining agreement the National Basketball Association owners declared a lockout on July 1, 1998. They wanted to discard a clause in the old agreement allowing teams to pay their own free agents whatever they wanted, substituting a hard salary cap. The players wanted to keep earning as much as possible. On Oct. 5 all 114 preseason games were cancelled. The players then proposed a 50% tax on salaries above $18 million that the owners rejected. On Oct. 13 the NBA cancelled the first two weeks of the regular season. By Oct. 21 the entire season seemed in jeopardy in the interests of the best paid.

Machine summary

The decision to cancel 99 games between Nov. 3 and Nov. 16 came after the players association proposed the implementation of a tax system instead of a hard salary cap, a proposal the owners said they would respond to by Friday. In a critical ruling for the North American National Basketball Association and the players union, arbitrator John Feerick decides Monday whether more than 200 players with guaranteed contracts should be paid during the lockout. Last year, the players received about $1 billion dollars in salaries and benefits and we have made proposals that are guaranteed to increase that number by 20 percent over the next four years, Granik said in a prepared statement.

Topic: White supermacists

Narrative/Topic statement: Describe the widespread activities of the white supermacists and the efforts of those opposed to them to prevent violence.

Human summary

White supermacists often travel cross-country staging protests. An Arkansas-based group protested in Boston. A Virginia-based group marched in Toledo, Ohio. Neo-Nazis urged fol- lowers to travel to Crawford, Texas to protest the Iraq war. Oneneo-Nazi leader plotted to kill a federal judge. White supremacists spread their word through books and internet postings, often quoting King and other civil rights leaders to advance their own agendas. To avoid violent clashes, community leaders have pleaded for calm, staged peace rallies and delayed announcing protest routes. A national watchdog group monitors warehouse activities of online retailer Aryan Wear in Dallas-Fort Worth, Texas.

Machine summary

White supremacists clashed with an angry crowd outside Faneuil Hall, where Holocaust survivors and their families were commemorating the liberation of Nazi concentration camps. A white Republican lawmaker who contends he was excluded from a Black legislative group solely because of his race said, in September 2005, that the group is even more racist than the Ku Klux Klan. We were protesting black racial violence against white people in that neighborhood, said White. Navarre said the riots escalated because members of the National Socialist Movement took their protest to the neighborhood, which is predominantly black, instead of a neutral place.

Table 2.2: The human and machine summaries towards a generic and a topic- based summarization problem. Sample input documents of these two problems are provided in Appendix A.

Table 2.1 provides the basic statistics of our dataset: the number of input docu- ment sets (input), the number of documents per input, the number of human sum- maries per input and the length limit of the output. We also provide examples of human summaries and machine summaries towards a generic as well as a topic- focused summarization problem in Table 2.2. It is easy to tell from the example that the human summaries have better content and linguistic quality.

Note here that the DUC 2007, TAC 2008, 2009 shared tasks all include a main task and an update summarization task. The data we described are from the main task. The update task requires summarizers to produce summaries under the as- sumption that the abstractor has already read a set of earlier articles. Note also that we do not use the TAC 2010, 2011 data, because they are created for guided summarization: a task where the summarizer should produce summaries to include all aspects that are specified in the guidance of the summarization problem.

Our methods are evaluated on newswire articles. These methods may not be ap- propriate if the documents to be summarized are from other domains (e.g., medical records, legal text, meeting transcripts). Indeed, documents from different domains have different structures and properties. For example, for scientific articles, abstracts and conclusion often summarize the contribution of the paper. Systems that sum- marize a scientific article may also utilize the snippets that cite this article in other papers (Qazvinian and Radev, 2008; Mohammad et al., 2009; Xu et al., 2015). Sys- tems that summarize articles in medical domains often utilize large-scale knowledge resources (e.g., Unified Medical Language System (UMLS) (Bodenreider, 2004)) to identify medical terms in the input documents. Such kind of domain specific infor- mation is helpful to identify information that should be included in the summary.