• No results found

Deriving the Global Indicators

5.4 Estimating Intrinsic Importance of Words

5.4.1 Deriving the Global Indicators

Below we introduce two methods of computing global indicators, where the second method is an ungraded version of the first one. The indicators are computed based on 160,001 summary-article pairs from year 2004 to year 2007 of the New York Times corpus. The idea is to compute the change of probability of each word between the summary and the original article. Similar approach has been used in Woodsend and Lapata (2012) for identifying words that are likely to be avoided in summaries. However, their analysis is based on a very small corpus (i.e., one TAC dataset). Gillick and Dunietz (2014) utilize this corpora for generating a labeled corpus of entity salience identification.

Method 1

We build two unigram language models (LMs): one from the original articles (LMG), the other from the summaries (LMS). Here we use the SRI Language Modeling Toolkit (SRILM) (Stolcke, 2002) with Ney’s absolute discounting (Ney et al., 1994) and 0.75 as the constant to subtract.4 Since SRILM uses white space as the word

separator, we tokenize all files by Stanford CoreNLP (Manning et al., 2014) before building the LMs. The probability of word w in LMS and LMG are denoted as

PS(w) and PG(w) respectively.

We compute the intrinsic importance (global indicators) of w based on PS(w) and PG(w). As the corpus is large enough, we only consider the words that appear in both at least one article and one summary. This results in a total of 128,381

words. We introduce five different global indicators. Score1(w) is the probability of

win the human summaries. Intuitively, words that appear more often in summaries are likely to be important. Score2(w) (Score3(w)) computes the difference (ratio)

between PS(w) and PG(w). Moreover, we compute Score4(w), where the formula

resembles Kullback-Leibler (KL) divergence. This is based on the hypothesis that summary-biased words tend to have higherPS(w) and larger difference/ratio between

PS(w) andPG(w). Similarly, we computeScore5(w) to characterize the unimportant

words (input-biased). Score4(w) and Score5(w) are regarded as our main metrics.

Score1(w) = PS(w) (5.1) Score2(w) = PS(w)−PG(w) (5.2) Score3(w) = PS(w)/PG(w) (5.3) Score4(w) = PS(w)·ln PS(w) PG(w) (5.4) Score5(w) = PG(w)·ln PG(w) PS(w) (5.5)

Table 5.4 shows the top words and top content words, ranked by the five types of global indicators. Here we briefly discuss the content words. Words that tend to be used in summaries, characterized by high Score4(w), include locations (e.g.,

York, NJ, Iraq), people’s names (e.g., Bush, John), abbreviations (pres, corp, dept) and verbs of conflict (e.g., contends, dies). Some of the top ranked words, such as Iraq, Bush and John, are related to the big events happened between 2004 and 2007. Words that tend not to be used in human summaries, characterized by high

Score5(w), include courtesy titles (e.g., Mr, Ms, Jr.), relative time reference (e.g.,

yesterday, p.m., Tuesday) and verbs that people use to express opinions (e.g., asked, told, added). The words with high probability in summaries (Score1(w)) overlap

with those ranked high byScore4(w) to some extent, but also includes a number of

frequent words that appear often both in the summaries and in the original articles (e.g. State, million, American, percent). The words ranked high byPS(w)−PG(w)

(Score2(w)) resembles that of Score4(w). The words ranked high by PS(w)/PG(w) (Score3(w)) include many abbreviations and uncommon words.

In summary, the global indicators—especially Score1(w), Score2(w), Score4(w)

and Score5(w)—seem to correlate well with our intuitions.

Method 2

Even though Method 1 successfully identifies the intrinsically important words, a scrutiny of Table 5.4 reveals two problems. First, many words are ranked high be- cause of journalistic conventions (e.g., reviews, op-ed, correction, photo(s)). For example, the summary of a correction article includes the word “correction”, the summary of an article accompanied by photos contains the word “photo(s)”. These words do not describe the main topics of an article. Second, abstractors uses ab- breviations a lot in summaries: “pres” for “president”, “sen” for “senator”, “min” for “minister”, etc. As a result, many abbreviations are ranked undesirably high. Method 2 tackles these problems.

Let n denote the number of summary-article pairs. Let Ti (1 ≤ i ≤ n) denote the articles and let Si (1 ≤ i ≤ n) denote the summaries. Method 2 includes the following steps:

Step 1: We manually build a dictionary, which includes words and their abbrevia- tions (e.g., president—pres). This dictionary is included in Appendix B. For the occurrences of the abbreviations in Si and Ti, we replace them with the original words. This helps to alleviate the second problem.

Step 2: For each Si, we form a new summary Si0 by filtering out the words that have never appeared in its corresponding original article (Ti). This helps to tackle the first problem. Moreover, this step is more suitable for extractive summarization (which is our case), because the words not in the original article cannot be selected by extractive summarizers.

Metric Rank Words

PS(w) 1-8 of, to, and, in, m, that, on, for

9-16 ’s, is, by, photo, new, with, at, from 17-24 are, as, says, has, photos, s, who, will 25-30 article, his, york, be, not, have

PS(w)−PG(w) 1-8 m, of, photo, says, new, on, photos, s

9-16 in, by, article, to, column, york, letter

PS(w)/PG(w) 1-8 atty, pres, fda, region/long, aclu, irs, guantanamo, faa

9-16 nj, nfc, nc, dept, chairman-chief, region/new, dist

PS(w)·ln PS(w)

PG(w) 1-8 m, photo, photos, pres, says, article, column 9-16 of, reviews, letter, new, on, by, york, in

17-24 in, l, sen, ny, discusses, drawing, to, op-ed, holds 25-30 correction, bush, editorial, and, j, will

PG(w)·ln PG(w)

PS(w) 1-8 the, a, mr., said, an, i, n’t, he 9-16 you, was, we, ms, it, had, this, but

17-24 she, ’re, my, yesterday, here, like, they, were 25-30 me, ’ve, there, do, ’m, so

Metric Rank Words

PS(w) 1-8 photo, photos, article, york, column, letter, bush

9-16 state, reviews, million, american, pres, percent, iraq, years 17-24 people, government, year, john, company, correction, national 25-30 federal, officials, drawing, billion, public, world, administration

PS(w)−PG(w) 1-8 photo, photos, article, column, york, letter, reviews, pres

9-15 bush, city, state, correction, drawing, op-ed, iraq

PS(w)/PG(w) 1-8 atty, pres, fda, region/long, aclu, irs, guantanamo, faa

9-15 nj, nfc, nc, dept, chairman-chief, region/new, dist

PS(w)·ln PS(w)

PG(w) 1-8 photo, photos, pres, article, column, reviews, letter, york 9-16 sen, ny, discusses, drawing, op-ed, holds, correction, bush 17-24 editorial, dept, city, nj, min, map, corp, graph

25-30 contends, iraq, john, dies, sec, state

PG(w)·ln PG(w)

PS(w) 1-8 mr, ms, yesterday, p.m., lot, tuesday, ca, thursday

9-16 wednesday, friday, told, monday, time, added, thing, sunday 17-24 things, asked, good, night, saturday, nyt, back, senator 25-30 wanted, kind, jr., mrs , bit, looked

Table 5.4: Top words derived by five global importance estimation methods (Method 1). The top table includes all words, the bottom table includes content words only. All words are lowercased.

Metric Rank Words

PS(w) 1-8 of, to, and, in, that, on, ’s

9-16 is, by, new, with, at, as, from, are 17-24 has, who, will, his, be, not, have, it 25-30 york, he, about, the, was, its

PS(w)−PG(w) 1-8 of, to, in, and, new, on, by, for

9-15 york, ’s, is, will, bush, that, president

PS(w)/PG(w) 1-7 perval, bacteriophages, juppe, melby, raveche, inderfurth, tikshoret

8-14 friedman-simring, lavoung, aclu, korondi, mckiver, gronim, meini

PS(w)·lnPPS(w)

G(w) 1-8 of, in, to, new, and, on, by, m.

9-16 york, for, bush, article, president, will, city, is 17-24 ’s, are, state, iraq, that, million, from, john 25-30 editorial, billion, at, federal, percent, has

PG(w)·lnPPG(w)

S(w) 1-8 the, a, mr., said, an, i, we, n’t 9-16 you, ms., he, it, was, had, this, she

17-24 my, but, ’re, yesterday, here, there, me, like 25-30 they, ’ve, do, our, were, ’m

Metric Rank Words

PS(w) 1-8 york, president, city, bush, state, million, percent, american

9-16 years, iraq, company, government, article, people, year, john 17-24 federal, national, billion, officials, public, administration, world 25-30 united, court, group, house, police, war, school

PS(w)−PG(w) 1-8 york, bush, president, city, state, million, iraq

9-15 john, percent, american, billion, government, federal, administration

PS(w)/PG(w) 1-7 perval, bacteriophages, juppe, melby, raveche, inderfurth, tikshoret

8-14 friedman-simring, lavoung, aclu, korondi, mckiver, gronim, meini

PS(w)·lnPPS(w)

G(w) 1-9 york, bush, article, president, city, state, iraq, million, john 10-16 editorial, billion, federal, percent, american, government

17-22 administration, jersey, michael, national, column, court, senator 23-30 company, op-ed, gov., security, department, minister, directed, police

PG(w)·lnPPG(w)

S(w) 1-8 mr., ms., yesterday, sept., ca, lot, tuesday, thursday 9-16 n.y., wednesday, friday, told, monday, thing, added, things 17-24 time, nyt, asked, good, night, p.m., mrs., sunday

25-30 saturday, wanted, back, thought, looked, wo

Table 5.5: Top words derived by five global importance estimation methods (Method 2). The top table includes all words, the bottom table includes content words only. All words are lowercased.

Step 3: This step builds the language models. We first build a language model (LM) from all texts Ti, using the same approach as described in Method 1. We then build a LM for all Si0, where the words from Ti are used as the full vocabulary list.5 The vocabulary list includes 587,976 words, which is at least

four times as large as the word list in Method 1. Compared to Method 1, words that appear inTi but not in Si0 are considered for Method 2.

Step 4: The following steps are the same as Method 1: we calculateScore1(w),. . .,

Score5(w) to quantify the intrinsic word importance.

Table 5.5 includes the top words ranked by Score1(w), . . ., Score5(w) using

Method 2. Examination shows that it include less noisy information compared to Table 5.4. For instance, among the words that are newly promoted to the top 30 ranked byScore4(w), we see “government”, “administration”, “senator”, “minister”

and “police”, which are often used to characterize the main events of a news article.