5.4 Estimating Intrinsic Importance of Words
5.4.1 Deriving the Global Indicators
Below we introduce two methods of computing global indicators, where the second method is an ungraded version of the first one. The indicators are computed based on 160,001 summary-article pairs from year 2004 to year 2007 of the New York Times corpus. The idea is to compute the change of probability of each word between the summary and the original article. Similar approach has been used in Woodsend and Lapata (2012) for identifying words that are likely to be avoided in summaries. However, their analysis is based on a very small corpus (i.e., one TAC dataset). Gillick and Dunietz (2014) utilize this corpora for generating a labeled corpus of entity salience identification.
Method 1
We build two unigram language models (LMs): one from the original articles (LMG), the other from the summaries (LMS). Here we use the SRI Language Modeling Toolkit (SRILM) (Stolcke, 2002) with Ney’s absolute discounting (Ney et al., 1994) and 0.75 as the constant to subtract.4 Since SRILM uses white space as the word
separator, we tokenize all files by Stanford CoreNLP (Manning et al., 2014) before building the LMs. The probability of word w in LMS and LMG are denoted as
PS(w) and PG(w) respectively.
We compute the intrinsic importance (global indicators) of w based on PS(w) and PG(w). As the corpus is large enough, we only consider the words that appear in both at least one article and one summary. This results in a total of 128,381
words. We introduce five different global indicators. Score1(w) is the probability of
win the human summaries. Intuitively, words that appear more often in summaries are likely to be important. Score2(w) (Score3(w)) computes the difference (ratio)
between PS(w) and PG(w). Moreover, we compute Score4(w), where the formula
resembles Kullback-Leibler (KL) divergence. This is based on the hypothesis that summary-biased words tend to have higherPS(w) and larger difference/ratio between
PS(w) andPG(w). Similarly, we computeScore5(w) to characterize the unimportant
words (input-biased). Score4(w) and Score5(w) are regarded as our main metrics.
Score1(w) = PS(w) (5.1) Score2(w) = PS(w)−PG(w) (5.2) Score3(w) = PS(w)/PG(w) (5.3) Score4(w) = PS(w)·ln PS(w) PG(w) (5.4) Score5(w) = PG(w)·ln PG(w) PS(w) (5.5)
Table 5.4 shows the top words and top content words, ranked by the five types of global indicators. Here we briefly discuss the content words. Words that tend to be used in summaries, characterized by high Score4(w), include locations (e.g.,
York, NJ, Iraq), people’s names (e.g., Bush, John), abbreviations (pres, corp, dept) and verbs of conflict (e.g., contends, dies). Some of the top ranked words, such as Iraq, Bush and John, are related to the big events happened between 2004 and 2007. Words that tend not to be used in human summaries, characterized by high
Score5(w), include courtesy titles (e.g., Mr, Ms, Jr.), relative time reference (e.g.,
yesterday, p.m., Tuesday) and verbs that people use to express opinions (e.g., asked, told, added). The words with high probability in summaries (Score1(w)) overlap
with those ranked high byScore4(w) to some extent, but also includes a number of
frequent words that appear often both in the summaries and in the original articles (e.g. State, million, American, percent). The words ranked high byPS(w)−PG(w)
(Score2(w)) resembles that of Score4(w). The words ranked high by PS(w)/PG(w) (Score3(w)) include many abbreviations and uncommon words.
In summary, the global indicators—especially Score1(w), Score2(w), Score4(w)
and Score5(w)—seem to correlate well with our intuitions.
Method 2
Even though Method 1 successfully identifies the intrinsically important words, a scrutiny of Table 5.4 reveals two problems. First, many words are ranked high be- cause of journalistic conventions (e.g., reviews, op-ed, correction, photo(s)). For example, the summary of a correction article includes the word “correction”, the summary of an article accompanied by photos contains the word “photo(s)”. These words do not describe the main topics of an article. Second, abstractors uses ab- breviations a lot in summaries: “pres” for “president”, “sen” for “senator”, “min” for “minister”, etc. As a result, many abbreviations are ranked undesirably high. Method 2 tackles these problems.
Let n denote the number of summary-article pairs. Let Ti (1 ≤ i ≤ n) denote the articles and let Si (1 ≤ i ≤ n) denote the summaries. Method 2 includes the following steps:
Step 1: We manually build a dictionary, which includes words and their abbrevia- tions (e.g., president—pres). This dictionary is included in Appendix B. For the occurrences of the abbreviations in Si and Ti, we replace them with the original words. This helps to alleviate the second problem.
Step 2: For each Si, we form a new summary Si0 by filtering out the words that have never appeared in its corresponding original article (Ti). This helps to tackle the first problem. Moreover, this step is more suitable for extractive summarization (which is our case), because the words not in the original article cannot be selected by extractive summarizers.
Metric Rank Words
PS(w) 1-8 of, to, and, in, m, that, on, for
9-16 ’s, is, by, photo, new, with, at, from 17-24 are, as, says, has, photos, s, who, will 25-30 article, his, york, be, not, have
PS(w)−PG(w) 1-8 m, of, photo, says, new, on, photos, s
9-16 in, by, article, to, column, york, letter
PS(w)/PG(w) 1-8 atty, pres, fda, region/long, aclu, irs, guantanamo, faa
9-16 nj, nfc, nc, dept, chairman-chief, region/new, dist
PS(w)·ln PS(w)
PG(w) 1-8 m, photo, photos, pres, says, article, column 9-16 of, reviews, letter, new, on, by, york, in
17-24 in, l, sen, ny, discusses, drawing, to, op-ed, holds 25-30 correction, bush, editorial, and, j, will
PG(w)·ln PG(w)
PS(w) 1-8 the, a, mr., said, an, i, n’t, he 9-16 you, was, we, ms, it, had, this, but
17-24 she, ’re, my, yesterday, here, like, they, were 25-30 me, ’ve, there, do, ’m, so
Metric Rank Words
PS(w) 1-8 photo, photos, article, york, column, letter, bush
9-16 state, reviews, million, american, pres, percent, iraq, years 17-24 people, government, year, john, company, correction, national 25-30 federal, officials, drawing, billion, public, world, administration
PS(w)−PG(w) 1-8 photo, photos, article, column, york, letter, reviews, pres
9-15 bush, city, state, correction, drawing, op-ed, iraq
PS(w)/PG(w) 1-8 atty, pres, fda, region/long, aclu, irs, guantanamo, faa
9-15 nj, nfc, nc, dept, chairman-chief, region/new, dist
PS(w)·ln PS(w)
PG(w) 1-8 photo, photos, pres, article, column, reviews, letter, york 9-16 sen, ny, discusses, drawing, op-ed, holds, correction, bush 17-24 editorial, dept, city, nj, min, map, corp, graph
25-30 contends, iraq, john, dies, sec, state
PG(w)·ln PG(w)
PS(w) 1-8 mr, ms, yesterday, p.m., lot, tuesday, ca, thursday
9-16 wednesday, friday, told, monday, time, added, thing, sunday 17-24 things, asked, good, night, saturday, nyt, back, senator 25-30 wanted, kind, jr., mrs , bit, looked
Table 5.4: Top words derived by five global importance estimation methods (Method 1). The top table includes all words, the bottom table includes content words only. All words are lowercased.
Metric Rank Words
PS(w) 1-8 of, to, and, in, that, on, ’s
9-16 is, by, new, with, at, as, from, are 17-24 has, who, will, his, be, not, have, it 25-30 york, he, about, the, was, its
PS(w)−PG(w) 1-8 of, to, in, and, new, on, by, for
9-15 york, ’s, is, will, bush, that, president
PS(w)/PG(w) 1-7 perval, bacteriophages, juppe, melby, raveche, inderfurth, tikshoret
8-14 friedman-simring, lavoung, aclu, korondi, mckiver, gronim, meini
PS(w)·lnPPS(w)
G(w) 1-8 of, in, to, new, and, on, by, m.
9-16 york, for, bush, article, president, will, city, is 17-24 ’s, are, state, iraq, that, million, from, john 25-30 editorial, billion, at, federal, percent, has
PG(w)·lnPPG(w)
S(w) 1-8 the, a, mr., said, an, i, we, n’t 9-16 you, ms., he, it, was, had, this, she
17-24 my, but, ’re, yesterday, here, there, me, like 25-30 they, ’ve, do, our, were, ’m
Metric Rank Words
PS(w) 1-8 york, president, city, bush, state, million, percent, american
9-16 years, iraq, company, government, article, people, year, john 17-24 federal, national, billion, officials, public, administration, world 25-30 united, court, group, house, police, war, school
PS(w)−PG(w) 1-8 york, bush, president, city, state, million, iraq
9-15 john, percent, american, billion, government, federal, administration
PS(w)/PG(w) 1-7 perval, bacteriophages, juppe, melby, raveche, inderfurth, tikshoret
8-14 friedman-simring, lavoung, aclu, korondi, mckiver, gronim, meini
PS(w)·lnPPS(w)
G(w) 1-9 york, bush, article, president, city, state, iraq, million, john 10-16 editorial, billion, federal, percent, american, government
17-22 administration, jersey, michael, national, column, court, senator 23-30 company, op-ed, gov., security, department, minister, directed, police
PG(w)·lnPPG(w)
S(w) 1-8 mr., ms., yesterday, sept., ca, lot, tuesday, thursday 9-16 n.y., wednesday, friday, told, monday, thing, added, things 17-24 time, nyt, asked, good, night, p.m., mrs., sunday
25-30 saturday, wanted, back, thought, looked, wo
Table 5.5: Top words derived by five global importance estimation methods (Method 2). The top table includes all words, the bottom table includes content words only. All words are lowercased.
Step 3: This step builds the language models. We first build a language model (LM) from all texts Ti, using the same approach as described in Method 1. We then build a LM for all Si0, where the words from Ti are used as the full vocabulary list.5 The vocabulary list includes 587,976 words, which is at least
four times as large as the word list in Method 1. Compared to Method 1, words that appear inTi but not in Si0 are considered for Method 2.
Step 4: The following steps are the same as Method 1: we calculateScore1(w),. . .,
Score5(w) to quantify the intrinsic word importance.
Table 5.5 includes the top words ranked by Score1(w), . . ., Score5(w) using
Method 2. Examination shows that it include less noisy information compared to Table 5.4. For instance, among the words that are newly promoted to the top 30 ranked byScore4(w), we see “government”, “administration”, “senator”, “minister”
and “police”, which are often used to characterize the main events of a news article.