Mining Global Knowledge from Dictionaries

We hypothesize that words in certain categories are likely to be globally important or unimportant. In fact, many summarizers already rely on this hypothesis for

content selection. Systems that conduct rule-based sentence compression use part- of-speech (POS) tags to identify the words to be removed (Dunlavy et al., 2003; Conroy et al., 2006b; Wang et al., 2013). Event based summarizers rely on named entity (NE) recognition (Filatova and Hatzivassiloglou, 2004; Li et al., 2006) to identify important events. Moreover, our experiments in Section 3.5.2 have shown that words with certain POS or NE tags are likely or unlikely to appear in human summaries.

In this section, we use MPQA and LIWC dictionaries to build features for our word importance estimation model (Chapter 3). For each word, a feature has value 1 if the word belongs to the category that the feature corresponds to, otherwise 0. To test the predictive power of these features, we perform proportion test and Wilcoxon rank-sum (WRS) test (a.k.a Mann-Whitney U test) for the words that are used and not used in human summaries. Since the features are binary, we regard proportion test as our main metric.2 _{Our experiments are helpful to understand what}

categories might be associated with the inclusion or exclusion of a word, independent of a particular input.

Using dictionaries to build features gives us two advantages. First, in contrast to the sparse unigram features, we have a dense representation of the feature space. Consider a word that appears in the test data but not in the training data. Because this word belongs to a dictionary category, we can infer the property of this word based on words of the same category in the training data. However, unigram features give no signals in this case. Second, the feature space is not determined based on a specific input. This is a more advantageous representation, according to Yang and Nenkova (2014).

2_{We observe that these two tests generate similar}_p_{-values and the relative rank between features}

Features Sample words prop WRS +/- rf

strong subj & negative fear, concerned, trouble, worst 1e-4 1e-4 - 8.5% strong subj & neutral air, opinion, felt, feel, view 0.057 0.045 - 9.3% strong subj & positive great, hope, true, kind, agree 0.009 0.007 - 7.3% weak subj & negative force, close, lost, hard, war, crisis 0.147 0.135 + 12.3%

weak subj & neutral move, major, pressure, high, show 0.14 0.124 - 9.3% weak subj & positive clear, good, minister, deal, leading 0.973 0.939 - 10.3%

Table 5.1: The MPQA features and their p-values by proportion test (prop) and Wilcoxon rank-sum test (WRS). +/- indicates more in the summary/input. Bold indicates statistical significant (p <0.05). rf indicates the percentage of words with this feature tag in the input that appear in human summaries. The mean rf for all words is 10.9%.

5.3.1 Multi-Perspective Question Answering (MPQA)

In the MPQA lexicon (Wiebe and Cardie, 2005), each word is labeled based on its subjectivity and polarity. There are two subjectivity (strongly subjective, weakly subjective) and three polarity (positive, neutral, negative) categories. The words that are subjective in most contexts are labeled as strongly subjective (e.g., abase, abash, abysmal), while the words that are subjective in some contexts are labeled as weakly subjective (e.g., abandon, ability, accept). Objective words are not included in this lexicon. The polarity label indicates whether or not one word evokes people’s negative, neutral or positive emotions.

For each word, we construct six features; each feature corresponds to a combi- nation of different polarities and subjectivities. Table 5.1 shows the p-value from significance test and sample words for each category. rf is defined as the percentage of words with this feature tag in the input that appears in human summaries (for formal definition ofrf, see Section 3.5.2). Experiment shows that words with strong subjectivity, whether positive, neutral or negative, are less likely to be used in summaries. Most strikingly, the p-value for strongly subjective negative words is very low: about 10−4_.

There are two possible explanations towards why words with strong subjectivity are unlikely to be used in summaries: (1) the main topic of an input tends not to be too subjective, or (2) the abstractors tend not to be too subjective while writing summaries. To investigate which explanation is more plausible, we look into some examples. First, “feel” (strong subjectivity and neutral) has never appeared in human summaries and appears 26 times in 11 inputs. By looking at examples, we observe that almost all occurrences of this word is within quotations (see Table 5.2). Therefore, the exclusion of “feel” is due to the fact that quotations are unlikely to be included in summaries.3 _{Second, “fear” (strong subjectivity and negative) has}

never appeared in human summaries and appears 26 times in 15 inputs. We observe that the usage of this word is mostly concerned with people expressing their feelings. This is consistent with an observation that verbs which describe personal actions are unlikely to appear in summaries (Nye and Nenkova, 2015). For both examples, the first explanation is more plausible, i.e., the strongly subjective words are not used because they are generally not related to the central topic of the input.

“I feel guilty”, he said, close to tears as he stood near the flowers and candles at the fire site.

The beatification may also owe its speed to the pope’s fear that a successor may be less sensitive to the East Europe an Church’s struggle against communism, to which he has devoted much of his life.

Table 5.2: Examples of input sentences that include words with strong subjectivities.

5.3.2 Linguistic Inquiry and Word Count (LIWC)

The LIWC application (Tausczik and Pennebaker, 2007) is originally designed to conduct text analysis for psychological research. Representative applications of LIWC

3_{Among the words that only appear in quotations of an input, 3}_._{2% of them are used in human}

Features Sample words prop WRS +/- rf

death killed, killing, war, died, death 1.5e-13 6.5e-14 + 20.4%

anger killed, killing, victims, hit, attack 3e-9 2.1e-9 + 21.7%

achieve president, leader, leaders, control 1.9e-5 1.5e-5 + 13.3%

negative emotion killed, killing, lost, pressure, victim 0.005 0.004 + 14.3%

money business, economic, bill, economy 0.016 0.013 + 11.4%

inclusive words including, included, close, open 0.023 0.016 + 13.7%

space international, country, national, world 0.050 0.045 + 12.0%

perceptual process spokesman, speaking, hand, press 3e-6 2.2e-6 - 4.4%

insight statement, question, found, believed 5e-6 4.1e-6 - 5.7%

hear spokesman, speaking, heard, spoke 5.4e-5 3.2e-5 - 1%

tentative maybe, question, hope, appears 0.001 7e-4 - 4.0%

cognitive process make, news, statement, including 0.002 2e-3 - 8.8%

present tense make, give, carry, turn 0.004 0.003 - 5.4%

body head, face, hand, feet, hands 0.005 0.004 - 5.1%

friend friends, neighboring, neighbors, fellow 0.026 0.015 - 0.7%

function words part, main, half, back 0.023 0.019 - 7.5%

positive emotions great, good, important, support 0.044 0.039 - 8.4%

Table 5.3: Significant LIWC features and their p-values by proportion test (prop) and Wilcoxon rank-sum test (WRS). +/- indicates more in the summary/input. rf indicates the percentage of words with this feature tag in the input that are included in human summaries. The mean rf for all words is 10.9%.

include suicide risk assessment (Matykiewicz et al., 2009), deception detection (New- man et al., 2003), sentiment analysis (Tumasjan et al., 2010), and schizophrenia identification (Hong et al., 2012; Hong et al., 2015b). The center of LIWC is a dictionary that assigns tags to words based on their lexical or semantic properties, which makes dictionary is appropriate for our analysis. One drawback of the LIWC dictionary lies in its low coverage, which only includes 4500 words or word-stems. It would be interesting if similar analysis can be conducted based on dictionaries with a higher coverage for future work.

Table 5.1 shows the p-value by Wilcoxon rank-sum (WRS) test and proportion test. Among all 64 features that correspond to LIWC categories, 16 are significant by proportion test, 17 are significant by WRS test.

Categories that appear at a higher rate in human summaries (i.e., summary- biased categories) include death, anger, achievements, negative emotions, money,

inclusive words and space. Among which, the first two are extremely significant. Indeed, words that are related to death (e.g., killed, killing, war, died), anger (e.g., killed, killing, attack, victims) are the main focus of many news articles. The statistical significance of negative emotions is related to a finding in prior work that general sentences (which are likely to be summary sentences) include a greater number of polarity words (Louis and Nenkova, 2011a; Louis and Nenkova, 2011b).

“Killed” and ”killing” are the most frequent words in the categories death,anger

and negative emotions. Therefore, it is possible that these categories are significant because these two words appear too often. To test if it is the case, we remove the samples that correspond to “killed” and “killing” in our data. Then we retest the statistical significance of these features. The newp-value for death and anger are still very significant—smaller than 0.002 by proportion test. However, the new p-value for negative emotion is 0.118. The result of WRS test is also very similar.

A greater number of features are concerned with the categories that are unlikely to be used in human summaries (i.e., input-biased categories). Highly significant categories include perceptual process (spokesman, speaking, hand), insight (statement, question) and friend (friends, neighboring, fellow). In LIWC, words are also grouped together by their part-of-speech tags, where we observe less occurrences of present tense verbs and function words in human summaries.

In document Content Selection in Multi-Document Summarization (Page 133-138)