Current Cross-Document Coreference Approaches

2 Cross-Document Coreference for Video Retrieval

2.2 Current Cross-Document Coreference Approaches

To investigate the current state of the art in cross-document coreference algorithms, we review three of the most recent cross-document coreference approaches which employ the most popular techniques and are used by cross-document summarisation systems intended for the summarisation of news stories merging multiple articles in the DUC (2004). Previous

cross-document coreference algorithms concern limited sets of events for specific domains, such as football events, terrorist or election events etc. We will show an example of how the most recent algorithms work on texts such as plot summary and audio description, which may

convey the same story plot but using different amount and kinds of information. Plot summary and audio description have a different function and differ significantly in the

language used, as well as the amount and kinds of information. The n-gram algorithm, the event-centric algorithm and the Boosting algorithm are intended for cross-document coreference between the same types of texts and use approaches, such as scoring important words by their frequency, matching n-grams or common open class words, triples, i.e. verbs with another word in the role of object or subject, computing the semantic word distance etc.

using part-of-speech (POS) taggers. Most existing approaches for cross-document

coreference have been tested on news articles and between the same types of texts. The present study proposes cross-document coreference between different types of texts, such as

plot summary and audio description, including an unlimited set of events, such as film events, by matching combinations of functional roles, such as subject and object, and considering the event temporal aspect for the number of expected matches.

2.2.1 N -g ra m Algorithm s

“An n-gram grammar is a representation of an N-th order Markov language model in which the probability of occurrence of a symbol is conditioned upon the prior occurrence of N-l other symbols” (Brown et al, 2001). N-gram matching algorithms can match single words, ‘uni-grams’, or two or more consecutive words, ‘bi-grams’, ‘trigrams’ etc. including grammar tags, such as noun, adjective etc. N-gram algorithms are used by the LAKE system (D’ Avanzo et al, 2004), which extracts and scores key-phrases in a coipus of news articles, and

by the News Story Gisting system (Doran et al, 2004) that generates short news story summaries, predicting which words should be in the resultant summary. The extraction of the candidate phrases is based on n-grams; uni or bi-grams for entity extraction (common noun or proper noun, or adjective + noun etc.) and tri or four-grams for events or situations (noun + verb + noun + adverb etc.). N-grams are scored according to statistical methods, involving

Chapter 2. Cross-Document Coreference fo r Video Retrieval

the document term frequency (TF), the inverse document frequency (DDF, the log of the number of all documents divided by the number of documents containing the term), as well as the ‘ first occurrence’ or ‘word distance’, which is the relative position of a candidate word with respect to the beginning of the document. The TF and IDF, as well as the word distance are computed to score important n-grams, Figure 2-7.

1. Match or fuzzy match n-grams 2. Compute TF

3. Compute IDF

4. Compute word distance

Figure 2-7. The n-gram algorithm as it appears in (D’ Avanzo et al, 2004, Doran et al, 2004)

Now, let us consider how well this works on plot summary and audio description. The selection of relevant phrases in the plot summary considers syntactic patterns describing an

entity (uni/bigrams), or a concise event or situation (tri/four-grams), using the Penn Treebank tagset. The PS sentence would be then tagged as follows:

Vianne [NP] has [VBZ] been [VBN] helping [VBG] Josephine [NP] out [IN] of [IN] her [PP] abusive [JJ] marriage [NN]

Where, ‘NP’ = Proper noun singular, 'VBZ’ = Verb 3rd person singular present, ‘VBN’ =

Verb, past participle, ‘VBG’ = Verb, gerund or present participle, ‘IN’ = Preposition or

subordinating conjunction, ‘PP’ = Personal pronoun, ‘JJ’ = adjective and ‘NN’ = Noun,

singular or mass.

Here we manually simulated the n-gram algorithm. As the important keywords are spread through the sentence, applying bi-grams tri-grams or four-grams does not detect any matches in audio description. Thus, we spot uni-grams, such as Vianne and Josephine, which describe

entities, and the uni-gram helping describing an event, Table 2-2.

N-gram PS n-gram TF AD n-gram TF n-gram TF in

the corpus of one PS and one AD

n-gram IDF

Vianne [NP] 2 /1 1 4 -1 .7 % 7 5 /5 ,9 1 6 -1 .2 % 77/6,030 - 1.2% 1

Josephine [NP] 1 /1 1 4 -0 .8 % 3 5 /5 ,9 1 6 -0 .5 % 36/6,030 - 0.5% 1 Help [V B-] 1 /1 1 4 -0 .8 % 4 /5 ,9 1 6 -0 .0 6 % 5 /6 ,0 3 0 - 0.08% 1

Table 2-2: TF and IDF for plot summary and audio description matched n-grams for one event in the film ‘ Spiderman’ (the n-grams were lemmatised)

The binary classifier decides if the sentence is relevant or not relevant based on two features, TF, showing whether the term is frequent in the specific document, and IDF, showing

whether the term occurs in all texts of the corpus or not. For example, the frequency (TF) of

the unigram Vianne in the corpus of a plot summary and an audio description is 77/6,030 (or

1.2%), see Table 2-2. The inverse document frequency (IDF) of the n-grams Vianne,

Josephine and helping is 1, as the n-grams occur in two documents in a corpus of one plot

summary and one audio description, i.e. two divided by two equals one. The TF of all n- grams in a corpus of one plot summary and one audio description is very low, ranging from 0.08% for the lemmatised n-gram help to 1.2% for the n-gram Vianne, showing that they are not important. However, these statistics are not useful when matching a plot summary against

the corresponding audio description spans, as the important words to be matched are proper

nouns or head nouns detected in the plot summary. The word helping was fuzzy matched

four times1. Here, the principle of word distance tends to consider words appearing close to

the beginning of a document as important, introducing the theme of the article. Candidate matches of Vianne, Josephine and helping were selected in audio description but were erroneously matched, as the correct matches appear in later parts of the audio description and not in the beginning of the document.

N-grams can be restrictive in the pair of texts plot summary and audio description as

keywords do not necessarily occur in the same order and longer n-grams would be needed, which are hard to be matched or even fuzzy matched. The TF is not important in our case as we consider nouns and verbs in the plot summary as the candidate keywords which may be infrequent. The IDF is not important in matching plot summary and audio description as

information is matched in two texts only. These statistics are important to detect candidate keywords in more than two texts, when the selection is not based on one of the texts as in the case of plot summary. The uni-gram Vianne was found 75 times in the audio description but

only four times referring to the event in which the character participated. The uni-gram

Josephine was found 35 times in the audio description and only on two occasions it was

included in the same event. The word help was found four times and all matches are

incorrect. The candidate keywords need to be combined with the occurrence of other

keywords to find the correct match. Lexical cohesion is included in News Story Gisting, based on the word repetition and synonymy relations according to WordNet. In our case it

has not been applied as synonyms are not detected in plot summary and audio description.

Chapter 2. Cross-Document Coreference for Video Retrieval

2.2.2 Event-Centric Algorithm

The ‘event centric’ algorithm is part of the MSR - NLP system (Vanderwende et al, 2004) and suggests a verb centred method, by tagging verbs and nouns, as well as time and syntactic relations (object, subject). The extracted fragments are divided into events and entities. First the event expressed by a verb is taken into account in combination with another

word assigned with either the role of logical subject or object, called ‘triple’ . Using the Pagerank algorithm, the triples which have been cast a vote more than once are detected and marked as important, Figure 2-8.

1. T ag verbs and nouns, time, syntactic relations (subject, o b ject) 2. P rod u ce triples in the fo rm o f

LFNodei, rel, LFNodej

3. S core im portant triples

Figure 2-8: T he event-centric algorithm (V anderw ende et al, 2004)

Here, we manually simulated and applied the event-centric algorithm to the pair of texts plot summary and audio description. We produce triples, which take the form of -LFNodei, rel,

LFNodej- in the logical form of sentences, meaning that the tokens are assigned with a functional role of subject or object, called ‘Tsubj’ , i.e. ‘typical subject’, or ‘Tobj’, i.e. ‘typical

object’ . For example, spider is the ‘ typical’ subject in the sentence Peter is bitten by a spider, while Peter is the grammatical subject. Here, we detect the typical subjects or objects of the

verbs. We then score important triples in the plot summary, meaning triples including a word appearing in a functional role, in more than one phrases, for example, Vianne is the typical subject in two sentences; Vianne, Tsubj, help and Vianne, Tsubj, opens.

Two votes are cast on Vianne, which is the ‘Tsubj’ in both events, Vianne opens a

chocolaterie and Vianne has been helping Josephine

Figure 2-9: The event-centric algorithm applied to plot sum m ary data, casting tw o votes fo r the node

Vianne

We then match the important plot summary triples to the audio description corresponding

triples. The plot summary triple Vianne, Tsubj, help is matched against the same triple included in the audio description sentence Vianne helps Armande to her feet. However, this is a spurious match as the correct matches should include Vianne helping a different character,

Josephine.

Plot summary triple Audio description triple/s

Vianne [PROPER NOUN], Tsubj, helping

[VERB]

Vianne has been helping Josephine

Vianne [PROPER NOUN], Tsubj, helping

[VERB]

Vianne helps Armande to her feet

Table 2-3: Plot summary and audio description matched triples for one event from the film ‘ Chocolat’

Using triples can be restrictive in our data, as we are rather interested in who is participating in the event than the word referring to the event itself. Extending triples, taking into account all the event participants, or matching words in the roles of subject and object without

matching the verb would perhaps be a more suitable approach for the pair of texts plot summary and audio description.

2.2.3 The Boosting Algorithm

The Boosting algorithm (Zhang et al, 2003) merges information in multiple news texts about

the same stories, producing “a strong hypothesis combining ‘weak’ or ‘base’ hypotheses”. Taking advantage of lexical, syntactic and semantic features, the algorithm follows a binary

classification scoring if two fragments are related or not by matching common lexical and syntactic features and computing the semantic distance. Without applying any stemming or stop-word deletion, only three lexical features are of interest for the authors of the algorithm:

1. The number of tokens in sentence 1 2. The number of tokens in sentence 2 3. The number of tokens in common

Figure 2-10: The lexical features analysis in the Boosting algorithm (Zhang et al, 2003)

We have manually simulated the Boosting algorithm for one example; Table 2-4 shows the application of the three lexical features rules on the pair of texts plot summary and audio description, analysing one event from the film ‘Chocolat’ . As the number of tokens in the audio description sentence must not diverge a lot one from the number of tokens included in

Chapter 2. Cross-Document Coreference fo r Video Retrieval

the plot summary clause, we have collected the audio description sentences, which include +/- six tokens of the number of tokens included in the plot summary sentence and at least three tokens in common. Four tokens in common are the highest matches in this data set.

PS sentence AD sentence Number of

tokens in the PS sentence Number of tokens in the AD sentence Number of tokens in common Vianne has been helping Josephine out of her abusive marriage

0 1 :46:54 At the shop Josephine ladles out

liquid chocolate - Vianne spreads it over a

slab of white marble.

10 16 4

01:48:19 Vianne teaches Josephine the

art of chocolate making.

10 7 3

02:20:40 Vianne helps Armande to her

feet.

10 6 3

Table 2-4: The three lexical features for one event from the plot summary and audio description for the film ‘ Chocoiat’

Three audio description utterances have been retrieved and only the second match refers to the event help ( Vianne teaches Josephine). In the other two sentences, pairs of closed class

words such as of or out are matched, as a stop word list was not used to prevent the matching of closed class words.

The syntactic analysis features shows which words may be more important than others,

having a Part-of-Speech (POS) tag x such as noun, proper noun, verb, adjective or adverb. Three syntactic features are of interest for the authors of the algorithm, Figure 2-11.

1. Number of tokens having POS x in the PS sentence 2. Number of tokens having POS x in the AD sentence 3. Number of common tokens having POS x

Figure 2-11: The syntactic features analysis in the Boosting algorithm

The same example can be analysed as shown in Table 2-5. Four audio description utterances have been detected including common words with POS open class word tags and again only

one is correct. The verb help has been erroneously matched, as the participants of the event are different, in the plot summary the participants are Vianne and Josephine and in the audio description the participants are Vianne and Armande.

PS sentence AD sentence Tokens having POS x in the PS Tokens having POS x in the AD Common tokens having POS x Vianne [PN] has been helping [VBG1 Josephine [PN] out of her abusive [JJ] marriage [NN]

01:46:54 At the shop Josephine

[PN] ladles out liquid chocolate -

Vianne [PN] spreads it over a slab o f

white marble.

5 9 2

01:48:19 Vianne |PN] teaches Josephine [PN] the art o f chocolate

making.

5 6 2

02:12:01 Copper pans gleam in the kitchen as Vianne [PN] and

Josephine [PN] prepare the food for

the party.

5 9 2

02:20:40 Vianne [PN| helps [VBZ]

Armande to her feet.

5 4 2

Table 2-5: The three syntactic features for one event from the plot summary and audio description for the film ‘ Chocolat’

In the sem antic features analysis, the authors take advantage o f the syntactic analysis first and then com p u te the sem antic distance betw een pairs o f w o rd s to find out i f the top lev el head n oun s and verbs present any sy n on y m s, estim ating i f tw o w ord s are sem an tically related or not, u sing the W o rd N et thesaurus. T h e sam e ex a m p le is sh ow n in T a b le 2-6.

1. Find the top level NP and VP in sentence 1 and sentence 2 2. Find the head tokens of both NP and VP

3. Align the heads correspondingly (e.g. NP VS NP...) 4. For each head pair compute the semantic distance

Figure 2-12: The semantic features analysis in the Boosting algorithm

In the case o f m atching p lot sum m ary and au dio description , it is m ore im portant to m atch all the participants o f the event, i.e. both N P s Vianne and Josephine, taking into a ccou n t the transitivity, rather than on e participant w ith the verb. T he sem antic analysis points at the m atch w ith both heads h aving the least sem antic distance. H ere, the au dio description utterance Vianne helps Armande to her feet is m atched against the p lot sum m ary clause

Vianne has been helping Josephine, w h ich d o e s not refer to the sam e event, in cluding d ifferen t participants. It is very rare to m atch the same verb, and co m p u tin g the sem antic distance betw een w ord s is not o f im portan ce in this data set, as n o syn on y m s w ere fou n d, w h ich usually is the case in the s p e c ific pair o f texts as w e w ill later sh o w in Chapter 4.

Chapter 2. Cross-Document Coreference fo r Video Retrieval

PS sentence AD sentences retrieved

by syntactic features

Aligned heads Semantic

distance

Vianne [NP head] has been helping |VP head] Josephine out of her abusive marriage

01:46:54 Vianne [NP head] spreads [VP head]

it over a slab o f white marble. PS NP head = Vianne PS VP head = help AD NP head = Vianne AD VP head = spread

PS NP head

Vianne

NP head Vianne

distance

0 PS VP head help

AD VP head

spread

not related 01:48:19 Vianne (NP head) teaches [VP head]

Josephine the art o f chocolate making. PS NP head = Vianne PS VP head = help AD NP head = Vianne AD VP head = teach

PS NP head

Vianne

NP head Vianne

distance = 0

PS VP head help

AD VP head

teach

not related

02:12:01 Vianne |NP head] and Josephine [NP head! prepare [VP headj

the food for the party.

PS NP head = Vianne PS NP head = Josephine PS VP head = help A D NP head = Vianne A D VP head = prepare

PS NP head

Vianne

NP head Vianne

distance = 0

PS VP head help

VS AD VP head

prepare

02:20:40 Vianne [NP head] helps [VP head]

Armande to her feet.

PS NP head = Vianne PS VP head = help AD NP head = Vianne A D VP head = help

PS NP head

Vianne

NP head Vianne

distance

0 PS VP head help

AD VP head

help

distance

0

Table 2-6: The semantic features for one event from the plot summary and audio description for the film ‘ Chocolat’

2.2.4 Discussion

Recent cross-document coreference approaches are applied between the same types of text

and involve newswire texts. We have reviewed three different algorithms, intended for the matching of the same types of texts. The n-gram algorithm matches consecutive words and scores important words by computing the term frequency in the document and in the corpus of documents and the word distance from the beginning of the document, as words mentioned close to the beginning are scored as significant. Matching n-grams is a popular technique used by different kinds of text retrieval systems, such as LAKE and News Stories Gisting. It

In document Cross-document coreference between different types of collateral texts for films. (Page 46-57)