2 Cross-Document Coreference for Video Retrieval
2.2 Current Cross-Document Coreference Approaches
To investigate the current state of the art in cross-document coreference algorithms, we review three of the most recent cross-document coreference approaches which employ the most popular techniques and are used by cross-document summarisation systems intended for the summarisation of news stories merging multiple articles in the DUC (2004). Previous
cross-document coreference algorithms concern limited sets of events for specific domains, such as football events, terrorist or election events etc. We will show an example of how the most recent algorithms work on texts such as plot summary and audio description, which may
convey the same story plot but using different amount and kinds of information. Plot summary and audio description have a different function and differ significantly in the
language used, as well as the amount and kinds of information. The n-gram algorithm, the event-centric algorithm and the Boosting algorithm are intended for cross-document coreference between the same types of texts and use approaches, such as scoring important words by their frequency, matching n-grams or common open class words, triples, i.e. verbs with another word in the role of object or subject, computing the semantic word distance etc.
using part-of-speech (POS) taggers. Most existing approaches for cross-document
coreference have been tested on news articles and between the same types of texts. The present study proposes cross-document coreference between different types of texts, such as
plot summary and audio description, including an unlimited set of events, such as film events, by matching combinations of functional roles, such as subject and object, and considering the event temporal aspect for the number of expected matches.
2.2.1 N -g ra m Algorithm s
“An n-gram grammar is a representation of an N-th order Markov language model in which the probability of occurrence of a symbol is conditioned upon the prior occurrence of N-l other symbols” (Brown et al, 2001). N-gram matching algorithms can match single words, ‘uni-grams’, or two or more consecutive words, ‘bi-grams’, ‘trigrams’ etc. including grammar tags, such as noun, adjective etc. N-gram algorithms are used by the LAKE system (D’ Avanzo et al, 2004), which extracts and scores key-phrases in a coipus of news articles, and
by the News Story Gisting system (Doran et al, 2004) that generates short news story summaries, predicting which words should be in the resultant summary. The extraction of the candidate phrases is based on n-grams; uni or bi-grams for entity extraction (common noun or proper noun, or adjective + noun etc.) and tri or four-grams for events or situations (noun + verb + noun + adverb etc.). N-grams are scored according to statistical methods, involving
Chapter 2. Cross-Document Coreference fo r Video Retrieval
the document term frequency (TF), the inverse document frequency (DDF, the log of the number of all documents divided by the number of documents containing the term), as well as the ‘ first occurrence’ or ‘word distance’, which is the relative position of a candidate word with respect to the beginning of the document. The TF and IDF, as well as the word distance are computed to score important n-grams, Figure 2-7.
1. Match or fuzzy match n-grams 2. Compute TF
3. Compute IDF
4. Compute word distance
Figure 2-7. The n-gram algorithm as it appears in (D’ Avanzo et al, 2004, Doran et al, 2004)
Now, let us consider how well this works on plot summary and audio description. The selection of relevant phrases in the plot summary considers syntactic patterns describing an
entity (uni/bigrams), or a concise event or situation (tri/four-grams), using the Penn Treebank tagset. The PS sentence would be then tagged as follows:
Vianne [NP] has [VBZ] been [VBN] helping [VBG] Josephine [NP] out [IN] of [IN] her [PP] abusive [JJ] marriage [NN]
Where, ‘NP’ = Proper noun singular, 'VBZ’ = Verb 3rd person singular present, ‘VBN’ =
Verb, past participle, ‘VBG’ = Verb, gerund or present participle, ‘IN’ = Preposition or
subordinating conjunction, ‘PP’ = Personal pronoun, ‘JJ’ = adjective and ‘NN’ = Noun,
singular or mass.
Here we manually simulated the n-gram algorithm. As the important keywords are spread through the sentence, applying bi-grams tri-grams or four-grams does not detect any matches in audio description. Thus, we spot uni-grams, such as Vianne and Josephine, which describe
entities, and the uni-gram helping describing an event, Table 2-2.
N-gram PS n-gram TF AD n-gram TF n-gram TF in
the corpus of one PS and one AD
n-gram IDF
Vianne [NP] 2 /1 1 4 -1 .7 % 7 5 /5 ,9 1 6 -1 .2 % 77/6,030 - 1.2% 1
Josephine [NP] 1 /1 1 4 -0 .8 % 3 5 /5 ,9 1 6 -0 .5 % 36/6,030 - 0.5% 1 Help [V B-] 1 /1 1 4 -0 .8 % 4 /5 ,9 1 6 -0 .0 6 % 5 /6 ,0 3 0 - 0.08% 1
Table 2-2: TF and IDF for plot summary and audio description matched n-grams for one event in the film ‘ Spiderman’ (the n-grams were lemmatised)
The binary classifier decides if the sentence is relevant or not relevant based on two features, TF, showing whether the term is frequent in the specific document, and IDF, showing
whether the term occurs in all texts of the corpus or not. For example, the frequency (TF) of
the unigram Vianne in the corpus of a plot summary and an audio description is 77/6,030 (or
1.2%), see Table 2-2. The inverse document frequency (IDF) of the n-grams Vianne,
Josephine and helping is 1, as the n-grams occur in two documents in a corpus of one plot
summary and one audio description, i.e. two divided by two equals one. The TF of all n- grams in a corpus of one plot summary and one audio description is very low, ranging from 0.08% for the lemmatised n-gram help to 1.2% for the n-gram Vianne, showing that they are not important. However, these statistics are not useful when matching a plot summary against
the corresponding audio description spans, as the important words to be matched are proper
nouns or head nouns detected in the plot summary. The word helping was fuzzy matched
four times1. Here, the principle of word distance tends to consider words appearing close to
the beginning of a document as important, introducing the theme of the article. Candidate matches of Vianne, Josephine and helping were selected in audio description but were erroneously matched, as the correct matches appear in later parts of the audio description and not in the beginning of the document.
N-grams can be restrictive in the pair of texts plot summary and audio description as
keywords do not necessarily occur in the same order and longer n-grams would be needed, which are hard to be matched or even fuzzy matched. The TF is not important in our case as we consider nouns and verbs in the plot summary as the candidate keywords which may be infrequent. The IDF is not important in matching plot summary and audio description as
information is matched in two texts only. These statistics are important to detect candidate keywords in more than two texts, when the selection is not based on one of the texts as in the case of plot summary. The uni-gram Vianne was found 75 times in the audio description but
only four times referring to the event in which the character participated. The uni-gram
Josephine was found 35 times in the audio description and only on two occasions it was
included in the same event. The word help was found four times and all matches are
incorrect. The candidate keywords need to be combined with the occurrence of other
keywords to find the correct match. Lexical cohesion is included in News Story Gisting, based on the word repetition and synonymy relations according to WordNet. In our case it
has not been applied as synonyms are not detected in plot summary and audio description.
Chapter 2. Cross-Document Coreference for Video Retrieval
2.2.2 Event-Centric Algorithm
The ‘event centric’ algorithm is part of the MSR - NLP system (Vanderwende et al, 2004) and suggests a verb centred method, by tagging verbs and nouns, as well as time and syntactic relations (object, subject). The extracted fragments are divided into events and entities. First the event expressed by a verb is taken into account in combination with another
word assigned with either the role of logical subject or object, called ‘triple’ . Using the Pagerank algorithm, the triples which have been cast a vote more than once are detected and marked as important, Figure 2-8.
1. T ag verbs and nouns, time, syntactic relations (subject, o b ject) 2. P rod u ce triples in the fo rm o f
LFNodei, rel, LFNodej
3. S core im portant triples
Figure 2-8: T he event-centric algorithm (V anderw ende et al, 2004)
Here, we manually simulated and applied the event-centric algorithm to the pair of texts plot summary and audio description. We produce triples, which take the form of -LFNodei, rel,
LFNodej- in the logical form of sentences, meaning that the tokens are assigned with a functional role of subject or object, called ‘Tsubj’ , i.e. ‘typical subject’, or ‘Tobj’, i.e. ‘typical
object’ . For example, spider is the ‘ typical’ subject in the sentence Peter is bitten by a spider, while Peter is the grammatical subject. Here, we detect the typical subjects or objects of the
verbs. We then score important triples in the plot summary, meaning triples including a word appearing in a functional role, in more than one phrases, for example, Vianne is the typical subject in two sentences; Vianne, Tsubj, help and Vianne, Tsubj, opens.
Two votes are cast on Vianne, which is the ‘Tsubj’ in both events, Vianne opens a
chocolaterie and Vianne has been helping Josephine
Figure 2-9: The event-centric algorithm applied to plot sum m ary data, casting tw o votes fo r the node
Vianne
We then match the important plot summary triples to the audio description corresponding
triples. The plot summary triple Vianne, Tsubj, help is matched against the same triple included in the audio description sentence Vianne helps Armande to her feet. However, this is a spurious match as the correct matches should include Vianne helping a different character,
Josephine.
Plot summary triple Audio description triple/s
Vianne [PROPER NOUN], Tsubj, helping
[VERB]
in
Vianne has been helping Josephine
Vianne [PROPER NOUN], Tsubj, helping
[VERB]
in
Vianne helps Armande to her feet
Table 2-3: Plot summary and audio description matched triples for one event from the film ‘ Chocolat’
Using triples can be restrictive in our data, as we are rather interested in who is participating in the event than the word referring to the event itself. Extending triples, taking into account all the event participants, or matching words in the roles of subject and object without
matching the verb would perhaps be a more suitable approach for the pair of texts plot summary and audio description.
2.2.3 The Boosting Algorithm
The Boosting algorithm (Zhang et al, 2003) merges information in multiple news texts about
the same stories, producing “a strong hypothesis combining ‘weak’ or ‘base’ hypotheses”. Taking advantage of lexical, syntactic and semantic features, the algorithm follows a binary
classification scoring if two fragments are related or not by matching common lexical and syntactic features and computing the semantic distance. Without applying any stemming or stop-word deletion, only three lexical features are of interest for the authors of the algorithm:
1. The number of tokens in sentence 1 2. The number of tokens in sentence 2 3. The number of tokens in common
Figure 2-10: The lexical features analysis in the Boosting algorithm (Zhang et al, 2003)
We have manually simulated the Boosting algorithm for one example; Table 2-4 shows the application of the three lexical features rules on the pair of texts plot summary and audio description, analysing one event from the film ‘Chocolat’ . As the number of tokens in the audio description sentence must not diverge a lot one from the number of tokens included in
Chapter 2. Cross-Document Coreference fo r Video Retrieval
the plot summary clause, we have collected the audio description sentences, which include +/- six tokens of the number of tokens included in the plot summary sentence and at least three tokens in common. Four tokens in common are the highest matches in this data set.
PS sentence AD sentence Number of
tokens in the PS sentence Number of tokens in the AD sentence Number of tokens in common Vianne has been helping Josephine out of her abusive marriage
0 1 :46:54 At the shop Josephine ladles out
liquid chocolate - Vianne spreads it over a
slab of white marble.
10 16 4
01:48:19 Vianne teaches Josephine the
art of chocolate making.
10 7 3
02:20:40 Vianne helps Armande to her
feet.
10 6 3
Table 2-4: The three lexical features for one event from the plot summary and audio description for the film ‘ Chocoiat’
Three audio description utterances have been retrieved and only the second match refers to the event help ( Vianne teaches Josephine). In the other two sentences, pairs of closed class
words such as of or out are matched, as a stop word list was not used to prevent the matching of closed class words.
The syntactic analysis features shows which words may be more important than others,
having a Part-of-Speech (POS) tag x such as noun, proper noun, verb, adjective or adverb. Three syntactic features are of interest for the authors of the algorithm, Figure 2-11.
1. Number of tokens having POS x in the PS sentence 2. Number of tokens having POS x in the AD sentence 3. Number of common tokens having POS x
Figure 2-11: The syntactic features analysis in the Boosting algorithm
The same example can be analysed as shown in Table 2-5. Four audio description utterances have been detected including common words with POS open class word tags and again only
one is correct. The verb help has been erroneously matched, as the participants of the event are different, in the plot summary the participants are Vianne and Josephine and in the audio description the participants are Vianne and Armande.
PS sentence AD sentence Tokens having POS x in the PS Tokens having POS x in the AD Common tokens having POS x Vianne [PN] has been helping [VBG1 Josephine [PN] out of her abusive [JJ] marriage [NN]
01:46:54 At the shop Josephine
[PN] ladles out liquid chocolate -
Vianne [PN] spreads it over a slab o f
white marble.
5 9 2
01:48:19 Vianne |PN] teaches Josephine [PN] the art o f chocolate
making.
5 6 2
02:12:01 Copper pans gleam in the kitchen as Vianne [PN] and
Josephine [PN] prepare the food for
the party.
5 9 2
02:20:40 Vianne [PN| helps [VBZ]
Armande to her feet.
5 4 2
Table 2-5: The three syntactic features for one event from the plot summary and audio description for the film ‘ Chocolat’
In the sem antic features analysis, the authors take advantage o f the syntactic analysis first and then com p u te the sem antic distance betw een pairs o f w o rd s to find out i f the top lev el head n oun s and verbs present any sy n on y m s, estim ating i f tw o w ord s are sem an tically related or not, u sing the W o rd N et thesaurus. T h e sam e ex a m p le is sh ow n in T a b le 2-6.
1. Find the top level NP and VP in sentence 1 and sentence 2 2. Find the head tokens of both NP and VP
3. Align the heads correspondingly (e.g. NP VS NP...) 4. For each head pair compute the semantic distance
Figure 2-12: The semantic features analysis in the Boosting algorithm
In the case o f m atching p lot sum m ary and au dio description , it is m ore im portant to m atch all the participants o f the event, i.e. both N P s Vianne and Josephine, taking into a ccou n t the transitivity, rather than on e participant w ith the verb. T he sem antic analysis points at the m atch w ith both heads h aving the least sem antic distance. H ere, the au dio description utterance Vianne helps Armande to her feet is m atched against the p lot sum m ary clause
Vianne has been helping Josephine, w h ich d o e s not refer to the sam e event, in cluding d ifferen t participants. It is very rare to m atch the same verb, and co m p u tin g the sem antic distance betw een w ord s is not o f im portan ce in this data set, as n o syn on y m s w ere fou n d, w h ich usually is the case in the s p e c ific pair o f texts as w e w ill later sh o w in Chapter 4.
Chapter 2. Cross-Document Coreference fo r Video Retrieval
PS sentence AD sentences retrieved
by syntactic features
Aligned heads Semantic
distance
Vianne [NP head] has been helping |VP head] Josephine out of her abusive marriage
01:46:54 Vianne [NP head] spreads [VP head]
it over a slab o f white marble. PS NP head = Vianne PS VP head = help AD NP head = Vianne AD VP head = spread
PS NP head
Vianne
VSAD
NP head Vianne
distance
=0
PS VP head help
VSAD VP head
spread
not related 01:48:19 Vianne (NP head) teaches [VP head]Josephine the art o f chocolate making. PS NP head = Vianne PS VP head = help AD NP head = Vianne AD VP head = teach
PS NP head
Vianne
VSAD
NP head Vianne
distance = 0
PS VP head help
VSAD VP head
teach
not related02:12:01 Vianne |NP head] and Josephine [NP head! prepare [VP headj
the food for the party.
PS NP head = Vianne PS NP head = Josephine PS VP head = help A D NP head = Vianne A D VP head = prepare
PS NP head
Vianne
VSAD
NP head Vianne
distance = 0
PS VP head help
VS AD VP head
prepare
02:20:40 Vianne [NP head] helps [VP head]Armande to her feet.
PS NP head = Vianne PS VP head = help AD NP head = Vianne A D VP head = help
PS NP head
Vianne
VSAD
NP head Vianne
distance
=0
PS VP head help
VSAD VP head
help
distance
=0
Table 2-6: The semantic features for one event from the plot summary and audio description for the film ‘ Chocolat’
2.2.4 Discussion
Recent cross-document coreference approaches are applied between the same types of text
and involve newswire texts. We have reviewed three different algorithms, intended for the matching of the same types of texts. The n-gram algorithm matches consecutive words and scores important words by computing the term frequency in the document and in the corpus of documents and the word distance from the beginning of the document, as words mentioned close to the beginning are scored as significant. Matching n-grams is a popular technique used by different kinds of text retrieval systems, such as LAKE and News Stories Gisting. It