Character Identification - Modeling Narrative Discourse

The first challenge is to “chunk” (identify and delimit) mentions of characters from the

text. It is not enough to compile a list of names; we must annotate the text to identify each mention of each character, so we can have a better idea about when each character

is speaking. This is a part of the Named Entity Recognition (NER) task, as previously articulated in by ACE program [Doddington et al., 2004] and elsewhere.

Character identification in novels is made complicated by the fact that there are myriad ways in which authors refer to characters. A named entity (e.g., Ebenezer Scrooge) is one

way. However, there can be aliases and variations (Mr. Scrooge). Authors also use pronouns and descriptive nominals (he, she, the old man) to refer to given characters known by proper

names. Nondescript characters, by definition, are only given nominals (the porter, the clerk). To identify all the characters in the text, then, we not only have to search across proper

nouns, nominals and pronouns, but link together those mentions which co-refer to the same entity.

We attempted to apply two recent named entity recognition and coreference systems to the QSA corpus: ACEJet [Grishman et al., 2005] and Reconcile [Stoyanov et al., 2010], both of which have been applied to the ACE task. However, both were designed more for news prose than for literary fiction, which presented substantial challenges to their adoption

for this task: Neither system was able to process texts at the novella length or longer, given our available resources, as this is several orders of magnitude longer than a typical news

article; also, both tools were aggressive in their merging of mentions into entities, which led to unacceptable precision losses. For instance, gender distinctions as implied by titles

were sometimes ignored, such that Mr. Weston and Mrs. Weston were merged into a single entity in the case of Emma. We believe the complexity of literary prose, compared to the

more simple syntactic structures found in newswire, calls for a new approach that is built on a distinct set of assumptions.

In order to maintain a high precision, we developed a custom pipeline for literary fiction that breaks the character identification task into three stages: First, to chunk named

entities, pronouns and nominals; second, to perform coreference at a high precision, but only between named entities; and third, to roll the more difficult coreference task—pronoun

resolution—into the larger quoted speech attribution task, as we do not require a general solution to this problem to address the task at hand (finding the named entity or nominal

responsible for each instance of quoted speech).

Pronouns are easy to detect, as there is a closed set of words for which we can search

(as we are limiting our investigation to English). Named entities and nominals are more difficult. Fortunately, chunking named entities is a task that carries over cleanly from other

discourse types, such as news, so we were able to leverage publicly available tools for the task. We processed each novel with the Stanford Named Entity Recognizer [Finkel et al., 2005], which applies Conditional Random Field (CRF) sequence models to the vector of words found in a document. The system was trained on the data set for the shared named entity

recognition task published at the Seventh Conference on Computational Natural Language Learning (CoNLL),4 a collection of news articles from the Reuters Corpus.5 Each named

entity is classified as either a Person, Location or Organization. In adapting the Stanford NER finder to the QSA and LSN corpora, we found that the tool sometimes classified

two identical mentions in two different classes, as they appeared in different contexts (one context might suggest that Darcy is a person, where another appears to refer to him as

4_{Available at}

http://www.cnts.ua.ac.be/conll2003/ner/#CN03

a place). In these cases, we assigned the class that took a plurality of “votes” across all identical mentions. We then removed all Location entities from our list of characters.

Using the development segment of the QSA corpus as a testbed, we developed a custom heuristic for extracting nominals. This method scans the pre-processed text against a

regular expression that searches each line for two close (but not necessarily adjacent) tokens: a determiner and a head noun. We compiled lists of determiners and head nouns using a

subset of the development corpus—determiners included the normal a and the, as well as possessives that include head nouns (e.g., her father, Isabella’s husband) and both ordinal

and cardinal numbers (two women). The text that falls between the determiner and the head noun is assumed to be a modifier, although we manually tuned the regular expressions

to separate legitimate modifiers from noise. A modifier can either be a single word, or two words separated by a comma.

We compiled a list of valid head nouns by adapting a subset of the taxonomy of English words offered by WordNet [Fellbaum, 1998]. Specifically, we chose subtrees that could po- tentially describe an animate agent, including organisms (the stranger), imaginary beings and spiritual beings. This required some filtering, as the WordNet “organism” hierarchy in-

cludes many words not typically used as nouns. For instance, heavy is typically an adjective, but it can refer to “an actor who plays villainous roles.” We trained a rule-based classifier [Cohen, 1995] on a subset of the development segment of the QSA corpus (based on our own annotations) to filter out such undesirable nouns and decrease the noise generated by the

nominal chunker. Features included counts of WordNet senses for the word as an adjective, a noun and a verb, as well as the noun senses’ “sense numbers.” In the latter case, WordNet

assigns a number to each sense to rank its prevalence in the various corpora which have been tagged against the lexicon. Words whose noun senses appeared frequently in WordNet’s

semantic concordance texts, especially relative to their non-noun senses, were allowed to be head nouns. The list numbered some 20,000 nouns, from aardvark to Zulu, including a fair

number of compound nouns. Table 2.3 shows excerpts from Dickens, Flaubert, Chekhov and Twain (clockwise from top left), including names and nominals in bold that our system identified as character mentions outside of quoted speech.

“A merry Christmas, uncle! God save you!” cried a cheerful voice. It was the voice of Scrooge’s nephew, who came upon him so quickly that this was the first intimation he had of his approach.

“Bah!” said Scrooge, “Humbug!”

He had so heated himself with rapid walking in the fog and frost, this nephew of Scrooge’s, that he was all in a glow; his face was ruddy and handsome; his eyes sparkled, and his breath smoked again.

“Christmas a humbug, uncle!” said Scrooge’s nephew. “You don’t mean that, I am sure?”

“And,” said Madame Bovary, taking her watch from her belt, “take this; you can pay yourself out of it.”

But the tradesman cried out that she was wrong; they knew one another; did he doubt her? What childishness!

She insisted, however, on his taking at least the chain, and Lheureux had already put it in his pocket and was going, when she called him back.

“You will leave everything at your place. As to the cloak”—she seemed to be reflecting—“do not bring it either; you can give me the maker’s address, and tell him to have it ready for me.” “Well, I do, too– LIVE ones. But I mean dead

ones, to swing round your head with a string.” “No, I don’t care for rats much, anyway. What I like is chewing-gum.”

“Oh, I should say so! I wish I had some now.” “Do you? I’ve got some. I’ll let you chew it awhile, but you must give it back to me.”

He beckoned coaxingly to the Pomeranian, and when the dog came up to him he shook his finger at it. The Pomeranian growled: Gurov shook his finger at it again.

The lady looked at him and at once dropped her eyes.

“He doesn’t bite,” she said, and blushed. “May I give him a bone?” he asked; and when she nodded he asked courteously, “Have you been long in Yalta?”

Table 2.3: Four samples of output that show the extracted character names and nominals (in bold) and quoted speech fragments (in italics).

refer to the same individual (as opposed to pronoun or nominal anaphora resolution). Our heuristic for this task is based on work we previously published in the domain of scholarly

monographs about art and architecture [Davis et al., 2003]. This approach finds named entity mentions that are variations of one another, grouping them into clusters that assume

transitivity (in that mentions that are variations of the same entity are assumed to be variations of one another). The clustering process is as follows:

1. For each named entity, we generate variations on the name that we would expect to see in coreferent mentions. Each variation omits certain parts of multi-word names,

respecting titles and first/last name distinctions. For example, Mr. Sherlock Holmes may refer to the same character as Mr. Holmes, Sherlock Holmes, Sherlock and

Holmes. (We found that in this literary genre, feminine titles could not be removed without confusing the women’s names with those of their male relatives.)

2. For each named entity, we compile a list of other named entities that may be coreferent mentions, either because they are identical or because one is an expected variation on

the other.

3. We then match each mention with the most recent of its possible coreferent mentions. In aggregate, this creates a cluster of mentions for each character.

Though we also group together identical nominals as referring to the same entity, we do not attempt to find coreference between nominals, pronouns and named entities. That is,

we do not perform anaphora resolution as a discrete task. Instead, we roll this ambiguity into the input for our next larger task, quoted speech attribution. When faced with such

input, as we will see, the QSA solver chooses the most likely speaker from among several nearby mentions.

In order to make quoted speech attribution easier, we also pre-process the texts with an automatic tagger that assigns a gender (male, female, plural, or unknown) to as many

named entities as possible. We do this first by finding mentions with gendered titles (e.g., Mr.), gendered head words (nephew) and first names as given in a gendered name dictionary

(Emma). Then, we assume that each named entity has one gender that is shared transitively by all of its mentions. Mr. Scrooge, for instance, assigns a “male” tag to itself and forwards

this tag to Scrooge, which assigns it further to Ebenezer Scrooge, which assigns it finally to Ebenezer (though redundantly, as the gender dictionary knows this name to be male). All mentions start out with the “unknown” tag; if two mentions for the same entity are

tagged with opposing genders by this approach, we take a vote among all the mentions with assigned genders, and apply the gender with the plurality of votes to each mention. This

assumes that all mentions refer to an entity with a consistent gender.

Although we did not conduct a formal evaluation of this component in isolation, its out-

put was used to give the annotators of the QSA corpus “candidate” speakers for each quote. As we mentioned earlier, only in 3% of quotes was there a recall issue in which the speaker

was not extracted by our tool and presented as a candidate. Meanwhile, the precision of the named entity coreference aspect is incorporated into the forthcoming evaluation of the

other genres and determine how well they perform in various contexts. Two areas to address are language independence (i.e., these steps do not work on French or German texts) and

more wide-ranging, automatic coreference. One particular irony of the limited-coreference approach we have described is that characters who change names, or are called two differ-

ent names at different times, are taken to be separate individuals—including, in the LSN corpus, a Dr. Jekyll on one page and a Mr. Hyde on the next.

In document Modeling Narrative Discourse (Page 38-43)