Toward Completeness in Concept Extraction and Classification

(1)

Toward Completeness in Concept Extraction and Classification

Eduard Hovy and Zornitsa Kozareva USC Information Sciences Institute

4676 Admiralty Way Marina del Rey, CA 90292

[email protected], [email protected]

Ellen Riloff School of Computing

University of Utah Salt Lake City, UT 84112

[email protected]

Abstract

Many algorithms extract terms from text to-gether with some kind of taxonomic clas-sification (is-a) link. However, the general approaches used today, and specifically the methods of evaluating results, exhibit serious shortcomings. Harvesting without focusing on a specific conceptual area may deliver large numbers of terms, but they are scattered over an immense concept space, making Recall judgments impossible. Regarding Precision, simply judging the correctness of terms and their individual classification links may pro-vide high scores, but this doesn’t help with the eventual assembly of terms into a single coher-ent taxonomy. Furthermore, since there is no correct and complete gold standard to measure against, most work invents some ad hoc evalu-ation measure. We present an algorithm that is more precise and complete than previous ones for identifying from web text just those con-cepts ‘below’ a given seed term. Comparing the results to WordNet, we find that the algo-rithm misses terms, but also that it learns many new terms not in WordNet, and that it clas-sifies them in ways acceptable to humans but different from WordNet.

1 Collecting Information with Care

Over the past few years, many algorithms have been published on automatically harvesting terms and their conceptual types from the web and/or other large corpora (Etzioni et al., 2005; Pasca, 2007; Banko et al., 2007; Yi and Niblack, 2005; Snow et al., 2005). But several basic problems limit the even-tual utility of the results.

First, there is no standard collection of facts against which results can be measured. As we show

in this paper, WordNet (Fellbaum, 1998), the most obvious contender because of its size and popularity, is deficient in various ways: it is neither complete nor is its taxonomic structure inarguably perfect. As a result, alternative ad hoc measures are invented that are not comparable. Second, simply harvesting facts about an entity without regard to its actual sub-sequent organization inflates Recall and Precision evaluation scores: while it is correct that a jaguar is a animal, mammal, toy, sports-team, car-make, and operating-system, this information doesn’t help to create a taxonomy that, for example, places mam-mal and animam-mal closer to one another than to some of the others. ((Snow et al., 2005) is an exception to this.) As a result, this work may give a mislead-ing sense of progress. Third, entities are of differ-ent formal types, and their taxonomic treatmdiffer-ent is consequently different: some are at the level of in-stances (e.g., Michelangelo was a painter) and some at the level of concepts (e.g., a painter is a human).

The goal of our research is to learn terms for en-tities (objects) and their taxonomic organization si-multaneously, from text. Our method is to use a single surface-level pattern with several open posi-tions. Filling them in different ways harvests differ-ent kinds of information, and/or confirms this infor-mation. We evaluate in two ways: against WordNet, since that is a commonly available and popular source, and also by asking humans to judge the re-sults since WordNet is neither complete nor exhaus-tively taxonomized.

(2)

An-imals or People) both at high precision compared to WordNet and including additional correct ones as well. We would like to also report on Recall rela-tive to WordNet, but given the problems described in Section 4, this turns out to be much harder than would seem.

First, we need to define some basic terminology:

term: An English word (for our current purposes, a

noun or a proper name).

seed term: A word we use to initiate the algorithm. concept: An item in the classification taxonomy we

are building. A concept may correspond to several terms (singular form, plural form, the term’s syn-onyms, etc.).

root concept: A concept at a fairly general (high)

level in the taxonomy, to which many others are eventually learned to be subtypes/instances of.

basic-level concept: A concept at the ‘basic level’,

corresponding approximately to the Basic Level cat-egories defined in Prototype Theory in Psychology (Rosch, 1978). For our purposes, a concept corre-sponding to the (proto)typical level of generality of its type; that is, a dog, not a mammal or a dachshund; a singer, not a human or an opera diva.

instance: An item in the classification taxonomy

that is more specific than a concept; only one exam-ple of the instance exists in ‘the real world’ at any time. For example, Michelangelo is an instance, as well as Mazda Miata with license plate 3HCY687, while Mazda Miata is not.

classification link: We use a single relation, that,

depending on its arguments, is either is a type of (when both arguments are concepts), or is an in-stance of or is an example of (when the first argu-ment is an instance/example of the second).

Section 2 describes our method for harvesting; Section 3 discusses related work; and Section 4 de-scribes the experiments and the results.

2 Term and Relation Extraction using the Doubly-Anchored Pattern

Our goal is to develop a technique that automatically ‘fills in’ the concept space in the taxonomy below any root concept, by harvesting terms through re-peated web queries. We perform this in two alter-nating stages.

Stage 1: Basic-level/Instance concept collec-tion: We use the Doubly-Anchored Pattern DAP

de-veloped in (Kozareva et al., 2008):

DAP: [SeedTerm1] such as [SeedTerm2] and<X>

which learns a list of basic-level concepts or in-stances (depending on whether SeedTerm2

ex-presses a basic-level concept or an instance).1 DAP

is very reliable because it is instantiated with ex-amples at both ‘ends’ of the space to be filled (the higher-level (root) concept SeedTerm1 and a basic-level term or instance (SeedTerm2)), which mutu-ally disambiguate each other. For example, “pres-idents” for SeedTerm1 can refer to the leader of a country, corporation, or university, and “Ford” for SeedTerm2 can refer to a car company, an automo-bile pioneer, or a U.S. president. But when the two terms co-occur in a text that matches the pattern

“Presidents such as Ford and <X>”, the text will

almost certainly refer to country presidents.

The first stage involves a series of repeated re-placements of SeedTerm2 by newly-learned terms in order to generate even more seed terms. That is, each new basic-level concept or instance is rotated into the pattern (becoming a new SeedTerm2) in a bootstrapping cycle that Kozareva et al. called reck-less bootstrapping. This procedure is implemented as exhaustive breadth-first search, and iterates until no new terms are harvested. The harvested terms are incorporated in a Hyponym Pattern Linkage Graph

(HPLG)G = (V, E), where each vertexv ∈ V is

a candidate term and each edge (u, v) ∈ E

indi-cates that term vwas generated by termu. A term

u is ranked by Out-Degree(u) =

P

∀(u,v)∈Ew(u,v)

|V|−1 ,

which represents the weighted sum ofu’s outgoing

edges normalized by the total number of other nodes in the graph. Intuitively, a term ranks highly if it is frequently discovering many different terms dur-ing the reckless bootstrappdur-ing cycle. This method is very productive, harvesting a constant stream of new terms for basic-level concepts or instances when the taxonomy below the initial root concept SeedTerm1 is extensive (such as for Animals or People).

1_{Strictly speaking, our lowest-level concepts can be}

(3)

Stage 2: Intermediate level concept collection:

Going beyond (Kozareva et al., 2008), we next apply the Doubly-Anchored Pattern in the ‘backward’

di-rection (DAP−1), for any two seed terms

represent-ing basic-level concepts or instances:

DAP−1:<X>such as [SeedTerm1] and [SeedTerm2]

which harvests a set of concepts, most of them inter-mediate between the basic level or instance and the initial higher-level seed.

This second stage (DAP−1) has not yet been

de-scribed in the literature. It proceeds analogously. For pairs of basic-level concepts or instances be-low the root concept that were found during the first

stage, we instantiate DAP−1 and issue a new web

query. For example, if the term “cats” was harvested

by DAP in “Animals such as dogs and<X>”, then

the pair< dogs, cats >forms the new Web query

“<X>such as dogs and cats”. We extract up to 2

consecutive nouns from the<X>position.

This procedure yields a large number of discov-ered concepts, but they cannot all be used for fur-ther bootstrapping. In addition to practical limita-tions (such as limits on web querying), many of them are too general–more general than the initial root concept–and could derail the bootstrapping process by introducing terms that stray every further away from the initial root concept. We therefore rank the harvested terms based on the likelihood that they will be productive if they are expanded in the next cycle. Ranking is based on two criteria: (1) the con-cept should be prolific (i.e., produce many lower-level concepts) in order to keep the bootstrapping process energized, and (2) the concept should be subordinate to the root concept, so that the process stays within the targeted part of the search space.

To perform ranking, we incorporate both the har-vested concepts and the basic-level/instance pairs into a Hypernym Relation Graph (HRG), which we

define as a bipartite graphHRG= (V, E)with two

types of vertices. One set of vertices represents the

concepts (the category vertices (Vc), and a second

set of vertices represents the basic-level/instance pairs that produced the concepts (the member pair

vertices (V_mp)). We create an edge e(u, v) ∈ E

between u ∈ Vc and v ∈ Vmp when the

con-cept represented by u was harvested by the

basic-level/instance pair represented byv, with the weight

of the edge defined as the number of times that the lower pair found the concept on the web.

We use the Hypernym Relation Graph to rank the intermediate concepts based on each node’s In-Degree, which is the sum of the weights on the

node’s incoming edges. Formally, In-Degree(u) =

P

∀(u,v)∈Ew(u, v). Intuitively, a concept will be ranked highly if it was harvested by many different combinations of basic-level/instance terms.

However, this scoring function does not deter-mine whether a concept is more or less general than the initial root concept. For example, when har-vesting animal categories, the system may learn the word “species”, which is a very common term asso-ciated with animals, but also applies to non-animals such as plants. To prevent the inclusion of over-general terms and constrain the search to remain ‘below’ the root concept, we apply a Concept Posi-tioning Test (CPT): We issue the following two web queries:

(a) Concept such as RootConcept and<X> (b) RootConcept such as Concept and<X>

If (b) returns more web hits than (a), then the con-cept passes the test, otherwise it fails. The first (most highly ranked) concept that passes CPT becomes the new seed concept for the next bootstrapping cycle. In principle, we could use all the concepts that pass

the CPT for bootstrapping2. However, for practical

reasons (primarily limitations on web querying), we run the algorithm for 10 iterations.

3 Related Work

Many algorithms have been developed to automat-ically acquire semantic class members using a va-riety of techniques, including co-occurrence statis-tics (Riloff and Shepherd, 1997; Roark and Char-niak, 1998), syntactic dependencies (Pantel and Ravichandran, 2004), and lexico-syntactic patterns (Riloff and Jones, 1999; Fleischman and Hovy, 2002; Thelen and Riloff, 2002).

The work most closely related to ours is that of (Hearst, 1992) who introduced the idea of apply-ing hyponym patterns to text, which explicitly iden-tify a hyponym relation between two terms (e.g.,

2

(4)

“such authors as <X>”). In recent years, sev-eral researchers have followed up on this idea using the web as a corpus. (Pasca, 2004) applies lexico-syntactic hyponym patterns to the Web and use the contexts around them for learning. KnowItAll (Et-zioni et al., 2005) applies the hyponym patterns to extract instances from the Web and ranks them by relevance using mutual information. (Kozareva et al., 2008) introduced a bootstrapping scheme using the doubly-anchored pattern (DAP) that is guided through graph ranking. This approach reported a significant improvement from 5% to 18% over ap-proaches using singly-anchored patterns like those of (Pasca, 2004) and (Etzioni et al., 2005).

(Snow et al., 2005) describe a dependency path based approach that generates a large number of weak hypernym patterns using pairs of noun phrases present in WordNet. They build a classifier using the different hypernym patterns and find among the highest precision patterns those of (Hearst, 1992). Snow et al. report performance of 85% precision at 10% recall and 25% precision at 30% recall for 5300 hand-tagged noun phrase pairs. (McNamee et al., 2008) use the technique of (Snow et al., 2005) to harvest the hypernyms of the proper names. The average precision on 75 automatically detected cat-egories is 53%. The discovered hypernyms were intergrated in a Question Answering system which showed an improvement of 9% when evaluated on a TREC Question Answering data set.

Recently, (Ritter et al., 2009) reported hypernym learning using (Hearst, 1992) patterns and manually tagged common and proper nouns. All hypernym candidates matching the pattern are acquired, and the candidate terms are ranked by mutual informa-tion. However, they evaluate the performance of their hypernym algorithm by considering only the top 5 hypernyms given a basic-level concept or in-stance. They report 100% precision at 18% recall, and 66% precision at 72% recall, considering only the top-5 list. Necessarily, using all the results re-turned will result in lower precision scores. In con-trast to their approach, our aim is to first acquire au-tomatically with minimal supervision the basic-level concepts for given root concept. Thus, we almost entirely eliminate the need for humans to provide hyponym seeds. Second, we evaluate the perfor-mance of our approach not by measuring the

top-ranked 5 hypernyms given a basic-level concept, but considering all harvested hypernyms of the concept. Unlike (Etzioni et al., 2005), (Pasca, 2007) and (Snow et al., 2005), we learn both instances and con-cepts simultaneously.

Some researchers have also worked on reorga-nizing, augmenting, or extending semantic concepts that already exist in manually built resources such as WordNet (Widdows and Dorow, 2002; Snow et al., 2005) or Wikipedia (Ponzetto and Strube, 2007). Work in automated ontology construction has cre-ated lexical hierarchies (Caraballo, 1999; Cimiano and Volker, 2005; Mann, 2002), and learned seman-tic relations such as meronymy (Berland and Char-niak, 1999; Girju et al., 2003).

4 Evaluation

The root concepts discussed in this paper are An-imals and People, because they head large taxo-nomic structures that are well-represented in Word-Net. Throughout these experiments, we used as the initial SeedTerm2 lions for Animals and Madonna for People (by specifically choosing a proper name for People we force harvesting down to the level of individual instances). To collect data, we submitted the DAP patterns as web queries to Google, retrieved the top 1000 web snippets per query, and kept only the unique ones. In total, we collected 1.1 GB of snippets for Animals and 1.5 GB for People. The algorithm was allowed to run for 10 iterations.

The algorithm learns a staggering variety of terms that is much more diverse than we had antici-pated. In addition to many basic-level concepts or instances, such as dog and Madonna respectively, and many intermediate concepts, such as mammals, pets, and predators, it also harvested categories that clearly seemed useful, such as laboratory animals, forest dwellers, and endangered species. Many other harvested terms were more difficult to judge, includ-ing bait, allergens, seafood, vectors, protein, and pests. While these terms have an obvious relation-ship to Animals, we have to determine whether they are legitimate and valuable subconcepts of Animals. A second issue involves relative terms that are hard to define in an absolute sense, such as native animals and large mammals.

(5)

• Precision: What is the correctness of the har-vested concepts? (How many of them are sim-ply wrong, given the root concept?)

• Recall: What is the coverage of the harvested

concepts? (How many are missing, below a given root concept?)

• How correct is the taxonomic structure

learned?

Given the number and variety of terms obtained, we initially decided that an automatic evaluation against existing resources (such as WordNet or something similar) would be inadequate because they do not contain many of our harvested terms, even though many of these terms are clearly sensi-ble and potentially valuasensi-ble. Indeed, the whole point of our work is to learn concepts and taxonomies that go above and beyond what is currently available.

However, it is necessary to compare with

something, and it is important not to skirt the issue by conducting evaluations that measure subsets of results, or that perhaps may mislead. We therefore decided to compare our results against WordNet and to have human annotators judge as many results as we could afford (to obtain a measure of Precision and the legitimate extensions beyond WordNet).

Unfortunately, it proved impossible to measure Recall against WordNet, because this requires as-certaining the number of synsets in WordNet be-tween the root and its basic-level categories. This requires human judgment, which we could not

af-ford. We plan to address this question in future

work. Also, assessing the correctness of the learned taxonomy structure requires the manual assessment of each classification link proposed by the system that is not already in WordNet, a task also beyond our budget to complete in full. Some results—for just basic-level terms and intermediate concepts, but not among intermediate-level concepts–are shown in Section 4.3.

We provide Precision scores using the following measures, where terms refers to the harvested terms:

P rW N= _##_{terms harvested by system}terms found in W ordNet

P rH=#terms judged correct by human_#_{terms harvested by system}

NotInW N= #terms judged correct by human but not in W ordNet

We conducted three sets of experiments.

Ex-periment 1 evaluates the results of using DAP to

learn basic-level concepts for Animals and instances for People. Experiment 2 evaluates the results of

using DAP−1 to harvest intermediate concepts

be-tween each root concept and its basic-level concepts or instances. Experiment 3 evaluates the taxonomy structure that is produced via the links between the instances and intermediate concepts.

4.1 Experiment 1: Basic-Level Concepts and Instances

In this section we discuss the results of harvest-ing the basic-level Animal concepts and People in-stances. The bootstrapping algorithm ranks the har-vested terms by their Out-Degree score and

consid-ers as correct only those with Out-Degree > 0. In

ten iterations, the bootstrapping algorithm produced

913Animal basic-level concepts and1,344People

instances that passed this Out-Degree criterion.

4.1.1 Human Evaluation

The harvested terms were labeled by human judges as either correct or incorrect with respect to the root concept. Table 1 shows the Precision of the

top-ranked N terms, with N shown in increments

of 100. Overall, the Animal terms yielded 71%

(649/913) Precision and the People terms yielded 95% Precision (1,271/1,344). Figure 1 shows that higher-ranked Animal terms are more accurate than lower-ranked terms, which indicates that the scor-ing function did its job. For People terms, accuracy was very high throughout the ranked list. Overall, these results show that the bootstrapping algorithm generates a large number of correct instances of high quality.

4.1.2 WordNet Evaluation

Table 1 shows a comparison of the harvested terms against the terms present in WordNet. Note that the Precision measured against WordNet

(P rW N) for People is dramatically different from

the Precision based on human judgments (P rH).

This can be explained by looking at the NotInWN

(6)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

100 200 300 400 500 600 700 800 900

Precision

Rank

Animal Basic-level Concepts

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

200 400 600 800 1000 1200

Precision

Rank

[image:6.595.73.273.68.449.2]

People Instances

Figure 1: Ranked Basic-Concepts and Instances.

and986correct People instances are not present in

WordNet (primarily, for people, because WordNet contains relatively few proper names). These results show that there is substantial room for improvement in WordNet’s coverage of these categories. For Ani-mals, the precision measured against WordNet is ac-tually higher than the precision measured by human judges, which may indicate that the judges failed to recognize some correct animal terms.

P rW N P rH NotInW N

Animal .79 .71 48

People .23 .95 986

Table 1: Instance Evaluation.

4.1.3 Evaluation against Prior Work

To assess how well our algorithm compares with previous semantic class learning methods, we com-pared our results to those of (Kozareva et al., 2008). Our work was inspired by that approach–in fact, we use that previous algorithm as the first step of our bootstrapping process. The novelty of our approach is the insertion of an additional bootstrapping stage that iteratively learns new intermediate concepts

us-ing DAP−1 and the Concept Positioning Test,

fol-lowed by the subsequent use of the newly learned intermediate concepts in DAP to expand the search space beyond the original root concept. This leads to the discovery of additional basic-level terms or in-stances, which are then recycled in turn to discover new intermediate concepts, and so on.

Consequently, we can compare the results pro-duced by the first iteration of our algorithm (be-fore intermediate concepts are learned) to those of (Kozareva et al., 2008) for the Animal and People categories, and then compare again after 10 boot-strapping iterations of intermediate concept learn-ing. Figure 2 shows the number of harvested con-cepts for Animals and People after each bootstrap-ping iteration. Bootstrapbootstrap-ping with intermediate con-cepts produces nearly 5 times as many basic-level concepts and instances than (Kozareva et al., 2008) obtain, while maintaining similar levels of precision. The intermediate concepts help so much because they steer the learning process into new (yet still cor-rect) regions of the search space after each iteration. For instance, in the first iteration, the pattern “ani-mals such as lions and *” harvests about 350 basic-level concepts, but only animals that are mentioned in conjunction with lions are learned. Of these, an-imals typically quite different from lions, such as grass-eating kudu, are often not discovered.

[image:6.595.105.263.672.705.2]

However, in the second iteration, the intermediate concept Herbivore is chosen for expansion. The pat-tern “herbivore such as antelope and *” discovers many additional animals, including kudu, that co-occur with antelope but do not co-co-occur with lions.

(7)

0 500 1000 1500 2000 2500 3000 3500

1 2 3 4 5 6 7 8 9 10

#Items Learned

Iterations

Animal Intermediate Concepts Animal Basic-level Concepts

0 500 1000 1500 2000 2500 3000 3500 4000

1 2 3 4 5 6 7 8 9 10

#Items Learned

Iterations

[image:7.595.88.287.68.447.2]

People Intermediate Concepts People Instances

Figure 2: Learning Curves.

evaluation of the intermediate concept terms.

4.2 Experiment 2: Intermediate Concepts

In this section we discuss the results of harvesting the intermediate-level concepts. Given the variety of the harvested results, manual judgment of correct-ness required an in-depth human annotation study. We also compare our harvested results against the concept terms in WordNet.

4.2.1 Human Evaluation

We hired 4 annotators (undergraduates at a dif-ferent institution) to judge the correctness of the in-termediate concepts. We created detailed annota-tion guidelines that define 14 annotaannota-tion labels for each of the Animal and People classes, as shown in Table 3. The labels are clustered into 4 major

PEOPLE IConcept Instances

Dictators: Adolf Hitler, Joseph Stalin, Benito Mussolini, Lenin, Fidel Castro, Idi Amin, Slobodan Milosevic, Hugo Chavez, Mao Zedong, Saddam Hussein

Celebrities: Madonna, Paris Hilton, Angelina Jolie, Britney , Spears, Tom Cruise, Cameron Diaz, Bono, Oprah Winfrey, Jennifer Aniston, Kate Moss

Writers: William Shakespeare, James Joyce, Charles Dickens, Leo Tolstoy, Goethe, Ralph Waldo Emerson, Daniel Defoe, Jane Austen, Ernest Hemingway, Franz Kafka

ANIMAL IConcept Basic-level Terms

Crustacean: shrimp, crabs, prawns, lobsters, crayfish, mysids, decapods, marron, ostracods, yabbies

Primates: baboons, monkeys, chimpanzees, apes, marmosets, chimps, orangutans, gibbons, tamarins, bonobos

[image:7.595.332.519.68.218.2]

Mammal: mice, whales, seals, dolphins, rats, deer, rabbits, dogs, elephants, squirrels

Table 2: Learned People and Animals Terms.

types: Correct, Borderline, BasicConcept, and Not-Concept. The details of our annotation guidelines, the reasons for the intermediate labels, and the anno-tation study can be found in (Kozareva et al., 2009).

ANIMAL

TYPE LABEL EXAMPLES

Correct GeneticAnimal reptile,mammal

BehavioralByFeeding predator, grazer

BehaviorByHabitat saltwater mammal

BehaviorSocialIndiv herding animal

BehaviorSocialGroup herd, pack

MorphologicalType cloven-hoofed animal

RoleOrFunction pet, parasite

Borderline NonRealAnimal dragons

EvaluativeTerm varmint, fox

OtherAnimal critter, fossil

BasicConcept BasicAnimal dog, hummingbird

NotConcept GeneralTerm model, catalyst

NotAnimal topic, favorite

GarbageTerm brates, mals

PEOPLE

TYPE LABEL EXAMPLES

Correct GeneticPerson Caucasian, Saxon

NonTransientEventRole stutterer, gourmand

TransientEventRole passenger, visitor

PersonState dwarf, schizophrenic

FamilyRelation aunt, mother

SocialRole fugitive, hero

NationOrTribe Bulgarian, Zulu

ReligiousAffiliation Catholic, atheist

Borderline NonRealPerson biblical figures

OtherPerson colleagues, couples

BasicConcept BasicPerson child, woman

RealPerson Barack Obama

NotConcept GeneralTerm image, figure

NotPerson books, events

Table 3: Intermediate Concept Annotation Labels

We measured pairwise inter-annotator agreement across the four labels using the Fleiss kappa (Fleiss,

1971). The κ scores ranged from 0.61–0.71 for

Animals (average κ=0.66) and from 0.51–0.70 for

People (average κ=0.60). These agreement scores

seemed good enough to warrant using these human judgments to estimate the accuracy of the algorithm.

The bootstrapping algorithm harvested3,549

An-imal and4,094People intermediate concepts in ten

[image:7.595.335.519.324.557.2]

(8)

we chose a random sample of intermediate concepts with frequency over 1, which was given to four hu-man judges for annotation. Table 4 summarizes the

labels assigned by the four annotators (A₁ – A₄).

The top portion of Table 4 shows the results for all

the intermediate concepts (437 Animal terms and

296People terms), and the bottom portion shows the

results only for the concepts that passed the Concept

Positioning Test (187Animal terms and139People

terms). Accuracy is computed in two ways: Acc1 is the percent of intermediate concepts labeled as Cor-rect; Acc2 is the percent of intermediate concepts labeled as either Correct or Borderline.

Without the CPT, accuracies range from 53–66%

for Animals and 75–85% for People. After

ap-plying the CPT, the accuracies increase to 71–84% for animals and 82–94% for people. These results confirm that the Concept Positioning Test is effec-tive at removing many of the undesirable terms. Overall, these results demonstrate that our algorithm produced many high-quality intermediate concepts, with good precision.

Figure 3 shows accuracy curves based on the rankings of the intermediate concepts (based on In-Degree scores). The CPT clearly improves accu-racy even among the most highly ranked concepts. For example, the Acc1 curves for animals show that nearly 90% of the top 100 intermediate concepts were correct after applying the CPT, whereas only 70% of the top 100 intermediate concepts were cor-rect before. However, the CPT also eliminates many desirable terms. For People, the accuracies are still relatively high even without the CPT, and a much larger set of intermediate concepts is learned.

Animals People

A1 A2 A3 A4 A1 A2 A3 A4

Correct 246 243 251 230 239 231 225 221 Borderline 42 26 22 29 12 10 6 4 BasicConcept 2 8 9 2 6 2 9 10 NotConcept 147 160 155 176 39 53 56 61 Acc1 .56 .56 .57 .53 .81 .78 .76 .75 Acc2 .66 .62 .62 .59 .85 .81 .78 .76

Animals after CPT People after CPT

A1 A2 A3 A4 A1 A2 A3 A4

[image:8.595.327.523.76.464.2]

Correct 146 133 144 141 126 126 114 116 Borderline 11 15 9 13 6 2 2 0 BasicConcept 2 8 9 2 0 1 7 7 NotConcept 28 31 25 31 7 10 16 16 Acc1 .78 .71 .77 .75 .91 .91 .82 .83 Acc2 .84 .79 .82 .82 .95 .92 .83 .83

Table 4: Human Intermediate Concept Evaluation.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

50 100 150 200 250 300 350 400

Precision

Rank Animal Intermediate Concepts

noCPTC

noCPTCB

withCPTC

withCPTCB

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

50 100 150 200 250 300

Precision

Rank People Intermediate Concepts

noCPT_C noCPT_CB withCPT_C withCPT_CB

Figure 3: Intermediate Concept Precision at Rank N.

4.2.2 WordNet Evaluation

[image:8.595.76.296.566.686.2]

(9)

P rW N P rH NotInWN Animal .20 (88/437) .57 (248/437) 204

People .51 (152/296) .85 (251/296) 108

Table 5: WordNet Intermediate Concept Evaluation.

4.3 Experiment 3: Taxonomic Links

In this section we evaluate the classification (taxon-omy) that is learned by evaluating the links between the intermediate concepts and the basic-level con-cept/instance terms. That is, when our algorithm claims that isa(X,Y), how often is X truly a subcon-cept of Y? For example, isa(goat, herbivore) would be correct, but isa(goat, bird) would not. Again, since WordNet does not contain all the harvested concepts, we conduct both a manual evaluation and a comparison against WordNet.

4.3.1 Manual and WordNet Evaluations

Creating and evaluating the full taxonomic struc-ture between the root and the basic-level or instance terms is future work. Here we evaluate simply the accuracy of the taxonomic links between basic-level concepts/instances and intermediate concepts as har-vested, but not between intermediate concepts. For each pair, we extracted all harvested links and deter-mined whether the same links appear in WordNet. The links were also given to human judges. Table 6 shows the results.

ISA P r_{W N} P r_H NotInWN

[image:9.595.85.280.70.104.2]

Animal .47(912/1940) .88 (1716/1940) 804 People .23 (318/908) .94 (857/908) 539

Table 6: WordNet Taxonomic Evaluation.

The results show that WordNet lacks nearly half of the taxonomic relations that were generated by the algorithm: 804 Animal and 539 People links.

5 Conclusion

We describe a novel extension to the DAP approach for discovering basic-level concepts or instances and their superconcepts given an initial root concept. By appropriate filling of different positions in DAP, the algorithm alternates between ‘downward’ and ‘up-ward’ learning. A key resulting benefit is that each new intermediate-level term acquired restarts har-vesting in a new region of the concept space, which allows previously unseen concepts to be discovered with each bootstrapping cycle.

We also introduce the Concept Positioning Test, which serves to confirm that a harvested concept falls into the desired part of the search space rela-tive to either a superordinate or subordinate concept in the growing taxonomy, before it is selected for further harvesting using the DAP.

These algorithms can augment other term harvest-ing algorithms recently reported. But in order to compare different algorithms, it is important to com-pare results to a standard. WordNet is our best can-didate at present. But WordNet is incomplete. Our results include a significantly large number of in-stances of People (which WordNet does not claim to cover), a number comparable to the results of (Et-zioni et al., 2005; Pasca, 2007; Ritter et al., 2009). Rather surprisingly, our results also include a large number of basic-level and intermediate concepts for Animals that are not present in WordNet, a category WordNet is actually fairly complete about. These numbers show clearly that it is important to conduct manual evaluation of term harvesting algorithms in addition to comparing to a standard resource.

Acknowledgments

This research was supported in part by grants from the National Science Foundation (NSF grant no. IIS-0429360), and the Department of Homeland Se-curity, ONR Grant numbers N0014-07-1-0152 and N00014-07-1-0149. We are grateful to the anno-tators at the University of Pittsburgh who helped us evaluate this work: Jay Fischer, David Halpern, Amir Hussain, and Taichi Nakatani.

References

M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O.Etzioni. 2007. Open information extraction from the web. In Proceedings of International Joint Conference on Artificial Itelligence, pages 2670–2676. M. Berland and E. Charniak. 1999. Finding Parts in Very Large Corpora. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics. S. Caraballo. 1999. Automatic Acquisition of a

Hypernym-Labeled Noun Hierarchy from Text. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics, pages 120–126. P. Cimiano and J. Volker. 2005. Towards large-scale,

(10)

O. Etzioni, M. Cafarella, D. Downey, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence, 165(1):91–134, June.

C. Fellbaum. 1998. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). May.

M.B. Fleischman and E.H. Hovy. 2002. Fine grained classification of named entities. In Proceedings of the COLING conference, August.

J.L. Fleiss. 1971. Measuring nominal scale agree-ment among many raters. Psychological Bulletin, 76(5):378–382.

R. Girju, A. Badulescu, and D. Moldovan. 2003. Learn-ing semantic constraints for the automatic discovery of part-whole relations. In HLT-NAACL.

M. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In COLING, pages 539–545. Z. Kozareva, E. Riloff, and E. Hovy. 2008.

Seman-tic class learning from the web with hyponym pat-tern linkage graphs. In Proceedings of ACL-08: HLT, pages 1048–1056. Association for Computational Lin-guistics.

Z. Kozareva, E. Hovy, and E. Riloff. 2009. Learning and evaluating the content and structure of a term taxon-omy. In AAAI-09 Spring Symposium on Learning by Reading and Learning to Read.

G. Mann. 2002. Fine-grained proper noun ontologies for question answering. In COLING-02 on SEMANET, pages 1–7.

P. McNamee, R. Snow, P. Schone, and J. Mayfield. 2008. Learning named entity hyponyms for question answer-ing. In Proceedings of the Third International Joint Conference on Natural Language Processing.

P. Pantel and D. Ravichandran. 2004. Automatically la-beling semantic classes. In HLT-NAACL, pages 321– 328.

M. Pasca. 2004. Acquisition of categorized named en-tities for web search. In Proceedings of CIKM, pages 137–145.

M. Pasca. 2007. Weakly-supervised discovery of named entities using web search queries. In CIKM, pages 683–690.

S. Ponzetto and M. Strube. 2007. Deriving a large scale taxonomy from wikipedia. In Proceedings of the 22nd National COnference on Artificial Intelligence (AAAI-07), pages 1440–1447.

E. Riloff and R. Jones. 1999. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence.

E. Riloff and J. Shepherd. 1997. A Corpus-Based Ap-proach for Building Semantic Lexicons. In Proceed-ings of the Second Conference on Empirical Methods in Natural Language Processing, pages 117–124.

A. Ritter, S. Soderland, and O. Etzioni. 2009. What is this, anyway: Automatic hypernym discovery. In Pro-ceedings of AAAI-09 Spring Symposium on Learning by Reading and Learning to Read, pages 88–93. B. Roark and E. Charniak. 1998. Noun-phrase

Co-occurrence Statistics for Semi-automatic Semantic Lexicon Construction. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pages 1110–1116.

E. Rosch, 1978. Principles of Categorization, pages 27– 48.

R. Snow, D. Jurafsky, and A. Y. Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. In NIPS.

M. Thelen and E. Riloff. 2002. A Bootstrapping Method for Learning Semantic Lexicons Using Extraction Pat-tern Contexts. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Process-ing, pages 214–221.

D. Widdows and B. Dorow. 2002. A graph model for un-supervised lexical acquisition. In Proceedings of the 19th international conference on Computational lin-guistics, pages 1–7.