We have now motivated use cases that would benefit from a knowledge base of entities and entity-related information. Furthermore, we explained which requirements such a knowledge base must meet and posed our research questions and hypotheses, which will be answered in the following chapters.
The problem we are trying to solve in this thesis is the automatic generation of an accurate knowledge base of entities and entity-centric information. As we will see in the next chapter, existing approaches for building such a knowledge base rely either on user input, extract information from a single or only few websites, or are very inaccurate.
The remainder of this thesis is structured as follows. First, we start by providing definitions, background information about knowledge bases, and explanations of evaluation measures in Chapter 2. In Chapter 3, we then introduce an architecture for our Web information extraction system by describing its main components, which will be explained in more detail throughout the thesis. Since our goal is to retrieve information in a fast manner, we begin
by describing a retrieval strategy for Web feeds in Chapter 4. In Chapter 5, we review
related work on entity extraction from the Web, describe five extraction techniques that are used by our system in detail, and compare them against each other. We will see that the extraction results are still very imprecise, which leads us to Chapter 6, which explains different approaches for assessing uncertain extractions. The goal in this chapter is to find
an algorithm that filters incorrect extractions to improve the precision without sacrificing the recall. Chapter 7 then details five techniques to extract factual information about entities from the Web. These techniques are essential in building the large knowledge base that we envision. In Chapter 8, we describe further extraction techniques for additional, entity-centric information, such as news, opinionated statements, and interactive multimedia objects. These extraction objects are valuable for an entity-centric knowledge base as we have motivated earlier. Chapter 9 reviews, develops, and compares question answering approaches. Our goal in this chapter is to find out whether knowledge bases are better suited for natural language question answering than systems that try to find answers on the Web. In Chapter 10, we showcase some examples of practical applications that can be developed using the information from the knowledge base that we build in this thesis before we conclude with the results and findings in Chapter 11.
Chapter 2
Background
This chapter explains the basic terminology and ideas necessary to understand the next chapters in this document. First, we will briefly describe information retrieval and define the most important terms used throughout this thesis. Second, we will distinguish between three kinds of sources on the Web since we make a differentiation in our algorithms later on. In the last section of this chapter, we give an overview of the related knowledge bases and information extraction systems. A more detailed review of several of these systems follows later in the pertinent sections.
Most algorithms and approaches explained and used in this thesis belong to the field of infor- mation retrieval. Information retrieval is a field of research that is concerned with searching and ranking documents matching a user query. Sources used for answering the query can be structured (such as databases or RDF repositories) or semi-structured (such as Web pages). Information retrieval employs approaches from many disciplines such as statistics, linguistics, information architecture, and information science. We will especially make use of the statistic and linguistic aspects in this thesis.
2.1
Definitions
In this section, we define the terms concept, named entity, attribute, and statement, as they are fundamental to other topics covered in this chapter.
2.1.1 Concept
A concept is a chunk of text that refers to “an abstract or general idea inferred or derived
from specific instances” (Stark and Riesenfeld, 1998)1. A concept is therefore a class of things
and it can be instantiated with a specific instance – an entity. We use the term “concept” throughout this thesis. Other researchers also use the term “entity type” which is a synonym to the term “concept” in the scope of this thesis. We will use both terms interchangeably.
1
2.1.2 Named Entity
There is no consensus on the definition of a named entity in the research community. Often, only instances of concepts in a certain scenario are considered named entities. The following definition is taken from the named entity recognition task from CoNLL 2002:
“Named entities are phrases that contain the names of persons, organizations, locations, times, and quantities.” (Sang and Meulder, 2003a).
The CoNLL 2002 definition is useful when clarifying the goals of the NER task. It also unnecessarily limits named entities to the concepts Person, Organization, Location, Time, and Quantity. Also, it is arguable whether time and quantity are real named entities. The term “named” restricts the task to entities that are rigid designators as defined by Kripke (1981) who says that “A rigid designator designates the same object in all possible worlds in which that object exists and never designates anything else” (LaPorte, 2006). Following this path, however, leads us into a philosophical discussion about what an entity is. Nadeau (2008) emphasizes how difficult the definition of a named entity actually is. He says, his definition is “[...] ugly and circular, but [...] practical!”: “The types recognized by NER are any sets of words that intersect with an NER type” (Nadeau, 2008).
The Message Understanding Conference (MUC) definition states that named entities are: “proper names, acronyms, and perhaps miscellaneous other unique identifiers” which belong to one of the following types “Organization: named corporate, governmental, or other or- ganizational entity Person: named person or family, and Location: name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains, et cetera)” (Chinchor, 1997). Again, we see a very vague definition of what named entities are.
Borrega et al. (2007) only considers nouns or noun phrases whose referent is unique and unambiguous to be named entities. Furthermore, they distinguish between Strong Named Entities (SNE) and Weak Named Entities (WNE). SNEs are “formed by a word, a number, a date, or, in some cases, a string of words referring to a single individual entity in the real world”. WNEs are syntactic elements consisting of at least one proper noun. Borrega et al. (2007) provide a collection of guidelines to determine what entities fall into the groups SNE and WNE. They base their guidelines on Spanish tests and admit that their definitions and guidelines always have exceptions.
The organizers of the TREC 2010 entity track also call defining entities on the Web an
unsolved problem (Balog et al., 2010b), and R¨ossler (2007) reiterates that there is no consensus
on the definition of named entities. He concludes his research on the definition of named entities stating that there are often no definitions but only guidelines. MUC and Automated Content Extraction (ACE), for instance, have refactored their guidelines for named entity tagging due to the difficulty of this task. The definitions, or rather, the guidelines, should be tailored to the context of the application.
We will therefore define the term “entity” for the scope of this thesis. Some researcher also call an entity that can be extracted from the Web a “Web object” (Nie et al., 2007). We define a named entity as follows: “An entity is a collection of names that refer to exactly one or to multiple identical, real or abstract concept instances. These instances can have several
Definitions 15
aliases and one name can refer to different instances. A ‘named entity’ is a reference to an entity using one of the entity’s aliases.”
Figure 2.1 shows the relation between concepts and entities. The figure allows us to explain our definition using some examples.
Figure 2.1: Relationship between Concepts and Entities
Ambiguity We can see that the movie entity Iron Man has another alias named Ironman,
which actually refers to the same entity, that is, their uuids are identical. Furthermore, we can observe that the name “Iron Man” is ambiguous and might also refer to the Marvel comic character. Names are often ambiguous and need to be disambiguated in the context in which they are mentioned. The names must be rigidly designated; the name “the 2008 movie where Robert Downey Jr. plays a comic hero” is therefore not an entity. Due to their ambiguity, all entities must get a universally unique identifier (UUID) (Leach et al., 2005).
Table 2.1 shows four example entities classified along two dimensions.
Abstract Concrete
Specific $1,000,000 Jim Carrey
Generic Field Hockey Lumia 800
Table 2.1: Example Classification of Entities According to our Definition
Generic and Specific Jim Carrey, the actor, refers to exactly one real world instance,
while the mobile phone Lumia 800 refers to multiple similar real world instances. People are specific in the sense that a concrete entity exists only once in the real world. Products are generic. For example, the mobile phone Lumia 800 exists multiple times in the real world so they are “identical concept instances”. We are not interested in these different instances, but rather in their common name, since they all share the same attributes, such as display size. Many concepts are generic, such as gene, car, or movie names. More information about specific and generic entities can be found in the work by LingPipe (2007).
Abstract and Concrete While Lumia 800 refers to a collection of real world objects, the sport entity Field Hockey is an abstract instance. Playing hockey makes it real, but until then it is an abstract instance of a concept. It is not a concept according to our definition since there are no instances of Hockey itself, but only instances of hockey games. The same is true for instances of event concepts, such as Concert or Conference. Our definitions also
allow us to have temporal and numerical instances, such as$1,000,000 as abstract entities.
2.1.3 Attribute
An attribute is a modifier for a concept. A concept can have multiple attributes and one attribute can belong to multiple concepts.
Attributes have a domain and a range. The domain specifies the concepts they modify and the range specifies the range of values. For example, the attribute display size belongs to the concepts Mobile Phone and Notebook (domain) and can have values between 1 and 25 inches (range). Multivalued attributes are beyond the scope of this thesis.
2.1.4 Statement
Statements are assertions about entities. While attributes modify concepts, statements are assigned to entities. We can classify statements along the two dimensions “truth” and “repre- sentation”. Along the “truth” dimension, such assertions can be falsehoods, facts, or opinions. Along the “representation” dimension, statements can be unstructured (natural language) or structured (for example, RDF). Table 2.2 provides examples of statements.
Falsehood Opinion Fact
Structured <earth> <hasForm> <flat> <earth> <is> <gorgeous> <earth> <hasForm> <ellipsoid>
Unstructured The earth is flat. The earth is gor-
geous.
The earth is an el- lipsoid.
Table 2.2: Classification of Statements