2.3 Knowledge Extraction
2.3.3 Data Sources for Knowledge Extraction
Many will argue that the amount of semantic data is the key in realizing many of the visions of the Semantic Web at large-scale. The challenge is the knowledge acquisition bottleneck that is met by the traditional “top-down” model of designing ontology. To address this issue, automated methods have been developed to use wide range of data sources. In general, these sources can be divided into the following categories:
• Structured; • Semi-structured; • Unstructured.
IE methods help with obtaining structured data from unstructured sources expressed
in natural language[17,54]. Thus this domain is inherently related to the field of nat-
ural language processing. The main tasks concerned are the identification and extrac- tion of instances of a particular types of entities and relations from a natural language text and their transformation into a structured representation such as database or on-
tologies [91]. While IR systems focus on retrieval of relevant documents from the
collection, an IE system strives to retrieve or extract relevant information (instances of entities and relations) from a document.
Structured Data Sources
The advantage of exposing existing structured data using a shared vocabulary is that it enables its applicability in a broader semantic context. There is a vast amount of structured data – also called relational data – (e.g. stored in relational databases) that can be used to extract entities to be integrated into a knowledge base. The need to make this structured data reusable as semantic data has recently increased and therefore it has become widely accepted practice to publish existing structured data and expose the published collection as knowledge bases. Consequently, much research has been focused on mapping this relational data to RDF. However, often
solutions are domain specific with tools that work in a particular environment[174].
To address this issue, the World Wide Web Consortium (W3C) initiated a working group RDB2RDF to study and propose standardized languages for mapping relational
data into RDF and OWL12. More specifically, this working group recently came up with
an interesting recommendation for “a direct mapping of relational data to RDF”13. An
example of such direct mapping is given in Table2.3.
ID Name AffiliationState
3 Abraham Lincoln 25
4 Thomas Jefferson 35
5 Barack H. Obama 25
Table 2.1: Table President.
ID Name
25 Illinois
30 New York
35 Virginia
Table 2.2: Table State.
Table 2.3: A Simplified Relational Data Describing Presidents and States of Primary Affiliation by Presidents. The Column AffiliationState of the Table President Refer- ences the Primary Key Column ID of the State Table.
The table President contains a list of American presidents with three column defini- tions: the primary key column ID, the Name and the column AffiliationState. Since the states of primary affiliation may be common to several presidents, this information is stored in a separate table and the column AffiliationState references the primary key column ID of the table State. Given these two table definitions, a direct mapping to RDF would typically take the following form:
<President/ID=3> <rdf:type> <President>
<President/ID=4> <rdf:type> <President>
<President/ID=3> <President#Name> "Abraham Lincoln"
<President/ID=4> <President#Name> "Thomas Jefferson"
<President/ID=5> <President#Name> "Barack H. Obama"
<President/ID=3> <President#AffiliationState> <State/ID=25>
<President/ID=4> <President#AffiliationState> <State/ID=35>
<President/ID=5> <President#AffiliationState> <State/ID=25>
<State/ID=25> <State#Name> "Illinois"
<State/ID=30> <State#Name> "New York"
<State/ID=35> <State#Name> "Virginia"
This type of transformation of representing relational data as RDF graph is common and many databases have been published as linked data in this manner. However,
12http://www.w3.org/2001/sw/rdb2rdf/(Last checked May 2013) 13http://www.w3.org/TR/rdb-direct-mapping/(Last checked May 2013)
this is merely a syntactic approach and the semantics of this data are not always necessarily reusable in the context Semantic Web which is a prerequisite for a typical usage. Simply converting into more Semantic Web friendly format such as RDF does not automatically enable semantic-aware services, since converted RDF data needs to be reinterpreted as well as transformed.
Recently, the initiatives such as schema.org, microformats14 and GoodRelations15
have boosted the amount of structured markup data on the Web. Several studies
have shown that over 30% of all Web pages in existence16 now contain structured
data [145, 151] and this data can be directly extracted as RDF graphs (e.g. using
Anything To Triples17).
Unstructured Data Sources
Much of human knowledge is expressed in free-text stored in natural language doc- uments. Thankfully, the Web has enabled the sharing of the digital information with minimal effort and the progress in IE enables the analysis and extraction of useful, valuable information from the text, thereby turning it into a machine processable knowledge. Using unstructured text as a source to derive structured data for a knowl- edge base is still an open area of active research.
Exploiting unstructured text as a source of relational facts often relies on Natural Lan- guage Processing (NLP) techniques. A substantial amount of work in NLP is directed towards the development of parsers. These parsers attempt to capture the meaning of sentences with the goal of building a tree or directed graph as output which re- flects the various levels of linguistic information including the part-of-speech (POS), the presence of phrases, grammatical structures and semantic roles. The output pro- duced by a parser is the structure and annotations which are useful for determining relationships between entities in a sentence. In this context, one of the major chal- lenges associated with using unstructured data sources is the task of Named Entity Recognition (NER). Traditionally, NER techniques are focused around the detection of common entities such as people, organization or geographic location. However, the recognition of domain specific entities poses a particular challenge because the NER tools usually require training examples for the types of entities to be recognized.
14http://microformats.org(Last checked January 2013) 15
http://www.heppnetz.de/projects/goodrelations/(Last checked December 2012)
16These statistics take into account Web pages up until February 2012 17http://any23.apache.org/(Last checked May 2013)
(a) Wikipedia Infobox for Article about Norway
(b) Corresponding Source in Edit Mode
Figure 2.5: An Example of an Infobox.
Semi-structured Data Sources
Semi-structured data is another popular source of extracting semantic information for knowledge bases. Semi-structured data is often characterized by the being a mid-
dleman – neither raw unstructured nor typed structured data. Probably the most
well-known source of this kind is Wikipedia infoboxes as shown in the Figure 2.5.
A number of projects use infoboxes as a source of extraction of factual information. For such semi-structured data source as Wikipedia infobox, a solution to extract the information can be as straightforward as using a pattern-matching technique. In fact, prevalent projects (e.g. DBpedia, Yago) specifically make use of this method.