Extending Ontologies: Relation Extraction from Text

Ontology creation and development is a time-consuming, often manually undertaken, task. Enrichment and automatic extension of ontologies have therefore been a field of intense study in the last decades.

This thesis is going to focus on methods adopted from graph and network analysis and ontologies will be looked at as graphs. Still, there have been different approaches to enrich ontologies, especially semantic ontologies like WordNet.

In the following, I will evaluate existing methods of ontology enrichment through relation extraction from natural language texts. These approaches do not take the ontology (e.g., WordNet or DBpedia and its network structure) into account, but rather work with freely available texts from different domains. The methods are based on the idea that

natural language texts contain and convey knowledge in the form of (syntactic) structures and that language can be parsed, processed, and information extracted. Three related, yet quite different, approaches shall be examined here.

Hearst (1992) proposes a method to extract hyponymy–hypernymy relations from texts only taking into account the surface structure of sentences (i.e., only a shallow analysis based on POS tagging). This method is evaluated in Hearst (1998).

L¨ungen and Lobin (2010) introduce a method to transform information given in table

of contents to a Multilayered Semantic Network (MultiNet). Unlike the first approach, a deep syntactic analysis of the dependency structure is undertaken to identify entities and relations. Furthermore, the text structure and organization are taken into account.

Others, such as McCord et al. (2012), take parsing trees and the deep structure of sentences into account when looking for recognizable patterns to match information given in texts to semantic relations such as those of DBpedia.

Hearst (1998) presents a method she calls lexico-syntactic pattern extraction to find relations between words. The method is meant to support a lexicographer in his work (e.g., the developers of WordNet). The goal is to find constructions “that frequently and reliably indicate the relation of interest” (Hearst, 1992, p. 540). Hearst (1998, p. 134) gives some examples of such constructions in texts:

(12) . . . works by such authors as Herrick, Goldsmith, and Shakespeare.

(13) Bruises, . . . , broken bones, or other injuries

Example 12 shows a typical itemization of names that are subsumed by the preceding noun phrase (NP) authors. The pattern that matches such constructions is

(14) such NP as {NP, }* {, and/or} NP.

The first N P is a hypernym of the following N P s: • hyponym(Herrick, author),

• hyponym(Goldsmith, author), and • hyponym(Shakespeare, author).

The pattern of interest in example 13 is of the form

The N P in the first slot are all hyponyms of the N P in the last position.35 _In

total, Hearst identifies eight such patterns that unambiguously and reliably indicate a hypernymy/hyponymy relation.

One way of finding such patterns is to derive a list of words that are already connected by the relation in question in WordNet and to extract sentences from large text corpora that contain the two words. Looking at the derived sentences should give a good overview of possible constructions and hence patterns.

However, adding newly found relations to WordNet comes with some problems. Since the structure of WordNet is not word based, but rather word sense based (see Chp. 5.2 for a deeper analysis of the WordNet structure), one has to decide which one is the word sense present in the relation. The word sense has to be disambiguated to the WordNet senses. This is, as will be explained later, not an easy task. More over, if one word form does not exist, the lexicographer has to decide to what word sense (called synonymous set of short synset ) to add this word form or to create a new synset.

Another problem arises when text genres other than encyclopedic texts are used to find new relations. Especially in newspapers, a genre for which a great amount of corpus data is available, authors often present subjective interpretations.

Hearst (1998, p. 17) found that this leads to some noise when extracting relations from newspaper articles.

The weakness of this approach lies in its inability to find patterns for relations other than hyponymy/hypernymy. Nonetheless, the method was used in the creation of Word- Net and helped identify hyponymous words.

While Hearst (1998) only mentions the possibility to apply a deeper analysis, L¨ungen

and Lobin (2010) transfer the idea of finding patterns in language structures that indicate a certain semantic relation, not necessarily a lexical relation, to table of contents (TOC) of academic text books. They find that the hierarchical order of text structures, such as sections, paragraphs, and others, in combination with morpho-syntactic information can be used to identify semantic relations between terms in headings.

The number of possible relations identified by different patterns is quite large. The

35_{Although this notion a widely accepted in the field of natural language processing, it is probably}

rejected by many scientists in other fields (cf. Kripke, 1980). Hyponymy is not used in a strictly linguistic sense here. It is used to refer to a logical categorization of objects in nature. Important in the field on NLP is that a pattern like this one allows to categorize entities, here represented by proper names. It gives meaning to a proper name that otherwise to a computer is only a string of characters without any meaning.

structure of headings, especially in academic textbooks, indicates per se a hierarchical relation that also exists between the single elements of the TOC. In a first step, a number of syntactic and grammatical analyses are undertaken (e.g., lemmatization and syntactic parsing). Statistical methods can be applied to find technical terms that are specific to a domain (see bold-faced terms in Fig. 13). Between these terms, depending on the hierarchical and syntactical order, the relations are established.

Hebborn (2013, p. 204) gives the example shown in Fig. 13, where an exemplary excerpt from a TOC can be seen.

2.3.5 Weitere Kern Merkmale politischer Systeme 2.3.5.1 Die Stellung des Parlaments

2.3.5.2 Die vertikale Gewaltenteilung 2.3.5.3 Verfassungsgerichte

Figure 13: Sample of hierarchically structured text.

A human reader easily understands that the subordinated sections and paragraphs

extend or explain the superordinate section. The terms that are related are in bold.36

2.3.5 [Weitere Kern[merkmale]keyword [politischer

Systeme]A]NP Gen

2.3.5.1 Die [Stellung des Parlaments]B

2.3.5.2 Die [vertikale Gewaltenteilung]C

2.3.5.3 [Verfassungsgerichte]D

Figure 14: Sample of hierarchically structured text with marked terms.

In Fig. 14, the important elements are labeled:37 The keyword Merkmal (English:

property), in plural, here in the form of a compound, in the superordinate heading (Fur- ther main properties of political systems) is the head of a N P Gen (of political systems) along with the term A, politischer Systeme (political systems) that indicates the focus of

36_{The accentuation does not exist in the original TOC but were added by Hebborn.}

the following headings. The subordinate headings contain the terms B − D. This struc-

ture implicitly comprises relations of the kind has(termA, termX) (e.g., has(politisches

System, Verfassungsgericht)). The focus term in the superordinate heading is explained or extended in the subordinate headings.

This approach is very fruitful when working with general knowledge ontologies, or when creating such an ontology. The approach seems unfitting for WordNet, where a relatively small set of semantic relations exists, and where the focus is not on general knowledge terms. It could, nonetheless, be applied to DBpedia and other similar ontologies to fill existing gaps. Still, because of its relatively free use of relations, one might have to create further rules to match these relations to those defined in the ontology. An example of how to match relations given in texts to relations in an ontology will be given in the following approach.

Based on the syntactic structure of utterances, the isomorphism or parallelism between syntactic and semantic structures can be exploited to assign semantic roles to phrases. For example, in question answering (QA) systems such as the already mentioned IBM Watson, finding information on different entities is needed to understand a question and to find the appropriate answer. In IBM Watson, relation extraction is used at different points in the workflow. Watson uses, besides other knowledge sources, DBpedia.

In contrast to the approach of Hearst, deep dependency parsing in used in Watson. A parser returns a parsing tree indicating syntactic and semantic dependencies between constituents. While a solely surface analysis depends on situations where the structure of a sentence corresponds to a predefined pattern, the parsing allows one to identify entities in nested structures or complex phrases. Also multi-word tokens are easily identified as single vertices in such a dependency tree.

The parser used by IBM is a so-called slot grammar (SG) (McCord, 1980). In SG, verbs are assigned basic semantic types indicating their sense. This information is stored within the system’s lexicon module (McCord, 1993, p. 127).

During the work on the Watson Jeopardy! challenge, the semantic type system was extended to include WordNet synset information and to match nouns to related verb frames. This can be used to define semantic relations by choosing a class of verbs or verb senses to be responsible to express the relation (McCord et al., 2012, p. 5). Having a verb frame writeVerb, each verb belonging to this frame is expected to express a writing relation. The semantic typing system and the WordNet synsets are used to match different verbs, such as write, compose, and others, as well as constructions such as the composer

of to this frame that establishes an authorOf relation.

The dependency structure of a sentence38 _like

(16) In 1936, he wrote his last play, “The Boy David”; an actress played the title role

(McCord et al., 2012, p. 10).

can easily be displayed in a predicate–argument structure such as

writeVerb(“he”,“his last play, ‘The Boy David”’)39

while still keeping in memory the internal structure of the object phrase. The relation can be reduced to

(17) writeVerb(he, The Boy David).

Dissolving the pronoun to the correct entity is then the actual problem in this question- answering problem. The underlying question to answer is “Who wrote The Boy David ?”. From different ontologies as well as a large number of natural language texts that are processed in the same way as the example question above, the system tries to find the answer. It has to look for a relation like the one in Example 17, where the first slot of the relation, the he, is filled with the actual author name: J. M. Barrie.

These and other approaches to extending sparse information of ontologies or to ana- lyzing relations in natural language texts are based on text corpora, i.e., external sources of information. In the following, approaches to extending networks will be presented that make no use of external knowledge but focus on the network structure itself.

In document On link predictions in complex networks with an application to ontologies and semantics (Page 68-73)