• No results found

2.3 NLP Methods Supporting Document Exploration

2.3.3 Information Extraction

In order to create a concept map from natural language text, concept and relation mentions have to be extracted from the text. In the NLP community, these and other extraction tasks that try to obtain structured data from unstructured text are studied as information extrac- tion (Jurafsky and Martin, 2009, Chapter 22). Traditionally, information extraction has been modeled as named entity recognition followed by relation classification (Doddington et al., 2004). Such techniques yield pairs of entities with relations between them, a structure that is already very similar to pairs of concepts and their relations needed for a concept map.

However, a limitation of classic information extraction is that relation extraction is mod- eled as a classification task, requiring a pre-defined list of supported relations and anno- tated training data for each of them. As Banko et al. (2007) points out, this renders these approaches impractical when applied to a large body of arbitrary text such as pages on the

Chapter 2. Background

web, for which it is impossible to anticipate all types of relations expressed in them. In our use case, a similar problem arises, as we want to create concept maps from documents before we know their content. The set of relations should thus not be constrained a priori. The idea of an open vocabulary for concepts and relations is a feature of a concept map (Novak and Cañas, 2008) that should be supported by the extraction mechanism.

As a solution, Banko et al. (2007) proposed open information extraction (OIE) , a variant of information extraction following the open vocabulary paradigm. OIE systems extract tuples such as(Barack Obama - graduated from - Havard Law), consisting of two arguments

connected by a relation, in which all parts are taken directly from the text. Every extraction has to be asserted by the text and arguments and relations should be as short as possible. Several extractions can be made from a single sentence.

Existing OIE systems can be classified along several dimensions:

Learned vs. Rule-based All OIE systems derive their extractions from the syntactic struc- ture of a sentence. A key difference is whether the extraction patterns operating on the syn- tax are learned from data or have been hand-engineered. TextRunner (Banko et al., 2007), the first OIE system, uses a binary classifier to detect patterns. WOE (Wu and Weld, 2010) extends this idea by using Wikipedia infoboxes as supervision. Another learning approach is bootstrapping, employed by OLLIE (Mausam et al., 2012) and NestIE (Bhutani et al., 2016), which iteratively extends a set of initial seed patterns. Recent work by Stanovsky et al. (2018) showed that the task can also be formulated as a sequence tagging problem. Despite these successful attempts at learning patterns, a similarly large amount of OIE systems have been proposed that use carefully hand-crafted patterns based on linguistic insights. ReVerb (Fader et al., 2011), KrakenN (Akbik and Löser, 2012), ClausIE (Del Corro and Gemulla, 2013), Exemplar (Mesquita et al., 2013), OpenIE4 (Mausam, 2016), PropS (Stanovsky et al., 2016b) and Graphene (Cetto et al., 2018) are examples for this line of work.

Parsing vs. Part-of-Speech Tagging Due to Banko et al. (2007)’s initial motivation of using OIE to extract relations from all of the web, scalability has been an important design crite- rion. Early systems, such as TextRunner (Banko et al., 2007), WOE𝑝𝑜𝑠(Wu and Weld, 2010) and ReVerb (Fader et al., 2011), rely only on part-of-speech tags to avoid the more costly dependency parsing used in most other systems. Therefore, these systems can achieve high processing speeds by sacrificing precision. In more recent work, this idea has been largely abandoned, presumably since better parsing algorithms and cheaper computational resources lowered the cost of dependency parsing.

Extraction Format While early work focused only on extractions of verb-mediated rela- tions with two arguments, this was soon deemed to be too narrow to extract all relevant relations from text. KrakeN (Akbik and Löser, 2012) introduced the idea of n-ary extrac-

2.3. NLP Methods Supporting Document Exploration

tions, adopted in most of the following work. OLLIE (Mausam et al., 2012) added context annotations, indicating that an extraction is only valid in a given context. NestIE (Bhutani et al., 2016) achieves the same by nesting extractions. Other extensions are including noun- mediated relations (Yahya et al., 2014) and minimizing argument spans (Angeli et al., 2015).

Language The large majority of work on OIE targets the English language. ExtrHech (Zhila and Gelbukh, 2013), a rule-based system for Spanish, is a first exception. Later work by Gamallo and Garcia (2015) led to ArgOE, a system that can process five different lan- guages with the same set of rules. Following the same idea, PredPatt (White et al., 2016) develops a rule set working with universal dependencies (Nivre et al., 2016), making their system scale to potentially all of the 71 languages that are currently covered by the tree- banks annotated with universal dependencies.12 Their evaluation covers five languages.

More details on the various systems can be found in a recent survey by Niklaus et al. (2018). An important challenge for OIE has been the evaluation of systems due to the lack of annotated data, making most work rely on post-hoc manual evaluations. Recently, efforts by Stanovsky and Dagan (2016b) and Schneider et al. (2017) introduced the necessary evaluation data and tested a range of published systems. They found ClausIE and OpenIE4 to be the best performing systems according to their benchmarks.

OIE systems, in particular rule-based methods working with dependency parses, are very similar to the methods developed for concept and relation extraction in the concept map mining literature (see Section 2.3.1.1 and 2.3.1.2). However, we are not aware of any direct applications of existing OIE systems for the task of concept map mining.