• No results found

3.2 Knowledge Extraction from Text

3.2.2 Knowledge Extraction Approaches

Extraction from natural language text is substantially different than extracting from metadata or using wrapper induction systems because the quality of text varies greatly due to the unstructured nature of texts. This section provides a review of existing knowledge extraction approaches.

DIPRE and Snowball

Over the recent years, various works have been proposed to discover relationships for

specific domains [198]. For instance, Snowball associates companies to the cities

where their headquarters are located [7] while DIPRE focuses on books and au-

thors [37]. One of the earliest semi-supervised system, DIPRE, uses a few pairs of

entities for a given type of relationship as initial seeds[37]. It searches the Web for

these pairs to extract patterns representing a relationship, and use the patterns to dis- cover new pairs of entities. These new entities are integrated in the loop to generate

more patterns, and then find new pairs of entities. Snowball[7] enhances DIPRE in

two different directions. First, a verification step is performed so that generated pairs of examples are checked with MITRE named entity tagger. Secondly, the patterns are more flexible because they are represented as vectors of weighted terms, thus enabling the clustering of similar patterns.

One of the main differences between Snowball and DIPRE is that the former adds a verification step. When new pairs of entities are discovered from generated patterns, they are checked by using MITRE named entity tagger. Thus they avoid too generic patterns. The other difference lies in the flexibility of the patterns. In DIPRE they are hard-represented while in Snowball the three flexible parts of the pattern (sb, sm, sa) are vectors of weighted terms, thus enabling to cluster similar patterns. Each cluster of pattern is represented by a centroid pattern. There is also a mechanism to filter the pattern by numbers of tuples supporting it. The similarity between two patterns is computed by adding the cosine measures between each part of the part. A pattern’s confidence is estimated by the number of positive tuples that pattern generates. A tuple will have high confidence if generated by multiple high-confidence patterns.

The evaluation is performed via an estimated precision (using a sample of 100 tuples) while experiments have generated 80000 tuples.

OAK

Hasegawa et al. have proposed an unsupervised system for extracting relations[98].

The idea is to identify types of entities in corpora using the OAK system. The basic as- sumption is that the pairs of entities that occur in a similar context represent the same relation. This relation is discovered through the process of context-based clustering of pairs of entities. The intuition of the system is that a context providing a basis for multiple relations is not expected and a pair of entities would either not be clustered at all or would be clustered to the most frequently expressed relation. Consequently, the approach is based on tagging named entities in the corpora, obtaining the pair of entities’ co-occurrence and the context in which they are mentioned, measuring the context similarity, create clusters and finally assigning labels to the cluster of pairs of named entities.

The proposed system does not deal with the issue of pairs entities linked by multiple relationships. In addition, the experiments are restricted to a few types of entities (30 at most).

Espresso

Later experiments include the Espresso [163] system, which is a weakly-supervised

relation extraction system that also makes use of generic patterns along with princi- pled reliability measure. The principled reliability measure proposed in Espresso is a measure of pattern and instance reliability, which enables the filtering algorithm. The

system is based on the Hearst patterns[100] and learns the surface patterns to extract

more instances. To accomplish this task, Espresso takes seed instances for a particular relation, which is why the system is called weakly-supervised. The generic pattern used in Espresso has a broad coverage which results in both higher number of true- positives and false-positives. For example, for the relation “part-of the”, the pattern

X of Y yields both wheel of the car (correct) and house of representatives (incorrect).

Therefore, the main objective addressed is to achieve a balanced recall/precision via

the use of what they call reliable patterns. The system is presented as efficient in terms of reliability and precision. However, experiments were performed on smaller datasets and it is not known how the system performs at large-scale.

KnowItAll

The KnowItAll system has been designed for large scale datasets such as the Web [72]. This system is able to annotate its own training examples using a few generic patterns. The patterns are composed of specific-domain extraction rules to obtain examples along with a probability to be relevant based on search engine hit counts. TextRunner brought further perspective in the field of Open Information Extraction,

for which the types of relationships are not predefined[17]. A self-supervised learner

is in charge of labeling the training data as positive or negative, and a classifier is then trained. Relations are extracted with the classifier while a redundancy-based probabilistic model assigns confidence scores to each new example. The system was

further developed into a ReVerb framework[75], which improves the precision-recall

curve of the TextRunner.

NELL

ReadTheWeb/ NELL [43] is another project that aims at continuously extracting cate-

gories (e.g., the type of an entity) and relationships from web documents and improv- ing the extraction step by means of learning techniques. Four components including a classifier and two learners are in charge of deriving the facts with a confidence score. According to the content of the online knowledge base, more iterations provide high confidence scores (almost 100%) for irrelevant relationships. In addition, NELL is mainly dedicated to the discovery of categories (95% of the discovered facts) rather than relationships between entities.

Prospera

A recent work about knowledge extraction reconciles three main issues in terms of

precision, recall and performance called Prospera [153]. The Prospera system fur-

ther develops the work done on the SOFIE [181] project. It utilizes both pattern

analysis with n-gram item sets to ensure a good recall and rule-based reasoning to guarantee an acceptable precision. The performance aspect is handled by partition- ing and parallelizing the tasks in a MapReduce-based distributed architecture. There are three major phases in Prospera: (i) the pattern gathering phase identifies a pair of entity names with the surface string appearing between those pair of names. This step additionally performs context-based mapping of names to entities in the Yago

ontology; (ii) the pattern analysis phase generalizes patterns to obtain n-grams to obtain fact candidates; (iii) to obtain better precision, the final reasoning phase per- forms MaxSat-based reasoner considering the pre-specified constraints. A restriction of this work deals with the pattern, which only covers the middle text between the two entities. This limitation affects the recall, as shown with the example “Lord Of

The Rings, which Tolkien has written”.

GATE (ANNIE)

The GATE project7 (General Architecture for Text Engineering) offers a set of tools

for processing text [56]. In addition to being an architecture, it has a development

environment, and a framework for building systems for human language processing. As a mature and open source project, GATE has attracted a large audience of both users and developers. In particular, it includes an IE system coined ANNIE (A Nearly-

New Information Extraction System)[134,135] with a range of modules including a

tokenizer, a gazetteer, a sentence splitter, a POS tagger, a named entities transducer and a coreference tagger. One of the differentiating feature of ANNIE is that it has a support for multilingual processing through the use of Unicode for text display in the GUI and accepting input in various languages. ANNIE has been reported to achieve a

precision and a recall value of around 90% for the NER task[134].

Probase

Probase [200] is a recent Microsoft Research project that is based on a probabilistic

model and it focuses on universal text understanding. It views the world in a concept- centric way and the goal is to construct a fine-grained taxonomy of concepts. Probase only considers an isA relation and therefore targets a hierarchical tree of concepts like other type hierarchies. This framework is probabilistic in the sense that it assigns a probability score to each pair of concepts to indicate the level of confidence.

The system makes an extensive use of (Hearst [100]) patterns and iteratively ex-

tracts concepts until no more new information can be extracted. Probase additionally includes a merging mechanism whereby it merges similar subconcepts or performs single and multiple sense alignments. Probase uses the Bing corpus of 1.6 billion

pages to construct its taxonomy of 2.7 million concepts8.

7http://gate.ac.uk(Last checked April 2013