• No results found

Setting and Evaluation

3.2

Setting and Evaluation

3.2.1

Setting

Existing distantly supervised relation extraction approaches generally contain the following five components, also displayed in Figure 2.3above: named entity recognition and classification; au- tomatically labelling sentences with the distant supervision heuristic; preprocessing and feature extraction; training a classifier; and combining and returning results.

The approach described in this thesis adds to these components an additional component for training and testing data retrieval. The aim of the setting is an entity-centric Web search-based distant supervision approach which gives instance-level results for knowledge base population.

Most documented distant supervision approaches perform experiments on corpora such as the New York Times corpus or Wikipedia (see Section2.4.1). They process each sentence with named entity recognisers. For those sentences that contain at least two entities, they iterate over each relation pair from a background knowledge base to label sentences with relations, which are then used as positive training examples for those relations. Negative training examples for relations are named entity pairs which are not identified as being in any relation in the knowledge base and are sampled randomly from the corpus.

Attempting to match each named entity in the corpus with each named entity in a background knowledge base is a significant computational effort. In reality entities and relations discussed in documents are on different topics, e.g. some documents are about politicians and their parties, whereas others are about musical artists and their albums. This means only a fraction of documents in a fixed corpus such as the New York Times are relevant to a specific entity in a background knowledge base. Therefore, attempting to match NEs tuples with every sentence in every document could lead to many false positives. A solution to this would be to preprocess the documents to find out if they are about the entity in question. However, a static corpus might not even contain the information desired, and significant effort could go into finding a suitable corpus which does. Consequently, a different setting is proposed in this thesis (see Figure3.1): instead of processing a static corpus and extracting all relations from it, the setting assumes that there is a user with a particular query, e.g. “What albums did the musical artist Michael Jackson release?”. The queries should contain the type of the subject entity (musical artist), its name (Michael Jackson) and the type of the object entity (album). The query is then used to retrieve sentences from the Web using a search engine. The search engine functions as a preprocessing step to retrieve relevant Web pages. Moreover, this is a dynamic way of retrieving information, rather than the static corpus-based way and could be used in a real-world setting where a user has a specific query and wants to retrieve the answer to such a query. This has the additional benefit of having access to large quantities of information, which eliminates the need for having to search for a suitable corpus that contains the desired information.

The same setting is used for training and testing. To generate annotated training data, Web pages potentially related to that query are then retrieved using a search engine, and all NEs on the Web pages matched against the named entity in the query (Michael Jackson) and objects of the relation “album of” with the subject “Michael Jackson”, as already contained in the knowledge

base (Music & Me). Sentences which contain both the subject and the object of the relation are used as positive training data. NE pairs with the subject of the relation, but a different object, i.e. a mention of an entity of the type specified by the relation, but not one referring to an entity known to stand in the given relation to the subject entity, are used as negative training data. Note that surface forms such as “Michael Jackson” can refer to multiple real-life entities, which the task of named entity disambiguation is concerned with. This issue is left for future work.

3.2.2

Evaluation

In order to measure the performance of a distant supervision approach, a test set is neccessary. Information extraction approaches usually rely on gold standard corpora produced in the context of evaluation initiatives such as ACE (Walker et al.,2006) or Ontonotes (Hovy et al.,2006). Another possibility is to use benchmark data provided by evaluation challenges, e.g. TAC KBPSurdeanu and Ji(2014).

The problem is that, for relation extraction, not many large manually annotated corpora exist and it is particularly difficult to find gold standard testing corpora for the same genre as the training data. One approach for solving this is to find a gold standard which contains some of the desired relations, and then to only evaluate those relations which can automatically be mapped to that gold standard (Roller and Stevenson,2014). Other existing distant supervision approaches use the same method for obtaining test data as they use for obtaining training data, i.e. automatically annotating it with relations from a knowledge base. In that case, part of the knowledge base is used for training, while another part is used for testing (also called “hold-out evaluation”). They then perform a sentence-level or an instance-level evaluation. For sentence- level evaluation (Mintz et al.,2009;Hoffmann et al.,2011;Alfonseca et al.,2012;Ling and Weld,

2012;Liu et al.,2014), testing instances are classified by a model and then ranked by confidence of each prediction. Top ranked sentences are then annotated manually. Another possibility is instance-level evaluation (Mintz et al.,2009; Riedel et al.,2010;Hoffmann et al.,2011;Alfonseca et al.,2012) for which predictions for the same< s, o >relation tuples are aggregated. A prediction is deemed correct if it is contained in the knowledge base. This evaluation setting is therefore a good measure for knowledge base population performance.

For the Web genre, no corpus of Web pages with manually annotated Freebase relations is available. The closest suitable corpora would be the TAC KBP challenge corpora, which consist of manually annotated Wikipedia corpora. However, Wikipedia is a curated text collection and articles have very similar structures. Therefore, Wikipedia text is not very diverse and not a good representation of the Web genre overall. An example of a large and diverse Web corpus is the ClueWeb corpus1. However, the only annotations which exist for it are unofficial automatic NE annotations provided by Google researchers2.

One of the research aims is therefore to create a new corpus for the Web genre using the entity-centric search-based method proposal described in Section 3.2.1, which is made available publicly. Evaluations are be performed on instance-level, to measure performance for knowledge

1http://lemurproject.org/clueweb12/ 2http://lemurproject.org/clueweb12/FACC1/