• No results found

An implicit network of Wikipedia

3.4 Event Completion on Wikipedia Data

3.4.1 An implicit network of Wikipedia

As a basis for our evaluation of events as an entity-centric extraction task that can be backed by an implicit network representation of the data, we extract such a network from Wikipedia. The benefits are comprehensiveness, since few text sources cover as broad a spectrum as Wikipedia, as well as scale, due to the size of the document collection.

Entity types for events: LOAD

With the main task being the identification and representation of events that emerge from the joint occurrence of named entities and temporal expressions, we base our model on the relations between distinct classes of such entities. The most prominent class of involved entities are the actors in an event. Generally, these correspond to the named entity type of persons, but we use the more general term to also include non-person actors in possible fictional settings, to which the model is equally applicable. The underlying assumption is that actors are singular individuals. In contrast, we consider groups of people to be organizations, meaning that in this model, an organization may describe a company, a political party or even a rock band (however disorganized the latter two might seem, at times). To represent the geographic component, we include locations, which in the most general interpretation are points or areas in space at which events take place. Finally, the temporal dimension is included by the mention of dates as temporal expressions. Here, we only consider dates of the granularity levels day, month, and year, but do not include time intervals. However, a refinement of the model that includes sets of temporally ordered, discrete dates as intervals is certainly possible. Since we are interested in modelling and extracting the relationship between these entities, we can use an implicit network that contains the four mentioned types of entities as a basis for information retrieval and event exploration tasks. Due to the structure of events as composites of entities with different types, we only extract edges between entities of different types. As a result, we obtain an implicit network that contains entities of type locations Loc, organizations Org, actors Act, and dates Dat, to which we refer as the LOAD-network. Thus, the set of entities can be regarded as a union of four setsE = Loc ∪ Org ∪ Act ∪ Dat. A schematic visualization is shown in Figure3.3.

Data selection and annotation

To test the model on a large-scale document collection, we use the entire dump of the English Wikipedia from June 2, 2015, as input. Since implicit networks are designed to

3.4 Event Completion on Wikipedia Data

Figure 3.3: Schematic view of the implicit network extracted from Wikipedia articles (pages) for locations, organization, actors (persons), and dates.

represent the information that is contained in unstructured text, we exclude infoboxes, ta- bles, references, and pages of lists, and therefore only use the raw text without any links or pre-existing annotations. For the tokenization, sentence-splitting, POS-tagging, lemmati- zation and named entity recognition, we use the Stanford Named Entity Recognizer[61].

We employ the 3-class model trained for CoNLL data to extract persons, organizations and locations as discussed above. For this first extraction of an implicit network, we only per- form entity extraction without entity linking (we later also extract a network of Wikipedia that is constructed with entity linking in Chapter4.2). As temporal tagger and for the normalization of temporal expressions, we use HeidelTime[188]instead of StanfordNER, since Wikipedia articles largely follow a narrative structure and we therefore need a do- main sensitive temporal tagger that can be adapted to this narrative domain instead of the typical news domain[189].For the stemming of terms, we use the Snowball stemmer that is an implementation of the Porter stemming algorithm[149].We set the cut-off parameter for the distance between entities toc = 5, since this value allows the use of 32-bit single precision floating point numbers for storing the exponentially diminishing edge weights without the risk of numeric instability, and thus helps in keeping the graph size manage- able. We use the hierarchical completeness condition as discussed in Chapter3.2for dates, and the inclusion by splitting for the names of persons and locations.

In the annotation phase, we find that there are about 4.6M English Wikipedia articles that contain at least one entity mention. The documents can be split into a total of 91.4M sentences, 53.5M of which contain at least one annotation. We find a total of 137.0M in-

3 Implicit Entity Networks

Loc Org Act Dat Ter Sen Doc Loc 0 Org 90.8 0 Act 275.8 105.7 0 Dat 83.0 45.5 127.6 0 Ter 182.8 93.9 316.6 57.3 0 Sen 71.3 20.9 84.4 38.3 412.2 0 Doc 0 0 0 0 0 53.5 0 |V | 2.7 3.4 7.1 0.2 4.9 53.5 4.5

Table 3.1: Number of edges (top) and nodes by entity type (bottom) of the implicit network con- structed from the English Wikipedia (in millions).

stances of entities, which are divided into 27.0M of class date, 44.2M of class location, 44.4M of class person and 21.3M that are annotated as organizations. After the extraction of cooccurrences and the aggregation of edges, we obtain an implicit network, for which we give the metrics in Table3.1. We observe that the number of distinct dates is by far the smallest due to the selected granularities year, month, and day. However, the coverage of dates in Wikipedia is above 50% for dates after the middle of the 16th century and perfect for dates after 1800[185], meaning that each such date is mentioned at least once. The large number of terms can be explained by the presence of technical terms, mismatched names or locations, and numeric data, which we did not model as separate entities (al- though it is a possible extension of the model). While it would be possible to reduce this number by limiting the terms to a dictionary, this approach might be too restrictive in many applications, especially for a large collection such as Wikipedia.