Implementation and Practical Considerations

In the following, we describe the technical details that are relevant to the implementation of the implicit network model as described above. We first introduce the algorithmic basis for implementing the network extraction, before highlighting the important steps in preprocessing the documents and discussing the benefits and drawbacks of entity annotation with and without entity linking.

3.3.1 Implicit network extraction algorithm

Based on the model presented in the previous section, the implementation of an implicit network extraction algorithm is fairly straightforward by extracting named entities from sentences, which are then connected in a graph structure with weights generated from their distances in the text. When implementing this model, it is important to note that fully connecting all entities and terms in a document produces an extremely dense graph in which each document is represented by a clique, and the combinatorial explosion in the number of edges results in a graph that would be prohibitively large for large document collections. However, the majority of these edges have little impact in practice. While the theory of the model is based on the assumption that entities in the same document share some connection regardless of their distance in the text, long-distance connections are weak and mostly negligible due to the exponential decay in edge weights. Therefore,

3.3 Implementation and Practical Considerations

we include a cut-off parameterc in the algorithm that excludes edges if the distance between the two instances is too large. Similarly, while it is sensible to model cross-sentence relations between entities, terms are less likely to be related to entities outside of their containing sentence, so we limit the edges between terms and entities to those that occur in the same sentence. Based on these considerations, we arrive at the algorithm for con- structing an implicit network representation of a document collection (see Algorithm1). Conceptually, multiple ways of generating such a network are conceivable. In practice,

Algorithm 1 Creation of an implicit network based on the output of named entity annotations of a document collection. For the definition of the distance functionδ, see Chap- ter3.2. The update function adds a new edge if it is not yet contained in E, or adds the value to the existing weightω if it already is contained in E. The cut-off parameter imposes an upper bound on the distance between cooccurrences.

Input: DocumentsD, cut-off parameter c

1: initialize V ← ∅

2: initialize E ← ∅

3: initializeω(v, w) ← 0 for all v, w

4: ford ∈ D do . iterate over all documents

5: Sd ←sentences ind

6: Ed ←entities ind

7: V ← V ∪S_d ∪E_d ∪ {d} . add sentence, document and entity nodes

8: fors ∈ Sd do . iterate over all sentences in the document

9: update E with(s,d, 1) . link the sentence to the document

10: Es ←s ∩ Ed . get all entities in the current sentence

11: fore ∈ Es do . iterate over all entities in the sentence

12: update E with(e, s, 1) . link the entity to the sentence

13: Ts ←s \ Es . get all terms in the current sentence

14: V ← V ∪T_s . add term nodes

15: fort ∈ Ts do . iterate over all terms in the sentence

16: update E with(t, s, 1) . link the term to the sentence

17: fore ∈ Es do . iterate over all entities in the sentence

18: update E with(e, t, 1) . link the term to the entity

19: fore1∈Ed do . iterate over all entities in the document

20: Ed ←Ed \ {e1}

21: fore2∈Ed do . iterate over all other entities in the document

22: ifδ (e1, e2) ≤ c then . if the distance is within the window

23: w ← exp(−δ (e1, e2)) . compute the weight contribution

24: update E with(e1, e2, w) . update the edge weight

3 Implicit Entity Networks

iterating over all sentences in a document in sequence and extracting the local subgraph that is induced by the context window around this sentence is most sensible.

Asymptotic complexity of the network extraction

The complexity of this extraction procedure is quadratic in the size of the context window that is induced by the cut-off parameterc, but linear in the number of sentences and documents. The same holds for the number of generated edges in the local subgraphs. It is easy to see that the extraction process is trivially parallelizable by documents, which results in one graph representation of each document that are then aggregated into one combined graph structure. Aggregation of the individual document graphs can be achieved by sort- ing the generated edges, which adds a logarithmic factor to the time complexity. Overall, the asymptotic runtime requirements are then in O(c2|S | log(c2|S |)), where c is typically a fixed and small parameter such thatc2 can be regarded as a constant, resulting in a loglinear runtime of O(|S | log |S |) in practice.

3.3.2 Document processing

A number of document-level preprocessing steps are necessary before and during the extraction of the network. The key factor in these decisions that is of practical importance to the implementation concerns the identical handling of the preprocessing steps during the generation of the network and during subsequent queries in order to ensure compatibility.

Character encoding

Unless the document collection that is used as input is very meticulously cleaned, special characters need to be removed. Typically, a thorough cleaning during preprocessing is not possible for large collections, since it is impossible to decide which characters are part of names, and which are artefacts (consider, as a simple example, title abbreviations that may or may not include periods, or the band name U2). Furthermore, accents that are present in the input texts may be difficult to input at query time (for example, due to querying the network from a mobile device with limited input capabilities, or from a laptop keyboard with a different language layout). Since we are extracting networks from English texts in the following, this is mostly an issue for foreign names, but since the approach is language agnostic, the issue needs to be addressed with the application scenario of the network in mind, especially for languages with more exotic characters or accents. For our experiments

3.3 Implementation and Practical Considerations

in the following, we use UTF formatting and typically keep accents in the data, but use a lower case representation of all characters.

Term extraction

The extraction of terms is not as straightforward as it initially appears and requires a de- cision on what constitutes a word. Since terms are defined as the words of a sentence that are left after entities have been removed from the sentence, the removal of entities is key and can be performed in two ways, namely by part-of-speech tagging or by entity deletion. In an ideal scenario, part-of-speech tagging can be used to identify terms since the documents are tagged for parts-of-speech in preparation of the entity extraction step. Using these annotations, it is then possible to identify non-entity terms for the extraction or even filter them by type. However, this approach can be problematic in practice due to errors that occur during part-of-speech tagging, or due to some character sequences that might represent interesting terms not being recognized as words. Furthermore, while entities should not overlap in an ideal scenario, which would enable a structured removal of entities from a sentence, overlaps happen in practice, especially when several different annotation tools are used. Thus, a simpler solution to the extraction of terms is a bitmap for all characters of a sentence that is used to mark the covered text of entities for deletion. Afterwards, marked parts of the sentence are removed and the remainder is considered for term generation. While this prevents the overlap, it also stands to introduce word fragments or non-words as terms. Thus, setting filtering criteria and minimum word lengths for a word to be considered a term may be appropriate. Furthermore, words can be ex- cluded from the list of terms based on frequency, such as frequent stop words, misspelled words, or word fragments that are extremely infrequent.

Stemming of terms

To reduce the size of the graph and to group related terms into a single node for improved recall in query answering, we recommend the use of a stemmer for term processing. This makes it much easier to match semantically similar words in a query on the resulting graph. Generally speaking, lemmatization would likely be preferable to stemming due to its better performance. However, lemmatization of terms only works properly during the document processing phase in which entire sentences are available, due to the need for part-of-speech tags that we discussed in Chapter2.1. In most application settings that intend the graph structure to be queried with individual query terms that are not contained

3 Implicit Entity Networks

in a proper sentence structure, lemmatization would not work well due to missing context and syntactic information. Thus, it would not be possible to properly match query terms to lemmatized nodes in the graph, thereby making queries incompatible with the network.

Entity annotation and linking

In addition to the basic text preprocessing steps, a final point with strong practical rele- vance that is worth considering is the dependence of the implicit network model on the annotation of entities in the documents, which deserves attention due to the entity-centric focus of the model. In particular, the user is faced with a choice between only recogniz- ing entities in the text, or also linking them to a gazetteer or knowledge base. On the surface, entity linking is the superior choice since it maps mentions of the same entity to the same node in the graph, even if there are differences in the surface forms of the mentions. Thus, entity linking takes care of disambiguating different or partial spellings of identical entities, which improves the performance of subsequent retrieval tasks due to a more complete representation of entity contexts. In practice, however, entity recognition requires an extensive NLP stack that includes, at the very least, sentence splitting, tokenization, chunking, and entity recognition itself. In cases where the data is annotated manually, this may not be problematic, but in cases of automatic annotation (which is the only viable option for large document collections), these automated steps are bound to in- cur a cumulative and propagating error that likely potentiates throughout the annotation pipeline. Thus, adding entity linking to the pipeline as a final step is likely to increase the overall error. In practice, this typically leads to problems in the recognition of infrequent or emerging entities that are mentioned very sparsely, while it has a more negligible effect on more frequent entities. Thus, the problem can be countered by calibrating the recognition and linking methods to optimize precision at the cost of recall, thereby emphasizing more common entities in the implicit network. In the following, we use both approaches, but primarily include entity linking for tasks that benefit from uniquely identifiable entities, such as politicians or places in the analysis of news articles in Chapters5and6.

In document Implicit Entity Networks: A Versatile Document Model (Page 64-68)