• No results found

Algorithmic principles for concept and entity recognition

2.5 Text Mining and Entity Recognition

2.5.3 Algorithmic principles for concept and entity recognition

In the following, we will introduce different algorithms to implement concept and entity recogni- tion. There are the dictionary, rule-based, and alignment-based algorithms. As one of most basic function for text analysis, different implementation of string matching algorithms are presented, including hashing, automata, and tree-based approaches.

Dictionaries

Dictionaries are the most basic form to identify concepts in text. They use a fixed list of concepts and identify the corresponding occurrences of the concept labels in text. This can be done very efficiently, for example by constructing Finite State Automata (Hakenberg et al., 2008). Unfortu- nately the simplicity of a fixed dictionary can decrease the quality of the matching, e.g., reduce precision and recall. This can be attributed to missing variation and unknown concept labels. As reported by Hirschman et al. (2002) this can range from 16 to 69% of missed concepts. Fur- thermore, a dictionary approach may need an additional disambiguation step. Depending on the implementation, it has been reported that the correct identified concepts with their corresponding senses can go as low as 2 to 7% (Hirschman et al., 2002).

Rules

Rules allow to describe an infinite number of variations. For instance, McDonald (1996) uses rules in three stages. The first stage is to identify candidates with a dictionary and lexical hints. In the second stage the candidates are classified context-based using the surrounding words. For the third stage, abbreviations resolution for identified concept is applied. Rules can also be applied for handling morphological modifications. This is demonstrated by Ananiadou (1994) with a system of layered rules.

For the best results its necessary to combine rules with other approaches. Fukuda et al. (1998) achieved good results by applying first a rule-based approach to finding core-terms with five rules. Secondly, the core-terms are connected, using other rules and a part of speech tagger.

More recently Hakenberg et al. (2008) use rules to expand their dictionary by generating vari- ations of gene and protein names after splitting them at visual gaps. They identify a visual gap for example in the gene name “BRAC1” between the “C” and the “1” because of the change from letters to numbers. This allows for variations like “BRAC-1” or “BRAC 1”. Additionally arabic numbers can be interchanged with roman numbers, thus accepting also “BRAC I” as a valid gene name.

In general, rule-based systems often need to be adapted for each domain or research aspect. This can be done by hand or with (semi-)automated approaches. Caporaso et al. (2007) use a boot- strapping approach to generate Regular-Expressions rules for identifying protein-point-mutations in scientific literature.

Alignment

An alternative way of handling unknown variations of term labels is to employ alignment algo- rithms. The basic operations of an alignment are four operations: match, substitution, deletion and insertion. In combination with a scoring scheme the task of an alignment is to find a combination of operations with the best score in the search space. Well known algorithms for this task are, for instance, the global sequence alignment (Needleman and Wunsch, 1970), the local sequence alignment (Smith and Waterman, 1981), and the Basic Local Alignment Search Tool (BLAST). BLAST provides improved local sequence alignment and relevance scores (Altschul et al., 1990). For text mining these alignment algorithms can be applied on different levels. It is used for characters or digits (Tsuruoka and Tsujii, 2003), words (Doms, 2004) or in a translated version. For the translation approach the text and concept labels are translated to nucleotides. In a second step the BLAST algorithm is used to find the best alignments (Krauthammer et al., 2000).

String Matching Algorithms

The basis for all the above mentioned algorithms is an efficient matching of Strings. For dictio- naries the task is to find direct occurrences. Rules and patterns also rely on the matching of textual fragments. Similar, the alignment works on position information of tokens.

The annotation task requires to match a list of Strings d = d0, d1, d2, . . . , dk with a given

text String s. The String s is typically longer than the Strings in the list d. The goal is to find all matching strings dj in s. Most traditional string search algorithms30 such as Knuth-Morris-

Pratt (Knuth et al., 1977) or Boyer-Moore (Boyer and Moore, 1977) use one item diand match it

against s. For matching a dictionary, this results in a matching algorithm with a complexity that depends on the number of entries in the list d and the complexity of each comparison. To address this complexity issue, the algorithms have to be designed in such a way, that they check all list items di at once. With this design the lower bound for the complexity is length of the string s.

There are multiple options for implementing such an algorithm. These algorithms usually create a data structure from the list d to facilitate the efficient lookup. The complexity and memory requirements for the creation and storing of the data structures are the trade-off for the reduced runtime during the actual search. There are different possibilities to create such a data structure, which are described in the subsequent paragraphs.

Hashing The first option is to simply use a hash code based method. Given a collision-free hash function a hashmap can provide a lookup with a constant complexity (O(1)). The main complexity contribution of the hash function is the function complexity and how often it is used. Given that the hash function uses all characters in the string, the lower bound is the length of the string |s| = l. For the task of finding all matching substrings of s in the list, this ends up with a total complexity of O(l2). But a hashmap is not the only option for hashing and string search. The Robin-Karp (Karp and Rabin, 1987) algorithm uses a hash function to reduce the number of character comparisons.

Finite Automata A different approach is to treat the task as a pattern matching problem. For a given dictionary in the simplest case, this results in a non-deterministic finite state automaton (NFA). For an efficient matching this automaton has to be transformed into a deterministic finite state automaton (DFA). This can be done with a powerset construction. Unfortunately this step potentially generates from n initially states 2n new states. This space requirement during the construction may be a limiting factor.

Trie An alternative representation of automata are trees. To avoid the limiting step of a powerset construction, it is the goal to construct trees with deterministic branching conditions. Trees especially for string search in dictionaries are called tries. This word was created from the context of retrieval. A trie is a character-based prefix tree. In general, tries can be as fast as a hashmap or faster. Instead of calculating a hash key, the trie is traversed. This allows for an early termination and a possible speedup compared to a hashmap. Consider the following example: Given a dictionary d with three entries d0 = aab, d1 = aacb, d2 = cca and the query q = accb.

For an illustration of the resulting trie see Figure 2.8a. In the case of the hashmap the hash function calculates the hash key using the whole length |q| = 4 and does the lookup. In contrast for the trie the traversal of the tree is a follows. In the first step the first character of q[0] = a is used to select the next node. In the second step the second character q[1] = c has no next node and the lookup in the trie terminates without checking the remaining characters in q.

Radix Tree This data structure radix tree is a trie with the additionally constraint, that it merges nodes with only one child. This has the effect that, in contrast to the very memory inten- sive trie, the radix tree has a smaller memory footprint. But it requires also a slightly more complex node selection algorithm during the traversal of the tree. For an example of an radix tree see Fig- ure 2.8b. The radix tree originally was introduced as PATRICIA (Practical Algorithm To Retrieve

30

(a) Example Trie

(b) Example Radix Tree

Figure 2.8: Example for the Trie and Radix Tree. The data structures contain the dictionary d = aab, aacb, cca

Information Coded in Alphanumeric) by Morrison (1968) and as crit bit tree by Gwehenberger (1968).

Prefix search, as done by trie and radix tree, is a task that is also used in context of routing of packages, for instance with the internet. For an efficient routing, it is required to identify the best matching prefix in a lookup table of routing addresses. A good review on the different algorithms, including radix tree, is presented by Waldvogel et al. (2001). Furthermore, the Radix Tree as a data structure is also available in the Linux kernel and can be used for such tasks (Corbet, 2006).

Suffix Tree Another specialization for string search with trees is the suffix tree. It is a radix tree containing all suffixes from a string (Weiner, 1973; McCreight, 1976). With this proposed data structure it is possible to solve many string search problems. The biggest known issue with suffix trees was to construct the tree itself. This is solved by Ukkonen (1995). The idea is to construct the suffix tree online and not by adding all suffixes (Giegerich and Kurtz, 1997). For the usage of suffix trees in conjunction with a dictionary or multiple patterns, the suffix tree has been extended to the Generalized Suffix Tree (Chi and Hui, 1992; Bieganski et al., 1994). It can handle multiple source strings in one suffix tree. Multiple pattern string search using suffix tree like data structures have also been proposed by Aho and Corasick (1975) and Commentz-Walter (1979).

2.5.4 Summary

Ontologies contribute to the background knowledge for semantic search. The automatic identifica- tion of ontology concepts in text with text mining applies different algorithms to match the concept labels and text. For this task the properties of text with different types of variations, synonyms, and ambiguity and ontology issues with naming conventions and descriptive labels are discussed. Furthermore, entities are an additional information extraction target and require adaptation with respect to synonyms, abbreviations, ambiguity and the large amount of entities.

The algorithmic approaches to implement ontology-based text mining and entity recognition are presented. This includes dictionaries, rules, and alignment. An in-depth description for the fundamental task of string matching is given, including a motivation for the advantages of tree- based string matching compared to hashing or automata, as an implementation for rules. Later in Section 3.1.3, several of these string matching algorithms are compared with each other as part of the GoWeb implementation and evaluation.