• No results found

Terminology recognition methods

2.8 Terminology recognition in the life sciences

2.8.2 Terminology recognition methods

Dictionary-based based approaches use a predefined collection of terms from terminological resources to identify occurrences in text. While this can be very efficently done, e.g. using Finite State Automata [111], the straight forward use of a fixed dictionary reduces the precision and recall of a system (see section 2.9 for a discussion on evaluation measures). The amount of neologisms, new created terms, is large in the biomedical field. Therefore a dictionary will be incomplete most times. Unknown entities will result in a high ratio of missed terms. Hirschman et al. [122] report of 16 to 69% missed terms in their experiments. Another problem of the simple use of dictionaries is the ambiguity of terms. Hirschman et al. [122] report only 2 to 7% correctly identified terms in their experiements with a simple dictionary approach.

The number of registered term labels in a dictionary is fixed while the number of term variations is continiously growing. An approach to target the potentially infinite number of term variants are rule-based methods.

Rule-based

McDonald [174] first used contextual information for his “proper name recognition and clas- sification facility”. The system processed text in three stages. First it identifies candidates using lexical hints and a dictionary. Second, it classifies the candidates using surrounding words as its context. Thrid, the identified terms can later be reidentified when mentioned as abbreviations.

Ananiadou [7] suggest a method which uses layers of rules which reflect morphological modifications. Fukuda et al. [91] describe a method which extracts target material names by performing connection and extension processing around a core term. They achieve good results with this rule based approach by first finding core-terms with five rules. Afterwards, the core-terms are connected, using other rules and a part of speech tagger.

Hakenberg et al. [111] expand their dictionary by generating variations of gene and protein names after splitting them at visual gaps. A visual gap is for example in the gene name “BRAC1” between the “C” and the “1” because of the change from letters to numbers. The expansion rules allow for variations like “BRAC-1” or “BRAC 1”. At a later stage the arabic numbers are allowed to be interchanged with roman numbers resulting in the acceptance of “BRAC I”.

Rule-based approaches need tedious manual tuning for each domain. The resulting sys- tems are therefore domain-specific and need considerable manual adoption if applied to new domains.

Alignment-based

The problem of low recall due to unknown variations of term labels was targeted by Tsu- ruoka and Tsujii [267, 268]. The authors allowed for edit operations in the terms such as substitution, deletion, insertion of characters and digits. Krauthammer et al. [150] employed a sequence alignment approach to the problem of term to text alignment. They translate the text and the terms into nucleotide codes and use the BLAST algorithm to compute the alignment.

The current version of GoPubMed uses an word alignment algorithm [67]. The text and the terms are decomposed into token stems. A local sequence alignment algorithm [248] is used to map term tokens to text tokens. Penalities values for gaps, deletions and insertions were experimentally calculated. The word alignment is geared by the notion of Information Value for a word. The Information Value for a word is based on the frequency of the occurrence of words in the ontology. The algorithm was tested on 100 manually curated MEDLINE abstracts and achieved good results (89.5% precision and 81.4% recall). The algorithm has a quadratic runtime, the nature of all approaches based on dynamic programming. Processing of a MEDLINE abstract takes about 10ms upon fresh annotation. Considering the worst case of 10.000 new articles for a query result the annotation pipeline alone uses 100 seconds. Heuristics and a caching mechanism ensures an average response of the current system within less than 10 seconds. The system does not disambiguate the meaning of words. Therefore false annotations of short ambiguous ontology concepts occur in the current GoPubMed.

Machine-learning methods can be used to target the tasks of Word Sense Disambiguation [204].

Machine-learning

Each of the limitations of the previously described approaches was targeted by machine- learning mthodes. For example Collier et al. [53] used Hidden Markov Models and specific orthographic features for the extracting of genes and protein names. Features used include word contains only lower case letters, word contains caps, e.g. “kappaB”, letters and digits mixed, e.g. “p53” and word contains greek letters. Yamamoto et al. [285] use morphological features, e.g. prefixes, lexical features, e.g. part-of-speech tags and stems, and syntactical features, e.g. noun phrases, as well as a dictionary of protein names. The authors report that the use of a dictionary is crucial in protein name tagging. Gaizauskas et al. [92] use machine-learned rules and heuristics for Names Entity Recognition. Hakenberg et al. [111] trained a Support Vector Machine for the disambiguation of gene names.

This section describes machine-learning techniques frequently used for biomedical textming. All machine-learning systems have in common that a training component is involved, mostly they rely on manually produced training data and a correct tokenization is essential for their performance.

Hidden Markov Models. Hidden Markov Models were first used for part-of-speech tag- ging by Church [51]. In Bikel et al. [24] the authors suggest to use HMMs for Named Entity Recognition and achieve similar performance compared to rule-based approaches without the need to manually create rules. The idea is to learn the probabilistic model of an automata which produces labels while consuming the text tokens.

The advantage is that HMMs work on sequences and that is exactly how texts are represented. However there is a drawback of using Hidden Markov Models on texts. HMMs can only consume atomic observations. One token after each other is consumed by the

model but no additional hints can such as POS tags, previously identified entities or other meta-data can be utilized. Other machine-learning approaches can use complex feature vectors.

Support Vector Machines. Support Vector Machines classify objects represented as feature vectors [271]. The idea is to compute the optimal classifier for a set of training ob- jects. The training objects are represented as vectors. The optimal classifier is a hyperplane between the vectors of the different classes. In the two-dimensional case the hyperplane can be line between the vectors (points in the two-dimensional case). Figure 2.13a shows the simple two dimensional case. The hyperplane is determined by the vectors closest to the area between the two classes, also called Support Vectors. This means not all training vectors are needed for the definition of the optimal classifier. Therefore the dimension of the feature vectors can be very large and sparse. Eventually only the supporting vectors determine the classifier.

Figure 2.13b show a case were no linear classifier can be found. In this case the vector space of the feature vectors is transformed into a higher dimension for which a linear classifier can be found. The computation of the classifier in the higher dimension is done implicitly. This is possible because of the properties of the Kernel function, see Burges [40] for a good tutorial on Support Vector Machines.

Figure 2.13: (a) linear case, the hyperplane is a line between the 2D-vectors (b) non-linear case, a Kernel function is needed

The computation can be expensive for large training sets and the classification of a standard SVM is always binary. The following methods allow for the classification into more than two classes.

Maximum Entropy (ME). In Berger et al. [20] the authors suggest a method for statis- tical modeling based on maximum entropy. The idea of the ME approache is to compute the probability of each known class for a given feature vector by making the minimal assuptions about the data. The authors introduce the idea with an example: Suppose from the training data we see that words from an input stream are classified into five classes (A,B,C,D and E). If we do not know the frequency for any class assignment the model has to assume that each class is assigned with a probability p(A) = p(B) = p(C) = p(D) = p(E) = 20%. This model makes the most uniform assumption about the observed data. Suppose we know that in 30% of the cases the input vector is classified into class A or B the constraints for our model are the following:

p(A) + p(B) + p(C) + p(D) + p(E) = 100% p(A) + p(B) = 30%

One of the possible solutions for these equations is: p(A) = 15% p(B) = 15% p(C) = 70/3% p(D) = 70/3% p(E) = 70/3%. Is the most uniform model and makes the least assumptions about the data. Suppose we observe further that in 50% of the cases the correct class is A or C the constraints for our model are the following:

p(A) + p(B) + p(C) + p(D) + p(E) = 100% p(A) + p(B) = 30% p(A) + p(C) = 50%

The model maximizing the entropy here is: p(A) = 20% p(B) = 10% p(C) = 30% p(D) = 20% p(E) = 20%.

The authors describe the goal intuitively: model all that is known and assume nothing about that which is unknown. In other words, given a collection of facts, choose a model consistent with all the facts, but otherwise as uniform as possible.

The example illustrates that the output of a ME model are probabilities for multiple outcomes. ME models can handle large feature vectors and large amounts of training data. The final model is computed iteratively using the Viterbi algorithm [273].

Maximum Entropy Models were used for Part-of-speech tagging [218], Sentence segmen- tation [226], Named Entity Recognition [220], Noun Phrase Chunking [20] and others.