The Exact Matching Approach

Part I Text Mining

4.2 The Exact Matching Approach

The exact matching approach (Figure 4.1a) performs an exact text search for synonyms

from a dictionary against text and thus directly evaluates the applied synonym dictionary; postfilters can be applied for increasing precision. It has been developed together with

Joannis Apostolakis and Daniel Güttler (Fundel et al.,2005a).

4.2.1 Match Detection

Synonyms as defined in the synonym dictionary are searched in texts by exact text matching. The search is case insensitive only if the synonym contains numbers or if the synonym length is above a certain threshold (here: 5 characters). When several synonyms of different length can be matched at a certain text position only the longest match is reported.

4.2.2 Rule-Based Postfilter

A rule-based postfilter has been set up in order to implement basic context sensitivity. It

checks occurrences of synonyms for nearby occurrence of modifiers (e. g.cells,domains,cell

type, DNA binding site) which indicate that a text fragment does not refer to a gene or protein.

Short synonyms in parentheses often overlap with definitions of abbreviations differing from the assumed protein as in the following examples: “. . . mapped by fluorescence in situ

hybridization (FISH) . . . ”, “. . . developing mouse submandibular gland (SMG) . . . ”. Fish

and SMG are valid mouse gene names, but in the text, these terms do not refer to genes.

The meaning of such occurrences is clarified by checking the words ahead of parentheses corresponding to the letters of the synonym. If no significant overlap of these words with the alternative names of the assumed protein is found the match is discarded. For example,

“small nuclear ribonucleoprotein polypeptide G”. Both alternative names have no overlap with the respective text fragments and the matches are therefore removed from the result set.

4.2.3 SVM-Based Postfilter

Fly synonyms show a significant overlap with common English words, body parts and phe-

notype descriptions (see Section 3.3.2) and therefore require context dependent analysis.

A postfilter which is based on support vector machines (SVM) (Chang and Lin, 2001) has

been set for context-dependent pruning of matches.

First, the curated fly synonym dictionary is matched against MEDLINE abstracts. Occur- rences of multi-word synonyms are always accepted. Occurrences of single-word synonyms are subjected to the SVM and classified as true or false hits. The SVM uses the following features:

• surface clues (i. e. orthographic properties of the matched synonym): synonym length;

whether it contains non-characters, numbers, Greek numbers, capitals, lower-case letters, numbers and letters; whether it consists entirely of capitals, lower-case letters; whether it has a capital after a non-capital; whether the first letter is upper case followed by only lower case letters

• part-of-speech tags (Brill,1992) of the matched synonym and directly adjacent words

• prefix and suffix of the synonym (the first and last 2 and 3 letters)

• all substrings of length 3 of the synonym

The feature value for the synonym length corresponds to the number of characters of the synonym. All other features are encoded as binary values (e. g. one feature is defined for each possible substring of length three; for all substrings that appear in the considered synonym the corresponding feature value is set to one and all other substring feature values are set to zero).

Furthermore, scores are used that indicate how often a word is found close to a correct

synonym match. Six categories of words are used: nearest verbs, nearest nouns, and words adjacent to a synonym match; occurrences before and after a match are considered sepa- rately.

Scores for nouns and verbs have been determined from the 5 000 abstracts of the fly training

set of the BioCreAtIvE challenge (Section 4.5.1). Each sentence that contains a synonym

was analyzed and the closest verb and noun (Brill, 1992) before and after the synonym

match has been extracted. The correct occurrences were used as positive samples and the false occurrences (false positives) as negative samples. For these verbs and nouns a score has been calculated as described below.

A second set of scores is based on a search of mouse synonyms against approximately 700 000 MEDLINE abstracts. In this data set, words appearing adjacent to synonym occurrences are extracted irrespective of their grammatical class. Since no standard of truth

is available for this data set, every match is assumed to be a positive sample and the adjacent words are extracted. All sentences in which no synonym has been matched are considered as negative samples and all words from these sentences are used for estimating the background word frequency.

The motivation for using scores obtained from searching fly and mouse synonyms against two different sets of abstracts is to exploit more information than given in the annotated training data.

Scores are calculated as:

Scorew,i = Occw i+ toti+ Occw i+ toti+ + Occw − tot− where

w : word (token consisting solely of letters, length ≥2,

for the BioCreAtIvE fly set only noun or verb,

for the large MEDLINE mouse set of any word class)

i∈

before, after : relative position of word w with respect to the synonym match

Occw

i+ : number of occurrences of word w at position i in positive samples

Occw

− : number of occurrences of word w in negative samples

toti+ : total number of words found at position i in positive samples

tot− : total number of words found in negative samples

These scores are used as SVM feature values: The directly adjacent words and the closest verbs and nouns before and after a synonym match are extracted. For each category, the score of the word is used as value for the corresponding feature, it is zero if no score is defined for the word.

The SVM uses a linear kernel. The training data for the SVM has been compiled as follows: The fly synonym dictionary has been searched against the BioCreAtIvE training data for fly

(Section 4.5.1) by exact matching. 10 000 occurrences of single word synonyms have been

compared against the annotation provided by the BioCreAtIvE organizers. Occurrences that are correct according to the BioCreAtIvE annotation are used as positive training samples, and occurrences that are not supported by the BioCreAtIvE annotation are used as negative training samples.

For the prediction, the curated synonym dictionary is matched against the abstracts of the test set. Occurrences of multi-word synonyms are accepted directly. Every match of a single-word synonym is classified by the SVM as positive or negative. A single word synonym is only accepted for an abstract if at least one match of this synonym in the abstract is classified as positive. All occurrences of multi-word synonyms and the accepted single-word synonyms are reported as final result.

In document Fundel, Katrin (2007): Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 74-76)