Part I Text Mining
3.4 Hierarchical Synonym Dictionaries
Publications often contain information on gene or protein groups such as Matrix Metal-
loproteases (MMPs), Collagen, or Bone Morphogenic Protein (BMP), without mentioning a specific member of the group. These gene or protein groups can be structural fami- lies, functionally similar proteins, a complex consisting of various proteins, or the proteins implicated in a common regulatory or signaling event or pathway. Texts might contain information concerning all members of the respective group or individual members that are specified in a way that is not recognized by a named entity identification approach. Public databases contain some information with respect to higher-level groupings of genes and proteins, such as annotation with Gene Ontology terms or references to pathways
and complexes. For example, Koike and Takagi(2004) generated a family name dictionary
based on the InterPro (Mulder et al., 2007) family hierarchy and manual construction of
the remaining hierarchy based on sequency similarities. Yet, the databases generally do not contain comprehensive information on gene groups as it would be required for text mining. The extraction of group terms directly from gene and protein synonym dictionaries makes it possible to derive group terms that are appropriate for text mining, with various levels of specificity and irrespectively of the type of group.
In the following, an approach for generating hierarchical synonym dictionaries and a cor- responding benchmark is presented. It has been developed together with and implemented
by Caroline Friedel and Cornelia Donner as part of their bachelor theses (Friedel, 2003;
Donner, 2003).
3.4.1
Generation of Hierarchical Synonym Dictionaries
Hierarchical synonym dictionaries expand on standard synonym dictionaries by addition- ally containing objects and synonyms for gene and protein families and groups. The lowest level of the hierarchy consists in the standard gene and protein synonym dictionaries as
described in Section3.2. The higher levels contain name groups with different levels of
generalization. A mapping links the group objects to the respective constituents.
The principle steps of the heuristics applied for generating hierarchical synonym dictionar- ies from standard synonym dictionaries are described in the following:
(1) Extraction of group terms: For every synonym containing a object specifier; i. e., ending with one of the following expressions
• a number
• a number followed by a letter (A-F)
• a roman number
• a letter (A-F) preceded by a space or hyphen and followed by any number of digits,
the substring pruned for the object specifier is extracted as group term. Initially, every
for the respective group object. The pair of original object identifier and group identifier is added to the mapping. For each original gene or protein object, all generated group terms
are gathered as a set of alternative group terms.
(2)Curationconsists in filtering and merging steps. Group terms below a threshold length are removed (e. g. three characters). Terms that are too general as group designations are
filtered, such as kinase orprotein. Group objects are then merged with the aim of pooling
group terms that refer to the same set of genes/proteins. First, group objects are merged if the respective group terms were derived from the same original objects. The terms of a combined group object are then checked for their alternative group terms. Alternative group terms that are shared by at least 40% of the members of the combined group are added to this group object. Next, group objects are merged if they share at least 70% of their synonyms (i. e. group terms) or respective original object identifier. The shortest group term is used as identifier for the combined group object. Group objects are curated
similarly to single objects (Section3.2.2), yet regular expressions for synonym pruning are
not applied as these would remove many group synonyms.
(3) Ambiguities between groupsare reduced with the aim of assigning the ambiguous terms to the most specific group objects. An ambiguous synonym that is used as group identifier for one of the respective groups is assigned only to this group. An ambiguous synonym of a limited length (here: 7 characters) that occurs as suffix of another synonym
assigned to one of the respective group objects is assigned to only this group (e. g. Bcl is
assigned to the group ofApoptosis regulator Bcl). Furthermore, an ambiguous term ending
with a number is removed from a group object given that the group object contains further terms that are identical to the ambiguous term except for the final number. Again, group objects are merged if they share 70% of their terms. Ambiguous synonyms ending with a letter (A-F) or a number followed by a letter are assigned only to the group objects of the lower hierarchy level, other ambiguous synonyms are assigned only to the group object of the higher hierarchy level.
(4) Ambiguities between groups and single objects are reduced by removing the synonyms which are ambiguous between single and group objects from the single objects provided the single object is mapped to the group object.
This procedure finally returns a hierarchical synonym dictionary that contains the original gene and protein objects as well as the newly generated group objects, and a mapping between group objects and their corresponding original gene or protein objects.
3.4.2
Evaluation
A Benchmark for Gene/Protein and Hierarchical Synonym Dictionaries
A hierarchical synonym dictionary has been generated by application of the above proce-
Prot (Bairoch et al., 2005). ProMiner (Hanisch et al. (2003, 2005), see Section4.3) was applied for named entity identification with the hierarchical synonym dictionary.
200 MEDLINE abstracts have been annotated manually with 746 occurrences of 486 dis- tinct gene objects and 219 occurrences of 121 distinct group objects. Two thirds of the abstracts have been selected randomly and one third has been selected specifically for relevance of gene groups. These were abstracts in which ProMiner identified at least five distinct objects and for which the results with the standard synonym dictionary and the hierarchical synonym dictionary differed.
The evaluation of hierarchical synonym dictionaries requires for specific criteria: If a syn- onym occurrence is ambiguous between a single object and a group object either match is accepted; i. e., the occurrence is counted once (56 cases). If the text mentions a specific member of a group and the automatic search returns the specific member as well as the group, the match of the group is ignored (96 cases). If a specific member of a group is not matched due to unusual or complicated grammatical constructs, the match of the group term is also accepted (15 cases). All other matches of group objects are evaluated without special treatment (123 cases).
Evaluation Results
The evaluation results (Table 3.6) show that the expansion for gene and protein groups
increases recall at similar precision, which leads to a remarkable increase in F-measure.
Synonym dictionary TP FN FP Recall Precision F-measure
standard dictionary 656 211 107 0.76 0.86 0.81
hierarchical dictionary 781 86 121 0.9 0.87 0.88
Table 3.6: Results of evaluation of standard gene and protein name dictionary and the hierarchical synonym dictionary expanded for gene and protein groups on the manually annotated benchmark set (TP: true positives, FN: false negatives, FP: false positives).
The detailed analysis of the results on the benchmark set indicate similar categories of errors for the group synonym dictionaries as for the standard synonym dictionaries (see
also Chapter4). The most important category of false negative errors of the standard syn-
onym dictionary (Table3.7) corresponds to group terms; usage of the hierarchical synonym
dictionary clearly increases recall.
Some of the false negative errors are caused by deficiencies of the matching strategy applied by ProMiner such as missing enumeration resolution. Others could be corrected by different parameter settings, improved ProMiner term lists, or improved synonym dictionaries. With the hierarchical synonym dictionary, some group terms are not found due to ambiguous synonyms (3 cases) or parentheses in synonym occurrences (2 cases). Furthermore, some standard objects are not found anymore (5 cases) as the respective synonyms are similar to group synonyms.
Most of the false positive matches (Table3.8) are caused by correct matching of terms that
Type of error occurrences Example
gene/protein groups 123 growth hormone
Missing Synonym 29 PKCepsilon, NRAS
Ambiguous Synonym 19 SMN1
Enumerations 19 ERK1/2
unclear 10 Ang II, CCK A receptor
Parentheses 9 Bcl-x(L), cyclooxygenase (COX)-1
Standard English 2 Bad
Table 3.7: False negative results of the evaluation of the standard gene and protein synonym dictionary on the manually annotated benchmark set.
Type of error occurrences Example
semantic ambiguity 63 PtK1 vs. PtK1 tissue cells
match within 16 GSH-S-transferase,
longer expression c-Jun NH2-terminal kinase
unspecific synonym 14 motor protein,serine/threonine kinase
insertions 6 transcription factor-1 vs. transcription factor IDX-1
permutation 5 alpha 2-M vs.alpha M beta 2
IL-2 receptor vs.IL-1 receptor II
token-class weighting 2 diabetes-associated peptide vs. diabetes associated
Table 3.8: False positive results of the evaluation of gene and protein synonym dictio- naries on the manually annotated benchmark set. Synonyms of gene or protein objects are marked initalics.
(e. g. if a group term achieves a better match score than a standard term) but introduces others (e. g. if a group synonym is unspecific and overlaps with a non-gene term). Most ap- proaches for named entity identification return at each position the longest string that can be identified. Biological objects are often described by nested terms and gene and protein names frequently form part of a longer biological expression (e. g. cell type or mutation). In these cases, context analysis is required to resolve the correct meaning and thus improve performance. Unspecific synonyms are undesired for individual gene and protein objects, yet interesting for gene and protein groups. Thus, the curation of synonym dictionaries could be improved to better distinguish between specific gene and protein objects and var- ious levels of generality of group terms. Permutations of synonym-constituents, insertions and token-class based weighting of the synonym-constituents in rare cases also caused er- rors.
In summary, the addition of group terms to the synonym dictionary leads to significant improvement of named identity identification. The proposed approach for group term and object generation is based on the application of a set of heuristic rules. The automatic expansion of standard synonym dictionaries is thus straightforward. The evaluation showed that the performance can be increased further. The generation rules can easily be expanded;
for example, by working on terms with subtype descriptors by letters other than A-F, or Roman numbers. Finally, group synonyms could be derived from corresponding protein
family or class databases such as InterPro (Mulder et al., 2007).