In (Ogren et al., 2004) the compositional structure of GO terms was analysed. The authors found that many GO terms contain each other and many GO terms are derived from each other. For example, the term membrane [GO:0016020] has inner membrane [GO:0019866] as a direct sub-concept. This knowledge can be used to au- tomatically generate new candidate terms following the observed patterns. We anal- ysed, whether these super-string relations observed in GO, can be verified in the text.
Evaluation setup
By analysing the GO we identified 3,129 out of 20,223 terms, where the term is a super-string of its children. Further for 1,189 of these terms (6% of all GO terms), the term and its children were found in PubMed abstracts (see also Table 5.3). Based on at most 5,000 texts containing the parent terms we identify the words which pre- cede the actual term and rank them by their frequency of occurrence. This lead to a list of newly identified candidate terms to be possibly included in the ontology. In the following we show the generated candidate child terms for the example GO terms “Death”, “vacuole”, and “GTPase activator activity”. Valid prefixes and pre- fixes matching existing child terms contained in the GO are indicated in the column GO. Many of the predicted terms are children of the known parent term.
Terms in GO 20223
Terms found in abstracts 14905
Terms having children containing themselves 3129
... parent found in text 2692
... parent and one child found in text 2239 ... parent and one child found in text; parent substring of child 1189
... parent and all children found in text 1781
Terms having children 7451
... parent found in abstracts 5964 ... parent and one child found in abstracts 5185
... parent and all children found in abstracts 3757
Table 5.3. Statistic on Gene Ontology terms appearance in PubMed abstracts with and without
their known child terms.74% of all 20223 GO terms (as of Dezember 2005) can be found in PubMed
abstracts; 29% of terms can be found and have children; 26% of terms can be found in text and at least one child term can be found; 19% of terms can be found and all children can be found in text. Hence 26% of terms it is theoretically possible to infer an parent child relationship on the basis of PubMed, which is the upper bound for the method described in Section 5.4 (Co-occurrence analysis – Algorithm by Heymann et. al). For 6% of terms the parent is substring of the child and both are contained in text which is the upper bound for the method described in Section 5.3 (Pattern-based relation extraction – Superstring prediction).
Example: GO:0005096 ‘GTPase activator activity‘ GTPases are molecular switches. A
GTPase activator is an enzyme that catalyzes the hydrolysis of GTP. GT- Pase activator activity has the children ‘ARF‘, ‘Rab‘, ‘Rac‘, ‘Ral‘, ‘Ran‘, ‘Rap‘, ‘Ras‘, ‘Rho‘ and ‘Sar GTPase activator activity‘. Five of the children can be automatically found.
pos. candidate count in GO
1 ras 133 child
2 rho 106 child
3 small 100 similar term 4 intrinsic 88
5 gap 37 synoynm
6 p21ras 34
7 family 29
8 arf 23 child
9 triphosphatase 19 similar term
10 rac 17 child
11 p21 16
12 rab 12 child
Example: GO:0016265 ‘Death‘
This term has the children ‘aging‘, ‘tissue death‘ and ‘cell death‘. Out of these three terms the superstring prediction method is only capable to find ‘tissue death‘ and ‘cell death‘. While ‘cell death‘ is found first, ‘tis- sue death‘ is not found within the first 50 predicted terms. Nevertheless by carefully investigating the result list one will find, that many terms are from the medical domain rather than molecular biology. Terms like ‘cardiac death‘, ‘neuronal death‘, ‘in- fant death‘, ‘fetal death‘, ‘brain death‘ and ‘neonatal death‘ make perfectly sense for a medical ontology. Pre- dicted prefixing words like ‘sudden, ‘early‘ and ‘late‘ can easily be fil- tered using knowledge about their fre- quency of occurrence in the English language.
pos. candidate count in GO
1 cell 60678 child 2 sudden 11521
3 cardiac 7179 suggested child 4 neuronal 5326 suggested child 5 infant 3925
6 fetal 3636 suggested child 7 brain 3468 suggested child 8 early 2658
9 late 2079
10 neonatal 2038 suggested child .. .. ... ...
Example: GO:0005773 ‘vacuole‘
A vacuole is defined as a closed struc- ture, found only in eukaryotic cells, that is completely surrounded by unit membrane and contains liquid mate- rial. The term has the children ‘au- tophagic‘, ‘contractile‘, ‘lytic‘, ‘para- sitophorous‘ and ‘storage vacuole‘. All are found in the first 50 predicted terms.
pos. candidate count in GO
1 autophagic 1219 child
2 cytoplasmic 1048 suggested child 3 parasitophorous 933 child
4 large 684
5 food 496
6 contractile 387 child
7 phagocytic 383 suggested child 8 rimmed 383
9 lipid 378 suggested child 10 intracellular 303 suggested child 11 intracytoplasmic 295 suggested child 12 digestive 265 descendant 13 endocytic 260 suggested child 14 small 247 15 membrane- bound 240 suggested child .. .. ... ... 20 storage 175 child .. .. ... ... 44 lytic 36 child .. .. ... ...
Results: Superstring prediction
For the experiment only those terms were considered, where at least one child and its parent term were contained in text, and where the parent term was literally contained in the child term. The analysis was performed separately the for two cases, where either
(a) the child term ends with the parent term (1062 of 1189 cases), or (b) the child term starts with the parent term (127 of 1189 cases).
Per parent term a maximal number of 5,000 PubMed abstracts have been analysed and term occurrences have been counted. The terms preceding or subsuming the parent term have been ranked by frequency of occurrence. The hypothesis saying that parent terms are contained in child terms as proper sub-string has been shown to hold for many biomedical terms in more than 15% of the cases (3129 of 20223 GO
terms). Ogren et al. (2004) reported that A⊂B given A is parent of B in 25.5% of the
cases (4,197 of 16,451 GO terms). Although the composition of terms is a pattern in the Gene Ontology, in the experiments it was not investigated to what extend string
inclusion can be found in other domains. The method is domain independent as Requirement 1
Domain independence
no domain specific information is required. The method is simple and fast and can
Requirement 2
Performance
be easily integrated in interactive learning tools. The OBO-Edit Ontology Genera- tion Plug-in as well as the Protégé Plug-in provide a regular expression filter func- tionality which allows finding candidate according to experiment (a) with a query “<child> <parent>$” and candidates according to experiment (b) with “<parent> <child>$”.
(a) Child term ends with parent term
top 5 top 10
children found (of 1062) 276 334
recall 26.0% 31.5%
precision 6.9% 4.1%
maximal precision 26.4% 13.2%
(b) Child term starts with parent term
top 5 top 10
children found (of 127) 35 43
recall 27.6% 33.9%
precision 0.9% 0.5%
maximal precision 3.2% 1.6%
Table 5.4.Precision and recall observed for the top 5 and top 10 ranked potential child terms for the
cases where the child terms (a) ends with and (b) starts with the parent term.
The results of the analysis in terms of precision and recall are shown in (Ta-
ble 5.4). The simple experiment shows on average very low precision of less than Requirement 3
Precision
10% for experiment (a) and even lower than 1% for experiment (b). The overall pre- cision is expected to be higher when including noun phrase chunking for filtering to allow only valid noun phrases as child terms. Although it can be expected that the true precision will be higher as valid candidate terms which are not part of the test
resource (in this case the Gene Ontology) are regarded as false predictions. With re- Requirement 4
Coverage
spect to requirement 4 Table 5.4 shows a recall of 26% for experiment (a) and 27.6%
for experiment (b). The method is transparent in a way, that all terms extracted from Requirement 5
Transparency
explicit evidence, and hence no transparency for the assignment of subsumption relationships.