4.3 The GO Slims
4.3.1 GO Slims Have Imprecisely Defined Scope
Slims are created by users according to their needs, but, due to this very fact, they do not necessarily satisfy the needs of a broader community. The existing slims lack definitions that would precisely describe their content. Slims were introduced early on in the life of the Gene Ontology (Adams et al. [10]), and have since remained ad hoc customized views. For research purposes, it is not advisable to guess a slim’s scope and goals from its name alone; e.g., one should not make assumptions about yeast-specificity of the Yeast slim. Unfortunately, the documentation of GO slims is not very helpful
4.3. THE GO SLIMS 105 in this respect.
GO slims have no explicit, precise criteria for the inclusion of terms. Specif- ically, it is not clear what it means that slims give a broad overview of the ontology content without the detail of the specific fine-grained terms; the notion of ‘fine-grainedness’ of GO terms is rather intuitive and imprecise. As illustrated in Fig. 4.2, there are terms excluded from the Generic slim despite their ancestors and successors being included in this slim. What- ever intuitive understanding of the specificity of GO terms one may have, it seems to be contradicted by this example.
[GO:0008150biologi alpro ess℄ GO:0009987 ellularpro ess
[GO:0007154 ell ommuni ation℄ [GO:0007267 ell- ellsignalling℄ [GO:0008037 ellre ognition℄
GO:0009988 ell- ellre ognition GO:0050875 ellularphysiologi alpro ess
[GO:0007049 ell y le℄ [GO:0007582physiologi alpro ess℄
GO:0050875 ellularphysiologi alpro ess [GO:0007049 ell y le℄
Figure 4.2: A partial view of the biological process branch of the Gene On- tology. Indentation reflects subsumption of terms. Terms included in the Generic slim are in square brackets. The term ellular pro ess is excluded
from the slim, although both the termbiologi al pro ess (its ancestor) and
the term ellre ognition(its successor) are included in it.
A simple approach to estimate a term’s specificity (or its inverse, generality) is to use the count of ancestors of the term. The termsbiologi alpro essand physiologi alpro ess(see Fig. 4.2) would have, following this line, equal gen-
erality; the term ell- ellsignallingwould be more specific than the term el- lularphysiologi alpro ess, though the former is not actually a successor of the
latter.11 An alternative approach, implemented in the Gene Ontology Par-
tition Database (Alterovitz et al. [12]), is to employ information-theoretic calculations, based on the count of annotations associated with GO terms. However, in general, such count of still incomplete annotations seems to reflect the interest and activity of particular research communities rather than any sort of term generality. This approach will not necessarily be re- liable until all genes and gene products in all organisms have been fully annotated.
These two approaches are based on the static structure of the GO and on its so-called ‘information content’, respectively. In either case, the result may not correspond well to the specificity of terms as it could be understood by domain experts. Computational assessment of the specificity of GO terms is not an infrequent topic of discussion on the GO-friends12mailing list.
4.3.2
‘Species-Specificity’ Has Imprecise Meaning
Until only recently, many terms in the Gene Ontology were said to be species- specific, and marked as such by a ‘sensu . . . ’ inclusion in the name and an ‘as in, but not restricted to, . . . ’ inclusion in the definition (where the el- lipses stand for a taxon name and a taxon description, respectively). As an example, Fig. 4.3 shows the termproteasome omplex (sensu Eukaryota), en-
coded in the OBO file format. In an effort to clarify the intentions, ‘sensu . . . ’ has been replaced by ‘sensu . . . research community’, but this change was later reversed; currently, ‘sensu’ terms are being modified so that their names reflect the actual differentiating criteria rather than the use of a term by a particular research community. For example,va uolarlumen(sensuMag- noliophyta)has been replaced withlumenofva uolewith ell y le-independent morphology, etc.
‘Species-specificity’ is also invoked in the description of slims. For exam-
11This simple approach is somewhat complicated by the fact that the Gene Ontology allows for multiple inheritance, i.e., a term may have more than one path to the root of the ontology, and the paths may be of different lengths. Ad hoc solutions to this problem include considering the maximal, minimal, or average distance from the root as a measure of specificity.
4.3. THE GO SLIMS 107
[Term℄ id: GO:0000502
name: proteasome omplex (sensu Eukaryota) namespa e: ellular_ omponent
def: "A large multisubunit omplex whi h atalyzes protein degradation. This omplex onsists of the barrel shaped proteasome ore omplex and the regulatory parti le that aps the proteasome ore omplex. As in, but not restri ted to, the eukaryotes (Eukaryota,
n bi_taxonomy_id:2759)." [GOC:rb℄
synonym: "26S proteasome" NARROW [℄ is_a: GO:0043234 ! protein omplex is_a: GO:0044424 ! intra ellular part
Figure 4.3: The OBO-format entry for the term proteasome omplex (sensu Eukaryota).
ple, the Prokaryotic GO subset is “a prokaryote-specific subset of GO terms, [which] contains only terms that are applicable to prokaryotes”.13 The
subset may thus include terms that are applicable not only to prokaryotes (the definition does not state that it contains only terms that are applicable
onlyto prokaryotes); it may also exclude some terms that are applicable to prokaryotes (the definition does not state that it contains all terms that are applicable to prokaryotes). The Plant slim contains the termbiologi alpro- ess, which clearly is not plant-specific (plants are not the only organisms in
which biological prcesses take place). It also contains the termextra ellular matrix(sensuMetazoa), whose name suggests specificity to animals; it should
rather include a term such asextra ellularmatrix(sensuViridiplantae). 14