5.2 Functional thresholds
6.3.5 Definition size vs term depth
A final point that needs to be considered in relation to group definition size is whether there is any relationship between the number of terms in a definition and the depth of these terms. A strong correlation (positive or negative) between definition size and term depth would suggest bias in the data, with deep terms occurring
6.3 Group definitions
Number of GO terms ST FT All groups Meaningful groups
0.28 0.17 3067 2975 0.25 2953 0.40 0.17 2976 2779 0.25 2737 0.28 0.26 3067 2982 0.41 2982 0.40 0.26 2976 2791 0.41 2781 0.93 0.42 3067 1326 0.58 1083 0.95 0.42 3058 1221 0.58 990 0.93 0.66 3067 1363 0.88 1044 0.95 0.66 3058 1263 0.88 947
Table 6.12: Number of GO terms used at least once in the group definitions for all groups and for meaningful groups. The total number of GO terms in the annotation of the Eisen dataset is 3101. The total number of distinct GO terms in group definitions is less than 3101 because there are always a few GO terms whose similarity with themselves is less than the ST for that grouping.
6.4 Summary
primarily in either small definitions (negative correlation) or in large definitions (positive correlation). ST r 0.28 0.21 0.40 0.28 0.93 0.06 0.95 0.03
Table 6.13: Correlation coefficients (r) for definition size vs. average depth of GO terms in the definition, for each semantic threshold. The coefficients were calculated using Pearson correlation and R’s cor() function. Correlation coefficients are independent of FT as the group definitions only depend on ST.
Table 6.13 shows the correlation coefficients (calculated as before in R, using Pearson’s correlation) for definition size and average depth of the terms in the defi- nition. The average depth was calculated using the maximum depth for each term (longest path between the term and the ontology root) in the definition. As a re- minder, each term in the GO can be related to the root via multiple paths, which may traverse different numbers of nodes. Unless otherwise stated, any reference to “distance from root” of a GO term is to the maximum distance. For all the terms annotated to the Eisen dataset, the difference between minimum and maximum dis- tance to the root ranges from 0 (about 39% of terms) to 11 (one case). The average difference lies at 1.98.
Based on the coefficients found for each ST, there is no obvious correlation be- tween definition size and term depth and therefore no bias in the way the GO terms are grouped into definitions. In conjunction with the correlation coefficients found for group size vs. definition size (Table 6.10), the conclusion is that the underly- ing structure of the GO does not directly influence the semantic similarity between GO terms, the resulting functional similarity between gene products and the groups which are created using the two types of measures.
6.4
Summary
In this chapter, a number of overall trends of grouping results for different thresholds were considered. One recurring feature that stands out is the effect of Schlicker’s se- mantic thresholds on the grouping results. The high thresholds and small difference between minimum and maximum ST led to a much larger number of groups than those obtained using Resnik’s measure, much smaller average group sizes, a much greater proportion of groups of insufficient size and very small group definitions.
6.4 Summary
All of these elements bring into question the suitability of these semantic thresh- olds. As briefly discussed in Section 6.1, lowering the minimum ST for Schlicker was considered but initial tests showed little promise of improvement.
Although Schlicker’s measure objectively addresses a drawback in Resnik’s mea- sure, it performed less well than the older measure in the present context. While using a different true-positive/true-negative dataset might have generated a different and better set of thresholds, this was not feasible within the scope of this project as it would have required a lot of time and very high levels of expert knowledge of all areas of molecular biology covered by the GO. The analysis of grouping results from here will therefore be limited to Resnik’s approach.
While this chapter focussed on the groups generated by the FuSiGroups algo- rithm from a very high-level perspective, the next two chapters will focus on the actual content and definitions of some of the groups. In addition to the full Eisen dataset, subsets of the data will be analysed in detail.
Chapter 7
The complete Eisen dataset
Until now, the results generated by the FuSiGroups algorithm have only been anal- ysed in very general terms such as the number of groups obtained for a given set of thresholds, group sizes and number of genes grouped. In this and the next chapter, the definition and content of groups will be analysed in order to determine whether the FuSiGroups algorithm does indeed meet its target functionality of grouping to- gether gene products based on meaningful functional relationships, providing an objective view of such complex biological data containing valuable novel insights. Chapter 7 focusses on the full Eisen dataset, while Chapter 8 will provide detailed investigations into several smaller, less noisy datasets to address a number of specific questions.
At the end of Chapter 6, it was concluded that the semantic thresholds deter- mined for Schlicker’s approach are less suitable for use with FuSiGroups than those determined for Resnik. For this reason, all analysis in this chapter uses groups based on Resnik’s semantic similarity approach.
In addition, in order to avoid repetition, it would be helpful to select only one combination of the ST and FT parameters on which to perform a more detailed analysis of groupings. It was shown previously that for Resnik, the BMA functional similarity approach performs better than the MAX functional similarity approach (Section 4.1, Figure 4.4). Considering this finding, the grouping results for the BMA approach will be used in this analysis. The main analysis will be performed on the grouping results using minimum ST and FT since these correspond to the highest accuracies in their respective datasets. Comparisons with results for maximum ST and FT will be made as necessary.
Unless stated otherwise, all groups have been created using the parameters listed in Table 7.1.
7.1 Largest groups and most common aspects
Variable Value
Semantic similarity Resnik Functional similarity BMA Annotations all annotations
Ancestor selection MICA
Semantic threshold 0.28
Functional threshold 0.17
Table 7.1: FuSiGroups parameters for groups analysed in Chapter 7.
7.1
Largest groups and most common aspects
There are two angles from which a closer analysis of grouping results could be started, namely the largest groups or the most common functional aspects represented by the groups. In a smaller dataset, the most common functional aspects should be the most sensible starting point, as they are most likely to reveal immediate information about the groups. In a dataset the size of the Eisen dataset on the other hand, this does not necessarily hold true, as the most common functional aspects may be too generic to contain any useful information.
In Chapter 6, it was established that the grouping result for the parameters in Table 7.1 consists of 481 groups, 397 of which contain at least 4 gene products (Table 6.2) and that the largest group contains 177 gene products (Table 6.3). Figure 7.1 shows the distribution of group sizes for meaningful groups1. As they are not going
to be considered in the analysis, the smaller group sizes are not included in the histogram in order to keep the size of the histogram’s Y axis as readable as possible. The frequencies for groups of size 1, 2 and 3 are 35, 31 and 18, respectively.
Although the study of the distribution of group sizes would seem to be more appropriate in the previous chapter, this information was not considered until now as it was felt inappropriate and overly repetitive to perform this analysis for all sets of thresholds. It is included here in order to provide context for the largest groups, such as what fraction of the full set of groups they represent and how their sizes compare to the majority of the groups. From the histogram, it is clear that the majority of groups (almost 85% of meaningful groups) have sizes in the interval [4, 80], although there are a couple of spikes in the number of groups at size 90 and 103, as well as a scattering of groups of sizes greater than 105 gene products.
Tables 7.2 and 7.3 show the most common group names, representing the most common functional aspects, and the largest groups, respectively. As a reminder,
1
As a reminder, “meaningful groups” have previously been defined as groups which contain at least 4 genes.
7.1 Largest groups and most common aspects Group sizes Frequency 0 20 40 60 80 100 120 140 160 180 0 2 4 6 8 10 12 14 16 18 20 22 24 26
Figure 7.1: Distribution of group sizes for ST28-FT17, for meaningful groups. Groups of smaller size are not included in order to reduce the size of the histogram’s Y axis. The frequencies for groups of size 1, 2 and 3 are 35, 31 and 18, respectively.
7.1 Largest groups and most common aspects
Group size
Name Ontology No. of groups Maximum Average Minimum
biopolymer modification (GO:0043412) BP 16 58 46.94 34
catabolic process (GO:0009056) BP 15 62 61.00 59
organic acid metabolic process (GO:0006082)
BP 11 72 62.55 48
cellular localization (GO:0051641) BP 11 104 97.09 64
endomembrane system (GO:0012505) CC 8 80 73.63 69
nucleobase, nucleoside and nucleotide metabolic process (GO:0055086)
BP 7 38 35.29 34
cell cycle (GO:0007049) BP 7 66 63.29 62
DNA metabolic process (GO:0006259) BP 7 89 73.57 55
mitochondrial part (GO:0044429) CC 7 88 75.43 66
response to stress (GO:0006950) BP 6 90 89.67 88
nitrogen compound metabolic process (GO:0006807)
BP 6 72 71.67 71
carbohydrate metabolic process (GO:0005975)
BP 6 39 24.17 20
reproduction (GO:0000003) BP 6 49 37.83 25
translation (GO:0006412) BP 6 167 165.67 165
macromolecular complex subunit organi- zation (GO:0043933)
BP 6 65 60.00 43
cytoskeleton (GO:0005856) CC 6 55 53.50 52
negative regulation of biological process (GO:0048519)
BP 5 68 63.60 50
lipid metabolic process (GO:0006629) BP 5 57 50.40 35
Table 7.2: Most common group names for ST28-FT17. Group names occuring a minimum of 5 times are shown, representing at total of 29.31% of all groups (or 35.52% of meaningful groups).
7.1 Largest groups and most common aspects
a group’s name is the lowest common ancestor of all the GO terms in the group definition, where lowest refers to the maximum distance of the term from the root. If more than one ancestor term has the same maximum distance from the root, the first term in the list of equally deep ancestors is used. A comparison of the two tables shows that the only overlap between them is for the group name “translation (GO:0006412)”. All groups with this name are found among the largest groups.
There are no molecular function groups in Table 7.2. This is not surprising as the functions of a set of proteins are likely to be more diverse than the set of processes these proteins are involved in, i.e. a number of different molecular functions make up a single biological process. A set of proteins that are part of the same biological process may therefore be subdivided into several distinctly named groups based on their functions because these functions differ enough for the overall functional similarity to be below the FT. The related BP- and CC-based groups however have the same or very similar names as the proteins are all part of the same process and act in the same cell part.
The explanation for the fact that only three out of the 18 most used group names are cellular component groups is slightly different. Although gene products grouped together because they are functionally similar are highly likely to be found in the same location, there are far fewer CC GO terms than BP terms, both in the number of distinct GO terms and in terms of annotations (see Table 6.5). It therefore follows that the majority of commonly used group names are of type BP rather than CC.
Overall, names in Table 7.2 reflect general cellular processes, such as metabolism (e.g. organic acid metabolic process, nitrogen compound metabolic process etc) and cell cycle. This is unsurprising considering the nature of the Eisen dataset. The genes in the dataset were selected based on the availability of functional annotations in 1998 [Eisen et al., 1998], not on the basis of any biological properties and therefore, they cover all aspects of the yeast genome. The experimental conditions on which Eisen et al. based their cluster analysis highlight genes involved in the affected processes but the full dataset is entirely unfiltered.
This effect is also observable in Table 7.3, among the largest groups obtained for ST28-FT17. The largest groups cover broad aspects of cell function, such as transcription and translation, and high-level locations such as cytosol and ribosome. All of these concepts cover a large number of genes, thus resulting in the largest groups.
This confirms the earlier assertion that considering the most common group names or largest groups may not be a useful approach for analysing a large dataset. In a smaller dataset, in which the genes may be related to a given theme, e.g. a
7.1 Largest groups and most common aspects
common pathway or set of pathways, this approach may reveal useful information. In a dataset like the Eisen dataset on the other hand, a more targeted approach, i.e. an analysis approach with a specific gene or function set in mind, would be more appropriate. This is not necessarily a drawback when approaching the data to address discrete biological hypotheses. Eisen et al. clustered genes from a set of gene expression studies involving diauxic shift, cell cycle, sporulation and temperature shock. Suitable starting points for the analysis of this dataset could therefore be genes of interest in one of these processes or functional aspects of these processes. This option will be addressed below.
An interesting observation is that many of the groups in Table 7.3 have the same name. There are for example four groups with the name “transcription, DNA- dependent (GO:0006351)” (marked with ∗ in Table 7.3), including the two largest groups. Between them, they contain 209 distinct genes of which 88 (42%) are found in all groups. 32 (15%) genes are unique to one of the groups, while a further 74 genes are found in three out of the four groups. The groups’ definitions also have some overlap, although it is not quite as pronounced. 14 out of 61 distinct GO terms are present in all definitions while only 3 terms are unique to one definition.
This level of overlap is even more pronounced in the six groups with the name “translation” (marked with ∓), which have 168 distinct genes between them and all six groups contain 165 of these. Of the three genes not found in all groups, one is in two groups and two are unique to one group. From a group definition point of view, the situation is slightly different. Only one of 46 distinct GO terms is common to all six group definitions; this is GO:0006412, i.e. translation and 8 terms are unique to one of the definitions.
This trend of high levels of overlaps can be observed in any set of groups with the same group name. More broadly, most of the group names in Table 7.3 fall into two categories, namely transcription-related groups (14 groups) and translation-related groups (13 groups). Three groups (1474, 1478 and 1070) do not fit into either of these categories. All groups in either of the two categories have a considerable overlap in their gene content. Although no gene in the transcription-related groups is present in all 14 groups, 20 genes out of a total of 364 distinct genes are present in 13 groups and 154 genes (42%) are present in at least 8 groups. Only 75 genes are present in just one group.
The overlap is even stronger for the translation-related groups. Here, 26 genes out of a total of 288 occur in all 13 groups, while 164 genes (57%) are found in 7 or more groups, whereas there are only 10 genes that are unique to a single group. In the translation category, there are two sub-categories with even stronger overlap:
7.1 Largest groups and most common aspects
Group ID Group name Ontology Group size
1193 ∗ transcription, DNA-dependent (GO:0006351) BP 177
1350 ∗ transcription, DNA-dependent (GO:0006351) BP 177
1367 ribosome (GO:0005840) CC 175
1196 structural molecule activity (GO:0005198) MF 172
1220 regulation of nucleobase, nucleoside, nucleotide
and nucleic acid metabolic process (GO:0019219)
BP 169
1357 ± nucleoplasm (GO:0005654) CC 169
1365 regulation of nucleobase, nucleoside, nucleotide
and nucleic acid metabolic process (GO:0019219)
BP 168
1391 transcription regulator activity (GO:0030528) MF 168
1042 ∓ translation (GO:0006412) BP 167 1036 ∓ translation (GO:0006412) BP 166 1041 ∓ translation (GO:0006412) BP 166 1095 ∓ translation (GO:0006412) BP 165 1373 ∓ translation (GO:0006412) BP 165 1375 ∓ translation (GO:0006412) BP 165
1083 ∗ transcription, DNA-dependent (GO:0006351) BP 161
1460 † DNA binding (GO:0003677) MF 150
1463 † DNA binding (GO:0003677) MF 150
1014 ± nucleoplasm (GO:0005654) CC 149
1073 ± nucleoplasm (GO:0005654) CC 144
1474 transporter activity (GO:0005215) MF 143
1232 ± nucleoplasm (GO:0005654) CC 138
1371 cytosol (GO:0005829) CC 129
1478 ‡ protein binding (GO:0005515) MF 123
1070 ‡ protein binding (GO:0005515) MF 121
1085 ∗ transcription, DNA-dependent (GO:0006351) BP 121
1069 ⋄ RNA processing (GO:0006396) BP 118
1142 ⋄ RNA processing (GO:0006396) BP 118
1348 ⋄ RNA processing (GO:0006396) BP 118
1156 ribonucleoprotein complex biogenesis
(GO:0022613)
BP 116
1290 chromosome (GO:0005694) CC 116
Table 7.3: Largest groups for ST28-FT17. A cut-off of s ≥ 116 was chosen as there is a clearly visible gap in Figure 7.1 between this and the next-lowest group size. Groups with the same name are marked with a symbol for easier identification.