Definition size vs term depth - Functional thresholds

5.2 Functional thresholds

6.3.5 Definition size vs term depth

A final point that needs to be considered in relation to group definition size is whether there is any relationship between the number of terms in a definition and the depth of these terms. A strong correlation (positive or negative) between definition size and term depth would suggest bias in the data, with deep terms occurring

6.3 Group definitions

Number of GO terms ST FT All groups Meaningful groups

0.28 0.17 3067 2975 0.25 2953 0.40 0.17 2976 2779 0.25 2737 0.28 0.26 3067 2982 0.41 2982 0.40 0.26 2976 2791 0.41 2781 0.93 0.42 3067 1326 0.58 1083 0.95 0.42 3058 1221 0.58 990 0.93 0.66 3067 1363 0.88 1044 0.95 0.66 3058 1263 0.88 947

Table 6.12: Number of GO terms used at least once in the group definitions for all groups and for meaningful groups. The total number of GO terms in the annotation of the Eisen dataset is 3101. The total number of distinct GO terms in group definitions is less than 3101 because there are always a few GO terms whose similarity with themselves is less than the ST for that grouping.

6.4 Summary

primarily in either small definitions (negative correlation) or in large definitions (positive correlation). ST r 0.28 0.21 0.40 0.28 0.93 0.06 0.95 0.03

Table 6.13: Correlation coefficients (r) for definition size vs. average depth of GO terms in the definition, for each semantic threshold. The coefficients were calculated using Pearson correlation and R’s cor() function. Correlation coefficients are independent of FT as the group definitions only depend on ST.

Table 6.13 shows the correlation coefficients (calculated as before in R, using Pearson’s correlation) for definition size and average depth of the terms in the definition. The average depth was calculated using the maximum depth for each term (longest path between the term and the ontology root) in the definition. As a reminder, each term in the GO can be related to the root via multiple paths, which may traverse different numbers of nodes. Unless otherwise stated, any reference to “distance from root” of a GO term is to the maximum distance. For all the terms annotated to the Eisen dataset, the difference between minimum and maximum distance to the root ranges from 0 (about 39% of terms) to 11 (one case). The average difference lies at 1.98.

Based on the coefficients found for each ST, there is no obvious correlation between definition size and term depth and therefore no bias in the way the GO terms are grouped into definitions. In conjunction with the correlation coefficients found for group size vs. definition size (Table 6.10), the conclusion is that the underly- ing structure of the GO does not directly influence the semantic similarity between GO terms, the resulting functional similarity between gene products and the groups which are created using the two types of measures.

6.4 Summary

In this chapter, a number of overall trends of grouping results for different thresholds were considered. One recurring feature that stands out is the effect of Schlicker’s semantic thresholds on the grouping results. The high thresholds and small difference between minimum and maximum ST led to a much larger number of groups than those obtained using Resnik’s measure, much smaller average group sizes, a much greater proportion of groups of insufficient size and very small group definitions.

6.4 Summary

All of these elements bring into question the suitability of these semantic thresholds. As briefly discussed in Section 6.1, lowering the minimum ST for Schlicker was considered but initial tests showed little promise of improvement.

Although Schlicker’s measure objectively addresses a drawback in Resnik’s measure, it performed less well than the older measure in the present context. While using a different true-positive/true-negative dataset might have generated a different and better set of thresholds, this was not feasible within the scope of this project as it would have required a lot of time and very high levels of expert knowledge of all areas of molecular biology covered by the GO. The analysis of grouping results from here will therefore be limited to Resnik’s approach.

While this chapter focussed on the groups generated by the FuSiGroups algorithm from a very high-level perspective, the next two chapters will focus on the actual content and definitions of some of the groups. In addition to the full Eisen dataset, subsets of the data will be analysed in detail.

Chapter 7 The complete Eisen dataset

Until now, the results generated by the FuSiGroups algorithm have only been analysed in very general terms such as the number of groups obtained for a given set of thresholds, group sizes and number of genes grouped. In this and the next chapter, the definition and content of groups will be analysed in order to determine whether the FuSiGroups algorithm does indeed meet its target functionality of grouping together gene products based on meaningful functional relationships, providing an objective view of such complex biological data containing valuable novel insights. Chapter 7 focusses on the full Eisen dataset, while Chapter 8 will provide detailed investigations into several smaller, less noisy datasets to address a number of specific questions.

At the end of Chapter 6, it was concluded that the semantic thresholds determined for Schlicker’s approach are less suitable for use with FuSiGroups than those determined for Resnik. For this reason, all analysis in this chapter uses groups based on Resnik’s semantic similarity approach.

In addition, in order to avoid repetition, it would be helpful to select only one combination of the ST and FT parameters on which to perform a more detailed analysis of groupings. It was shown previously that for Resnik, the BMA functional similarity approach performs better than the MAX functional similarity approach (Section 4.1, Figure 4.4). Considering this finding, the grouping results for the BMA approach will be used in this analysis. The main analysis will be performed on the grouping results using minimum ST and FT since these correspond to the highest accuracies in their respective datasets. Comparisons with results for maximum ST and FT will be made as necessary.

Unless stated otherwise, all groups have been created using the parameters listed in Table 7.1.

7.1 Largest groups and most common aspects

Variable Value

Semantic similarity Resnik Functional similarity BMA Annotations all annotations

Ancestor selection MICA

Semantic threshold 0.28

Functional threshold 0.17

Table 7.1: FuSiGroups parameters for groups analysed in Chapter 7.

7.1 Largest groups and most common aspects

There are two angles from which a closer analysis of grouping results could be started, namely the largest groups or the most common functional aspects represented by the groups. In a smaller dataset, the most common functional aspects should be the most sensible starting point, as they are most likely to reveal immediate information about the groups. In a dataset the size of the Eisen dataset on the other hand, this does not necessarily hold true, as the most common functional aspects may be too generic to contain any useful information.

In Chapter 6, it was established that the grouping result for the parameters in Table 7.1 consists of 481 groups, 397 of which contain at least 4 gene products (Table 6.2) and that the largest group contains 177 gene products (Table 6.3). Figure 7.1 shows the distribution of group sizes for meaningful groups1_{. As they are not going}

to be considered in the analysis, the smaller group sizes are not included in the histogram in order to keep the size of the histogram’s Y axis as readable as possible. The frequencies for groups of size 1, 2 and 3 are 35, 31 and 18, respectively.

Although the study of the distribution of group sizes would seem to be more appropriate in the previous chapter, this information was not considered until now as it was felt inappropriate and overly repetitive to perform this analysis for all sets of thresholds. It is included here in order to provide context for the largest groups, such as what fraction of the full set of groups they represent and how their sizes compare to the majority of the groups. From the histogram, it is clear that the majority of groups (almost 85% of meaningful groups) have sizes in the interval [4, 80], although there are a couple of spikes in the number of groups at size 90 and 103, as well as a scattering of groups of sizes greater than 105 gene products.

Tables 7.2 and 7.3 show the most common group names, representing the most common functional aspects, and the largest groups, respectively. As a reminder,

As a reminder, “meaningful groups” have previously been defined as groups which contain at least 4 genes.

7.1 Largest groups and most common aspects Group sizes Frequency 0 20 40 60 80 100 120 140 160 180 0 2 4 6 8 10 12 14 16 18 20 22 24 26

Figure 7.1: Distribution of group sizes for ST28-FT17, for meaningful groups. Groups of smaller size are not included in order to reduce the size of the histogram’s Y axis. The frequencies for groups of size 1, 2 and 3 are 35, 31 and 18, respectively.

7.1 Largest groups and most common aspects

Group size

Name Ontology No. of groups Maximum Average Minimum

biopolymer modification (GO:0043412) BP 16 58 46.94 34

catabolic process (GO:0009056) BP 15 62 61.00 59

organic acid metabolic process (GO:0006082)

BP 11 72 62.55 48

cellular localization (GO:0051641) BP 11 104 97.09 64

endomembrane system (GO:0012505) CC 8 80 73.63 69

nucleobase, nucleoside and nucleotide metabolic process (GO:0055086)

BP 7 38 35.29 34

cell cycle (GO:0007049) BP 7 66 63.29 62

DNA metabolic process (GO:0006259) BP 7 89 73.57 55

mitochondrial part (GO:0044429) CC 7 88 75.43 66

response to stress (GO:0006950) BP 6 90 89.67 88

nitrogen compound metabolic process (GO:0006807)

BP 6 72 71.67 71

carbohydrate metabolic process (GO:0005975)

BP 6 39 24.17 20

reproduction (GO:0000003) BP 6 49 37.83 25

translation (GO:0006412) BP 6 167 165.67 165

macromolecular complex subunit organi- zation (GO:0043933)

BP 6 65 60.00 43

cytoskeleton (GO:0005856) CC 6 55 53.50 52

negative regulation of biological process (GO:0048519)

BP 5 68 63.60 50

lipid metabolic process (GO:0006629) BP 5 57 50.40 35

Table 7.2: Most common group names for ST28-FT17. Group names occuring a minimum of 5 times are shown, representing at total of 29.31% of all groups (or 35.52% of meaningful groups).

7.1 Largest groups and most common aspects

a group’s name is the lowest common ancestor of all the GO terms in the group definition, where lowest refers to the maximum distance of the term from the root. If more than one ancestor term has the same maximum distance from the root, the first term in the list of equally deep ancestors is used. A comparison of the two tables shows that the only overlap between them is for the group name “translation (GO:0006412)”. All groups with this name are found among the largest groups.

There are no molecular function groups in Table 7.2. This is not surprising as the functions of a set of proteins are likely to be more diverse than the set of processes these proteins are involved in, i.e. a number of different molecular functions make up a single biological process. A set of proteins that are part of the same biological process may therefore be subdivided into several distinctly named groups based on their functions because these functions differ enough for the overall functional similarity to be below the FT. The related BP- and CC-based groups however have the same or very similar names as the proteins are all part of the same process and act in the same cell part.

The explanation for the fact that only three out of the 18 most used group names are cellular component groups is slightly different. Although gene products grouped together because they are functionally similar are highly likely to be found in the same location, there are far fewer CC GO terms than BP terms, both in the number of distinct GO terms and in terms of annotations (see Table 6.5). It therefore follows that the majority of commonly used group names are of type BP rather than CC.

Overall, names in Table 7.2 reflect general cellular processes, such as metabolism (e.g. organic acid metabolic process, nitrogen compound metabolic process etc) and cell cycle. This is unsurprising considering the nature of the Eisen dataset. The genes in the dataset were selected based on the availability of functional annotations in 1998 [Eisen et al., 1998], not on the basis of any biological properties and therefore, they cover all aspects of the yeast genome. The experimental conditions on which Eisen et al. based their cluster analysis highlight genes involved in the affected processes but the full dataset is entirely unfiltered.

This effect is also observable in Table 7.3, among the largest groups obtained for ST28-FT17. The largest groups cover broad aspects of cell function, such as transcription and translation, and high-level locations such as cytosol and ribosome. All of these concepts cover a large number of genes, thus resulting in the largest groups.

This confirms the earlier assertion that considering the most common group names or largest groups may not be a useful approach for analysing a large dataset. In a smaller dataset, in which the genes may be related to a given theme, e.g. a

7.1 Largest groups and most common aspects

common pathway or set of pathways, this approach may reveal useful information. In a dataset like the Eisen dataset on the other hand, a more targeted approach, i.e. an analysis approach with a specific gene or function set in mind, would be more appropriate. This is not necessarily a drawback when approaching the data to address discrete biological hypotheses. Eisen et al. clustered genes from a set of gene expression studies involving diauxic shift, cell cycle, sporulation and temperature shock. Suitable starting points for the analysis of this dataset could therefore be genes of interest in one of these processes or functional aspects of these processes. This option will be addressed below.

An interesting observation is that many of the groups in Table 7.3 have the same name. There are for example four groups with the name “transcription, DNA- dependent (GO:0006351)” (marked with ∗ in Table 7.3), including the two largest groups. Between them, they contain 209 distinct genes of which 88 (42%) are found in all groups. 32 (15%) genes are unique to one of the groups, while a further 74 genes are found in three out of the four groups. The groups’ definitions also have some overlap, although it is not quite as pronounced. 14 out of 61 distinct GO terms are present in all definitions while only 3 terms are unique to one definition.

This level of overlap is even more pronounced in the six groups with the name “translation” (marked with ∓), which have 168 distinct genes between them and all six groups contain 165 of these. Of the three genes not found in all groups, one is in two groups and two are unique to one group. From a group definition point of view, the situation is slightly different. Only one of 46 distinct GO terms is common to all six group definitions; this is GO:0006412, i.e. translation and 8 terms are unique to one of the definitions.

This trend of high levels of overlaps can be observed in any set of groups with the same group name. More broadly, most of the group names in Table 7.3 fall into two categories, namely transcription-related groups (14 groups) and translation-related groups (13 groups). Three groups (1474, 1478 and 1070) do not fit into either of these categories. All groups in either of the two categories have a considerable overlap in their gene content. Although no gene in the transcription-related groups is present in all 14 groups, 20 genes out of a total of 364 distinct genes are present in 13 groups and 154 genes (42%) are present in at least 8 groups. Only 75 genes are present in just one group.

The overlap is even stronger for the translation-related groups. Here, 26 genes out of a total of 288 occur in all 13 groups, while 164 genes (57%) are found in 7 or more groups, whereas there are only 10 genes that are unique to a single group. In the translation category, there are two sub-categories with even stronger overlap:

7.1 Largest groups and most common aspects

Group ID Group name Ontology Group size

1193 ∗ transcription, DNA-dependent (GO:0006351) BP 177

1350 ∗ transcription, DNA-dependent (GO:0006351) BP 177

1367 ribosome (GO:0005840) CC 175

1196 structural molecule activity (GO:0005198) MF 172

1220 regulation of nucleobase, nucleoside, nucleotide

and nucleic acid metabolic process (GO:0019219)

BP 169

1357 ± nucleoplasm (GO:0005654) CC 169

1365 regulation of nucleobase, nucleoside, nucleotide

and nucleic acid metabolic process (GO:0019219)

BP 168

1391 transcription regulator activity (GO:0030528) MF 168

1042 ∓ translation (GO:0006412) BP 167 1036 ∓ translation (GO:0006412) BP 166 1041 ∓ translation (GO:0006412) BP 166 1095 ∓ translation (GO:0006412) BP 165 1373 ∓ translation (GO:0006412) BP 165 1375 ∓ translation (GO:0006412) BP 165

1083 ∗ transcription, DNA-dependent (GO:0006351) BP 161

1460 † DNA binding (GO:0003677) MF 150

1463 † DNA binding (GO:0003677) MF 150

1014 ± nucleoplasm (GO:0005654) CC 149

1073 ± nucleoplasm (GO:0005654) CC 144

1474 transporter activity (GO:0005215) MF 143

1232 ± nucleoplasm (GO:0005654) CC 138

1371 cytosol (GO:0005829) CC 129

1478 ‡ protein binding (GO:0005515) MF 123

1070 ‡ protein binding (GO:0005515) MF 121

1085 ∗ transcription, DNA-dependent (GO:0006351) BP 121

1069 ⋄ RNA processing (GO:0006396) BP 118

1142 ⋄ RNA processing (GO:0006396) BP 118

1348 ⋄ RNA processing (GO:0006396) BP 118

1156 ribonucleoprotein complex biogenesis

(GO:0022613)

BP 116

1290 chromosome (GO:0005694) CC 116

Table 7.3: Largest groups for ST28-FT17. A cut-off of s ≥ 116 was chosen as there is a clearly visible gap in Figure 7.1 between this and the next-lowest group size. Groups with the same name are marked with a symbol for easier identification.

In document Investigating “Gene Ontology”- based semantic similarity in the context of functional genomics (Page 145-156)