• No results found

Protein structures have been organised into evolutionary families by resources such as CATH (Dawson et al., 2017) and SCOP (Andreeva et al., 2014). Both of these resources classified the proteins at the domain level. Protein domains are independently-folding functional and evolutionary units of a protein sequence. They are generally observed as local, compact units of structure, with a hydrophobic in- terior and a hydrophilic exterior, forming a globular-like state that cannot be further subdivided. CATH and SCOP identify structural domain boundaries within protein structures and classify domains into protein families based on their evolutionary ori- gin. However, there are differences between the two resources (i.e. how the proteins are chopped into domains and how the domains are classified into families) which will be discussed further below.

CHAPTER 1. INTRODUCTION 47

families. Resources employing these techniques usually identify conserved sequence patterns. The most widely used resource is Pfam (Finn et al., 2016). The newest release of Pfam (release 31) comprises 16,712 protein families, which covers more than 75% of the sequences in UniProt.

1.6.1

CATH

CATH (Orengo et al., 1997) was established by Orengo and Thornton in the mid- 1990s. The first release of CATH in 1997 featured a total of 8,000 structural domains classified into 1,000 superfamilies. The newest release of CATH (CATH version 4.1) classified 308,999 structural domains into 2,737 superfamilies.

CATH is derived using a semi-automated approach followed by expert manual cu- ration. Domain boundaries are inferred in new protein structures by searching for using both sequence and structural similarity to domains that have been classified in CATH. Sequence similarity is obtained by scanning the new chains against a library of CATH superfamily HMMs. Structural similarity is computed using a streamlined ver- sion of SSAP (Taylor and Orengo, 1989), CATHEDRAL (Redfern et al., 2007), that is around 1,000 times faster. Both of these similarities are then used to assess whether the new chain is homologous to any domains in CATH. For cases with low similar- ity, the domain boundaries are identified manually. Manual assignment of domain boundaries is guided by three ab initio domain identification methods: PUU (Holm and Sander, 1994), DETECTIVE (Swindells, 1995) and DOMAK (Siddiqui and Bar- ton, 1995).

The CATH superfamily code is denoted by four numbers corresponding to each level in the CATH classification (i.e. 3.20.20.120). At the top of the hierarchy is the class level where structural domains are classified based on their secondary structure content (i.e. the proportion of residues adopting α-helical or β-sheet conformations). Class 1 domains are mostly α-helical, Class 2 domains are mostly β-sheet, Class 3 domains have a significant amount of both α-helical and β-sheet. Class 4 domains have very little secondary structure content.

CHAPTER 1. INTRODUCTION 48

The second level of the hierarchy is the architecture level given by the global ar- rangement of secondary structures in 3D space. This is followed by the topology level where domains with similar folds (which takes into account the 3D arrangement, ori- entations and connections between the secondary structures) are grouped together. The fourth level is the homologous superfamily level where domains are deemed ho- mologous if they meet two of the following three criteria:

• Significant sequence similarity (e.g. Significant E-value)

• Significant structural similarity (e.g. SSAP score above 80 with at least 60% overlapping residues)

• Evidence of functional similarity (e.g. conserved catalytic, or binding site residues and cofactors)

1.6.1.1 Gene3D

For each CATH superfamily, HMMs are built for non redundant representatives (at 95% sequence identity). Based on these HMMs, the CATH-Gene3D resource (Lam et al., 2016) predicts domain superfamily assignments in UniProt sequences (UniProt, 2017) and ENSEMBL sequences (Aken et al., 2016). These annotations provide structural and functional insights to the protein. The current release of Gene3D covers a total of 54 millions sequences belonging to 2,737 CATH superfamilies, from 19,000 genomes.

In addition to the domain assignments, Gene3D also provides protein-protein in- teraction information, drug information, catalytic site information and mutation infor- mation for human proteins. For regions that are not covered by structural domains, Gene3D also predicts if these regions are intrinsically disordered using IUPRED (Dosz- tanyi et al., 2005). A total of 60,000 3D structural models were built for over 70% of the structurally uncharacterised sequences in human and fly genomes using an in-house modelling platform FunMod described in Chapter 2.

CHAPTER 1. INTRODUCTION 49

1.6.1.2 Sub-classification of homologous superfamilies into functional fami-

lies (FunFams)

Once the sequence relatives are assigned to a specific superfamily in CATH, func- tional families are identified by performing agglomerative clustering of domain rela- tives (GEMMA (Lee et al., 2010)). Relatives are first annotated with Gene Ontology (GO) annotations and then clustered at 90% sequence identity (S90). S90 clusters without GO-annotations are removed. In addition, sequence fragments (less than 80% of the average sequence length) are also removed from the clusters. For each S90 clusters, a sequence profile is built based on the MSA. S90 clusters are then compared using HMM-HMM profile comparison software, COMPASS (Sadreyev and Grishin, 2003). At each round, clusters that match a specific threshold are merged and profiles are generated for the new clusters. The process is continued to until a single cluster remains (i.e. producing a hierarchical clustering tree from the leaf nodes to the root).

Determining an optimal cut of a hierarchical clustering tree of sequence relatives within a given superfamily is key to producing functionally pure clusters. Initially, GO annotations were used to ensure the functional coherence in each functional family (i.e. clusters are only merged if they contain coherent GO terms) (Rentzsch and Orengo, 2013). However, due to the paucity of the GO terms and annotation biases existing in the GO (Schnoes et al., 2013), a newer approach that exploits sequence patterns and unaffected by the limitations of GO was developed.

The new FunFHMMer protocol determines the best optimal cut of the tree using conserved positions and specificity determining positions in the cluster alignments (Das et al., 2015). Highly conserved positions are generally important for the stability, folding or function of the protein domain. Specificity-determining positions are posi- tions that are conserved within and unique to a particular cluster, sharing a specific function and usually involved in functional divergence from other clusters (Abhiman and Sonnhammer, 2005; Rausell et al., 2010).

CHAPTER 1. INTRODUCTION 50

by validating against experimentally determined Enzyme Commission (EC) (Webb, 1992) and SFLD (Akiva et al., 2014) annotations and also by checking whether known functional sites coincide with highly conserved residues in the MSAs of FunFams (Das et al., 2015). Functional predictions based on FunFams were ranked amongst the top five methods for the "Molecular Function" category and the "Biological Process" category in the 2nd CAFA International Function Prediction experiment (Jiang et al., 2016).

1.6.2

SCOP

The Structural Classification Of Proteins (SCOP) database (Murzin et al., 1995), like CATH, classifies structural domains hierarchically into three different levels. SCOP is largely based on expert curation resulting in a very high-quality dataset. The highest level of classification in SCOP is the Class level which is based on secondary structure content. The structural domains are further classified based on their fold groups, which is similar to the T-level in CATH. The third level of SCOP is the homologous superfamilies level (comparable to the H-level in CATH) which comprises of either domains which share sequence identity with at least 30% sequence identity or those having very similar structures and functions. SCOP uses tools such as the structural comparison tool DALI (Holm and Sander, 1993), Pfam (Finn et al., 2016) and BLAST (Camacho et al., 2009) to check for the homology of the domains (Andreeva et al., 2014).

Recently, a new prototype of SCOP, SCOP2 (Andreeva et al., 2014) has been in- troduced to organise the structural domains. With the expansion of structural data in PDB, the SCOP team recognised that evolutionary relationships can be more com- plex than first thought, leading to a new design to capture the relationships. Rather than using a hierarchy, the new SCOP2 uses a directed acyclic graph to represent the structural and evolutionary relationship of the domains. Both structural and evolution- ary relationships are now split into two separate categories allowing the classification of the homologous proteins into different folds and structural classes while keeping

CHAPTER 1. INTRODUCTION 51

them in the same evolutionary family and superfamily (Andreeva et al., 2014).

SCOPe (Structural Classification of Proteins-extended) (Fox et al., 2013) is a database that extends the SCOP v1.75 database (the newest stable release of SCOP). To keep up with structural growth in PDB, SCOPe uses automated methods to clas- sify the new structures using the original SCOP-based hierarchy. Similar to Gene3D, the SUPERFAMILY (Wilson et al., 2009) resource provides domain superfamily and family assignments for protein sequences based on SCOP.