1.7 The prediction of protein function
2.1.1 Identifying functional site residues
There are a number of manually curated resources providing information on func- tionally important residues in proteins. For example, the Catalytic Site Atlas (CSA) and the NCBI Inferred Biomolecular Interactions Server (IBIS). These resources re- port experimentally determined key functional residues involved in catalytic reac- tions and interaction binding interfaces, respectively. However not all known protein sequences have experimentally determined functional residue annotations. A num- ber of algorithms have therefore been established to predict functional residues.
Detecting functional residues within a superfamily is a difficult task. Many ap- proaches aim to identify conserved residue sites in a multiple sequence alignment on the assumption that functionally important sites are likely to be conserved. How- ever, the multiple sequence alignment (MSA) that one uses to detect these residues must contain relatives that are functionally similar. Some superfamilies however contain a large numbers of relatives that are structurally and functionally diverse. In these cases, subclassification into functional subfamilies can be performed to pro- duce good quality sequence alignments.
However, the majority (∼ 96%) of superfamilies in the CATH database are small and since their members often carry out the same general biological function, the amino acid residues responsible for carrying out this function are expected to be conserved throughout evolution, i.e. they are conserved across all sequences in a
CHAPTER 2. IDENTIFICATION AND CHARACTERISATION OF
FUNCTIONAL DIVERSITY IN PROTEIN DOMAIN SUPERFAMILIES 53
superfamily alignment. For example, the members of the serine protease superfamily are well known for the preservation of their catalytic triad in the active site where 3 residue positions are completely conserved across all members (Drenth et al., 1972; Kraut, 1977).
If one can identify the conserved residue positions in a protein superfamily se- quence alignment then it is likely that these positions could play a structural and/or functional role in the protein. Positions with a structural role for example may be important in maintaining the stability of the protein or important in folding. As mentioned already, positions with a functional role may be catalytic residues, or may be involved in interface interactions such as ligand binding, protein-protein interactions, and protein-DNA binding.
2.1.1.1 Identifying conserved residues
Many different types of algorithms have been developed to study the evolution of homologous protein sequences and identify conserved sites, ranging from simple ma- jority fraction (Wu and Kabat, 1970), to entropy or mutual information-based algo- rithms ( Shannon and Weaver, 1949; Wu and Kabat, 1970; Mihalek et al., 2004), and statistical estimations of residue mutability with methods such as Rate4Site (Pupko et al., 2002). By identifying all of the conserved residues, these methods can be used to identify residues that are preserved throughout evolution, for example residues in enzyme active sites.
Methods based on physicochemical properties Livingstone and Barton (1993) developed the program, Analysis of Multiply Aligned Sequences (AMAS), which calculates a conservation score based on the physicochemical properties observed at each sequence position of a MSA. At each MSA position in a pre-defined subfam- ily, the general physicochemical properties of each amino acid are defined according to Taylor (1986) and Zvelebil et al. (1987). Two different methods are subsequently applied, the first (similar to Zvelebil et al. (1987)) scores an alignment position as be- ing ‘positively’ conserved if the amino acids all have the same physicochemical prop-
CHAPTER 2. IDENTIFICATION AND CHARACTERISATION OF
FUNCTIONAL DIVERSITY IN PROTEIN DOMAIN SUPERFAMILIES 54
erties and ‘negatively’ conserved if the amino acids have different properties. The second method within AMAS only calculates conservation of the positions whose amino acids have the same physicochemical properties (Livingstone and Barton, 1993).
Methods based on entropy Scorecons is another approach, which was devel- oped by Valdar (2002) to quantify the conservation of each residue position in a protein sequence alignment. Each position is assigned an evolutionary conservation score between 0 and 1, where 0 indicates zero conservation at that position, through to 1 where the residue at that position is completely conserved. Amino acid di- versity at each position is calculated using amino acid similarity information from a Dayhoff-like mutation data matrix (Jones et al., 1992). The overall score sums up the contributions from each individual sequence, and sequences are weighted in- versely with their redundancy in the alignment (Valdar, 2002). This method was reviewed along with 13 other methods in Manning et al. (2008) and was reported to consistently score in third place.
Phylogenetic tree-based approaches In addition to Lichtarge et al. (1996), a number of other groups also use a phylogenetic tree-based approach. Consurf (Glaser et al., 2003; Ashkenazy et al., 2010; Celniker et al., 2013) calculates the evolutionary rate of each position in an MSA to determine which positions are highly variable throughout evolution (i.e. not conserved), and those that evolve slowly and are therefore highly conserved. The evolutionary rate is calculated through an empir- ical Bayesian method or a maximum likelihood method, and the phylogenetic tree is used to display the evolutionary relationships between MSA sequences. Conser- vation scores are mapped onto the 3D structure of a family member to display any clusters of highly conserved residues on the protein surface, which are inferred to be functionally important residues. The MINER algorithm (La and Livesay (2005)) uses evolutionary information from an MSA to calculate the phylogenetic similar- ity between a local (sliding window) region of the MSA and the whole alignment.
CHAPTER 2. IDENTIFICATION AND CHARACTERISATION OF
FUNCTIONAL DIVERSITY IN PROTEIN DOMAIN SUPERFAMILIES 55
From this, phylogenetic motifs (PMs) are identified, which are shown to structurally cluster around key functional residues.