1.1 Fundamentals of protein structures
1.1.3 Amino acid interactions and sequence evolution
Alterations in the amino acid sequence, arising from random mutations in the protein coding DNA, will alter the specific amino acid interactions around the sites of change. As long as the alteration is not detrimental to the function of the protein, and thus the fitness of the organ- ism, then the organism and thus the gene associated with the altered protein will persist. Over time multiple changes in the sequence will occur and this can lead to variations in the amino acid sequence of the same protein found in different species and indeed in individuals of the same species, albeit to a lesser extent. Furthermore gene duplication within a species allows one copy of a duplicated gene to rapidly accumulate changes in its coding region; mutations that are detrimental to protein function will likely not be detrimental to organism survival as long as there is one fully functioning copy of the protein maintained. Such accumulation of substitutions can lead to proteins with novel function. Thus comparison of proteins with a re- cent common evolutionary ancestor will indicate positions where the amino acid sequence are
different, indicating that amino acid substitutions must have taken place in the history of one or more modern sequences compared with their common ancestral sequence.
Sequence identity1is used to classify proteins into families, and deduce common ancestry.
Homology is defined as the presence of similar properties or characteristics between two or more species that are a result of common ancestry. There are two types of evolutionary related- ness that apply to homology, orthology and parology. Sequence orthology refers to sequences which are related through a speciation event. While parology refers to sequences related through a gene duplication event. Though it is feasible to differentiate between orthologs and paralogs, it is not necessary in the context of this thesis and the term homology will be used to refer to the evolutionary relatedness of protein sequences.
The definition used by the SCOP database, is that protein sequences with 30% identity with respect to a reference sequence are classified as belonging to that family, with exception made for sequences which score less but are known to have structural and functional similarities [18, 19]. In their 1996 paper on the differences between protein structures as a function of sequence identity, Chothia and Lesk reported that sequences which had a sequence identity of 40% or more would have similar structures and functions [20]. The discrepancy between the two different values of sequence identity has to do with the distinction between structural similarity and functional similarity of proteins. There is a general concept of a “twilight region” between 30%–40% sequence identity where a cut off exists for protein relatedness, which falls between these two reported values.
The variations that can occur between sequences of the same family, in the form of residue substitutions at specific locations in the sequence, will be constrained by the pressures arising from a variety of quarters. An important step in elucidating the way in which protein structures evolve is the identification and characterisation of those pressures and their origins [21]. A study of the substitution behaviour of amino acids in homologous proteins, using hidden Markov models, has shown that the solvation state and the secondary structure environment significantly affect the propensity for substitutions to occur [21]. The solvation state, refers to an amino acid
residue’s interaction with the solvent environment surrounding the protein. There are several ways in which this can be determined as discussed later in this chapter, in section 1.3.
The hidden Markov model based study of subsitution behaviour [21], did not consider pair- wise substitutions within the protein sequence or structure. The localised replacement or substi- tution of an amino acid at a given sequence position will alter the physical interactions around the substitution site, this is illustrated in Figure 1.1, shown in the next section. As such a sub- stitution of an amino acid at one position may allow one or more residues at other sites in the structure to undergo a substitution that would otherwise have been deleterious, which may now provide functional or structural benefit or may compensate for minor instabilities arising from the original substitutions. Coordinated changes in amino acid substitution patterns are clearly seen when comparing protein homologues [22] although the exact details of the mechanism of these coordinated changes is not completely clear. For example, if two sites form a Lys-Asp salt-bridge. If Lys is replaced by Asp, then the original Asp will need to be replaced by either Arg or Lys, to maintain the salt bridge. Asp-Asp would be a repulsive interaction and would most likely be disruptive at the very least locally, if not to the entire structure and function of the protein. This correlated substitution behaviour is most commonly referred to in the literature as correlated mutations. Though it will be referred to as co-substitution in this thesis, as this term more accurately describes the process.
The next two sections are reviews. Firstly a review of co-evolution/correlated- mutation/co- substitution analysis methods in the literature is given. This is followed by a review of methods for determining the solvation state of residues. As mentioned earlier, the context of substitutions in amino-acid sequences has an effect on the propensity for the substitution to occur. For this reason amino acid context is explicitly considered in the co-substitution analysis developed in this thesis and requires some introduction.