Coevolution analysis of synthetic alignment reveals gap e ff ects

3.4 Conclusions

4.4.1 Coevolution analysis of synthetic alignment reveals gap e ff ects

does not vary within the protein family. Positions 3, 7, 8, and 9 are randomly assorting. A common assumption is that the conservation of position 2 implies it is somehow important to the protein family, as the sequence meanderings of evolution would have changed its identity if it were not. Conversely, positions exhibiting variation imply that the position is comparatively unimportant, or some evolutionary pressure would have constrained the residues allowed at that position.

at a position. Assume that the gambler is conservative and only wagers when confident on a correct answer. If the wager was on the identity of the amino acid at position 2 in Figure 4.1, the bettor would certainly choose cysteine. In contrast the bettor would likely choose not to wager on the identity of the amino acids at positions 3 or 7-9. In Information Theoretic terms, the uncertainty when placing your bets can be measured using Entropy (H), which is the measure of uncertainty of the identity of a random variable. Position 3 has high entropy and position 2 has low entropy.

Now consider positions 1 and 10. Here the residues can take on three possible states in each position. Without further information the bettor would be uncertain which to choose. However, what if the game is changed slightly to give the bettor one additional piece of information - the identity and location of another residue in the same protein sequence. If the bettor is gambling on the identity of position 1, neither position 2 nor position 3 provides any additional information for placing the bet. However, positions 1 and 10 are coupled: a residue at position 1 is always paired with a particular residue at position 10. So given the identity of the residue at either position, the bettor is now certain about the residue at the other. These two positions exhibitcovariation. In the same way that the lack of variation in position 2 implies the important conservation of residue identity, the covariation between positions 1 and 10 indicates the potential for an important conservedinteractionbetween the two positions.

In Information Theoretic terms, the dependency between two positions can be calculated using Mutual Information (MI). Mutual Information can be formally defined as:

MIi,j = Hi+Hj−Hi,j (4.1)

whereHi is the entropy of positioniandHi,jis the joint entropy of positonsiand j. Since high

entropy translates into low certainty, equation 1 shows thatMIcan be understood intuitively as the reduction of uncertainty of the identity of one position when given the identity of another.

MI by itself is not generally used to estimate covariation because this measure is very sensitive to the entropy of the columns [18, 10], and because of the confounding effect of the

intrinsic phylogenetic relationships between positions in a multiple sequence alignment [8]. One commonly-used metric isMI p, which is defined as:

MI pi,j = MIi,j−

MIi,×MIj,

MI . (4.2)

This measure subtracts the mean evolutionary relationships between positions, and others have developed similar measures based on regression [13]. The ”heatmap” drawn below the sample alignment corresponds to theMIpscore between positions. For example, the aforemen- tioned covariation between columns 1 and 10 is illustrated by the red square in the heatmap, which corresponds to a normalizedMIpscore greater than 2.5.

As mentioned above, MIp-based coevolution methods have been modified frequently to analyze gapped positions in an alignment by interpreting every gap character as the 21st character in the alignment alphabet [31, 25, 14]. Essentially, these groups treat gapped positions identically to non-gapped positions. The motivation for this change seems laudable if it is as- sumed that gaps contain information. However, as shown below, the gambling game analogy intuitively demonstrates why MIp will produce statistical artifacts over useful information if gap characters are included as the 21st character. This demonstration shows why the proper treatment of gaps is critical to avoiding erroneous interpretations of covariation measures.

Returning to the game, now consider columns 4 through 6 in Figure 4.1A. The residues in these columns are absolutely conserved, as in column 2, and the bettor would not require any information other than this knowledge if asked to choose the identity of the residue. Note that theMI pbetween these columns is 0: there is no additional information that can be gained about a position by giving information about a second position since there is no uncertainty for these positions (H =0).

In Figure 4.1B, a deletion event has been introduced at the conserved positions, 4 through 6 from panel A. That there is a gap at these positions does not provide new knowledge regarding the structural or functional relationships between the ungappedresidues. This is because the

Figure 4.2: Pairwise covariation scores shown as heatmaps compared to the percentage of gaps at each position from four different alignment methods. Each node in the heatmap represents a

Z pcovariation score ranging from below 0.5 (yellow) to above 2.5 (red). The histogram below shows the percentage of gaps at each position corresponding to the heatmap above.

gap character formally represents a residue in one sequence matching nothing in the other [24]. However, if the gap character is considered to be an additional symbol, then entropy is added to the positions and information is created. When MIp is calculated, we observe a dramatic increase in the covariation valuesthat is local to the deletion event itself. An investigator might therefore conclude that these positions coevolve strongly. This inference would be wrong because the only information we have added is that, at these positions, nothing in one sequence corresponds to nothing in another. As discussed later, the information added to the gap position is dependent on the method used to align the sequences and on the assumptions used by that method.

In Figure 4.1C, a deletion event has occurred at the randomly-assorting positions, 7 through 9. Again, there is a dramatic increase in local covariation that is local to the gapped positions. Note however, that since the non-gapped residues are randomly assorting, the residues that are not in gaps are still assorting completely randomly: thus knowing the identity of a residue at one position continues to provide no information about the identity of a residue at a different position.

invariably result in the gain of information. This results in local covariation being increased dramatically. However, the gap placement rules used by the multiple sequence alignment pro- grams in widespread use are different, some are arbitrary some are not. It is thus possible that gap placement by at least some of these automated algorithms does not create information, and as as result does not increase local covariation scores.

In document Computational Molecular Coevolution (Page 105-109)