• No results found

Chapter 3 QUANTIFYING SEQUENCE ALIGNMENTS

3.1 Evolution and Measuring Evolution

In term of sequence evolution, whether the sequence is DNA, RNA, or Protein, the measurement often relies on the sequence primary information. In general, the measurement is focused on four features: sequence similarity, homology, divergence, and convergence. These features involve the sequences, their structures and functionalities.

The simplest of the four is measuring the similarity between sequences. For any given set of sequences, a statistical analysis of sequenced similarity can be derived from the number of common nucleotides in the sequences. However, this measurement has limited used for the randomness involved. For example, finding a highly similarity between a non-functional sequence segment against a functional motif of a sequence may not provide any practical use.

Observing divergence and convergence from groups of nucleotides, segments of sequences, or sequences themselves often lead to discovery of important information such as key elements of a sequence, structures or functions. Therefore, identifying and measuring divergence and convergence between sequences are fundamentals in sequence analysis. Commonly, divergence and convergence often derived on the probability a nucleotide, or amino acid residue, diverges or converges into another.

Lastly, homology measurement is a technique to identify the distance between the sequences and one of their ancestors. And this is the ultimate goal of measuring the evolution. Most often, homology is measured based on available biological information and information derived from the other three measuring

Figure 3.1. A Human Calmodulin wraps around its binding domain in the plasma membrane. Ca2+

atoms are depicted as circles.

techniques. For example, Calmodulin (CalM: CALcium MODULated proteIN), a calcium-binding protein expressed in all eukaryotic cells to meditate inflammation, metabolism, short-term and long-term memory, nerve grown and immune response, etc, is well known to have approximately 148 amino acids long with four EF-hand motifs, each of which binds a Ca2+. When comparing Calmodulin against other sequences

that results in identical, or highly similar, segments matching the four EF-hand motifs would give a better confidence that these sequences are homologous. Figure 3.1 depicts of a human CalM 3D structure showing the Calmodulin wrapping its binding domain in the plasma membrane.

Next, we will look into how the evolution can be measured. 3.1.1 Jukes and Cantor’s Model

In 1966 Jukes and Cantor [55] developed a method modeling the probability of a nucleotide, or a polypeptide, being mutated to, or substituted by, another. This method assumes the rate of substitution or evolution is the same for all four nucleotides (similar for amino acids) as in Figure 3.2, i.e., unrelated DNA/RNA sequences should be 25% identical by chance. The main goal is to find the actual number of mutations that really occur from an ancestor sequence leading to two sequences of fixed length n withk

different nucleotides, or k substitutions. In other words, it is to find the number of steps that make these two sequences differed, or the distance between them. For example, observing the following two sequences:

Figure 3.2. This figure shows the possible paths a nucleotide can be mutated to with the same probability α in Jukes and Cantor’s model

a: A A C C A C A C A A C C A C T A A A G A A C C T A C A

b: A A G T A C A G T A C T G C T G A T G G A G C A A C T

0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 C 1 0 0 1 In this representation, 1s represent an incident of mutation. There are 12 substitutions between the two sequences of length 27, which is about 44 percent, (12 is also the distance, or edit distance, between these two sequences). Assuming each nucleotide is equally likely to change to another nucleotide, it is almost certain that the number of mutations is more than 12 since C could change to T then change to G, etc. Jukes and Cantor method estimates the total actual number of mutations is about 18 (t = 66 percent), not 44 percent.

The Jukes and Cantor equation is defined as:

d=3

4ln(1 4

3p), (3.1) where tis the distance between two sequences, and pis the fractional dissimilarity defined as the ratio of number of mismatched positions over the length of the sequences.

The terms substitution and mutation referenced in this chapter are used interchangeably, and they both convey the same meaning of a residue symbol being replaced by another symbol.

3.1.2 Measuring Relatedness

The most basic task to identify the evolution process is to calculate the evolution distances between the involved species. Initially, species that yield the most confidence in being homologous are grouped together. These groups are then being considered as new taxonomic units, and the grouping process is repeated until all taxonomic units are joined together. The most fundamental task in this process is measuring the evolution distance, or scoring the homogeneousness between species. In terms of sequence analysis, the scoring is often based on the similarity, divergence, and convergence between the sequences. At the sequence residue level, either nucleotides or amino acids, there are available biological and scientific information permitting scientists to rank any pair of matching residues. However, functional units and structures of the sequences are often made of more than one residue and may depend on each other, thus making the ranking very difficult since the units and the dependency between them are being determined. The most popular approach to solve this problem is to accumulate the residue similarity scores, divergent scores and convergent scores, and then normalize these scores based on some weighing schemes. Therefore, the weighing scheme should be refined by experts for each instance of measurement to fit the data and their expected or desired results. Any revealed information obtained from sequence alignment should be incorporated back to further refine the alignment.