• No results found

Chapter 5 MULTIPLE SEQUENCES ALIGNMENT ALGORITHMS

5.4 Genetic Algorithms

Alignment methods in this group revolve around the genetic algorithm [37, 46] to solve the multiple sequence alignment problem. The GA algorithm mimics the natural selection of the nature where the most fitness species have better chance of survive, thus generating more offspring. There are many implementa- tions such as GA [17], GA-DP [128], SAGA [84], RBT-GA [111], GA-ACO [61], and in [14, 67, 78],etc.

Figure 5.12. Superpath transformation: (a) detachment and (b) x-cut

5.4.1 SAGA: sequence alignment by genetic algorithm

SAGA [84] optimizes its multiple sequence alignment by applying genetic algorithm (GA) [37, 46]. The SAGA algorithm starts with an initial population containing some random alignments of the sequences. These initial alignments represent the first generation,g0, of the alignments. In generationn, n >0, SAGA

ranks each member in the population using the sum-of-pair method as its objective function and applies the GA algorithm to select pairs of parents from generation Gn−1 to create new offsprings for the nth

generation. A pair of parents with a good fitness score, based on the objective function, is allowed to have more offsprings. The offsprings are created from any combination of crossover, gap insertion and block shuffling techniques. The crossover allows blocks of the parents to be swapped as in Figure 5.14; the gap insertion method creates a child by inserting gaps into a parent alignment; and the block shuffling allows a block of gaps to be shuffled to its left or right locations. The population is kept constant by replacing the population members with new offsprings. No duplicated members are allowed in the population. The algorithm terminates when no new offspring with better fitness score can be found in then+kth generation.

n and k are predefined parameter by users before executing the algorithm. This algorithm is depicted in Figure 5.13

Similarly, GA and RBT-GA (Rubber Band Techniques with GA) utilize genetic algorithm as their engine. The term ’rubber band’, used in RBT-GA method, represents a path from location (0,0) to location (m,n) in a back-tracking matrix similar to the one created in dynamic programming. The optimal rubber

Figure 5.13. SAGA alignment scheme. At any generationGi the parents Pi are crossed breed or mutated

by a random operationX to generate a new set of childrenPi+1

Figure 5.14. Illustration of SAGA, where the two parents are crossed breed. The dotted boxes represent the consistent alignment between the two parents

band is exactly the same as the DP back-tracking path. Instead of finding the optimal solution via DP, the rubber band technique iteratively selects pairs of residues from the aligning pairs of sequences as anchor points (poles) and tries to find the best scored path between the anchors.

Genetic algorithm with ant colony optimization (GA-ACO) combines the GA technique with ant colony optimization (ACO) [26] to prevent local optima. The ACO is an iterative heuristic algorithm that simulates ants’ behavior. When an ant moves, it secretes pheromone on its path for other ants to follow. The amount of pheromone secreted is proportional to the goodness or amount of the food being found. The pheromone decays over time. Ants follow paths with the highest intensity of pheromone. Overtime, only frequent paths remain. When applying ACO to GA, k ants are assigned to random columns of the parent alignments in GA for traversing across the sequences. After x iterations, the remaining paths are aligned, preserved, and passed to future offspring generations.

5.4.2 GA and Self-organizing Neural Networks

The GA-SNN [66] utilizes self-organized neural networks and genetic algorithm to align a set of sequences. A self-organizing neural network is composed of two layers, an input layer and an output layer. Each node/neuron in the input layer is connected to every node in the output layer with a certain weight. The nodes/neurons in the output layer are interconnected to their neighbors. Figure 5.15 depicts a neural network used in this method. The neural networks are used to identify conserved regions in the sequences allowing genetic algorithm to select offsprings that have these motifs aligned. The neural networks identify the sequence motifs as follows:(i) generates a list of words using window sliding method with length 3 for each input sequence; (ii) feeds the words into the input nodes of the neural networks. (iii) classifies the words that emerge from the third sub-network as motifs and gives them more weights.

The decision of which word is allowed to emerge from the third sub-network is based on the weight of the pattern. When the words are fed into the neural networks, the input nodes/neurons calculate the distances between the words and classified them into groups at the top level. Each word in these groups are passed down to the next level for further classification. The distances between the words are defined to be the average distance between their overlapped words of length 2 (each word is split into two overlapping words for comparison). The words are then classified and grouped similarly to the technique done at the top-level sub-network. A pattern arrives at the bottom of the neural networks will be pair-wise aligned to the pattern stored at that node. The pattern with the highest alignment pair-wise sum-of-pair score at the node is kept as a winner, or a motif.

Figure 5.15. Neural networks

5.4.3 FAlign

FALIGN [16] combines both progressive and iterative refinement algorithms into MSA. This method requires users to identify and define the motif regions in the sequences. The sequences are split at the motifs’ boundaries across all sequences into segments, and each segment of the sequences is aligned progressively. The segments are assembled back after being aligned to generate an alignment. FALIGN uses BLOSUM62 score matrix and sum-of-pair score. The last step of the algorithm is randomly and iteratively shifting the gaps and residues in non-motif regions to improve the alignment score.