Top PDF Faster and efficient algorithm for sequence alignment

Faster and efficient algorithm for sequence alignment

Faster and efficient algorithm for sequence alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity if two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels. The goal of this paper is to explore the computational approaches to sequence alignment in a faster and optimal way. Two techniques that have been studied are global alignment and local alignment. In this paper, I have used the idea of both the alignment techniques separately. Each technique follows an algorithm (Needleman – Wunsch algorithm for global alignment and Smith – Waterman algorithm for local alignment) which helps in generating proper optimal alignment accordingly. Multiple DNA
Show more

32 Read more

Oculus: faster sequence alignment by streaming read compression

Oculus: faster sequence alignment by streaming read compression

Since they change the seed sequence used in align- ment, the vast majority of the differences (inaccuracies) produced were for reads that Oculus either reversed the orientation of (88% of single-end differences), or order of (67% of paired-end differences). Mostly these were previously unaligned reads that aligned and vice versa, but in some cases, an unambiguously mapped read actu- ally changed alignment positions (single-end, 0.09% of differences; paired-end, 10.15% of differences). Though initially surprising, this can be explained by mismatches in seed sequences. Bowtie is less permissive of mis- matches in the seed than at the end of a read under the assumption that read quality tends to be better toward 5’ end. Of two closely homologous regions of the genome, one may count as the best hit in the forward orientation, and the other in reverse orientation. For example:
Show more

8 Read more

Improvisation of Global Pairwise Sequence Alignment Algorithm Using Dynamic Programming

Improvisation of Global Pairwise Sequence Alignment Algorithm Using Dynamic Programming

There are some alternative methods that have greatly reduced the time and space requirements of dynamic programming method for sequence alignment. Some shortcut methods have also been developed to speed up the alignment process. Such methods are used in FASTA and BLAST algorithms. The word and k-tuple methods are used by FASTA and BLAST algorithms. FASTA was developed by W. Pearson and D. Lipman (1988) which performs a database scan for sequence similarity in a short time. FASTA break down a sequence into short words of a few characters long, and these words are then organized into a table indicating where they are in the sequence. If one or more words are present in both sequences, then the sequences must be considered similar for those regions. Pearson (1990, 1996) continued to improve the FASTA method for similarity searches in sequence databases. BLAST developed by S. Altschul et al. (1990) has been considered faster than the FASTA algorithm. Like FASTA, BLAST prepares a table of short sequence words for each sequence, but it also determines which of these words are most significant and good indicators of similarity between two sequences and finally it confines the search to these words (and related ones). This confinement fastened the alignment process. Recent improvements in BLAST include PSI-BLAST and GAPPED-BLAST which is threefold faster than the original BLAST.
Show more

6 Read more

Parallelizing and Analyzing the Behavior of Sequence Alignment Algorithm on a Cluster of Workstations for Large Datasets

Parallelizing and Analyzing the Behavior of Sequence Alignment Algorithm on a Cluster of Workstations for Large Datasets

place to cut sequence t, generating subsequences t{}_{1} and t{}_{2}, in such a way that the alignment problem can be solved in a split and merge recursive manner. Usually, one given biological sequence is compared against thousands or even millions of sequences that compose genetic data banks. One of the most important gene repositories is the one that is part of a collaboration that involves GenBank at the National Center for Biotechnology Information (NCBI), the EMBL at the European Molecular Biology Laboratory and DDBJ at the DNA Data Bank of Japan. These organizations exchange data daily and a new release is generated every two months. By now, there are millions of entries composed of billions of nucleotides. In this scenario, the use of exact methods such as NW and SW is prohibitive. For this reason, faster heuristic methods are proposed which do not guarantee that the best alignment will be produced. Usually, these heuristic methods are evaluated using the concepts of sensitivity and sensibility. Sensitivity is the ability to recognize as many significant alignments as possible, including distantly related sequences. Sensibility is the ability to narrow the search in order to discard false positives [23]. Typically, there is a tradeoff between sensitivity and sensibility. Usually, heuristic methods use scoring matrices to calculate the mismatch penalty between two different proteins. In figure. 1, we assigned a unique value for a mismatch (-1 in the example) regardless of the parts involved. This works well with nucleotides but not for proteins. For instance, some mismatches are more likely to occur than others and can indicate evolutionary aspects [21]. For this reason, the alignment methods for proteins use score matrices which associate distinct values with distinct mismatches and reflect the likelihood of a certain change.
Show more

13 Read more

Sequence embedding for fast construction of guide trees for multiple sequence alignment

Sequence embedding for fast construction of guide trees for multiple sequence alignment

The embedding process entails the construction of vec- tors representing biological sequences in such a way that the distances between those vectors approximate the dis- similarities between the sequences themselves. These vector distances are orders of magnitude faster to calcu- late than sequence distances, and this allows us to rapidly generate a distance matrix δ (F(x), F(y)) from a set of embedded sequences. For very large numbers of sequences, perhaps numbering in the hundreds of thou- sands, such distance matrices can become unmanageable, due to sheer size. In these cases, the sequence vectors can be clustered using an alternative clustering method such as k-means. For this paper, our main aim is to be able to rapidly generate guide trees which can be used to make multiple alignments of the input sequences. Here, this is done by applying the UPGMA clustering algorithm to the embedded distance matrix. We then try to measure the success of the overall procedure by (i) tree comparison and (ii) comparing the multiple sequence alignments that are generated using guide trees from embedded distance matrices with those generated from full sequence dis- tance matrices. This comparison is achieved using stan- dard multiple alignment benchmarking procedures. Attempts at directly comparing the distance matrices using standard matrix comparison methods, such as Stress [26], proved inconclusive, and results are not shown here.
Show more

11 Read more

Analysing Multiple DNA Sequence Alignment Algorithms  Smith Waterman  Algorithm and Parallel Smith Waterman Algorithm

Analysing Multiple DNA Sequence Alignment Algorithms Smith Waterman Algorithm and Parallel Smith Waterman Algorithm

Abstract— Multiple DNA sequence alignment is one of the important research topics of bioinformatics. Rapid and automated sequence analysis facilitates everything from functional classification & structural determination of proteins, to studies of genetic expression and evolution. Rapid and automated sequence analysis facilitates everything from functional classification & structural determination of proteins, to studies of genetic expression and evolution. The ultimate choice of sequence search algorithms is Smith Waterman. However, because of computationally demanding nature of this method, special purpose hardware alternatives of this method like Parallel Smith Waterman have been developed. This keeps the essence of smith waterman with faster computations. In this paper, we present the efficiency of Parallel Smith Waterman over Smith Waterman algorithm.
Show more

6 Read more

A Noble Approach on Bioinformatics: Smart Sequence Alignment Algorithm applying DNA Replication (SSAADR)

A Noble Approach on Bioinformatics: Smart Sequence Alignment Algorithm applying DNA Replication (SSAADR)

The previously discussed algorithms in this paper works by filling a matrix of size m x n (m is the length of the first sequence, n is the length of the second sequence).The new implemented Smart Sequence Alignment Algorithm applying DNA Replication (SSAADR) algorithm does both global and local alignment using the concepts of Needleman-Wunsch (NW) and Smith-Waterman algorithms (SW).In SSAADR algorithm, the concept of the NW is used for the trace-back and SW is used for main iteration. Initialization of first row and first column of matrix M (i , j) is done with zeros. Trace back starts from the last cell (M (m x n)) and ends at the first cell (M (0,0)).As Smith-Waterman performs faster than Needleman-Wunsch, the concept of SW is used for filling the matrix. For trace back purpose, the concept of NW is used. To make the proposed algorithm faster, the theory of DNA replication has been introduced here. While the alternative sequences of both original DNA sequences(i.e. sequences found after putting A in place of T and C in place of G and vice versa) are used instead of the original ones, the execution time of the proposed algorithm SSAADR decrease immensely. This conclusion came to light after experimenting with Original-Original, Original-Alternative, Alternative- Original and Alternative-Alternative pairs of DNA sequences and the optimal execution time was found for Alternative- Alternative pair. The whole proposed algorithm is presented here in four segments.
Show more

6 Read more

An Improved Needleman-Wunsch Algorithm for Pairwise Sequence Alignment of Protein-Albumin

An Improved Needleman-Wunsch Algorithm for Pairwise Sequence Alignment of Protein-Albumin

An improvement of Needleman-Wunsch algorithm (INWA) has been applied to align pairwise sequence for human protein-albumin. The main idea of the proposed method is to skip unused data by remaining blank area in order to obtain the least computational time and to reduce space complexity. INWA only fill cell of matrix partially. This algorithm can be applied to the both kind of sequence alignment, either the input sequences have the same length or not. Furthermore, the space and time complexity of INWA is O(N). It is better than the original Needleman-Wunch algorithm that is O(MN). Also, the running time of INWA is 25.9% faster than the original Needleman-Wunsch algorithm.
Show more

5 Read more

A Design of a Hybrid System for DNA Sequence Alignment

A Design of a Hybrid System for DNA Sequence Alignment

This paper describes a parallel algorithm and its needed architecture and a complementary sequential algorithm for solving sequence alignment problem on DNA (Deoxyribonucleic acid) molecules. The parallel algorithm is considered much faster than sequential algorithms used to perform sequence alignment; the initialization operation is done by activating a number of processing elements each compares the two sequences simultaneously and weights this comparison between each nucleotide from the first and the second DNA sequences. Then, the sequence matching operation is performed also simultaneously on the same processing elements. Both the initialization operation and the sequence matching operation are done in only two clock cycles. The proposed sequence matching operation is considered a new approach that highlights the subsequences matched between the given two DNA sequences. This operation provides a good indication for linking the matched subsequences of the two DNA sequences depending on a threshold K. which indicates the number of subsequences of highest score that can be linked. A parallel architecture is also presented as the algorithms are performed using it. A simple sequential algorithm is then presented to get the final alignment between the two DNA sequences. This hybrid system is considered a step towards a complete parallel processing architecture to solve computationally intensive applications of DNA.
Show more

6 Read more

Bayesian coestimation of phylogeny and sequence alignment

Bayesian coestimation of phylogeny and sequence alignment

Statistical modelling and MCMC approaches have a long history in population genetic analysis. In particular, coa- lescent approaches to genealogical inference have been very successful, both in maximum likelihood [19,20] and Bayesian MCMC frameworks [21,22]. The MCMC approach is especially promising, as it allows the analysis of large data sets, as well as nontrivial model extensions, see e.g. [23]. Since divergence times in population genet- ics are small, alignment is generally straightforward, and genealogical inference from a fixed alignment is well- understood [20,24-26]. However, these approaches have difficulty dealing with indels when sequences are hard to align. Indel events are generally treated as missing data [27], which renders them phylogenetically uninformative. This is unfortunate as indel events can be highly informa- tive of the phylogeny, because of their relative rarity com- pared to substitution events. Statistical models of alignment and phylogeny often refer to missing data. Not all of these can be integrated out analytically (e.g. tree topology), and these are dealt with using Monte Carlo methods. The efficiency of such approaches depend to a great extent on the choice of missing data. In previous approaches to statistical alignment, the sampled missing data were either unobserved sequences at internal nodes [28], or both internal sequences and alignments between nodes [13], or dealt exclusively with pairwise alignments [29,30]. In all cases the underlying tree was fixed. In [31] we published an efficient algorithm for computing the likelihood of a multiple sequence alignment under the TKF91 model, given a fixed underlying tree. The method analytically sums out all missing data (pertaining to the evolutionary history that generated the alignment), elim- inating the need for any data augmentation of the tree. This methodology is referred to in the MCMC literature as
Show more

10 Read more

DNA Sequence alignment using programme by algorithm

DNA Sequence alignment using programme by algorithm

Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. A variety of computational algorithms have been applied to the sequence alignment problem. These include slow but formally correct methods like dynamic programming. These also include efficient, heuristic algorithms or probabilistic methods designed for large-scale database search, that do not guarantee to find best matches (Nocedal and Wright, 2006).
Show more

5 Read more

DNA Multiple Sequence Alignment by a Hidden Markov Model and Fuzzy Levenshtein Distance based Genetic Algorithm

DNA Multiple Sequence Alignment by a Hidden Markov Model and Fuzzy Levenshtein Distance based Genetic Algorithm

Zhang et. al. have proposed [9] a novel method of population initialization and of crossover. Chang et. al. [17] has successfully combined fuzzy arithmetic with GA to arrive at better alignments. Lai et. al. [3] have suggested new genetic operators that direct the GA towards better solutions. Nguyen et al. [12] presents a hybrid scheme where they convert the MSA problem to the problem of finding the shortest path in a weighted directed acyclic k-dimension graph (where k is the number of sequences). Hjelmqvist [10] in 2012 published the idea of a fast and memory efficient Levenshtein algorithm to compute the edit distance between strings, such as DNA sequences. This paper presents a genetic algorithm, based on Hidden Markov Model and Fuzzy Levenshtein Distance to align multiple DNA sequences.
Show more

5 Read more

Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome Sequence Comparison

Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome Sequence Comparison

Protein structural similarity can be used to infer evolutionary relationships between proteins or to classify protein structures into more generalized groups; therefore a good pro- tein structure alignment algorithm is very helpful for protein biologists. However, a good alignment algorithm itself is insufficient for the effective discovering of structural relation- ships between tens of thousands of proteins. It is hard to imagine that one could manually examine the structural similarity between 100,000 × 100,000 pairs of proteins chains; the structural relationship between proteins has to be discovered automatically. The field of protein structure query aims to find similar structures in the protein data bank according to a given query structure. Because of the requirements of both a fast and stable filter and a fast and accurate structure alignment tool, this area has posed an even greater research challenge than protein structure alignment. In Chapter 4 an efficient protein structure query algorithm and tool [71] is developed to find similar protein structures in the protein data bank for any given structure. With the combination of a series of fast and stable filters and a structure alignment algorithm particularly optimized for the query purpose, some exciting results have been observed when compared with SSM, the one we believe is among the best structure query web sites. An algorithm comparison model is also proposed to compare our protein structure query tool with others.
Show more

127 Read more

A Much Faster Algorithm for Finding a Maximum Clique

A Much Faster Algorithm for Finding a Maximum Clique

We define a clique as a complete subgraph in which all pairs of vertices are adjacent to each other. Algorithms for finding a maximum clique (e.g., [18]) in a given graph have received much attention especially recently, since they have many applications. There has been much theoretical and experimental work on this problem [3, 20]. In particular, while finding a maximum clique is a typical NP-hard problem, considerable progress has been made for solving this problem in practice. Furthermore, much faster algorithms are required in order to solve many practical problems. Along this line, Tomita et al. developed a series of branch-and-bound algorithms MCQ [16], MCR [17] and MCS [18] among others that run fast in practice. It was shown that MCS is relatively fast for many instances tested.
Show more

13 Read more

Heuristics for multiobjective multiple sequence alignment

Heuristics for multiobjective multiple sequence alignment

A multiobjective formulation of sequence alignment provides the practitioner a set of alignments that represents the trade-off between decreasing the number of gaps and increasing similarity. In bioinformatics, this formulation and algorithms can be found already for pairwise sequence (DNA/Protein) alignment  [7–10]. Abbasi et al. [7] present dynamic programming algorithms to compute the optimal set of alignments by treat- ing the number of indels/gaps and the scores for (mis)matches/substitution as separate objectives. They also apply this method to analyze the construction of phylogenetic trees. Taneda [11] describes a heuristic approach for pairwise RNA sequence alignment that incorporates RNA structure information to approximate a set of optimal alignments. Schnattinger et al. [12] extend the work of Taneda by computing the optimal set. They treat the sequence alignment and the consensus structure calculation as separate objec- tives and solve both problems simultaneously with a dynamic programming approach. An extensive review about other problems in bioinformatics that are formulated as mul- tiobjective optimization problems is explained in Handl et al. [13].
Show more

17 Read more

DECIPHERING THE SEQUENCE ALIGNMENT BY NEEDLEMAN-WUNSCH ALGORITHM ON TO REDUCE COMPUTATIONAL TIME VIA HIGH PERFORMANCE COMPUTING

DECIPHERING THE SEQUENCE ALIGNMENT BY NEEDLEMAN-WUNSCH ALGORITHM ON TO REDUCE COMPUTATIONAL TIME VIA HIGH PERFORMANCE COMPUTING

In our present era, a biological data explosion has occurred and also a great hastening in the amassing of biological information began. The reasons for the biological data explosion are the revolutionary recombinant DNA technology used for DNA sequencing and the latest revolution of Genome Sequencing Projects. So, it is at ease to attain the DNA sequence of the gene consistent towards RNA or proteins than it is to experimentally govern its structure or its function. Because of this, we find that the size of sequence databases (e.g. Genbank maintained by NCBI, USA) is larger than the size of structure databases (e.g. PDB, maintained by RCSB, USA), to date. This provides a strong inspiration for emerging computational approaches that can deduce biological evidence from sequence unaccompanied. With the advent of modern computers and information technology, the biological data have not only been stored onto the computer in the form of databases but also processed using computational techniques to get useful results and connections among them.
Show more

13 Read more

Universal Multiplication Equation for Efficient and Faster Multiplication

Universal Multiplication Equation for Efficient and Faster Multiplication

proposed. This is called universal because of its wide application on all types of numbers. This works fine without any assumptions, no matter whatever the numbers are. It has evolved after continuous study on numbers; how the answers were generated, when different types of numbers underwent multiplication. Using the same equation, it has been proved why zero multiplied by any number is zero and why negative multiplied by negative is positive. This equation can be further used for fast mental calculations and calculations in competitive exams very efficiently. This can be also used in the field of math coprocessors in computers. Algorithms can be developed using this equation for faster multiplications in multiplier (FPGA’s), reducing the processing time and power consumption hence increases the efficiency.
Show more

8 Read more

Fractal MapReduce decomposition of sequence alignment

Fractal MapReduce decomposition of sequence alignment

The Universal Sequence Map (USM) procedure expands the Chaos Game Representation (CGR) approach to “ alignment-free ” analysis of sequences of any alphabet. Not only is the sequence comparison procedure described here performed without recourse to dynamic programming alignment, but multiple layers of nested map-reduce distribution provide maximally parallelized workflows to find the length of the similar segment shared by any two sequence units. If this basic align- ment operation can be streamlined by the USM proce- dure into the scalable and distributed processing form described here, the expectation is that other sequence analysis operations can be similarly decomposed, includ- ing more advanced types of alignment proceedures. This may be particularly significant given the large amount of sequence information now being generated by NextGen methodologies. The proposed MapReduce decomposi- tion was implemented in “ the language of the web ” , JavaScript (ecmascript), both out of convenience and in arguable anticipation of the native use of web-browsers for distributed computing.
Show more

12 Read more

Better, Faster, Stronger Sequence Tagging Constituent Parsers

Better, Faster, Stronger Sequence Tagging Constituent Parsers

Sequence tagging models for constituent pars- ing are faster, but less accurate than other types of parsers. In this work, we address the following weaknesses of such constituent parsers: (a) high error rates around closing brackets of long constituents, (b) large label sets, leading to sparsity, and (c) error propa- gation arising from greedy decoding. To ef- fectively close brackets, we train a model that learns to switch between tagging schemes. To reduce sparsity, we decompose the label set and use multi-task learning to jointly learn to predict sublabels. Finally, we mitigate issues from greedy decoding through auxiliary losses and sentence-level fine-tuning with policy gra- dient. Combining these techniques, we clearly surpass the performance of sequence tagging constituent parsers on the English and Chi- nese Penn Treebanks, and reduce their pars- ing time even further. On the SPMRL datasets, we observe even greater improvements across the board, including a new state of the art on Basque, Hebrew, Polish and Swedish. 1
Show more

12 Read more

ProfileGrids: a sequence alignment visualization paradigm that avoids the limitations of Sequence Logos

ProfileGrids: a sequence alignment visualization paradigm that avoids the limitations of Sequence Logos

The interactive JProfileGrid program viewer has features designed for the biologist user [30,31]. In software version 2.0, a new “overview” mode allows the visualization of the entire MSA within one window as either a ProfileGrid or as stacked sequences. Individual ProfileGrid cells can be selected to extract sequence subsets of interest during a visualization dissection. Sorting the residue rows by physi- cal/chemical properties such as flexibility, helix propensity, hydropathy, and volume allow qualitative structural analyses to be performed. The detailed ProfileGrid window with the symbol counts, has a new second pane to view different parts of the MSA at the same time. The “ highlight ” feature can identify residues that occur greater or less than a user- defined threshold of residue frequency. Large alignments can be separated into subsets of interest by using metadata filtering once JProfileGrid imports simple sequence annota- tions from flat file spreadsheet databases. The interactive features of the JProfileGrid program can be more easily appreciated by a live walk-through as demonstrated by my 2013 BioVis Data Contest movie submission [34].
Show more

8 Read more

Show all 10000 documents...