Notes and Comments - Parametric Sequence Alignment

Parametric Sequence Alignment

4.7 Notes and Comments

Parametric issues in optimization, especially linear programming, have been studied since the 1950s. Parametric linear programming, where the coeﬃcients of the objective function are variable, was initially formulated by Gass and Saaty [24]. In the terminology of this chapter, Gass and Saaty presented a simplex-based algorithm for one-dimensional sensitivity analysis of parametric linear programs. The method can be used for construction and search as well. The combinatorial complexity of parametric linear programming was studied by Murty [39], who showed that there exists a parametric linear program with n variables and 2n constraints where there are 2ⁿ basic feasible solutions, each of which is a unique optimal solution for some suitably chosen value of the parameter. Parametric versions of various optimization problems have been studied and bounds for various problems have been established. A sampling of the parametric optimization problems considered in the literature includes network ﬂows [23], stable marriage [30], matroid optimization [12], and scheduling [32].

Many of the approaches discussed in this chapter are specializations of more general techniques to sequence comparison. The geometric definitions and results of Section 4.2.1 are adapted from Agarwal and Sharir’s text [2]. Megiddo’s method of parametric search technique originally appeared in [36] as a method for solving optimization problems with ra-tional objective functions. An improvement based on simulating parallel algorithms instead of sequential ones is also due to Megiddo [37]. The application of Megiddo’s method to sensitivity analysis was first investigated by Gusfield [26, 27]. Ray shooting is an important problem in computer graphics, where it is used to detect and remove hidden surfaces and in computing the intersection of polyhedra; Agarwal and Matouˇsek describe these and other geometric applications of parametric search in [1] (see also Salowe’s survey [48]). Newton’s zero-finding algorithm and the gradient descent method are classical algorithms that can be traced back to Newton and Cauchy respectively. Radzik [47] describes the application of Newton’s method to solve fractional combinatorial optimization problems. The gradient descent method for optimization is well known and discussed in many textbooks [44, 40].

Polyak [43] was among the ﬁrst to study the subgradient method’s theoretical aspects. Held and Karp [33] were the ﬁrst to apply the method to mathematical programming problems.

4-28 Handbook of Computational Molecular Biology Parametric sequence comparison was first considered by Fitch and Smith [22], who studied the effect of varying the gap penalty on the optimum alignment of two sample sequences. By careful analysis, they showed that there are 7 and 11 different optimal alignments (optimal-ity regions) for their sample pair when end gaps are weighted and unweighted, respectively.

Waterman et al. [52] proposed a systematic way of finding the optimality regions. Vin-gron and Waterman [51] studied the implications of parameter choice through a series of case studies. Independently of Waterman et al.’s work, Gusfield et al. [29] formally defined parametric alignment and gave the first bounds on the number of regions. Among their results is the O(n^2/3) on the number of optimality regions for global alignment with zero gap penalty presented in 4.3.1. Fernández-Baca et al. [18] prove that this bound is tight when the alphabet size is unbounded [18]; in fact, it is the only combinatorial complexity bound for parametric sequence comparison known to be exact. The best known lower bound when the alphabet size is bounded is Ω(√

n) [18]. The properties of parametric alignment problems with feature-based scoring schemes were ﬁrst investigated by Fern´andez-Baca et al. [19], who obtained combinatorial bounds for several problems, such as multiple sequence alignment and phylogeny construction, by observing that they all have a similar integer parametric nature. Tighter bounds (Theorem 4.2) are due to Pachter and Sturmfels [42]

and Fern´andez-Baca and Venkatachalam [21].

Gusﬁeld [26] attributes the one-parameter construction algorithm of 4.5.1 to Eisner and Severance [16]. The two-parameter construction algorithm presented in 4.5.2 is due to Fern´andez-Baca and Srinivasan [20]. Zimmer and Lengauer [53] used this algorithm in their parametric sequence alignment software. The techniques for reconstructing multi-dimensional convex geometric objects through probing, upon which Theorem 4.7 is based, were developed by Dobkin et al. [13, 14], who extended the work on two-dimensional probing by Cole and Yap [10].

Gusﬁeld and Stelling’s publicly-available XPARAL system [31] implements the ray-shooting approach (using Newton’s method) for two-parameter sensitivity analysis described in 4.4.6 and applies it to construct the maximization diagram for two-parameter alignment prob-lems under a wide variety of scoring functions. While, in principle, ray shooting can be used for sensitivity analysis (and, hence, construction) for any number of parameters, there appears to be no reference to this in the literature. One way to achieve this generalization is to use the probing idea mentioned above; the probes here are ray-shooting queries, instead of evaluations. Each probe returns a supporting hyperplane of the region being generated.

The results of [13, 14] imply a number of queries proportional to the number of vertices and facets of the region.

Pachter and Sturmfels [42, 41] describe an implementation of the lifting algorithm for the construction problem mentioned at the beginning of 4.5. Their software relies on Gawrilow and Joswig’s polymake tool [25].

XPARAL solves the inverse alignment problem using the gradient descent method de-scribed in 4.4.4. Sun et al. [50] give eﬃcient algorithms for inverse sequence alignment with and without gaps that exploit the properties of feature-based scoring. Other inverse parametric optimization problems are studied by Eppstein [17].

For general background on hidden Markov models, a good starting point is Rabiner’s survey article [45]. HMMs were ﬁrst used for sequence alignment by Borodovsky et al. [6, 7, 8]. Durbin et al. [15] and Baldi and Brunak [4] give good introductions to the application of HMMs to sequence alignment and bioinformatics in general. Pachter and Sturmfels [42, 41] build a mathematical theory of statistical models for biological applications and show connections between parametric analysis and statistical models.

References 4-29

References

[1] P. Agarwal and J. Matouˇsek. Ray shooting and parametric search.SIAM Journal on Computing, 22:794–806, 1993.

[2] P. K. Agarwal and M. Sharir. Davenport-Schinzel Sequences and their Geometric Applications. Cambridge University Press, Cambridge–New York–Melbourne, 1995.

[3] R. Agarwala and D. Fern´andez-Baca. Weighted multidimensional search and its ap-plication to convex optimization. SIAM Journal on Computing, 25:83–99, 1996.

[4] P. Baldi and S. Brunak.Bioinformatics: The machine learning approach. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2nd edition, 2001.

[5] C.B. Barber, D.P. Dobkin, and H. Huhdanpaa. The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software, 22(4):469–483, 1996.

[6] M. Borodovsky, Yu. Sprizhitsky, E. Golovanov, and A. Alexandrov. Statistical patterns in primary structures of functional regions in the E. coli genome: I. Oligonucleotide frequencies analysis. Molecular Biology, 20:826–833, 1986.

[7] M. Borodovsky, Yu. Sprizhitsky, E. Golovanov, and A. Alexandrov. Statistical patterns in primary structures of functional regions in the E. coli genome: II. Non-homogeneous markov models. Molecular Biology, 20:833–840, 1986.

[8] M. Borodovsky, Yu. Sprizhitsky, E. Golovanov, and A. Alexandrov. Statistical pat-terns in primary structures of functional regions in the E. coli genome: III. Computer recognition of coding regions. Molecular Biology, 20:1145–1150, 1986.

[9] E. Cohen and N. Megiddo. Maximizing concave functions in ﬁxed dimension. In P. M. Pardalos, editor,Complexity in Numerical Optimization, pages 74–87. World Scientiﬁc, Singapore, 1993.

[10] R. Cole and C. Yap. Shape from probing. Journal of Algorithms, 8:19–38, 1987.

[11] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational Geometry: Algorithms and Applications. Springer-Verlag, Berlin, 2nd edition, 2000.

[12] T. Dey. Improved bounds on planark-sets and related problems. Discrete and Com-putational Geometry, 19(3):373–382, 1998.

[13] D. Dobkin, H. Edelsbrunner, and C.K. Yap. Probing convex polytopes. InProceedings of the 18th Annual ACM Symposium on Theory of Computing, pages 424–432, 1986.

[14] D. Dobkin, H. Edelsbrunner, and C.K. Yap. Probing convex polytopes. In Cox and Wilfong, editors,Autonomous Robot Vehicles, pages 328–341. Springer-Verlag, 1990.

[15] R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis:

Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.

[16] M.J. Eisner and D.G. Severance. Mathematical techniques for eﬃcient record seg-mentation in large shared databases. Journal of the Association for Computing Machinery, 23:619–635, 1976.

[17] D. Eppstein. Setting parameters by example. SIAM J. Computing, 32(3):643–653, 2003.

[18] D. Fernández-Baca, T. Seppäläinen, and G. . Bounds for parametric sequence com-parison. Discrete Applied Mathematics, 118:181–198, 2002.

[19] D. Fernández-Baca, T. Seppäläinen, and G. Slutzki. Parametric multiple sequence alignment and phylogeny construction. Journal of Discrete Algorithms, 2:271–287, 2004. Special issue on Combinatorial Pattern Matching, R. Giancarlo and D. Sankoff, eds.

[20] D. Fern´andez-Baca and S. Srinivasan. Constructing the minimization diagram of a two-parameter problem. Operations Research Letters, 10:87–93, 1991.

4-30 References [21] D. Fern´andez-Baca and B. Venkatachalam. Parametric analysis, duality, and lattice

polytopes. unpublished manuscript, May 2004.

[22] W.M. Fitch and T.F. Smith. Optimal sequence alignments. Proceedings of the Na-tional Academy of Sciences USA, 80:1382–1386, 1983.

[23] G. Gallo, M.D. Grigoriades, and R.E. Tarjan. A fast parametric maximum ﬂow algo-rithm and applications. SIAM Journal on Computing, 18:30–55, 1989.

[24] S.I. Gass and T. Saaty. The computational algorithm for the parametric objective function. Naval Research and Logistics Quarterly, 2:39–45, 1955.

[25] E. Gawrilow and M. Joswig. polymake: an approach to modular software design in computational geometry. In Proceedings of the 17th Annual Symposium on Com-putational Geometry, pages 222–231. ACM Press, 2001.

[26] D. Gusﬁeld. Sensitivity analysis for combinatorial optimization. Technical Report UCB/ERL M80/22, University of California, Berkeley, May 1980.

[27] D. Gusﬁeld. Parametric combinatorial computing and a problem in program module allocation. Journal of the Association for Computing Machinery, 30(3):551–563, 1983.

[28] D. Gusﬁeld. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge–New York–

Melbourne, 1997.

[29] D. Gusﬁeld, K. Balasubramanian, and D. Naor. Parametric optimization of sequence alignment. Algorithmica, 12:312–326, 1994.

[30] D. Gusﬁeld and R.W. Irving. Parametric stable marriage and minimum cuts. Infor-mation Processing Letters, 30:255–259, 1989.

[31] D. Gusﬁeld and P. Stelling. Parametric and inverse-parametric sequence alignment with XPARAL. In Russell F. Doolittle, editor,Computer methods for macromolec-ular sequence analysis, volume 266 of Methods in Enzymology, pages 481–494. Aca-demic Press, 1996.

[32] N.G. Hall and M.E. Posner. Sensitivity analysis for scheduling problems.J. of Schedul-ing, 7(1):49–83, 2004.

[33] M. Held and R.M. Karp. The traveling salesman problem and minimum spanning trees: part II. Mathematical Programming, 6:6–25, 1971.

[34] E. Herbert. Algorithms in Combinatorial Geometry. Springer-Verlag, Heidelberg, 1987.

[35] P. McMullen. The maximum number of faces of a convex polytope. Mathematika, 17:179–184, 1970.

[36] N. Megiddo. Combinatorial optimization with rational objective functions. Math.

Oper. Res., 4:414–424, 1979.

[37] N. Megiddo. Applying parallel computation algorithms in the design of serial algo-rithms. Journal of the Association for Computing Machinery, 30(4):852–865, 1983.

[38] D. Mount. Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Press, Cold Spring Harbor, New York, 2001.

[39] K. Murty. Computational complexity of parametric linear programming. Math. Pro-gramming, 19:213–219, 1980.

[40] G.L. Nemhauser and L.A. Wolsey. Integer and Combinatorial Optimization. Wiley-Interscience Series in Discrete Mathematics and Optimization. John Wiley & Sons, 1988.

[41] L. Pachter and B. Sturmfels. Parametric inference for biological sequence analysis.

Proceedings of the National Academy of Sciences USA, 101(46):16138–16143, 2004.

[42] L. Pachter and B. Sturmfels. Tropical geometry of statistical models. Proceedings of the National Academy of Sciences USA, 101(46):16132–16137, 2004.

References 4-31 [43] B.T. Polyak. A general method for solving extremal problems. Soviet. Math. Dokl.,

8:593–597, 1967.

[44] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling. Numerical Recipes:

The Art of Scientiﬁc Computing. Cambridge University Press, Cambridge (UK) and New York, 2nd edition, 1992.

[45] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

[46] T. Radzik. Algorithms for some linear and fractional combinatorial optimization problems. Department of Computer Science, Stanford University, Stanford, CA 94305, August 1992.

[47] T. Radzik. Newton’s method for fractional combinatorial optimization. In33rd Annual Symposium on Foundations of Computer Science, pages 659–669, Pittsburgh, PA, October 1992. IEEE.

[48] J. Salowe. Parametric search. In J.E. Goodman and J. O’Rourke, editors,Handbook of Discrete and Computational Geometry, chapter 37, pages 683–696. CRC Press LLC, Boca Raton, FL, 1997.

[49] R. Seidel. Convex hull computations. In J.E. Goodman and J. O’Rourke, editors, Handbook of Discrete and Computational Geometry, chapter 19, pages 361–376.

CRC Press LLC, Boca Raton, FL, 1997.

[50] F. Sun, D. Fern´andez-Baca, and W. Yu. Inverse parametric sequence alignment. Jour-nal of Algorithms, 53(1):36–54, 2004.

[51] M. Vingron and M.S. Waterman. Sequence alignment and penalty choice: Review of concepts, case studies, and implications. J. Mol. Biol., 235:1–12, 1994.

[52] Michael S. Waterman, M. Eggert, and E. Lander. Parametric sequence comparisons.

Proceedings of the National Academy of Sciences USA, 89:6090–6093, 1992.

[53] R. Zimmer and Th. Lengauer. Fast and numerically stable parametric alignment of biosequences. InProceedings of RECOMB 97, Santa Fe, NM, pages 344–353. ACM Press, 1997.

5

Lookup Tables, Suﬃx Trees and

In document Handbook of Computational Molecular Biology (Page 134-140)