Future Work - Quantifying Regional-Scale Water Storage Using Models and Observations: Applicati

Chapter 7 Conclusion

7.2 Future Work

As QuasiNovo continues to be developed there are some obvious improvements and extensions that are anticipated. Much of this work serves as a proof of concept from a software engineering perspective. A capable programmer could find several areas of the software that could benefit from optimization. For example, much of the SNN is written in the scripting language Ruby. While Ruby is an excellent language for rapid prototyping, it is orders of magnitude slower than a compiled language, and so this component of the software will be refactored accordingly to reduce runtime. The

scoring and reranking components of the software are written in C++, however we expect the memory footprint can be reduced dramatically during candidate genera- tion, which would reduce the runtime due to the smaller set of candidates that need to be scored and ranked.

The majority of the planned developments for the scientific aspects of the research concern the creation and analysis of AAU distributions. While we have already conducted several studies of different AAU distributions, it is important to continue compiling and investigating AAU distributions at different taxonomic levels and of varying composition. AAU distributions that vary according to GC content, protein family, and proteotypic propensity will be explored, and various information theoretic measures of the distributions will be studied.

Bibliography

[1] Nuno Bandeira, Karl R Clauser, and Pavel A Pevzner, Shotgun protein sequenc- ing: assembly of peptide tandem mass spectra from mixtures of modified proteins., Molecular & cellular proteomics : MCP 6(2007), no. 7, 1123–34.

[2] Nuno Bandeira, Dekel Tsur, Ari Frank, and Pavel A Pevzner, Protein identifi- cation by spectral networks analysis., Proceedings of the National Academy of Sciences of the United States of America 104 (2007), no. 15, 6140–5.

[3] Marshall Bern, Yuhan Cai, and David Goldberg, Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry., Analytical chemistry 79 (2007), no. 4, 1393–400.

[4] Ting Chen, M Y Kao, M Tepel, J Rush, and G M Church, A dynamic program- ming approach to de novo peptide sequencing via tandem mass spectrometry., Journal of computational biology : a journal of computational molecular cell biology 8 (2001), no. 3, 325–37.

[5] Hao Chi, RX Sun, Bing Yang, CQ Song, and LH, pNovo: De novo Peptide Sequencing and Identification Using HCD Spectra, Journal of Proteome (2010), 2713–2724.

[6] James P Cleveland and John R Rose,A Neural Network Approach to Pre-filtering MS / MS spectra, ISBRA 2012, 2012, pp. 82–84.

[7] JP Cleveland and JR Rose, A Neural Network Approach to the Identification of b-/y-ions in MS/MS Spectra, 2012 IEEE International Conference on Bioinfor- matics and Biomedicine, 2012, pp. 588–592.

[8] Robertson Craig, John P Cortens, and Ronald C Beavis, The use of proteo- typic peptide libraries for protein identification., Rapid communications in mass spectrometry : RCM 19 (2005), no. 13, 1844–50.

[9] V Dančík, T A Addona, K R Clauser, J E Vath, and P A Pevzner,De novo pep- tide sequencing via tandem mass spectrometry., Journal of computational biology : a journal of computational molecular cell biology 6 (1999), no. 3-4, 327–42.

[10] Peter a DiMaggio and Christodoulos a Floudas, De Novo Peptide Identifica- tion via Tandem Mass Spectrometry and Integer Linear Optimization, Analytical chemistry 79 (2007), no. 4, 1433–46.

[11] Jimmy K. Eng, Ashley L. McCormack, and John R. Yates III, An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database, Journal of the American Society for Mass Spectrometry 5 (1994), no. 11, 976–989.

[12] Bernd Fischer, Volker Roth, Franz Roos, Jonas Grossmann, Sacha Baginsky, Peter Widmayer, Wilhelm Gruissem, and Joachim M Buhmann, NovoHMM: a hidden Markov model for de novo peptide sequencing., Analytical chemistry 77 (2005), no. 22, 7265–73.

[13] P G Foster and D A Hickey, Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions., Journal of molecular evolution 48 (1999), no. 3, 284–90.

[14] Ari Frank, Algorithms for tandem mass spectrometry-based proteomics, Ph.D. thesis, University of California, San Diego, 2008.

[15] ,A Ranking-Based Scoring Function for Peptide-Spectrum Matches, Jour- nal of proteome research 8 (2009), no. 5, 2241–2252.

[16] Ari Frank and Pavel Pevzner, PepNovo: de novo peptide sequencing via proba- bilistic network modeling., Analytical chemistry 77 (2005), no. 4, 964–73.

[17] Ari Frank and MM Savitski,De novo peptide sequencing and identification with precision mass spectrometry, Journal of proteome . . . (2007), 114–123.

[18] Ari Frank, Stephen Tanner, and Vineet Bafna, Peptide sequence tags for fast database search in mass-spectrometry, Journal of proteome (2005), 1287–1295.

[19] Yonghua Han, Bin Ma, and Kaizhong Zhang, SPIDER: software for protein identification from sequence tags with de novo sequencing error, Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE, no. Csb, IEEE, January 2004, pp. 206–215.

[20] Yingying Huang, Joseph M Triscari, Ljiljana Pasa-Tolic, Gordon a Anderson, Mary S Lipton, Richard D Smith, and Vicki H Wysocki, Dissociation behavior of doubly-charged tryptic peptides: correlation of gas-phase cleavage abundance

with ramachandran plots., Journal of the American Chemical Society126(2004), no. 10, 3034–5.

[21] Yingying Huang, Joseph M Triscari, George C Tseng, Ljiljana Pasa-Tolic, Mary S Lipton, Richard D Smith, and Vicki H Wysocki, Statistical characterization of the charge state and residue dependence of low-energy CID peptide dissociation patterns., Analytical chemistry77 (2005), no. 18, 5800–13.

[22] Andrew Keller, Samuel Purvine, Alexey I Nesvizhskii, Sergey Stolyar, David R Goodlett, and Eugene Kolker, Experimental protein mixture for validating tan- dem mass spectral analysis., Omics : a journal of integrative biology 6 (2002), no. 2, 207–12.

[23] Jainab Khatun, Kevin Ramkissoon, and M.C. Morgan C Giddings, Fragmenta- tion characteristics of collision-induced dissociation in MALDI TOF/TOF mass spectrometry., Analytical chemistry 79 (2007), no. 8, 3032–40.

[24] Daniel T Lavelle and William R Pearson, Globally, unrelated protein sequences appear random., Bioinformatics (Oxford, England)26 (2010), no. 3, 310–8.

[25] Bingwen Lu and Ting Chen,A suboptimal algorithm for de novo peptide sequenc- ing via tandem mass spectrometry., Journal of computational biology : a journal of computational molecular cell biology 10 (2003), no. 1, 1–12.

[26] Bin Ma, K Zhang, and C Liang, An effective algorithm for peptide sequencing from MS/MS spectra, Journal of Computer and System Sciences70(2005), no. 3, 418–430.

[27] Parag Mallick, Markus Schirle, Sharon S Chen, Mark R Flory, Hookeun Lee, Daniel Martin, Jeffrey Ranish, Brian Raught, Robert Schmitt, Thilo Werner, Bernhard Kuster, and Ruedi Aebersold, Computational prediction of proteotypic peptides for quantitative proteomics., Nature biotechnology 25 (2007), no. 1, 125–31.

[28] M Mann and M Wilm, Error-tolerant identification of peptides in sequence databases by peptide sequence tags., Analytical chemistry 66 (1994), no. 24, 4390–9.

[29] Lijuan Mo, Debojyoti Dutta, Yunhu Wan, and Ting Chen,MSNovo: a dynamic programming algorithm for de novo peptide sequencing via tandem mass spec- trometry., Analytical Chemistry 79 (2007), no. 13, 4870–4878.

[30] Norman R Pace,Mapping the tree of life: progress and prospects., Microbiology and molecular biology reviews : MMBR 73 (2009), no. 4, 565–76.

[31] Béla Paizs and Sándor Suhai, Fragmentation pathways of protonated peptides., Mass spectrometry reviews 24 (2005), no. 4, 508–48.

[32] Itsik Pe’er, Clifford E Felder, Orna Man, Israel Silman, Joel L Sussman, and Jacques S Beckmann, Proteomic signatures: amino acid and oligopeptide com- positions differentiate among phyla., Proteins 54 (2004), no. 1, 20–40.

[33] D N Perkins, D J Pappin, D M Creasy, and J S Cottrell, Probability-based pro- tein identification by searching sequence databases using mass spectrometry data., Electrophoresis20 (1999), no. 18, 3551–67.

[34] John T Prince, Mark W Carlson, Rong Wang, Peng Lu, and Edward M Marcotte,

The need for a public proteomics repository., Nature biotechnology 22 (2004), no. 4, 471–2.

[35] Bernhard Y Renard, Marc Kirchner, Flavio Monigatti, Alexander R Ivanov, Juri Rappsilber, Dominic Winter, Judith a J Steen, Fred a Hamprecht, and Hanno Steen,When less can yield more - Computational preprocessing of MS/MS spectra for peptide identification., Proteomics9 (2009), no. 21, 4978–84.

[36] John R Rose, James P Cleveland, and Alvin Fox,An Information Theoretic Ap- proach to Rescoring Peptides Produced by De Novo Peptide Sequencing, ICBCB 2010: International Conference on Bioinformatics and Computational Biology (Paris, France), World Academy of Science, Engineering and Technology, 2010, pp. 200–205.

[37] Brian C Searle, Surendra Dasari, Mark Turner, Ashok P Reddy, Dongseok Choi, Phillip A Wilmarth, Ashley L McCormack, Larry L David, and Srinivasa R Nagalla, High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo se- quencing results., Analytical chemistry 76 (2004), no. 8, 2220–30.

[38] LM Silva and J Marques de Sá, Data classification with multilayer perceptrons using a generalized error function, Neural Networks 21 (2008), no. 9, 1302–10.

[39] Gregory A. C. Singer and Donal A. Hickey, Nucleotide Bias Causes a Genomewide Bias in the Amino Acid Composition of Proteins, Mol. Biol. Evol. 17 (2000), no. 11, 1581–1588.

[40] David L. Tabb, Statistical Characterization of Ion Trap Tandem Mass Spectra from Doubly Charged Tryptic Peptides, Analytical Chemistry 75 (2003), no. 5, 1155–1163.

[41] David L Tabb, Anita Saraf, and John R Yates, GutenTag: high-throughput se- quence tagging via an empirically derived fragmentation model., Analytical chemistry 75 (2003), no. 23, 6415–21.

[42] Haixu Tang, Randy J Arnold, Pedro Alves, Zhiyin Xun, David E Clemmer, Milos V Novotny, James P Reilly, and Predrag Radivojac, A computational ap- proach toward label-free protein quantification using predicted peptide detectabil- ity., Bioinformatics (Oxford, England) 22 (2006), no. 14, e481–8.

[43] Hans J C T Wessels, Tom G Bloemberg, Maurice van Dael, Ron Wehrens, Lut- garde M C Buydens, Lambert P van den Heuvel, and Jolein Gloerich, A com- prehensive full factorial LC-MS/MS proteomics benchmark data set., Proteomics 12 (2012), no. 14, 2276–81.

[44] Natalie Wielsch, Henrik Thomas, Vineeth Surendranath, Patrice Waridel, Ari Frank, Pavel Pevzner, and Andrej Shevchenko,Rapid validation of protein iden- tifications with the borderline statistical confidence via de novo sequencing and MS BLAST searches., Journal of proteome research 5 (2006), no. 9, 2448–56.

[45] Vicki H Wysocki,Peptide Fragmentation Overview, Principles of Mass Spectrom- etry Applied to Biomolecules, Wiley, 2006, pp. 279–300.

This research was supported by NSF award 0959427 and a grant from the Sloan Foundation Indoor Air Program. Some of the experiments were run on an SGI Altix 4700 system with 128 computing cores and 256GB shared memory funded by NSF award 0708391.

Appendix A

Additional Figures and Listings

G A S P V T C I L N D Q K E M H F R Y W G A S P V T C I L N D Q E M H F Y W

Figure A.1: Pair-wise cleavage probability for b-/y-ions from peptides that have no internal K/R, and end in K/R, i.e., peptides matching the sequence motif regular expression/∧[∧KR]∗[KR]$/. Black indicates a probability of zero, and white indicates a probability of one.

G A S P V T C I L N D Q K E M F Y W G A S P V T C I L N D Q E M F Y W

Figure A.2: Pair-wise cleavage probability for b-/y-ions from peptides that have no internal K/R/H, at least one internal P, and end in K, i.e., peptides matching the sequence motif regular expression /∧[∧HKR]∗P[∧HKR]∗[K]$/. Black indicates a probability of zero, and white indicates a probability of one.

G A S P V T C I L N D Q E M F R Y W G A S P V T C I L N D Q E M F Y W

Figure A.3: Pair-wise cleavage probability for b-/y-ions from peptides that have no internal K/R/H, at least one internal P, and end in R, i.e., peptides matching the sequence motif regular expression /∧[∧HKR]∗P[∧HKR]∗[R]$/. Black indicates a probability of zero, and white indicates a probability of one.

G A S V T C I L N D Q K E M F R Y W G A S V T C I L N D Q E M F Y W

Figure A.4: Pair-wise cleavage probability for b-/y-ions from peptides that have no internal K/R/H/P and end in K/R, i.e., peptides matching the sequence motif regular expression /∧[∧PHKR]∗[KR]$/. Black indicates a probability of zero, and white indicates a probability of one.

0.1 1 10 100

100 150 200 250 300 350

tag mass collisions (rounded to 1/10 Da)

’-’

Figure A.5: Unique tag masses up to pairs (single missing peak in theb-/y-ion ladder) that collide within 0.1 Da.

0.1 1 10 100

100 150 200 250 300 350 400 450 500 550 tag mass collisions (rounded to 1/10 Da)

’-’

Figure A.6: Unique tag masses up to triplets (two sequential missing peaks in the b-/y-ion ladder) that collide within 0.1 Da.

def longest_common_subsequence_in_place ( p1 , p2 , t o l = 0 . 5 ,

i s o b a r i c _ e q u i v a l e n c e=f a l s e )

return 0 i f p1 . l e n g t h==0 or p2 . l e n g t h==0

num = Array . new ( p1 . l e n g t h ) { Array . new ( p2 . l e n g t h ) } p1 . compute_parent_mass p2 . compute_parent_mass p1_mass_N = p1 . n _ o f f s e t p2_mass_N = 0 . 0 p1_mass_C = p1 . mass p2_mass_C = p2 . mass i f i s o b a r i c _ e q u i v a l e n c e then p1 = p1 . gsub ( / [ I ] / , ’ L ’ ) p2 = p2 . gsub ( / [ I ] / , ’ L ’ ) end f o r i in 0 . . . p1 . l e n g t h do

p1_mass_N += AA2MASS [ p1 [ i . . i ] ] #mass o f amino a c i d

p1_mass_C−= AA2MASS [ p1 [ i . . i ] ] p2_mass_N = 0 . 0 p2_mass_C = p2 . mass f o r j in 0 . . . p2 . l e n g t h do p2_mass_N += AA2MASS [ p2 [ j . . j ] ] p2_mass_C−= AA2MASS [ p2 [ j . . j ] ] i f p1 [ i . . i ]==p2 [ j . . j ] and (

( p1_mass_N−p2_mass_N ) . abs<=t o l or

( p1_mass_C−p2_mass_C ) . abs<=t o l )

i f i ==0 or j ==0 num [ i ] [ j ] = 1 e l s e num [ i ] [ j ] = 1+num [ i−1 ] [ j−1] end e l s e i f i ==0 and j ==0 num [ i ] [ j ] = 0 e l s i f i ==0 and j !=0 # f i r s t i t h e l e m e n t

num [ i ] [ j ] = [ 0 , num [ i ] [ j−1 ] ] . max

e l s i f j ==0 and i !=0 # f i r s t j t h e l e m e n t

num [ i ] [ j ] = [ 0 , num [ i−1 ] [ j ] ] . max

e l s i f i !=0 and j !=0

num [ i ] [ j ] = [ num [ i−1 ] [ j ] , num [ i ] [ j−1 ] ] . max

end end end end return num [ p1 . l e n g t h−1 ] [ p2 . l e n g t h−1] end

In document Quantifying Regional-Scale Water Storage Using Models and Observations: Application For Drought Assessment In South Carolina (Page 71-82)