Statistical Inference for Big Data Problems in Molecular Biophysics

(1)

Statistical Inference for Big Data Problems

in Molecular Biophysics

Arvind Ramanathan1, Andrej Savol2,4, Virginia Burger2,4, Shannon Quinn2,4, Pratul K. Agarwal3, Chakra Chennubhotla4

1_{Computational Data Analytics Group, Computational Science and Engineering Division}

Oak Ridge National Laboratory, Oak Ridge, TN 37830

2_{Joint Carnegie Mellon University-University of Pittsbugh Ph.D. Program in Computational Biology} 3_{Computational Biology Institute, Computer Science and Mathematics Division}

Oak Ridge National Laboratory, Oak Ridge, TN 37830

4_{Department of Computational and Systems Biology}

University of Pittsburgh, Pittsburgh, PA 15260

Abstract

We highlight the role of statistical inference techniques in providing biological insights from analyzing long time-scale molecular simulation data. Technologi-cal and algorithmic improvements in computation have brought molecular simu-lations to the forefront of techniques applied to investigating the basis of living systems. These longer, increasingly complex simulations are reaching petabyte scales. While these simulations promise important insights into the mechanisms of bio-molecular function, teasing out the important information that provide in-sights into the atomistic-scale behavior, has now become a true challenge on its own. Mining this data for important and biologically relevant patterns is critical to automating therapeutic intervention discovery, improving protein design, and fundamentally understanding the mechanistic basis of cellular homeostasis.

1 Molecular Biophysics

Biological macromolecules such as proteins, de-oxy/ribose nucleic acid (DNA/RNA), carbohydrates and lipids play a diverse role in regulating cellular functions, and thus are easential to sustain life. In order to investigate and understand the mechanistic basis of biomolecular function, biophysicists, over the last30years, have taken advantage of the advances in computing power to run increasingly detailed simulations of these biomolecules. In particular, much attention has been paid to the simula-tion of proteins, which are often considered the workhorses of a cell, and made up of long polymers of amino-acid residues, and fold into three-dimensional structures to perform their function. Bi-ological functions are controlled by the dynamical interactions between various biomolecules and can occur at multiple time-scales from femto-seconds up to micro-, milli-, seconds and beyond, spanning more than 15 orders of magnitude between them.

In this paper, we focus on fully-atomistic simulations of proteins/biomolecules in solution as they best represent the cellular environment. Molecular dynamics (MD) simulations provide insights into the dependence of biological function on interactions at multiple length and time scales. MD simulations are governed by a potential energy function that includes both bonded and non-bonded interaction terms. The gradient of the energy function defines a force-field which is then applied to every atom in the molecule. At each time step, Newton’s laws of motion are integrated to generate a trajectory. A time-step on the order of a femtosecond (10−15s) is necessary for capturing the small-est vibrations of intersmall-est and to ensure numerical stability for integrating the equations of motions as simulations progress. However, biologically interesting events (related to bimolecular function) typically occur at microsecond (10−6_{s) and higher time scales. With improvements in sampling}

(2)

techniques and available hardware resources, MD simulations have successfully crossed the mil-lisecond (10−3s) time-scale barrier [1] and continue to provide novel insights into the functioning of biomolecular systems.

Assuming that a typical protein hasO(1000)atoms, representing a protein in full-atomistic detail in Cartesian space (x, y, z) requiresO(3000)single-precision numbers. Even if one stores the out-put from MD simulations at 100 ps (10−10s) intervals, assuming regular access to millisecond long simulations in the near future, typical datasets would have somewhere between1012₋₁₀15

confor-mations, which could easily occupy several terabytes of storage. Indeed, datasets of this order are being made available online by several research groups [2, 3, 1], and one can certainly expect more of these datasets in the near future. Such large-scale (and potentially large-volume) datasets from molecular simulations pose a significant big-data challenge.

The purpose of MD simulations is to reveal mechanistic basis of protein function. Traditionally, biophysicists have relied on the availability of order-parameters (e.g., experimental observables such as dihedral angles, distance constraints between different parts of the protein/biomolecule, or other thermodynamical measurements) as a means of analyzing these simulations. These ‘knowledge-based parameters’ are often difficult to obtaina prioriand can be considered as a biological filter applied to the large dataset from which only a few small number of functionally relevant dimensions are drawn. Given the experts’ knowledge, these parameters were sufficient to analyze smaller time-scale simulations. However, with the growth in MD data sets (reaching petabyte time-scale), there is a need to develop automated tools that can discover potential order parameters as well as reveal novel (hitherto unknown) features of this complex energy landscape. Machine learning and statistical inference techniques offer new avenues for elucidating novel relationships in the conformational landscape. Hence, our goal in this paper is to bridge machine learning with molecular biophysics in the hope of discovering new biology.

2 Computational Challenges

How can statistical inference tools help sift through petascale simulation data to reveal the orga-nizing principles of conformational landscapes? From a biologist viewpoint, the primary goal of performing simulations is to reveal mechanistic insights into the function of a biomolecular system. To this end, inference tools must (1) enable statistical insights into the time-dependent structural changes that the biomolecule undergoes in the course of a simulation; (2) exploit these structural dependencies to build a biophysically/biochemically relevant low-dimensional representation of the simulation data; (3) use the low-dimensional representations to generate kinetically and energetically coherent conformational sub-states and finally (4) draw causality relations from time-dependent structural changes within each conformational sub-state to implicate functionally relevant residues. We discuss some of these issues in more detail below.

Building statistical insights into time-dependent structural changes that a protein undergoes in the course of a simulation

Naturally occurring ensembles, such as images and sounds, have been shown to posess scale-invariant statistics. Such invariances, however, may be hard to find in the molecular simulation data; because the protein data is more likely to resemble an object-specific ensemble, such as a dataset of face images which are known to exhibit multi-scale behavior but not in a convenient scale invariance form. The physical basis for such an observation is supported by the fact that thermal fluctuations allow the molecule to cross over energy barriers and tumble to places far removed from the starting configuration (Figure 1). Moreover, recent evidence from experiments suggest that these rare, low-probability excursions from the mean conformation may have a significant bearing on biological function, including protein folding, enzyme catalysis and molecular recognition [4, 5, 6] Thus, the algorithmic aspects of sufficiently describing these rare-excursions and their properties and building rich representations from the simulation data remains difficult.

McCammon’s group published an early result on the long-tailed behavior of the positional fluctuation data from picoseconds long simulation trajectories [9]. Recently, we documented this long-tail property in the positional data of ubiquitin and adenylate kinase simulations using micro-second time-scale simulation trajectories [8, 10]. We observed that the long-tails give rise to non-orthogonal, oranharmonic, couplings between the various moving parts of a biomolecule.

(3)

Figure 1:Biophysical insights gained from a statistical characterization of the simulation data. (A) Time-dependent structural changes are anharmonic as shown by the long-tails in the positional fluctuations from 0.5µs long simulation data of a protein: ubiquitin. Log-histograms of the posi-tional deviations from the mean conformation for (i) backbone carbon alpha atoms (red, kurtosis

κ = 6.3), (ii) all-atoms of the protein (blue, kurtosis κ = 8.2) and (iii) best fitting Gaussian to the carbon alpha data (dotted, kurtosisκ= 3). (B) Non-orthgonal or anharmonic coupling in the positional deviations of carbon alpha atoms of residues31and45. PCA (black arrows) imposes orthogonality and misses the intrinsic orientation of this data, but a variant of JADE [7] (red ar-rows) that deconvolves fourth-order dependencies successfully captures the intrinsic anharmonic directions. (C) An emergent behavior from using higher-order statistics is the ability to discover energetically coherent clusters of conformations. (D) Biological insights derived from (C): coupled motions in different regions of the protein ubiquitin are important to substrate recognition (2D3G and 2G45). Figure modified from [8].

While the long tails hint at the use of independent component analysis algorithms, just like the natural datasets of images and sounds, we also observed that respecting the non-orthogonal correlations intrinsic to the data is the key to discovering energetically coherent clusters and building low-dimensional biophysically relevant projections.

Learning a biophysically/biochemically relevant low-dimensional representations

MD simulations represent inherently high dimensional data-sets; however, the complex interactions between various parts of the protein imply that only a small number of dimensions may be relevant for its function. Dimensionality reduction methods are concerned with describing the complexities of molecular motions with far fewer variables such that important structural shifts can be visualized

(4)

and interpreted. Conceptually, such approaches map each3N-dimensional snapshot (for aN atom bimolecular system) to a datapoint on a lower-dimensional manifold, but the nature and complex-ity of such a mapping are nontrivial, and no current method can claim optimalcomplex-ity. Designing this embedding, or reaction coordinates, such that biophysically-relevant transitions are observed has actually become more challenging because longer simulations necessarily access more structural diversity and rare transitions. A truly useful mapping must thus filter out thermal fluctuations, be sensitive to the non-Gaussian/anharmonic character of conformational shifts, and capture/describe rare (i.e. outlier) transitions.

Existing methods in the trajectory analysis literature [11] primarily identify basis vectors (in3N

conformational space) that align with high-variance direction. Principal component analysis (PCA) methods, and other adaptations from factor analysis [12], require strong assumptions about the nature of correlated motions within the studied biomolecules. We have shown that two such as-sumptions, namely linearity and orthogonality, are not valid for MD simulations [8, 10]. These findings are consistent with the mechanistic interpretation of intramolecular forces where bonded and non-bonded interactions promote inherently anharmonic coupling between different parts of the biomolecule. We further emphasize two important drawbacks of existing dimensionality reduction methods. Firstly, current reaction coordinate formulations are not protein specific, meaning neither the inherent properties of molecular interactions nor the restrictions on protein structure are consid-ered. Secondly, existing methods scale poorly, requiring matrix operations that become intractable for those systems that are of biological interest.

The appearance of substantial atomistic simulations arrived primarily via hardware speedups. Use-ful interpretation of molecular simulations, on the other hand, requires a physics-centric approach to analyzing collective atomic motions. Our experience in highlighting potential weaknesses of exist-ing methods and suggestexist-ing alternatives will be critical in expandexist-ing the possible tools researchers can apply to molecular systems.

Clustering simulation data with biological and thermodynamical relevance.

A second class of analysis techniques concerns grouping conformations from the simulations that share important structural or kinetic features. Clustering conformations presents a considerable chal-lenge in that the data has a natural ordering, namely the temporal relationships between successive frames that many datasets do not. Clustering approaches that appreciate the temporal associations among snapshots and also their structural relationships can provide insight into how protein motions facilitate function, or dysfunction [2, 13]. Specifically, clustering enables the determination of a small number of parameters (referred to as order parameters) which are highly relevant for estimat-ing rate kinetics and thermodynamics, which are directly measurable via experiments. Although many clustering techniques are regularly applied to MD simulations, the challenge is to determine which techniques can provide adequate insights into the biological underpinnings of function and relate them back in a seamless way to the experimental observables such as rate kinetics and ther-modynamics.

3 Solution Space and Outlook

We have outlined the current state-of-the-art in terms of developing statistical inference approaches for big-data problems in molecular biophysics. However, with the astronomical increase in the size of datasets from molecular simulations, it is clear that there is a need to develop integrated machine learning toolkits that provide for:

• handling access to and analysis of big simulation data in a distributed fashion such as using Hadoop,

• online or near real-time analytics of simulations as they are progressing, to facilitate anomaly detection and large-scale biophysically relevant motion signatures [14, 15],

• developing toolkits that are particularly suited to exploit heterogeneous computing re-sources such as graphics processing units (GPUs) for molecular biophysics applica-tions [16].

The availability of packages such as HiMach [17], mdanalysis [18] and HOST4MD [10] will cer-tainly facilitate the development of a computing infrastructure required to tackle the big data chal-lenges from molecular biophysics.

(5)

References

[1] David E. Shaw, Ron O. Dror, John K. Salmon, J. P. Grossman, Kenneth M. Mackenzie, Joseph A. Bank, Cliff Young, Martin M. Deneroff, Brannon Batson, Kevin J. Bowers, Ed-mond Chow, Michael P. Eastwood, Douglas J. Ierardi, John L. Klepeis, Jeffrey S. Kuskin, Richard H. Larson, Kresten Lindorff-Larsen, Paul Maragakis, Mark A. Moraes, Stefano Piana, Yibing Shan, and Brian Towles. Millisecond-scale molecular dynamics simulations on anton.

InProceedings of the Conference on High Performance Computing Networking, Storage and

Analysis, SC ’09, pages 39:1–39:11, New York, NY, USA, 2009. ACM.

[2] Gregory R. Bowman, Kyle A. Beauchamp, George Boxer, and Vijay S. Pande. Progress and challenges in the automated construction of markov state models for full protein systems. J.

Chem. Phys., 131(12):124101, 2009.

[3] Marc W. van der Kamp, R. Dustin Schaeffer, Amanda L. Jonsson, Alexander D. Scouras, An-drew M. Simms, Rudesh D. Toofanny, Noah C. Benson, Peter C. Anderson, Eric D. Merkley, Steven Rysavy, Dennis Bromley, David A.C. Beck, and Valerie Daggett. Dynameomics: A comprehensive database of protein dynamics. Structure, 18(4):423 – 435, 2010.

[4] James S. Fraser, Michael W. Clarkson, Sheena C. Degnan, Renske Erion, Dorothee Kern, and Tom Alber. Hidden alternative structures of proline isomerase essential for catalysis. Nature, 462(7273):669–673, 12 2009.

[5] Dmitry M. Korzhnev, Tomasz L. Religa, and Lewis E. Kay. Transiently populated intermediate functions as a branching point of the ff domain folding pathway. Proceedings of the National Academy of Sciences, 2012.

[6] Denis Bucher, Barry J. Grant, Phineus R. Markwick, and J. Andrew McCammon. Accessing a hidden conformation of the maltose binding protein using accelerated molecular dynamics.

PLoS Comput Biol, 7(4):e1002034, 04 2011.

[7] Jean-Francois Cardoso. High-order contrasts for independent component analysis. Neural

Comput., 11(1):157–192, 1999.

[8] A. Ramanathan, A. J. Savol, C. J. Langmead, P. K. Agarwal, and C. S. Chennubhotla. Dis-covering conformational sub-states relevant to protein function. PLoS ONE, 6(1):e15827, 01 2011.

[9] S. H. Northrup, M. R. Pearl, J. D. Morgan, J. A. McCammon, and M. Karplus. Molecular dynamics of ferrocytochrome c: magnitude and anisotropy of atomic displacements. J. Mol. Biol., 153:1087–1111, 1981.

[10] A. Ramanathan, A. J. Savol, P. K. Agarwal, and C. S. Chennubhotla. Event detection and sub-state discovery from biomolecular simulations using higher-order statistics: Application to enzyme adenylate kinase. Proteins: Struct., Funct., and Bioinform., 80(11):2536–2551, 2012.

[11] Jianyin Shao, Stephen W. Tanner, Nephi Thompson, and Thomas E. Cheatham. Clustering molecular dynamics trajectories: 1. characterizing the performance of different clustering al-gorithms. Journal of Chemical Theory and Computation, 3(6):2312–2334, 2007.

[12] Oliver F. Lange and Helmut Grubm¨uller. Full correlation analysis of conformational protein dynamics. Proteins: Structure, Function, and Bioinformatics, 70(4):1294–1312, 2008. [13] V.M. Burger, A. Ramanathan, A. J. Savol, C.B. Stanley, P. K. Agarwal, and C. S.

Chennub-hotla. Quasi-anharmonic analysis reveals intermediate states in the nuclear co-activator recep-tor binding domain ensemble. Pacific Symposium on Biocomputing, 17:70–81, 2012.

[14] Willy Wriggers, Kate A. Stafford, Yibing Shan, Stefano Piana, Paul Maragakis, Kresten Lindorff-Larsen, Patrick J. Miller, Justin Gullingsrud, Charles A. Rendleman, Michael P. East-wood, Ron O. Dror, and David E. Shaw. Automated event detection and activity monitor-ing in long molecular dynamics simulations. Journal of Chemical Theory and Computation, 5(10):2595–2605, 2009.

[15] A. Ramanathan, J.-O. Yoo, and C. J. Langmead. On-the-fly identification of conformational substates from molecular dynamics simulations. J. Chem. Theory Comput., 7(3):778–789, 2011.

(6)

[16] D. Brandt. Investigation of GPGPU for use in Processing of EEG in Real-time. PhD thesis, Kate Gleason College of Engineering, 2010.

[17] Tiankai Tu, Charles A. Rendleman, David W. Borhani, Ron O. Dror, Justin Gullingsrud, Morten Ø. Jensen, John L. Klepeis, Paul Maragakis, Patrick Miller, Kate A. Stafford, and David E. Shaw. A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. InProceedings of the 2008 ACM/IEEE conference on Supercomput-ing, SC ’08, pages 56:1–56:12, Piscataway, NJ, USA, 2008. IEEE Press.

[18] Naveen Michaud-Agrawal, Elizabeth J. Denning, Thomas B. Woolf, and Oliver Beckstein. Mdanalysis: A toolkit for the analysis of molecular dynamics simulations. Journal of