Statistical Inference for Big Data Problems in Molecular Biophysics

(1)

Statistical Inference for Big Data Problems

in Molecular Biophysics

Arvind Ramanathan1, Andrej Savol2,4, Virginia Burger2,4, Shannon Quinn2,4, Pratul K. Agarwal3, Chakra Chennubhotla4

1_{Computational Data Analytics Group, Computer Science and Engineering Division}

Oak Ridge National Laboratory, Oak Ridge, TN 37830

2_{Joint Carnegie Mellon University-University of Pittsbugh Ph.D. Program in Computational Biology} 3_{Computational Biology Institute, Computer Science and Mathematics Division}

Oak Ridge National Laboratory, Oak Ridge, TN 37830

4_{Department of Computational and Systems Biology}

University of Pittsburgh, Pittsburgh, PA 15260

Abstract

We highlight the role of statistical inference techniques in providing biological insights from analyzing long time-scale molecular simulation data. Technologi-cal and algorithmic improvements in computation have brought molecular simu-lations to the forefront of techniques applied to investigating the basis of living systems. While these longer simulations, increasingly complex reaching petabyte scales presently, promise a detailed view into microscopic behavior, teasing out the important information has now become a true challenge on its own. Mining this data for important patterns is critical to automating therapeutic intervention discovery, improving protein design, and fundamentally understanding the mech-anistic basis of cellular homeostasis.

1 Molecular Biophysics

Over last30years biophysicists have taken advantage of the advances in computing power to run increasingly detailed simulations of biomolecules in order to investigate the mechanistic basis of their function. The structure, dynamics and function of biological macro-molecules such as pro-teins, de-oxy/ribose nucleic acid (DNA/RNA), carbohydrates and lipids control cellular function, and thus life. Proteins, the workhorses of the cell, are long polymers of amino-acid residues which fold into three-dimensional structures to perform their function. The biological function controlled by the dynamical interactions between various bio-molecules can occur at multiple time-scales from femto-seconds up to micro-, milli-, seconds and beyond, spanning more than 15 orders of magni-tude between them. Molecular dynamics (MD) simulations provide insights into the dependence of biological function on interactions at multiple length and time scales.

In this paper, we focus on using fully-atomistic simulations of proteins/biomolecules in solution as they best represent the cellular environment. MD simulations are governed by a potential energy function that includes both bonded and non-bonded interaction terms. The gradient of the energy function defines a force-field which is then applied to every atom in the molecule. At each time step, Newton’s laws of motion are integrated to generate a trajectory. A time-step on the order of a fem-tosecond (10−15s) is necessary for capturing the smallest vibrations of interest, whereas biological interesting events typically occur at microsecond (10−6_{s) and higher time scales. With}

improve-ments in sampling techniques and available hardware resources, MD simulations have successfully crossed the millisecond (10−3_{s) time-scale barrier [1] and have provided novel insights into the}

(2)

Assuming that a typical protein hasO(1000)atoms, representing a protein in full-atomistic detail in Cartesian space (x, y, z) requires at least3×O(1000)single-precision numbers. Even if one stores the output from molecular dynamics (MD) simulation at 100 ps (10−10_{s) intervals, assuming}

regular access to millisecond long simulations in the near future, typical datasets would have some-where between1012₋₁₀15_{conformations, which could easily occupy several terabytes of storage.}

Indeed, datasets of this order are being made available online by several research groups, and one can certainly expect more of these datasets in the near future. Such large-scale (and potentially large-volume) datasets from molecular simulations pose a significant big-data challenge.

The purpose of MD simulations is to reveal mechanistic basis of protein function. Traditionally, biophysicists have relied on the availability of order-parameters (e.g., experimental observables such as dihedral angles, distance constraints between different parts of the protein/bio-molecule, or other thermodynamical measurements) as a means of analyzing these simulations. These ‘knowledge based parameters’ are often difficult to obtaina prioriand can be considered as a biological filter applied to the large data-set from which only a few small number of functionally relevant dimensions are drawn. Given the experts’ knowledge, these parameters were sufficient to analyze smaller time-scale simulations. However, with the growth in MD data sets (reaching petabyte time-scale), there is a need to develop automated tools that can discover potential order parameters as well as reveal novel (hitherto unknown) features of the complex energy landscape. Machine learning and statistical inference techniques offer new avenues for elucidating novel relationships in the conformational landscape. Hence, our goal here is to bridge machine learning with molecular biophysics in the hope of discovering new biology.

2 Computational Challenges

We ask how statistical inference tools can help sift through petascale simulation data to reveal the organizing principles of conformational landscapes. The challenges include: (1) building statisti-cal insights into the time-dependent structural changes that the protein undergoes in the course of a simulation; (2) exploiting these structural dependencies to build a biophysically/biochemically relevant low-dimensional representation of the simulation data; (3) using the low-dimensional rep-resentations to generate kinetically and energetically coherent conformational sub-states and finally (4) drawing causality relations from time-dependent structural changes within each conformational sub-state to implicate functionally relevant residues. We discuss the first three issues in more detail below.

Building statistical insights into time-dependent structural changes that a protein undergoes in the course of a simulation

Naturally occurring ensembles, such as images, sounds and videos, have been shown to posses scale-invariant statistics. However, such invariances may be hard to find in the molecular simulation data, because the protein data is more likely to resemble an object-specific ensemble, such as a dataset of face images which are known to exhibit multi-scale behavior but not in a convenient scale invariance form. Thermal fluctuations allow the molecule to cross over energy barriers and tumble to places far removed from the starting configuration (Fig. 1). Moreover, recent evidence from experiments suggest that these rare, low-probability excursions from the mean conformation may have a significant bearing on biological function, including protein folding, enzyme catalysis and molecular recognition. The algorithmic aspects of sufficiently describing these rare-excursions and their properties and building rich representations from the simulation data remains a hard problem. McCammon’s group published an early result pointing out the long-tailed behavior of the positional fluctuation data from picoseconds long simulation trajectories [4]. More recently, we documented this long-tail property in the positional data of ubiquitin and adenylate kinase simulations using micro-second long simulation trajectories [3, 5]. We observed the long-tails give rise to non-orthogonal couplings between the various portions of the biomolecule. While the long tails hint at the use of independent component analysis algorithms, just like the natural datasets of images and sounds, we also observed that respecting the non-orthogonal correlations intrinsic to the data is the key to discovering energetically coherent clusters and building low-dimensional biophysically relevant projections.

(3)

Figure 1:Biophysical insights gained from a statistical characterization of the simulation data. (A) Time-dependent structural changes are anharmonic as shown by the long-tails in the positional fluctuations from 0.5µs long simulation data of a protein: ubiquitin. Log-histograms of the posi-tional deviations from the mean conformation for (i) backbone carbon alpha atoms (red, kurtosis

κ = 6.3), (ii) all-atoms of the protein (blue, kurtosis κ = 8.2) and (iii) best fitting Gaussian to the carbon alpha data (dotted, kurtosisκ= 3). (B) Non-orthgonal or anharmonic coupling in the positional deviations of carbon alpha atoms of residues31and45. PCA (black arrows) imposes orthogonality and misses the intrinsic orientation of this data, but a variant of JADE [2] (red ar-rows) that deconvolves fourth-order dependencies successfully captures the intrinsic anharmonic directions. (C) An emergent behavior from using higher-order statistics is the ability to discover energetically coherent clusters of conformations. (D) Biological insights derived from (C) implicat-ing motions of different regions in the protein ubiquitin as beimplicat-ing important for recognizimplicat-ing diverse substrates (2D3G and 2G45). Figure modified from [3].

Learning a biophysically/biochemically relevant low-dimensional representations

For a molecular system withN atoms andtsimulation frames, a full conformational description requires3N tvariables, that is, the x, y, zcartesian coordinates for every frame. Dimensionality reduction methods are concerned with describing the complexities of molecular motions with far fewer variables such that important structural shifts can be visualized and interpreted. Conceptu-ally, such approaches map each3N-dimensional snapshot to a datapoint on a lower-dimensional manifold, but the nature and complexity of such a mapping are nontrivial, and no current method can claim optimality. Designing this embedding, orreaction coordinates, such that biophysically-relevant transitions are observed has actually become more challenging because longer simulations necessarily access more structural diversity and rare transitions. A truly useful mapping must thus

(4)

(1) filter out thermal fluctuations, (2) be sensitive to the non-Gaussian/anharmonic character of con-formational shifts, and capture rare (i.e. outlier) transitions.

Existing methods primarily identify basis vectors (in3Nconformational space) that align with high-variance direction. So called principal component analysis (PCA) methods, adapted from the factor analysis statistics literature, require strong assumptions about the nature of correlated motions within the studied biomolecules. We have shown that two such assumptions, linearity and orthogonality, are not valid for MD simulations. These findings are consistent with the mechanistic interpreta-tion of intramolecular forces where bonded and non-bonded interacinterpreta-tions promote inherently anhar-monic coupling between different parts of the bio-molecule. We further emphasize two important drawbacks of existing dimensionality reduction methods. Firstly, current reaction coordinate for-mulations are not protein specific, meaning neither the inherent properties of molecular interactions nor the restrictions on protein structure are considered. Secondly, existing methods scale poorly, requiring matrix operations that become intractable for those systems that are of biological interest. The appearance of substantial atomistic simulations arrived primarily via hardware speedups. Use-ful interpretation of molecular simulations, on the other hand, requires a physics-centric approach to analyzing collective atomic motions. Our experience in highlighting potential weaknesses of ex-isting methods and suggesting alternatives will be critical in expanding the possible tools researches can apply to molecular systems.

Clustering simulation data with biological and thermodynamical relevance.

A second class of analysis techniques concerns grouping simulation frames that share important structural or kinetic features. Clustering conformer snapshots presents a considerable challenge in that the data has a natural ordering, namely the temporal relationships between successive frames that many datasets do not. Clustering approaches that appreciate the temporal associations among snapshots and also their structural relationships can provide insight into how protein motions facil-itate function, or dysfunction. Specifically, clustering enables the determination of a small number of parameters (referred to as order parameters) which are highly relevant for estimating rate kinetics and thermodynamics, which are directly measurable via experiments. Although many clustering techniques are regularly applied to MD simulations, the challenge is to determine which techniques can provide adequate insights into the biological underpinnings of function and relate them back in a seamless way to the experimental observables such as rate kinetics and thermodynamics.

3 Solution Space and Outlook

We have outlined the current state-of-the-art in terms of developing statistical inference approaches for big-data problems in molecular biophysics. However, with the astronomical increase in the size of datasets from molecular simulations, it is clear that there is a need to develop integrated machine learning toolkits that provide for:

• handling access to and analysis of big simulation data in a distributed fashion such as using Hadoop

• online or near real-time analytics of simulations as they are progressing, to facilitate anomaly detection and large-scale biophysically relevant motion signatures [6, 7]

• developing toolkits that are particularly suited to exploit heterogeneous computing re-sources such as graphics processing units (GPUs) for molecular biophysics applications [8] The availability of packages such as HiMach [9], mdanalysis [10] and HOST4MD [5] will certainly facilitate the development of a computing infrastructure required to tackle the big data challenges from molecular biophysics.

References

[1] David E. Shaw, Ron O. Dror, John K. Salmon, J. P. Grossman, Kenneth M. Mackenzie, Joseph A. Bank, Cliff Young, Martin M. Deneroff, Brannon Batson, Kevin J. Bowers, Ed-mond Chow, Michael P. Eastwood, Douglas J. Ierardi, John L. Klepeis, Jeffrey S. Kuskin, Richard H. Larson, Kresten Lindorff-Larsen, Paul Maragakis, Mark A. Moraes, Stefano Piana,

(5)

Yibing Shan, and Brian Towles. Millisecond-scale molecular dynamics simulations on anton.

InProceedings of the Conference on High Performance Computing Networking, Storage and

Analysis, SC ’09, pages 39:1–39:11, New York, NY, USA, 2009. ACM.

[2] Jean-Francois Cardoso. High-order contrasts for independent component analysis. Neural

Comput., 11(1):157–192, 1999.

[3] A. Ramanathan, A. J. Savol, C. J. Langmead, P. K. Agarwal, and C. S. Chennubhotla. Dis-covering conformational sub-states relevant to protein function. PLoS ONE, 6(1):e15827, 01 2011.

[4] S. H. Northrup, M. R. Pearl, J. D. Morgan, J. A. McCammon, and M. Karplus. Molecular dynamics of ferrocytochrome c: magnitude and anisotropy of atomic displacements. J. Mol. Biol., 153:1087–1111, 1981.

[5] A. Ramanathan, A. J. Savol, P. K. Agarwal, and C. S. Chennubhotla. Event detection and sub-state discovery from biomolecular simulations using higher-order statistics: Application to enzyme adenylate kinase. Proteins: Struct., Funct., and Bioinform., 80(11):2536–2551, 2012.

[6] Willy Wriggers, Kate A. Stafford, Yibing Shan, Stefano Piana, Paul Maragakis, Kresten Lindorff-Larsen, Patrick J. Miller, Justin Gullingsrud, Charles A. Rendleman, Michael P. East-wood, Ron O. Dror, and David E. Shaw. Automated event detection and activity monitor-ing in long molecular dynamics simulations. Journal of Chemical Theory and Computation, 5(10):2595–2605, 2009.

[7] A. Ramanathan, J.-O. Yoo, and C. J. Langmead. On-the-fly identification of conformational substates from molecular dynamics simulations. J. Chem. Theory Comput., 7(3):778–789, 2011.

[8] D. Brandt. Investigation of GPGPU for use in Processing of EEG in Real-time. PhD thesis, Kate Gleason College of Engineering, 2010.

[9] Tiankai Tu, Charles A. Rendleman, David W. Borhani, Ron O. Dror, Justin Gullingsrud, Morten Ø. Jensen, John L. Klepeis, Paul Maragakis, Patrick Miller, Kate A. Stafford, and David E. Shaw. A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. InProceedings of the 2008 ACM/IEEE conference on Supercomput-ing, SC ’08, pages 56:1–56:12, Piscataway, NJ, USA, 2008. IEEE Press.

[10] Naveen Michaud-Agrawal, Elizabeth J. Denning, Thomas B. Woolf, and Oliver Beckstein. Mdanalysis: A toolkit for the analysis of molecular dynamics simulations. Journal of