Data pre-processing and model selection

6.7 General discussion

6.7.5 Data pre-processing and model selection

As shown in chapter 3, the NPS is based upon pre-processed FRET data, i.e. average FRET efficiencies and anisotropies, which depend on the distance between average donor and acceptor positions and the angle between the average transition dipole moment orientations. It is left open, how the pre-processing accomplishes to describe the “raw” FRET data “sufficiently well” by a yet unknown number of macromolecule states each with unknown average FRET efficiency and anisotropy.

In several recent publications the FRET efficiency time traces (McKinney et al.,2006) or the underlying donor and acceptor fluorescence signals (Liu et al., 2010b; Bronson et al., 2009) were analyzed. These authors addressed the questions how to model the macromolecule states, how many states to assign, and tried to quantify the above term “sufficiently well”. Hidden-Markov-modeling (HMM) was used in these studies to characterize the average FRET efficiency of each state and in order to extract the transitions rates between the states. The data was analyzed with maximum-likelihood estimation (McKinney et al.,2006;Liu et al.,2010b) or with Bayesian data analysis (Bronson et al., 2009).

Other popular methods to extract the average FRET efficiency of different states with- out modeling the underlying dynamics explicitly are Gaussian fits of FRET efficiency histograms, photon distribution analysis (PDA), and proximity ratio histograms (PRH) (see section6.7.1for references).

Typically, the number of states is obtained by some heuristic rule based on the goodness of the fit. In Bayesian data analysis, there are better ways to deal with the selection of models according to their probability, which was done byBronson et al.(2009). They calculated the probability of a model with a specific number of states given the experimental data, which is proportional to the evidence obtained in the pre-processing analysis.

However, seen from the point of view of Bayesian parameter estimation, all pre- processing methods listed above assign a flat prior in the parameters that describe a FRET state, which is in particular the FRET efficiencyE. This choice seems to be rather arbitrary, since one could work as well with a flat prior in other parametrizations, e.g. ln(E) instead ofE. However, a flat prior inE represents a different state of information than a flat prior in ln(E), which is inconsistent when absolutely no information about the FRET efficiency of a state is available.

As shown in section2.4.3, a natural parametrization can be obtained from transforma- tion invariances, and a maximum entropy prior in this parametrization truly expresses the experimenter’s ignorance. This was done for NPS (section3.2.1) and resulted in flat priors in the fluorophore positions and average transition dipole moment orientations.

Conversely, this means that the shape of the prior in the parameters that describe a FRET state, i.e. the FRET efficiencies, will not be flat in the pre-processing analysis. One can therefore reason that a non-flat prior could change the probabilities of concurring

6.7 General discussion models with different number of states. This, in turn, would have implications on the biochemical interpretation of the experiments.

As a consequence, one should couple NPS with a pre-processing method to infer the number of states of a macromolecule. This can be done by modifying the likelihood of the NPS analysis with the results of any pre-processing method that utilizes a flat prior in FRET efficiency and anisotropy, as will be shown in the following.

Coupling of NPS and FRET data preprocessing methods

In the following, it is assumed that the macromolecule and the attached fluorophores are described by a model, M, with KM states in total8. A particular macromolecule state kis quantified by the positions xi;k and the average transition dipole orientations

Ωi;k of each fluorophorei, as well as by the potentially ambiguous signs of the average axial depolarizations, si;k. In the following, these structural parameters of NPS will be abbreviated byqiNPS;k := (xi;k,Ωi;k, si;k), and{qNPSi;k }will be the structural parameters of all fluorophores in all possible macromolecule states. The observables measured for the fluorophore pairsij, for example the FRET efficiency, are available as time series, i.e. as FRET efficiency time traces, and will be denoted by Oij(t). The ensemble of the time traces of all measured molecules and FRET pairsij will be denoted by{Oij(t)}.

Now, in order to infer the structural parameters of the modelM, one must compute the posterior probability density p {qNPSi;k }|{Oij(t)}, M, I, which is done with Bayes’ theorem, p {qNPSi;k }|{Oij(t)}, M, I = 1 ZM p {Oij(t)}|{qiNPS;k }, M, I p {qNPSi;k }|M, I , (6.7.1) where ZM is the evidence of the model M, the likelihood of the time series data is denoted by p {Oij(t)}|{qNPSi;k }, M, I

, andp {qiNPS;k }|M, I

is the prior. Note that the likelihood depends on the “raw data”, and not on pre-processed observables, in contrast to equation (3.2.5), and that the priorp {qNPSi;k }|M, I

_{can be split into a contribution}

of the depolarization signs, si;k, and a contribution of fluorophore positions, xi;k, and average transition dipole moment orientations,Ωi;k,

p {qNPSi;k }|M, I = Q 1 l;kSl;k p({xi;k,Ωi;k}|M, I). (6.7.2) Here,Q

l;kSl;kis the number of combinations of possible average axial depolarization signs (similar to equation (3.2.5)) and p({xi;k,Ωi;k}|M, I) is the regular NPS prior (section

3.2.3).

The most probable macromolecule model can be found by maximizing the probability of the modelM given the experimental data{Oij(t)}. This probability can be related to the evidenceZM of the model, again by using Bayes’ theorem,

p(M|{Oij(t)}, I)∝p({Oij(t)}|M, I)p(M|I) =ZM p(M|I), (6.7.3) wherep(M|I) denotes the prior probability of the modelM withKM states.

As shown in theappendix, one can rewrite the likelihoodp {Oij(t)}|{qNPSi;k }, M, I

_in

equation (6.7.1) into an expression that contains the posterior of a completely independent data pre-processing analysis, e.g. HMM, PDA, PRH, or the simple fits of FRET efficiency histograms. When the applied method is not based on Bayesian parameter estimation, the analogue of a posterior density must be found (Sivia,2006, chapter 3.5). The pre-

processing must compute the estimates of the “true” values of the observables, {O˜ij;k}, that characterize the statesk= 1. . . KM, and it must utilize a flat prior in{O˜ij;k}. The coupling between NPS and the pre-processing method turns out to be

p {Oij(t)}|{qNPSi;k }, M, I ₌ ZMPP πPP_M p {O˜ij;k}|{Oij(t)}, M, I _˜ Oij;k=Oij;k qNPS_i_;_k ,qNPS_j_;_k , (6.7.4) where ZMPP is the evidence of the pre-processing inference, πMPP denotes the value of the flat prior in the “true” observables {O˜ij;k}, i.e. πMPP = p {O˜ij;k}|M, I

, and p {O˜ij;k}|{Oij(t)}, M, Iis the posterior of the pre-processing analysis. The observables

Oij;k qNPSi;k , qNPSj;k

expected from NPS (i.e.Eij;k orAij;k) are functions of the positions, average transition dipole moment orientations and average axial depolarization signs of the fluorophoresiandj. Oij;k qNPSi;k , qNPSj;k

_{is substituted for the “true” observables ˜}

Oij;k in the pre-processing posterior.

As a consequence of equations 6.7.1 and 6.7.4, the task to identify a fixed number of states is left to the pre-processing algorithm, whereas the computation of possible fluorophore positions and average transition dipole moment orientations is done by NPS. The posterior obtained by the pre-processing thus plays the role of the likelihood in the NPS calculation.

The probability of the modelM can be then computed as follows: p(M|{Oij(t)}, I)∝

ZMPP

π_MPPZ

NPS

M KM!·p(M|I). (6.7.5)

ZMNPS is the evidence of the NPS problem, calculated with the posterior of the data pre-processing, p {O˜ij;k}|{Oij(t)}, M, I

_˜ Oij;k=Oij;k, substituted for Q ij∈ML (sisj) ij (xi,Ωi,xj,Ωj) in

equation (3.2.5). The factor KM! accounts for the multimodality of

p {O˜ij;k}|{Oij(t)}, M, I

_˜

Oij;k=Oij;k qNPSi;k ,qNPSj;k

as this function is invariant under

permutations ofσij;kfor different conformationsk9.

It might be legitimate to approximate the posterior of the pre-processing analysis by a multivariate normal distribution (see section 2.4.4) that is characterized by {Oij;k}, the best estimate of the “true” observables, and a corresponding covariance matrix, ∆2_O

ij;k,lm;n, describing the correlations between ˜Oij;k and ˜Olm;n. When the covariance matrix is diagonal, the standard deviations ∆Oij;k of each estimated observable

Oij;k can be obtained directly by taking the square root of the diagonal elements, i.e. ∆Oij;k=

∆2_O_ij_;_k,ij_;_k_{. The pre-processed data}_O_ij_;_k_±_∆_O_ij_;_k _{can be used in the NPS}

calculation as already described in section 3.2.2. Note that the coupling of NPS and a data pre-processing method was used in all analyses presented in this thesis, since fits of FRET efficiency histograms were used to extract the center FRET efficiencies and their standard deviations. When the covariance matrix is not diagonal, the likelihood in equation (3.2.5) must be modified in order to support correlated FRET efficiencies and anisotropies.

9_{NPS in its present form requires a unique assignment of observables to conformations and thus cannot}

6.7 General discussion Discussion

As shown above, NPS can be coupled to arbitrary FRET efficiency and anisotropy pre- processing methods, which was proven formally in theappendix. In short, the posterior probability distribution computed in a pre-processing analysis like PDA, PRH or HMM, plays the role of the likelihood in the NPS calculation. The evidence obtained by NPS modifies the posterior probability of a model, which, in turn, can have consequences in Bayesian model selection.

However, the question arises what kind of prior one should choose in the NPS analysis, since usually one would expect rather small changes in the macromolecule structure, e.g. movements of domains. That would correlate the a priori expected positions and average transition dipole moment orientations of the same fluorophore in different states of the macromolecule. It might be necessary to introduce a different parametrization to describe the positions of the same fluorophores in different states of the macromolecule. Such a parametrization could, for instance, model possible conformational changes as rotations about “hinges” in the protein, for example similar to mechanistic models of bent DNA (Woźniak et al.,2008).

Finally, it still has to be verified, whether NPS is useful to determine the number of states of a macromolecule. It might be possible that the influence of the evidence calculated with NPS is overwhelmed by the evidence obtained by the pre-processing method. This could happen, when, for example dynamic information is analyzed with a HMM method, since the transitions between different states might be very informative them- selves.

After this concluding section, the results of the thesis will be summarized in the following chapter.

7 Summary and outlook

This thesis focused on the data analysis of fluorescence resonance energy transfer (FRET) experiments carried out to infer structural information of biological macromolecules. The purpose of such experiments is to determine the yet unknown position of one or more fluorophores, called antennas, based on FRET measurements between the antennas and the satellite fluorophores, which are attached to known positions on the macromolecule determined by high-resolution structure methods. Alternatively, the relative position and orientation of several rigid components constituting a macromolecular complex can be computed, which is referred to as FRET-assisted docking. In this case, the structure of all components must be known, and the fluorophores must be attached to known positions on the components.

As it is easier to resolve structural heterogeneities in single-molecule experiments than with standard ensemble-based methods like X-ray crystallography or NMR spectroscopy, single-molecule FRET-based localization techniques can be used whenever high-resolution ensemble methods fail due to inhomogeneity effects.

However, since FRET is caused by dipolar coupling of the electronic systems of the fluorophores, inevitable orientation effects complicate the straight-forward interpretation of FRET efficiencies in terms of distances, and therefore also hamper the localization of fluorophores based on a simple trilateration scheme. In order to account for these effects, Bayesian data analysis has been applied to resolve the challenging task of FRET-based localization of fluorophores attached to biological macromolecules, as well as to FRET- assisted docking of several macromolecules.

This was accomplished by developing the Nano-Positioning System (NPS), a model to describe FRET between pairs of several partially oriented fluorophores attached to a macromolecule (chapter 3). In the model, the fluorophores are fixed in their position and exhibit constrained, fast, and axially symmetric orientation fluctuations. The FRET efficiency depends on the distance of the fluorophores and on the length scale of energy transfer, the so-called Förster distance. The latter can be calculated from the relative orientation of the average transition dipole moments of the fluorophores and from experi- mentally accessible observables, among them fluorescence anisotropies of the fluorophores, which quantify the magnitude of the orientation effects.

The challenge in the data analysis was to infer the antenna positions even though none of the average transition dipole moments was known. However, even the lack of information can be mathematically represented in Bayesian data analysis, and this state of knowledge is encoded by assigning an adequate maximum entropy prior. The prior also contains information about possible fluorophore positions calculated by taking into account the constraints imposed by the known macromolecule structure. Thereafter, the prior is combined with the likelihood that is the probability to generate the measured data by a specific configuration of fluorophores, in order to yield the posterior probability density, which is the result of the inference. Thus, it is possible to combine structural information obtained by high-resolution ensemble methods like X-ray crystallography or NMR spectroscopy with FRET data measured in single-molecule experiments.

The results of NPS can be readily displayed as marginal posterior probability densities, which reflect the information about the position of a fluorophore. The location, form and fuzziness of the densities impart simultaneously the position of the fluorophore and the

uncertainty of localization. The densities can also be shown as a set of credible volumes, which are three-dimensional credible intervals, and the Bayesian equivalent of confidence intervals. By displaying them along with a high resolution structure it is possible to perceive the available information in an intuitive way.

The major advantage of NPS over related analysis methods is the possibility to account for orientation effects, which are known to be a source of uncertainty in FRET-based localization experiments but have often been ignored or argued away in the literature so far. NPS also accounts for other uncertainties like FRET measurement errors as well as the positions of fluorophores known only to a limited extent due to the attachment via carbon chain linkers.

The NPS model was realized in two versions that differ in their parametrization. The first is called position - Förster distance NPS model and uses the fluorophore positions and the Förster distances as model parameters (section 3.1). This model is capable of inferring the unknown position of one antenna fluorophore by analyzing FRET efficiency measurements between the antenna and an arbitrary number of satellite fluorophores. In fact, theposition - Förster distanceNPS model is a special case and an approximation of the more general second version, the position - orientation NPS model, which uses the fluorophore positions and the average transition dipole moment orientations as model parameters (section3.2). Theposition - orientationNPS model is able to infer the unknown positions and average transition dipole orientations of an arbitrary number of satellites and antennas. Moreover, theposition - orientationNPS model was applied to the docking of several macromolecules by the additional modeling of the position and orientation of the docked macromolecule (section3.3). The approach is also capable of processing an arbitrary number of FRET efficiency measurements and potentially also FRET anisotropy data, which carries information about the angle between the average transition dipole moment orientations of the fluorophores constituting a FRET pair.

Inference calculations of both theposition - Förster distanceNPS model and theposition - orientation NPS model can be carried out in separate custom-written Matlab software packages (section5.2). The software packages contain graphical user interfaces and can be operated by users with only basic knowledge of NPS. Marginal position densities of fluorophores as well as the densities of arbitrary positions in the reference frame of a docked macromolecule can be easily exported to common electron density map formats, which can be loaded into molecular viewing software (subsection5.3.5).

The Nano-Positioning System was applied to study the eukaryotic RNA Polymerase II (Pol II) elongation complex. Though many structural details are already known from high resolution X-ray crystallography studies, a large part of the nascent RNA, the nontemplate DNA and the upstream part of the template DNA could not be observed in the crystal structure (chapter4).

NPS was successfully tested by inferring the position of a fluorophore attached to the 3’-end of the RNA (section6.2.1). This position was already known to be located close to the active site of Pol II from the crystal structure and NPS was able to reproduce this position.

In the first project, the position of a fluorophore attached to the 5’-end of a nascent 29 nucleotides long RNA was inferred from FRET efficiency data acquired both in the pres- ence and absence of the transcription factor IIB (TFIIB) (section6.2.3). Interestingly, it could be shown that the RNA can be displaced by TFIIB from a position close to the dock

In document Muschielok, Adam (2010): Development and application of a quantitative analysis method for fluorescence resonance energy transfer localization experiments. Dissertation, LMU München: Fakultät für Chemie und Pharmazie (Page 160-200)