Uncertainty in integrative structural modeling ScienceDirect

(1)

Uncertainty in integrative structural modeling Dina Schneidman-Duhovny

¹

, Riccardo Pellarin

¹

and Andrej Sali

^1,2

Integrativestructuralmodelingusesmultipletypesofinput informationandproceedsinfourstages:(i)gathering information,(ii)designingmodelrepresentationandconverting informationintoascoringfunction,(iii)samplinggood-scoring models,and(iv)analyzingmodelsandinformation.Inthefirst stage,uncertaintyoriginatesfromdatathataresparse,noisy, ambiguous,orderivedfromheterogeneoussamples.Inthe secondstage,uncertaintycanoriginatefromarepresentation thatistoocoarsefortheavailableinformationorascoring functionthatdoesnotaccuratelycapturetheinformation.Inthe thirdstage,themajorsourceofuncertaintyisinsufficient sampling.Inthefourthstage,clustering,cross-validation,and othermethodsareusedtoestimatetheprecisionandaccuracy ofthemodelsandinformation.

Addresses

1DepartmentofBioengineeringandTherapeuticSciences,Universityof California,SanFrancisco,CA94158,USA

2DepartmentofPharmaceuticalChemistry,andCaliforniaInstitutefor QuantitativeBiosciences(QB3),UniversityofCalifornia,SanFrancisco, CA94158,USA

Correspondingauthors:Schneidman-Duhovny,Dina([email protected]), Sali,Andrej([email protected])

CurrentOpinioninStructuralBiology2014,28:96–104 ThisreviewcomesfromathemedissueonBiophysicaland molecularbiologicalmethods

EditedbyDavidMillarandJillTrewhella

http://dx.doi.org/10.1016/j.sbi.2014.08.001 0959-440X/#2014ElsevierLtd.Allrightreserved.

Introduction

Tounderstandandmodulatebiologicalprocesses,weneed theirspatiotemporalmodels.Thesemodels canbecom- putedbasedoninputinformationaboutthestructureand dynamics of the system of interest, including physical theories, statistical inference from databases of known sequences and structures, as well as a large variety of experimentalmethods.Astructuralmodelofamolecule isdefinedbytherelativepositionsandorientationsofits components(e.g.atoms,pseudo-atoms,residues,second- arystructureelements,domains,andsubunits).Allstruc- tural characterization approaches correspond to finding modelsthatbest fitinputinformation,as can bejudged

byascoringfunction;whenthescoringfunctionincludes experimentaldata,itquantifiesthedifferencebetweenthe observed data and the data computed from the model.

Therefore,structuralcharacterizationcanbedescribedasa four-stage process: (i) gathering input information, (ii) designing model representation and converting informationintoascoringfunction,(iii)samplinggood-scoring models,and(iv)analyzingmodelsandinformation(Box1 and Figure 1). For example, in X-ray crystallography a modelconsistsofatomicpositions,andthescoringfunction assesses the agreements (i) between the computed and observedstructurefactorsviatheRfreeparameter[1]aswell as(ii)betweenthemodelgeometryandtheidealgeometry implied by a molecular mechanics force field via the potentialenergyofthemodel.

Touseamodelwell,weneedtoassessitsaccuracy(stage ivabove).Assessmentstandardsandcorrespondingtools havealreadybeendevelopedforX-raycrystallography[2]

and NuclearMagnetic Resonance(NMR) spectroscopy

Box1 Glossary

Inputdata—experimentaldatausedtocomputeamodel.

Inputinformation—experimentaldataandanyadditionalinfor- mation.

Datasparseness—ameasureoftheamountofdatarelativetothe numberofdegreesoffreedominthemodel.

Dataerror—thedifferencebetweenthemeasureddataanditstrue value,whichcanbecomputedgivenaforwardmodelandthetrue structure;dataerrorcanberandomand/orsystematic,affectingthe precisionandtheaccuracyofthemeasureddata.

Dataambiguity—adatapointisambiguouswhenitcannotbe assignedtothespecificcomponentsofthemodel.

Dataincoherence—adatasetisincoherentwhenitisderivedfrom acompositionallyorconfigurationallyheterogeneoussample.

Single-statemodel—amodelthatspecifiesasinglestructural stateandvalueforanyotherparameter.

Multi-statemodel—amodelthatspecifiestwoormoreco-existing structuralstatesandvaluesforanyotherparameter.

Ensembleofstructuralmodels—asetofstructuralmodelseach oneofwhichisconsistentwiththedata.

Ensembleprecision—variabilityamongstructuralmodelsinthe ensemble.

Errororaccuracyofastructuralmodel—thedifferencebetween thestructuralmodelandthetruestructure(s).

Representationresolution—adescriptorofthedetailinthe representationofthestructuralmodel(e.g.atomicmodelsconsistof atoms).

(2)

[3],whiletheyarestill evolvingfor electronmicroscopy (EM) [4], SmallAngle X-ray Scattering[5,6],and com- parativemodeling[7].Standardvalidationofthecrystal- lographic and NMR entries in the Protein Data Bank (PDB)[8]includesassessinggeometricalfeaturessuchas stereochemistry and packing, fit of the model to the experimental data, andthequality of thedataitself. In the EM field, Fourier Shell Correlation (FSC) is com- monlyusedtoestimatemapresolution[4,9,10].Recently, new validation methods for EM maps were suggested, includingtiltpairanalysis[11],gold-standardFSCcurves [4], high-resolution noise substitution [12,13], and ResLogplots[14].InSAXS datavalidation, thex-free

criterionwasrecentlyproposed[15],inspiredbyRfreein crystallography. Protein aggregation can be revealed in the Guinier plot, inter-particle interference can be detected by measuring SAXS profiles at multiple con- centrations,andconformationalheterogeneityistosome degreereflectedintheKratkyorPorod-Debyeplots[16].

Estimating the accuracy of comparative models is still challenging,butmethodsbasedonavarietyofcriteriado exist[7,17,18].

Nosingleexperimentalmethodisguaranteedtoproduce asatisfactorystructureforagivensystem.Nevertheless, structure determination can often benefit from an

Figure1

Single-state model Multi-state model

Model validation: convergence, cross validation, bootstrapping, permutation tests

Sparseness Error Ambiguity Incoherence

Noise model Representation

commensurate with information

Heterogeneity model Ambiguity

model

Restraints satisfied?

Single cluster yes

Multiple clusters no

more information is needed

problems with data, representation, data interpretation, sampling Gathering information

Designing model representation and converting information into a scoring function

Sampling good-scoring models

Analyzing models and information

… … te model

Heterogeneity model o a scoring function

ermutation tests

more information is needed

n, s with data, representation erpretation, sampling

…

Current Opinion in Structural Biology

Uncertaintyinintegrativestructuremodeling.Thefour-stageschemeofintegrativestructuremodelingisusedtodescribehowtoapproachuncertainty inthedataandthemodels.Thecollectedinformationisconvertedintoascoringfunctionthataccountsfordataerror,ambiguity,andincoherence.The modelrepresentationshouldreflectdatasparseness.Aftersampling,ifgood-scoringmodelssatisfytherestraints,theyarefurtherevaluatedby structuralclusteringanddatavalidationtests.

(3)

integrative (hybrid) approach, where information from multiple experimental datasets is used to compute all structural modelsthat are consistent with the available data[19–22].DatafromX-raycrystallography,EM,NMR spectroscopy, SAXS, cross-linkingcombined with mass spectrometry (MS), Fo¨rster resonance energy transfer (FRET) spectroscopy, double electron–electron resonance (DEER), and hydrogen–deuterium exchange (HDX)isfrequentlyusedinintegrativestructuredeter- mination (Table 1). Sometimes integrative models are assessed based on clustering of models, modeling with simulateddata,observationofnon-randompatternsinthe models,andmodelingwithsubsetsofdata[20].However, asetofstandardsforvalidatingintegrativemodelshasnot yetbeendeveloped [22].

Itisessentialforappropriateuseofastructuralmodelto estimateerrorsin themodelas wellas thedata usedto computeit.Modelerrorisdefinedas thedifference between the model and true structure. It originates from severaldifferentsources.First,inputdatacanbesparse, noisy, ambiguous, or incoherent (Glossary). Second, the systemrepresentationcanbetoocoarse,resultinginsome inputinformationbeingignored.Third,thescoringfunc- tionmay notaccuratelycapturetheinputinformationor theinputinformation isinsufficient to identifythe true structure.Fourth,samplingmaynotfindthetruestructure

due to many degrees of freedomused to represent the system. Because the true structure is unknown in real applications,modelerrorisalsounknown.However,the lowerboundonthemodelerrorcanoftenbeestimatedas theprecisionofthesetofmodelsconsistentwiththeinput information.Here,wedescribetheoriginsofuncertaintyin each stage of integrative modeling,and suggesthow to quantifyandminimizeit.

Stage1: gatheringinformation

Spatialinformationaboutagivensystemcanincludedata fromexperiments such as those listed above, statistical propensitiessuchasatomicstatisticalpotentialsextracted fromknownproteinstructures,andphysicallaws,suchas interatomic interactions approximated by a molecular mechanics force field. This information is used to representthe system as well as to sample and rank its possibleconfigurations.Therearefoursourcesofuncer- taintyin theinformation,as follows.

Data sparseness. The data sparseness measures the amountofinformationinthedatarelativetothenumber of degrees of freedom in the model; the amount of informationin thedatadependsonthenumberofdata points and their precision as well as their interdepen- dence.Datasparsenessaffectstheprecisionofthemodel [23].Forexample,foraprotein–proteincomplexmapped

Table1

Someoftherecentstructuressolvedbyanintegrativeapproach.

Structure Experimentalinformation Method

SaccharomycescerevisiaeINO80[56] Cryo-EMmap(17A˚ resolution),212intra-proteinand 116inter-proteincross-links

Manualmodeling inChimera PolycombRepressiveComplex2[57] NegativestainEMmap(21A˚ resolution)and60

intra-proteinandinter-proteincross-links

Manualmodeling inChimera 39Slargesubunitoftheporcine

mitochondrialribosome[58]

Cryo-EMmap(4.9A˚ resolution)and70inter-protein cross-links

COOT,O,PHENIX Schizosaccharomycespombe26S

holocomplex[52]

Cryo-EMmap(8.4A˚ resolution)and35cross-links fromS.pombeand36crosslinksfromS.cerevisiae

IMP SaccharomycescerevisiaeRNApolymeraseII

transcriptionpre-initiationcomplex[54]

Cryo-EMmap(16A˚ resolution),157intra-proteinand 109inter-proteincross-links

Exhaustiveenumeration Saccharomycescerevisiae40SeIF1eIF3

translationinitiationcomplex[59]

965cross-links,including126uniqueeIF3-eIF3and 40SeIF1-eIF3crosslinks,negativestainEMmap (28A˚ resolution),crystallographicstructuresof40S complex,eIF3domains

IMP

SalmonellatyphimuriumTypeIIIsecretion systemneedle[53]

Solid-stateNMR,cryo-EM(19.5A˚ resolution) Rosetta Methanemonooxygenasehydroxylase

(MMOH),toluene/o-xylenemonooxygenase hydroxylase(ToMOH),andurease[60]

CompositionandstoichiometryfromnativeMS, collisioncrosssectionfromionmobility–MSand cross-links

IMP

ESCRT-Icomplex[61] SAXS,doubleelectron–electrontransfer(DEER),and FRET

EROS

Hsp90substraterecognition[46] 31cross-linksandNMRspectroscopy IMP

SplicingfactorU2AF65[62] Paramagneticrelaxationenhancement(PRE),residue dipolarcouplings(RDCs),andSAXS

ASTEROIDS

HIV-1capsidprotein[63] RDCsandSAXS Xplor-NIH

500-kilobase(kb)domainofhuman chromosome16[64,65]

ChromosomeConformationCaptureCarbonCopy (5C)experimentsandexcludedvolume

IMP Humangenomearchitecture[66] Tetheredchromosomeconformationcapture(TCC)

andpopulation-basedmodeling

IMP

(4)

byasinglecross-link,ifeachproteinisrepresentedbya single sphere,thedata sparsenessis 1 datapointper1 degree of freedom; if the proteins are represented by theirrigidatomicstructures,thedatasparsenessis1data point per 6 degrees of freedom (3 rotations and 3 translations). In X-raycrystallography,thedata sparseness can be quantified by the number of reflections divided by the number of atoms in the unit cell. In NMRspectroscopy,thedatasparsenessisusuallyquan- tifiedbythenumberofNOE restraintsperresidue. In SAXS,thedatasparsenessofaSAXSprofileisdefined using the Nyquist-Shannon sampling theorem: given the maximumdimension(dmax),thesampling theorem determines that thenumber ofunique, evenly distrib- uted observations for a maximum scattering vector (qmax) is given by (dmax qmax)/p. The problem with sparsedata isthatthereare morefree parametersthan observations, which maylead toanover-interpretation ofthe data(over-fitting).

Data error. Errorof thedata isthesum of random and systematic measurement errors. The magnitude of random error can best be assessed by multiple repeated measurements;systematicerrorsforagiventypeofdata can be estimated by benchmarks relying on known structures. For example, in X-ray crystallography the random error is caused by the noise in the X-ray flux, detector and electronics, whilethesystematic errorcan resultfromradiationdamage,conformationalheterogen- eity of thesample, and crystalpacking defects[24]. In SAXS, the random error sources are similar to those in crystallography,whilethesystematicerrorcanresultfrom sampleaggregation andradiationdamage[5].

Data ambiguity. Data ambiguity is the uncertainty in assigningdatapointstospecificcomponentsofthesys- tem. Forexample,it isgenerallynot possibletoassign which of the three methyl protons gave rise to an observedNOEsignal[25].Anotherexampleistheambi- guityofassigningacross-linktoaspecificinstanceofa proteinwhenthecomplexcontainsmultipleinstancesof it. By contrast, diffraction and scattering data are a function of all components of a system and thus not ambiguous.

Data incoherence. Data incoherenceis aresultof com- positionalorconformationalheterogeneityofoneormore samplesusedto generateoneor moredatasetsfor modeling;forexample,asystemmayexistasamixtureoftwo statesinanNMRsolutionexperimentoritmayexistin different states in X-ray (crystal) and SAXS (solution) experiments. As a result, the measured data will be a mixture ofcontributionsfromeach state.The abilityto disentangledifferentstatesdependsontheprecisionand accuracyofthedata;forexample,conformationaldiffer- ences smaller than the precision of the data may be difficultto detect.

Stages 2and 3:converting input information into system representation,scoring function, and sampling

Input informationaboutthestructureof thesystemcan beused(i)toselectthesetofvariablesthatrepresentthe system(systemrepresentation),(ii)torankthedifferent configurations(scoringfunction), (iii)to searchfor good scoringsolutions(sampling),and(iv)modelvalidation.It is often most computationally efficient, although not always possible, to encode information into the representation;in contrastit isgenerally moststraightforward, butleastefficient,toencodeinformationintothescoring function.Forexample,inprotein–proteindocking,max- imizationofshapecomplementaritycanbeencodedinto a scoring function that is then optimized by a generic optimization method. Alternatively, maximization of shape complementarity can also be encoded more effi- ciently through a representation consisting of shape descriptors,suchassurfacecurvature,resultingin faster sampling by generating only configurations of subunits withcomplementaryshape descriptors[26].

Representation

The representation of a system is defined by all the variables that need to be determined based on input information, including the assignment of the system components to geometric objects, such as points and spheres. A simple example is Cartesian coordinates for pointscorrespondingtotheindividualatoms.Morecom- plex representations can assign a component to other geometric primitives (e.g. spheres, ellipsoids, and 3D Gaussian density functions) and include additional degrees of freedom, such as the number of states in the system and their weights. For instance, in a high- resolutionrepresentation,aspherecanrepresentasingle atom, while in a coarse-grained representation it may correspondto aresidue.Coarse-grainingcanbeusedto encode the uncertainty arising from both static and dynamic variability. Moreover, in a ‘rigid body’, the relative positions of the primitives (e.g. atoms in a domain) can be constrained, for example based on a crystallographic structure. In most applications, the representation is determined before any other compu- tationsandisnotchanged.Theresolutionoftherepres- entation should be commensurate with the input information. In some cases, it isbeneficial to represent different parts of astructure with different representations or a part may bedescribed with several different inter-linked representations simultaneously (i.e. multi- scalerepresentation);insuchacase,informationcanbe applied to restrain the model by using the most con- venientrepresentation [27].

When defining the representation, we usually have to balance between the requirements of scoring and sampling. Weneed arepresentation that is sufficiently detailedforaccuratelyassessingamatchbetweenamodel

(5)

and the input information. For example, when using chemical cross-linking information, we need to choose betweenrepresentingthecross-linkerexplicitlywithall ofitsatoms[28]orimplicitlyasafunctionofthedistance between the cross-linked residues. To minimize data sparseness, we also need a representation that is sufficiently coarse, given the invariablylimited information content of the data. Finally, the representation should also be sufficiently coarse to allow for exhaustive samplingofgoodscoringmodelsinafeasibletimeframe.

Although it would be best to be able to compute an optimal representation based on theinput information, thisisnotyetpossible.

Scoring

Most generally, the scoring function ranks alternative models based on the evidence provided by the input information. The scoring function should take into account the uncertainty in the input information, including sparseness, error, ambiguity, and incoherence.

Forexample,ascoringfunctioncouldevaluatewhether or not agiven model fits thedata withinits error bars.

Ambiguity in the data assignment should also be accounted for bythe scoring function. For example, to addresstheambiguityinmethylprotonassignmentforan observedNOEsignal,thesignalisoftenassignedtothe center of mass of thethree methyl protons [25]. For a complexwithmultiplecopiesofthesameprotein,across- link can be assigned to the copy of the protein that satisfies it best [20]. When a sample is heterogeneous (i.e.dataareincoherent),ascoringfunctionshouldrank instances of a model, each one of which consists of multiple structures (multi-state model). For example, proteinheterogeneityin acrystalcanbemodeledusing snapshots of molecular dynamics simulations [29].

Protein dynamics in solution, as measured by SAXS and NMR spectroscopy, can be modeled by fitting multiple weighted conformations to the data [30–34].

InEM singleparticle reconstruction, heterogeneitycan beaddressedbymulti-modelreconstructionusingmulti- stageclustering [35].

The most objective ranking of models is in principle achievedbyaBayesianscoringfunction[36].TheBaye- sianapproachestimatestheprobabilityofamodel,given information available about the system, including both priorknowledgeand newlyacquiredexperimentaldata.

When modeling heterogeneous systems, model M includes a set of N modeled structures X={Xi}, their population fractions in the sample {wi}, and potentially additionalparameters(e.g.theunknowndataerrors).The posteriorprobabilityp(MjD,I)ofmodelMgivendataD andpriorknowledgeIis

pðMjD;IÞ/pðDjM;IÞ pðMjIÞ

wherethelikelihoodfunctionp(MjD,I)istheprobabilityof observingdataDgivenMandI;andthepriorp(MjD)is

the probability of model M given I. The likelihood functionisbasedontheforwardmodelf(X)thatpredicts thedatapointthatwouldhavebeen observedforstruc- ture(s)Xintheabsenceofexperimentalerror,andanoise model that specifies the distribution of the deviation betweentheexperimentallyobservedandpredicteddata points. The Bayesian scoring function is defined as S(M)=log[p(D|M,I)p(M|I)]which ranksthemodels thesameastheposteriorprobability.Themostprobable modelsare found by selectingthe best scoring models sampledfromtheposteriordistribution.

The Bayesian scoring function can account for most sources of uncertainty in data without over-fitting. It was successfully adopted for NMR spectroscopy data [36,37], and recently cryo-EM density maps [38,39].

Bayesian structuredeterminationbased onsparseNOE measurements produces more accurate structures and betterestimatesof precision thanstandard NMRstruc- turedeterminationmethods[36,37,40].Insingleparticle EM reconstruction, the Bayesian approach results in density maps with higher resolution than those from standard reconstruction methods using the same input datasets[38,39];moreover,high-resolutionmapscanbe obtainedfromonly afewthousandof particles [41–43].

Recently, the BioEM method for Bayesian analysis of individualEMimagesthatcandealwithconformational heterogeneitywasdeveloped[44].Bayesianscoringfunc- tionshavealsobeendevelopedforcysteinecross-linking [45], chemical cross-linking [46], FRET spectroscopy [47],and atomicstatisticalpotentials [48].

TheBayesianapproachismoreobjectivethantraditional scoringfunctionsinanumberofrespects:(i)inferenceof unknownquantities,suchasdataerrorandstateweights, (ii) combination of different types of information, (iii) inferenceof multiple structures, (iv)estimate of model precision,and(v)‘marginalization’ofparametersthatare difficulttodetermine.Themaindisadvantageisthatthe modelismoreelaborate(cf.noisemodelandpriors)anda more exhaustive sampling of structural and parameter spaceisrequired.

Sampling

Avariety ofoptimizationmethods(e.g.conjugategradi- ents),samplingalgorithms(e.g.MonteCarlo),andeven exhaustive enumeration (e.g. Fast Fourier Transform) canbeusedto findmodels consistentwith inputinformation.Themajorsourceofuncertainty inthis stageis insufficient sampling due to the ruggedness and high dimensionality of the scoring function landscape that needstobesampled.Asaresult,itisalmostnevercertain thatthebestscoringmodelsweresampled.Forstochastic sampling,such as theMonte Carlo algorithm,thethor- oughnessof samplingcanbeindicated byshowingthat new independent runs (e.g. usingrandom starting con- figurationsanddifferentrandomnumbergeneratorseeds)

(6)

do not result in significantly different good-scoring solutions (‘convergence test’) [49]. Passing such a test is a necessarybut notsufficient condition for thorough sampling;apositiveoutcomeofthetestmaybemislead- ingif,forexample,thelandscapecontainsonlyanarrow, and thus difficult to find, pathway to the pronounced minimum correspondingto thenativestate.

Formulti-statemodels,thesamplingisoftenperformed intwosteps,duetotherelativelyhighnumberofdegrees of freedominvolved.First,alargesetof possiblesingle configurations is sampled. Second, thesets of configur- ationsinamulti-statemodelareenumerated,forexample byusingageneticalgorithm[31,32],amaximumentropy approach [33],oradeterministicmethod[34].

Stage 4:analyzing modelsand information Input information and output models are analyzed in ordertoestimatemodelprecisionandaccuracy,todetect inconsistentinformationandmissinginformation,aswell astosuggestmostinformativefutureexperiments.There are three possible outcomesof modeling,based on the number of clusters of modelsand consistency between the models and information. The following discussion appliestosingle-statemodels,butsimilarconsiderations canalsobeextendedtomulti-statemodels.First,ifonlya single model(or aclusterof similarmodels)satisfiesall restraintsandthusallinputinformation,thereisprobably sufficientinformationfordeterminingthestructure(with theprecisioncorrespondingtothevariabilitywithinthe cluster). Second, if two or more different models are consistentwith therestraints,theinformationisinsuffi- cient to define the single state or there are multiple significantly populatedstates. If thenumberof distinct models is small, thestructural differencesbetween the models may suggest additional experiments to narrow downthepossible solutions.Third,ifnomodelsatisfies allinputinformation,theinformationoritsinterpretation intermsoftherestraintsareincorrect,therepresentation needs to include additional degrees of freedom, and/or sampling needs to be improved (regardless of the outcome oftheconvergencetestabove).

If multiplestructural statesare indicated,care must be takenthat thescoring functionexplicitly allowsfor this possibility [31–34,45,46]. When a mixture of states is modeled, thenumberofstatesneedsto bedetermined.

Frequently, Occam’s razor suggests that the smallest number of states sufficient to explain the input information withinsome thresholdistheoptimal choice. An example of this approach is the ‘minimal ensemble’

method in molecular modeling based on SAXS data [32]. However, sometimes Occam’s razor is not applicable. For example, even though a SAXS profile of an intrinsically disordered protein may be matched byasumofprofilesfortheminimalensemblestructures, thesystemislikelytoexistinalargeensembleofmany

widely differentstates; suchcasesareindicatedbysim- ilaritybetweendistributionsofstructuralproperties,such as the radius of gyration, of the minimal and large ensembles [31].

Onceweobtainamodel(singleormulti-state)thatsatisfies theinputdata,wecananalyzeittoestimateprecisionand accuracy. It is impossible to know with certainty the accuracyof theproposed structurewithout knowingthe realnativestructure.However,accuracycanbeestimated based on rules derived from benchmark studies that involvemodelingofknownstructures.Forexample,there isastrongcorrelationoftheaccuracyofanX-raystructure withthe resolution oftheX-raydataset andRfree [1].In additiontosuchbroadrules,fivetypesofanalysisthatare indicativeofmodelprecisionandaccuracyinspecificcases havebeenproposed[20],asfollows(thefirstthreetestsare examplesofstatisticalresampling[50]).

Estimating model precision based on variability in the ensembleofgood-scoringmodels.Themodelensemble isanalyzedintermsoftheprecisionofitsfeatures,suchas the protein positions and contacts [49,51–53]; the pre- cisionis definedbythevariability in theensembleand likely provides the lower bound on its accuracy. Of particular interest are the features that are present in most configurations in theensemble and have a single maximum in their probability distribution. The spread aroundthemaximumdescribeshowpreciselythefeature was determined by the input information. A more thorough test is performed by estimation of structural variability in multiple random subsets of theensemble [52,54].

Self-consistencyoftheexperimentaldata.Inconsistencies intheexperimentaldataoritsinterpretationareindicated by an ensemble of models containing only frustrated structures that do not satisfy the input data, although such an outcome can also arise from the failure of sampling. If there is amodel that satisfies all data, the probability ofsuchamodeloccurring bychancecanbe indicated by statistical significance tests; if this prob- abilityislow,themodelislikelytobecorrect.Inthese tests, the labels on the data points are randomized or permuted, followed by re-computing the model; for example, one can assign cross-links to random residue pairs[55].

Validatingmodelsbyusingrandomsubsetsofexperimen- tal data.The structurecanbedirectlyvalidated against experimentaldatathatwasnotincludedinthestructure calculation[52].Thiscriterionissimilartothecrystal- lographicRfreeparameterandcanbeusedtoassessboth themodelaccuracyandtheinputdata[1].Alternatively, modeling can be repeated with random subsets of the data. Common statistical techniques for this validation include crossvalidation andbootstrapping[50].

(7)

Reproducibilityofthemodelwithsimulateddata.Inthis approach,anativestructureisassumed,therestraintsto betestedaresimulatedfromthisstructure,thestructure isthenreconstructedbasedonlyontheserestraints,and finally the reconstruction is compared to the original assumed structure [49]. Using such simulations, the dependenceof model accuracy onthe amount,quality, and type of information canbemapped for futurepre- dictionofaccuracy.

Patternsunlikely to occurby chance.Unlikely patterns emergingfrommappingindependentandunuseddataon the structure also increase our confidence in a model, similarlytovalidationbyinformationnotusedinmodel- ing.Forexample,themodelofthenuclearporecomplex (NPC)revealedanunexpected16-foldpseudo-symmetry in the arrangement of fold types of the constituent proteins, in addition to the known 8-fold symmetry [49].The 16-foldpseudo-symmetryvalidatesthemodel becausethefoldtypeswerenotusedinmodelingandthe 16-foldpseudosymmetryis unlikely toarise bychance (while it can be reasonably explained by gene dupli- cationsintheevolutionoftheNPC).

Conclusions

Integrative structure determination needs de factostan- dardsandtoolsforassessingtheinputdataandresulting models, followingin thefootstepsof X-raycrystallography and NMR spectroscopy with established structure validationcriteria.

Acknowledgments

WeacknowledgesupportfromNIHR01GM083960andNIHU54 GM103511(A.S.).WeacknowledgesupportfromSwissNationalScience Foundation(SNSF)grantsPA00P3_139727andPBZHP3-133388(toR.P.).

References andrecommendedreading

Papersofparticularinterest,publishedwithintheperiodofreview, havebeenhighlightedas:

ofspecialinterest

ofoutstandinginterest

1. Bru¨ngerAT:FreeRvalue:anovelstatisticalquantityfor assessingtheaccuracyofcrystalstructures.Nature1992, 355:472-475.

2. ReadRJ,AdamsPD,ArendallWB,BrungerAT,EmsleyP, JoostenRP,KleywegtGJ,KrissinelEB,Lu¨ttekeT,OtwinowskiZ etal.:Anewgenerationofcrystallographicvalidationtoolsfor theproteindatabank.Structure2011,19:1395-1412.

3. MontelioneGT,NilgesM,BaxA,Gu¨ntertP,HerrmannT, RichardsonJS,SchwietersCD,VrankenWF,VuisterGW, WishartDSetal.:RecommendationsofthewwPDBNMR ValidationTaskForce.Structure2013,21:1563-1570.

4. HendersonR,SaliA,BakerML,CarragherB,DevkotaB, DowningKH,EgelmanEH,FengZ,FrankJ,GrigorieffNetal.:

Outcomeofthefirstelectronmicroscopyvalidationtaskforce meeting.Structure2012:205-214.

5. JacquesDA,GussJM,SvergunDI,TrewhellaJ:Publication guidelinesforstructuralmodellingofsmall-anglescattering datafrombiomoleculesinsolution.ActaCrystallogrDBiol Crystallogr2012,68:620-626.

6. TrewhellaJ,HendricksonWA,KleywegtGJ,SaliA,SatoM, SchwedeT,SvergunDI,TainerJA,WestbrookJ,BermanHM:

ReportofthewwPDBSmall-AngleScatteringTaskForce:data requirementsforbiomolecularmodelingandthePDB.

Structure2013,21:875-881.

7. SchwedeT,SaliA,HonigB,LevittM,BermanHM,JonesD, BrennerSE,BurleySK,DasR,DokholyanNVetal.:Outcomeofa workshoponapplicationsofproteinmodelsinbiomedical research.Structure2009:151-159.

8. BermanHM,WestbrookJ,FengZ,GillilandG,BhatTN,WeissigH, ShindyalovIN,BournePE:TheProteinDataBank.NuclAcids Res2000,28:235-242.

9. SaxtonWO,BaumeisterW:Thecorrelationaveragingofa regularlyarrangedbacterialcellenvelopeprotein.JMicrosc 1982,127:127-138.

10. HarauzG,vanHeelM:Exactfiltersforgeneralgeometrythree dimensionalreconstruction.ProcIEEEComputerVisionPattern Recogn1986,78:146-156.

11. HendersonR,ChenS,ChenJZ,GrigorieffN,PassmoreLA, CiccarelliL,RubinsteinJL,CrowtherRA,StewartPL, RosenthalPB:Tilt-pairanalysisofimagesfromarangeof differentspecimensinsingle-particleelectron

cryomicroscopy.JMolBiol2011,413:1028-1046.

12. ScheresSHW,ChenS:Preventionofoverfittingincryo-EM structuredetermination.NatMethods2012,9:853-854.

13. ChenS,McMullanG,FaruqiAR,MurshudovGN,ShortJM, ScheresSHW,HendersonR:High-resolutionnoisesubstitution tomeasureoverfittingandvalidateresolutionin3Dstructure determinationbysingleparticleelectroncryomicroscopy.

Ultramicroscopy2013,135:24-35.

14.

StaggSM,NobleAJ,SpilmanM,ChapmanMS:ResLogplotsas anempiricalmetricofthequalityofcryo-EMreconstructions.

JStructBiol2014,185:418-426.

A method for assessingthe accuracyof cryo-EMreconstructions is presented.Aplotofinverseresolutionversusthelogarithmofthenumber ofsingleparticles(a‘ResLog’plot)providesmetricsforthereliabilityof thereconstructionandtheoverallqualityofthedatasetandprocessing.

15.

RamboRP,TainerJA:Accurateassessmentofmass,models andresolutionbysmall-anglescattering.Nature2013,496:477- 481.

AstatisticalmethodbasedontheNyquist–Shannonsamplingandthe noisy-channelcodingtheoremsisusedforevaluatingstructuralmodels againstSASdata.

16. RamboRP,TainerJA:Characterizingflexibleandintrinsically unstructuredbiologicalmacromoleculesbySASusingthe Porod-Debyelaw.Biopolymers2011,95:559-571.

17. HaasJ,RothS,ArnoldK,KieferF,SchmidtT,BordoliL, SchwedeT:TheProteinModelPortal—acomprehensive resourceforproteinstructureandmodelinformation.

Database(Oxford)2013,2013bat031.

18. KryshtafovychA,BarbatoA,FidelisK,MonastyrskyyB, SchwedeT,TramontanoA:Assessmentoftheassessment:

evaluationofthemodelqualityestimatesinCASP10.Proteins 2014,82(Suppl2):112-126.

19. SaliA,GlaeserR,EarnestT,BaumeisterW:Fromwordsto literatureinstructuralproteomics.Nature2003,422:216-225.

20. AlberF,DokudovskayaS,VeenhoffLM,ZhangW,KipperJ, DevosD,SupraptoA,Karni-SchmidtO,WilliamsR,ChaitBTetal.:

Determiningthearchitecturesofmacromolecularassemblies.

Nature2007,450:683-694.

21. RusselD,LaskerK,WebbB,Vela´zquez-MurielJ,TjioeE, Schneidman-DuhovnyD,PetersonB,SaliA:Puttingthepieces together:integrativemodelingplatformsoftwareforstructure determinationofmacromolecularassemblies.PLoSBiol2012, 10:e1001244.

22. WardAB,SaliA,WilsonIA:Biochemistryintegrativestructural biology.Science2013,339:913-915.

23. HabeckM:Statisticalmechanicsanalysisofsparsedata.J StructBiol2011,173:541-548.