Data source for this dissertation - PROTEIN NMR DATA OVERVIEW

CHAPTER 3. PROTEIN NMR DATA OVERVIEW

3.1 Data source for this dissertation

All of the statistics utilized for this dissertation are from he Re-referenced Protein Chemical shift Database (RefDB)71_{. RefDB is a secondary database derived from a} primary database, the Biological Magnetic Resonance Bank (BMRB) 26_{. As mentioned in} Chapter 2, chemical shifts are relative measurements, which are prone to inconsistencies. Moreover, any inconsistencies in the chemical shift measurements could distort the subtle but rich source of structural and dynamic information present in these measurements. Since 1991, BMRB has served as an archive for interpreted and raw NMR experimental datasets that allow a biomolecular NMR researcher to systematically assemble, compare, and interpret a variety of NMR measurements, especially chemical shifts, across the database. However, these entries are deposited by researchers across the NMR community that utilize a wide range of NMR experiments and data analysis procedures. As a result, roughly 40% of entries in the BMRB contain referencing inconsistencies26,84_{. These data} quality issues limit easy reuse of the BMRB, especially global analyses across the BMRB, which became the impetus driving the original development of the RefDB.

The RefDB contains a subset of the BMRB entries that are carefully and properly re-referenced according to the IUPAC/IUB convention71_{. The re-referencing procedure} involves using X-ray or NMR coordinate data to estimate protein 1_H,13_{C and}15_{N chemical} shifts via SHIFTX then compare estimated results with the observed shifts in BMRB via SHIFTCOR85_{. RefDB provides a standard chemical shift resource for the protein NMR} community; however, even this resource for protein NMR chemical shifts has some limitations, as is pointed out in Chapter 5.

3.1.1 Normal distribution

The Normal, or Gaussian, distribution is the most important and most widely used distribution. This bell-shape curved distribution can be well approximated in all manner of data that appear to be distributed normally: human height, IQ scores, grades, productions, and chemical shift values are non-exception. Relative frequency distribution can be obtained by suitably normalized frequency distribution, aka. histogram as showing in Figure 3.1. The one-dimensional Normal distribution is determined by just two parameters: mean μ and standard deviation σ of the data. The very definition of Normal distribution 𝑁𝑁(𝜇𝜇, 𝜎𝜎) is: 𝑝𝑝(𝑥𝑥) =_{𝜎𝜎√2𝜋𝜋}1 exp −(𝑥𝑥−𝜇𝜇)_2𝜎𝜎₂2

Figure 3.1 Normal distribution plot and histogram.

The mean, 𝜇𝜇, controls the location of the peak of the distribution, and 𝜎𝜎 controls the dispersion of the distribution and the larger the value 𝜎𝜎 is, the “fatter” the distribution appears. In the scope of the dissertation, the data I used in the BaMORC method are from alpha and beta carbons (Figure 3.2). Thus, a bivariate or 2-D Normal distribution is most appropriate, and a third parameter besides mean and standard deviation is introduced as follows.

The theory of correlation between two variants is an important concept in the mathematical statistics utilized in this dissertation. The bivariate Normal distribution

is an extension of the familiar univariate Normal distribution. Similarly, the probability density function is 𝑝𝑝(𝑥𝑥1, 𝑥𝑥2) =_{2𝜋𝜋𝜎𝜎} 1 1𝜎𝜎2�1−𝜌𝜌2exp �− 𝑧𝑧 2(1−𝜌𝜌2₎� Where 𝑧𝑧 =(𝑥𝑥1−𝜇𝜇1)2 𝜎𝜎₁2 +(𝑥𝑥2−𝜇𝜇2 )2 𝜎𝜎₂2 −2𝜌𝜌(𝑥𝑥1−𝜇𝜇_𝜎𝜎₁1_𝜎𝜎)(𝑥𝑥₂ 2−𝜇𝜇2) and 𝜌𝜌 = 𝑐𝑐𝑐𝑐𝑐𝑐(𝑥𝑥1, 𝑥𝑥2) = 𝐶𝐶𝐶𝐶𝐶𝐶1,2 𝜎𝜎1𝜎𝜎2 is

the correlation of 𝑥𝑥1 and 𝑥𝑥2 and 𝐶𝐶𝑐𝑐𝑣𝑣1,2 is the covariance. And the covariance is the third parameter for a bivariate distribution, which provides a measurement of strength of the correlation between two random variables. In this context, it is the correlation between chemical shifts of alpha and beta carbons from the same amino acid residue.

Figure 3.2 Bivariate Normal distribution. Top: 2-D plot of bivariate normal distribution; bottom: 3-D plot of bivariate Normal distribution.

For uncorrelated variates 𝐶𝐶𝑐𝑐𝑣𝑣(𝑥𝑥1, 𝑥𝑥2) = 0; however, if the variables are correlated in some manner, the covariance will be nonzero: if 𝐶𝐶𝑐𝑐𝑣𝑣(𝑥𝑥1, 𝑥𝑥2) > 0, then 𝑥𝑥2 tends to increase as 𝑥𝑥1 increases, and visa verse. Conventionally, covariance is included into the covariance matrix along with the variance (squared standard deviation) Σ = � 𝜎𝜎12 𝑐𝑐𝑐𝑐𝑣𝑣

The importance of the covariance is due to the correlation between the chemical shifts of the alpha and beta carbons.

3.1.2 The Central Limit Theorem

The name of “the Central Limit Theorem” has many implications; however, the theorem, that most commonly referred to by this name is the following:

The Central Limit Theorem (CLT)

Let 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛 be a sequence of random variables that are identically and independently distributed, with mean 𝜇𝜇 = 0 and variance 𝜎𝜎2_{. Let 𝑆𝑆}

𝑛𝑛 =_√𝑛𝑛1 (𝑥𝑥1+ 𝑥𝑥2+ ⋯ + 𝑥𝑥𝑛𝑛). Then the distribution of the normalized sum 𝑆𝑆𝑛𝑛 approaches the Normal distribution of 𝑁𝑁(0, 𝜎𝜎2_{), as 𝑛𝑛 → ∞ .}

Figure 3.3 A random variable with uniform distribution over [-1,1] added to itself repeatedly. After only four summations, the resulting distribution is very close to a

Normal distribution.

Although this seems to be very counterintuitive, the CLT simply states the summing distribution of 𝑋𝑋 will obtain a Normal distribution in the limit, where 𝑋𝑋 can be any distribution with mean of 0 and variance 𝜎𝜎2_{. Figure 3.3 shows a random variable with} uniform distribution over [-1,1] added to itself repeatedly. After only four summations, the resulting distribution is very close to a Normal distribution. The alternative version of the CLT states, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with mean 𝜇𝜇 and finite variance 𝜎𝜎2_{, will be approximately}

normally distributed with sample mean 𝜇𝜇 and sample variance _√𝑛𝑛𝜎𝜎2, regardless of the underlying distribution. And similarly, the CLT applies to bivariate Normal distributions in the same manner.

3.1.3 The Chi-squared distributions.

The following theorem clarifies the relationship between the Normal distribution and the Chi-squared distribution. And the density estimation algorithm from the BaMORC reference correction method uses the chi-squared distribution with two degrees of freedom. Theorem. If X is normally distributed with mean 𝜇𝜇 and variance 𝜎𝜎2 _{> 0, then: 𝑉𝑉 =} �𝑋𝑋−𝜇𝜇_𝜎𝜎 �2is distributed as a chi-squared random variable (𝜒𝜒2_{) with one degree of freedom.}

And the bivariate version of the theorem is: (𝒙𝒙 − 𝝁𝝁)′_𝚺𝚺−𝟏𝟏_{(𝒙𝒙 − 𝝁𝝁)~𝜒𝜒} 22

Proof is showing that the probability density function of the random variable 𝑉𝑉 is the same probability density function of a chi-squared random variable with 1 degree of freedom and the bivariate version can be established in same manner and omitted here, thus we only need to show

𝑔𝑔(𝑣𝑣) = 1 Γ �12� 1 2𝑣𝑣 1 2−1_𝑒𝑒−𝐶𝐶2

And the cumulative distribution 𝐺𝐺(𝑣𝑣) = 𝑃𝑃(𝑉𝑉 ≤ 𝑣𝑣) = 𝑃𝑃(𝑍𝑍2 _{≤ 𝑣𝑣) = 1, where 𝑍𝑍} follows the standard Normal distribution 𝑁𝑁(0, 𝜎𝜎2_{) and 𝑉𝑉 = 𝑍𝑍}2_{. A proof is out the scope} of this dissertation92. The chi-squared distribution with two degree of freedom is showing

Figure 3.4 Chi-squared density distribution with k=1, 2, 3, 4, 5 degrees of freedom (Figure adapted from WikiMedia.).

In document Automatic \u3csup\u3e13\u3c/sup\u3eC Chemical Shift Reference Correction of Protein NMR Spectral Data Using Data Mining and Bayesian Statistical Modeling (Page 36-41)