CHAPTER 3. PROTEIN NMR DATA OVERVIEW
3.1 Data source for this dissertation
All of the statistics utilized for this dissertation are from he Re-referenced Protein Chemical shift Database (RefDB)71. RefDB is a secondary database derived from a primary database, the Biological Magnetic Resonance Bank (BMRB) 26. As mentioned in Chapter 2, chemical shifts are relative measurements, which are prone to inconsistencies. Moreover, any inconsistencies in the chemical shift measurements could distort the subtle but rich source of structural and dynamic information present in these measurements. Since 1991, BMRB has served as an archive for interpreted and raw NMR experimental datasets that allow a biomolecular NMR researcher to systematically assemble, compare, and interpret a variety of NMR measurements, especially chemical shifts, across the database. However, these entries are deposited by researchers across the NMR community that utilize a wide range of NMR experiments and data analysis procedures. As a result, roughly 40% of entries in the BMRB contain referencing inconsistencies26,84. These data quality issues limit easy reuse of the BMRB, especially global analyses across the BMRB, which became the impetus driving the original development of the RefDB.
The RefDB contains a subset of the BMRB entries that are carefully and properly re-referenced according to the IUPAC/IUB convention71. The re-referencing procedure involves using X-ray or NMR coordinate data to estimate protein 1H, 13C and 15N chemical shifts via SHIFTX then compare estimated results with the observed shifts in BMRB via SHIFTCOR85. RefDB provides a standard chemical shift resource for the protein NMR community; however, even this resource for protein NMR chemical shifts has some limitations, as is pointed out in Chapter 5.
22
3.1.1 Normal distribution
The Normal, or Gaussian, distribution is the most important and most widely used distribution. This bell-shape curved distribution can be well approximated in all manner of data that appear to be distributed normally: human height, IQ scores, grades, productions, and chemical shift values are non-exception. Relative frequency distribution can be obtained by suitably normalized frequency distribution, aka. histogram as showing in Figure 3.1. The one-dimensional Normal distribution is determined by just two parameters: mean ΞΌ and standard deviation Ο of the data. The very definition of Normal distribution ππ(ππ, ππ) is: ππ(π₯π₯) =ππβ2ππ1 exp β(π₯π₯βππ)2ππ22
Figure 3.1 Normal distribution plot and histogram.
The mean, ππ, controls the location of the peak of the distribution, and ππ controls the dispersion of the distribution and the larger the value ππ is, the βfatterβ the distribution appears. In the scope of the dissertation, the data I used in the BaMORC method are from alpha and beta carbons (Figure 3.2). Thus, a bivariate or 2-D Normal distribution is most appropriate, and a third parameter besides mean and standard deviation is introduced as follows.
The theory of correlation between two variants is an important concept in the mathematical statistics utilized in this dissertation. The bivariate Normal distribution
23
is an extension of the familiar univariate Normal distribution. Similarly, the probability density function is ππ(π₯π₯1, π₯π₯2) =2ππππ 1 1ππ2οΏ½1βππ2exp οΏ½β π§π§ 2(1βππ2)οΏ½ Where π§π§ =(π₯π₯1βππ1)2 ππ12 +(π₯π₯2βππ2 )2 ππ22 β2ππ(π₯π₯1βππππ11ππ)(π₯π₯2 2βππ2) and ππ = ππππππ(π₯π₯1, π₯π₯2) = πΆπΆπΆπΆπΆπΆ1,2 ππ1ππ2 is
the correlation of π₯π₯1 and π₯π₯2 and πΆπΆπππ£π£1,2 is the covariance. And the covariance is the third parameter for a bivariate distribution, which provides a measurement of strength of the correlation between two random variables. In this context, it is the correlation between chemical shifts of alpha and beta carbons from the same amino acid residue.
Figure 3.2 Bivariate Normal distribution. Top: 2-D plot of bivariate normal distribution; bottom: 3-D plot of bivariate Normal distribution.
For uncorrelated variates πΆπΆπππ£π£(π₯π₯1, π₯π₯2) = 0; however, if the variables are correlated in some manner, the covariance will be nonzero: if πΆπΆπππ£π£(π₯π₯1, π₯π₯2) > 0, then π₯π₯2 tends to increase as π₯π₯1 increases, and visa verse. Conventionally, covariance is included into the covariance matrix along with the variance (squared standard deviation) Ξ£ = οΏ½ ππ12 πππππ£π£
24
The importance of the covariance is due to the correlation between the chemical shifts of the alpha and beta carbons.
3.1.2 The Central Limit Theorem
The name of βthe Central Limit Theoremβ has many implications; however, the theorem, that most commonly referred to by this name is the following:
The Central Limit Theorem (CLT)
Let π₯π₯1, π₯π₯2, β¦ , π₯π₯ππ be a sequence of random variables that are identically and independently distributed, with mean ππ = 0 and variance ππ2. Let ππ
ππ =βππ1 (π₯π₯1+ π₯π₯2+ β― + π₯π₯ππ). Then the distribution of the normalized sum ππππ approaches the Normal distribution of ππ(0, ππ2), as ππ β β .
Figure 3.3 A random variable with uniform distribution over [-1,1] added to itself repeatedly. After only four summations, the resulting distribution is very close to a
Normal distribution.
Although this seems to be very counterintuitive, the CLT simply states the summing distribution of ππ will obtain a Normal distribution in the limit, where ππ can be any distribution with mean of 0 and variance ππ2. Figure 3.3 shows a random variable with uniform distribution over [-1,1] added to itself repeatedly. After only four summations, the resulting distribution is very close to a Normal distribution. The alternative version of the CLT states, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with mean ππ and finite variance ππ2, will be approximately
25
normally distributed with sample mean ππ and sample variance βππππ2, regardless of the underlying distribution. And similarly, the CLT applies to bivariate Normal distributions in the same manner.
3.1.3 The Chi-squared distributions.
The following theorem clarifies the relationship between the Normal distribution and the Chi-squared distribution. And the density estimation algorithm from the BaMORC reference correction method uses the chi-squared distribution with two degrees of freedom. Theorem. If X is normally distributed with mean ππ and variance ππ2 > 0, then: ππ = οΏ½ππβππππ οΏ½2is distributed as a chi-squared random variable (ππ2) with one degree of freedom.
And the bivariate version of the theorem is: (ππ β ππ)β²πΊπΊβππ(ππ β ππ)~ππ 22
Proof is showing that the probability density function of the random variable ππ is the same probability density function of a chi-squared random variable with 1 degree of freedom and the bivariate version can be established in same manner and omitted here, thus we only need to show
ππ(π£π£) = 1 Ξ οΏ½12οΏ½ 1 2π£π£ 1 2β1ππβπΆπΆ2
And the cumulative distribution πΊπΊ(π£π£) = ππ(ππ β€ π£π£) = ππ(ππ2 β€ π£π£) = 1, where ππ follows the standard Normal distribution ππ(0, ππ2) and ππ = ππ2. A proof is out the scope of this dissertation92. The chi-squared distribution with two degree of freedom is showing
26
Figure 3.4 Chi-squared density distribution with k=1, 2, 3, 4, 5 degrees of freedom (Figure adapted from WikiMedia.).