• No results found

CHAPTER 3. PROTEIN NMR DATA OVERVIEW

3.1 Data source for this dissertation

All of the statistics utilized for this dissertation are from he Re-referenced Protein Chemical shift Database (RefDB)71. RefDB is a secondary database derived from a primary database, the Biological Magnetic Resonance Bank (BMRB) 26. As mentioned in Chapter 2, chemical shifts are relative measurements, which are prone to inconsistencies. Moreover, any inconsistencies in the chemical shift measurements could distort the subtle but rich source of structural and dynamic information present in these measurements. Since 1991, BMRB has served as an archive for interpreted and raw NMR experimental datasets that allow a biomolecular NMR researcher to systematically assemble, compare, and interpret a variety of NMR measurements, especially chemical shifts, across the database. However, these entries are deposited by researchers across the NMR community that utilize a wide range of NMR experiments and data analysis procedures. As a result, roughly 40% of entries in the BMRB contain referencing inconsistencies26,84. These data quality issues limit easy reuse of the BMRB, especially global analyses across the BMRB, which became the impetus driving the original development of the RefDB.

The RefDB contains a subset of the BMRB entries that are carefully and properly re-referenced according to the IUPAC/IUB convention71. The re-referencing procedure involves using X-ray or NMR coordinate data to estimate protein 1H, 13C and 15N chemical shifts via SHIFTX then compare estimated results with the observed shifts in BMRB via SHIFTCOR85. RefDB provides a standard chemical shift resource for the protein NMR community; however, even this resource for protein NMR chemical shifts has some limitations, as is pointed out in Chapter 5.

22

3.1.1 Normal distribution

The Normal, or Gaussian, distribution is the most important and most widely used distribution. This bell-shape curved distribution can be well approximated in all manner of data that appear to be distributed normally: human height, IQ scores, grades, productions, and chemical shift values are non-exception. Relative frequency distribution can be obtained by suitably normalized frequency distribution, aka. histogram as showing in Figure 3.1. The one-dimensional Normal distribution is determined by just two parameters: mean ΞΌ and standard deviation Οƒ of the data. The very definition of Normal distribution 𝑁𝑁(πœ‡πœ‡, 𝜎𝜎) is: 𝑝𝑝(π‘₯π‘₯) =𝜎𝜎√2πœ‹πœ‹1 exp βˆ’(π‘₯π‘₯βˆ’πœ‡πœ‡)2𝜎𝜎22

Figure 3.1 Normal distribution plot and histogram.

The mean, πœ‡πœ‡, controls the location of the peak of the distribution, and 𝜎𝜎 controls the dispersion of the distribution and the larger the value 𝜎𝜎 is, the β€œfatter” the distribution appears. In the scope of the dissertation, the data I used in the BaMORC method are from alpha and beta carbons (Figure 3.2). Thus, a bivariate or 2-D Normal distribution is most appropriate, and a third parameter besides mean and standard deviation is introduced as follows.

The theory of correlation between two variants is an important concept in the mathematical statistics utilized in this dissertation. The bivariate Normal distribution

23

is an extension of the familiar univariate Normal distribution. Similarly, the probability density function is 𝑝𝑝(π‘₯π‘₯1, π‘₯π‘₯2) =2πœ‹πœ‹πœŽπœŽ 1 1𝜎𝜎2οΏ½1βˆ’πœŒπœŒ2exp οΏ½βˆ’ 𝑧𝑧 2(1βˆ’πœŒπœŒ2)οΏ½ Where 𝑧𝑧 =(π‘₯π‘₯1βˆ’πœ‡πœ‡1)2 𝜎𝜎12 +(π‘₯π‘₯2βˆ’πœ‡πœ‡2 )2 𝜎𝜎22 βˆ’2𝜌𝜌(π‘₯π‘₯1βˆ’πœ‡πœ‡πœŽπœŽ11𝜎𝜎)(π‘₯π‘₯2 2βˆ’πœ‡πœ‡2) and 𝜌𝜌 = 𝑐𝑐𝑐𝑐𝑐𝑐(π‘₯π‘₯1, π‘₯π‘₯2) = 𝐢𝐢𝐢𝐢𝐢𝐢1,2 𝜎𝜎1𝜎𝜎2 is

the correlation of π‘₯π‘₯1 and π‘₯π‘₯2 and 𝐢𝐢𝑐𝑐𝑣𝑣1,2 is the covariance. And the covariance is the third parameter for a bivariate distribution, which provides a measurement of strength of the correlation between two random variables. In this context, it is the correlation between chemical shifts of alpha and beta carbons from the same amino acid residue.

Figure 3.2 Bivariate Normal distribution. Top: 2-D plot of bivariate normal distribution; bottom: 3-D plot of bivariate Normal distribution.

For uncorrelated variates 𝐢𝐢𝑐𝑐𝑣𝑣(π‘₯π‘₯1, π‘₯π‘₯2) = 0; however, if the variables are correlated in some manner, the covariance will be nonzero: if 𝐢𝐢𝑐𝑐𝑣𝑣(π‘₯π‘₯1, π‘₯π‘₯2) > 0, then π‘₯π‘₯2 tends to increase as π‘₯π‘₯1 increases, and visa verse. Conventionally, covariance is included into the covariance matrix along with the variance (squared standard deviation) Ξ£ = οΏ½ 𝜎𝜎12 𝑐𝑐𝑐𝑐𝑣𝑣

24

The importance of the covariance is due to the correlation between the chemical shifts of the alpha and beta carbons.

3.1.2 The Central Limit Theorem

The name of β€œthe Central Limit Theorem” has many implications; however, the theorem, that most commonly referred to by this name is the following:

The Central Limit Theorem (CLT)

Let π‘₯π‘₯1, π‘₯π‘₯2, … , π‘₯π‘₯𝑛𝑛 be a sequence of random variables that are identically and independently distributed, with mean πœ‡πœ‡ = 0 and variance 𝜎𝜎2. Let 𝑆𝑆

𝑛𝑛 =βˆšπ‘›π‘›1 (π‘₯π‘₯1+ π‘₯π‘₯2+ β‹― + π‘₯π‘₯𝑛𝑛). Then the distribution of the normalized sum 𝑆𝑆𝑛𝑛 approaches the Normal distribution of 𝑁𝑁(0, 𝜎𝜎2), as 𝑛𝑛 β†’ ∞ .

Figure 3.3 A random variable with uniform distribution over [-1,1] added to itself repeatedly. After only four summations, the resulting distribution is very close to a

Normal distribution.

Although this seems to be very counterintuitive, the CLT simply states the summing distribution of 𝑋𝑋 will obtain a Normal distribution in the limit, where 𝑋𝑋 can be any distribution with mean of 0 and variance 𝜎𝜎2. Figure 3.3 shows a random variable with uniform distribution over [-1,1] added to itself repeatedly. After only four summations, the resulting distribution is very close to a Normal distribution. The alternative version of the CLT states, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with mean πœ‡πœ‡ and finite variance 𝜎𝜎2, will be approximately

25

normally distributed with sample mean πœ‡πœ‡ and sample variance βˆšπ‘›π‘›πœŽπœŽ2, regardless of the underlying distribution. And similarly, the CLT applies to bivariate Normal distributions in the same manner.

3.1.3 The Chi-squared distributions.

The following theorem clarifies the relationship between the Normal distribution and the Chi-squared distribution. And the density estimation algorithm from the BaMORC reference correction method uses the chi-squared distribution with two degrees of freedom. Theorem. If X is normally distributed with mean πœ‡πœ‡ and variance 𝜎𝜎2 > 0, then: 𝑉𝑉 = οΏ½π‘‹π‘‹βˆ’πœ‡πœ‡πœŽπœŽ οΏ½2is distributed as a chi-squared random variable (πœ’πœ’2) with one degree of freedom.

And the bivariate version of the theorem is: (𝒙𝒙 βˆ’ 𝝁𝝁)β€²πšΊπšΊβˆ’πŸπŸ(𝒙𝒙 βˆ’ 𝝁𝝁)~πœ’πœ’ 22

Proof is showing that the probability density function of the random variable 𝑉𝑉 is the same probability density function of a chi-squared random variable with 1 degree of freedom and the bivariate version can be established in same manner and omitted here, thus we only need to show

𝑔𝑔(𝑣𝑣) = 1 Ξ“ οΏ½12οΏ½ 1 2𝑣𝑣 1 2βˆ’1π‘’π‘’βˆ’πΆπΆ2

And the cumulative distribution 𝐺𝐺(𝑣𝑣) = 𝑃𝑃(𝑉𝑉 ≀ 𝑣𝑣) = 𝑃𝑃(𝑍𝑍2 ≀ 𝑣𝑣) = 1, where 𝑍𝑍 follows the standard Normal distribution 𝑁𝑁(0, 𝜎𝜎2) and 𝑉𝑉 = 𝑍𝑍2. A proof is out the scope of this dissertation92. The chi-squared distribution with two degree of freedom is showing

26

Figure 3.4 Chi-squared density distribution with k=1, 2, 3, 4, 5 degrees of freedom (Figure adapted from WikiMedia.).

Related documents