STATISTICAL ANALYSIS OF EXPERIMENTAL DATA

BIOCHEMISTRY LABORATORY

E. STATISTICAL ANALYSIS OF EXPERIMENTAL DATA

The purpose of most biochemistry laboratory exercises is to observe and meas-ure characteristics of a biomolecule or a biological process. The characteristic is often quantitative—a single number or a group of numbers. These measured quantities may be the molecular weight of a protein, the pH of a buffer solution, the absorbance of a colored solution, the rate of an enzyme-catalyzed reaction, the concentration of a protein in solution, or the radioactivity associated with a molecule. If you measure a quantitative characteristic many times under identi-cal conditions, a slightly different result will most likely be obtained each time.

For example, if a radioactive sample is counted twice under identical exper-imental and instrumental conditions, the second measurement immediately fol-lowing the first, the probability is very low that the numbers of counts will be identical. If the absorbance of a solution is determined several times at a specific wavelength, the value of each measurement will surely vary from the others. If an assay for cholesterol is performed several times on a blood serum sample from the same individual, the values will probably be close, but not all will be the same (see Study Problem 1.13). Which measurements, if any, are correct?

Before this question can be answered, you must understand the source and treat-ment of numerical variations in experitreat-mental measuretreat-ments.

Defining Statistical Analysis

An error in an experimental measurement is defined as a deviation of an observed value from the true value. There are two types of errors, determinate and indeterminate. Determinate errors are those that can be controlled by the

experimenter and are associated with malfunctioning equipment, improperly designed experiments, and variations in experimental conditions. These are sometimes called human errors because they can be corrected or at least partially alleviated by careful design and performance of the experiment. Indeterminate errors are those that are random and cannot be controlled by the experimenter.

Specific examples of indeterminate errors are variations in radioactive counting and small differences in the successive measurements of glucose in a serum sample.

Two statistical terms involving error analysis that are often used and mis-used are accuracy and precision. Precision refers to the extent of agreement among repeated measurements of an experimental value. Accuracy is defined as the difference between the experimental value and the true value for the quantity.

Because the true value is seldom known, accuracy is better defined as the differ-ence between the experimental value and the accepted true value. Several exper-imental measurements may be precise (that is, in close agreement with each other) without being accurate.

If an infinite number of identical, quantitative measurements could be made on a biosystem, this series of numerical values would constitute a statistical population.The average of all of these numbers would be the true valueof the measurement. It is obviously not possible to achieve this in practice.

The alternative is to obtain a relatively small sample of data, which is a subset of the infinite population data. The significance and precision of these data are then determined by statistical analysis.

Most quantitative biological measurements can be made in duplicate, triplicate, or even quadruplicate, but it would be impractical and probably a waste of time and materials to make numerous determinations of the same measurement. Rather, when you perform an experimental measurement in the laboratory, you will collect a small sample of data from the population of infi-nite values for that measurement. To illustrate, imagine that an infiinfi-nite number of experimental measurements of the pH of a buffer solution are made, and the results are written on slips of paper and placed in a container. It is not feasible to calculate an average value of the pH from all of these numbers, but it is possible to draw five slips of paper, record these numbers, and calculate an average pH. By doing this, you have collected a sample of data. By proper sta-tistical manipulation of this small sample, it is possible to determine whether it is representative of the total population and the amount of confidence you should have in these numbers.

The Mean, Sample Deviation, and Standard Deviation

Radioactive decay with emission of particles is a random process. It is impossi-ble to predict with certainty when a radioactive event will occur. Therefore, a series of measurements made on a radioactive sample will result in a series of differ-ent count rates, but they will be cdiffer-entered around an average or mean value of counts per minute. Table 1.1 contains such a series of count rates obtained with a scintillation counter on a single radioactive sample. A similar table could be prepared for other biochemical measurements, including the rate of an

TABLE 1.1 The Observed Counts and Sample Deviation from a Typical Radioactive Sample

Counts per Minute Sample Deviation x_i - x

1243 +21

1250 +28

1201 -21

1226 +4

1220 -2

1195 -27

1206 -16

1239 +17

1220 -2

1219 -3

Mean = 1222

enzyme-catalyzed reaction or the protein concentration of a solution as deter-mined by the Bradford method. The arithmetic average, or mean, of the num-bers is calculated by totaling all the experimental values observed for a sample (the counting rates, the velocity of the reaction, or protein concentration) and di-viding the total by the number of times the measurement was made. The mean is defined by Equation 1.1.

>> (Eq. 1.1)

where

average or mean

value for an individual measurement

total number of experimental determinations of all the values

The mean counting rate for the data in Table 1.1 is 1222. If the same radioactive sample were again counted for a series of ten observations, that series of counts would most likely be different from those listed in the table, and a different mean would be obtained. If we were able to make an infinite number of counts on the radioactive sample, then a true mean could be calculated. The true mean would be the actual amount of radioactivity in the sample. Although it would be desir-able, it is not possible experimentally to measure the true mean. Therefore, it is necessary to use the average of the counts as an approximation of the true mean

a = sum n = the x_i = the

x = arithmetic

x = a

x_i n

Frequency of occurrence of a measurement

68.3%

95.5%

–2s –s x– +s +2s

FIGURE 1.11 The normal distribution curve.

and to use statistical analysis to evaluate the precision of the measurements (that is, to assess the agreement among the repeated measurements).

Because it is not usually practical to observe and record a measurement many times as in Table 1.1, what is needed is a way to determine the reliability of an observed measurement. This may be stated in the form of a question. How close is the result to the true value? One approach to this analysis is to calculate the sample deviation, which is defined as the difference between the value for an observation and the mean value, (Equation 1.2). The sample deviations are also listed for each count in Table 1.1.

>> (Eq. 1.2)

A more useful statistical term for error analysis is standard deviation, a measure of the spread of the observed values. Standard deviation, s, for a sample of data consisting of n observations may be estimated by Equation 1.3.

>> (Eq. 1.3)

It is a useful indicator of the probable error of a measurement. Standard deviation is often transformed to standard deviation of the mean or standard error. This is defined by Equation 1.4, where n is the number of measurements.

>> (Eq. 1.4)

It should be clear from this equation that as the number of experimental observations becomes larger, becomes smaller, or the precision of a measure-ment is improved.

Standard deviation may also be illustrated in graphical form (see Figure 1.11). The shape of the curve in Figure 1.11 is closely approximated by the Gaussian distribution or normal distribution curve. This mathematical

s_m

s_m = s 1n s = Ba (xⁱ - x)²

n - 1 Sample deviation = xi - x

treatment is based on the fact that a plot of relative frequency of a given event yields a dispersion of values centered about the mean, The value of is measured at the maximum height of the curve. The normal distribution curve shown in Figure 1.11 defines the spread or dispersion of the data. The probability that an observation will fall under the curve is unity, or 100%. By using an equation derived by Gauss, it can be calculated that for a single set of sample data, 68.3% of the observed values will occur within the interval 95.5% of the observed values within and 99.7% of the observed values within Stated in other terms, there is a 68.3% chance that a single observation will be in the interval

For many experiments, a single measurement is made, so a mean value, is not known. In these cases, error is expressed in terms of s, but is defined as the percentage proportional error, in Equation 1.5.

>> (Eq. 1.5)

The parameter k is a proportional constant between and the standard devia-tion. The percent proportional error may be defined within several probability ranges. Standard error refers to a confidence level of 68.3%; that is, there is a 68.3% chance that a single measurement will not exceed the For standard error, Ninety-five hundredths error means there is a 95% chance that a single measurement will not exceed the The constant k then becomes 1.45.

The previous discussion of standard deviation and related statistical analysis placed emphasis on estimating the reliability or precision of experi-mentally observed values. However, standard deviation does not give specif-ic information about how close an experimental mean is to the true mean.

Statistical analysis may be used to estimate, within a given probability, a range within which the true value might fall. The range or confidence inter-val is defined by the experimental mean and the standard deviation. This simple statistical operation provides the means to determine quantitatively how close the experimentally determined mean is to the true mean.

Confidence limits( and ) are created for the sample mean as shown in Equations 1.6 and 1.7.

>> (Eq. 1.6)

>> (Eq. 1.7)

where

statistical parameter that defines a distribution between a sample mean and a true mean

The parameter t is calculated by integrating the distribution between percent confidence limits. Values of t are tabulated for various confidence limits (Table 1.2). Each column in the table refers to a desired confidence level (0.05 for 95%, 0.02 for 98%, and 0.01 for 99% confidence). The table also includes the term degrees of freedom,which is represented by n - 1,the number of experimental

t = a

Spreadsheet Statistics

It is common practice today to use computer spreadsheet programs for statis-tical analysis of biochemical data. A spreadsheet provides a means to collect and enter data in the form of numbers and text. Perhaps the most versatile and easy-to-use spreadsheet software is Microsoft Excel, although more special-ized statistical software programs including SPSS and SyStat are also very useful (see Chapter 2 and Appendix I). Using Excel to estimate statistical terms for experimental data is relatively straightforward. Launching the Microsoft Excel program on your computer brings up the Excel spreadsheet, which consists of rows (number headings) and columns (letter headings).

More detailed instructions for the statistical applications of Excel are found at www.microsoft.com.

Statistical Analysis in Practice

The equations for statistical analysis that have been introduced in this section are of little value if you have no understanding of their practical use, meaning, and limitations. A set of experimental data will first be presented, and then sev-eral statistical parameters will be calculated using the equations. This example will serve as a summary of the statistical formulas and will also illustrate their application.

TABLE 1.2 Values of t for Analysis of Statistical Confidence Limits Probability of Larger Value of t, Sign Ignored

d.f. 0.05 0.02 0.01 d.f. 0.05 0.02 0.01

1 12.706 31.821 63.657 14 2.145 2.624 2.977

2 4.303 6.096 9.925 15 2.131 2.602 2.947

3 3.182 4.541 5.841 16 2.120 2.583 2.921

4 2.776 3.747 4.604 17 2.110 2.567 2.898

5 2.571 3.365 4.032 18 2.101 2.552 2.878

6 2.447 3.143 3.707 19 2.093 2.539 2.861

7 2.365 2.998 3.499 20 2.086 2.528 2.845

8 2.306 2.896 3.355 21 2.080 2.518 2.831

9 2.262 2.821 3.250 22 2.074 2.508 2.819

10 2.228 2.764 3.169 23 2.069 2.500 2.807

11 2.201 2.718 3.106 24 2.064 2.492 2.797

12 2.179 2.681 3.055 25 2.060 2.485 2.787

13 2.160 2.650 3.012

observations minus 1. The values of and are calculated as previously described in Equations 1.1 and 1.4.

s_m x

STUDY EXERCISE 1.3

Statistical Analysis of Data

Ten identical protein samples were analyzed by the Bradford method for protein analysis. The following values for protein concentration were obtained.

Sample mean

Sample deviation

Sample deviation= xi - x x = ax

n = 10.02

10 = 1.00 mg>mL

Observation Number Protein Concentration (mg/mL), x

1 1.02

2 0.98

3 0.99

4 1.01

5 1.03

6 0.97

7 1.00

8 0.98

9 1.03

10 1.01

Observation x_i - x

1 +0.02

2 -0.02

3 -0.01

4 +0.01

5 +0.03

6 -0.03

7 0.00

8 -0.02

9 +0.03

10 +0.01

Calculation of the sample deviation for each measurement gives an indication of the precision of the determinations.

(Continued)

In document Biochemistry (Page 44-51)