Statistics for Science
7.1 Analysing Replicate Data
7.1.1
Introduction
Experimental measurements are often made to discover the ‘true’ value of some parameter, e.g. the pH of a solution. However, all experimental measurements are subject to uncertainty, which means that the final value will only be a ‘best estimate’ for the unknown ‘true’ value. It is therefore common to make repeated (replicate) measurements to counteract the effects of random experimental uncertainties.
In this context, statistics can be used to perform two key functions:
1. Provide a description of the experimental ‘raw’ data that has been recorded, often presenting this in the form of a visual ‘picture’ of the distribution of the values.
2. Calculate, using certain statistical assumptions, a ‘best estimate’ of the ‘true’ value being measured.
We will use the experimental data in Example 7.1 to introduce the use of a box and whisker
plot to describe the raw data and then we will use a case study to investigate the cal-
culation of a 95 % confidence interval to give the best estimate of the ‘true’ value being measured.
7.1.2
Ranked data – box and whisker plots
Example 7.1
The following data set has nine replicate experimental measurements of an unknown ‘true’ value, µ.
Data 2.3 11.3 3.8 4.5 4.2 8.1 6.3 3.7 3.3
Produce a visual representation of the values in this data set.
The worked answer is given the following text .
Ranked data has the data values sorted into ascending (or descending) order and assigned a rank position.
The nine data values in Example 7.1 can be sorted into ascending order and assigned rank positions from 1 to 9.
Data: 2.3 3.3 3.7 3.8 4.2 4.5 6.3 8.1 11.3
Rank: 1 2 3 4 5 6 7 8 9
Ranking is used extensively in non-parametric statistics (Chapter 12), where only the ‘order’ of the data values is important and not the actual (parametric) values.
The location of a set of,n, data values is described by the following:
• Median is the middle value in a set of ranked data values and gives the location of the data. The median is the value with the rank 0.50× (n + 1).
• Lower quartile, Q1, is the value one-quarter of the way from the lowest to the highest value. The lower quartile is the value with the rank 0.25× (n + 1).
• Upper quartile, Q3, is the value three-quarters of the way from the lowest to the highest value. The upper quartile is the value with the rank 0.75× (n + 1).
In the above data, the number of data values,n= 9. Thus:
• Median value has the rank = 0.5 × (9 + 1) = 5 Median value with rank 5= 4.2
• Lower quartile value has the rank = 0.25 × (9 + 1) = 2.5
Lower quartile value with rank 2.5 is halfway between 3.3 and 3.7= 3.5
• Upper quartile value has the rank = 0.75 × (9 + 1) = 7.5
Upper quartile value with rank 7.5 is halfway between 6.3 and 8.1= 7.2
The spread of non-parametric data is described by the following:
• Interquartile range, IQR, is the difference in value between the upper quartile and lower quartile:
IQR = Q3 − Q1
• Total range is the difference in value between the lowest value and the highest value.
Q7.1
For each of the sets below, calculate:
Median Lower Upper IQR
quartile quartile (i) 5.0, 8.2, 7.9, 6.6, 7.6, 5.7, 3.2, 7.5, 5.9
(ii) 11, 8, 13, 9, 21, 24, 12, 22, 29, 43 (iii) 45, 67, 23, 78, 67, 56, 98, 23, 49
A box and whisker plot (often just called a boxplot) is very useful way of visualizing raw experimental data.
Figure 7.1 shows the box and whisker plot for the data in Example 7.1, drawn against the data value axis.
9 10 11 12
2 3 4 5 6 7 8
Data
Figure 7.1 Box and whisker plots of data in Example 7.1 (using Minitab).
The ‘middle’ line drawn inside the box shows the position of the median value. The ends of the ‘box’ give the positions of the upper and lower quartiles. The ends of the ‘whiskers’ give the maximum and minimum values in the data.
The fact that the median is not at the centre of its ‘box’ shows that the data is not symmetrical.
Q7.2
Represent all of the number sets in Q7.1 as box and whisker plots.
It is not possible to use standard Excel to draw box and whisker plots. However, statistics packages such as Minitab can produce these diagrams easily.
7.1.3
Confidence interval for an unknown value
If we are trying to measure some experimental property with a true value of µ, then we are
more likely to record a value,x, close to µ, and less likely to record a value a long way away
fromµ. The graph in Figure 7.2 shows the likelihood distribution for getting an experimental
result,x, when measuring a true blood–alcohol level of µ= 80 mg of alcohol in each 100 mL
of blood, i.e. 80 mg per 100 mL.
x
80 78
76 82 84 86
74
nearly all between 76 and 84, with only a few outside this range.
The mathematical name for this ‘bell-shaped’ distribution is the normal distribution (see 8.1.3). Most random experimental errors can be considered to have a frequency spread of results that follow a normal distribution.
Example 7.2
Twenty analysts each make five replicate measurements on a blood sample which has a true value of 80 mg per 100 mL. The individual recorded values are distributed at random with probabilities given by the curve in Figure 7.2.
How can their results be presented and interpreted?
The worked answer is given in the following text .
For Example 7.2, a random selection of values by computer has been used to simulate the realistic spread of 20 possible data sets, each with five experimental data values. The results are described by the 20 box and whisker plots in Figure 7.3.
86 84 82 80 80 78 76 74 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Data
Figure 7.3 Box and whisker plots for 20 data sets (using Minitab).
It is important to realize that, in a real experimental measurement, none of the 20 analysts would know either the true value (µ= 80) being measured or the uncertainty in the measure-
ments. Individually, they must rely on just their five measurements to estimate both the true value and measurement uncertainty.
For example:
• Set 14 has data values that happen to be grouped closely together and all on the high side of the true value, and the analyst would tend to believe that his or her results are quite precise and close to a value of about 81.
• Set 5 has a wide spread of values and the analyst would not claim the same precision as analyst 14.
• Set 15 shows results that are skewed to low values (the median value is close to the lower end of the range) and set 1 is skewed to high values, whereas set 13 is fairly symmetrical. • Set 18 only just includes the true value of 80.
The data for these six sets are given in Table 7.1.
Table 7.1. Data for analyst sets 1, 5, 13, 14, 15 and 18.
Set Data Mean 95 % confidence interval
Minimum Maximum 1 80.2 79.6 80.7 80.4 76.4 79.5 77.3 81.6 5 81.7 83.4 78.9 75.3 79.0 79.7 75.8 83.5 13 78.0 79.0 81.7 80.7 84.1 80.7 77.7 83.7 14 80.8 82.1 80.5 80.6 81.3 81.1 80.2 81.9 15 78.1 81.7 78.3 78.2 78.0 78.9 76.9 80.8 18 78.6 79.1 76.4 80.0 77.9 78.4 76.7 80.1
Based on their own five measurements, each analyst is required to give a best estimate of the unknown true value in the form of a confidence statement. For example, analyst 5 would present his or her results as follows:
On the basis of my five data values, I am 95 % confident that the true value,µ, lies between 75.8 and 83.5.
This range of values, called the 95 % confidence interval, is a symmetrical range centred on the mean (average) value of the set of sample data – see 8.2.4. The minimum and maximum values of the confidence interval in Table 7.1 have been calculated using theory that will be developed in Chapter 8.
The 95 % confidence intervals are calculated for each data set and recorded in Figure 7.4 as ‘error bars’ on either side of the sample mean values.
The statement that the ‘true value,µ, lies within the 95 % confidence interval’ has a 5 %
chance of being wrong. Hence we could expect that 5 % of claims (i.e. 1 in 20) will indeed prove to be wrong. The simulated data agrees with this probability, in that it can be seen that just one confidence interval (from set 14) does not include the true value (µ= 80).
Example 7.3
Which of the 20 analysts, whose results are given in Figure 7.4, would answer Yes to the following question:
On the basis of your results, are you 95 % confident that the true blood–alcohol level is
We can see in Figure 7.4 that the confidence intervals for sets 1, 4, 6, 7, 8, 9, 14, 15 and 18 do not cross the 82 mg per 100 mL grid line.
As the ‘82’ value is outside their separate confidence intervals for the true blood–alcohol level, each of these analysts would answer yes to the above question.
86 84 82 80 80 78 76 74 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Data
Figure 7.4 The 95 % confidence intervals for 20 data sets (using Minitab).
Q7.3
Case study data has been presented using both box and whisker plots in Figure 7.3 and confidence intervals in Figure 7.4.
Explain the fundamental difference between the information described by the two forms of presentation.
The difference between the two types of presentation can be illustrated by the effect of increasing the sample size of data values. The presentations in Figure 7.5 show a sample size
n= 5 (set 1 from the data), sample size n = 10 (set 1 plus set 2), sample size n = 20 (sets 1
to 4), sample sizen= 40 (sets 1 to 8) and sample size n = 80 (sets 1 to 16).
The interquartile ranges of the box and whisker plots (Figure 7.5a) approach a constant range which is representative of the underlying uncertainty in the data being measured, and the ends of the whiskers extend to the most extreme data values in the set.
However, the confidence interval (Figure 7.5b) is a measure of confidence in locating the true value being measured, and as the sample size increases, the increased information means that it is possible to be more precise about the true value – the confidence interval becomes narrower. This effect due to the central limit theorem is explained in 8.2.3.
(a) (b) Boxplot of n = 5, n = 10, n = 20, n = 40, n = 80 95% CI for n = 5, n = 10, n = 20, n = 40, n = 80 n = 5 n = 10 n = 20 n = 40 n = 80 n = 5 n = 10 n = 20 n = 40 n = 80 86 84 82 80 80 80 78 76 84 82 80 78 76 74 Data Data
Figure 7.5 Effect of increasing sample size on (a) boxplots and (b) confidence intervals (using Minitab).