• No results found

Central value

In document Statistical Data Analysis Explained (Page 75-80)

Statistical Distribution Measures

4.1 Central value

Statistical Distribution Measures

When studying the data distribution graphically, it is apparent that the distributions of the variables can look quite different. Instead of looking at countless data distributions, it may be desired to characterise the data distribution by a number of parameters in a table. What is required?

First the central value of the distribution needs to be identified, together with a measure of the spread (variation) of the data. Furthermore, the quartiles and different percentiles of the distribution may be of interest (i.e. above what value fall the uppermost two, five, or ten per cent of the data).

When working with “ideal” data, two further measures are often provided in statistical tabulations: skewness and kurtosis. Skewness is a measure of the symmetry of the data dis-tribution, kurtosis provides an expression of the curvature – the appearance of the density trace can be flat or steep. Such “summary values” are frequently used to compare data from different investigations. For real data there is often the problem that the presence of outliers and/or multimodal distributions will bias these measures. Both can be easily recognised from the graphics described in the previous chapter.

4.1 Central value

What is the most appropriate estimator for the central value of a data distribution? What is actually the central value of a distribution? It could be the “centre of gravity”, it could be the most likely value, it could be the most frequent value, it could be the value that divides the samples into two equal halves. Accordingly there exist several different statistical measures of the central value (location) of a data distribution.

4.1.1 The arithmetic mean

The most frequently used measure is probably the arithmetic mean. Throughout this book we will use the term MEAN for the arithmetic mean. All values are summed up and divided by the number of values. Ifx1, x2, . . . , xndenotes the values of the data with n indicating the

Statistical Data Analysis Explained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett, R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6

52 STATISTICAL DISTRIBUTION MEASURES

number of samples, the simple formula for calculating the arithmetic mean is:

MEAN= 1 n

n i=1

xi.

4.1.2 The geometric mean

The geometric mean G is often used with advantage for right-skewed (e.g., lognormal) distri-butions. G is calculated by taking then-th (n = number of samples) root of the product of all the data values. It requires that all values are positive, negative values or zeros in the data set are not permitted. When dealing with applied geochemical data, this is the usual case, negative concentrations of a chemical element cannot exist, and the concentration is always conceptu-ally greater than zero, even if it is so low that it cannot be measured. An exception may be measurements of organic pollutants that do not occur in nature, where the concentration could actually be zero. The formula is:

G=√nx1· x2· · · xn or n



n

i=1

xi.

Using this form of calculation should be avoided as rounding errors may occur due to the arithmetic precision limitations of computers. A preferable method is to use logarithms, then:

mean= 1 n

n i=1

log(xi) and G= 10mean or G = emean

depending on whether logarithms to the base 10 or natural logarithms were used.

4.1.3 The mode

The MODE is the value with the highest probability of occurrence. There is no simple formula to estimate the mode; however, the mode is often estimated from a histogram or density trace – the MODE being the value where the histogram or density function shows a maximum.

4.1.4 The median

The MEDIAN divides the data distribution into two equal halves. The data are sorted from the lowest to the highest value, and the central value of the ordered data is the MEDIAN. In the case thatn is an even number, there exist two central values and the average of these two values is taken as the MEDIAN. This may best be demonstrated by a simple example:

2.3 2.7 1.9 2.1 1.8 2.4 2.0 5.9.

The data are then sorted:

1.8 1.9 2.0 2.1 2.3 2.4 2.7 5.9.

The two central values are:

1.8 1.9 2.0 2.1 2.3 2.4 2.7 5.9, and the MEDIAN is (2.1 + 2.3)/2 = 2.2.

CENTRAL VALUE 53 For comparison, the MEAN of these eight values is 2.64, the geometric mean, G, is 2.44, and to estimate a MODE does not make much sense when dealing with so few data. Looking at the differences between the MEAN, G, and MEDIAN demonstrates that it may be difficult to define a meaningful central value. These large differences between the estimates of the central value are caused by the one high value (5.9) that clearly deviates from the majority of data.

When computing the MEAN, each value enters the calculation with the same weight. The high value of 5.9 thus has a strong influence on the MEAN. The logarithm of the geometric mean G is the arithmetic mean of the log-transformed data. Log-transforming (base 10) the above data:

0.26 0.28 0.30 0.32 0.36 0.38 0.43 0.77;

the MEAN of these log-transformed values is 0.39. When this value is transformed back (100.39), the value of the geometric mean G, 2.44, is obtained. Thus the same problem observed for the MEAN remains, G is attracted by the (still) extreme value of 0.77, though to a lesser extent due the nature of the logarithmic transformation. Using the above example, the difference between the MEDIAN and maximum is 3.7 units, but in logarithmic units the difference is only 0.43. Clearly, when calculating the arithmetic mean, the maximum value will have less leverage when using logarithmically transformed data.

Because the MEDIAN is solely based on the sequence of the data values, it is not affected by extreme values – the extreme value could even be much higher without any influence on the MEDIAN. The MEDIAN would not even change when not only the largest value but also the next two lower values were much higher (or the lowest values much lower). For largen the MEDIAN is resistant to up to 50 per cent of the data values being extreme. Thus the MEDIAN may be the best measure of the central value if dealing with data containing extreme values.

4.1.5 Trimmed mean and other robust measures of the central value

To this point four methods of obtaining a central value have been discussed. However, others exist, such as calculating a trimmed mean. Here a selected proportion of extreme data values are excluded before calculating the mean. Thus the five per cent trimmed mean would be the mean of the data between the 5thand 95thpercentiles, i.e., the top and bottom five per cent of the data have been discarded from the calculation. Because the exact proportion that should be trimmed is unknown, the procedure introduces an amount of subjectivity into computing the mean. Graphical inspection of the data distribution, e.g., in the CP-plot, can be very helpful in deciding on the trimming percentage. There are other robust estimators (i.e. estimators that are less influenced by data outliers and deviations from the model of a normal data distribution) of the central value, based on the M-estimator (Huber, 1964). Data points that are far away from the centre of the distribution are down-weighted by these robust estimators. The M-estimator is less robust than the MEDIAN. However, because the MEDIAN uses less information it is less precise in estimating the central value of the underlying distribution.

4.1.6 Influence of the shape of the data distribution

It has been demonstrated how the different measures of the central value behave when the data contain one (or several) extreme value(s). How does the shape of the distribution influence the central value? Figure 4.1 shows, starting with a normal distribution, how the different measures of the central value depend on the shape of the data distribution.

54 STATISTICAL DISTRIBUTION MEASURES Normal distribution

MEAN G MODE MEDIAN

Lognormal distribution

0 MEANG MEDIANMODE Student t distribution, df=5

MEAN G MODE MEDIAN

Chi−square distribution, df=5

0

MEANG MEDIAN

MODE

Exponential distribution

0 MEANG MEDIANMODE

Multimodal distribution

MEAN G

MEDIAN

MODE

Figure 4.1 Graphical presentation of a selection of different statistical data distributions (normal, lognormal, chi-square, Student t, exponential and multimodal) and the location of the four different measures of the central value discussed in the text

In the case of a normal distribution, all four measures will, theoretically, take the same value (Figure 4.1), though there may be slight differences depending on the actual data. For a normal distribution the MEAN is the best (most precise) central value.

For a lognormal distribution important differences occur – the arithmetic mean is strongly influenced by high values (Figure 4.1). The MEDIAN and G are (theoretically) equal. The reason is that the logarithm of the lognormal distribution is a normal distribution. Thus the geometric mean G of the lognormal distribution corresponds to the MEAN of the normal distribution. For a normal distribution the MEAN is the best measure of the central value. Thus G can be considered as the best measure of the central value of a lognormal distribution. The

CENTRAL VALUE 55 position of the MEDIAN will not be influenced by a logarithmic transformation. Thus we get the same value for G and MEDIAN. The MODE is lower than all other measures and identifies the peak of the lognormal distribution. Still, the MODE may not be the best measure of the central value because far fewer data occur below the mode than above. The MEDIAN still separates the data into two equal halves and is, together with the geometric mean G, a good measure of the central value for a lognormal distribution. The most precise measure is again the MEAN, calculated for the log-transformed data and then back-transformed to the original data scale.

The Student t distribution is a symmetrical distribution with a different shape than the normal distribution (Student (W.S. Gosset), 1908). The likelihood that it contains values that are far away from the centre of the distribution is much higher; it has “heavy tails”. The “heaviness” of the tails depends on a property named the degrees of freedom (df) (see, e.g., Abramowitz and Stegun, 1965). A small value of df results in very heavy tails. With ever increasing df the shape of the normal distribution is approached. Because the distribution is symmetrical, in theory all four measures of the central value are again equal (Figure 4.1). However, in practice the Student t distribution describes data with many extreme values, and thus the measures can strongly deviate from each other. Of the measures discussed here, the MEDIAN is the only reliable measure of the central value because it is resistant to a high proportion of extreme values.

The chi-square distribution is a right-skewed distribution (see, e.g., Abramowitz and Stegun, 1965). As for the t distribution, the parameter df (degrees of freedom) determines the shape.

With increasing df the distribution becomes more and more symmetric and will in the end approximate the normal distribution. All four measures of the central value will usually differ (MODE< G < MEDIAN < MEAN – Figure 4.1). The MEAN is influenced by the high values (right-skewed distribution). The geometric mean G will be influenced by the low values because the log-transformed values of a chi-square distribution are left skewed. The MODE identifies the peak, but in a right-skewed distribution fewer values will occur below the MODE than above. The MEDIAN is thus the best measure of the central value.

The exponential distribution is again right skewed, and the same sequence of the measures of the central value as for the chi-square distribution is observed. The reasons are the same.

Here the MODE is zero and uninformative. Again the MEDIAN is the best measure of the central value.

A multimodal distribution has more than one peak. All four measures of the central value will generally provide different locations. In the case shown in the example plot (Figure 4.1), it is very difficult to decide which measure is the best indication of the central value because three data populations are clearly present where the distribution functions are superimposed on one-another. It would be best to separate the three distributions and estimate the central value of each distribution individually. This is impractical because of the extent of the area of overlap and one central value for the whole distribution has to be estimated, even knowing that the central value will not be meaningful for any of the three underlying distributions. The MODE will be meaningful as a central value for one population. MEAN and G can both be strongly influenced by outliers or skewness. The MEDIAN still divides the resulting total distribution into two equal halves and may thus again be the best measure of the central value. Note that it is impossible to transform this distribution to approach symmetry.

In conclusion, it can be stated that the MEDIAN is the most suitable measure of the central value when dealing with distributions with different shapes (i.e. when working with real data). It is also preferable because it is robust against a sizeable proportion of extreme values.

56 STATISTICAL DISTRIBUTION MEASURES

In document Statistical Data Analysis Explained (Page 75-80)