UNIVARIATE DATA ANALYSIS INTRODUCTION

There are several ways in which to summarize a univariate (single attribute) distribution. Quite often we will simply compute the mean and the variance, or plot its histogram. However, these statistics are very sensitive to extreme values (outliers) and do not provide any spatial information, which is the heart of a geostatistical study. In this section, we will describe a number of different methods that can be used to analyse data for a single variable.

SUMMARY STATISTICS

The summary statistics represented by a histogram can be grouped into three categories:

 measures of location,

 measures of spread, and

 measures of shape.

Measures of Location

Measures of location provide information about where the various parts of the data distribution lie, and are represented by the following:

 Minimum: Smallest value.

 Maximum: Largest value.

 Median: Midpoint of all observed data values, when arranged in ascending order. Half the values are above the median, and half are below. This statistic represents the 50th percentile of the cumulative frequency histogram and is not generally affected by an occasional erratic data point.

 Mode: The most frequently occurring value in the data set. This value falls within the tallest bar on the histogram.

 Quartiles: In the same way that the median splits the data into halves, the quartiles split the data in quarters. Quartiles represent the 25th, 50th and 75th percentiles on the cumulative frequency histogram.

 Mean: The arithmetic average of all data values. (This statistic is quite sensitive to extreme high or low values. A single erratic value or outlier can significantly bias the mean.) We use the following formula to determine the mean of a Population:

Mean =  = where:

 = population mean

N = number of observations (population size)

ZI = sum of individual observations

We can determine the mean of a Sample in a similar manner. The below formula for the sample mean is comparable to the above formula, except that population notations have been replaced with those for samples.

Mean = where:

= sample mean

n = number of observations (sample size)

ZI = sum of individual observations Measures of Spread

Measures of spread describe the variability of the data values, and are represented by the following:

 Variance: Average squared difference of the observed values from the mean. Because the variance involves squared differences, this statistic is very sensitive to abnormally high/low values.

Variance = ^

Kachigan (1986) notes that the above formula is only appropriate for defining variance of a population of observations. If this same formula was applied to a sample for the purpose of estimating the variance of the parent population from which the sample was drawn, then the formula above will tend to underestimate the population variance. This

underestimation occurs as repeated samples are drawn from the population and the variance is calculated from each, using the sample mean ( , rather than the population mean (). The resulting average of

these variances would be lower than the true value of the population variance (assuming we were able to measure every single member of the population).

We can avoid this bias by taking the sum of squared deviations and dividing that sum by the number of observations – less one. Thus, the sample estimate of population variance is obtained using the following formula:

Variance = s^

 Standard Deviation: Square root of the variance.

Standard Deviation =  ^

This measure is used to show the extent to which the data is spread around the vicinity of the mean, such that a small value of standard deviation would indicate that the data was clustered near to the mean. For example, if we had a mean equal to 10, and a standard deviation of 1.3, then we could predict that most of our data would fall somewhere between (10 - 1.3) and (10 + 1.3), or between 8.7 to 11.3. The standard deviation is often used instead of the variance, because the units are the same as the units of the attribute being described.

 Interquartile Range: Difference between the upper (75th percentile) and the lower (25th percentile) quartile. Because this measure does not use the mean as the center of distribution, it is less sensitive to abnormally high/low values.

Figure 1a and 1b illustrate histograms of porosity with a mean of about 15 %, but different variances.

Outliers or “Spurious” Data

Figure 1a

Another statistic to consider is the Z-score; a summary statistic in terms of standard deviation. Data which “appear” to be anomalous based on its Z-score which have absolute values are greater than a specified cutoff are termed outliers. The typical cutoff is 2.5 standard deviations from the mean. The formula is the ratio of the data value minus the sample mean to the sample variance.

Zscore = (Zi -) /

This statistic serves as a caution, signifying either bad data, or a true local anomaly, which must be taken into account in the final analysis.

Note: The Z-score transform does not change the shape of the histogram. The transform re-scales the histogram with a mean equal 0 and a variance equal 1. If the histogram is skewed before being transformed, it retains the same shape after the transform. The X-axis is now in terms of  standard deviation units about the mean of zero.

Measures of Shape

Measures of shape describe the appearance of the histogram and are represented by the following:

 Coefficient of Skewness: Averaged cubed difference between the data values and the mean, divided by the cubed root of the standard deviation. This measure is very sensitive to abnormally high/low values:

CS1/nZi -)³/^

where:

 is the mean

is the standard deviation

n is the number of X and Y data pairs

The coefficient of skewness allows us to quantify the symmetry of the data distribution, and tells us when a few exceptional values (possibly outliers?) exert a disproportionate effect upon the mean.

 positive: long tail of high values (median < mean)

 negative: long tail of low values (median > mean)

 zero: a symmetrical distribution Figure 2a, 2b,

Figure 2a

and 2c

illustrate histograms with negative, symmetrical and positive skewness.

 Coefficient of Variation: Often used as an alternative to skewness as a measure of asymmetry for positively skewed distributions with a minimum at zero. It is defined as the ratio of the standard deviation to the mean. A value of CV > 1 probably indicates the presence of some high erratic values (outliers).

CV =  where:

 is the standard deviation

 is the mean

SUMMARY OF UNIVARIATE STATISTICAL MEASURES AND DISPLAYS

Advantages

 Easy to calculate.

 Provides information in a very condensed form.

 Can be used as parameters of a distribution model (e.g., normal distribution defined by sample mean and variance).

Limitations

 Summary statistics are too condensed, and do not carry enough information about the shape of the distribution.

 Certain statistics are sensitive to abnormally high/low values that properly belong to the data set (eg.,,^CS).

 Offers only a limited description, especially if our real interest is in a multivariate data set (attributes are correlated).

BIVARIATE STATISTICAL MEASURES AND DISPLAYS

In document Basic Geostatistics (Page 32-38)