Background - Quality. Guidance for Data Quality Assessment. Practical Methods for Data Analysis

Outliers are measurements that are extremely large or small relative to the rest of the data and, therefore, are suspected of misrepresenting the population from which they were collected. Outliers may result from transcription errors, data-coding errors, or measurement system

problems such as instrument breakdown. However, outliers may also represent true extreme values of a distribution (for instance, hot spots) and indicate more variability in the population than was expected. Not removing true outliers and removing false outliers both lead to a distortion of estimates of population parameters.

Statistical outlier tests give the analyst probabilistic evidence that an extreme value (potential outlier) does not "fit" with the distribution of the remainder of the data and is therefore a statistical outlier. These tests should only be used to identify data points that require further investigation. The

EPA QA/G-9 Final

QA00 Version 4 - 25 July 2000

Box 4-13: Directions for the Wald-Wolfowitz Runs Test

Consider a sequence of two values and let n denote the number of observations of one value and m denote the number of observations of the other value. Note that if is customary for n < m (i.e., n denotes the value that occurs the least amount of times. This test is used to test the null hypothesis that the sequence is random against the alternative hypothesis that the data in the sequence are correlated or may come from different populations.

STEP 1: List the data in the order collected and identify which will be the ‘n’ values, and which will be the ‘m’ values.

STEP 2: Bracket the sequences within the series. A sequence is a group of consecutive values. For example, consider the data AAABAABBBBBBBABB. The following are sequences in the data

{AAA} {B} {AA} {BBBBBB} {A} {BB}

In the example above, the smallest sequence is has one data value and the largest sequence has 6. STEP 3: Count the number of sequences for the ‘n’ values and call it T. For the example sequence, the ‘n’

values are ‘A’ since there are 6 A’s and 9 B’s, and T = 3: {AAA}, {AA}, and {A}.

STEP 4: If T is less than the critical value from Table A-12 of Appendix A for the specified significance level ", then reject the null hypothesis that the sequence is random in favor of the alternative that the data are correlated amongst themselves or possibly came from different distributions. Otherwise, conclude the sequence is random. In the example above, 3 < 6 (where 6 is the critical value from Table A-12 using n=6, m=9, and " = 0.01) so the null hypothesis that the sequence is random is rejected.

tests alone cannot determine whether a statistical outlier should be discarded or corrected within a data set; this decision should be based on judgmental or scientific grounds.

There are 5 steps involved in treating extreme values or outliers:

1. Identify extreme values that may be potential outliers; 2. Apply statistical test;

3. Scientifically review statistical outliers and decide on their disposition; 4. Conduct data analyses with and without statistical outliers; and

5. Document the entire process.

Potential outliers may be identified through the graphical representations of Chapter 2 (step 1 above). Graphs such as the box and whisker plot, ranked data plot, normal probability plot, and time plot can all be used to identify observations that are much larger or smaller than the rest of the data. If potential outliers are identified, the next step is to apply one of the statistical tests described in the following sections. Section 4.4.2 provides recommendations on selecting a statistical test for outliers.

Box 4-14: An Example of the Wald-Wolfowitz Runs Test

This is a set of monitoring data from the main discharge station at a chemical manufacturing plant. The permit states that the discharge should have a pH of 7.0 and should never be less than 5.0. So the plant manager has decided to use a pH of 6.0 to an indicate potential problems. In a four-week period the following values were recorded:

6.5 6.6 6.4 6.2 5.9 5.8 5.9 6.2 6.2 6.3 6.6 6.6 6.7 6.4 6.2 6.3 6.2 5.8 5.9 5.8 6.1 5.9 6.0 6.2 6.3 6.2

STEP 1: Since the plant manager has decided that a pH of 6.0 will indicate trouble the data have been replaced with a binary indicator. If the value is greater than 6.0, the value will be replaced by a 1; otherwise the value will be replaced by a 0. So the data are now:

1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1

As there are 8 values of ‘0’and 19 values of ‘1’, n = 8 and m = 19.

STEP 2: The bracketed sequence is: {1 1 1 1} {0 0 0} {1 1 1 1 1 1 1 1 1 1} {0 0 0} {1} {0 0 } {1 1 1} STEP 3: T = 3: {000}, {000}, and {00}

STEP 4: Since 3 < 9 (where 9 is the critical value from Table A-12 using " = 0.05) so the null hypothesis that the sequence is random is rejected.

If a data point is found to be an outlier, the analyst may either: 1) correct the data point; 2) discard the data point from analysis; or 3) use the data point in all analyses. This decision should be based on scientific reasoning in addition to the results of the statistical test. For instance, data points containing transcription errors should be corrected, whereas data points collected while an instrument was malfunctioning may be discarded. One should never discard an outlier based solely on a statistical test. Instead, the decision to discard an outlier should be based on some scientific or quality assurance basis. Discarding an outlier from a data set should be done with extreme caution, particularly for environmental data sets, which often contain legitimate extreme values. If an outlier is discarded from the data set, all statistical analysis of the data should be applied to both the full and truncated data set so that the effect of discarding

observations may be assessed. If scientific reasoning does not explain the outlier, it should not be discarded from the data set.

If any data points are found to be statistical outliers through the use of a statistical test, this information will need to be documented along with the analysis of the data set, regardless of whether any data points are discarded. If no data points are discarded, document the

identification of any "statistical" outliers by documenting the statistical test performed and the possible scientific reasons investigated. If any data points are discarded, document each data point, the statistical test performed, the scientific reason for discarding each data point, and the effect on the analysis of deleting the data points. This information is critical for effective peer review.

EPA QA/G-9 Final

QA00 Version 4 - 27 July 2000

In document Quality. Guidance for Data Quality Assessment. Practical Methods for Data Analysis EPA QA/G-9 QA00 UPDATE. EPA/600/R-96/084 July, 2000 (Page 136-139)