• No results found

Exploratory Data Analysis

N/A
N/A
Protected

Academic year: 2021

Share "Exploratory Data Analysis"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

Exploratory Data Analysis

Exploratory data analysis is often the ¯rst step in a statistical analysis, for it helps understanding the main features of the particular sample that an analyst is using. Intelligent descriptions or summaries of the data may sometimes be su±cient to ful¯ll the purposes for which the data were gathered. E®ective summaries can also point to \bad" data or unexpected aspects that might go unnoticed if data are blindly crunched by computers. Further, exploratory data analysis suggest possible probability models for the data and helps understanding the population features that a good model ought to be able to reproduce.

Here we shall brie°y discuss ways of summarizing three features of the distribution of a batch of data Z = (Z 1 ; : : : ; Z n ): its center, its spread and is shape. All three concepts are deliberately kept vague. For further references see Hoaglin et al. (1984) and Mosteller and Tukey (1977). The term batch is used to emphasize the fact that at this stage no commitment to a statistical model is being made.

1.1 MEASURES OF CENTER

A very popular measure of center (or location) is the (arithmetic mean Z = n ¹ ¡1

X n i=1

Z i = n ¡1> Z;

where ¶ denotes an n-vector of ones. Notice that the mean need not coincide with any of the observations in the batch.

The use of the mean is partly justi¯ed by its linearity property, that is, if X and Y are batches of data of equal size and Z is such that Z i = ¹ + ®X i + ¯Y i ,

then Z = ¹ + ® ¹ ¹ X + ¯ ¹ Y :

Notice however that if Z i = g(X i ; Y i ), where g is an arbitrary function, then it is generally not true that ¹ Z = g( ¹ X; ¹ Y ).

The mean is very sensitive to adding and dropping observations. In particular,

it is very sensitive to even a single outlier, that is, an arbitrarily large or small

data point. To see this, let ¹ Z n ¡1 denote the mean of a batch of n ¡1 observations.

(2)

The mean of the batch of n observations obtained by adding the value z to the initial batch is equal to

Z ¹ n = n ¡1 (

n ¡1

X

i=1

Z i + z) = µ

1 ¡ 1 n

Z ¹ n ¡1 + 1 n z;

that is, ¹ Z n is a weighted average of ¹ Z n¡1 and z. For any ¯xed n, j ¹ Z n ¡ ¹ Z n ¡1 j = jz ¡ ¹ Z n ¡1 j

n ! 1;

as jzj ! 1. Since a single outlier is enough to take ¹ Z n arbitrarily away from Z ¹ n ¡1 , we say that the mean is not a robust measure of center. On the other hand, for any ¯xed ¯nite z,

Z ¹ n ¡ ¹ Z n ¡1 = z ¡ ¹ Z n ¡1

n ! 0;

as n ! 1, and so the e®ect of a single outlier vanishes as the size of the batch gets arbitrarily large.

The normalized di®erence

SC(z) = n( ¹ Z n ¡ ¹ Z n¡1 ) = z ¡ ¹ Z n¡1 ;

viewed as function of z, is called the sensitivity curve of the mean. The fact that this function is unbounded simply re°ects the lack of robustness of the mean.

A closely related concept is J.W. Tukey's empirical in°uence function. Let ¹ Z n denote the mean of a batch of data of size n, and let ¹ Z (i) denote the mean of the batch of size n ¡ 1 obtained by deleting the i-th data point Z i . It is easy to verify that

Z ¹ n ¡ ¹ Z (i) = Z i ¡ ¹ Z n

n ¡ 1 ; i = 1; : : : ; n:

The empirical in°uence function of the mean is an n-vector with ith element equal to ¹ Z n ¡ ¹ Z (i) . An in°uential observation is one for which the di®erence j ¹ Z n ¡ ¹ Z (i) j is large or, equivalently, the residual Z i ¡ ¹ Z n is large.

To robustify the mean, let us sort the data in ascending order. The ordered data Z (1) ; Z (2) ; : : : ; Z (n) , where Z (1) · Z (2) · ¢ ¢ ¢ · Z (n) , are called the set of order statistics of the batch. A reasonable measure of center is the (symmetric)

®-trimmed mean, de¯ned as

Z ¹ ® = Z ([n®]+1) + ¢ ¢ ¢ + Z (n ¡[n®])

n ¡ 2[n®] ; 0 · ® < :5;

where [n®] denotes the greatest integer less than or equal to n®. Thus ¹ Z ® is obtained by dropping the [n®] largest and [n®] smallest data points and then taking the average of the rest. The mean is the extreme case corresponding to

® = 0.

(3)

To compare the robustness properties of the mean and an ®-trimmed mean, we introduce the concept of breakdown point. Let T (Z) be a measure of center for a batch Z of size n, and let T (Z ¤ ) be the same measure for a new batch Z ¤ obtained by replacing any m · n of the the original data points by arbitrary values. Let

b(m; T; Z) = sup

Z

¤

jT (Z ¤ ) ¡ T (Z)j;

where the supremum is taken over all possible Z ¤ . If b(m; Z; Z ¤ ) is in¯nite, this means that m outliers can have an arbitrarily large e®ect on T , which may be expressed by saying that T \breaks down". Therefore, the breakdown point of T is de¯ned by

²(T; Z) = min h m

n : b(m; T; Z) = 1 i :

In other words, the breakdown point is the smallest fraction of contamination that can cause T (Z ¤ ) to take on values arbitrarily far from T (Z).

It is straightforward to verify that the breakdown point of the mean is equal to 1=n, whereas the breakdown point of the ®-trimmed mean is equal to ([n®] + 1)=n.

The median may be viewed as the extreme case of an ®-trimmed mean corresponding to ® ! :5. When the number n of data points in Z is odd, the median ~ Z is unique and is equal to Z ([n+1]=2) . When n is even, a median is any point in the interval [Z (n=2) ; Z ([n=2]+1) ]. This lack of uniqueness is conventionally resolved by de¯ning

Z = ~

½ Z ([n+1]=2) ; if n is odd, :5[Z (n=2) + Z ([n=2]+1) ]; if n is even.

Notice that if n is odd, the median exactly coincides with one of the observations.

If n is even, the median is the average of two adjacent order statistics. It is easy to verify that if g is any increasing function and X is such that X i = g(Z i ), then ~ X = g( ~ Z).

The breakdown point of the median is equal to 1/2 if n is even, and is equal to (1 + n ¡1 )=2 if n is odd. With little loss of generality, let ~ Z n¡1 be the median of a batch of size n ¡1, where n¡1 = 2k is even. Thus, ~ Z n¡1 = :5[Z (k) +Z (k+1) ].

The median of the batch of size n obtained by adding the value z to the previous batch is equal to

Z ~ n = 8 <

:

Z (k) ; if z < Z (k) ,

z; if Z (k) · z · Z (k+1) , Z (k+1) ; if z > Z (k+1) .

To compare the sensitivity curves of the mean and the median, consider the case when ¹ Z n ¡1 = ~ Z n ¡1 . Then

SC(z; ¹ Z) = z ¡ ¹ Z n ¡1 ; while

SC(z; ~ Z) = 8 <

:

(Z (k) ¡ ¹ Z n ¡1 ); if z < Z (k) ,

n(z ¡ ¹ Z n ¡1 ); if Z (k) · z · Z (k+1) ,

n(Z (k+1) ¡ ¹ Z n ¡1 ); if z > Z (k+1) .

(4)

Instead of choosing a single measure of center, it is often more informative to compute and compare several measures. For example, comparing the mean and the median gives indication about the presence of skeweness in the data (skewness is another vague concept!). If the data are symmetric, then the mean and the median coincide. If the data are skewed to the left, then the mean is greater than the median. If the data are skewed to the right, then the median is greater than the mean.

1.2 MEASURES OF SPREAD

Two measures of spread (or scale) based on order statistics are the range range = max

i fZ i g ¡ min

i fZ i g = Z (n) ¡ Z (1)

and the interquartile range IQR = upper quartile - lower quartile, where the upper quartile is the median of the data greater or equal to the median, and the lower quartile is the median of the data smaller or equal to the median.

Two other common measures of spread are the mean squared deviation from the mean

^

¾ 2 = n ¡1 X n i=1

(Z i ¡ ¹ Z) 2 ;

or its square root ^ ¾ called the standard deviation, and the mean absolute deviation from the mean

~

¾ = n ¡1 X n i=1

jZ i ¡ ¹ Z j:

The ¯rst is just the mean of the squared deviations (Z i ¡ ¹ Z) 2 , while the second is the mean of the absolute deviations jZ i ¡ ¹ Z j. Because of their mean-like nature, neither measure is robust.

It is easily seen that

^

¾ 2 = n ¡1 X n i=1

Z i 2 ¡ ¹ Z 2 : Further, if X is such that X i = a + bZ i , b 6= 0, then

^

¾ X = jbj ^¾ Z ; ¾ ~ X = jbj ~¾ Z :

A highly robust estimate of spread is the median absolute deviation from the median

MAD = Med i jZ i ¡ ~ Z j:

1.3 MEASURES OF SHAPE

One measure of center and one measure of spread are often all one needs to

concisely summarize the data. Just a pair of summary statistics, however, does

(5)

not provide an accurate description of the data, in the sense that arbitrarily di®erent batches of data may result in exactly the same description.

J.W. Tukey suggested the use of a box-plot, a graphical procedure that combines a measure of location (the median), a measure of spread (the IQR), shows the presence of possible outliers, and gives some indication about the shape of the distribution of the data in terms of their symmetry or skewness.

Construction of a box-plot proceeds as follows:

1. Horizontal lines are drawn at the median and the upper and lower quartiles are joined by vertical lines to produce the box.

2. Vertical line is drawn up from the upper quartile to the most extreme data point that is within a distance of 1:5 £ IQR from the upper quartile. A similarly de¯ned vertical line is drawn down from the lower quartile. Short horizontal lines are added to mark the ends of these vertical lines.

3. Each data point beyond the ends of the vertical line is marked with an asterix or a dot.

Symmetry or asymmetry is revealed by the location of the median relative to the upper and lower quartiles.

If a large batch of data is available, one can study its shape in more detail.

The main tool is the empirical distribution function (edf) F n (z), de¯ned as the fraction of data points less than or equal to z.

Let 1 fAg denote the indicator function of the event A, that is, 1 fAg =

½ 1; if A occurs, 0; otherwise.

Then we can simply write

F n (z) = n ¡1 X n i=1

1 fZ i · zg:

Notice that F n is a non-decreasing step function, bounded between 0 and 1, with jumps of height 1=n at each distinct point Z i . If a data value is repeated m times, the jump is equal to m=n. The edf. summarizes all the information contained in a batch of data, except the order in which the observations enter the batch. Notice that in some cases, such as time-series, time order may be important.

There exists a simple relationship between the edf and the set of order statistics. By the de¯nition of order statistic, the number of data points less than or equal to Z (i) is equal to i. Thus

F n (Z (i) ) = i n ;

and we say that Z (i) is the i=n-quantile of the empirical distribution of Z.

Sometimes it is useful to compare two edf's by means of Q{Q plots. In a Q{Q

plot, the quantiles of one batch of data are plottet against those of another.

(6)

To interpret a Q{Q plot, the following result is useful. If X and Z are batches of data such that X i = a + bZ i , 0 < b < 1, then Z (i) = a + bZ (i) . This implies that a Q{Q plot of X and Z is a straight line with slope equal to b and intercept equal to a.

Instead of working with the edf, it is sometimes convenient to work with an equivalent representation, namely the empirical survival function

S n (z) = n ¡1 X n i=1

1 fZ i > z g = 1 ¡ F n (z):

This is just the fraction of data points greater than z. Clearly, S n (Z (i) ) = (n ¡ i)=n.

The empirical survival function is often used in the case of data on time until failure or death, such as individual life-times or unemployment duration data.

An alternative way of displaying the shape of a batch of data is by means of a histogram. To construct a histogram, partition the range of the data into intervals or bins of a certain (possibly unequal) bin width. A histogram is then obtained by plotting the fraction of observations in each bin divided by the bin width. Thus, if the bin width is constant and equal to ±, the height h n (z; ±) of a histogram at a point z is equal to the number of data points in the bin containing z divided by n±. Thus, ±h n (y; ±) is just the relative frequency of data in the same bin containing z.

If there are m bins of equal size, then

± = Z (n) ¡ Z (1)

m :

Viewed as a function, h n ( ¢; ±) is non-negative and integrates up to one, that is, h n ( ¢; ±) ¸ 0 and R

h n (z; ±) dz = 1.

A crucial problem in constructing a histogram is the choice of the number of bins. Too many bins (or, equivalently if the bin width is constant, too small a bin width) make a histogram look too ragged, too few bins (too large a bin width) make the histogram look oversmoothed.

If data are not uniformly distributed, it may be useful to let the binwidth vary with the local density of the data. In this case, wider bins will be chosen where the data are more sparse, and narrower bins where the data are more dense.

REFERENCES

Hoaglin D.C., Mosteller F. and Tukey J.W. (1983) Understanding Robust and Exploratory Data Analysis, Wiley, New York.

Mosteller F. and Tukey J.W. (1977) Data Analysis and Regression: A Second Course in Statistics, Addison-Wesley, Reading, MA.

Tukey J. (1977) Exploratory Data Analysis, Addison-Wesley, Reading, MA.

References

Related documents

4 tactics: drop, add, combine variables, discover variables via residuals Today, we looked at distinction between exploratory and confirmatory?. We also learned about box and

Exploratory Data Analysis Nature and Structure of Data.. Primary and

For example, for the data I collected in the course survey, the variables blood type and sex are categorical.. • The variable year in school is an example of an ordinal

Computer Science and Data Analysis Series. Exploratory Data Analysis

А для того, щоб така системна організація інформаційного забезпечення управління існувала необхідно додержуватися наступних принципів:

Rank order of the most abundant species of LAs and DAs of the total assemblage and the habitats: inner flat (IF), outer flat (OF), sandbar (SB), channel (CH), shallow sublittoral

Carriage Run BEDROOM 2 Second Floor Elevation C opt door BEDROOM 4 MASTER BEDROOM walk-in closet walk-in closet linen w ardrobe w ardrobe opt skylight MASTER BATH

Smith is skilled at applying Six Sigma and Total Quality Management principles to corporate process improvement initiatives and meeting facilitation, interventions, project