P1.2 Centers and spreads: Presentation - Cluster Analysis of Data Sizes and Multipl variability

Further summarization of the data leads to presenting a feature with just two numbers, one expressing the distribution’s location, its “central” or other impor- tant point, and the other representing the distribution’s dispersion, the spread. We review some most popular characteristics for both, the center, Table 1.1, and the spread, Table 1.2.

Worked example 1.1. Mean

For set X={1,1,5,3,4,1,2}, mean is c=(1+1+5+3+4+1+2)/7=17/7=2.42857…, or rounded up to two decimals, c=2.43.

This is as close an approximation to the numbers as one can get, which is good. A less satisfactory property is that the mean is not stable against outliers. For example, if X in Worked example 1.1 is supplemented with value 23, the mean becomes c=(17+23)/8=5, a

much greater number. This is why it is a good idea to remove some observations on both extremes of the data range, both the minimum and maximum, before computing the mean, which is utilized in the concept of trimmed mean in statistics.

Table 1.1. A review of location or central point concepts

# Name Explanation Comments

1 Mean The feature’s arithmetic

average 0. Minimizes the summary error squared 1. Estimates the distribution’s

expected value

2. Sensitive to outliers and distribution’s shape

2 Median The middle of the sorted list of feature values

1. Minimizes the summary absolute error

2. Estimates the distribution’s expected value

3. Not-sensitive to outliers 4. Sensitive to distribution’s shape 3 Mid-range Middle of the range 1. Minimizes the maximum abso-

lute error

2. Estimates the distribution’s expected value

3. Very sensitive to outliers 4. Not sensitive to distribution’s shape

4 P-quantile A value dividing the en- tire entity set in proportion P/(1-P) of feature values so that those with higher values constitute P proportion (upper P- quantile) or 1-P proportion (bottom P-quantile)

1. Not-sensitive to outliers 2. Sensitive to distribution’s shape

5 Mode A maximum of the histo- gram

1. Depends on the bin size 2. Can be multiple

Worked example 1.2. Median

To compute the median of the set from the previous example, X={1, 1, 5, 3, 4, 1, 2}, it must be sorted first: 1, 1, 1, 2, 3, 4, 5. The median is defined as the element in the middle, which is 2. This is rather far away from the mean, 2.43, which evidences that the distribution is biased towards the left end, the smaller entities. With the outlier 23 added, the sorted set becomes 1, 1, 1, 2, 3, 4, 5, 23, thus leading to two elements in the middle, 2 and 3. The median in this case is the average of the two, (2+3)/2=2.5, which is by far lesser change than the mean of the extended set, 5.

The more symmetric a distribution, the closer its mean and median to each other. Sepal width of Iris data set (Table 0.3) has mean=3.05 and median=3, quite close values. In contrast, in Market town data (Table 0.4), Population resident’s median, 5258, is predictably much less than the mean, 7351.4. The mean of a power law distribution is always biased towards the great values achieved by the few outliers; this is why it is a good idea to use the median as its central value. The median is very stable against outliers: the values on the extremes just do not affect the middle of the sorted set if added uniformly to both sides.

The midrange corresponds to the mean of a flat distribution, in which all bins are equally likely. In contrast to the mean and median, the midrange depends only on the range, not on the distribution. It is obviously highly sensitive to outliers, that is, changes of the maximum and/or minimum values of the sample.

The concept of p-quantile is an extension of the concept of median, which is a 50% quantile.

Worked example 1.3. P-quantile (percentile)

Take p=10% and determine the upper 10% quantile of Population resident feature. This should be 5th_{value in its descending order, that is, 18966. Why is the 5}th_{value? Because}

10% of the total number of entities, 45, is 4.5; therefore, the 5-th value leaves out p=10% of the largest towns in the sample. Similarly, the lower 10% quantile of the feature is 5th_value

in its ascending order, 2230. Worked example 1.4. Mode

According to the histograms in the bottom of Figure 1.1, it is the very first bin which is modal in the Population resident distribution. In the 5-bin setting, it takes one fifth of the feature range, 23801-2040=21761, that is, 4352. In the 10-bin setting, it is one tenth of the feature range, that is, 2176. In the latter case, the modal bin is interval [2040, 4216], and the modal bin is as twice wider, [2040, 6392], in the former case.

Each of the characteristics of spread In Table 1.2 parallels, to an extent, a location characteristic under the same number.

These measures intend to give an estimate of the extent of error in the corre- sponding centrality index. The standard deviation is the average quadratic error of the mean. Its use is related to the least-squares approach that currently prevails in data analysis and can be justified by good properties of the solutions, within the data analysis perspective, and properties of the normal distribution, within the probabilistic perspective. These paradigms are explained later in section 1.2.F.

# Name Explanation Comments 1 Standard

deviation The quadratic average de-viation from the mean 1. Minimized by the mean 2. Estimates the square root of the variance

2 Absolute deviation

The average absolute deviation from the median

Minimized by the median 3 Half-range The maximum deviation

from the midrange

Minimized by the mid-range

The absolute deviation expresses the average absolute deviation from the median. Usually, it is calculated regarding the mean, as the average error in representing the feature values by the mean. However, it is more related to the median, because it is the median that minimizes it.

The half-range expresses the maximum deviation from the mid-range; so they should be used on par, as it is done customarily by the research community in- volved in building classifying rules.

In document Cluster Analysis of Data Sizes and Multipl variability (Page 53-56)