Descriptive
Statistics
sample are referred to asstatisticsand are signified by Latin letters such as x ands. Sometimes computation formulas for a parameter and the corresponding statistic are the same, as in the population and sample mean.However, sometimes they differ: the most famous example is that of the population and sample variance and standard deviation.Somewhat confusingly, because most statistical practice is concerned with inferential statistics, sometimes statistical formulas properly meant for samples are applied to populations (when the parameter formula should be used instead).When the formulas differ, both will be provided in this chapter.
Measures of Central Tendency
Measures of central tendency, also known as measures of location, are typically among the first statistics computed for the continuous variables in a new data set. The main purpose of computing measures of central tendency is to give you an idea of what is a typical or common value for a given variable.The three most common measures of central tendency are the arithmetic mean, median, and mode.
The Mean
The arithmetic mean, or simply the mean, is more commonly known as the
averageof a set of values.It is appropriate for interval and ratio data, and can also
be used for dichotomous variables that are coded as 0 or 1.For continuous data, for instance measures of height or scores on an IQ test, the mean is simply calcu- lated by adding up all the values and dividing by the number of values.The mean of a population is denoted by the Greek lettermu(σ) while the mean of a sample is typically denoted by a bar over the variable symbol: for instance, the mean ofx would be designatedxand pronounced “x-bar.” The bar notation is sometimes adapted for the names of variables also: for instance, some authors denote “the mean of the variable age” by age, which would be pronounced “age-bar”.
For instance, if we have the following values of the variablex: 100, 115, 93, 102, 97
We calculate the mean by adding them up and dividing by 5 (the number of values):
x = (100 + 115 + 93 + 102 + 97)/5 = 507/5 = 101.4
Statisticians often use a convention called summation notation, introduced in Chapter 1, which defines a statistic by expressing how it is calculated.The computation of the mean is the same whether the numbers are considered to represent a population or a sample: the only difference is the symbol for the mean itself. The mean of a data set, as expressed in summation notation, is:
x 1 n --- xi i=1 n
∑
=Wherexis the mean ofx,nis the number of cases, andxiis a particular value of
x.The Greek letter sigma (Σ) means summation (adding together), and the figures above and below the sigma define the range over which the operation should be performed.In this case the notation says to sum all the values ofxfrom 1 ton. The symbolidesignates the position in the data set, sox1is the first value in the data set,x2the second value, andxnthe last value in the data set.The summation
symbol means to add together or sum the values ofxfrom the first (x1) toxn.The
mean is therefore calculated by summing all the data in the data set, then dividing by the number of cases in the data set, which is the same thing as multiplying by 1/n.
The mean is an intuitively easy measure of central tendency to understand.If the numbers represented weights on a beam, the mean would be the point where the beam would balance perfectly.However the mean is not an appropriate summary measure for every data set because it is sensitive to extreme values, also known as
outliers (discussed further below), and may also be misleading for skewed
(nonsymmetrical) data.For instance, if the last value in the data set were 297 instead of 97, the mean would be:
x = (100 + 115 + 93 + 102 + 297)/5 = 707/5 = 141.4
This is not a typical value for this data: 80% of the data (the first four values) are below the mean, which is distorted by the presence of one extremely high value.A good practical example of when the mean is misleading as a measure of central tendency is household income data in the United States.A few very rich house- holds make the mean household income a larger value than is truly representative of the average or typical household.
The mean can also be calculated using data from afrequency table, i.e., a table displaying data values and how often each occurs.Consider the following simple example in Table 4-1.
To find the mean of these numbers, treat the frequency column as a weighting variable, i.e., multiply each value by its frequency. The mean is then calculated as:
This is the same result you would reach by adding together each individual score (1+1+1+1+...) and dividing by 26.
The mean forgrouped data, in which data has been tabulated by range, is calcu- lated in a similar manner.One additional step is necessary: the midpoint of each Table 4-1. Simple frequency table
Value Frequency 1 7 2 5 3 12 4 2 x (1×7)+(2×5)+(3×12)+(4×2) 7+5+12+2 --- 2.35 = =
Measures of Central Tendency | 57
Descriptive
Statistics
range must be calculated, and for the purposes of the calculation it is assumed that all data points in that range have the midpoint as their value.A mean calcu- lated in this way is called agrouped mean.A grouped mean is not as precise as the mean calculated from the original data points, but it is often your only option if the original values are not available.Consider the following tiny grouped data set in Table 4-2.
The mean is calculated by multiplying the midpoint of each interval by its frequency, and dividing by the total frequency:
One way to lessen the influence of outliers is by calculating atrimmed mean.As the name implies, a trimmed mean is calculated by trimming or discarding a certain percentage of the extreme values in a distribution, and calculating the mean of the remaining values.In the second distribution above, the trimmed mean (defined by discarding the highest and lowest values) would be:
x = (100 + 115 + 102 )/3 = 317/3 = 105.7
This is much closer to the typical values in the distribution than 141.4, the value of the mean of all the values.In a data set with many values, a percentage such as 10 percent or 20 percent of the highest and lowest values may be eliminated before calculating the trimmed mean.
The mean can also be calculated for dichotomous data using 0–1 coding, in which case the mean is equivalent to the percent of values with the number 1.For instance, if we have 10 subjects, 6 males and 4 females, coded 1 for male and 0 for female, computing the mean will give us the percentage of males in the population:
x = (1+1+1+1+1+1+0+0+0+0)/10 = 6/10 = 0.6 or 60% males