‘Lies, damned lies and statistics’, attributed to Benjamin Disraeli by Mark Twain.
Teaching and learning objectives:
1. To understand what is meant by descriptive statistics.
2. To learn the separate functions of the mean, mode and median, range, interquartile range, variance and standard deviation.
3. To learn how to calculate these using MS Excel.
Introduction
Descriptive statistics resolve complexity by summarising and compressing data to identify their essential characteristics to create a brief but relatively accurate impression to the observer. You already do this in everyday use by describing the total number of, say, your fellow students’ range of ages, their average age, male/female split and ethnic composition.
Consider this example. In reply to a question from your parents, you tell them that your university class consists of about 20 students, half are women and six come from the Indian sub-continent. You add that most students are about the same age.
(Twenty years old but there are three older students.) The oldest is 48. This creates a reasonable picture in their minds. You use this rough description or numeric shorthand to avoid having to describe each student individually.
However, more accurate statistics may be required – by say, by the registrar’s staff – giving the precise average age. The ages of the class are (in ascending order):
18,18,18,18,19,19,19,19,19,19,19,19,19,19,19,19,20,20,20,36,45,48.
There are actually 22 students. To calculate the average, you must, of course, add all the ages together and divide the sum by the number of students. In this
184 Research Methods in Politics
case, the total ages are 489. There are 22 students. So the average is 489/22 = 22.23 years.
So what? Well, in this example, none of the students is aged 22. So the average is misleading. The reason for this apparent discrepancy is because the class contains much older students. This is more obvious when the students and their ages are individually plotted on a type of graph or chart termed a scattergram:
In this simple example, the average has been distorted by the inclusion of three much older students. But, suppose that one of the 18-year-old students were to
‘drop out’ and an 85 year-old admitted. In this case, then the average mean would increase to 25.3 and the graph to:
0
0 10 20 30
10 20 30 40 50 60
Individual students
Ages
Figure 13.1a Scattergram of students and their ages
0 10 20 30 40 50 60 70 80 90
0 10 20 30
Individual students
Ages
Figure 13.1b Scattergram of students and their ages (revised class)
Calculating and Interpreting Descriptive Statistics 185 In this second case, the calculation of the average has been grossly distorted by the
replacement of a more typical student by a very much older one. This is called the outlier effect– the effect of those highest or lowest terms to skew – distort – your everyday, ‘average’ mathematical description of complex information. (Incidentally, some quantitative researchers argue that qualitative researchers rely on outliers; the exceptional case is more attention-grabbing than more typical information.)
To overcome this weakness, statisticians use a set of simple statistical measures to better describe, say, a student class. These are the arithmetic mean, median, mode, range, variance and the standard deviation. These are termed measures of central tendency (or measures of dispersion). As the terms imply, they describe numerically the extent to which the individual terms cluster around the ‘centre’. The class is termed the population:the total group of people or events being described or under study. The population is represented by the Roman capital, N. The individual measurements of data from the population are called terms. A group of terms which measure the same characteristic at the same moment in time is called a series.
The arithmetic mean is what is generally called the average, i.e. the sum of the terms divided by the number of terms. In statistics, the terms are called X and the number of terms N . ‘Sum of’ is represented by the Greek symbol (pronounced sigma). The arithmetic mean is called X bar and represented nowadays by the symbol X . So X = XN
The median is the middle term when the series is ranked (normally in ascending order), e.g. 4,10,2,8,6 becomes ranked into 2,4,6,8,10. The median is the (N+1)th term/2. So, the series 2,4,6,8,10 has five terms. Therefore the median is the (5+1)/2thterm= 3rdterm. The third term in the series is 6. If the series were reduced to four terms – 2,4,6,8 – then the median would be the (4+1)/2thterm= 2.5. In this case, the median is calculated as lying midway between the second term (4) and the third term (6). So the median is 5.
The mode is the most frequent term in the series, e.g. in the series 1,3,6,5,1,2 then the mode is 1. This is rarely used.
Returning to the example of the initial class of 22 students, then, whilst the arithmetic mean may be 22.3, the median is the (22+1)/2th term, i.e. the 11.5th term or half-way between the 11th and 12th term. In this case, the 11th and 12th terms are both 19 so the median is 19. The mode (the most common term) is also 19.
So, in this example, the median and the mode provide better descriptions than the mean (of 22.23). In the revised class, the median and mode remain 19 and the effect of the (85 year old) outlier is effectively discounted. In cases like these where most terms lie below the mean, then the data – or distribution – is termed positively skewed. In many universities, the final degree classification uses the median exam mark rather than average mark. In this way, exceptionally good or bad exam marks are discounted.
Another statistical measure of the data is the range – the difference between the highest and lowest term. However, as noted earlier, the range can be distorted by
186 Research Methods in Politics
exceptionally high or low outliers, e.g. the very mature student. In the revised class, the range would be 67 (i.e. 85–18). So statisticians developed the interquartile range. This is the difference between the first and third quartiles when the terms of the series are placed in ranked order. So in a series of 99 terms, the interquartile range is the difference between the 25thand 75thterm. The first and third quartiles are identified in a similar way to the median by calculating
(N + 1)
4 and 3(N + 1) 4
For example, in our initial class of 22 students ranked by age, then the interquartile range is the difference between the ages of the 5.75th and 17.25th students, i.e. 19 and 20= 1.
The greatest use of median measurements probably lies in representing unequal distributions especially in terms of resources. Income is a prime example where a small number of people may have vast wealth and, at the other end of the scale, a large number virtually nothing. In these cases, the inequality is shown by calculating and comparing the income of the 10thand 90th ‘percentiles’. So, in a fair society where incomes are equal, then the 10th and 90th percentiles will be the same whereas, in less equal countries, the ratio of 10thto 90thpercentiles may be as high as 100. A more sophisticated descriptive statistic for measuring unequal distribution of income and wealth is provided by the Gini coefficient. This is the ratio between the areas above and below the Lorenz curve (of cumulative incomes) and the area of equal distribution. The coefficient varies between less than 0.25 (Greenland) and 0.6 (Namibia). No coefficients are calculable for Sub-Saharan Africa. The Gini coefficient for the UK is 0.35–0.39.
In the UK, the official (upper) level of poverty is calculated as 60% of median household income. ‘Deep poverty’ is calculated as 40% of the median household income. The definitions reflect a rejection of the historic concept of poverty being calculable as an absolute level in favour of relative measures.
Lorenz curve for a typical
Figure 13.2 Charts of income distribution (from www.statistics.gov.uk)
Calculating and Interpreting Descriptive Statistics 187 However, the median itself can be misleading. Take for example, data obtained
from applying the Likert scale to the question:
Q. To what extent do you agree with the following statement:
The Anglo-American invasion of Iraq has actually increased the threat of terrorism in both countries. Do you strongly agree, agree, neither agree nor disagree, disagree, or strongly disagree? You code the possible answers 1–5 (where 1 is strongly agree, etc.). Let us say that you ask three samples of ten people of three different age groups (in practice, you would use much greater samples). The results (ranked in descending order) are:
Young people 5 5 5 4 3 2 1 1 1
Middle-aged people 4 4 3 3 3 2 2 1 1
Older people 4 3 3 3 3 3 3 3 2
In each of these three groups, the median is 3 (neither agree nor disagree) although the intensity of agreement or disagreement is significantly higher among the sample of younger people. To overcome this problem, mathematicians developed the variance.
The variance
The variance is a descriptive statistic which measures the degree of concentration or dispersal of the terms around the mean. This is found by calculating the difference between each term and the mean, i.e. X − Xi. In some cases, the difference will be +, in others, –. (Indeed, if added together they should cancel each other out, leaving an answer of 0.) So, to overcome this difficulty, the differences are squared – thus eliminating the minus quantities. The variance is then found by adding all these squared differences and dividing by the number of terms. The formula for the variance is:
(X − Xj)2
N where j is the number of terms in the series
This sounds more complicated than it is. Consider the example above of the Likert scores for the sample of ten, younger people:
Young people, X 5 5 5 4 3 2 1 1 1
Arithmetic mean, X= 3 3 3 3 3 3 3 3 3
Difference between term (5−3) (5−3) (5−3) (4−3) (3−3) (2−3) (1−3) (1−3) (1−3)
and mean (X−X) = +2 +2 +2 +1 0 −1 −2 −2 −2
Difference2(X−X)2= 4 4 4 1 0 1 4 4 4