• No results found

The arithmetic mean (or average), x , lies at the centre of a set of numeric data values._ It is found by adding up all the data values and then dividing the total by the sample size as shown in the following formula:

x = _ ___________________ Sum of all the observationsNumber of observations = ___ ∑xn i

3.1

Where: _x = the sample arithmetic mean (average) n = the number of data values in the sample

xi = the value of the ith data value of random variable x ∑xi = the sum of the n data values, i.e. x1 + x2 + x3 + x4 + … + xn

Example 3.1 Financial Advisors’ Training Study

The number of seminar training days attended last year by 20 financial advisors is shown in Table 3.1. What is the average number of training days attended by these financial advisors? (See Excel file C.3.1 – financial training.)

Table 3.1 Financial advisors’ training days (n = 20)

16 20 13 19 24 22 18 18 15 20

21 21 18 20 18 20 15 20 18 20

Solution

To find the average, sum the number of days for all 20 financial advisors (∑xi = 376) and divide this total by the number of financial advisors (n = 20).

x = _ ___ 37620 = 18.8 days

On average, each financial advisor attended 18.8 days of seminar training last year.

The arithmetic mean has the following two advantages.

It uses all the data values in its calculation.

It is an unbiased statistic (meaning that, on average, it represents the true mean).

These two properties make it the most widely used measure of central location.

The arithmetic mean, however, has two drawbacks.

It is not appropriate for categorical (i.e. nominal or ordinal-scaled) data.

For example, it is not meaningful to refer to the ‘average’ colour of cars or ‘average’

preferred brand or ‘average’ gender. The arithmetic mean can only be applied to numeric (i.e. interval and ratio-scaled) data.

It is distorted by outliers. An outlier is an extreme value in a data set.

For example, the mean of 3, 4, 6 and 7 is 5. However, the mean of 3, 4, 6 and 39 is 13, which is not representative of the majority of the data values.

65 These two drawbacks require that other measures of central location be considered. Two alternative central location measures are the median and the mode.

Median

The median (Me) is the middle number of an ordered set of data. It divides an ordered set of data values into two equal halves (i.e. 50% of the data values lie below the median and 50% lie above it).

Follow these steps to calculate the median for ungrouped (raw) numeric data:

Arrange the n data values in ascending order.

Find the median by first identifying the middle position in the data set as follows:

– If n is odd, the median value lies in the

(

____ n + 1 2

)

th position in the data set.

– If n is even, the median value is found by identifying the ( __ 2 n )th position and then averaging the data value in this position with the next consecutive data value.

Example 3.2(a) Monthly Car Sales (n = 9 months)

Find the median number of cars sold per month by a dealer over the past nine months based on the following monthly sales figures:

27 38 12 34 42 40 24 40 23

Solution

Order the data set:

12 23 24 27 34 38 40 40 42

Since n = 9 (i.e. n is odd), the median position is 9 + 1____ 2 = 5th position. The median value therefore lies in the 5th data position. Thus the median monthly car sales is 34 cars. This means that there were four months when car sales were below 34 cars per month and four months when car sales were above 34 cars per month (not necessarily consecutive months).

Example 3.2(b) Monthly Car Sales (n = 10 months)

Find the median number of cars sold per month by a dealer over the past ten months based on the following monthly sales figures:

27 38 12 34 42 40 24 40 23 18

Solution

Order the data set:

12 18 23 24 27 34 38 40 40 42

Since n = 10 (i.e. n is even), the data value in the ( ___ 102 )th = 5th position is 27.

Average the 5th and 6th position values (i.e. 27 + 34______ 2 = 30.5) to give the median value.

Thus the median monthly car sales is 30.5 cars. This means that there were five months when car sales were below 30.5 cars per month and five months when car sales were above 30.5 cars per month.

To calculate the median for grouped numeric data

Use these methods when the data is already summarised into a numeric frequency distribution (or ogive).

Graphical approach

Using the ‘less than’ ogive graph (i.e. cumulative frequency polygon), the median value is found by reading off the data value on the x-axis that is associated with the 50%

cumulative frequency located on the y-axis.

Arithmetic approach

– Based on the sample size, n, calculate __ n 2 to find the median position.

– Using the cumulative frequency counts of the ‘less than’ ogive summary table, find the median interval (i.e. the interval that contains the median position (the ( __ 2 n )th data value).

– The median value can be approximated using the midpoint of the median interval, or calculated using the following formula to give a more representative median value:

Me = Ome + _______ c

[

__ n 2 – f(<)

]

fme

3.2

Where: Ome = lower limit of the median interval c = class width

n = sample size (number of observations) fme = frequency count of the median interval

f(<) = cumulative frequency count of all intervals before the median interval The formula takes into account ‘how far’ into the median interval the median value lies.

Example 3.3 Courier Delivery Times Study

A courier company recorded 30 delivery times (in minutes) to deliver parcels to their clients from its depot. The data is summarised in the numeric frequency distribution and ogive as shown in Table 3.2.

Table 3.2 Numeric frequency distribution and ogive for courier delivery times (minutes) Time Frequency Cumulative

10 – < 20 3 3

20 – < 30 5 8

30 – < 40 9 17

40 – < 50 7 24

50 – < 60 6 30

Total 30

67 Find the median delivery time of parcels to clients by this courier company.

Solution

Since n = 30, the median delivery time will lie in the ( ___ 302 )th = 15th ordered data position.

Using the ogive (of cumulative counts) in Table  3.2, the 15th data value falls in the 30–< 40 minutes interval. This identifies the median interval. An approximate median delivery time for parcels is therefore 35 minutes (the interval midpoint). However, a more representative median value can be found by using Formula 3.2, where:

Ome = 30 minutes c = 10 minutes fme = 9 deliveries f(<) = 8 deliveries Me = 30 + 10 (

___ 302 − 8)

______ 9 = 30 + 7.78 = 37.78 minutes

Thus the median parcel delivery time is 37.78  minutes. This means that half the deliveries occurred within 37.78  minutes while the other half took longer than 37.78 minutes.

The median has one major advantage over the mean – it is not affected by outliers. It is therefore a more representative measure of central location than the mean when significant outliers occur in a set of data.

A drawback of the median, however, is that it cannot be calculated for categorical data.

It makes no sense, for example, to refer to a ‘middle’ brand of fuel types. Thus a median, like the mean, can also only be applied to numeric data.

Mode

The mode (Mo) is defined as the most frequently occurring value in a set data. It can be calculated both for categorical data and numeric data.

The following are illustrative statements that refer to the mode as the central location measure:

Colgate is the brand of toothpaste most preferred by households.

The most common family size is four.

The supermarket frequented most often in Kimberley is Checkers.

The majority of machine breakdowns last between 25 and 30 minutes.

Follow these steps to calculate the mode:

For small samples of ungrouped data, rank the data from lowest to highest, and identify, by inspection, the data value that occurs most frequently.

For large samples of discrete or categorical (nominal and ordinal-scaled) data:

– construct a categorical frequency table (see Chapter 2)

– identify the modal value or modal category that occurs most frequently.

For large samples of continuous, numeric (ratio-scaled) data:

– calculate a numeric frequency distribution (see Chapter 2)

– identify the modal interval as the interval with the highest frequency count

– use either the midpoint of the modal interval as an approximate modal value or apply the following formula to calculate a more representative modal value.

Mo = Omo + ___________ 2fc (fm – fm – 1)

m – fm – 1 – fm + 1

3.3

Where: Omo = lower limit of the modal interval c = width of the modal interval fm = frequency of the modal interval

fm−1 = frequency of the interval preceding the modal interval fm+1 = frequency of the interval following the modal interval

The modal formula weights (‘pulls’) the modal value from the midpoint position towards the adjacent interval with the higher frequency count. If the interval to the left of the modal interval has a higher frequency count than the interval to the right of the modal interval, then the modal value is pulled down below the midpoint value, and vice versa.

Example 3.4 Courier Delivery Times Study

Refer to Example 3.3 for the problem description and Table 3.3 for the sample data of 30 delivery times that have been summarised into a numeric frequency distribution.

Find the modal delivery time of parcels to clients from the courier service’s depot (i.e.

what is the most common courier delivery time?)

Table 3.3 Numeric frequency distribution of courier delivery times (minutes) Time intervals Frequency count

10 – < 20 3

20 – < 30 5

30 – < 40 9

40 – < 50 7

50 – < 60 6

Total 30

Solution

From the numeric frequency distribution, the modal interval (interval with the highest frequency) is 30 – < 40 minutes. The midpoint of 35 minutes can be used as an approximate modal courier delivery time.

69 To calculate a more representative modal value, apply the modal Formula 3.3 with:

Omo = 30 minutes c = 10 minutes fm = 9 deliveries fm−1 = 5 deliveries fm+1 = 7 deliveries

Mo = 30 + __________ (2(9) − 5 − 7) 10(9 − 5) = 30 + 6.67 = 36.67 minutes

Thus the most common courier delivery time from depot to customers is 36.67 minutes.

The mode has several advantages:

It is a valid measure of central location for all data types (i.e. categorical and numeric).

If the data type is categorical, the mode defines the most frequently occurring category.

If the data type is numeric, the mode is the most frequently occurring data value (or the midpoint value of a modal interval, if the numeric data has been grouped into intervals).

The mode is not influenced by outliers, as it represents the most frequently occurring data value (or response category).

The mode also has one main disadvantage:

It is a representative measure of central location only if the histogram of the numeric random variable is unimodal (i.e. has one peak only). If the shape is bi-modal, there is more than one peak, meaning that two possible modes exist, in which case there is no single representative mode.