Further
Mathematics
Unit 3 Core:
Data Analysis
Chapter 1 & 2:
Displaying & Describing Data Distributions Summarising
Numerical Data
Name:_____________
Examples
Exercise 1A
Organising and displaying categorical data
Using a frequency table
The sex of 11 children is given below where F = female and M = male:
F M M F F M F F F M M
Complete the frequency table to display the information.
Using a bar chart to display categorical data
NOT HISTOGRAMS !!!
Used to display Categorical Data.
Bar of equal width, but are separated by small, equal spaces that can be arranged either horizontally or vertically
Construct a bar chart for the set of data on the sex of the 11 children.
Frequency
SEX
Count Percentage Count
(to 1dp)
Male
Female
TOTAL
Using a stacked or segmented bar chart to display categorical data
Segmented (divided) bar chart is a single bar which is used to represent all the data studied.
Usually information presented as percentages so the total bar length represents 100% of the data.
Construct a segmented bar chart for the set of data on the sex of 11 children given on the previous page.
Reporting on categorical data
The mode is the statistic that describes the most dominant (most frequently occurring) group in the distribution/
Example:
The table shows the frequency distribution of school type for a group of students.
(a) Complete the table by filling in the missing values.
(b) Which is the modal category? __________________________________
Answering Statistical Questions: Categorical Variables
Briefly summarise the context in which the data were collected including the number of individuals involved in the study.
If there is a clear modal category, ensure that it is mentioned.
Include frequencies or percentages in the report. Percentages are preferred.
If there are a lot of categories, it is not necessary to mention every category, but the modal category should always be mentioned.
Example
______ schools were involved in the study. The schools were classified as being, ‘Catholic’,_________,
__________ or _____________. The majority of the schools, _____ %, were found to be ___________.
Of the remaining schools, _____% were found to be __________, while __________% were found to be _________.
Frequency
School type Count Percentage Count
Catholic 4 20
Government 11
Independent 5
TOTAL
Organising and displaying numerical data: The Histogram
Using a histogram to display numerical data
In a histogram,
The frequency is shown on the vertical axis
The values of the variable are plotted on the horizontal
For continuous numerical data, each bar corresponds to an interval
For discrete numerical data, the intervals start and end half-way between values
Empty classes or missing discrete values have bars of zero height
Example
Weight (in kg) Frequency
48 3
49 5
50 7
51 0
52 1
Number of pets Frequency
0 4
1 0
2 8
3 3
4 1
Constructing a histogram using CAS Example 1:
The following are the marks obtained by a group of 27 students in a short Mathematics test:
16 11 4 25 15 7 14 13 14 12 15 13 16 14
15 12 17 18 22 18 15 13 17 18 22 23 18
Display the data using a histogram.
Enter the data into a data list named marks
Describing a histogram
The description is focused on three points:
Shape and outliers
Centre
Spread
Histogram A
Histogram B
Histogram C
Bimodal Outliers
The distribution of a numerical variable can be described in terms of:
shape: symmetric or skewed (positive or negative)
outliers: values that appear to stand out
centre: the midpoint of the distribution (median)
spread: one measure is the range of values covered (range = largest value – smallest value).
Example 1:
The histogram shows the distribution of waiting times (in minutes) at a doctor’s surgery. Complete the report below describing the distribution of waiting times in terms of shape and any outliers that may be present, centre and spread.
The distribution is ___________________________________ with a possible outlier
in the interval ____________________________ minutes.
The median of the distribution lies in the interval __________________________ minutes.
The range of the distribution is __________________________________ minutes.
Exercise 1C
Waiting times (minutes)
0 4 8 12 16 20 24
2 4 6 8 10
Using log scale to display data
Consider the following numbers 100
101 102 103 104 10-1 10-2 10-3 10-4
log(1) log(10) log(100) log(1000) log(10000)
log(0.1) log(00.1) log(000.1) log(0000.1)
Logs are used to remove the decimal places to be able to use a useable scale.
Example
The set of numbers 0.01, 0.1, 1, 10, 100, 1000, 10 000, 100 000, 1 000 00 ranges from 0.01 to 1 million.
Dot plot Using log Scale
Example
The histogram below displays the body weights (in kg) of a number of animalspecies. Because the animals represented in this dataset have weights ranging from around1 kg to 90 tonnes (a dinosaur), most of the data are bunched up at one end of the scale andmuch detail is missing.
The distribution of weights is highly positively skewed, with an outlier.
Using log scale, their weights are much more evenly spread along the scale. The distribution is now approximately symmetric, with no outliers,and the histogram is considerably more informative.
We can now see that the percentage of animals with weights between 10 and 100 kg is similar to the percentage of animals with weights between 100 and 1000 kg.
Example
The histogram shows the distribution of the weights of 27 animal species plotted on a log scale.
a) What body weight (in kg) is represented by the number 4 on the log scale?
b) How many of these animals have body weights more than 10 000 kg?
c) The weight of a cat is 3.3 kg. Use your calculator to determine the log of its weight correct to two significant figures.
d) Determine the weight (in kg) whose log weight is 3.4 (the elephant). Write your answer correct to the nearest whole number.
Constructing a histogram with a log scale using CAS
Example
The weights of 27 animal species (in kg) are recorded below.
1.4 470 36 28 1.0 12 000 2600 190 520 10 3.3 530 210 62 6700 9400 6.8 35 0.12 0.023 2.5 56 100 52 87 000 0.12 190 Construct a histogram to display the distribution:
a) of the body weights of these 27 animals and describe its shape b) of the log body weights of these animals and describe its shape.
Log Scale
Exercise 1D Review Exercise
Chapter 2: Summarising Numerical Data
SUMMARY STATISTICS
Summary statistics are numbers used to describe the overall essential features of a distribution.
Two essential features are:
The typical value (The centre)
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
The spread
_____________________________________________________________
_____________________________________________________________
_____________________________________________________________
Example:
Given below is an ordered set of 10 daily maximum temperatures (in degrees Celsius) recorded in November:
18 18 19 21 24 26 26 27 29 33
Find the following summary statistics.
(a) Range _______________________________________________________________
(b) Median _______________________________________________________________
(c) Mean _______________________________________________________________
_______________________________________________________________
Dot Plot
simplest way to display small sets of numerical data
suitable for displaying discrete numerical data
consists of a number line with each data point marked by a dot
When several data points have the same value, the points are stacked on top of each other.
Construct a dot plot for the data above.
Stem Plots
Discrete and continuous data
Used for displaying small to median sized data sets
Data separated into two parts: leading digits (stem) and last digit (leaf) Example
Key:
Stem plots with split stems
Used to identify hidden features in the data
The above stem plots are formed using the same data set. The last stem plot reveals that the data is negatively skewed with an outlier. Note: outlier not apparent in the original plot
Exercise 2A
Median, Range and Interquartile Range (IQR)
IQR is defined as the spread of the middle 50% of data values Example
Locate the median for the following data sets
Median:________________ Median:________________
Range and Interquartile Range (IQR)
The range gives us an idea about the spread of a set of data Range = Maximum value – Minimum value
The IQR is more reliable than the range as it is not affected be extreme values or outliers.
Range: Range:
IQR: IQR:
Example: Dot Plots
a)
b) Determine the range and IQR
Example: Stem Plots
The stem plot shows the average life expectancy (in years) of people living in 23 countries.
The key is such that 5|2 means 52 years.
Life expectancy
5 2
5 7
6 1 4
6 6 6 7 9
7 1 2 2 3 3 4 4 4 4
7 5 5 6 6 7 7
8 0
(a) Find the median
(b) Find the range and IQR
The IQR “versus” the range as measures of spread The range
gives an indication of the absolute spread
is affected by the presence of outliers
The IQR
gives the spread of the middle 50% of observations
is generally unaffected by the presence of outliers since the upper and lower 25%
of observations are “discarded”
Exercise 2B
The 5-number summary and the boxplot
Example
Order each of the following sets of data, locate the minimum value, the lower quartile, the median, the upper quartile and the maximum value.
Display the distribution using a box plot.
(a) 2 9 1 8 3 5 3 8 1
(b) 10 1 3 4 8 6 10 1 2 9
Shapes of histograms, stemplots and boxplots
Refer to previous stem plot example (Life Expectancy)
The stem plot shows the average life expectancy (in years) of people living in 23 countries. The key is such that 5|2 means 52 years.
Life expectancy
5 2
5 7
6 1 4
6 6 6 7 9
7 1 2 2 3 3 4 4 4 4
7 5 5 6 6 7 7
8 0
(a) Find the 5-number summary.
(b) Display the distribution using a box plot.
(c) Describe the distribution of life expectancy in terms of shape, centre and spread.
_________________________________________________________________________________
_________________________________________________________________________________
OUTLIERS IN A DISTRIBUTION
Example
Example:
University participation rates (%) in 21 countries are listed below:
3 7 8 9 12 13 15 17 18 20 21
22 25 26 26 26 27 30 36 37 55
(a) Find the 5-number summary and complete the following table.
Minimum Q1 Median Q3 Maximum
(b) Calculate the interquartile range.
_________________________________________________________________________________
(c) Calculate the lower fence.
_________________________________________________________________________________
(d) If the summary statistics found in part (a) were used to construct a box plot, then the university participation rate of 55% would be shown as an outlier.
Explain why this is so. Show an appropriate calculation to support your answer.
_____ ___________________________________________________________________________
____ ____________________________________________________________________________
________________________________________________________________________________
THE 5-NUMBER SUMMARY AND THE BOX PLOT: USING CAS
Example 1:
The reaction times (in millisec) of 18 people are listed below:
38 36 35 35 43 46 42 64 40 48 35 34 40 44 30 25 39 31
Find the 5-number summary using the CAS calculator.
Construct a box plot using the CAS calculator.
The 5-number summary values from the box plot.
Interpreting boxplots
h describe the distribution represented by the boxplot in terms of shape, centre and spread.
g describe the distribution represented by the boxplot in terms of shape, centre and spread.
______________________________________________________________________________________________________
______________________________________________________________________________________________________
______________________________________________________________________________________________________
______________________________________________________________________________________________________
Exercise 2C, 2D, 2E
The Mean
Calculating the mean
The mean of a set of data is given by:
Mean = sum of all data values Total number of data vales
x =
n xExample 1
The following are test scores obtained by a group of 12 students.
1 1 2 2 2 3 3 4 4 5 6 7
(a) Calculate the mean and find the median of the distribution.
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
(b) The maximum score was incorrectly recorded. It was 27 and not 7. How does this affect
(i) the median?
____________________________________________________________________
(ii) the mean?
____________________________________________________________________
Relationship between the mean and the median
The median lies at the mid-point of the distribution
The mean is the balance point of the distribution
The median “vs” the mean as appropriate measures of centre
The median is relatively unaffected by the presence of extreme values.
If the distribution is:
symmetric with no outliers, either the mean or median can be used to indicate the centre of the distribution
clearly skewed and/or if there are outliers, the median is a more appropriate measure of centre since the median is relatively resistant to extreme values
Example
The number of bedrooms in houses on a street 2, 1, 3, 4, 2, 3, 2, 2, 3, 3, 20
Mean:
Median:
The mean becomes a less reliable measure of the centre of a set of data when the set of data is skewed or contains an outlier. The median is a better measure of the centre of a set of data in these situations.
Standard Deviation
Defining the standard deviation
The interquartile range (IQR) measures the spread around the median
The standard deviation measures the spread around the mean.
The formula for the standard deviation, s, is
NOTE!!!!!!
The standard deviation is found using the CAS calculator
Finding the standard deviation using CAS
Example:
The data lists the pulse rates of 26 adult females.
65 73 74 81 59 64 76 83 95 70 73 79 64 77 80 82 77 87 66 89 68 78 91 93 69 75 Find the standard deviation for this distribution.
Exercise 2F -1 and 2F-2
s = ( x – x )2 n – 1
where n is the number of data values and x is the mean
Standard deviation
Normal Distribution
The set of data in the histogram opposite is approximately symmetric with a bell shape.
Sets of data like this such as birth weights and people’s heights are called normal distributions.
Normal distributions are centered on the mean value, x . Also the 68-95-99.7% rule can be used for normal distributions.
1. 68% of data lie within 1 standard deviation either side of the mean 2. 95% of data lie within 2 standard deviations either side of the mean 3. 99.7% of data lie within 3 standard deviations either side of the mean
Example 1:
The distribution of blood pressure readings for executives is known to be symmetric with a mean blood pressure of 134 and a standard deviation of 20.
From this information, it can be concluded that:
(a) About 68% of the executives have blood pressures between _________ and_________
(b) About 95% of the executives have blood pressures between ___________ and_________
(c) About 99.7% of the executives have blood pressures between __________and_________
(d) About 16% of the executives have blood pressures above _____________
(e) About 2.5% of the executives have blood pressures below ____________
(f) About 0.15% of the executives have blood pressures below ___________
(g) About 50% of the executives have blood pressures above _____________
Example 2:
The distribution of resting pulse rates of 20-year old men is approximately symmetric, with a mean of 66 beats/min and a standard deviation of 4 beats/min.
(a) What percentage of 20 year-old men have pulse rates of:
(i) less than 66 beats/min? _________________
(ii) more than 70 beats/min? _________________
(iii) between 62 and 70 beats/min? _________________
(iv) less than 62 beats/min? _________________
(v) between 58and 74 beats/min? _________________
(vi) less than 70 beats/min _________________
(b) In a sample of 2 000 men, how many are expected to have pulse rates between 54 and 78 beats/min?
________________________________________________________________________________
________________________________________________________________________________
________________________________________________________________________________
________________________________________________________________________________
Example 3:
The number of matches in a box is not always the same. When a sample of boxes was studied it was found that the number of matches in a box approximated a normal distribution with a mean of 50 matches and a standard deviation of 2.
In a sample of 200 boxes how many would be expected to have more than 48 matches?
An estimation of the standard deviation
For normal distributions, Standard deviation Range 6 Example:
For all other distributions, the standard deviation can be very roughly estimated by
Standard deviation Range 4
The standard score ( z-score)
Indicates the position of a certain score in relation to the mean.
Process of finding standard score is called standardization
Non-standardised data is often referred to as raw scores or just score.
Calculating z-scores
Normally distributed scores are transformed into a new set of units that show the number of deviations each data value lies from the mean.
This transformation process is called standardising and these transformed data values are called standard values or z-scores.
The rule for calculating the z-score is:
Example:
The test scores of an IQ test are normally distributed with a mean of 100 and a standard deviation of 15.
standard score = data value – mean standard deviation
z = x – x s
Convert the following IQ test scores (x) to stand IQ test scores (z).
IQ score (x) Standard IQ score (z)
100
115
70
120
90
Standard scores can be zero, positive or negative:
A positive z-score indicates the data value lies above the mean A zero z-score indicates the data value is equal to the mean A negative z-score indicates the data value lies below the mean
Using z-scores Example 1:
The heights of a group of young women are found to be normally distributed with a mean of 160 cm and a standard deviation of 7 cm.
Determine and interpret the z-score of a woman who is:
(a) 166 cm tall, giving the answer correct to 1 decimal place
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
(b) 148 cm tall, giving the answer correct to 1 decimal place
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
(c) 160 cm tall, giving the answer correct to 1 decimal place
___________________________________________________________________________
___________________________________________________________________________
___________________________________________________________________________
Example 2:
A student obtained a mark of 75 in her psychology exam and a mark of 70 in her mathematics exam.
(a) Convert her psychology and mathematics marks to z-scores based on the information given above..
_________________________________________________________________________________
_________________________________________________________________________________
(b) Compare her performance in the two exams.
_________________________________________________________________________________
_________________________________________________________________________________
Exercise 2G, 2H, Worksheet and Chapter Review
Subject Student’s mark Mean Standard deviation
Psychology 75 65 10
Mathematics 70 60 5