• No results found

Further Mathematics Unit 3 Core: Data Analysis

N/A
N/A
Protected

Academic year: 2021

Share "Further Mathematics Unit 3 Core: Data Analysis"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

Further

Mathematics

Unit 3 Core:

Data Analysis

Chapter 1 & 2:

Displaying & Describing Data Distributions Summarising

Numerical Data

Name:_____________

(2)

Examples

(3)

Exercise 1A

(4)

Organising and displaying categorical data

Using a frequency table

The sex of 11 children is given below where F = female and M = male:

F M M F F M F F F M M

Complete the frequency table to display the information.

 Using a bar chart to display categorical data

 NOT HISTOGRAMS !!!

 Used to display Categorical Data.

 Bar of equal width, but are separated by small, equal spaces that can be arranged either horizontally or vertically

Construct a bar chart for the set of data on the sex of the 11 children.

Frequency

SEX

Count Percentage Count

(to 1dp)

Male

Female

TOTAL

(5)

 Using a stacked or segmented bar chart to display categorical data

 Segmented (divided) bar chart is a single bar which is used to represent all the data studied.

 Usually information presented as percentages so the total bar length represents 100% of the data.

Construct a segmented bar chart for the set of data on the sex of 11 children given on the previous page.

(6)

 Reporting on categorical data

The mode is the statistic that describes the most dominant (most frequently occurring) group in the distribution/

Example:

The table shows the frequency distribution of school type for a group of students.

(a) Complete the table by filling in the missing values.

(b) Which is the modal category? __________________________________

 Answering Statistical Questions: Categorical Variables

Briefly summarise the context in which the data were collected including the number of individuals involved in the study.

If there is a clear modal category, ensure that it is mentioned.

Include frequencies or percentages in the report. Percentages are preferred.

If there are a lot of categories, it is not necessary to mention every category, but the modal category should always be mentioned.

Example

______ schools were involved in the study. The schools were classified as being, ‘Catholic’,_________,

__________ or _____________. The majority of the schools, _____ %, were found to be ___________.

Of the remaining schools, _____% were found to be __________, while __________% were found to be _________.

Frequency

School type Count Percentage Count

Catholic 4 20

Government 11

Independent 5

TOTAL

(7)

Organising and displaying numerical data: The Histogram

 Using a histogram to display numerical data

In a histogram,

 The frequency is shown on the vertical axis

 The values of the variable are plotted on the horizontal

 For continuous numerical data, each bar corresponds to an interval

 For discrete numerical data, the intervals start and end half-way between values

 Empty classes or missing discrete values have bars of zero height

Example

Weight (in kg) Frequency

48 3

49 5

50 7

51 0

52 1

Number of pets Frequency

0 4

1 0

2 8

3 3

4 1

(8)

 Constructing a histogram using CAS Example 1:

The following are the marks obtained by a group of 27 students in a short Mathematics test:

16 11 4 25 15 7 14 13 14 12 15 13 16 14

15 12 17 18 22 18 15 13 17 18 22 23 18

Display the data using a histogram.

Enter the data into a data list named marks

(9)

 Describing a histogram

The description is focused on three points:

 Shape and outliers

 Centre

 Spread

Histogram A

Histogram B

Histogram C

Bimodal Outliers

The distribution of a numerical variable can be described in terms of:

 shape: symmetric or skewed (positive or negative)

 outliers: values that appear to stand out

 centre: the midpoint of the distribution (median)

 spread: one measure is the range of values covered (range = largest value – smallest value).

(10)

Example 1:

The histogram shows the distribution of waiting times (in minutes) at a doctor’s surgery. Complete the report below describing the distribution of waiting times in terms of shape and any outliers that may be present, centre and spread.

The distribution is ___________________________________ with a possible outlier

in the interval ____________________________ minutes.

The median of the distribution lies in the interval __________________________ minutes.

The range of the distribution is __________________________________ minutes.

Exercise 1C

Waiting times (minutes)

0 4 8 12 16 20 24

2 4 6 8 10

(11)

 Using log scale to display data

Consider the following numbers 100

101 102 103 104 10-1 10-2 10-3 10-4

log⁡(1) log⁡(10) log⁡(100) log⁡(1000) log⁡(10000)

log⁡(0.1) log⁡(00.1) log⁡(000.1) log⁡(0000.1)

Logs are used to remove the decimal places to be able to use a useable scale.

Example

The set of numbers 0.01, 0.1, 1, 10, 100, 1000, 10 000, 100 000, 1 000 00 ranges from 0.01 to 1 million.

Dot plot Using log Scale

Example

The histogram below displays the body weights (in kg) of a number of animalspecies. Because the animals represented in this dataset have weights ranging from around1 kg to 90 tonnes (a dinosaur), most of the data are bunched up at one end of the scale andmuch detail is missing.

The distribution of weights is highly positively skewed, with an outlier.

(12)

Using log scale, their weights are much more evenly spread along the scale. The distribution is now approximately symmetric, with no outliers,and the histogram is considerably more informative.

We can now see that the percentage of animals with weights between 10 and 100 kg is similar to the percentage of animals with weights between 100 and 1000 kg.

Example

The histogram shows the distribution of the weights of 27 animal species plotted on a log scale.

a) What body weight (in kg) is represented by the number 4 on the log scale?

b) How many of these animals have body weights more than 10 000 kg?

c) The weight of a cat is 3.3 kg. Use your calculator to determine the log of its weight correct to two significant figures.

d) Determine the weight (in kg) whose log weight is 3.4 (the elephant). Write your answer correct to the nearest whole number.

(13)

 Constructing a histogram with a log scale using CAS

Example

The weights of 27 animal species (in kg) are recorded below.

1.4 470 36 28 1.0 12 000 2600 190 520 10 3.3 530 210 62 6700 9400 6.8 35 0.12 0.023 2.5 56 100 52 87 000 0.12 190 Construct a histogram to display the distribution:

a) of the body weights of these 27 animals and describe its shape b) of the log body weights of these animals and describe its shape.

(14)

Log Scale

Exercise 1D Review Exercise

(15)

Chapter 2: Summarising Numerical Data

SUMMARY STATISTICS

 Summary statistics are numbers used to describe the overall essential features of a distribution.

Two essential features are:

 The typical value (The centre)

_____________________________________________________________

_____________________________________________________________

_____________________________________________________________

 The spread

_____________________________________________________________

_____________________________________________________________

_____________________________________________________________

Example:

Given below is an ordered set of 10 daily maximum temperatures (in degrees Celsius) recorded in November:

18 18 19 21 24 26 26 27 29 33

Find the following summary statistics.

(a) Range _______________________________________________________________

(b) Median _______________________________________________________________

(c) Mean _______________________________________________________________

_______________________________________________________________

Dot Plot

 simplest way to display small sets of numerical data

 suitable for displaying discrete numerical data

 consists of a number line with each data point marked by a dot

 When several data points have the same value, the points are stacked on top of each other.

Construct a dot plot for the data above.

(16)

Stem Plots

 Discrete and continuous data

 Used for displaying small to median sized data sets

 Data separated into two parts: leading digits (stem) and last digit (leaf) Example

Key:

Stem plots with split stems

 Used to identify hidden features in the data

The above stem plots are formed using the same data set. The last stem plot reveals that the data is negatively skewed with an outlier. Note: outlier not apparent in the original plot

Exercise 2A

(17)

Median, Range and Interquartile Range (IQR)

IQR is defined as the spread of the middle 50% of data values Example

Locate the median for the following data sets

Median:________________ Median:________________

Range and Interquartile Range (IQR)

The range gives us an idea about the spread of a set of data Range = Maximum value – Minimum value

The IQR is more reliable than the range as it is not affected be extreme values or outliers.

Range: Range:

IQR: IQR:

Example: Dot Plots

a)

b) Determine the range and IQR

(18)

Example: Stem Plots

The stem plot shows the average life expectancy (in years) of people living in 23 countries.

The key is such that 5|2 means 52 years.

Life expectancy

5 2

5 7

6 1 4

6 6 6 7 9

7 1 2 2 3 3 4 4 4 4

7 5 5 6 6 7 7

8 0

(a) Find the median

(b) Find the range and IQR

 The IQR “versus” the range as measures of spread The range

 gives an indication of the absolute spread

 is affected by the presence of outliers

The IQR

 gives the spread of the middle 50% of observations

 is generally unaffected by the presence of outliers since the upper and lower 25%

of observations are “discarded”

Exercise 2B

(19)

 The 5-number summary and the boxplot

Example

Order each of the following sets of data, locate the minimum value, the lower quartile, the median, the upper quartile and the maximum value.

Display the distribution using a box plot.

(a) 2 9 1 8 3 5 3 8 1

(b) 10 1 3 4 8 6 10 1 2 9

(20)

Shapes of histograms, stemplots and boxplots

Refer to previous stem plot example (Life Expectancy)

The stem plot shows the average life expectancy (in years) of people living in 23 countries. The key is such that 5|2 means 52 years.

Life expectancy

5 2

5 7

6 1 4

6 6 6 7 9

7 1 2 2 3 3 4 4 4 4

7 5 5 6 6 7 7

8 0

(a) Find the 5-number summary.

(b) Display the distribution using a box plot.

(21)

(c) Describe the distribution of life expectancy in terms of shape, centre and spread.

_________________________________________________________________________________

_________________________________________________________________________________

OUTLIERS IN A DISTRIBUTION

Example

(22)

Example:

University participation rates (%) in 21 countries are listed below:

3 7 8 9 12 13 15 17 18 20 21

22 25 26 26 26 27 30 36 37 55

(a) Find the 5-number summary and complete the following table.

Minimum Q1 Median Q3 Maximum

(b) Calculate the interquartile range.

_________________________________________________________________________________

(c) Calculate the lower fence.

_________________________________________________________________________________

(d) If the summary statistics found in part (a) were used to construct a box plot, then the university participation rate of 55% would be shown as an outlier.

Explain why this is so. Show an appropriate calculation to support your answer.

_____ ___________________________________________________________________________

____ ____________________________________________________________________________

________________________________________________________________________________

(23)

THE 5-NUMBER SUMMARY AND THE BOX PLOT: USING CAS

Example 1:

The reaction times (in millisec) of 18 people are listed below:

38 36 35 35 43 46 42 64 40 48 35 34 40 44 30 25 39 31

Find the 5-number summary using the CAS calculator.

Construct a box plot using the CAS calculator.

(24)

The 5-number summary values from the box plot.

Interpreting boxplots

h describe the distribution represented by the boxplot in terms of shape, centre and spread.

(25)

g describe the distribution represented by the boxplot in terms of shape, centre and spread.

______________________________________________________________________________________________________

______________________________________________________________________________________________________

______________________________________________________________________________________________________

______________________________________________________________________________________________________

Exercise 2C, 2D, 2E

(26)

The Mean

 Calculating the mean

The mean of a set of data is given by:

Mean = sum of all data values Total number of data vales

x =

n x

Example 1

The following are test scores obtained by a group of 12 students.

1 1 2 2 2 3 3 4 4 5 6 7

(a) Calculate the mean and find the median of the distribution.

___________________________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

(b) The maximum score was incorrectly recorded. It was 27 and not 7. How does this affect

(i) the median?

____________________________________________________________________

(ii) the mean?

____________________________________________________________________

(27)

 Relationship between the mean and the median

 The median lies at the mid-point of the distribution

 The mean is the balance point of the distribution

The median “vs” the mean as appropriate measures of centre

The median is relatively unaffected by the presence of extreme values.

If the distribution is:

 symmetric with no outliers, either the mean or median can be used to indicate the centre of the distribution

 clearly skewed and/or if there are outliers, the median is a more appropriate measure of centre since the median is relatively resistant to extreme values

Example

The number of bedrooms in houses on a street 2, 1, 3, 4, 2, 3, 2, 2, 3, 3, 20

Mean:

Median:

The mean becomes a less reliable measure of the centre of a set of data when the set of data is skewed or contains an outlier. The median is a better measure of the centre of a set of data in these situations.

(28)

Standard Deviation

 Defining the standard deviation

The interquartile range (IQR) measures the spread around the median

The standard deviation measures the spread around the mean.

The formula for the standard deviation, s, is

NOTE!!!!!!

The standard deviation is found using the CAS calculator

 Finding the standard deviation using CAS

Example:

The data lists the pulse rates of 26 adult females.

65 73 74 81 59 64 76 83 95 70 73 79 64 77 80 82 77 87 66 89 68 78 91 93 69 75 Find the standard deviation for this distribution.

Exercise 2F -1 and 2F-2

s =  ( x – x )2 n – 1

where n is the number of data values and x is the mean

Standard deviation

(29)

Normal Distribution

The set of data in the histogram opposite is approximately symmetric with a bell shape.

Sets of data like this such as birth weights and people’s heights are called normal distributions.

Normal distributions are centered on the mean value, x . Also the 68-95-99.7% rule can be used for normal distributions.

1. 68% of data lie within 1 standard deviation either side of the mean 2. 95% of data lie within 2 standard deviations either side of the mean 3. 99.7% of data lie within 3 standard deviations either side of the mean

(30)

Example 1:

The distribution of blood pressure readings for executives is known to be symmetric with a mean blood pressure of 134 and a standard deviation of 20.

From this information, it can be concluded that:

(a) About 68% of the executives have blood pressures between _________ and_________

(b) About 95% of the executives have blood pressures between ___________ and_________

(c) About 99.7% of the executives have blood pressures between __________and_________

(d) About 16% of the executives have blood pressures above _____________

(e) About 2.5% of the executives have blood pressures below ____________

(f) About 0.15% of the executives have blood pressures below ___________

(g) About 50% of the executives have blood pressures above _____________

Example 2:

The distribution of resting pulse rates of 20-year old men is approximately symmetric, with a mean of 66 beats/min and a standard deviation of 4 beats/min.

(31)

(a) What percentage of 20 year-old men have pulse rates of:

(i) less than 66 beats/min? _________________

(ii) more than 70 beats/min? _________________

(iii) between 62 and 70 beats/min? _________________

(iv) less than 62 beats/min? _________________

(v) between 58and 74 beats/min? _________________

(vi) less than 70 beats/min _________________

(b) In a sample of 2 000 men, how many are expected to have pulse rates between 54 and 78 beats/min?

________________________________________________________________________________

________________________________________________________________________________

________________________________________________________________________________

________________________________________________________________________________

(32)

Example 3:

The number of matches in a box is not always the same. When a sample of boxes was studied it was found that the number of matches in a box approximated a normal distribution with a mean of 50 matches and a standard deviation of 2.

In a sample of 200 boxes how many would be expected to have more than 48 matches?

 An estimation of the standard deviation

 For normal distributions, Standard deviation Range 6 Example:

 For all other distributions, the standard deviation can be very roughly estimated by

Standard deviation Range 4

(33)

The standard score ( z-score)

 Indicates the position of a certain score in relation to the mean.

 Process of finding standard score is called standardization

 Non-standardised data is often referred to as raw scores or just score.

 Calculating z-scores

Normally distributed scores are transformed into a new set of units that show the number of deviations each data value lies from the mean.

This transformation process is called standardising and these transformed data values are called standard values or z-scores.

The rule for calculating the z-score is:

Example:

The test scores of an IQ test are normally distributed with a mean of 100 and a standard deviation of 15.

standard score = data value – mean standard deviation

z = x – x s

Convert the following IQ test scores (x) to stand IQ test scores (z).

IQ score (x) Standard IQ score (z)

100

115

70

120

90

(34)

Standard scores can be zero, positive or negative:

A positive z-score indicates the data value lies above the mean A zero z-score indicates the data value is equal to the mean A negative z-score indicates the data value lies below the mean

 Using z-scores Example 1:

The heights of a group of young women are found to be normally distributed with a mean of 160 cm and a standard deviation of 7 cm.

Determine and interpret the z-score of a woman who is:

(a) 166 cm tall, giving the answer correct to 1 decimal place

___________________________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

(35)

(b) 148 cm tall, giving the answer correct to 1 decimal place

___________________________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

(c) 160 cm tall, giving the answer correct to 1 decimal place

___________________________________________________________________________

___________________________________________________________________________

___________________________________________________________________________

Example 2:

A student obtained a mark of 75 in her psychology exam and a mark of 70 in her mathematics exam.

(a) Convert her psychology and mathematics marks to z-scores based on the information given above..

_________________________________________________________________________________

_________________________________________________________________________________

(b) Compare her performance in the two exams.

_________________________________________________________________________________

_________________________________________________________________________________

Exercise 2G, 2H, Worksheet and Chapter Review

Subject Student’s mark Mean Standard deviation

Psychology 75 65 10

Mathematics 70 60 5

(36)

References

Related documents

To participate in this year’s competition, simply register online at www.ppgplace.com and click on the Gingerbread House Display link by November 2 or mail a completed entry form

Under this cover, the No Claim Bonus will not be impacted if repair rather than replacement is opted for damage to glass, fi bre, plastic or rubber parts on account of an accident

The key segments in the mattress industry in India are; Natural latex foam, Memory foam, PU foam, Inner spring and Rubberized coir.. Natural Latex mattresses are

This study is concerned with understanding the impact of students’ prior expectations and learning conceptions on their SWE transition and subsequent employability

For BYOD programs, these capabilities are sometimes used with very minimal device policy management, giving users more freedom to use devices as they wish while carving out

Several parameters affecting gene electrotransfer efficiency have been identified so far [ 7 , 30 , 31 ]: electric field distribution, which is related to the electrode

Berdasarkan penjelasan diatas bahwa tidak sesuai dengan teori yang sudah dikemukakan oleh keown et al (2001 :157) “bahwa apabila perputaran piutang dalam suatu perusahaan dalam