STATISTICS
MATHS
ASSESSMENT
Review types of data, collecting data, sorting data, measures of central tendency, measures of spread and displaying data.
Interpreting results from two sets of data (i.e. back to back stem and leaf displays, histograms, double column graphs, or box and
whiskers plots).
Find the range, interquartile range and standard deviation as measures of spread of data sets
- Find the mean and standard deviation of a set of data using digital technologies – calculators
- Compare and describe the spread of sets of data with the same mean but different standard deviations
Bivariate Data: recognises the difference between dependent and independent variables. Describes the strength and direction of the relationship between two variables displayed in a scatter plot, e.g. Strong positive relationships, weak negative relationships with justifications.
Uses lines of best fit to predict what might happen between known data values (interpolation) and predict what might happen beyond known data values (extrapolation).
Know the six processes to setting up statistical investigations. Identify reasons why data in a display may be misrepresented.
EXPRECTATIONS
Use measures of central tendency (mean, mode, median) and the range to analyse data that is displayed in a frequency table, stem-and-leaf plot or dot plot.
Use terms ‘skewed’ or ‘symmetrical’ when describing the shape of a distribution.
Compare two sets of data and draw conclusions by finding the mean, mode and/ or median, and the range of both sets.
Construct a cumulative frequency table, histogram and polygon (ogive) for ungrouped data.
Use cumulative frequency to find the median. Group data into class intervals.
Construct a cumulative frequency table and histogram for grouped data.
Find the mean and modal class of grouped data.
Determine the upper and lower quartiles for a set of scores. Construct a box-and-whisker plot using the five-point summary. Use a calculator to find the standard deviation of a set of scores. Use the mean and standard deviation to compare two sets of data. Compare the relative merits of measures of spread (range,
interquartile range and standard deviation).
STATISTIC TERMANOLOGY
BIVARIATE DATA
BOX PLOT (CAT-AND-WHISKERS PLOT)
- a diagram obtained from the five number summary
- the box shows the middle 50% of scores (the interquartile range) - the whiskers show us the extent of the bottom and top quartiles
as well as the range CENSUS
- a survey of a whole population CUMULATIVE FREQUENCY
- the number of scores less than or equal to a particular outcome - e.g. For the data 3,6,5,3,5,5,4,3,3,6 the cumulative frequency of
5 is 8 (there are 8 scores of 5 or less)
CUMULATIVE FREQUENCY HISTORGRAM (AND POLYGON)
- these show the outcomes and their cumulative frequencies DATA
- the pieces of information (or ‘scores’) to be examined
- categorical: data that uses non-numerical categories - ordered data involves a ranking, e.g. exam grades, garment sizes
- distinct data has no order, e.g. colours, types of cars
- numerical: data that uses numbers to show ‘how much’ - continuous data can have any numerical value within a range, e.g. height
- discrete data is restricted to certain numerical values, e.g. number of pets
DOT PLOT
- a type of graph that uses one axis and a number of dots above the axis
EXTRAPOLATION
- predicting a data beyond the range of values given FIVE NUMBER SUMMARY
- a set of numbers consisting of the minimum score, the three quartiles and the maximum score
FREQUENCY
- the number of times an outcome occurs in the data - e.g. for the data 3,6,5,3,5,5,4,3,3,6 the outcome 5 has a
frequency of 3
FREQUENCY DISTRIBUTION TABLE
- a table that shows all the possible outcomes and their
frequencies (it usually is extended by adding other columns such as the cumulative frequency)
FREQUENCY HISTROGRAM
- a type of column graph showing the outcomes and their frequencies.
FREQUENCY POLYGON
- a type of line graph showing outcomes and their frequencies - to complete the polygon, the outcomes immediately above and
below the actual outcomes are used (the height of these columns is zero)
- data that is organised into groups or classes
- class intervals: the size of the groups into which the data is organised e.g. 1-5 (5 scores); 11-20 (10 scores)
- class centre: the middle outcome of a class e.g. the class 1-5 has a class centre of 3 INTERPOLATION
- estimating data that lie within the domain of the values given INTERQUARTILE RANGE
- the range of the middle 50% of scores
- the difference between the median of the upper half of scores and the median of the lower half of scores
- IQR = Q3-Q1 LINE OF BEST FIT
- a line that ‘best fits; the data on a scatter plot mean MEAN
- the number obtained by ‘evening out’ all the scores until they are equal
- e.g. if the scores 3,6,5,3,5,5,4,3,3,6 were ‘evened out’, the number obtained would be 4.3
- to obtain the mean, we divide the sum of the scores with the total number of scores
MEDIAN
- the middle score for an odd number of scores or the mean of the middle two scores for an even number of scores
- the median class is grouped data containing the median MODE (MODAL CLASS)
- the outcome or class that contains the most scores OGIVE
- this is another name for the cumulative frequency polygon OUTCOME
- a possible value of the data OUTLIER
- a score that is separated from the main body of scores QUARTILES
- the points that divide the scores the scores up into quarters - the second quartile, Q2, divides the scores into halves (Q2 =
median)
- the first quartile, Q1, is the median of the lower half of scores - the third quartile, Q3, is the median of the upper half of scores RANGE
- the difference between the highest and lowest scores SAMPLE
- a part (usually a small part) of a large population
- random sample: a sample taken so that each member of the population has the same change of being included
- systematic sample: a sample selected according to some ordering scheme, e.g. every tenth member
- stratified sample: a sample is proportionally taken from each subgroup in a population
SCATTER PLOT
- a graph that uses points on a number plane to show the relationship between two categories.
SHAPE (OF A DISTRIBUTION)
- a set of scores can be symmetrical or skewed SOURCES OF DATA
- primary: the data has been collected by yourself
- secondary: the data has come from an external source, e.g. newspapers, internet
STANDARD DEVIATION
- a measure of spread that can be thought of as the average distance of scores from the mean
- the larger the standard deviation, the larger the spread STATISTICS
- the collection, organisation and interpretation of numerical data STEM-AND-LEAF PLOT
- a graph that shows the spread of scores without losing the identity of the data
- ordered stem-and-leaf plot: the leaves are placed in order
- back-to-back stem-and-leaf plot: this can be used to compare two sets of scores, one set on each side
VARIABLE
- something that can be observed, measured or counted to provide data
1
STATISTICS
TYPES OF DATA
The data we collect is made up of variables. These are pieces of information like a quantity or a characteristic that can be observed or measured. They may change either over time or between individual observations. The main types of data are:
CATEGORICAL – VARIABLES ARE CATEGORIES
- ordered | e.g. exam grades, garment sizes - distinct | e.g. types of cars, eye colour NUMERICAL – VARIABLES ARE NUMBERS
- continuous | e.g. height of a person, distance thrown
COLLECTING DATA
There are three main ways of collecting data, including: CENSUS
- a whole population is surveyed, e.g. every student in the school is questioned
SAMPLE
- a selected group of a population is surveyed, e.g. a small number in each class is questioned
OBSERVATION
- numerical facts are collected and tabulated, e.g. sports data, weather, sales figures, etc.
A sample is usually random to limit the chances of bias occurring.
However, it may be systematic if the members of the sample are chosen according to a rule, such as every 10th member of a population. If a
population is composed of various sub-groups, the sample could be stratified to ensure a proportionate representation of each group in the sample.
Primary source data is collected first hand by observation or survey. Secondary source data is obtained from an external source such as a newspaper, website or another person’s research.
SORTING DATA
A large amount of data needs to be tabulated (organised into a table) so that it can be analysed. A common form of table is the frequency
distribution table. DISCRETE DATA OUTCOME (
x ) TALLY FREQUENCY ( f ) f × x CUMULATIVEFREQUENCY
1 ||| 3 3 3 2 |||| 4 8 7 3 ||||||| 7 21 14 4 ||||||||| 9 36 23 5 ||||| 5 25 28 6 || 2 12 30 TOTAL | 30 | 105 GROUPED DATA
Used to cluster discrete data into groups or to divide continuous data into adjoining groups.
CENTRE ( f ) E FREQUENCY 1-<5 1 ||| 3 3 3 5-<9 2 |||| 4 8 7 9-<13 3 ||||||| 7 21 14 13-<17 4 ||||||||| 9 36 23 17-<21 5 ||||| 5 25 28 21-<25 6 || 2 12 30 TOTAL | 30 | 390
After the data has been sorted, certain key numbers can be determined. Some measure how the data clusters around the ‘centre’. These are called measures of central tendency (or measures of location). Others measure how the data spreads from the centre. These are called measures of spread.
MEASURES OF CENTRAL TENDENCY
The mean, median and mode are called measures of location because they give an indication of a central value (or average) around which a set of scores tend to cluster.
MODE
- the score or outcome that occurs the most, i.e. with the highest frequency.
- the mode is the score that occurs most often
- e.g. 3,4,3,4,6,7,5,7,3,5,2 – the mode from the set of data is 3 MEAN
- the sum of the scores divided by the number of scores, i.e. the usual definition of ‘average’.
- for raw data, mean ¿
∑
of scoresnumber of scores
- for tabulated data, mean ¿
∑
of f × x column∑
of frequency column ]MEDIAN
- the ‘middle’ score when the scores are placed in order. If there is an even number of scores, the median is the average of the two middle scores.
- the median is (when scores are arranged from lowest to highest) - the middle score (for an odd number of scores)
- the average of the two middle scores (for an even number of scores)
MESURES OF SPREAD
-
the highest score minus the lowest score, for grouped data, unless the original scores are known, the maximum possible range can be determined by suing the class groupings.-
RANGE = highest score – lowest score QUARTILES- the median being the middle score, divides a set of data into two equal parts. Quartiles divide the data into four equal parts.
- the first quartile is often referred to as the lower quartile and is represented by the symbol Q1. It is the value below which 25% of the scores lie.
- the second quartile is the middle value; it is also the median. It is the value that separates the lower 50% of scores from the upper 50% of scores.
- the third quartile is often called the upper quartile and is
represented by the symbol Q3. It is the value above which 25% of the scores lie.
INTERQUARTILE RANGE
- the interquartile range is the difference between the upper quartile and the lower quartile.
- INTERQUARTILE RANGE = upper quartile – lower quartile = Q3 – Q1
- the interquartile range of the middle 50% of scores ignores very low or very high scores (outliers)
- the interquartile range is not meaningful for a small set of scores - associated with these measures of spread is the ‘five number
summary’ of a set of data that is defined as: minimum score, first quartile Q1, median Q2, third quartile Q3, maximum score.
DISPLAYING DATA
DOT PLOT
- a simple display where each score is represented by a dot. - the mode is easy to identify as the highest column of dots - the highest and lowest scores determine the range
- a clear impression of the spread of the scores is given - any outliers are easily identified
25% interquartile range 50% upper quartile Q3 median Q2 lower quartile Q1 25%
FREQUENCY HISTOGRAM AND POLYGON
- the frequency of each score is represented by a column in a histogram and a dot in a polygon
- these dots coincide with the centre of each column
- the dots are joined to form the polygon, which is completed by joining the axis
- the mode is identified by the highest column
- a clear impression of the spread of the scores is given - any outliers are easily identified
- for grouped data, the classes can be represented on the horizontal axis by the class centres
CUMULATIVE FREQUENCY HISTOGRAM AND POLYGON
- graphing the cumulative frequency results in columns of increasing height, the last column representing the total frequency
- the polygon is formed by joining the corners of adjoin columns - the polygon is useful for indicating the median and quartiles
1 2 3 4 5 280 260 240 220 200 180 160 140 120 100 80 60 40 20 number of 560 600 640 680 720 760 800 840 880 mark cumulative frequency scores 800 600 400 200 0
BOX PLOT
- this is drawn using the five number-summary for a set of data - it gives an impression of the spread of the data and also whether
it is symmetrical or skewed from its centre
- this will be indicated by the box being nearer to one end than the other
- if there are more low scores, the skew is said to be positive; more high scores would mean the data is negatively skewed
STEM-LEAF-PLOT
- a stem-and-leaf plot resembles a histogram (on its side) in which the data is grouped
- the individual scores can still be identified - the data may be unordered
- two sets of data can be compared using a back-to-back stem-and-leaf plot
- the range and mode are easily identified
- the scores are ordered so the median and quartiles can be determined by counting
STEM LEAF STEM 9 8 8 0 7 4 3 2 2 2 0 1 6 6 7 9 9 8 8 7 7 3 1 2 0 4 8 9 6 6 4 4 3 0 2 2 2 6 9 3 2 2 4 1 5 5 6 7 8 8 9 5 5 2 9 9 9 6 0 25 35 45 55 65 75 85 1 2 3 4 5 6 7 8 9 10 11 12 13 14 minimum value lower
quartile median quartileupper maximum value
FEATURES OF A DISPLAY OF DATA
OUTLIERS
- an outlier is a value that is clearly separated from the main body of the data
CLUSTERS
- cluster refers to whether the data is ‘bunched’ (close together)
SYMMETRY AND SKEWNESS
- the general shape of a display or distribution refers to whether it is symmetrical or skewed (lopsided)
- the following histogram and stem-and-leaf plot are both symmetrical in shape, thus we can infer that the data is consistent. outlier frequency 30 20 10 0 0 5 10 15 cluster number criteria 14 12 10 8 6 4 2 0 number of clusters 0 2 3 8 13 14 15
- if distribution is not - in a skewed distribution, most of the data are clustered at one
‘end’ of the distribution and taper off towards the ‘tail’ at the other end
COMPARING THE MEAN, MEDIAN AND MODE
- when the mean, median and mode are found for a set of data, it is necessary to decide which measure is most appropriate
- the mean is usually the most appropriate measure of location as it takes into account of every data score
- if there are any outliers in the set of data, then the mean may be affected by these extreme scores and will not accurately
represent all of the scores
- thus the median is a better measure as it is not affected by outliers 0 2 3 8 13 14 15 positive skew + negative skew -LEAF STEM 0 6 1 3 7 9 2 5 8 3 0 2 3 3 7 8 4 0 4 5 6 7 8 6 9
- the mode is useful when the most common score is important, or when the data is categorical (such as hair colour or make of car). When dealing with categorical data, it is not possible to have a mean or median
CUMULATIVE FREQUENCY TABLES AND GRAPHS
- the cumulative frequency is a progressive total of the frequency - the cumulative frequency for a particular score of the frequencies
for that score and for all scores less than it
- a cumulative frequency histogram and polygon (also called the ogive) can be drawn using the ‘score’ and the ‘cumulative frequency’ columns
- the cumulative frequency can be used to find the median
GROUPED DATA
- when measuring the masses of students, there could be a large number of different masses
- constructing frequency tables and graphs would not provide useful information for this data and so, to overcome this problem, the scores are grouped together in class intervals
MEASURES OF SPREAD
- the mode, median and mean are measures of central tendency as they give an indication of a central value
- the range is a measure of spread
- measures of spread indicate how much a set of data is spread out
QUARTILES
- the median, being the middle score, divides a set of data into two equal parts
- quartiles are the values that divide the set of data into four equal parts
INTERQUARTILE RANGE
- the interquartile range is the difference between the upper quartile and the lower quartile
- the interquartile range takes into account the middle 50% of scores and ignores very high or very low scores (outliers) BOX-AND-WHISKER PLOTS
- the lower extreme (lowest score), lower quartile, median, upper quartile and upper extreme (highest score) together make the five number summary
- these points can be shown on a box-whisker plot STANDARD DEVIATION
- the interquartile range measures the spread of the scores about the median
- the standard deviation of a set of scores is a measure of the spread of the scores about the mean
- if the standard deviation of a set of scores is small, there will be little spread of the scores about the mean
- the lower the standard deviation is the data becomes more consistent
THE NORMAL DISTRIBUTION
- if a frequency distribution of a population (such as heights of all Australian women) is normal, it can be represented by a bell-shaped curve called the normal curve or normal distribution curve
- it is symmetrical about the mean and is unimodal
- a total of 68% of the population will lie within one standard deviation of the mean
- a total of 95% of the population will lie within two standard deviations of the mean
- a total of 99.7% of the population will lie within three standard deviations of the mean
COMPARING THE RANGE, INTERQUARTILE RANGE AND STANDARD DEVIATION
- the standard deviation is usually the most appropriate measure of spread because it takes into account all of the values in the data set
- the range is easiest to calculate, but its value only depends upon two scores, the highest and lowest score
- if there are any outliers in the set of data, then the standard deviation and range will be affected by these extreme scores and will then give an exaggerated representation of the spread
- the interquartile range is a best measure because it concentrates only on the middle 50% and so it’s not affected by outliers
STANDARD DEVIATION 2
- the standard deviation is a measure of how far the scores are spread about the mean. It can be thought of as the average distance of the scores from the mean
- the smaller the standard deviation, the less spread of scores – closer to the mean
- the larger the standard deviation, the more spread of scores – further from the mean
INTERPOLATION
- an interpolation is a prediction between given data points - the process of estimating data within the domain of the values
given
- this is valid when a definite relationship exists between the two variables
EXTRAPOLATION
- an extrapolation is a prediction beyond given data points - the process of predicting data beyond the values given
- often not useful and can lead to false results as there are no guarantee that an observed pattern will continue beyond the data presented
BIVARIATE DATA
- when data is collected from two different variables that may or not be related
- used to analyse the relationship between two variables - DEPENDENT VARIABLE – measurement
- INDEPENDENT VARIABLE – change SCATTER PLOTS
MODEATE-POSITIVE RELATIONSHIP
- looking at a positive scatter plot gives the general impression that as one variable increases, so does the other
- this is said to be a positive relationship though it may not be exact
- if there was an exact relationship between two variables the points would lie along a straight line
MODERATE NEGATIVE RELATIONSHIP
- looking at a negative scatter plot gives the general impression that as one variable increases, the other decreases
- this is said to be negative relationship between the two variables - if there was draw connecting pair of points, it would have a
negative slope WEAK RELATIONSHIP
- if the scatter plot seems totally random, it would suggest there is no direct relationship between the two variables
NO CHANGE
- the scatterplot shows a linear pattern, but because the ‘line’ of points is horizontal, it would suggest that the there is no bearing between the two variables, thus there is no relationship linking the variables
LINE OF BEST FIT
- for scatter plots that appear to show a relationship between the two variables, a line can be drawn that runs through the ‘middle’ of the plotted points
- the gradient of the line can be calculated by using two convenient points through which the line passes using the gradient formula (rise over run)
- the range mode and median can be determined for each of the variables from a scatter plot by observation and counting
THE 6 STAGES OF DATA ANALYSIS
POSING QUESTIONS
- the first stage is to pinpoint the final information that will be needed in order to be able to draw a conclusion
- this involves coming up with questions that, if answered, would lead to meaningful information that would allow us to draw a conclusion and to make recommendations
COLLECTING DATA
- once we have posed questions, we need to collect data to answer them
- before we do the actual collecting, we have to decide on how we will collect the data, the type of data we will collect and the sources from which we can collect them
- the sources can be either primary or secondary
- it is important that the data to be collected are from reliable sources and not from some obscure website or outdated book, otherwise the data may not be accurate
- some reliable sources of note are government organisations such as the Australian bureau of statistics and the bureau of
meteorology, which have strict data collection methodologies in place to ensure the accuracy and reliability of their data
ORGANISING DATA
- in the third stage, we arrange the data we have collected into a form that gives structure and order to the data
- a common way of accomplishing this is to use a table, e.g. a frequency table
- how this data will be organised will vary as a function of the nature of the statistical investigation
SUMMARISING AND DISPLAYING DATA
- once we have organised the data, we need to present the data in a form that will be easy to read, understand and analyse
- most often this will be accomplished by using graph such as a column graph, bar graph, dot plot or line chart
- the particular type of graph to be used will depend on the purpose of the investigation
- besides displaying the data in a graph, it may also be beneficial to summarise the data using statistical quantities such as the mean, median, mode and range
ANALYSING DATA AND DRAWING CONCLUSION
- after we have finished summarising and displaying the data, it is time to examine and interpret the data, to decide on what means and too ultimately draw conclusions from it
- this may involve identifying the trends and patterns from the graph, and identifying how those trends and patterns change over time or across categories (such as across different
populations). From these trends, we can then draw conclusions and possibly predictions about future outcomes
WRITING A REPORT
- once we have finished analysing the data, it is time to put everything together in a written report
- the report should address the background and aim of statistical inquiry and the questions sought to answer, detail the data
collection method (including sources and types of data) involve a thorough discussion of the findings, list and explain the reasoning behind the conclusions, and, if appropriate, include
2
FINANCIAL
PRINCIPAL – the original amount of money invested (or lent) for the purpose of earning interest
SIMPLE INTEREST – interest paid only on the original sum of money (principal) invested and not any interest earned by that sum
COMPOUND INTEREST – interest paid on the sum (principal) invested as well as any accumulated interest
INTEREST
- the payment made for the use of money invested (or borrowed) - financial institutions (such as banks and credit unions) reward
investors by paying them interest on their savings or investments - conversely, when borrowing money, the borrower pays interest to
the financial institution on that loan
- the original amount of money invested or borrowed is called the principal
SIMPLE INTEREST
- the interest paid on the original principal
- the interest calculated on the original investment amount or the amount borrowed
- the same interest is paid for each time period – also known as flat rate interest
- I = PRT
- I as the interest - P as the principal
- R as the interest rate per period, expressed as a decimal - T as the number of period
COMPOUND INTEREST
- simple interest is calculated only on the original amount (the principal) invested or borrowed and so the interest for each period remains the same
- for compound interest, the interest earned after one period is added to the principal so that, next time, the interest is
calculated on a larger principal
- this means more interest because we are also earning interest on the interest we have already earned
- the interest earned during one-time period will then earn interest in the next time period
CALCULATION OF COMPOUND INTEREST FIRST STEP OF CALCULATION
- A = P (1+R)n
- A as the total amount of investment - P as the principal
- R as the interest rate per period expressed in decimal - n as number of periods
SECOND STEP OF CALCULATION
- compound interest = final amount – principal - I = A – P
TERMS
- p.a. = per annum/ yearly
- six monthly / twice a year (divided by 2) = every 6th months
- quarterly (divided by 4) = every 3 months - monthly (divided by 12)
- weekly (divided by 52) - yearly to daily (365 days) COMPARISON - comparing 6 monthly - R (divided by 2) = N (multiplied by 2) - comparing quarterly - R (divided by 4) = N (multiplied by 4) - Comparing monthly - R (divided by 12) = N (multiplied by 12)1