STP 420
INTRODUCTION TO APPLIED STATISTICS NOTES
PART 1 - DATA CHAPTER 1
LOOKING AT DATA - DISTRIBUTIONS
Individuals – objects described by a set of data (people, animals, things) - all the data for one individual make up a case
Variable – any characteristic of an individual (may take different values for different individuals).
Categorical variable – places an individual into one of several groups/categories.
Quantitative variable – takes numerical values for which arithmetic operations (adding/averaging) makes sense.
Distribution – tells us what values a variable takes and how often these values are taken.
1.1 Displaying Distributions with Graphs
Exploratory data analysis – use statistical tools (graphs and numerical summaries) and ideas to help examine data and describe their main features
- examine each variable and the relationships among variables - construct graphs and add numerical summaries
Graphs for categorical variables
Bar graph - order of bars are not important
Pie chart - must have all parts that make up the whole
Measuring speed of light Newcomb experiment
Measurement – dependent on instrument use to make measurement - appropriateness of measurement for purpose
Variation – difference in measurements may be due to many factors Distribution - the pattern of variation of a variable
The distribution of a quantitative variable records its numerical values and how often each value occurs
Stemplot – gives quick picture of a distribution while including the actual numerical values in the graph
1. Separate each observation into a stem (has all but the last digit, can be 1, 2, or more digits) consisting of all but the final (rightmost) digit and a leaf (has only one digit), the final digit.
2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column.
3. Write each leaf in the row to the right of its stem, in increasing order out from the stem.
Back-to-back stemplot – uses one stem and two sets of leaves, one on either side of the stem helps to make comparison between two data sets.
The number of stems can be doubled by splitting the stem in two; one with leaves from 0 to 4 and the other with leaves 5 to 9.
Good idea to round off numbers to only a few digits before trying to make a stemplot (lose some accuracy in measurements)
Examining a distribution
1. In any graph of data, look for the overall pattern and for striking deviations from
that pattern.
2. Can describe the overall pattern of a distribution by its shape, center, and spread.
3. Outlier, important deviation that falls outside the overall pattern.
Mode(s) – observation(s) that occurs most often - shown by the major peak(s) in the graph Unimodal – distribution with one major peak
Symmetric distribution – values smaller and larger than its midpoint are mirror images of each other
Skewed to the right – right tail (larger values) longer than left tail (smaller values) Skewed to the left – left tail (smaller values) longer than right tail (larger values)
Histogram – breaks the range of values of a variable into intervals (of equal width) and displays only the count (frequency) or percent (relative frequency) of the observations that fall into each interval
Frequency table – table showing the intervals with their respective frequencies/relative frequencies
Roundoff error – may sometimes be significant
Looking at data -
Histogram can help to shape, spread (outliers), center
Time plots – plotting the measurements in the order that they are observed (over time).
Time series – measurements of a variable taken at regular intervals over time - examples: economic/social data
Seasonal variation – a pattern in a time series that repeats itself at known regular intervals of time
Trend – persistent long-term rise or fall
Monthly consumer price index for some product
Index number – nationwide average price (less variable than the price at any one store that may from time to time offer special prices)
Seasonally adjusted – helps to avoid misinterpretation especially for short periods of time.
Decomposing time series
Statistical software programs can help to examine a time series by decomposing the data into systematic patterns such as trends and seasonal variation and the residuals that remains after we remove these patterns
1.2 Describing Distributions with numbers
Measures of center
1. Mean = x = x
1+ x
2n + ... + x
n= n 1 ∑ x
i2. Median = M
The median is the midpoint of the distribution, the number such that half the observations are smaller and the other half are larger.
To find the median:
1. Arrange the observations in increasing order.
2. If the number of observations n is odd, the median is the center observation at the position (n+1)/2 in the ordered list.
3, If the number of observations n is even, the median is the mean of the two center observations in the ordered list and holds the same position as above in #2.
The mean is affected by extreme observations whereas the median is not affected, hence the median is called a resistant measure and the mean is not resistant.
Measuring spread: Quartiles
Quartiles divide the distribution into 4 equal parts
To calculate the quartiles:
1. Arrange the observations in increasing order and find the median (same as Q
2- the second quartile) 50% of the observations are to its left
2. The first quartile (Q
1) is the median of the observations on the left of the median.
25% of the observations are to its left
3. The third quartile (Q
3) is the median of the observations on the right of the median.
75% of the observations are to its left
Percentiles divide the distribution into 100 equal parts 25%ile = Q
150%ile = Q
2= M 75%ile = Q
3Range is the highest score minus the lowest score.
Interquartile range is the highest quartile minus the lowest quartile.
IQR = Q
3– Q
1An observation is a suspected outlier if it falls more than 1.5 X IQR above Q
3or below Q
1.
The Five number summary include
Minimum Q
1M = Q
2Q
3Maximum
in the given order.
Boxplot – graph of the five number summary with suspected outliers plotted individually - useful in comparing distributions
1. Central box spans the quartiles 2. A line in the box marks the median
3. Observations more than 1.5 X IQR above Q
3or below Q
1are plotted as individual outliers
4. Lines extend from the box out to the smallest and largest observations that are not
suspected outliers.
The variance s
2of a set of observations is the average of the squares of the deviations of the observations from their mean.
∑ −
= −
−
− + +
− +
=
1−
2 2 2 2 22
( )
1 1 1
) (
...
) (
)
( x x
n n
x x x
x x
s x
n iHence, the standard deviation is
∑ −
= − ( )
21
1 x x
s n
ix
1to x
nare the observations and n-1 is the degrees of freedom
Properties
1. s measures spread about the mean and should be used only when the mean is chosen as the measure of center.
2. s = 0 only when there is no spread, all observations are the same value. Otherwise s
> 0 measures the spread of the observations about the mean (more spread implies a bigger s)
3. s, like the mean is not resistant. A few outliers can make s very large.
A Linear Transformation changes the original variable x into a new variable x
new= a + bx (equation of a straight line)
the constant a shift all the values of x a units upward/downward the positive constant b changes the size of the unit of measurement linear transformations do not change the shape of a distribution
Effect of a linear transformation
To see the effects of a linear transformation on measures of center and spread, apply these
rules:
1. Multiplying each observation by a positive number b multiplies both measures of center (mean and median) and measures of spread (interquartile range and standard deviation) by b.
2. Adding the same number a (+ve or –ve) to each observation adds a to measures of center and to quartiles and other percentiles but does not change measures of spread.
1.3 The normal distributions Strategy for exploring data
1. Always plot data (stemplot or histogram)
2. Look for overall pattern and striking deviations (outliers) 3. Calculate numerical summary to describe center and spread and
4. Draw a smooth curve approximately through the tops of the bars in the histogram.
A density curve is a curve that
1. is always on or above the horizontal axis 2. has area exactly 1 underneath it
It describes the overall pattern of a distribution.
The area under the curve and above any range of values is the relative frequency of all observations that fall in that range.
Measuring center and spread for density curves
If symmetric, mean, median and mode are same x value that has the highest peak
Median and mean of a density curve
1. The median has an area of 0.5 on each side 2. The mean is the balance point
3. If skewed to the right, the measures are in the order mode, median and mean (the mean is pulled to the right)
If skewed to the left, the measures are in the order mean, median and mode (the mean is pulled to the left)
The mean of a population (idealized distribution) is µ
The standard deviation of a population (idealized distribution) is σ
The normal curve has equation:
2
2 1
2 ) 1
(
− −
=
σµ
π σ
x