STP 420INTRODUCTION TO APPLIED STATISTICSNOTES

(1)

STP 420

INTRODUCTION TO APPLIED STATISTICS NOTES

PART 1 - DATA CHAPTER 1

LOOKING AT DATA - DISTRIBUTIONS

Individuals – objects described by a set of data (people, animals, things) - all the data for one individual make up a case

Variable – any characteristic of an individual (may take different values for different individuals).

Categorical variable – places an individual into one of several groups/categories.

Quantitative variable – takes numerical values for which arithmetic operations (adding/averaging) makes sense.

Distribution – tells us what values a variable takes and how often these values are taken.

1.1 Displaying Distributions with Graphs

Exploratory data analysis – use statistical tools (graphs and numerical summaries) and ideas to help examine data and describe their main features

- examine each variable and the relationships among variables - construct graphs and add numerical summaries

Graphs for categorical variables

Bar graph - order of bars are not important

Pie chart - must have all parts that make up the whole

(2)

Measuring speed of light Newcomb experiment

Measurement – dependent on instrument use to make measurement - appropriateness of measurement for purpose

Variation – difference in measurements may be due to many factors Distribution - the pattern of variation of a variable

The distribution of a quantitative variable records its numerical values and how often each value occurs

Stemplot – gives quick picture of a distribution while including the actual numerical values in the graph

1. Separate each observation into a stem (has all but the last digit, can be 1, 2, or more digits) consisting of all but the final (rightmost) digit and a leaf (has only one digit), the final digit.

2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column.

3. Write each leaf in the row to the right of its stem, in increasing order out from the stem.

Back-to-back stemplot – uses one stem and two sets of leaves, one on either side of the stem helps to make comparison between two data sets.

The number of stems can be doubled by splitting the stem in two; one with leaves from 0 to 4 and the other with leaves 5 to 9.

Good idea to round off numbers to only a few digits before trying to make a stemplot (lose some accuracy in measurements)

Examining a distribution

1. In any graph of data, look for the overall pattern and for striking deviations from

that pattern.

(3)

2. Can describe the overall pattern of a distribution by its shape, center, and spread.

3. Outlier, important deviation that falls outside the overall pattern.

Mode(s) – observation(s) that occurs most often - shown by the major peak(s) in the graph Unimodal – distribution with one major peak

Symmetric distribution – values smaller and larger than its midpoint are mirror images of each other

Skewed to the right – right tail (larger values) longer than left tail (smaller values) Skewed to the left – left tail (smaller values) longer than right tail (larger values)

Histogram – breaks the range of values of a variable into intervals (of equal width) and displays only the count (frequency) or percent (relative frequency) of the observations that fall into each interval

Frequency table – table showing the intervals with their respective frequencies/relative frequencies

Roundoff error – may sometimes be significant

Looking at data -

Histogram can help to shape, spread (outliers), center

Time plots – plotting the measurements in the order that they are observed (over time).

Time series – measurements of a variable taken at regular intervals over time - examples: economic/social data

Seasonal variation – a pattern in a time series that repeats itself at known regular intervals of time

Trend – persistent long-term rise or fall

Monthly consumer price index for some product

(4)

Index number – nationwide average price (less variable than the price at any one store that may from time to time offer special prices)

Seasonally adjusted – helps to avoid misinterpretation especially for short periods of time.

Decomposing time series

Statistical software programs can help to examine a time series by decomposing the data into systematic patterns such as trends and seasonal variation and the residuals that remains after we remove these patterns

1.2 Describing Distributions with numbers

Measures of center

1. Mean = ^x ⁼ ^x

¹

⁺ ^x

²

_n ⁺ ^... ⁺ ^x

ⁿ

⁼ _n ¹ ∑ ^x

ⁱ

2. Median = M

The median is the midpoint of the distribution, the number such that half the observations are smaller and the other half are larger.

To find the median:

1. Arrange the observations in increasing order.

2. If the number of observations n is odd, the median is the center observation at the position (n+1)/2 in the ordered list.

3, If the number of observations n is even, the median is the mean of the two center observations in the ordered list and holds the same position as above in #2.

The mean is affected by extreme observations whereas the median is not affected, hence the median is called a resistant measure and the mean is not resistant.

Measuring spread: Quartiles

Quartiles divide the distribution into 4 equal parts

(5)

To calculate the quartiles:

1. Arrange the observations in increasing order and find the median (same as Q

2

- the second quartile) 50% of the observations are to its left

2. The first quartile (Q

1

) is the median of the observations on the left of the median.

25% of the observations are to its left

3. The third quartile (Q

3

) is the median of the observations on the right of the median.

75% of the observations are to its left

Percentiles divide the distribution into 100 equal parts 25%ile = Q

1

50%ile = Q

2

= M 75%ile = Q

3

Range is the highest score minus the lowest score.

Interquartile range is the highest quartile minus the lowest quartile.

IQR = Q

3

– Q

1

An observation is a suspected outlier if it falls more than 1.5 X IQR above Q

3

or below Q

1

.

The Five number summary include

Minimum Q

1

M = Q

2

Q

3

Maximum

in the given order.

Boxplot – graph of the five number summary with suspected outliers plotted individually - useful in comparing distributions

1. Central box spans the quartiles 2. A line in the box marks the median

3. Observations more than 1.5 X IQR above Q

3

or below Q

1

are plotted as individual outliers

4. Lines extend from the box out to the smallest and largest observations that are not

suspected outliers.

(6)

The variance s

²

of a set of observations is the average of the squares of the deviations of the observations from their mean.

∑ ⁻

= −

−

− + +

− +

=

¹

−

² ² ² ² ²

2

( )

1 1 1

) (

...

) (

)

( x x

n n

x x x

x x

s x

ⁿ _i

Hence, the standard deviation is

∑ ⁻

= − ( )

²

1 1 x x

s n

_i

x

1

to x

n

are the observations and n-1 is the degrees of freedom

Properties

1. s measures spread about the mean and should be used only when the mean is chosen as the measure of center.

2. s = 0 only when there is no spread, all observations are the same value. Otherwise s

> 0 measures the spread of the observations about the mean (more spread implies a bigger s)

3. s, like the mean is not resistant. A few outliers can make s very large.

A Linear Transformation changes the original variable x into a new variable x

new

= a + bx (equation of a straight line)

the constant a shift all the values of x a units upward/downward the positive constant b changes the size of the unit of measurement linear transformations do not change the shape of a distribution

Effect of a linear transformation

To see the effects of a linear transformation on measures of center and spread, apply these

rules:

(7)

1. Multiplying each observation by a positive number b multiplies both measures of center (mean and median) and measures of spread (interquartile range and standard deviation) by b.

2. Adding the same number a (+ve or –ve) to each observation adds a to measures of center and to quartiles and other percentiles but does not change measures of spread.

1.3 The normal distributions Strategy for exploring data

1. Always plot data (stemplot or histogram)

2. Look for overall pattern and striking deviations (outliers) 3. Calculate numerical summary to describe center and spread and

4. Draw a smooth curve approximately through the tops of the bars in the histogram.

A density curve is a curve that

1. is always on or above the horizontal axis 2. has area exactly 1 underneath it

It describes the overall pattern of a distribution.

The area under the curve and above any range of values is the relative frequency of all observations that fall in that range.

Measuring center and spread for density curves

If symmetric, mean, median and mode are same x value that has the highest peak

Median and mean of a density curve

1. The median has an area of 0.5 on each side 2. The mean is the balance point

3. If skewed to the right, the measures are in the order mode, median and mean (the mean is pulled to the right)

If skewed to the left, the measures are in the order mean, median and mode (the mean is pulled to the left)

The mean of a population (idealized distribution) is µ

(8)

The standard deviation of a population (idealized distribution) is σ

The normal curve has equation:

2

2 1

2 ) 1

(

^^

 



−  −

=

^σ

µ

π σ

x

e x

f

The 68-95-99.7 rule

In the normal distribution with mean µ and standard deviation σ 1. 68% of the observations fall within σ of the mean µ 2. 95% of the observations fall within 2σ of the mean µ 3. 99.7% of the observations fall within 3σ of the mean µ

Standardizing observations

If x is an observation from a distribution that has mean µ and standard deviation σ , the standardized value of x is

σ − µ

= x

z called a z-score

Standard normal distribution - N(0, 1): mean 0 and standard deviation 1

If the variable X has any normal distribution N( µ, σ ) with mean and standard deviation , then the standardized variable

σ µ

= X −

Z has a standard normal distribution

The standard normal table gives the area under the curve to the left of the z-score value.

This is often interpreted as a probability.

It is important that all X variables are standardized in order to use the standard normal tables to compute probabilities.

Normal quantile plot

(9)

- very sensitive way to assess normality, however, not easily done by hand - computer software programs allow us to construct a more accurate plot without taking much time

If the points on a normality quantile plot lie close to a straight line, the plot indicates that the data are normal. Systematic deviations from a straight line indicate a nonnormal distribution. Outliers appear as points that are far away from the overall pattern of the plot.

To construct the normal quantile plot

1. Arrange the observed data values from smallest to largest. Record what percentile of the data each value occupies. Eg. for 20 observations, the first is at the 5% point, the next is at the 10% point, and so on.

2. Find the z-scores for each of the percentiles. Eg. z = -1.645 is the 5% point of the standard normal distribution.

3. Plot each data point x against the corresponding z. If the data distribution is close to standard normal, the plotted points will lie close to the 45

⁰

STP 420INTRODUCTION TO APPLIED STATISTICSNOTES

STP 420

INTRODUCTION TO APPLIED STATISTICS NOTES

PART 1 - DATA CHAPTER 1

LOOKING AT DATA - DISTRIBUTIONS

Individuals – objects described by a set of data (people, animals, things) - all the data for one individual make up a case

Variable – any characteristic of an individual (may take different values for different individuals).

Categorical variable – places an individual into one of several groups/categories.

Quantitative variable – takes numerical values for which arithmetic operations (adding/averaging) makes sense.

Distribution – tells us what values a variable takes and how often these values are taken.

1.1 Displaying Distributions with Graphs

Exploratory data analysis – use statistical tools (graphs and numerical summaries) and ideas to help examine data and describe their main features

- examine each variable and the relationships among variables - construct graphs and add numerical summaries

Graphs for categorical variables

Bar graph - order of bars are not important

Pie chart - must have all parts that make up the whole

Measuring speed of light Newcomb experiment

Measurement – dependent on instrument use to make measurement - appropriateness of measurement for purpose

Variation – difference in measurements may be due to many factors Distribution - the pattern of variation of a variable

The distribution of a quantitative variable records its numerical values and how often each value occurs

Stemplot – gives quick picture of a distribution while including the actual numerical values in the graph

1. Separate each observation into a stem (has all but the last digit, can be 1, 2, or more digits) consisting of all but the final (rightmost) digit and a leaf (has only one digit), the final digit.

2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column.

3. Write each leaf in the row to the right of its stem, in increasing order out from the stem.

Back-to-back stemplot – uses one stem and two sets of leaves, one on either side of the stem helps to make comparison between two data sets.

The number of stems can be doubled by splitting the stem in two; one with leaves from 0 to 4 and the other with leaves 5 to 9.

Good idea to round off numbers to only a few digits before trying to make a stemplot (lose some accuracy in measurements)

Examining a distribution

1. In any graph of data, look for the overall pattern and for striking deviations from

that pattern.

2. Can describe the overall pattern of a distribution by its shape, center, and spread.

3. Outlier, important deviation that falls outside the overall pattern.

Mode(s) – observation(s) that occurs most often - shown by the major peak(s) in the graph Unimodal – distribution with one major peak

Symmetric distribution – values smaller and larger than its midpoint are mirror images of each other

Skewed to the right – right tail (larger values) longer than left tail (smaller values) Skewed to the left – left tail (smaller values) longer than right tail (larger values)

Histogram – breaks the range of values of a variable into intervals (of equal width) and displays only the count (frequency) or percent (relative frequency) of the observations that fall into each interval

Frequency table – table showing the intervals with their respective frequencies/relative frequencies

Roundoff error – may sometimes be significant

Looking at data -

Histogram can help to shape, spread (outliers), center

Time plots – plotting the measurements in the order that they are observed (over time).

Time series – measurements of a variable taken at regular intervals over time - examples: economic/social data

Seasonal variation – a pattern in a time series that repeats itself at known regular intervals of time

Trend – persistent long-term rise or fall

Monthly consumer price index for some product

Index number – nationwide average price (less variable than the price at any one store that may from time to time offer special prices)

Seasonally adjusted – helps to avoid misinterpretation especially for short periods of time.

Decomposing time series

Statistical software programs can help to examine a time series by decomposing the data into systematic patterns such as trends and seasonal variation and the residuals that remains after we remove these patterns

1.2 Describing Distributions with numbers

Measures of center

1. Mean = x = x

+ x

n + ... + x

= n 1 ∑ x

2. Median = M

The median is the midpoint of the distribution, the number such that half the observations are smaller and the other half are larger.

To find the median:

1. Arrange the observations in increasing order.

2. If the number of observations n is odd, the median is the center observation at the position (n+1)/2 in the ordered list.

3, If the number of observations n is even, the median is the mean of the two center observations in the ordered list and holds the same position as above in #2.

The mean is affected by extreme observations whereas the median is not affected, hence the median is called a resistant measure and the mean is not resistant.

Measuring spread: Quartiles

Quartiles divide the distribution into 4 equal parts

To calculate the quartiles:

1. Arrange the observations in increasing order and find the median (same as Q

- the second quartile) 50% of the observations are to its left

2. The first quartile (Q

) is the median of the observations on the left of the median.

25% of the observations are to its left

3. The third quartile (Q

) is the median of the observations on the right of the median.

75% of the observations are to its left

Percentiles divide the distribution into 100 equal parts 25%ile = Q

50%ile = Q

= M 75%ile = Q

Range is the highest score minus the lowest score.

Interquartile range is the highest quartile minus the lowest quartile.

IQR = Q

– Q

1. Mean = ^x ⁼ ^x

⁺ ^x

_n ⁺ ^... ⁺ ^x

⁼ _n ¹ ∑ ^x

∑ ⁻

∑ ⁻