STAT355 - Probability & Statistics

(1)

STAT355 - Probability & Statistics

Instructor: Kofi Placid Adragni

Fall 2011

(2)

Chap 1 - Overview and Descriptive Statistics

1.1 Populations, Samples, and Processes

1.2 Pictorial and Tabular Methods in Descriptive Statistics

1.3 Measure of Location

1.4 Measures of Variability

(3)

1.1 Populations, Samples, ...

IDiscipline of statistics provides methods fororganizing and summarizingdata and for drawing conclusions based on

information contained in thedata.

IAn investigation will typically focus on a well-defined collection of objects constituting apopulation of interest.

IWhen desired information is available for all objects in the population, we have what is called acensus.

IOften, census is impractical or infeasible. Asample, - a subset of the population - is selected in some prescribed manner.

(4)

1.1 Populations, Samples, ...

IAvariableis any characteristic whose value may change from one object to another in the population.

IData results from making observations either on a single variable or simultaneously on two or more variables.

IAunivariate data set consists of observations on a single variable.

IWe havebivariatedata when observations are made on each of two variables.

(5)

1.1 Populations, Samples, ...

IDescriptive statistics help summarize and describe important features of the data.

ISome are graphical in nature: histograms, boxplots, scatter plots,...

IOther involve calculation of numerical summary measures, such as means, standard deviations, and correlation coefficients.

ISoftware:R, S-Plus, Minitab, SAS, SPSS, Jmp, Stata,...

(6)

1.1 Populations, Samples, ...

Scope of Modern Statistics

I molecular biology (analysis of microarray data, SNPs,...)

I ecology (describing quantitatively how individuals in various animal and plant populations are spatially distributed)

I materials engineering (studying properties of various treatments to retard corrosion)

I marketing (developing market surveys and strategies for marketing new products)

I public health (identifying sources of diseases and ways to treat them)

I civil engineering (assessing the effects of stress on structural elements and the impacts of traffic flows on communities)

I ...

Meanwhile, statisticians continue to develop new models for describing randomness, and uncertainty and new methodology for analyzing data.

(7)

1.1 Populations, Samples, ...

Data Collecting

IStatistics deals not only with theorganization andanalysis of data once it has been collected but also with thedevelopment of techniquesfor collecting the data.

IData not properly collected may be useless and misleading . IAppropriate sampling scheme must be used.

I simple random sample

I stratified random sample

I ...

(8)

1.2 Pictorial and Tabular Methods in Descriptive Statistics

Visual representation of data

I Stem-and-Leaf Displays

I Dotplots

I Histograms

I Boxplots

I Frequency tables

I Pie charts

I Bar graphs

I Scatter plots

I ...

(9)

1.2 Pictorial and Tabular Methods in Descriptive Statistics

Stem-and-Leaf Displays Example:

data

0.0 -0.2 -1.1 -0.6 -2.3 0.5 -0.3 1.5 1.0 1.0 0.6 -1.1 -0.9 -1.2 0.3 1.4 0.4 -1.1 0.0 1.1 2.0 -0.2 0.3 -0.2 0.7 0.1 -0.8 0.3 0.3 0.4

-2 | 3 -1 | -1 | 2111 -0 | 986 -0 | 3222

0 | 001333344 0 | 567

1 | 0014 1 | 5 2 | 0

(10)

1.2 Pictorial and Tabular Methods in Descriptive Statistics

Histograms

(11)

Wednesday, Sept 7

To cover...

1.3 Measure of Location 1.4 Measure of Variability

(12)

1.3 Measures of Location

Some measures of location are:

I Mean

I Median

I Quartiles

I Percentiles

I Trimmed Means

Data:

Let X be the variable of interest.

I x1, x2, ..., xnare observations X ;

I n is the number of observations, or sample size, or number of samples.

(13)

1.3 Measures of Location

Data example:

Caustic stress corrosion cracking of iron and steel has been studied because of failures around rivets in steel boilers and failures of steam rotors.

Let X be the crack length (µm) as a result of constant load stress corrosion tests on smooth bar tensile specimens for a fixed length of time.

x₁ = 16.1; x₂= 9.6; x₃= 24.9; x₄ = 20.4; x₅= 12.7 x₆ = 21.2; x₇= 30.2; x₈ = 25.8; x₉ = 18.5; x₁₀= 10.3 x11= 25.3; x12= 14.0; x13= 27.1; x14= 45.0; x15= 23.3 x16= 24.2; x17= 14.6; x18= 8.9; x19= 32.4; x20= 11.8; x₂₁= 28.5

The sample size is n = 21.

(14)

Mean

The mean or the arithmetic average of the set is the most familiar and useful measure of the center.

Let x₁, x₂, ..., x_n be a given set of numbers. The sample meanis denoted by ¯x . If the set is y₁, ..., y_n, the sample mean is ¯y .

Definition

The sample mean ¯x of observations x₁, x₂, ..., x_n, is given by

¯ x = 1

n(x1+ x2+ ... + xn) = Pn

i =1x_i

n (1)

Example:

x₁ = 16.1; x₂= 9.6; x₃= 24.9; x₄ = 20.4; x₅= 12.7 The mean is ¯x = (16.1 + 9.6 + 24.9 + 20.4 + 12.7)/5 = 16.7

(15)

Mean...

Sample mean of x1, x2, ..., xn: ¯x

The population mean is often denoted by µ.

Let N be the total number of observations in the population.

The population mean can be obtained as

µ = (sum of the N population values)/N. (2)

IThere is more to this population mean! A general definition for µ that applies to both finite and (conceptually) infinite populations will be visited later.

IJust as ¯x is an interesting and important measure of sample location, µ is an interesting and important (often the most important) characteristic of a population.

(16)

Median

Sample medianis the middle value once the observations are ordered from smallest to largest.

Notation: Denote observations by x1, ..., xn. The sample median is represented by ˜x .

DefinitionThe sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included so that every sample observation appears in the ordered list). Then,

˜ x =

(ⁿ⁺¹₂ )^th ordered value if n is odd average of (ⁿ₂)^thand(ⁿ₂ + 1)^th ordered values ifn is even.

(3)

(17)

Median

Example: A sample of n = 12 recordings of Beethovens Symphony #9, yielding the following durations (min) listed in increasing order:

62.3 62.8 63.6 65.2 65.7 66.4 67.4 68.4 68.8 70.8 75.7 79.0 IThe sample median is the average of the n/2 = 6^th and (n/2 + 1) = 7^th values from the ordered list:

˜

x = (66.4 + 67.4)/2 = 66.9

INotes: If the largest observation 79.0 was excluded from the sample, the resulting sample median for the n = 11 remaining observations would have been the single middle value 66.4 (the [n + 1]/2 = 6^th ordered value, i.e. the 6^th value in from either end of the ordered list).

IThe sample mean is ¯x = 68.01, a bit more than a full minute larger than the median.

(18)

Median

Remarks and Notation:

I The population median is denoted by ˜µ

I The sample median is very insensitive to outliers.

I If the median salary for a sample of engineers were

˜

x = 66, 416, we might use this as a basis for concluding that the median salary for all engineers exceeds 60, 000.

I The population mean µ and median ˜µ will not generally be identical.

I When this is the case, in making inferences we must first decide which of the two population characteristics is of greater interest and then proceed accordingly.

(19)

Quartiles, Percentiles, and Trimmed Means

Quartilesdivide the data set (sample or population) into four equal parts.

I Observations above the third quartile Q3 constituting the upper quarter of the data set.

I The second quartile Q2 is the median.

I The first quartile Q1 separates the lower quarter from the upper three-quarters.

Example: Beethovens Symphony #9 data - Q1 = 64.80; Q2 = ˜x = 66.90; Q3 = 69.30

(20)

Quartiles, Percentiles, and Trimmed Means

IA data set (sample or population) can be even more finely divided usingpercentiles; the 99^th percentile separates the highest 1% from the bottom 99%, and so on.

IThe mean is quite sensitive to a single outlier, whereas the median is not affected by outliers.

IAtrimmed mean is a compromise between ¯x and ˜x to the robustness to outliers.

A 10% trimmed mean, for example, would be computed by eliminating the smallest 10% and the largest 10% of the sample and then averaging what remains.

(21)

Mean, Median, Quartiles, Percentiles, and Trimmed Means

Example: The production of Bidri is a traditional craft of India.

Bidri wares (bowls, vessels,...) are cast from an alloy containing primarily zinc along with some copper. The following observations are on copper content (%) for a sample of Bidri artifacts:

2.0 2.4 2.5 2.6 2.6 2.7 2.7 2.8 3.0 3.1 3.2 3.3 3.3 3.4 3.4 3.6 3.6 3.6 3.6 3.7 4.4 4.6 4.7 4.8 5.3 10.1

IStem-and-Leaf display 2 | 04566778 3 | 012334466667 4 | 4678

5 | 3 6 | 7 | 8 | 9 | 10 | 1

(22)

Mean, Median, Quartiles, Percentiles, and Trimmed Means

IA prominent feature is the single outlier at the upper end.

IThe sample mean and median are 3.65 and 3.35, respectively. A trimmed mean with a trimming percentage of 100(2/26) = 7.7%

results from eliminating the two smallest and two largest observations; this gives ¯x_{tr (7.7)}= 3.42

ITrimming here eliminates the larger outlier and so pulls the trimmed mean toward the median.

IA trimmed mean with a moderate trimming percentage (between 5% and 25%) will yield a measure of center that is neither as sensitive to outliers as is the mean nor as insensitive as the median.

IIf the desired trimming percentage is 100α% and nα is not an integer, the trimmed mean must be calculated by interpolation.

(23)

Categorical Data and Sample Proportions

Example: If a survey of individuals who own digital cameras is undertaken to study brand preference, then each individual in the sample would identify the brand of camera that he or she owned, from which we could count the number owning Canon, Sony, Kodak, and so on.

IWhen the data is categorical, afrequency distribution orrelative frequencydistribution provides an effective tabular summary of the data.

IConsider sampling a dichotomous populationone that consists of only two categories (such as voted or did not vote in the last election, does or does not own a digital camera, etc.).

IIf we let x denote the number in the sample falling in category 1, then the number in category 2 is nx . The relative frequency or sample proportion in category 1 is x /n and the sample proportion in category 2 is 1 − x /n.

(24)

Categorical Data and Sample Proportions

ILets denote a response that falls in category 1 by a 1 and a response that falls in category 2 by a 0. A sample size of n = 10 might then yield the responses 1, 1, 0, 1, 1, 1, 0, 0, 1, 1. The sample mean for this numerical sample is (since number of 1s is x = 7)

x

n = x1+ x2+ ... + xn

n (4)

= 1, +1 + 0 + 1 + 1 + 1 + 0 + 0 + 1 + 1

10 = 7

10. (5) IThe sample proportion of observations in the category is the sample mean of the sequence of 1s and 0s. Thus a sample mean can be used to summarize the results of a categorical sample.

IAnalogous to the sample proportion x /n of individuals or objects falling in a particular category, let p represent the proportion of those in the entire population falling in the category.

(25)

Categorical Data and Sample Proportions

IAs with x /n, p is a quantity between 0 and 1, and while x /n is a sample characteristic, p is a characteristic of the population.

IThe relationship between the two parallels the relationship between ¯x and µ and between ˜x and ˜µ. In particular, we will subsequently use x /n to make inferences about p.

IExample: A sample of 100 car owners reveals that 22 owned their car at least 5 years, then we might use 22/100 = .22 as a point estimate of the proportion of all owners who have owned their car at least 5 years.

IWith k categories (k > 2), we can use the k sample proportions to answer questions about the population proportions p1, ...., pk.

(26)

1.4 Measures of Variability

Some measures of variability

I Range (min − max)

I Interquartile Range (Q3 − Q2)

I Variance or standard deviation

Different samples or populations may have identical measures of center yet differ from one another in other important ways.

Example:

(27)

Sample Variance

Definition

The sample variance of x₁, x₂, ..., x_n, denoted by s², is given by s² = 1

n − 1

n

X

i =1

(x_i − ¯x )² (6)

The sample standard deviation, denoted by s, is the (positive) square root of the variance:

s =

√

s² (7)

Remarks:

Is² and s are both nonnegative.

IThe unit for s is the same as the unit for each of the x_is.

(28)

Sample Variance

Computing remark:

s² = 1 n − 1

n

X

i =1

(xi − ¯x )² = 1 n − 1

" _n X

i =1

x_i²− n(¯x )²

#

(8)

= 1

n − 1

" _n X

i =1

x_i²−(Pn i =1x_i)²

n

# .(9)

Example: Find the variance and standard deviation of 154 142 137 133 122 126 135 135 108 120 127 134 122

I Step 1: Form and find Pn

i =1x_i² = 154²+ 142²+ ... + 122² = 222581

I Step 2: With ¯x = 130.4, calculate n(¯x )² = 1695

I Step 3: The variance is s²= (222581 − 1695)/12 = 131.6 The standard deviation is s =√

131.6 = 11.5.

(29)

Mean and Variance

Proposition

Let x₁, x₂, ..., x_n be a sample and c be any nonzero constant.

I If y_i = x_i + c for i = 1, ..., n, then ¯y = ¯x + c and s_y² = s_x².

I If yi = cxi for i = 1, ..., n, then ¯y = c ¯x , s_y² = c²s_x², and s_y = |c|s_x.

where s_y² and s_x² are the sample variances for respectively the x ’s and y ’s.

(30)

Five-Number Summaries and Boxplots

IWith x1, x2, ..., xn, thefive-number summary is given by

(minimum, first quartile Q1, median, third quartile Q3, maximum) (smallest x_i, lower fourth, median, upper fourth, largest x_i)

IAboxplot(aka box-and-whisker plot) is a way of graphically depicting groups of numerical data through their five-number summaries.

Remark: A boxplot may also indicate which observations, if any, might be considered outliers.

Using the Bidri data set, we have

(31)

Five-Number Summaries and Boxplots

(32)

Comparative Boxplots

Suppose we have two sets of data as

x: 8.87 4.98 11.23 21.03 10.33 -4.03 9.70 7.67 11.64 1.73 1.78 4.83 -1.63 13.52 4.12 5.69 13.91 7.56 17.15 8.84 13.08 8.18 10.28 4.67 16.54 12.18 2.97 9.35 10.70 10.91

y: 8.94 5.61 6.44 15.38 6.60 5.81 9.33 10.93 7.69 4.98 15.16 7.87 6.04 4.74 4.81 8.68 5.12 8.93 18.89 8.33 4.10 11.77 8.37 6.50 3.90 11.98 8.02 5.89 6.35 8.43

(33)

Exercise 78

Consider a sample x₁, ..., x_n and suppose that the values of ¯x , s_x², and sx have been calculated.

a. Let y_i = x_i− ¯x for i = 1, ..., n. How do the values s_y² and s_y for the yi’s compare to s_x², and sx? Explain or justify.

b. Let z_i = (x_i− ¯x )/s_x for i = 1, ..., n. What are s_z² and s_z, the variance and standard deviation for the zi’s?