Applied Statistics for Engineers and Scientists: Basic Data Analysis

132  Download (0)

Full text


Applied Statistics for Engineers and

Scientists: Basic Data Analysis

Man V. M. Nguyen

Faculty of Computer Science & Engineering HCMUT



This lecture presents selective topics of Statistics and Probability I from basic concepts to practical applications for undergraduates in

* Statistics and Applied Mathematics,

* Computer Science major, and

* Biological Sciences.

Aimed for a joint program with Portland Univ. at HCMUS, HCM City Vietnam.


Introduction to Statistics I


The aims the course


Randomness and uncertainty are phenomena that engineering students are facing in both their daily life and in professional environments.

The course’ aims are to provide for students in - Business Administration and Econometrics, - Computing and Biological Sciences

fundamental methodologies together with major formalizations and techniques of Probability and Statistics.

The foundation could help you understand and resolve efficiently theoretical andpractical problems possesing randomness by nature.


What is Statistics?


Statistics is usually defined as a branch of Applied Mathematics, which is in turn a modern discipline of modern mathematics. In practice, modern mathematics is one of the principal tools of statistics. Therefore, to understand statistics, it is a must to have some knowledge of modern mathematics.

Brief description of the course. We introduce basic statistical concepts and terminology that are fundamental to the use of statistics in experimental work.


Why is Statistics?


Look at the followingreal life situations.

1. A recent newspaper article concluded that smoking

marijuana/cigarette at least three times a week resulted in lower grades in college.

I How do you think the researchers came to this conclusion? I Do you believe it?


Why is Statistics?


2. It is obvious to most people that, on average, men are taller than women, and yet there are some women who are taller than some men. Therefore, if you wanted to prove that men were taller, you would need to measure many people of each sex.

Here is a theory: On average, men have lower resting pulse rates than women do.

XHow could you go about trying to prove or disprove that? •Would it be sufficient to measure the pulse rates of one member of each sex? Two members of each sex?

zWhat information about mens and womens pulse rates would help you decide how many people to measure?


Why is Statistics?


3. Suppose you were to learn that the large state university in a particular state graduated more students who eventually went on to become millionaires than any of the small liberal arts colleges in the state.

I Would that be a fair comparison?

I How should the numbers be presented in order to make it a fair comparison?


Brief description of the course


We will learn:

I the role of statistics in engineering and scientific experimentation,

I two major subdivision of Statistics: * Descriptive statistics and

* Inferential statistics

I InInferential statistics, understand the distinction between samples and populations, see how to obtain/provide decision based on key statistics extracted from samples

I relating sample statistics to populations parameters, and I characterizing deterministic and empirical models.


Course structure


In Part 0: Warming-up Review, we will survey some basic ideas of modern mathematics such as

- the concept of functions, - equations,

- operation of summations, etc.

before we venture into the statistical discussion. Part I mentions methods of Descriptive Statistics.

Part II discusses Probability Concepts and Distributions, and in Part III, we touch one of the most important topics of Statistics, Statistical Estimation.


A definition of Statistics


Statistics is the science of problem-solving in the presence of variability. Two main branches

I Descriptive statistics is concerned with summarizing and describing numerically a body of data.

I More importantly, Inferential statistics is the process of reaching generalizations about the whole (called the

populations) by examining a portion or many portions (called samples).

Why Statistics in Science and Technology?

Scientific investigations are important not only in the academic laboratories ofresearch universities but also in the engineering laboratories ofindustrial manufacturers.


Statistics in engineering and scientific



Statistical methods are applied in an enormous diversity of problems in fields as:

I Agriculture (which varieties grow best?)

I Genetics, Biology (selecting new varieties, species) I Economics (how are the living standards changing?) I Market Research (comparison of advertising campaigns) I Education (what is the best way to teach small children


I Environmental Studies (do strong electric or magnetic fields induce higher cancer rates?)


Statistics in Quality engineering


A key motivation. Quality and productivity are characteristic goals of industrial and service processes, which are expected to result in goods and services that are highly sought by consumers and that yield profits for the firms that supply them.

Urgent demands from Industry and Services.

* No longer satisfactory just to monitor on-line industrial processes and to ensure that products are within desired specification limits. * Competition demands that a better product be produced within the limits of economic realities.

* Better products: initiated in academic & industrial research laboratories, made feasible in pilot & new-product research studies All of these activities require experimentation, the data collection and the analysis of data rightly.


Explicit material of this course


After learning and conducting exercises of the course, in BA contexts, or broader in Econometrics, in Software Industry, or in Biology-related sciences as Pharmaceutics, Bio-medicines ... you should be able to:

1. Know introductory concepts and methods of descriptive and inferential statistics

2. Understand and practically employ methods, include: a) grouping of data,

b) measures of central tendency and dispersion,

c) probability concepts and distributions,

d) sampling and statistical estimation.

e) Statistical hypothesis testing, and

f) basic Linear Regression Models.

The last two topics will be discussed in the next lecture note, Statistics-I-Slides-part-3and4.pdf.


Part O: Warming up- a mathematical review


Set Theory


Concept of Set. A set is a collection of things/objects s (called) elements.

S = {s : a property P(s) is fulfilled, that s satisfies} E.g., P(s) = national soccer teams s taking part matches in Germany in July 2010.

XIf S is a set and x is a member or element of S we write x ∈ S. Otherwise we write x 6∈ S .

XThe set with elements x1, · · · , xn is denoted {x1, · · · , xn}.

XThe empty set with no elements is denoted {} or ∅. XA set with one element is called a singleton. e.g., {a} is a singleton.


Various Number Sets- the naturals and integers


Notation. The natural numbers {0, 1, · · · } by N, the set of integers is denoted by Z,

the rational numbers by Q, and the real numbers by R. Elucidation.

a/ The number 0, 1, 2, 3, and so on are called natural numbers N. If we add or multiply any two natural numbers, the result is always a natural number. However, if we subtract or divide two natural numbers, the results are not always a natural number.

b/ To overcome the limitation of subtraction, we extend the natural number system to the system of integers. We do by including, together with all the natural numbers, all of their negatives and the number zero (0). Thus, we can represent the system of integers Z in the form: . . . -3, -2, -1, 0, 1, 2, 3, . . .


The rationals Q


c/ ... we still can not always divide any two integers. For example 8/(-2) = -4 is an integer, but 8/3 is not an integer. To overcome this problem, we extend the system of integers to the system of rational numbers.

We define a number as rational if it can be expressed as a ratio of two integers. Thus, all four basic arithmetic operations (addition, subtraction, multiplication and division) are all possible in the rational number system Q.


The irrationals and the reals R


There also exits some numbers in everyday use that are not rational number; that is, they can not be expressed as a ratio of two integers. For example 2 , 3, , etc. are not rational numbers; such numbers are called irrational numbers.

d/ The term real number is used to describe a number that is either rational or irrational.

To give a complete definition of real numbers R would involve the introduction of a number of new ideas, and we shall not do this task now. However, it is a good idea to think about a real number in terms of decimals.


Set equality and subsets


•Two sets A, B are equal, denoted A = B if they have the same elements.

•Sets can be described by properties that the elements satisfy. XIf P is a property, then the expression {x|P} denotes the set of all x that satisfy P. e.g., the set of odd natural numbers can be represented by the following equal sets.

{x|x = 2k + 1 for some k ∈ N} = {1, 3, 5, · · · }.

Subsets. The set A is a subset of B, denoted A ⊆ B, means every element of A is an element of B, i.e. [ for all s, if s ∈ A then s ∈ B ]

Thepower set of a set S , denoted power (S ) or P(S ) is the set of all subsets of S .


Set operations and the algebra of sets


Few basic operations on sets are:

I S ∩ R := {x : x in both S and R} (intersection)

I S ∪ R := {x : x ∈ S or x ∈ R} (union)

I S \ R := {x : x ∈ S but not in R} (difference)

Quiz. Determine the set P(X ) ∩ P(Y ) if you know X = {a, b, 1} and Y = {u, a, b}, where P(S ) is the set consisting all subsets of S .




The idea of function is one of the most fundamental concepts in modern mathematics. A function expresses the hypothesis of one quantity depending on (or being determined by) another quantity. For example:

(i) bone mass is dependent on age of subject; (ii) height is dependent on races etc.

If a function f assigns a value y in the range to a certain x in the domain, then we write: y = f (x ) where, in Modern Statistics, x is called ”independent” variable and

y is ”dependent” variable

(although this terminology is sometimes controversial.) In formal mathematics, we refer to the possible values of x as domain, and the possible values of y as range.




A linear function is usually of the form: y = a + bx [1]

where a is called the intercept (when y = 0) and b is called the slope which represents the rate of change in y with respect to the change in x by one unit.

In a two dimensional space Ox , Oy , for any two given points (x1, y1) and (x2, y2) , the slope can be determined by the relation:

b = change iny change in x =

y2− y1




However, for a series of n > 2 points of x and y , we could extend this formula into a series of n simultaneous equations and estimate a and b by the Method of Least Squares which is readily

available in several statistical softwares. MANY LINEAR FUNCTIONS.

If there are two lines, say, y = a1+ b1x and y = a2+ b2x , then we

can make a number of observations:

(a) The two lines are parallel if and only if its slopes are equal, i.e. b1 = b2

(b) On the other hand, if b1 6= b2, then the lines are not parallel.



”independent” variables


y = a + bx [1]

Equation [1] could be expanded further to include more than one x variable. For instance, bone mineral density (BMD) is strongly dependent on age, denote AGE and weight, denote WEIGHT, we may write this statement as:

BMD = a + b AGE + c WEIGHT

where a, b and c are estimated constants. Thus, for every value of AGE and WEIGHT, a BMD could be estimated.

We will examine this function in the context of regression analysis later in this series.




We often come across situations where the functional relationship between

the dependent variable (y ) and the independent variable (x ) is not linear, but a curved one.

One of the popular functions is the quadratic function, which is of the form:

y = f (x ) = ax2+ bx + c [2] where a, b, c are constants.




Of course, a learning of mathematics can not be complete without being able to communicate in its language. Here are some of the commonly-used symbols in mathematics which you are required to be conversant with:

————— ————— —————– SYMBOL MEANING Note

————— ————— —————– ∈ Belong to relation 6∈ Not belong to relation ⇒ imply; it follows that logic operator ⇐ Implied by logic operator ⇔ Equivalent to; if and only if logic operator R Real numbers set notation ∀ For every quantifier ∃ There exists quantifier


Part I: Descriptive statistics


I Numerical measures of location (e.g. Central tendency)

I Measures of Dispersion (Variability)


Part I: Descriptive statistics


Practical motivation I. Cash flow management (CFM) is a key critical activity of a firm named S in HCMC.

S uses independent representatives to sell the products to department stores, gift shops on the whole city.

Main component of CFM is the analysis and control of accounts receivable.


I to measure the average age (time duration) and value of outstanding invoices, and

I to make meaningful decisions based on those statistics (i.e. numerical measures).


Concrete Data and Demand


a) Recent summary of accounts receivable shows the following descriptive statistics:

Mean: 40 days Median: 35 days Mode: 31 days Furthermore,

b) Critical and concrete demands for S ’s success are:

I the average age for outstanding invoices should not exceed 45 days, and

I the dollar value of invoices more than 60 days old should not exceed 5% of the total value of all accounts receivable. Question: how should you connect/employ/explain the statistical summary a) rightly to answer the management’s concern that whether b) is satisfied?


Numerical measures of location:

Central tendency


We employ three basic measures to describe central tendency of data:

1/ Mean 2/ Median 3/ Mode


Central tendency- Mean


Mean or the average value. Thesample meanx describes the central tendency of a sample of size n:

x = Pn

i =1xi


* If the number of elements (observations/items) of the entire population is N, thepopulation mean is

µ = PN

i =1xi

N . Other way to measure Central tendency? Yes!


Central tendency- Median


The median is the value in the middle when the data x1, · · · , xn of

size n are sorted in ascending order (smallest to largest). - If n is odd, then the median is the middle value.

- If n is even, the median is the average of the two middle values. Example 0. For instance, find the mean and median of two data sets, representing monthly salaries of IT engineers in the US:

x = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 3325], and

x∗= [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000].

Then the sample mean of data x is x =

Pn i =1xi

n = 2940.


Central tendency- Median


Since n = 12 is even, the middle two values are 2890 and 2920; the median of data x , denoted Med(x ) is the average of these values:

Median = Med(x ) = 2890 + 2920

2 = 2905.

Remark: Whenever a data set contain extreme values, the median is often the preferred measure of central location than the mean. Sample data x∗ consists of extreme values such as $USD10000, then the new sample mean is

x∗ =


i =1xi∗

n = 3496 >> 2940 = the old mean of data x But the median is unchanged, reflecting better central tendency:

Med(x ) = Med(x∗) = 2890 + 2920


Central tendency- Mode


Frequency distributions. In any sample data x of size n, the number of observations nA of a particular value A is its absolute

frequency distribution. A relative frequency distribution of A is nA


* A histogram is a bar graph of a frequency distribution. * The mode of sample data x is the value that occurs with greatest frequency

Example 1. A computing student An received the following grades in subjects of his first semester 2007:


Mode– Example


Grades Absolute frequency Relative frequency

5 1 0.1 6 4 0.4 7 2 0.2 8 1 0.1 9 1 0.1 10 1 0.1

Size of data = n = 10 relative frequency = nA/n

Table: Frequency distributions of An’s grades

Hence, the mode of our grade data x is Mode = 6, its absolute frequency is 4, its relative frequency is 0.4.


Mode– Bimodal and Multimodal


If the data consists of exactly two modes, we say the data is bimodal ; if more than two modes, the data are multimodal. In practice, only singlemodal or bimodal mode are interested, since they indicate important measure of central tendency (location) for qualitative data.

Soft Drink Absolute frequency

Coke classic 19

Diet Coke 8 Twister 5

Pepsi 13

Sprite 5

soft drinks A absolute frequency = nA= 19, 8, ...

size of data = n =P

A=Sprite,··· nA = 50


Numerical measures of location:

Spreading tendency


We employ basic measures to describe spreading tendency of data:

a/ Percentiles and




A percentile provides information about

how the data are spread over the interval from the smallest value to the largest value.

Given a sample data x of observations, formally we have Definition

The pth percentile is a value m ∈ x such that at least p percent of the observations are less than or equal m, and at least 100 − p percent of the observations are greater than or equal this value.




Example 2. Universities frequently report admission test scores in terms of percentiles. Suppose an applicant K obtain a raw score m = 54 (on the scale 100) of an admission test. Would we know his chance to pass the exam in comparison with his friends? YES, if we know how many percent the value m corresponds to on the set of all applicant scores!

If the value m = 54 corresponds to, say 70th percentile (of the whole students scores), we know that

I approximately 70% of students scored lower than applicant K and


Percentiles- Mathematical formula


Our concern now is: Given p%, find the value m by locating its position (index) in the observed sample data x of size n. Calculating the pth percentile. In 3 steps

1. Arrange the data x in ascending order to obtain the sorted sample data y 2. Compute an index i i = p 100  n 3. Locate m from i :

I If i is not an integer, round up to the ceiling di e =: j (the smallest integer that bigger than i ). Then m = y[j ].


Percentiles- Examples


Example 3. Let us determine the 85th percentile for the salary data given inExample 0.

x = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 3325]

- Arrange x to get y = x (since the data is sorted already). - Compute an index i = p 100  n = 85 100  12 = 10.2 6∈ N

- Round it up to the ceiling di e =: 11 (the smallest integer that bigger than 10.2). Then m = y[11] = 3130.

* The 50th percentile for the same data is similarly computed as: i =  p 100  n = 50 100  12 = 6 ∈ N So m = (y[6] + y[7])/2 = (2890 + 2920)/2 = 2905.


Quartiles= 25%


Often desirable to divide data into four parts. Each part contains approximately one-fourth, or 25% of the observations.

The division points are called the quartiles, and are defined as: Q1 = first quartile, or 25th percentile

Q2 = second quartile, or 50th percentile (also the median)

Q3 = third quartile, or 75th percentile

Table: Three Quartiles

E.g., our salary data sample, given inExample 0, is divided into four parts (Q1 and Q3 should be computed as we did for the 85th

and 50th percentile!)

[2710, 2755, 2850, k2880, 2880, 2890, k2920, 2940, 2950, k3050, 3130, 3325]


Measures of Dispersion (also called Variability)


Practical motivation II.

You are purchasing agent of Maximart in HCMC, You regularly place orders with two distinct suppliers in good and luxurious ceramic, say

XMinh Long ceramic, denoted M and Xanother foreign brand, denoted F.

After several months of operation, you find that the mean number of days required to fill orders isµ =10.3 days for both suppliers. Your concern is:

I Do the two suppliers M and F demonstrate the same degree of reliability in terms of making deliveries on schedule? I Which supplier would you prefer?


Measures of Dispersion


Working Days Supplier M Supplier F

7 0 2 8 0 1 9 1 0 10 5 3 11 4 1 12 0 1 13 0 1 14 0 0 15 0 1


Measures of Dispersion


In total, Minh Long ceramic provides sum of delivery days 1.9 + 5.10 + 4.11 = 9 + 50 + 44 = 59 + 44 = 103 days and the foreign brand F got

2.7 + 1.8 + 3.10 + 1.(11 + 12 + 13 + 15) = 14 + 8 + 30 + 51 = 22 + 81 = 103.days. ObviouslyµM = µF=10.3 days. But

Note that:

* the 7 or 8 deliveries shown for the Minh Long ceramic M are viewed as favorably, meanwhile

* the slow 13- to 15- deliveries for the foreign brand F could be disastrous

in terms of keeping your business run smoothly (workforce busy, big selling during peak-season)


Measures of Dispersion


We understand

dispersion = how far the extreme data values is from the mean! Although Minh Long ceramic and the foreign brand F the same meanµM = µF=10.3, but

F has dispersion 15 − 10.3 > |9 − 10.3| = Minh Long ’ dispersion. In that sense, the foreign brand F has

large dispersion, so less reliable(than Minh Long firm) in terms of making deliveries on schedule!


Measures of Dispersion


Basic measures for Dispersion- Variability are: XRange, isn’t it?

XInterquartile Range?


Measures of Dispersion- Range


Range= The largest minus the smallest. That is Range of the data x = [x1, · · · , xn] is Max (x ) − Min(x ).

For the salary data given inExample 0

x = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 3325]

the range of the data is 3325 − 2710 = 615 For the extreme one

x∗= [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000]

the range of the data now is 10000 − 2710 = 7290!.


Measures of Dispersion- Interquartile Range


The Interquartile Range (IQR) is the range of the middle 50% of the data:

Interquartile Range = Q3− Q1.

This indicator overcomes the dependency on extreme values. E.g., from

[2710, 2755, 2850, k2880, 2880, 2890, k2920, 2940, 2950, k3050, 3130, 3325]

Q1↑= 2865 Q2=↑ 2905 Q3=↑ 3000

the Interquartile Range of the data is Q3− Q1= 3000 − 2865 = 135.


Measures of Variability


Variance and Standard deviation. The Variance of a data is a measure of variability that utilizes all the data.

*The sample variance of a data x of size n Var(x ) = σ2x = Pn i =1(xi− x)2 n − 1 = Pn i =1xi2− nx2 n − 1 ,

* the sample standard deviation σx =

p Var(x )

*The population variance of a population of size N, with µ = x : σ2 =


i =1(xi − µ)2




FromPractical motivation II

Working Days Supplier M Supplier F

7 0 2 8 0 1 9 1 0 10 5 3 11 4 1 12 0 1 13 0 1 14 0 0 15 0 1


Variance to judge Reliability


Supplier M, provides the data of length n = 10 xM = [9, 10, 10, 10, 10, 10, 11, 11, 11, 11]

and the sample variance Var(M) Pn i =1xi2− nµ2 n − 1 = (92+ 5.102+ 4.112) − 10.10.32 10 − 1 = 4.1 9 = 0.45 Supplier F, similarly provides the data with the same length

xF = [7, 7, 8, 10, 10, 10, 11, 12, 13, 15]

and the sample variance Var(F )

(2.72+ 82+ 3.102+ 112+ 122+ 132+ 152) − 10.10.32

9 =


9 = 6.67 So Supplier F is less reliable than Supplier M.


Standard deviation– Coefficient of Variation


*The sample standard deviation is σx:

σM = p Var(M) = √ 0.45 = 0.67; σF = p Var(F ) = √ 6.67 = 2.58

* Coefficient of Variation V measures relative dispersion, i.e. compares how large the standard deviation is relative to the mean:

V = σ µ× 100  % for populations and V = σx µx × 100  % for samples x


Coefficient of Variation


InPractical motivation II, the coefficient of variation of the Supplier M is VM =  σM µM × 100  % = 0.67 10.3× 100% = 6.5% the coefficient of variation of the Supplier F is

VF =  σF µF × 100  % = 2.58 10.3× 100% = 25%

Hence Supplier F is less reliable than Supplier M with a ratio of almost 4 times!


Measures of Association Between Two



We now consider the relationship between variables. Two most important concepts as descriptive measures for this task are: Covariance measures the co-movement of two separate distributions

and Correlation.

Let us start by looking at aPractical motivation III. [Sale trend ] A manager of a sound equipment store in Hanoi wants to determine the relationship between

I the number x of weekend television commercials shown, and I the sales y at his store during the following weeks.

Sample data of size n = 10 has been recorded in 10 weeks, shown in Table 1.


Association Between Two Variables


Week Number of commercials (x ) Sales Volume y (×$100s)

1 2 50 2 5 57 3 1 41 4 3 54 5 4 54 6 1 38 7 5 63 8 3 48 9 4 59 10 2 46


Association Between Two Variables


Covariance- the 1st descriptive measure of association between 2 variables X , Y .

For a sample data of size n with the observations

(x, y) = {(x1, y1), · · · , (xn, yn)} the sample covariance is defined as

sxy =


i (xi− x)(yi − y )

n − 1

In our example we have x = 30/10 = 3 and y = 510/10 = 51, and the sample covariance sxy = 99/9 = 11.

Obviously for the entire population, the population covariance is σxy =


i (xi− µx)(yi− µy)


A positive covariance indicates that X and Y move together in relation to their means. A negative covariance indicates that they move in opposite directions.


Association Between Two Variables


Remark that

(xi − x)(yi− y ) > 0 ⇐⇒ the point (xi, yi) ∈ quadrants I &III

(xi − x)(yi− y ) < 0 ⇐⇒ the point (xi, yi) ∈ quadrants II &IV

As a result,

1. sxy > 0 indicates a positive linear association (relationship)

between x and y

2. sxy ≈ 0: x and y are not linearly associated

3. sxy < 0 then x and y are negatively linearly associated

In our example, sxy = 99/9 = 11 indicating a strong positive linear

relationship between the number x of television commercials shown and the sales y at the multimedia equipment store.

But the value of the covariance depends on the measurement units for x and y. Is there other precise measure of this relationship?


Association Between 2 Variables- Correlation


Correlation coefficient- the second descriptive measure rxy =




Part II: Probability Concepts and Distributions

——————————————————————————— What is probability?

Experiments. An experiment E is a specific trial/activity (of scientists, human being) whose outcomes possess randomness. Simple examples are:

I Coin throwing - throw a coin, random outcomes are head (H) or tail (T)

I Temperature measurement- observe continuously

temperatures at noon in HCMC in 10 days of Summer 2007, random outcomes are recorded by the list


Probability distributions– Three basic concepts


1. Sample space S- set of all possible outcomes. Ex. 1: Coin throwing −→ S = {H, T }

2. Events- is subset A of sample space S: A ⊂ S . Usually we include all events into a set, called the event set

Q := {A : A ⊂ S and is an event}.

When an experiment E is performed and an outcome a is observed we say that event A has occurred if a ∈ A.

3. Probability distribution (probability function)- a map P from Q to the interval [0, 1]:

P : Q → [0, 1], A ∈ Q ⇒ P(A) = Prob(A) =


Axioms of Probability Theory

(A. Kolmogorov, 1933).


A1. Probabilities are nonnegative, 0 ≤ P(A) ≤ 1, where P(A) := Prob(A)

A2. The sample space S has probability 1, that is P(S) = 1 A3. Probabilities of disjoint events A, B, A ∩ B = ∅:

P(A ∪ B ) = P(A or B ) = P(A) + P(B ), ——————————————–

in which

* A, B ⊂ S are events,


Axioms of Probability Theory


More general, we have

P(A1∪ A2∪ · · · ∪ Am) = P(A1) + P(A2) + · · · + P(Am)

for m mutually disjoint events, i.e. Ai ∩ Aj = ∅ when 1 ≤ i 6= j ≤ m.

The so-called countably additive of probabilities is a generalization:

P( ∞ [ i =1 Ai) = ∞ X i =1 P(Ai).


Assign probabilities to events


Possible ways to assign probabilities to events:

a) Frequency interpretation: probability is based on history (data obtained or observed). For any event A ⊂ S, its probability is the relative frequency

P(A) = Prob(A) = X



Example 2: If temperatures in Temperature measurement experiment above are the list

[34, 29, 28, 32, 31, 32, 30, 31, 30, 33] (in Celsius degree), and define event A = temperatures higher than 30o. The sample space S is the above list, and if we suppose the chance to see any temperature in S is the same, then P(A) =Ps∈A P(s) = 106.


Assign probabilities to events


b) Classical interpretation: compute probability in question from other known probabilities using basic formulas, based on the assumption that all outcomes have equal probability. Apply when the sample space S holds |S| = n < ∞, then for any event A ⊂ S, its probability is the fraction found by counting methods:

P(A) = Prob(A) = |A| |S|. Example 3: In Coin throwing, S = {H, T }, P(H) = P({H}) = 12 = P(T ).

c) Subjective interpretation: use a model, can hypothesize about phenomenon possessing randomness

Example 4: P(survival after a serious surgery) is estimated by the doctor


Probability of a single event


Computing Rule. For finite sample spaces, we assume S = {s1, s2, · · · , sn}, define pi = P(si) then pi ≥ 0, and n X i =1 pi = 1. Fact

If all outcomes have equal probabilities, then P(A) = Prob(A) = nA

n , where nA = |A|.


On a single toss of a die, we get only one of six possible outcomes 1,2,3,4,5 or 6; then the sample space S = {1, 2, 3, 4, 5, 6}, and pi = P(i) = 1/6, for all i = 1..6


Multiple events– Rule of addition


XWhat are mutually exclusive and not mutually exclusive events? Two events A and B are mutually exclusive if A ∩ B = ∅, i.e. the occurrence of A precludes the occurrence of B. Then

P(A and B ) = P(A ∩ B ) = 0 * For mutually exclusive events:

P(A ∪ B ) = P(A or B ) = P(A) + P(B ). * How about nonmutually exclusive events, i.e.


Multiple events– Rule of addition


* For nonmutually exclusive events, P(A and B) = P(A ∩ B) 6= 0: P(A ∪ B ) = P(A or B ) = P(A) + P(B ) − P(A and B ).


If the die is fair, when tossing of a die, pi = P(i) = 1/6. The

probability of event Z =‘getting 2 or 3 or 4’ is P(Z ) = P(2 or 3or 4) =



P(s) = 3/6


Given that event B “happened”, what is the probability that event A also happened?


Conditional probability


Given that event B “happened”, what is the probability that event A also happened?

Brainstorming thought: narrow down the sample space to the space where B has occurred. (aimed to the comparison between A ∩ B and B).

The formula: Conditional probability of Event A given Event B

P(A | B ) = P(AB ) P(B ) =

P(A ∩ B )

P(B ) . (0.1) As a result, the joint probability of two events A and B is


Bayes’ Theorem


Note also that

P(B ) · P(A | B ) = P(A) · P(B | A) [since LHS = P(AB) = P(BA) = RHS ]


We always have the following, for any pair of events A, B: P(A | B ) = P(A) · P(B | A)


What are dependent events?


Events A and B are dependent if the occurrence of one is connected in some way to the occurrence of the other. ———————————————

Then the joint probability of A and B is

P(AB ) = P(A) · P(B | A) = P(B | A) · P(A) or also

P(AB ) = P(BA) = P(A | B ) · P(B ) (0.3)

( since P(BA) = P(B) · P(A | B) = P(A | B) · P(B)) ———————————————


What are independent events?


Events A and B are independent if the occurrence of A is not connected in any way to the occurrence of B. Then

P(A | B ) = P(A) and P(B | A) = P(B ) (0.4) ———————————————

Rule of multiplication. The joint probability of two independent events A and B is

P(AB ) = P(A | B ) · P(B ) by Equation 0.3 so due to Eq. 0.4

P(AB ) = P(A) · P(B ). ———————————————




Denote events or outcomes with capital letters A, B, C , and so on. If A is one outcome, all other possible outcomes are part of ‘A complement’ = Ac.

P(A) is the probability that the event or outcome A occurs. Rule 0: For any event A, 0 ≤ P(A) ≤ 1.

Rule 1: P(A) + P(Ac) = 1 or P(Ac) = 1 − P(A) Rule 2: If events A and B are mutually exclusive, then

P(AorB ) = P(A) + P(B ) Rule 3: If events A and B are independent, then

P(AB ) = P(A) · P(B )

Rule 4: If the ways in which an event B can occur are a subset of those for event A, then P(B) ≤ P(A).


Part II: Probability distributions


A little introduction to Random Variable. Definition

A random variable X is a function from a set - sample space S to the reals R. For any b ∈ R, the preimage

A := X−1(b) = {w : X (w ) = b} ⊂ S is an event, we understand

Prob{X = b} := Prob(A) = X

w ∈A

Prob(w ).

For finite set - sample space S then obviously Prob{X = b} := Prob(A) = |A|


What is a Probability Distribution?


The probability distribution of a random variable describes how probabilities are distributed over the (range) values of the random variable.

For a discrete random variable X , its probability distribution is the probability function

f (x ) = Prob({X = x },

provides the probability that the r. v. X receives a particular value x ∈ Range(X ). We must have

f (x ) ≥ 0 and X

x ∈Range(X )


Discrete Probability Distribution


Example (Die tossing)

On a single toss of a die, we get only one of six possible outcomes 1,2,3,4,5 or 6; then the sample space S = {1, 2, 3, 4, 5, 6}.

Define the random variable X : S → R+ to be the identity function Id , that is X (i ) = Id (i ) = i , i ∈ S .

The probability distribution associated with X is the probability function

f (i ) = Prob{X = i } = Prob{X−1(i )} = |X

−1(i )|

|S| = 1/6, for i = 1..6


Practical motivation


Why study Probability Distributions?

Citibank in HCMC makes available financial services, including checking and saving accounts, loans, mortgages, insurance and investment services.

These complicated activities have been done through a Citibanking system consisting of many modules, like

- ATMs, or more advanced,

- the Card Banking Centers (CBCs).


Motivation - Card Banking Centers


What would be the services available at CBSs? and How?

Each CBC operates as a waiting line system with randomly arriving customers seeking service at one of the ATMs. CBC capacity studies are used

- to analyze customer waiting line and

- to determine whether additional ATMs are needed.

Data collected by Citibank showed that the random customer arrivals followed a probability distribution known as the Poisson distribution.

Using the Poisson distribution, Citibank can compute probabilities for the number of customers arriving at a CBC during any time period and decisions concerning the number of ATMs needed.


Part IIA: Useful Discrete probability



Discrete probability distributions, such as the one used by Citibank are the topic of this section.

Discrete random variable X is the one that has a finite range set. The discrete probability distribution f (x ) must fulfill:

f (x ) ≥ 0, and X

x ∈Range(X )

f (x ) = 1. ———————————————

Besides the Poisson distribution,


Useful Discrete probability distributions


1/ Bernoulli Distribution B(p). This distribution describes a random variable that can take only two possible values, i.e. X = {0, 1}.

———————————————————————————— The distribution is described by a probability function

p(1) = P(X = 1) = p, p(0) = P(X = 0) = 1−p for some p ∈ [0, 1]. It is easy to check that E(X ) = p, Var(X ) = p(1 − p).

———————————————————————————— Notice that we used the following concepts of E(X ) and Var(X ).


Expectation and Variance- the Discrete Case


Expectation. The expectation operator defines the expected value (or average behavior ) of a random variable X as

E(X ) = X

x ∈Range(X )

P(X = x ) · x , where

P(X = x ) = P(X−1(x )); and X−1(x ) = {w : X (w ) = x } ⊂ S . Since, the r.v. X : S → R is an assignment of values to the points in sample space S , you could also think

E(X ) = X

w ∈S

P(w ) · X (w ) equivalently.

Variance of a random variable X is


Useful Discrete probability distributions


2/ Binomial distribution B(n, p). This distribution describes a random variable X that is a number of successes in n independent Bernoulli trials with probability of success p.

———————————————————————————— In other words, X is a sum of n independent Bernoulli r.v.

Therefore, X takes values in X = {0, 1, ..., n} and the distribution is given by a probability function

p(k) = P(X = k) =n k 

pk(1 − p)n−k.

It is easy to check that E(X ) = np, Var(X ) = np(1 − p). ————————————————————————————


Binomial process- a well-known example


Let H and T be two outcomes of an experiment as Coin throwing, with sample space SCoin = {H, T }, and in general, the occurrence

likelihoods P(H) = P({H}) = p; P(T ) = 1 − p.

Assume that we perform n trials, called Bernoulli Trials, of the experiment and each trial is independent of the others.

For example, the event H on the first trial is independent from the event H on the second trial. So both events have probability p. The sample space S now can be represented by

S = {x1x2· · · xn | xi ∈ SCoin}.

Since the trials are independent, we assign probabilities to the points in S by


Binomial process- example


Question: What is the probability of exactly k successes in n trials of a binomial experiment where P(success) = p and

P(failure) = 1 − p?

Let X be the sum of n independent Bernoulli r.v., then X takes values in {0, 1, ..., n}.

The answer, therefore is X = k ∈ {0, 1, ..., n}, that means exactly k successes in n trials. By combinatorial reasoning,


the binomial distribution X = Bin(n, k) is given by a probability function p(k) = P(X = k) =n k  pk(1 − p)n−k. ————————


Binomial process- example


Example/Quiz. Two fair dice are tossed. If the total is 7, we win $100; if the total is 2 or 12, we lose $100; otherwise we lose $10. What is the expected value of the game?

Reminder : if V : S → R is an assignment of values to the points in sample space S , then

E(V ) = X

w ∈S

P(w ) · V (w ).


Part IIB: Continuous probability distributions


A random variable X is acontinuous random variableiff its range Range(X ) is a continuous set (as R or its subsets)

Continuous distributions. A continuous probability distribution refers to the range Range(X ) of all possible values that a

continuous random value X can assume, together with the associated probabilities P(X ≤ t) .

The probability distribution of a continuous random variable X is called probability density function (pdf), or simply a probability function, denoted fX(t).

Key continuous probability distributions include: - the normal distribution and


Continuous probability distribution- determination


The distribution function or cumulative distribution function (cdf) of X is the function defined by

FX(t) = P(X ≤ t), −∞ < t < ∞

Let X be a r.v. with cdf FX(t). We say that X is acontinuous

random variableonly iff its range Range(X ) contains an interval (either finite or infinite) of real numbers.

The cdf FX(t) must have derivative


dt =: f (t)

This function is defined almost every where and is piecewise continuous


Probability density and Cummulative function


Thus, if X is a continuous r.v., then P(X = t) = 0.

XThe probability function

f (t) = dFX(t) dt

- is called the probability density function of X , and - is given by a smooth curve C such that

* the total area (probability) under the curve is 1,


Continuous probability distributions- Properties


XHowever, the probability that a continuous random variable X assumes any value within a given interval say, [a, b] is measured by the area under the curve C within that interval. In other words, the probability of the event “a ≤ X ≤ b00 is:

Prob(a ≤ X ≤ b) = Prob(a < X < b) = Z b


f (x )dx XThe mean µ of a continuous probability distribution with pdf f (x ) is given by

µ = E(X ) = Z

x ∈Range(X )

x f (x )dx , and the variance

Var X = σ2 = Z

x ∈Range(X )


Important continuous probability distributions


Two key ones: the normal distribution and the exponential distribution.

I Normal distribution found to be useful in numerous areas like Medical science, Petroleum engineering, Enviromental, Biological and Ecological sciences ...

I Exponential distribution found to be useful in numerous other areas like mass manufacturing, mechanical and electronic engineering ...


(a) Normal distribution- the first continuous



If X is a normal random variable, the normal distribution is

f (x ) = 1 σ√2πe −12 x − µ σ 2 , −∞ < x < ∞, µ ∈ R, σ2 > 0. (0.5) We write x ∼ N(µ, σ2), where

f (x ) = height of the normal curve e = constant 2.71

π = constant 3.14, µ is the mean, and


(b) Exponential distribution- the second continuous one


XThe exponential distribution is

f (x ) = f (x ; λ) = λe−λ x, x ≥ 0 (0.6) where λ > 0 is a constant. The mean and the variance are

µ = 1 λ; σ

2 = µ2 = 1


zThe exponential cummulative distribution function (cdf) is

F (a) = Prob(x ≤ a) = Z a 0 f (t)dt = Z a 0

λe−λ tdt = 1−e−λa, a ≥ 0.


Reminder: the relation between pdf f and cdf F is f (t) = dFX(t)

dt ⇐⇒ Z


Practical uses of Exponential distribution


The exponential distribution is widely used in the field of Reliability Engineering, such as a model of the time to failure (TTF) of a component or system. In that case,

Xthe parameter λ is called the failure rate of the system, and Xthe distribution’s mean µ =λ1 is called the mean time to failure (MTTF).


An electronic component of in an airborne radar system has a useful life X described by an exponential distribution with failure rate 10−4/h, that is λ = 10−4. Compute MTTF for this


The mean time to failure for this component is its expected life which is the mean µ =λ1 = 104 = 10000h.


Part IIB: Normal distribution- Properties


The normal curve (of the probability function f (x )) is - bell-shaped,

- symmetrical about the mean, and

- when we move further away from the mean in both directions, the normal curve approaches the horizontal axis.

Quantitatively description of the three properties is given by three most useful cases:

I The area of A1= {x : |x − µ| ≤ σ}

takes 68.26% of the whole area (probability 1) I The area of A2= {x : |x − µ| ≤ 2σ}

takes 95.44% of the whole area I The area of A3= {x : |x − µ| ≤ 3σ}


Normal distribution- Computation


Observation: The cdf of a normal random variable X ∼ N(µ, σ2), given by F (a) = Prob(x ≤ a) = Z a −∞ f (x )dx , where f (x ) = 1 σ√2πe −12 x − µ σ 2

can not be evaluated symbolically! Can only compute probabilities if we use the z-transformation

z = x − µ σ ,

then f (z) ∼ N(0, 1)! Practically we can employ Listed Tables to extract probabilities concerned.


Normal distribution- Computation 1


Key facts: (see Table 6.1 at 233, Buz Stat. 1) x = µ ⇐⇒ z = 0; x = µ + σ ⇐⇒ z = 1; x = µ + kσ ⇐⇒ z = k; x = µ + 1.96σ ⇐⇒ z = 1.96; x = µ + 1.645σ ⇐⇒ z = 1.645; x = µ + 2.576σ ⇐⇒ z = 2.576;


Normal distribution- Computation 2


For instance, a normal random variable X ∼ N(µ, σ2) with µ = 10, σ = 2. Due to z = x − µ σ , then P(10 ≤ X ≤ 14) = P(0 ≤ z ≤ 2) Table 6.1 at 233, Buz Stat. 1 provides that: z = 2 resulting in probability 0.4772. So


Part III: Statistical Estimation


I Interval Estimation– Population Mean- σ known case I Basic of Sampling Distribution of the sample mean x

——————————————————————— Interval Estimation means

I an interval estimate of a (population) parameter p

I where two statistics L, R say, round possible values p up to a probability


What is Interval Estimation?


An interval estimate of a (population) parameter p:

the interval between two statistics L, R say, that includes the true value of the parameter with some probability.

L| − − − − − − − − − − − − − − − p − − − − − − − − − −|R E.g., formally, an interval estimator of the mean parameter p := µ (an important parameter, most used in Statistics)

consists of three components: two statistics L, R and the confidence coefficient or level1 − α!

Three components and the concerned parameter must satisfy: P{L ≤ µ ≤ R } = 1 − α = β (0.7)


Part IIIA: Interval Estimation– Example


P{L ≤ µ ≤ R } = 1 − α = β (0.8) The interval [L, R] = L ≤ µ ≤ R is called a

100(1 − α)% = 100β% confidence interval for the unknown µ

E.g., if µ is the mean of most productive age of human being, α = 0.1 ⇒ 1 − α = β = 0.9, L = 35, R = 45

thenthe interval [35, 45] = 35 ≤ µ ≤ 45is called a 100(1 − α)% = 100β% = 90%confidence interval for thepopulation mean µ.


Why do we study Interval Estimation?


In Statistics, a point estimate of a population parameter is a sample statistic used to estimate that population parameter. But a point estimator cannot be expected to provide the exact value of the population parameter.

Instead an interval estimate is often computed by adding and subtracting a margin of error, to the point estimate:

An interval estimate = Point estimate ± Margin of error ———————————–

For example, an interval estimate of the mean µ [L, R]= ˆµ ± Margin of error f (α)


Why do we study Interval Estimation?


[L, R] = ˆµ ± Margin of error f (α)

means P{L ≤ µ ≤ R} = 1 − α = confidence level Mathematically, an interval estimate refers to a range of values together with the probability, called confidence level, that the interval includes the unknown population parameter

L| − − − − − − − − − − − − − −ˆµ − − − µ − − − − − − − − − −|R L = ˆµ − f (α), R = ˆµ + f (α) and the

probability that µ ∈ [L, R] is 1 − α.

Here f (α) is the radius measuring how large the bounding area of µ is!


Part IIIA: Population Mean- σ known case


We first consider Estimation of population mean in the case of population variance σ2 known.

Specific practical application in business. Consider the monthly customer service survey

conducted by, a start-up biological applications oriented firm at HCMC.

Key facility : the firm provides a website for accepting customer orders and providing follow-up services over the Internet.


Aim of statistical inference


The firm’s quality assurance team uses a customer service survey to measuresatisfaction of customers with its website and online customer service.

Statistically How? The team sends a questionnaire each month to a random sample of customers who placed an order or requested service during previous months.

Key Aim of statistical inference. To draw conclusions or make decisions about a population based on a random sample selected from the population.


What are components/questions of the



We rating satisfaction of customers by formulating/asking questions:

1. how ease of placing orders? 2. how timely delivery?

3. how accurate order filling? And 4. how efficient technical advices?

Summarizing data, how? Compute an overall satisfaction score x from 0 to 100. In the most recent month,

a sample data of n = 100 customers are surveyed,

a sample mean x = 82 of customer satisfaction is then computed. * Will assume that random samples are used in the analysis


What are Random Samples?



Random samples A sample x1, x2, . . . , xn of size n is random iff

the observations {xi} are independently and

identically distributed (i.i.d.). This concept is applicable for both finite or infinite populations, and where sampling is performed with replacement. Sampling without replacement In sampling without replacement

from a finite population of N items, we say that a sample of n items {xi} is a random sample iff each of

the Nn possible samples has an equal probability of being chosen.


A bit of Sampling for Statistical Inference


Sampling from a finite population.

A simple random sample of size n (from a finite population of size N) is a sample selected such that

each possible sample of size n has the same probability of being selected.

Sampling without replacement, from a finite population of N items, is the sampling procedure used most often; and when refer to simple random sampling, we assume that the sampling is without replacement.

In our instance, N = the number of all customers of, and n = 100 (the number of customer to whom we sent the questionnaire in the last month).


The core equality of Interval Estimation


Remind the key proposed equality to estimate some interest parameter:


An Interval Estimate = Point estimate ± Margin of error, −→ An Interval Estimate of µ =x± some error, ————————————————————–

Here, the sample mean x provides a point estimate of the

population mean µ (of satisfaction scores) for the population of all customers.

From the survey for many months (all sampling months), we consistently and approximately found an estimate 20 for the standard deviation, i.e. σ = 20.


Case of σ known- Point estimator usage

——————————————————————————— Key observation- assumption: More over, the historical data (from the survey for all sampling months) show that the population of satisfaction scores isnormally distributed, with a standard deviation σ = 20. Hence σ is known.

By the proposed equality

Interval Estimate of µ = x ± some error margin, how could we determine

- the Margin of Error, and as a result

- the Interval Estimate of the population mean µ by its Point estimate x ?

The Case of σ unknownis far more complicated, then will be discussed in Part IV: Hypothesis Testing.


Main Proposition 1



Given the population standard deviation σ or its estimate σx (by

Eqn. 0.9), and provided that the population is normal or that a random sample has size at least 30 (see Fact 11), we can find the 95% confidence level for the unknown population mean as

P(x − 1.96σx < µ < x + 1.96σx) = 0.95


P(|µ − x | < 1.96σx) = 0.95

In our instance of, (why we know σx = 2?)


Main Proposition 1- the general case


With the Margin of Error e = zα/2·√σn, the general form of an

Interval Estimate of a population mean µ with known standard deviation σ, with the confidence coefficient/level 1 − α is ———————————————————————————

[x − zα/2· σx, x + zα/2· σx] 3 µ or

P(|µ − x | < zα/2· σx) = 1 − α

where zα/2 is the z value providing an area α/2 in the upper tail

of the standard normal probability distribution.


Proof of Main Proposition 1


1/ Basic of Sampling Distribution of the sample mean x a) the sample mean x is a random variable?

b) find the sample standard deviation σx

2/ Central Limit Theorem- Formal form


Part IIIB: Basic of Sampling Distribution of

the sample mean x


Given a data set S , for the chosen simple random sample A, B . . ., we can compute the sample mean xA of A, the sample mean xB of

B . . .

Considering the process of selecting a simple random sample as an experiment, the sample mean x is the numerical description of the outcome of the experiment. Thus,

the sample mean x is a random variable, and therefore x has a mean µx, a standard deviation σx and a probability distribution.


Since various possible values of x are results of distinct simple random samples, the probability distribution of x is called the Sampling Distribution of x .


Sampling Distribution of the sample mean x


The standard deviation σx is found by the formula

σ2u= Pn

i =1(ui − u)2

n − 1 , (0.9) where the random variable u := x .


For infinite population and if σ, the standard deviation of the population is known,

n = the sample size, N = the population size,

then a formula for the standard deviation of x follows: σx =

σ √