Basic Statistic

(1)

Business Analytics

(2)

Agenda

 Introduction  Data

(3)

3. Basic Statistics

I. Probability

II. Random variables

III. Probability distribution IV. The Central Limit Theorem

V. Sampling and statistical inference VI. Confidence intervals

(4)

3.a. Probability

 Probability is a numerical way of describing how likely something is to happen.

 One of the fundamental methods of calculating probability is by using set theory.

 A set is defined as a collection of objects and each individual object is called an element of that set.

• Example from number of credit cards data, the distinct number of credit cards owned form a set: # Cards = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

• Numbers present on a dice form a set: Dice = {1, 2, 3, 4, 5, 6}

 The sample space (S ) is the set of all possible outcomes that might be observed for an event/experiment.

 If each of the elements in the sample space are equally likely, then we can define the probability of event A as:

• P(A) = (# elements in A)/(# elements in sample space)

• e.g. P(# Cards = 1) = (# of customers having 1 card)/(Total number of customers) = 100/1000 = 0.10 = 10%

• e.g. Probability of rolling an even number on a dice Sample space (S) = {1, 2, 3, 4, 5, 6}

Event (A) = {2, 3, 4} P(A) = 3/6 = 0.5 = 50%

 Why is it important from analytics perspective?

• What we do: analyze historical data to find pattern under assumption that past is a reflection of future.

(5)

3.a. Probability- Other topics

 Set operations • Union (A U B) • Intersection (A B)

 Venn diagrams

• Basic operations on Venn diagrams

 Basic probability axioms 1. P (S) = 1

2. P (A) >= 0 for all A S

3. P (A U B) = P(A) + P(B) – P (A B)  Conditional probability • P(A|B) = P (A B)/ P(B)  Bayes theorem U U U U

(6)

3.b. Random variables

I. Definition

II. Types of Random Variables 1. Discrete

2. Continuous

III. Distribution and Probability Density functions of Random Variables

IV. Expected value (or Mean) of Random Variables

V. Variance of Random Variables

(7)

3.b. Random variables- Definition



A random variable is a function or a rule which maps each event in a sample space to real

numbers.

 So, if w is an element of the sample space S (i.e. w is one of the possible outcomes of the experiment concerned) and the number x is associated with this outcome, then X(w) = x .

 Convention:

• Denote random variable by capital letter “X”

• Denote the outcome or possible values by small letter “x” i.e. X(w) = x

w

₁

w

₂

w

₃

.

x

₁

x

₂

x

₃

.

X (w) = x

Random variable

(8)

3.b. Random variables- Definition

Example:



Suppose there are 8 balls in a bag. The random variable X is the weight, in kg, of a ball

selected at random. Balls 1, 2 and 3 weigh 0.1kg, balls 4 and 5 weigh 0.15kg and balls 6, 7

and 8 weigh 0.2kg. Using the notation above, write down this information.

Solution:



X(b1) = 0.10 kg, X(b2) = 0.10 kg,

X(b3) = 0.1 kg,

X(b4) = 0.15 kg, X(b5) = 0.15 kg

X(b6) = 0.2 kg,

X(b7) = 0.2 kg

b₁ b₂ b₃ b₄ b₅ b₆ B₇ b₈

0.10

0.15

0.20 X (b

_i

) = x

Weight (Random variable)

(9)

3.b. Types of Random variables

 There are two types of Random Variables

1. Discrete Random Variables

(10)

3.b. Discrete Random variables

Definition:

 The set of all possible values of the outcome (or x) takes discrete values

• e.g. Outcome of rolling a dice= {1, 2, 3, 4, 5, 6}

• Or # credit cards owned by an individual = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} Probabilities:

 Probabilities are defined on events (subsets of the sample space S).

 So what is meant by “P(X = x) ”?

• Suppose sample space consists of eight events {s₁, s₂, s₃, s₄, s₅, s₆, s₇, s₈} • Let the outcome for

– E1 = {s₁, s₂, s₃} be associated with number x1 – E2 = {s₄, s₅} be associated with number x2 – E3 = {s₆, s_7,s₈} be associated with number x3 • P(X = x1) is meant P(E1)

• P(X = x2) is meant P(E2) • P(X = x3) is meant P(E3)

(11)

3.b. Discrete Random variables



Probability functions

• The function f_X (x) = P(X = x) for each x in the range of X is the probability function (PF) of X

• It specifies how the total probability of 1 is divided up amongst the possible values of X

• Thus, gives the probability distribution of X.

• Also known as “probability distribution functions” (pdf)

 Following are the requirements for a function to qualify as the probability function of a discrete random variable:

• f_X(x) >= 0 for all x within the range of X

• ∑f_X (x) = 1

 Cumulative distribution functions

• Gives the probability that X assumes a value that does not exceed x. • Denoted as F_X(x) = P(X <= x) where max (F_X(x)) = 1

(12)

3.b. Discrete Random variables- Probability

Example:



Suppose there are 8 balls in a bag. The random variable X is the weight, in kg, of a ball

selected at random. Balls 1, 2 and 3 weigh 0.1kg, balls 4 and 5 weigh 0.15kg and balls 6, 7

and 8 weigh 0.2kg. Write down the different probability distribution functions.

Solution:

 f_X(0.10) = P(X=0.10) = probability the ball b1 or b2 or b3 is selected out of 8 balls = 3/8

 f_X(0.15) = P(X=0.15) = probability the ball b4 or b5 is selected out of 8 balls = 2/8

 f_X(0.20) = P(X=0.20) = probability the ball b6 or b7 or b8 is selected out of 8 balls = 3/8

 F_X(0.10) = P(X <= 0.10) = P(X=0.10) = 3/8  F_X(0.15) = P(X<=0.15) = P(X=0.10)+ P(X=0.15) = 2/8 + 3/8 = 5/8  F_X(0.20) = P(X<=0.20) = P(X=0.10) + P(X=0.15) + P(X=0.20) = 3/8 + 2/8 + 3/8 = 8/8 = 1 b₁ b₂ b3 b₄ b₅ b₆ b₇ b8 x1=0.10 x2=0.15 x3=0.20

X (b

_i

) = x

(13)

3.b. Continuous Random variables

Definition:

 The set of possible values taken by a continuous random variable falls in an interval (or a collection of intervals) on the real line:

• e.g. Salary of a set of individuals

• Mathematically examples {x: x > 0} or {x: − ∞ < x < ∞} or {x: 0 < x < 1} Probability Density Function

 First define the range or the interval in which the probability has to be determined.

 Say its (a, b).

 The probability associated is represented as P(a < X < b) or P(a ≤ X ≤ b).

 Also, it is the area under the curve of the probability density function (PDF) from a to b.

 So probabilities can be evaluated by integrating the PDF f_X (x) .

 This relationship defines the PDF.

 Mathematically

• P(a < X < b) =

∫

_ab f_X(x) dx

 The conditions for a function to serve as PDF are

• f_X(x) ≥ 0 − ∞ ≤ x ≤ ∞

• ∫

-∞

∞

(14)

3.b. Continuous Random variables

Cumulative distribution function:

 The cumulative distribution function (CDF) is defined to be the function: • F_X(x) = P(X ≤ x)

 For a continuous random variable, FX (x) is a continuous, non-decreasing function, defined for all real values of x.

(15)

3.b. Random variables- Expected values

Definition:

 Expected values are numerical summaries of important characteristics of the distributions of random variables.

 Expected values of a Random Variable “X” is denoted as E[X]

 Important Expected values are • Mean

• Variance and Standard deviation

 Mean:

• E[X] is a measure of central location

• For discrete case calculated as E[X] = ∑(x_i* P_i) OR E[X] = (∑x * f_X(x))

• For continuous case calculated as E[X] =

∫

_-∞∞x * f_X(x) dx

• Usually denoted by μ

 Variance:

• Var[X] = E[{X – E[X]}2_]

(16)

3.b. Random variables- Expected values

Example:



Suppose there are 8 balls in a bag. The random variable X is the weight, in kg, of a ball

selected at random. Balls 1, 2 and 3 weigh 0.1kg, balls 4 and 5 weigh 0.15kg and balls 6, 7

and 8 weigh 0.2kg. Find mean and variance of weight.

Solution:

 f_X(0.10) = P(X=0.10) = 3/8  f_X(0.15) = P(X=0.15) = 2/8  f_X(0.20) = P(X=0.20) = 3/8



E[X] = ∑P

_i

* x

_i

= 3/8 * 0.10 + 2/8 * 0.15 + 3/8 * 0.20 = 1.2/8 = 0.15 kg



Var[X] = E[X

2

_{] – E}

2

_{[X] = 0.024375 – 0.0225 = 0.001875 kg}

2 b₁ b2 b₃ b₄ b₅ b₆ b7 b₈ x1=0.10 x2=0.15 x3=0.20

X (b

_i

) = x

(17)

3.c. Discrete Probability distributions

I. Define and describe Discrete Probability distributions

1) Uniform

2) Bernoulli

3) Binomial

4) Poisson

(18)

3.c. Discrete PDF- Uniform distribution

 Sample space S = {1, 2, 3,…,k} .

 Probability measure:

• equal assignment (1/k) to all outcomes i.e. all outcomes are equally likely.

 Random variable X defined by X(i) = i , (i = 1, 2, 3,…,k) .

 Distribution: P(X = x) = 1/k where x = (1, 2, 3, 4,….,k)

 Expected values: • Mean, μ= (k + 1)/2

• Variance, σ2_{= (k}2_{– 1)/12}

 Example: Assigning equal probability of default to a portfolio of credit card holders.

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 1 2 3 4 5 6 7 8 9 10

(19)

3.c. Discrete PDF- Bernoulli distribution

 A Bernoulli trial is an experiment which has only two possible outcomes – s (“success”) and f (“failure”).

 “success” and “failure” are mere labels and should not be taken literally. Instead we could have “yes” and “no” OR “true” and “false”

 Sample space S = {s,f} .

 Probability measure:

• P({s}) = p, P({f}) = 1 – p 0 < p < 1

 Random variable X defined by X(s) = 1, X(f) = 0.  Distribution: P(X = x) = px _{* (1-p)}1-x _{, x = 0, 1; 0 < p < 1}  Expected values:

• Mean, μ= p

• Variance, σ2_{= p (1 – p)}

 Examples:

• Tossing of a coin. “Head” corresponds to “success” and “Tail” corresponds to “failure”.

• Defaulting a home loan. “Default” corresponds to “success” and “Non-default” corresponds to “failure”. • Auto insurance policy. “No claim” corresponds to “success” and “Claim” corresponds to “failure”.

(20)

3.c. Discrete PDF- Bernoulli distribution

 Bernoulli distribution with probability of success (p) = 0.25.

0.00 0.25 0.50 0.75 s (X = 1) f (X = 0) Bernoulli distribution, p = 0.25

(21)

3.c. Discrete PDF- Binomial distribution

Assumptions

 Each trial has only 2 possible outcomes, success and failure.

 The trials are identical and fixed (Usually denoted by n)

 The probability of success p is constant (0 <= p <=1).

 All trials are independent Example

 Number of borrowers that may default during a time period

• If we know total borrowers and constant PD for all borrowers, assuming default independence

 Number of claims on insurance policy from total policy holders

Binomial Distribution – One Parameter Distr.

The probability of getting exactly k successes in n trials is given by the probability mass function: for k = 0, 1, 2, ..., n, where n_C k = n!/{(n-k)!*k!} Mean (X) = n x p Variance (X) = n x p x q

Discrete (Counting) Distribution – Useful for Modeling Frequency of Losses

(22)

3.c. Case study for binomial distribution

 Each operational loss, independently, is supposed to be insured with probability 60% in a BL (probability of 60% arrived from historical data as Number of insured Losses/Total number of losses in the BL over last 36 months). This implies that the annual number of insured loss is the sum of Bernoulli trial results and would follow a binomial distribution. If during a particular year, 20 losses happen in the BL, what is the probability that the Bank would have insurance in 10 or less cases?

 Answer: Refer sheet: “Ex-Binomial”

(23)

3.c. Discrete PDF- Poisson distribution

Expresses the probability of a number of events occurring in a fixed period of time if these events

occur with a known average rate and independently of the time since the last event

Assumptions

 Constant mean (number of events in a pre-specified time interval)

 The interval length between two consecutive events follows an exponential distribution (λ ) Sum of independent Poisson variables is also Poisson

 λ for 12M period maybe taken as 4 x λ for 3M

period Example

 For instance, event of occurrence of operational risk losses, credit defaults during a time period; if individual events are independent

Poisson Distribution – One Parameter Distr. Expected number of occurrences in interval = λ, Probability there are exactly k occurrences is equal to

•k is the number of occurrences of an event

and is a non-negative integer, k = 0, 1, 2, ...

•k! is the factorial of k

•e is the base of the natural logarithm (e =

2.71)

•λ (+ve real number), equal to the expected

number of occurrences during interval Mean (X) = Variance (X) = λ

Discrete (Counting) Distribution – Most Popular for Modeling Frequency of Losses

(24)

3.c. Case study for Poisson distribution

Annual mean “Damage to Physical Assets” frequency in Agency Services is 5.9 events p.a.

 Find the probability of recording 0, 1, 2, 3, 4…..20 losses over next 12M.

 The Bank actually records 10 such events over next 12M. The management feels that it is 1 out of 100 years scenario. Verify this hypothesis.

(25)

3.c. Discrete PDF- Negative- Binomial distribution

Discrete probability distribution of the number of failures (r) in a sequence of Bernoulli trials before a specified (non-random) number k of success occurs

Special generalized case of the Poisson distribution

 Intensity rate (λ) is no longer taken to be constant (Assumed to follow a Gamma Distribution)

Two-parameter distribution

 Provides additional flexibility in fitting data

 Parameter uncertainty maybe high with less data points (typical of scenario where annual frequency data points maybe 3-6)

Advantages

 Allows modelling of the frequency dependence due to the assumption that occurrence of operational losses may be affected by some external factor

Discrete (Counting) Distribution – Popular for Modeling Frequency of Losses

Variance > Mean, useful if variance of operational loss frequency is greater than mean

Negative Binomial Distr. – 2 Param Distr.

The probability of getting exactly r failures before k successes is given by the probability mass function:

for k = 0, 1, 2, ..., n, where

n_C

(26)

3.c. Continuous Probability distributions

I. Define and describe Continuous Probability distributions

1) Uniform 2) Normal 3) Lognormal 4) Exponential 5) Gamma 6) Chi-square 7) t- distribution 8) F- distribution

(27)

3.c. Continuous PDF- Uniform distribution

 Assigns equal probability to all values between its minimum and maximum values.

 Random variable X takes a value between two number a and b (say).

 Probability density function: f_X(x) = 1/(b-a), a < x < b  Denoted as X ~ U(a, b)

 Expected values: • Mean, μ= (a + b)/2 • Variance, σ2_{= (b - a)}2_/12

 Example: Assigning equal probability of default to a portfolio of credit card holders.

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 1 2 3 4 5 6 7 8 9 10

(28)

3.c. Continuous PDF- Gamma distribution

 Gamma family if distributions is a positively-skewed distribution explained by two parameters “α” and “λ” (say).

 It is bounded at zero and can take various shapes depending on values of parameters.

 Random variable X takes a non-zero positive value.

 Probability density function: f_X(x) =( λα xα-1_e- λx_{)/Γ(α) ,} _{x > 0}

 Denoted as X ~ Gamma(α, λ)

 Expected values:

• Mean, μ= α/λ

• Variance, σ2_{= α/λ}2

 Special cases:

• Exponential distribution when α = 1: f_X(x) =λ e- λx_, _{x > 0}

• Chi-square distribution with α = 2v (v any positive integer) and λ = 1/2

 Example:

• Used the predict claim amount in Auto insurance.

(29)

3.c. Continuous PDF- Gamma distribution

 Plotting PDFs for different Gamma distributions using MS Excel.

X Ga(2, 3) Ga(1, 4) Ga(20, 0.5)

1 7.96% 19.47% 0.00% 2 11.41% 15.16% 0.00% 3 12.26% 11.81% 0.00% 4 11.72% 9.20% 0.08% 5 10.49% 7.16% 0.75% 6 9.02% 5.58% 3.23% 7 7.54% 4.34% 8.17% 8 6.18% 3.38% 13.98% 9 4.98% 2.63% 17.73% 10 3.96% 2.05% 17.77% 11 3.12% 1.60% 14.71% 12 2.44% 1.24% 10.40% 13 1.90% 0.97% 6.44% 14 1.46% 0.75% 3.56% 15 1.12% 0.59% 1.79% 16 0.86% 0.46% 0.82% 17 0.65% 0.36% 0.35% 18 0.50% 0.28% 0.14% 19 0.37% 0.22% 0.05% 20 0.28% 0.17% 0.02% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% 1 3 5 7 9 11 13 15 17 19 Pr ob ab ili ty Random variable (X) Gamma distribution Ga(2, 3) Ga(1, 4) Ga(20, 0.5)

(30)

3.c. Continuous PDF- Normal distribution

 A symmetrical distribution having bell shaped pdf curve.

 Widely used to naturally occurring variables e.g. height, weight, exam scores etc.

 Has two parameters mean (

μ

) and variance (σ2).  Random variable X takes a non-zero positive value.

 Probability density function: f_X(x) =(1/ σ √2∏ ) exp[-1/2 {(x-

μ)/

σ}2_]  Denoted as X ~ N(

μ,

σ2₎

 It provides good approximations to various other distributions (Central Limit Theorem)

 Transformation z = (x-

μ)/

σ has N(0, 1) distribution.

 Afterwards, The probability is calculated by looking into the “standard probability distribution table for N(0,1) distribution”.

(31)

3.c. Continuous PDF- Normal distribution

 Plotting PDFs for different Normal distributions using MS Excel.

X N(0,1) N(0,1.6) -5 0.0% 0.2% -4 0.0% 1.1% -3 0.4% 4.3% -2 5.4% 11.4% -1 24.2% 20.5% 0 39.9% 24.9% 1 24.2% 20.5% 2 5.4% 11.4% 3 0.4% 4.3% 4 0.0% 1.1% 5 0.0% 0.2% 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% -5 -4 -3 -2 -1 0 1 2 3 4 5 N(0,1) N(0,1.6)

(32)

3.c. Continuous PDF- Normal distribution

Problem:

 If X ~ N(25,36) , by making use of standard normal probability distribution table, find: (i) P( X < 28) (ii) P( X > 30) (iii) P( X < 20)

Solution:

(i) P(X < 28) = P(Z < (28-25)/sqrt(36)) = P(Z < 3/6) =0.69146 (ii) P(X > 30) = P(Z > 0.833) =1− P(Z < 0.833) =1− 0.79758 = 0.20242 (iii) P(X < 20) = P(Z < −0.833) =1− P(Z < 0.833) =1− 0.79758 = 0.20242

(33)

3.c. Continuous PDF- Lognormal distribution

 A positively skewed distribution.

 If random variables X has lognormal distribution then Y = log(X) is normally distributed.

 Random variable X is bounded at zero and used to model variables taking non-zero positive values.

 Defined by two parameters

μ

and σ2 _{and denoted as X ~ log N(μ,}_σ2)

 Probability density function: f_X(x) =(1/ xσ √2∏ ) exp[-1/2 {(log x-

μ)/

σ}2_{], 0 < x}  Expected values:

• Mean, E[X] = exp(μ + (1/2) σ2₎

• Variance, var(X) = exp(2μ + σ2_{) (exp(σ}2_{) – 1)}  Example:

• Used the predict claim amount in Auto insurance. • Used the predict loss amount in bank loan defaults

(34)

3.c. Continuous PDF- Lognormal distribution

 Plotting PDFs for lognormal (0,1) distributions using MS Excel.

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0 2 4 6 8 10 12 14 16 18 20

logN(0,1)

(35)

3.d. The Central Limit Theorem

Introduction:

 It is perhaps one of the most important result in statistics

 It provides the basis for large-sample inference about a population mean when the population distribution is unknown.

 It also provides the basis for large-sample inference about a population proportion, for example, in opinion polls and surveys.

Definition:

 If X₁, X₂, ….,X_nis a sequence of independent, identically distributed (iid) random variables with finite mean μ and finite (non-zero) variance σ 2_{then the distribution of (<X> – μ)/(σ /√n) approaches the}

standard normal distribution, N(0,1) , as n → ∞

 μ is the population mean from which X₁, X₂, ….,X_nhave been extracted.

 <X> is the sample mean calculated as <X> = (1/n)

i=1∑ i=n

X_i

 For large n, (<X> – μ)/(σ /√n) and (∑ Xi – n μ)/(√(n σ 2_{)) has N(0, 1) distribution}

 OR

• <X> ~ N(μ, σ 2_/n)

(36)

3.d. The Central Limit Theorem

Example:



It is assumed that the number of claims arriving at an insurance company per working day has

a mean of 40 and a standard deviation of 12. A survey was conducted over 50 working days.

Find the probability that the sample mean number of claims arriving per working day was less

than 35.

Solution:

We have, μ = 40, σ = 12 , n = 50 .

The central limit theorem states that <X) ~ N(40,12

2

/50) .

We want P( <X> < 35) :

P( <X> < 35) = P(Z < (35-40)/ √(12

2

/50))

= P(Z < -2.946) = 1 – P(Z < 2.946)

= 1 – 0.9984 = 0.0016

(37)

3.e. Sampling and Statistical Inference

I. Introduction

II. Random samples

III. Sample Mean

IV. Sample variance

V. The t- result

(38)

3.e. Sampling and Statistical Inference

Introduction:

 When a sample is taken from a population the sample information can be used to infer certain things about the population.

 For example, a population quantity could be its mean or variance.

 If we were to keep taking samples from the same population and calculating the mean and variance for each of the samples, we would find that the mean and variance results form distributions as well.

 The distributions of the sample mean and sample variance are called sampling distributions.

Need for Sampling:

 The physical impossibility of checking all items in the population.

 The cost of studying all the items in a population.

 The sample results are usually adequate.

 Contacting the whole population would often be time-consuming.

(39)

3.e. Sampling and Statistical Inference- Normal distribution

The sample mean:

 Mean, <X> = (1/n)∑Xi

 Distribution:

• (<X> – μ)/(σ /√n) ~ N(0,1)

• <X> ~ N(μ, σ 2_/n)

• μ is the population mean for which we are trying to draw the inference

The sample variance:

 Mean, S2_{= ∑(Xi - <X>)}2_/(n-1)  Distribution:

• (n-1) S2_/σ2 _{~ χ}2 n-1

• σ2 _{is the population variance for which we are trying to infer}

 Point to be noted: Distribution of mean is symmetrical (Normal) where as for variance it positively skewed (chi-square) for small n by somewhat symmetrical for large n.

 Expected value of

S

2_:

• E[(n-1) S2_/σ2_{] = E[χ}2

n-1], (the mean and variance of χ2kare k and 2k, respectively)

E[S2_{] = σ}2 _{i.e. expected value of sample variance is an un-biased estimator of population}

(40)

3.e. Sampling and Statistical Inference- Normal distribution

The t result:

 Distribution:

• (<X> – μ)/(σ /√n) ~ N(0,1) is used to draw inference about μ when population variance σ2_{is known.} • But for a population usually σ2 _{is not known.}

• We combine (<X> – μ)/(σ /√n) ~ N(0,1) and (n-1) S2_/σ2 _{~ χ}2

n-1 to solve this problem. • (<X> – μ)/(S /√n) ~ N(0,1)/√(χ2

n-1/n-1) = tn-1 • As N(0,1)/√(χ2

k/k) = tk

Example:

 State the distribution of (<X>-100)/(S/

√5)

for a random sample of 5 values taken from a N(100,σ 2₎

population. What is the probability that this quantity will exceed 1.533? Solution:

 Distribution: (<X>-100)/(S/

√5) ~ t

₄

(41)

3.e. Sampling and Statistical Inference- Normal distribution

The F result:

 if independent random samples of size n1 and n2 respectively are taken from normal populations with variances σ₁2 _{and σ2}2_{, then}

• (S₁2_{/ σ}

12) / (S22/ σ22 ) ~ Fn1-1, Fn2-1

(42)

3.e. The F-test

Example: William Waugh is examining the earnings for two different industries. He suspects that the

earnings for chemical industry are more divergent than those of petroleum industry. To confirm, he took a sample of 35 chemical manufacturers & a sample of 45 petroleum companies. He measured the sample standard deviation of earnings across the chemical industry to be $3.5 & that of

petroleum industry to be $3.00. Determine if the earnings of the chemical industry have greater standard deviation than those of the petroleum industry.

(43)

Solution

1. State the hypothesis:

where variance of earnings for the chemical industry = variance of earnings for the petroleum industry =

2. Select the appropriate test statistic: F=

s

₁2

_{/ s}

22

3. Specify the level of significance: Take it 5% here

4. State the decision rule regarding the hypothesis: Reject

H

₀

is F > 1.74

5. Collect the sample & calculate the sample statistics:

Using the information provided, the F-statistic can be computed as:

F = S

₁2

_{= $3.502 = 1.1165 < 1.74 (Hence no sufficient evidence to reject H}

0

)

S

₂2

_$3.002

(44)

3.f. Point Estimate & Confidence Intervals

 Point estimates: These are the single (sample) values used to estimate population parameters

 Confidence interval: It is a range of values in which the population parameter is expected to lie

 Confidence interval takes on the following form where N ≥ 30 • CI = m+ Z*s_x

True for a population distribution where

m is the mean of the population

s_xis the standard deviation of the population

• For a sample mean,

Point estimate + (reliability factor * standard error ) CI = < x >+ Z*(S_x/√n)

Where < x > is the mean of the sample

(45)

3g. Hypothesis Testing

 A statistical hypothesis test is a method of making statistical decisions from and about experimental data.

 Null-hypothesis testing answers the question:

• “How well the findings fit the possibility that chance factors alone might be responsible."

(46)

3g. Key steps in Hypothesis Testing

 Null Hypothesis (H₀): The hypothesis that the researcher wants to reject

 Alternate Hypothesis(H_a): The hypothesis which is concluded if there is sufficient evidence to reject null hypothesis

 Test Statistic

 Rejection/Critical Region

(47)

3g. Launching a niche course for MBA students?

 Sam, a brand manager for a leading financial training center, wants to introduce a new niche finance course for MBA students. He met some industry stalwarts and found that with the skills acquired by attending such a course, the students would able to land up a in a good job.

 He meets a random sample of 100 students and discovers the following characteristics of the market • Mean household income to $20,000

• Interest level in students = high

• Current knowledge of students for the niche concepts = low

 Sam strongly believes the course would adequately profitable in students if they have the buying power for the course. They would be able to afford the course only if the mean household income is greater than $19,000.

 Would you advice Sam to introduce the course? • What should be the hypothesis?

o Hint: What is the point at which the decision changes (19,000 or 20,000)? o What about the alternate hypothesis?

• What other information do you need to ensure that the right decision is arrived at? o Hint: confidence intervals/ significance levels?

o Hint: Is there any other factor apart from mean, which is important? How do I move from population parameters to standard errors?

• What is the risk still remaining, when you take this decision? o Hint: Type-I/II errors?

(48)

3g. Criterion for Decision Making

• To reach a final decision, Sam has to make a general inference (about the population) from the sample data.

• Criterion: Mean income across all households in the market area under consideration.

– If the mean population household income is greater than $19,000, then PD should introduce the product line into the new market.

• Sam’s decision making is equivalent to either accepting or rejecting the hypothesis:

– The population mean household income in the new market area is greater than $19,000 • The term one-tailed signifies that all z-values that would cause Sam to reject H₀, are in just one

tail of the sampling distribution

– m-> Population Mean

– H₀: m  $19,000

(49)

• Sample mean values greater than $19,000--that is x-values on the right-hand side of the sampling distribution centered on µ = $19,000--suggest that H₀may be false.

• More important the farther to the right x is , the stronger is the evidence against H₀

3g. Identifying the Critical Sample Mean Value – Sampling Distribution

0 0.05 0.1 0.15 0.2 0.25 -10 -5

$19,000

0 5 10

Reject H₀if the sample mean exceeds X_c

Critical Value

(X

_c

)

(50)

• Standard deviation for the sample of 100 households is $4,000. The standard error of the mean (s_x) is given by:

• Critical mean household income x_c through the following two steps:

– Determine the critical z-value, z_c. For  =0.05: – z_c = 1.645.

– Substitute the values of z_c, s, and m(under the assumption that H₀is "just" true ) – Critical Value x_c

– x_c= m+ z_cs = $19,658.

– In this case, since the observed sample statistic (20,000) is greater than the critical value (19,658), so the null hypothesis is rejected =>

3g. Computing the Criterion Value

Decision Rule

If the sample mean household income is greater than $19,658, reject the null hypothesis and introduce the new course 400 $   n s s_x

(51)

 The value of the test statistic is simply the z-value corresponding to = $20,000.

 Here, s_xis the standard error

3g. Test Statistic

5 . 2    x s x Z m 0 0.05 0.1 0.15 0.2 0.25 -10 -5

_μ=$19,000

0 5 10

Z=0

x= $ 20,000

Z=2.5

Do not Reject H

₀

Reject H

₀ 645 . 1 658 , 19 $   c c Z X

α= 0.05

• There is a significant difference in the hypothesized population parameter and the observed sample statistic =>

• Mean income > 19,000 => • Launch the course

(52)

3g. Errors in Estimation

 Please note: You are inferring for a population, based only on a sample

• This is no proof that your decision is correct

• It’s just a hypothesis

– There is still a chance that your inference is wrong

– How do I quantify the prob. of error in inference?

 Type I and Type II Errors:

– Type I error occurs if the null hypothesis is rejected when it is true

– Type II error occurs if the null hypothesis is not rejected when it is false

 Significance Level:

– -> Significance level : The upper-bound probability of a Type I error

– 1 -->confidence level : The complement of significance level

– The power of a test is the probability of correctly rejecting the null.

Actual Inference H0 is True H0is False H0 is True Correct Decision Confidence Level=1-α

Type-II Error P(Type-II Error)=β

H0is False

Type-I Error Significance Level=α

(53)

3g. P - Value – Actual Significance Level

 The p-value is the smallest level of

significance at which the null hypothesis can be rejected.

 P-value

• The probability of obtaining an observed value of x (From the sample) as high as $20,000 or more when actual

populations mean (m) is only $19,000 = 0.00621

• Calculated probability of rejecting the null hypothesis (H₀) when that hypothesis (H₀) is true (Type I error)

 The actual significance level of 0.00621 in this case means that the odds are less than 62 out of 10,000 that the sample mean income of $20,000 would have occurred entirely due to chance (when the

population mean income is $19,000)

μ=$19,000

Z=0

p-value= 0.00621

Do not Reject H

₀

Reject H

₀

α= 0.05

0 0.05 0.1 0.15 0.2 0.25

(54)

3g. Some variations in the Z-Test

 What if Sam surveyed the market and found that the student behavior is estimated to be:

־ They would found the training too expensive if their household income is < US$ 19,000 and hence would not have the buying power for the course?

־ They would perceive the training to be of inferior quality, if their household income is > US$19,000 and hence not buy the training?

־ How would the decision criteria change? What should be the testing strategy?

 Hint: From the question wording infer: Two tailed testing

־ Appropriately modify the significance value and other parameters

־ Use the Z-test

 Appropriate change in the decision making and testing process process:

־ Students will not attend the course if:

• The household income >$19,000 and the students perceive the course to be inferior

• The household income is <$19,000

־ This becomes a two tailed test wherein the student will join the course only when the household lie between a particular boundary. i.e. the household income should be neither very high neither very low

(55)

• Now the test is modified to two-tailed test, which signifies that all z-values that would cause PD to reject H₀, are in both the tails of the sampling distribution

־ m -> Population Mean ־ H₀: m = $19,000

־ H_a: m ≠ $19,000

• Since we are checking for significance difference on both the ends, so it’s a two tailed test

• The lower boundary =

• Conclusion: If the household income lies between $18,216 and $19,784 then the student will attend the course at 95% confidence 216 , 18 $ 400 * 95 . 1 000 , 19 * 2 /     s m Z_ 784 , 19 $ 400 * 95 . 1 000 , 19 * 2 /     s m Z_

μ=$19,000

Z=0

Do not

Reject H

₀

Reject H

₀

α= 0.025

0 0.05 0.1 0.15 0.2 0.25 -10 -5 10

α= 0.025

Reject H

₀

(56)

Thank you!

EduPristine

702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA

www.edupristine.com