EMPIRICAL FREQUENCY DISTRIBUTION

(1)

INTRODUCTION TO MEDICAL STATISTICS:

THEORETICAL PROBABILITY

DISTRIBUTIONS

Mirjana KujundžićTiljak

• EMPIRICAL FREQUENCY DISTRIBUTION

– observed data

• THEORETICAL PROBABILITY DISTRIBUTION

- described by mathematical models

(2)

THEORETICAL PROBABILITY

DISTRIBUTIONS 2

27.06.2006 THEORETICAL PROBABILITY DISTRIBUTIONS

3

• when some empirical distribution approximates a

particular probability distribution – theoretical

knowledge of that distribution could be used

→

answer questions about data

• evaluation of probabilities is required

4

PROBABILITY (P)

• measures uncertainty

• measures the chance of a given event occurring

0 ≤

P

≤

1

– P = 0 →event cannot occur – P = 1 →event mustoccur

– Q = 1-P →probability of the complementary event (the event not occurring)

(3)

5

• Various approaches in probability calculations:

– Subjective – personal degree of belief that the event will occur

(e.g. the world sill come to an end in the year 2050)

– Frequentist– the proportion of times the event would occur if the experiment will be repeated a large number of times (e.g. the number of times we would get a “head")

– A priori– requires knowledge of the theoretical model –

probability distribution– which describes the probabilities of all possible outcomes of the “experiment (e.g. genetic theory allows us to describe the probability distribution for eye color in a baby born t a blue-eyed women and brown-eyed man by initially specifying all possible genotypes of eye color in the baby and their probabilities)

PROBABILITY (P)

• The addition rule:if two events (A and B) are mutually exclusive →the probability that either one or the other occurs (A or B) is equal to the sum of their probabilities

Prob (A or B) = Prob (A) + Prob (B)

• The multiplication rule:if two events (A and B) are independent →the probability that both events occur (A and B) is equal to the product of the probability of each

(4)

THEORETICAL PROBABILITY DISTRIBUTIONS 4 27.06.2006 THEORETICAL PROBABILITY DISTRIBUTIONS 7

RANDOM VARIABLES

• random variable

– a quantity that can take any

one of a set of mutally excluseve values with a

given probability

• discrete or discontinuous random variable

=

numerical values are integer

• E.g. number of children in family – 0, 1, 2, 3, … k

• continuus random variable

=

numerical values are real numbers

• E.g. body weight 72,35 kg, blood glucose level 7,2 mmol/l

8

PROBABILITY DISTRIBUTION

• Probability distribution– shows the probabilities of all

possible values of the random variable

– a theoretical distribution that is expressed mathematically – has a mean and variance that are analogous to those of and

empirical distribution

• parameters– summary measures (e.g. mean, variance) characterizing that distribution → are estimated in the sample by relevant statistics

• depending on whether the random variable is discrete or continuous →the probability distribution can be either discrete or continuous

(5)

9

• DISCRETE (Binomial, Poisson)

– the probability can be derived corresponding to every possible value of the random variable

– the sum of all such probabilitis is 1

PROBABILITY DISTRIBUTIONS

• CONTINUOUS (Normal, Chi-squared, t, F)

– the probability of the random variable, x, taking values in certain ranges, could be derived

– if the horizontal axis represents the values of x →the curve from the equation of the distribution could be drawn (= probability density function)

– Total area under the curve = 1 →represents the probability of all possible events

• Probability that x lies between two limits is equal to the area under the curve between these values

(6)

THEORETICAL PROBABILITY DISTRIBUTIONS 6 27.06.2006 THEORETICAL PROBABILITY DISTRIBUTIONS 11

PROBABILITY DISTRIBUTIONS

• Probability that x lies between two limits?

12

PROBABILITY DISTRIBUTIONS

(7)

13

• one of the most important distributions in

statistics

• german mathematician C.F. Gauss

• the most biological measurements follow normal

distribution

• it is used in many analytical models

THE NORMAL (GAUSSIAN) DISTRIBUTION

• Probability density function:

f (x) = (1/

σ√2π)

e

a

(8)

DISTRIBUTIONS 8

15

THE NORMAL (GAUSSIAN) DISTRIBUTION

Completely described by two parameters:

-

mean (

µ

)

- variance (

σ

2

₎

X ~ N (

µ,σ

2

₎

27.06.2006 THEORETICAL PROBABILITY DISTRIBUTIONS 16

(9)

17 • normal distribution curve:

– area under curve = 1 – bell-shaped (unimodal= – symmetrical about its mean – apsolute maximum for x = µ

– shifted to the right if the mean is increased and to the left if the mean is decreased (assuming constant variance)

– flattened as the variance is increased but becomes more peaked as the variance is decreased (for a ficed mean)

THE NORMAL (GAUSSIAN) DISTRIBUTION

• the mean and median and mode of a Normal distribution are equal

• the probability (P)that a normally distributed random variable, x, with mean, µ, and standard deviation, σ, lies between:

(µ -σ) and (µ + σ) = 0,68 (µ - 1.96σ) and (µ + 1.96σ) = 0.95 (µ – 2.58σ) and (µ + 2.58σ) = 0.99

→these intervals may be used to define reference intervals

(10)

DISTRIBUTIONS 10

19

THE NORMAL (GAUSSIAN) DISTRIBUTION

• changing µ, constant

σ

:

20

THE NORMAL (GAUSSIAN) DISTRIBUTION

(11)

21

• changing

σ

, constant

µ

:

THE NORMAL (GAUSSIAN) DISTRIBUTION

(12)

DISTRIBUTIONS 12

23

THE NORMAL (GAUSSIAN) DISTRIBUTION

• changing

σ

, constant

µ

:

24

THE STANDARD NORMAL DISTRIBUTION

• transformation of original value (x) to

Standardized Normal Deviate (SND) (z

_i

):

z

_i

= (x

₁

-

µ)/σ

= random variable that has a Standard Normal distribution

• sample:

z

_i

= (x

₁

-

⎯

x)/s

(13)

X

₁

→

Z

₁

X

₂

→

Z

₂

X

₃

→

Z

₃

…

X

_n

→

Z

_n

⌧

, s

=?, s

_z

=?

THE STANDARD NORMAL DISTRIBUTION

X

₁

→

Z

₁

X

₂

→

Z

₂

X

₃

→

Z

₃

…

X

_n

→

Z

_n

⌧

, s

=0, s

_z

=1

(14)

DISTRIBUTIONS 14

27

THE STANDARD NORMAL DISTRIBUTION

X

₁

→

Z

₁

X

₂

→

Z

₂

X

₃

→

Z

₃

…

X

_n

→

Z

_n

⌧

, s

=0, s

_z

=1

Z~N(0,1)

(15)

29

(16)

DISTRIBUTIONS 16

31

THE STANDARD NORMAL DISTRIBUTION

32

(17)

33

• W.S. Gossett (pseudonym Student)

• parameter that characterizes the t-distribution = the degrees of freedom

• Similar shape as normal distribution (more spread out with longer tails) – as the degrees of freedom increase its shape approaches Normality

• Useful for calculating confidence intervals for testing hypotheses about one or two means

(18)

DISTRIBUTIONS 18

35

THE CHI-SQUARE (

χ

2

_{) DISTRIBUTION}

• a

right skewed distribution taking positive values • characterized by itsdegrees of freedom

• its shape depends on the degrees of freedom – it

becomes more symmetrical and approaches Normality as they increase

• useful for analysing categorical data

36

(19)

37

• skewed to the right

• defined by a ratio – the distribution of a ratio of two estimated variances calculated from Normal dana approximates the F-distritution

• characterized by degrees of freedom of the numerator and the denominator of the ratio

• useful for comparing two variances, and more than two means using the analysis of variance

THE LOGNORMAL DISTRIBUTION

• the probability distribution of a random variable whose log (to base 10 or e) follows the Normal distribution • highly skewed to the right

• logs of row data skewed to the right →an empirical distribution that is nearly Normal = data approximate Log-normal distribution

(20)

DISTRIBUTIONS 20

39

THE LOGNORMAL DISTRIBUTION

40

THE BINOMIAL DISTRIBUTION

• theoretical distribution for discrete random

variable

• definition: Jacob Bernuolli, 1700.

• two outcomes: “success” i “failure”

• n events

– E.g. n = 100 unrelated women undergoing IVF

(21)

41

• Two parameters that describe the Binomial distribution: n = number of indivudial in the sample (or repetitions of a trial)

π= the true probability of success for each individual (or in each trial)

X~B(n,p)

THE BINOMIAL DISTRIBUTION

• Mean = nπ

(the value for the random variable that we expect if we look at n individuals, or repeat the trial n times)

• Variance = nπ(1-π) • small n

– the distribution is skewed to the right if π<0.5 – the distribution is skewed to dhe right if π>0.5

(22)

DISTRIBUTIONS 22

43

THE BINOMIAL DISTRIBUTION

• the distribution becomes more symmetrical as the sample size increases and approximates to the Normal

distribution if both nπandnπ(1 –π) aregreater than 5

• the properties of the Binomial distribution could be use when making inferences about proportions

• the Normal approximation of the Binomial distribution when analyzing proportions is often used

44

THE BINOMIAL DISTRIBUTION

Example: gene recombination

Chromosomal locus: 2 allels: A anda

p= probability of A

Q = 1−p = probability ofa

(23)

45

conception→outcame space:{AA, Aa, aa}

P(AA) = P(A) * P(A)= p2

P(aa) = P(a) * P(a) = q2

P(Aa) = P(A) * P(a) = pq

P(aA) = P(a) * P(A)= qp 2pq ______

1,0

p2_{+ 2pq + q}2_{= (p+q)}2 _{= 1}2_{= 1}

(24)

DISTRIBUTIONS 24

47

THE BINOMIAL DISTRIBUTION

– Example – probability of genotypes:

• frequency of gene A = 0,33 • frequency of gene a = 0,67 (p+q)2_{= (0,33 + 0,67)}2_{= 0,33}2_{+ 2 * 0,33 * 0,67 + 0,67}2 P (AA)= 0,332 _{= 0.1089} P (Aa) = 0,33 * 0,67 = 0,2211 P (aA) = 0,67 * 0,33 = 0,2211 P (aa) = 0,672 _{= 0,4489} 27.06.2006 THEORETICAL PROBABILITY DISTRIBUTIONS 48

THE BINOMIAL DISTRIBUTION

Graphical presentatnion – probabilities of different genotypes

0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5 AA Aa aa P

(25)

49

– Example – death outcome as binomial distribution:

• Letality od neke bolesti = 0,30 …..(30/100) • Survival probability = 0,70

• n = 5

• Binom: (0,30 + 0,70)5

Number of

death examinees Binom Probability 5 (everybody) 4 3 2 1 0 (nobody) P5 5p4_q 10p3_q2 10p2_q3 5pq4 q5 0,00243 0,02835 0,13230 0,30870 0,36015 0,16807 Total 1,00000

THE POISSON DISTRIBUTION

• Poisson (begining of XIX century)

• the Poisson random variable= the count or the number of events that occur independently and randomly in time or space at some average rate, µ (0 and all positive integers)

– example: the number of hospital admissions per day typically follows the Poisson distribution

→use of the Poisson cistribution to calculate the probability of a certain number of admissions on any particular day

(26)

DISTRIBUTIONS 26

51

THE POISSON DISTRIBUTION

• Mean (average rate, µ)= the parameter that describes the Poisson distribution

• The meanequals the variancein the Poisson distribution

• Unimodalcurve, right skewedif the mean is small, but becomes more symmetrical as the mean increases, when it approximates n Normal distribution