Goodness of Fit. Proportional Model. Probability Models & Frequency Data

(1)

1

Probability Models & Frequency Data

Goodness of Fit Proportional Model Chi-square Statistic Example R Distribution Assumptions Poisson Distribution Example R 2

Goodness of Fit

Goodness of fit tests are used to compare any observed frequency distribution against an expected frequency distribution.

We previously did specialized examples of this for a probability distribution (the 50:50 expected right- vs. left-hand toad example) and binomial distribution (sperm genes on X chromosome of mice).

The binomial test we did is a specialized form for categorical variables with only two outcomes. Here we will introduce a more generalized form.

3

Proportional Model

The proportional model is one of the simplest

probability model. The frequency of occurrence of

events is proportional to the number of

opportunities (e.g., X chromosome example).

What would we do, however, if we had multiple

proportions? A more generalized form of this test

is the chi-square (χ

2

_{) goodness-of-fit-test}

_.

(2)

4

Example 8.1: Under the proportional model, one would expect babies born in the U.S. to be born in equal proportions across the days of the week (i.e., 14.28% per day). Is this true? Shown are a random sample of 350 births from across the U.S. During the year 1999.

33

63

5

Goodness-of-Fit Test

The χ2_{goodness-of-fit test use the chi-square statistic} (based upon the chis-square distribution) to compare frequency data to a model stated by the null

hypothesis.

Continuing with our example:

H₀: The probability of birth is the same every day of the week. H_A: The probability of birth is not the same every day of the week.

Again, H₀ and H_A are statements about the population from which the sample is obtained.

6 In order to proceed, we need to determine the expected

frequencies under the null model. In examining the calender for 1999, we see that there are not an even number of each day (52) in the year (there was an additional Friday), so we need to adjust for this.

(3)

7

The calculation of the expected frequencies is straight forward.

Expected = 350 ⋅ (52/365) = 49.863 NB: the sum of the expected frequencies must sum to the total observed (350).

Once you have a full set of observed and expected frequencies, one can then determine a chi-square statistic and associated probability.

Goodness-of-Fit Test

8

Chi-square Statistic

The chi-square statistic measures the discrepancy between between observed and expected

frequncies (make sure to always use the absolute frquencies [counts] not relative frequencies [proportions]).

Chi-square for each element can be calculated as:

2₌

∑

Observed−Expected  2 Expected = 33−49.863 49.863 =5.70 9

Chi-square Statistic

χ

Τηε 2_{statistic is additive across all levels, so:} χ2_{= 5.70 + 1.58 + 3.46 + 3.46 + 0.16 + 0.53 + 0.16 = 15.05} We now have a calculated test statistic and as usual need to compare it to a table value at a particular degree of freedom to make our decision. In other words, is 15.05 large enough to be significantly different?

df = (number of categories) -1 = 7-1 = 6

From Statistical Table A in your text, we see that at df = 6, the critical value for χ2_{is 12.59. Therefore, we reject the null} hypothesis and conclude that there are unequal proportions of births among days.

(4)

10

Chi-square Statistic

This type of problem can most easily be solved using a table format:

11

Assuming equal probabilities this can be very easily done in R using chisq.test:

> births<-c(33,41,63,63,47,56,47) > chisq.test(births)

Chi-squared test for given probabilities data: births

X-squared = 15.24, df = 6, p-value = 0.01847 How can we do this with the unequal probabilities that we have? This is a bit more complicated, but still straightforward:

Chi-square Statistic

12

Chi-square Statistic

> obsbirths<-births > days<-c(52,52,52,52,52,53,52) > expbirths<-350*(days/365) > expbirths [1] 49.86301 49.86301 49.86301 49.86301 49.86301 50.82192 [7] 49.86301 > chi<-sum((obsbirths-expbirths)^2/expbirths) > chi [1] 15.05676 > ?pchisq > pchisq(chi,df=6) [1] 0.9801802 > pchisq(chi,df=6,lower.tail=FALSE) [1] 0.01981982 What's going on here?

(5)

13

Chi-square Distribution

The chi-square distribution is a theoretical probability distribution (analogous to normal, binomial, poisson, etc.).

Note that the distribution is not symmetrical and is highly skewed.

When df = 1 then asymptotic to both axes!

14

If χ2_{is a random variable with a chi-square} distribution:

χ2_{is a positive real number}

The density function depends only on n (df) The expected value of χ2_{= n}

The variance of χ2_{= 2 n}

The graph of f (χ2_{) is not symmetrical}

The graph of f (χ2_{) approaches symmetry as ν ∞}₌

Chi-square Distribution

15

(6)

16 We can explore the properties of the chi-square distribution through the use of R functions and graphics:

> par(mfrow=c(2,2),mar=c(3,4,3,3)) > layout.show(4) > plot(dchisq(1,df=1:30)) > plot(dchisq(5,df=1:30)) > plot(dchisq(10,df=1:30)) > plot(dchisq(15,df=1:30)) 17 18

Chi-square Assumptions

The sampling distribution of the chi-square

statistic only approximately follows the chi-square

distribution (but pretty closely).

Two assumptions apply:

1) None of the categories should have an

expected frequency less than one.

2) No more than 25% of the categories should

have expected frequencies less than five.

(7)

19

Goodness-of-Fit Test

Two Proportions

-The chi-square goodness of fit test is a very general one and can be used in a variety of situations.

It can also be used when there are only two proportions, a replacement for the binomial test, but at a cost...it is much less powerful in this situation. So, use the binomial test whenever appropriate.

20

Poisson Distribution

The poisson distribution describes the number of successes in blocks of time or space, when successes happen independently of each other and occur with equal probability at every point in time or space.

The poisson is often useful in biological studies because it is a starting place for evaluating whether or not an observed pattern is random or not.

If the null model is rejected, the distribution may be either clumped or dispersed.

21

Poisson Distribution

A clumped distribution arises when the presence of one success is increases the probability of success for adjacent observations (e.g., occurrences of a contagious disease). A dispersed distribution is the opposite: the presence of one success decreases the probability of success for adjacent observations (e.g., animals with well defended territories).

(8)

22

Poisson Distribution

23

Poisson Distribution

The poisson distribution is constructed using the probability of X successes occurring in any given block of time or space:

Pr[ X successes]=e

−_x

X !

Where mu is the mean number of independent successes in time or space (expressed as a unit count) and e is the base of the natural log.

24

Poisson Distribution

Example

-Example 8.6 provides the example of an assessment of the fossil record. They ask, do extinctions occur randomly through the fossil record or are their periods where extinction rates are unusually high (mass extinctions) compared to background rates? Fossil marine invertebrates are an ideal taxa to test this question as they

preserve well. The data are the number of recorded extinctions in 76 contiguous blocks of time.

(9)

25

26

The hypotheses are:

H₀: The number of extinctions per time interval has a Poisson distribution. H_A: The number of extinctions per time interval does not have a P distr.

We need to begin by estimating μ, the mean number of extinctions per time interval. As usual, μ, can be estimated by x-bar (= 4.21, n = 76).

We need to use the same protocol and generate expected values to compare to our observed values, so return to the formula for calculation of the poisson distribution.

Poisson Distribution

Example 27

Poisson Distribution

Example -Pr[3 extinctions]=e −4.21 4.213 3! For example, for 3 extinctions:

Expected[3 extinctions] = 76 x 0.1846 = 14.03

(10)

28

29

Poisson Distribution

Example

-We now have a chi-square test statistic calculated. We need to determine the degrees of freedom. In the broadest sense, df normally is n – 1. However, in a variety of circumstances, we need to also subtract the number of parameters being estimated from the data. So, df = 8-1-1=6.

The critical value for χ2_{of 12.59 at P = 0.05 and df} = 6 is 12.59. Thus, we reject the null hypothesis and conclude extinctions are non-random.

30

> extinctions<-c(0,13,15,16,7,10,4,2,1,2,6) > ?dpois

> dpois(extinctions, 4.21)

[1] 1.484637e-02 3.111768e-04 2.626347e-05 6.910575e-06 [5] 6.905011e-02 7.156129e-03 1.943289e-01 1.315693e-01 [9] 6.250321e-02 1.315693e-01 1.148102e-01

(11)

31

> extinctions2<-c(13,15,16,7,10,4,2,9) > chisq.test(extinctions2)

Chi-squared test for given probabilities data: extinctions2

X-squared = 18.7368, df = 7, p-value = 0.009053

Poisson Distribution

Example

32

We can explore the properties of the chi-square distribution through the use of R functions and graphics:

Poisson Distribution

> par(mfrow=c(2,2),mar=c(3,4,3,3)) > layout.show(4) > plot(dpois(1:25,1)) > plot(dpois(1:25,2)) > plot(dpois(1:25,4.21)) > plot(dpois(1:25,10)) Our example 33