3. Analysis of Qualitative Data

(1)

3. Analysis of Qualitative

Data

Inferential Stats, CEC at RUPP

Poch Bunnak, Ph.D.

Content

1. Hypothesis tests about a population

proportion: Binomial test

2 Chi

t t f

d

f fit

2. Chi-square test for goodness of fit

3. Chi-square test for independence

4. Notes on Measures of associations

(2)

1. Hypothesis tests

Ab

t

About

a population proportion:

One-sample binomial test

p

2010 Poch Bunnak ₃

1.1. Situations:

– To compare if a sample proportion is different from another test value, such at the value a given population, past result...

– Examples:Examples:

• In 2010, girls accounted 25% of all students in all universities in Phnom Penh. The Rector wishes to know if the % of girls at RUPP is significantly higher or lower than the overall proportion

higher or lower than the overall proportion.

• A president got 55% of votes. After one year in his president office, his wanted to know if the number of his supporters increased or decreased. A ld 2500 d d 2300 bl t D d t id id f • A company sold 2500 red and 2300 blue toys. Do data provide evidence for

a significant color preference?

1.2. Test statistics:

– Large sample: Z-test; Small sample: Binomial test – (n > 5/[min(p, 1-p)]

(3)

1.2.a Test statistics: z test

– Random sample, binary var, normal distribution of p (the closer p to 0 or 1 for any sample size, the skewed is the distribution of p)p)

– H₀: p = p_h Æ H_a: p z p_h Or H_a: p > p_hOr H_a: p < p_h – Two ways:

( )/ S S ( hi ) Æ h fi d h l Æ k d i i

• Z = (p – p_h)/ S.E._(p)or Z = Sqrt(chi-square) Æ then find the p value Æ make decision • S.E.(p) = sqrt((p*(1-p))/n)

• Run

1 3 E

l

1.3. Example

– Women today are getting more educated and working outside the home for cash They are likely to marry later and have the home for cash. They are likely to marry later and have fewer children than before. Is this claim true? Use CDHS 2005 data to test if the mean age at marriage and the mean

b f hild i 2005 diff t f th i 2000 number of children in 2005 are different from those in 2000.

2010 Poch Bunnak ₅

1.2.b Test statistics: Binomial test

– Based on binomial distribution (prob distribution of two outcomes only, binary var)

– Assumptions: Binary var normality [n*p>10 & n*(1-p)>10]Assumptions: Binary var., normality [n p>10 & n (1-p)>10]

Example

– 15 girls and 35 boys were enrolled in one class. Is this class g y gender-different from the gender admission quota of 25% girls?

SPSS d t t t ( d ith 1 i l d 2 b – SPSS data: create two vars (gender with 1=girls and 2=boys

and n with 15 for girls and 35 for boys), weight by n.

– SPSS analysis: Analyzey y Æ Nonparametric test Æ Binomialp Æ Move gender in Test var List Box Æ Enter 0.25 in the Test Proportion Box Æ OK

(4)

Binomial Test Observed Asymp. Sig 1.00 15 .30 .25 .252a 2.00 35 .70 50 1 00 Group 1 Group 2 Total cat Category N Observed

Prop. Test Prop.

Sig. (1-tailed)

• Interpretation

50 1.00 Total Based on Z Approximation. a.

Interpretation

– There was 30% of girls in the class with 5% greater than the quota. However, the difference is not statistically significant based on the binomial test (n = 50, p_(1-tailed) = 0.252)

• Practice:

– Redo the test with 150 girls and 350 boys. What do you see? – A survey of 200 voters showed that 120 voted for A and 80

t d f B I th h id t di t th i ? voted for B. Is there enough evidence to predict the winner?

2010 Poch Bunnak ₇

• Other features of binomial test with SPSS

– Note that binomial test is always one-tailed test

– SPSS does not calculate CI of the difference. You

can do this using formulas

– You can use cut point to split the data, if do not want

to do recode (values =< cut point value is group 1)

– Three options for calculating p values:

• Asymptotic distribution (z approximation)

• Exact test (based on actual data w/o prob sampling calculation): when the normal approximation is not met calculation): when the normal approximation is not met. You should use this test if your data are small or p is small • Monte Carlo: When the sample size is too large

(5)

2. Hypothesis tests about

l ti

’

ti

a population’s proportions:

Chi-square test

for goodness of fit

g

2010 Poch Bunnak ₉

2.1. Situations:

– To compare a sample’s freq distribution of a categ var with expected frequencies (all categories contain the same proportion of values) or with user-specified proportions of values) p p p

– Examples:

• Do all three candidates have a significant difference in the number of t ?

supporters?

• Is there any evidence showing that all departments have different numbers of first year students?

• In 2000, 10% were extreme poor, 20% were just poor, 40% were just above the poverty line, 25% were rich, and 5% were very rich. Is there any change in the distribution of living standard 10 years later?

2.2. Test statistics:

(6)

2.3. Chi-square test assumptions

– Nonparametric test Æ no distribution shape assumption – Categorical data; data from a random sample

– The F2_{test is valid only if the expected freq (f}

( )) is at least 5 for TheF test is valid only if the expected freq (f_(e)) is at least 5 for any category or no more than 20% of the categories should have f_(e) < 5

H f f ( ll t i ) H f z f ( t l t 1 t )

– H₀: f_(o) = f_(e) (all categories); H_a: f_(o)z f_(e) (at least 1 category)

2.3. Example

– The distribution of foundation-year students by department is: 100 – The distribution of foundation-year students by department is: 100

in English, 80 in math, and 110 in computer science. Is the

difference statistical significant at 99% CL? H0: the distribution of students are equal across the department

students are equal across the department

– Enter data in SPSS (dept and n vars) and weight by n

– SPSS: Analyze Æ Nonparametric Æ Chi-square Æ Put Dept var in Test Var List (Be sure “all cat equal” is ticked) Æ OK

2010 Poch Bunnak ₁₁

2.4. Result and interpretation

N t – Notes:

• f(e) = n of case/n of cat; • residual = f(0) – f(e); • F2_{= Sum[(f}

(0)-f(e))2/f(e)];

• df = n of cat – 1

• Compare the obtainedF2_with

• Compare the obtainedF with criticalF2_{or use asymp sig to make}

decision about the test Interpretation: – Interpretation:

The test is not stat sig (F2_{= 4.8,}

df=2, p=0.089), meaning that H₀is t d d H i j t d Th accepted and H_ais rejected. Thus, there is no evidence supporting the H_athat the distribution of students

(7)

2.5. Other notes for Chi-square test with SPSS

– If you have many categories but wish to analyze some of them, you can specify the range of values to be analyzed – You can do the test of freq distribution against user-definedYou can do the test of freq distribution against user-defined

(values entered in the order of cat value codes and one of them must be different; otherwise equal f_(e) is the same as b f )

before)

– Three options for calculating p values:

• Asymptotic distribution (z approximation) • Asymptotic distribution (z approximation)

• Exact test: when the freq values assumptions are not met (small n Æ f_(e)_{( )} too small and too many categories with < 5) • Monte Carlo: When the sample size is too large

2010 Poch Bunnak ₁₃

3. Hypothesis tests about

Diff

t

l ti

ti

Different population proportions:

Chi-square test

(8)

3.1. Situations

• You want to find if two categorical vars from a single population are associated

– 2 vars are associated if they are dependent on one another (change in one2 vars are associated if they are dependent on one another (change in one varÆ change in the other var)

• Examples

I th ti f f l t d t i t d t t (E li h d – Is the proportion of female students in two departments (English and

Computer) the same? [Gender and dept vars; if the % of females in both depts is the same Æ no association b/w gender and field of study]

D th b f t f id t h ft 1 f hi b i

– Does the number of supports of a president change after 1 year of his being elected? [2 vars: support (yes-no) and time(when elected and 1 year later). If the % of supporters the same Æ no association]

• If both vars are binary Æ 2x2 table;

• In general, 2 categorical variablesIn general, 2 categorical variables Æ rxc tableÆ rxc table

2010 Poch Bunnak ₁₅

3.3. Chi-square test assumptions

– Nonparametric test Æ no distribution shape assumption – Categorical or nominal data; data from a random sample – The F2_{test is valid only if f}

( )> 0 there is no more than 20% of the TheF test is valid only if f_(e)> 0 there is no more than 20% of the categories have f_(e) < 5

– H₀: There is no association b/w var 1 and var 2

N i i i d d (% i b % i l) (f f )

• No association = independence = (% in urban ~ % in rural) = (f_(o)= f_(e))

3.4. Example 1: 2x2 table

– The mean age at mar is 19 4 yrs Is there any sig difference in theThe mean age at mar is 19.4 yrs. Is there any sig difference in the proportion of mar at age below the average b/w urban and rural areas? H₀: no association b/w age at mar and residential location SPSS CDHS 2005 d t

– SPSS CDHS 2005 data

• Recode v511 into binary var “v511_d” with 1=below 19.4 and 2>=19.4 • Analyze Æ Descriptive Æ Crosstabs Æ V025 as column and v511_d as row

(9)

• Result

2010 Poch Bunnak ₁₇

• Interpretation

I t t l 59% f i d t b l th f

– In total, 59% of women married at age below the average of age at mar. This proportion is higher in rural areas than in urban areas (59.9% versus 55.9%, respectively. A F2 _{test was}

performed see if the two vars are independence and the result showed that the two vars are independent at 95% CL (chi-square = 3 4 2-tailed p = 0 065) Although the proportion of square 3.4, 2 tailed p 0.065). Although the proportion of marriage at the age below the mean is higher in rural than in urban areas, the difference is not statistical significant.

– Note that if H_a is one-tailed, thus p = 0.065/2 = 0.033 Æ sig at 95% Æ two vars are dependent!

(10)

• Importance!

Chi-square test does not tell us how strong is the association. To know this, we need to request measures of association:

C ti ffi i t C t( 2_/( 2_{/ ))} 2 _b _d

– Contingency coefficient: C=sqrt(F2_/(F2_/n)), F2 _-based,

value: 0 Æ ~1 (reach 1 only if there are many categories of vars)

– Phi: Adjusted for n, I =sqrt(F2_{/n), for 2x2 table only, value:}

0Æ 1

– Cramer’s V: Adj for n, V=sqrt(F2_{/(n*min(r-1,c-1)), rxc}

tables, value: 0 Æ 1

– Lamda and Uncertainty Coefficients are F2_{-based, value: 0}

Æ 1; PRE interpretation: improvement in predicting one var

2010 Poch Bunnak ₁₉

p p p g

given the knowledge of the other var.

3.4. Example 2: rxc table

– Find out if the level of education of husbands and wives are related ( v106 and v701). Is this true for both urban and rural areas? (Use appropriate tests)

– Table below summarizes the data on religious preference and the attitudes towards abortion in one country. Is respondents’ attitude related to their religious preference?g p

Liberal Conservativ

Religiouspreference

Liberal protestant

Conservativ

eprotestant Catholic None Total

Favor 103 182 80 16 381 Attitude Oppose 187 238 286 74 785 Total 290 420 366 90 1166 toward abortion