3.
Analysis of Qualitative
Data
Inferential Stats, CEC at RUPP
Poch Bunnak, Ph.D.
Poch Bunnak, Ph.D.
Content
1. Hypothesis tests about a population
proportion: Binomial test
proportion: Binomial test
2 Chi
t t f
d
f fit
2. Chi-square test for goodness of fit
3. Chi-square test for independence
4. Notes on Measures of associations
1.
Hypothesis tests
Ab
t
About
a population proportion:
a population proportion:
One-sample binomial test
p
2010 Poch Bunnak 3
1.1. Situations:
– To compare if a sample proportion is different from another test value, such at the value a given population, past result...
– Examples:Examples:
• In 2010, girls accounted 25% of all students in all universities in Phnom Penh. The Rector wishes to know if the % of girls at RUPP is significantly higher or lower than the overall proportion
higher or lower than the overall proportion.
• A president got 55% of votes. After one year in his president office, his wanted to know if the number of his supporters increased or decreased. A ld 2500 d d 2300 bl t D d t id id f • A company sold 2500 red and 2300 blue toys. Do data provide evidence for
a significant color preference?
1.2. Test statistics:
– Large sample: Z-test; Small sample: Binomial test – (n > 5/[min(p, 1-p)]
1.2.a Test statistics: z test
– Random sample, binary var, normal distribution of p (the closer p to 0 or 1 for any sample size, the skewed is the distribution of p)p)
– H0: p = ph Æ Ha: p z ph Or Ha: p > phOr Ha: p < ph – Two ways:
( )/ S S ( hi ) Æ h fi d h l Æ k d i i
• Z = (p – ph)/ S.E.(p)or Z = Sqrt(chi-square) Æ then find the p value Æ make decision • S.E.(p) = sqrt((p*(1-p))/n)
• Run
1 3 E
l
1.3. Example
– Women today are getting more educated and working outside the home for cash They are likely to marry later and have the home for cash. They are likely to marry later and have fewer children than before. Is this claim true? Use CDHS 2005 data to test if the mean age at marriage and the mean
b f hild i 2005 diff t f th i 2000 number of children in 2005 are different from those in 2000.
2010 Poch Bunnak 5
1.2.b Test statistics: Binomial test
– Based on binomial distribution (prob distribution of two outcomes only, binary var)
– Assumptions: Binary var normality [n*p>10 & n*(1-p)>10]Assumptions: Binary var., normality [n p>10 & n (1-p)>10]
Example
– 15 girls and 35 boys were enrolled in one class. Is this class g y gender-different from the gender admission quota of 25% girls?
SPSS d t t t ( d ith 1 i l d 2 b – SPSS data: create two vars (gender with 1=girls and 2=boys
and n with 15 for girls and 35 for boys), weight by n.
– SPSS analysis: Analyzey y Æ Nonparametric test Æ Binomialp Æ Move gender in Test var List Box Æ Enter 0.25 in the Test Proportion Box Æ OK
Binomial Test Observed Asymp. Sig 1.00 15 .30 .25 .252a 2.00 35 .70 50 1 00 Group 1 Group 2 Total cat Category N Observed
Prop. Test Prop.
Sig. (1-tailed)
• Interpretation
50 1.00 Total Based on Z Approximation. a.Interpretation
– There was 30% of girls in the class with 5% greater than the quota. However, the difference is not statistically significant based on the binomial test (n = 50, p(1-tailed) = 0.252)
• Practice:
– Redo the test with 150 girls and 350 boys. What do you see? – A survey of 200 voters showed that 120 voted for A and 80
t d f B I th h id t di t th i ? voted for B. Is there enough evidence to predict the winner?
2010 Poch Bunnak 7
• Other features of binomial test with SPSS
– Note that binomial test is always one-tailed test
– SPSS does not calculate CI of the difference. You
can do this using formulas
– You can use cut point to split the data, if do not want
to do recode (values =< cut point value is group 1)
– Three options for calculating p values:
• Asymptotic distribution (z approximation)
• Exact test (based on actual data w/o prob sampling calculation): when the normal approximation is not met calculation): when the normal approximation is not met. You should use this test if your data are small or p is small • Monte Carlo: When the sample size is too large
2.
Hypothesis tests about
l ti
’
ti
a population’s proportions:
Chi-square test
Chi-square test
for goodness of fit
g
2010 Poch Bunnak 9
2.1. Situations:
– To compare a sample’s freq distribution of a categ var with expected frequencies (all categories contain the same proportion of values) or with user-specified proportions of values) p p p
– Examples:
• Do all three candidates have a significant difference in the number of t ?
supporters?
• Is there any evidence showing that all departments have different numbers of first year students?
• In 2000, 10% were extreme poor, 20% were just poor, 40% were just above the poverty line, 25% were rich, and 5% were very rich. Is there any change in the distribution of living standard 10 years later?
2.2. Test statistics:
2.3. Chi-square test assumptions
– Nonparametric test Æ no distribution shape assumption – Categorical data; data from a random sample
– The F2test is valid only if the expected freq (f
( )) is at least 5 for TheF test is valid only if the expected freq (f(e)) is at least 5 for any category or no more than 20% of the categories should have f(e) < 5
H f f ( ll t i ) H f z f ( t l t 1 t )
– H0: f(o) = f(e) (all categories); Ha: f(o)z f(e) (at least 1 category)
2.3. Example
– The distribution of foundation-year students by department is: 100 – The distribution of foundation-year students by department is: 100
in English, 80 in math, and 110 in computer science. Is the
difference statistical significant at 99% CL? H0: the distribution of students are equal across the department
students are equal across the department
– Enter data in SPSS (dept and n vars) and weight by n
– SPSS: Analyze Æ Nonparametric Æ Chi-square Æ Put Dept var in Test Var List (Be sure “all cat equal” is ticked) Æ OK
2010 Poch Bunnak 11
2.4. Result and interpretation
N t – Notes:
• f(e) = n of case/n of cat; • residual = f(0) – f(e); • F2= Sum[(f
(0)-f(e))2/f(e)];
• df = n of cat – 1
• Compare the obtainedF2with
• Compare the obtainedF with criticalF2or use asymp sig to make
decision about the test Interpretation: – Interpretation:
The test is not stat sig (F2= 4.8,
df=2, p=0.089), meaning that H0is t d d H i j t d Th accepted and Hais rejected. Thus, there is no evidence supporting the Hathat the distribution of students
2.5. Other notes for Chi-square test with SPSS
– If you have many categories but wish to analyze some of them, you can specify the range of values to be analyzed – You can do the test of freq distribution against user-definedYou can do the test of freq distribution against user-defined
(values entered in the order of cat value codes and one of them must be different; otherwise equal f(e) is the same as b f )
before)
– Three options for calculating p values:
• Asymptotic distribution (z approximation) • Asymptotic distribution (z approximation)
• Exact test: when the freq values assumptions are not met (small n Æ f(e)( ) too small and too many categories with < 5) • Monte Carlo: When the sample size is too large
2010 Poch Bunnak 13
3.
Hypothesis tests about
Diff
t
l ti
ti
Different population proportions:
Chi-square test
Chi-square test
3.1. Situations
• You want to find if two categorical vars from a single population are associated
– 2 vars are associated if they are dependent on one another (change in one2 vars are associated if they are dependent on one another (change in one varÆ change in the other var)
• Examples
I th ti f f l t d t i t d t t (E li h d – Is the proportion of female students in two departments (English and
Computer) the same? [Gender and dept vars; if the % of females in both depts is the same Æ no association b/w gender and field of study]
D th b f t f id t h ft 1 f hi b i
– Does the number of supports of a president change after 1 year of his being elected? [2 vars: support (yes-no) and time(when elected and 1 year later). If the % of supporters the same Æ no association]
• If both vars are binary Æ 2x2 table;
• In general, 2 categorical variablesIn general, 2 categorical variables Æ rxc tableÆ rxc table
2010 Poch Bunnak 15
3.3. Chi-square test assumptions
– Nonparametric test Æ no distribution shape assumption – Categorical or nominal data; data from a random sample – The F2test is valid only if f
( )> 0 there is no more than 20% of the TheF test is valid only if f(e)> 0 there is no more than 20% of the categories have f(e) < 5
– H0: There is no association b/w var 1 and var 2
N i i i d d (% i b % i l) (f f )
• No association = independence = (% in urban ~ % in rural) = (f(o)= f(e))
3.4. Example 1: 2x2 table
– The mean age at mar is 19 4 yrs Is there any sig difference in theThe mean age at mar is 19.4 yrs. Is there any sig difference in the proportion of mar at age below the average b/w urban and rural areas? H0: no association b/w age at mar and residential location SPSS CDHS 2005 d t
– SPSS CDHS 2005 data
• Recode v511 into binary var “v511_d” with 1=below 19.4 and 2>=19.4 • Analyze Æ Descriptive Æ Crosstabs Æ V025 as column and v511_d as row
• Result
2010 Poch Bunnak 17
• Interpretation
I t t l 59% f i d t b l th f
– In total, 59% of women married at age below the average of age at mar. This proportion is higher in rural areas than in urban areas (59.9% versus 55.9%, respectively. A F2 test was
performed see if the two vars are independence and the result showed that the two vars are independent at 95% CL (chi-square = 3 4 2-tailed p = 0 065) Although the proportion of square 3.4, 2 tailed p 0.065). Although the proportion of marriage at the age below the mean is higher in rural than in urban areas, the difference is not statistical significant.
– Note that if Ha is one-tailed, thus p = 0.065/2 = 0.033 Æ sig at 95% Æ two vars are dependent!
• Importance!
Chi-square test does not tell us how strong is the association. To know this, we need to request measures of association:
C ti ffi i t C t( 2/( 2/ )) 2 b d
– Contingency coefficient: C=sqrt(F2/(F2/n)), F2 -based,
value: 0 Æ ~1 (reach 1 only if there are many categories of vars)
– Phi: Adjusted for n, I =sqrt(F2/n), for 2x2 table only, value:
0Æ 1
– Cramer’s V: Adj for n, V=sqrt(F2/(n*min(r-1,c-1)), rxc
tables, value: 0 Æ 1
– Lamda and Uncertainty Coefficients are F2-based, value: 0
Æ 1; PRE interpretation: improvement in predicting one var
2010 Poch Bunnak 19
p p p g
given the knowledge of the other var.
3.4. Example 2: rxc table
– Find out if the level of education of husbands and wives are related ( v106 and v701). Is this true for both urban and rural areas? (Use appropriate tests)
– Table below summarizes the data on religious preference and the attitudes towards abortion in one country. Is respondents’ attitude related to their religious preference?g p
Liberal Conservativ
Religiouspreference
Liberal protestant
Conservativ
eprotestant Catholic None Total
Favor 103 182 80 16 381 Attitude Oppose 187 238 286 74 785 Total 290 420 366 90 1166 toward abortion