Correspondence analysis
and Related Methods – Part 2
1. What is multiple correspondence analysis (MCA)?
2. Why is MCA so useful as a method of visualizing
questionnaire data?
3. How is MCA implemented in XLSTAT?
• “Classical” or “simple” CA analyses the relationships
between two variables, although the method is extended
to analyse different forms of tabular data, for example the
product–attribute data shown previously, as well as
ratings, preferences, on an individual or aggregate level.
• Multiple CA analyses several categorical variables where
we are interested in all the relationships within the set of
variables, not between one set and another
• The best way to understand the difference is to see the
different data format for the MCA program in XLSTAT:
these are individual-level responses to several questions.
Responses to four questions concerning working women Demographic categories
Source:
Family &
Changing
Gender
Roles Survey
ISSP (1994)
• “between-set” means that there are two sets of
variables and we are interested in the relationships
between them – e.g., between demographics and
the question responses
• “within-set” means that there is one set of variables
and we are interested in the relationships amongst
them – e.g., amongst the question responses... this
is the multiple correspondence analysis (MCA) case
Between
Between
Between
Between----set
set
set
set versus
versus
versus
versus within
within
within
within----set
set
set
set
• Questions: Should a women work full-time, work part-time
or stay at home or missing data [4 response categories]:
(Q1) before she has children; (Q2) when she has a
pre-school child; (Q3) when children are still at pre-school; (Q4)
when all children have left home.
Between
Between
Between
Between----set
set
set
set example
example
example
example: Simple CA
: Simple CA
: Simple CA
: Simple CA
Q3: Should a woman with a child at school work full-time, part-time or stay at
home?
work work stay at DK/unsure/full-time part-time home missing
COUNTRY W w H ? Total AUS 256 1156 176 191 1779 DW 101 1394 581 248 2324 DE 278 691 62 66 1097 GB 161 646 70 107 984 NIRL 126 394 75 52 647 USA 482 686 107 172 1447 A 84 632 202 59 977 H 285 736 447 32 1500 I 171 670 167 10 1018 IRL 223 424 209 82 938 NL 539 1205 143 81 1968 N 487 1242 205 153 2087 S 295 833 39 105 1272 CZ 228 585 198 13 1024 SLO 341 428 222 41 1032 PL 431 425 589 152 1597 BG 270 427 335 94 1126 RUS 175 1154 550 119 1998 NZ 120 754 72 101 1047 CDN 566 497 108 269 1440 IL 468 664 92 63 1287 J 203 671 313 120 1307 E 738 1012 514 230 2494 RP 243 448 484 25 1200 Total 7271 17774 5960 2585 33590 Average profile 0.216 0.529 0.177 0.077 1
Source:
Family &
Changing Gender
Roles Survey
ISSP (1994)
Simple CA
Simple CA
Simple CA
Simple CA
Should a woman with a child at school work full-time, part-time or stay at home?
2W 2w 2H 2? AUS DW DE GB NIRL USA A H I IRL NL N S CZ SLO PL BG RUS NZ CDN RP IL J E -0.4 -0.2 0 0.2 0.4 0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.0737 (50.6%) 0.0532 (36.5%)
87.1%
inertia
explained
W ? w HSimple CA
Simple CA
Simple CA
Simple CA of
of
of
of multiway
multiway
multiway
multiway tables
tables
tables
tables
Should a woman with a child at school work full-time, part-time or stay at home?
work work stay at DK/unsure/ full-time part-time home missing
COUNTRY W w H ? Total AUSm 117 596 114 82 909 AUSf 138 559 60 109 866 DWm 43 675 357 123 1198 DWf 58 719 224 125 1126 . . . . . . . . . . . . . . . . . . RPm 347 445 294 111 1197 RPf 390 566 218 118 1292 Total 7271 17774 5960 2585 33590 Average profile 0.216 0.529 0.177 0.077 1 •Each country is split by gender: 24×2 country-age groups. We say the variables country and age are interactively coded
•Average profile stays the same, so definition of centre and geometric distance remain identical to previous map, all that has been done is to split each country point into two profiles
Simple CA
Simple CA
Simple CA
Simple CA of
of
of
of multiway
multiway
multiway
multiway tables
tables
tables
tables
Should a woman with a child at school work full-time, part-time or stay at home?
86.8%
inertia
explained
W w H ? AUSm DWm Dem GBm NIRLm USAm Am Hm Im IRLm NLmNm Sm CZm SLOm PLm BGm RUSm NZm CDNm RPm Ilm Jm Em AUSf DWf Def GBf NIRLf USAf Af Hf If IRLf NLf Nf Sf CZf SLOf PLf BGf RUSf NZf CDNf RPf Ilf Jf Ef -0.4 -0.2 0 0.2 0.4 0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 0.0797 (51.5%) 0.0546 (35.3%)•Ireland (IRL) has largest M–F difference
•Bulgaria (BG) is only country with a reverse M–F difference •Inertia before: 0.01456 •Inertia with M–F split: 0.01546 •5.8% due to M–F
Simple CA
Simple CA
Simple CA
Simple CA of
of
of
of multiway
multiway
multiway
multiway tables
tables
tables
tables
Should a woman with a child at school work full-time, part-time or stay at home?
87.3% inertia
explained
•Points tend to lie in a curved pattern (called
arch or horseshoe)
•Points that lie inside the arch are polarized, e.g. PLm26-35: 32% W, 22% w, 32% H, but NZm>66: 7% W, 73% w, 15% H Average: 22% W, 53% w, 18% H •Interactive coding of country (24), gender (2) and age (6), giving 288 combinations
?
H
w
W
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 0.1301 (54.3%) 0.0791 (33.0%) CDNf<25 Hm>66 PLm>66 NZm>66 DEm<25 PLm<26-35Stacked
Stacked
Stacked
Stacked tables
tables
tables
tables
Should a woman with a child at school work full-time, part-time or stay at home?
•Since the column margins of each table are identical (and same as the interactively coded tables before), the basic geometry remains the same, it’s just the detail that is sacrificed here, all the information is collapsed into “main effects”.
•Each variable is separately cross-tabulated with the question and then stacked one on top of another. W w H ? Country (24) Gender (2) Age (6) Education (7) Marital status (5) Social class (8)
•Inertia of stacked table is the average of the inertias of its subtables
Stacked
Stacked
Stacked
Stacked tables
tables
tables
tables
... with a child at
school ...
•Tables can be stacked row-wise and column-wise, adding additional questions as columns W w H ? Country (24) Gender (2) Age (6) Education (7) Marital status (5) Social class (8) W w H ? W w H ? W w H ?
Should a (married)
woman before having
children...
... with a
preschool child...
... when her
children have
left home work
full-time,
part-time or stay at
home?
•24 contingency tables in a 6 ×4 pattern, row margins and column margins are the same.
•Inertia of stacked table is the average of the inertias of its subtables
Stacked
Stacked
Stacked
Stacked tables
tables
tables
tables
Women in the workplace and 6 demographic variables
71.0% inertia
explained
•Relationships within questions and relationships within demographics not displayed explicitly •Join categories of ordinal variable to see trends, for example age. •Relationships between each demographic variable and each question displayed jointly 1W 1w 1H 1? 2W 2w 2H 2? 3W 3w 3H 3? 4W 4w 4H 4? AUS DW DE GB NIRL USA A H I IRL NL N S CZ SLO PL BG RUS NZ CDN RP IL J E M F A1 A2 A3 A4 A5 A6 ma wi di se si E1 E2 E3 E4 E5 E6 E7 S0 S1 S2 S3 S4 S5 S6 S* -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 0.6 0.0188 (49.1%) 0.0084 (21.9%)Multiple
Multiple
Multiple
Multiple correspondence
correspondence
correspondence
correspondence analysis
analysis
analysis (MCA)
analysis
(MCA)
(MCA)
(MCA)
Women in the workplace – 4 questions
West & East German samples only
•N
rows,Q
questions,q
-th question hasJ
q categories, total number of categories isJ
(N
= 3415,Q =
4J
q= 4 for allq
,J =
16 ) •One definition of MCA is that it is the CA of the indicator matrix •Response data is recoded as dummy variablesQuestions Qu. 1 Qu. 2 Qu. 3 Qu. 4
1 2 3 4 W w H ? W w H ? W w H ? W w H ? ---1 3 2 2 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 2 3 3 2 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 4 3 3 2 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 4 4 4 4 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 4 4 4 4 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 3 2 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 . . . . . .
. . . and so on for 3415 rows
Original data Indicator Matrix
MCA: XLSTAT
MCA: XLSTAT
MCA: XLSTAT
MCA: XLSTAT initial
initial
initial
initial output
output
output
output
Total inertia: 3
Eigenvalues and percentages of inertia:
F1 F2 F3 F4 F5 Eigenvalue 0.692 0.513 0.365 0.307 0.218 Inertia (%) 23.061 17.108 12.156 10.248 7.254 Cumulative % 23.061 40.169 52.325 62.573 69.827 Adjusted Inertia 0.347 0.123 0.023 0.006 Adjusted Inertia (%) 66.152 23.482 4.456 1.118 Cumulative % 66.152 89.634 94.090 95.208 ... F12
Total inertia in MCA of indicator matrix
Z =
3
4
4
16
−
=
=
−
Q
Q
J
Multiple
Multiple
Multiple
Multiple correspondence
correspondence
correspondence analysis
correspondence
analysis
analysis
analysis (MCA)
(MCA)
(MCA)
(MCA)
Women in the workplace – 4 questions
•If
Z
(N
×J
) is the indicator matrix, then the Burt matrixB
(J
×J
) isB
=Z
TZ
•Alternative definition of MCA is that it is the CA of the Burt matrix •Stacked matrix of
all two-way contingency tables, including each variable with itself 1W 1w 1H 1? 2W 2w 2H 2? 3W 3w 3H 3? 4W 4w 4H 4? 2500 0 0 0 172 1107 1130 91 355 1709 345 91 1766 537 40 157 0 476 0 0 7 129 335 5 16 261 181 18 128 293 17 38 0 0 79 0 1 6 72 0 1 17 61 0 14 21 38 6 0 0 0 360 1 57 108 194 7 96 55 202 51 45 2 262 172 7 1 1 181 0 0 0 127 48 4 2 165 15 0 1 1107 129 6 57 0 1299 0 0 219 997 61 22 972 239 13 75 1130 335 72 108 0 0 1645 0 24 988 573 60 760 615 84 186 91 5 0 194 0 0 0 290 9 50 4 227 62 27 0 201 355 16 1 7 127 219 24 9 379 0 0 0 360 14 1 4 1709 261 17 96 48 997 988 50 0 2083 0 0 1348 566 23 146 345 181 61 55 4 61 573 4 0 0 642 0 202 286 73 81 91 18 0 202 2 22 60 227 0 0 0 311 49 30 0 232 1766 128 14 51 165 972 760 62 360 1348 202 49 1959 0 0 0 537 293 21 45 15 239 615 27 14 566 286 30 0 896 0 0 40 17 38 2 0 13 84 0 1 23 73 0 0 0 97 0 157 38 6 262 1 75 186 201 4 146 81 232 0 0 0 463
Burt matrix
1W 1w 1H 1? 2W 2w 2H 2? 3W 3w 3H 3? 4W 4w 4H 4?MCA (
MCA (
MCA (
MCA (Burt
Burt
Burt
Burt matrix
matrix
matrix
matrix version
version
version
version))))
64.9% inertia
explained (only
40.2% if indicator
matrix analysed)
•Missing value categories have strong association •Relationshipsamongst (within) the set of questions are displayed jointly
Women in the workplace – 4 questions
1W 1w 1H 1? 2W 2w 2H 2? 3W 3w 3H 3? 4W 4w 4H 4? -3 -2 -1 0 1 2 -1 0 1 2 3 0.263 (41.9%) 0.479 (23.0%) 0.479 (41.9%)
0.263 (23.0%) •Results are same for
Burt matrix, just principal inertias change.
Multiple
Multiple
Multiple
Multiple correspondence
correspondence
correspondence analysis
correspondence
analysis
analysis
analysis (MCA)
(MCA)
(MCA)
(MCA)
Women in the workplace – 4 questions
•Since the diagonal inertias are so high, this inflates the average, hence low percentages •Total inertia of Burt matrix is average of the inertias of its submatrices = 1.143 1W 1w 1H 1? 2W 2w 2H 2? 3W 3w 3H 3? 4W 4w 4H 4? 2500 0 0 0 172 1107 1130 91 355 1709 345 91 1766 537 40 157 0 476 0 0 7 129 335 5 16 261 181 18 128 293 17 38 0 0 79 0 1 6 72 0 1 17 61 0 14 21 38 6 0 0 0 360 1 57 108 194 7 96 55 202 51 45 2 262 172 7 1 1 181 0 0 0 127 48 4 2 165 15 0 1 1107 129 6 57 0 1299 0 0 219 997 61 22 972 239 13 75 1130 335 72 108 0 0 1645 0 24 988 573 60 760 615 84 186 91 5 0 194 0 0 0 290 9 50 4 227 62 27 0 201 355 16 1 7 127 219 24 9 379 0 0 0 360 14 1 4 1709 261 17 96 48 997 988 50 0 2083 0 0 1348 566 23 146 345 181 61 55 4 61 573 4 0 0 642 0 202 286 73 81 91 18 0 202 2 22 60 227 0 0 0 311 49 30 0 232 1766 128 14 51 165 972 760 62 360 1348 202 49 1959 0 0 0 537 293 21 45 15 239 615 27 14 566 286 30 0 896 0 0 40 17 38 2 0 13 84 0 1 23 73 0 0 0 97 0 157 38 6 262 1 75 186 201 4 146 81 232 0 0 0 463
Burt matrix – inertias of each subtable
1W 1w 1H 1? 2W 2w 2H 2? 3W 3w 3H 3? 4W 4w 4H 4?
3.000
0.363
0.424
0.644
0.363
3.000
0.892
0.345
0.424
0.892
3.000
0.480
0.644
0.345
0.480
3.000
•Percentage of variance explained is actually much higher, in MCA the overall inertia is inflated by the diagonal tables in the Burt matrix – the percentage is actually about 90%Adjustment
Adjustment
Adjustment
Adjustment of
of
of
of principal
principal
principal
principal inertias
inertias
inertias
inertias
(
(
(
(eigenvalues
eigenvalues
eigenvalues
eigenvalues))))
Here are the steps to rescale the solution:
1.
Calculate the average off-diagonal inertia :
average off-diagonal inertia
=
2.
Calculate the adjusted principal inertias :
adjusted principal inertias
=
3.
Calculate adjusted percentages of inertia :
adjusted percentages of inertia
=
− − − ( ) 2 1 Q Q J inertia Q Q B Q Q Q Q k k 1 1 1
λ
λ
only for 2 2 > − − inertia diagonal -off average inertias principal adjustedWe can rescale an existing MCA solution in order to best fit the off-diagonal
tables. All we need is the total inertia of the Burt matrix,
inertia
(
B
), and the
principal inertias
λλλλ
k2of the Burt matrix in the solution space.
If we have computed the solution on the indicator matrix
Z
(as in MCA module
of XLSTAT), the eigenvalues calculated are
λλλλ
kso all the squares of the
principal inertias of
Z
need to be summed in order to get
inertia
(
B
). If you
have analysed the Burt matrix
B
,
inerti
a
(
B
) is the total inertia.
MCA (
MCA (
MCA (
MCA (adjusted
adjusted
adjusted
adjusted))))
Women in the workplace – 4 questions
4? 4H 4w 4W 3? 3H 3w 3W 2? 2H 2w 2W 1? 1H 1w 1W -3 -2 -1 0 1 2 -1 0 1 2 3 0.347 (66.2%) 0.123 (23.5%)
89.7% inertia explained
MCA (
MCA (
MCA (
MCA (Burt
Burt
Burt
Burt matrix
matrix
matrix
matrix version
version
version
version))))
Women in the workplace – 4 questions
1W 1w 1H 1? 2W 2w 2H 2? 3W 3w 3H 3? 4W 4w 4H 4? -3 -2 -1 0 1 2 -1 0 1 2 3 0.263 (41.9%) 0.479 (23.0%) 0.479 (41.9%) 0.263 (23.0%)
64.9% inertia explained
MCA
MCA
MCA
MCA
Women in the workplace – supplementary demographic groups
DW DE M F A1 A2 A3 A4 A5 A6 E1 E2 E3 E4 E5 E6 E* ma wi di se si -0.5 0.5 -0.5 0.5
Related topics
Related topics
Related topics
Related topics
1. Subset correspondence analysis
• restricting analysis to a subset of categories (e.g. all
substantive responses excluding missing categories, or
missing categories by themselves, or “middle” categories)
2. Square asymmetric tables
• mobility tables, brand-switching, migration...
3. Recoding of data before applying CA
• ratings, preferences, paired comparisons, continuous-scale
data (ratio and interval)
4. Stability and inference
• concentration ellipses, convex hulls, permutation tests
5. Canonical correspondence analysis (CCA)
• CA with explanatory variables (combination of dimensions
reduction and regression)
Subset
Subset
Subset
Subset correspondence
correspondence
correspondence
correspondence analysis
analysis
analysis
analysis
For example, analysing the women working data but ignoring the missing
values (this is NOT just a CA of the table without the missing value columns –
the masses and metric of the complete matrix are maintained).
In XLSTAT’s MCA program you are given a menu for selecting which
categories you want to retain or omit:
Subset
Subset
Subset
Subset correspondence
correspondence
correspondence analysis
correspondence
analysis
analysis
analysis
4H 4w 4W 3H 3w 3W 2H 2w 2W 1H 1w 1W -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 0.1240 (70.0%) 0.0241 (13.5%)
Canonical
Canonical
Canonical
Canonical correspondence
correspondence
correspondence
correspondence analysis
analysis
analysis
analysis (
(
(
(CCA
CCA
CCA
CCA))))
This has the same objective as CA but restricts the CA solution to be (linearly)
related to external predictor variables, for exampe we want to find the best
low-dimensional view of the responses which is related to age (either age
group or original age variable)
Canonical
Canonical
Canonical
Canonical correspondence
correspondence
correspondence
correspondence analysis
analysis
analysis
analysis
(
(
(
(restricted
restricted
restricted
restricted to
to
to
to age
age
age
age group
group
group differences
group
differences
differences
differences))))
Q4-4 Q4-3 Q4-2 Q4-1 Q3-4 Q3-3 Q3-2 Q3-1 Q2-4 Q2-3 Q2-2 Q2-1 Q1-4 Q1-3 Q1-2 Q1-1 agegp-6 agegp-5 agegp-4agegp-3 agegp-2 agegp-1 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.685 (63.5%) 0.465 (18.4%)