Multivariate exploratory analysis of qualitative data

Exploratory data analysis

3.4 Multivariate exploratory analysis of qualitative data

So far we have used covariance and correlation as our main measures of statistical relationships between quantitative variables. With ordinal qualitative variables, it is possible to extend the notion of covariance and correlation to the ranks of the observations. The correlation between the variable ranks is known as the Spearman correlation coefﬁcient. Table 3.7 shows how to express the ranks of two ordinal qualitative variables that describe the quality and the menu of four different restaurants. The Spearman correlation of the data in Table 3.7 is zero, therefore the ranks of the two variables are not correlated.

Table 3.7 Ranking of ordinal variables.

VariableA VariableB Ranks of variableA Ranks of variableB High Simple 3 1

Medium Intermediate 2 2 Medium Elaborated 2 3 Low Simple 1 1

More generally, transforming the levels of the ordinal qualitative variables into the corresponding ranks allows most of the analysis applicable to quantitative data to be extended to the ordinal qualitative case. This can also include principal component analysis (Section 3.5). However, if the data matrix contains qualitative data at the nominal level (not binary, otherwise they could be con- sidered quantitative, as in Section 2.3), the notion of covariance and correlation cannot be used. The rest of this section considers summary measures for the intensity of the relationships between qualitative variables of any kind. These measures are known as association indexes. These indexes can sometimes be applied to discrete quantitative variables, but with a loss of explanatory power.

In the examination of qualitative variables, a fundamental part is played by the frequencies for the levels of the variables. Therefore we begin with the contingency table introduced in Section 2.4. Unlike Section 2.4, qualitative data are often available directly in the form of a contingency table, without need- ing to access the original data matrix. To emphasise this difference, we now introduce a slightly different notation which we shall use throughout. Given a qualitative characterXwhich assumes the levelsX1, . . . , XI, collected in a pop-

ulation (or sample) ofn units, the absolute frequency of levelXi (i=1, . . . , I) is the number of times the variable X is observed having value X_i. Denote this absolute frequency by n_i. Table 3.8 presents a theoretical two-way contingency table to introduce the notation used in this Section. In Table 3.8 nij

Table 3.8 Theoretical two-way contingency table.

Y X Y1 . . . Yj . . . YJ Total X1 n11. . . n1j . . . n1J n1+ ... ... ... ... ... Xi ni1 . . . nij . . . niJ ni+ ... ... ... ... ... XI nI1. . . nIj . . . nIJ nI+ Total n₊1. . . n+j . . . n+J n

indicates the frequency associated with the pair of levels(X_i, Y_j), i=1,2, . . . , I; j=1,2, . . . , J, of the variablesXandY. Thenij are also called cell frequencies. • ni+ =Jj=1nij is the marginal frequency of the ith row of the table; it

represents the total number of observations which assume theith level ofX (i =1,2, . . . , I).

• n+j=Ii=1nij is the marginal frequency of the jth column of the table;

it denotes the total number of observations which assume the jth level of Y (j =1,2, . . . , J ).

For the frequencies in the table, we can write the following marginalisation relationship: I i=1 ni+= J j=1 n+j = I i=1 J j=1 nij =n

From ann×p data matrix, it is possible to constructp(p−1)/2 two-way contingency tables, corresponding to all possible qualitative variable pairs. However, it is usually reasonable to limit ourselves to obtaining only those that correspond to interesting ‘intersections’ between the variables, those for which the joint distribution may be important and a useful complement to the univariate frequency distribution.

3.4.1 Independence and association

To develop descriptive indexes of the relationship between qualitative variables, we need the concept of statistical independence. Two variables, X and Y, are said to be independent, with reference to n observations, if they adhere to the following conditions: ni1 n+1 = · · · = niJ n+J = ni+ n ∀i=1,2, . . . , I or, equivalently, n1j n1+ = · · · = nIj nI+ = n+j n ∀j=1,2, . . . , J

If this occurs it means that, with reference to the ﬁrst equation, the (bivariate) analysis of the variables does not give any additional information aboutXbeyond the univariate analysis of the variableXitself, and similarly forY in the second equation. It will be said in this case thatY and X are statistically independent. From the deﬁnition, notice that statistical independence is a symmetric concept in the two variables; in other words, ifXis independent of Y, thenY is independent of X. The previous conditions can be equivalently, and more conveniently, expressed as a function of the marginal frequenciesn_i₊ andn₊_j. ThenXandY

are independent if

nij = ni+_nn+j ∀i=1,2, . . . , I; ∀j=1,2, . . . , J In terms of relative frequencies, this is equivalent to

pXY(xi, yj)=pX(xi)pY(yj)

for everyiand for everyj. When working with real data, the statistical independence condition is almost never satisﬁed exactly. Consequently, observed data will often show some degree of interdependence between the variables.

The statistical independence notion applies to qualitative and quantitative variables. A measure of interdependence operates differently for qualitative variables than for quantitative variables. For quantitative variables, it is possible to calculate summary measures (called correlation measures) that work both on the levels and the frequencies. For qualitative variables, the summary measures (called association measures) can use only the frequencies, because the levels are not metric. For quantitative variables, an important relationship holds between statistical independence and the absence of correlation. If two variables, X and Y, are statistically independent, then cov(X,Y)=0 and r(X, Y )=0. The converse is not necessarily true, in the sense that two variables can be such thatr(x, y)=0, even though they are not independent. In other words, the absence of correlation does not imply statistical independence. An exception occurs when the variables X and Y are jointly distributed according to a normal multivariate distribution (Section 5.1). Then the two concepts are equivalent. The greater difﬁculty of using association measures compared with correlation measures lies in the fact that there are so many indexes available in the statistical literature. Here we examine three different classes: distance measures, dependency measures and model-based measures.

3.4.2 Distance measures

Independence between two variables,XandY, holds when nij = ni+_nn+j ∀i=1,2, . . . , I; ∀j=1,2, . . . , J

for all joint frequencies of the contingency table. A ﬁrst approach to the summary of an association can therefore be based on calculating a ‘global’ measure of disagreement between the frequencies actually observed (nij) and those expected

in the hypothesis of independence between the two variables (n_i₊n₊_j/n). The original statistic proposed by Karl Pearson is the most widely used measure for verifying the hypothesis of independence betweenXandY. In the general case, it is deﬁned by X2 ₌ I i=1 J j=1 (nij −n∗ij) n∗ ij 2

where

n∗ ij =

ni+n+j

n i=1,2, . . . , I;j =1,2, . . . , J

Note that X2 =0 if the variables X and Y are independent. In that case the factors in the numerator are all zero. The statistic X2 can be written in the equivalent form X2₌_n  I i=1 J j=1 n2 ij ni+n+j −1  

which emphasises the dependence of the statistic on the number of observations, n. This reveals a serious inconvenience – the value ofX2_{is an increasing function}

of the sample sizen.

To overcome such inconvenience, some alternative measures have been proposed, all functions of the previous statistic. Here is one of them:

φ2₌ X2 n = I i=1 J j=1 n2 ij ni+n+j −1

This index is usually called the mean contingency, and the square root of φ2 is called the phi coefﬁcient. For 2×2 contingency tables, representing binary variables, φ2 is normalised as it takes values between 0 and 1, and it can be shown that

φ2₌ Cov

2_{(X, Y )}

Var(X)Var(Y )

Therefore in the case of 2×2 tables,φ2_{is equivalent to the squared linear corre-}

lation coefﬁcient. For contingency tables bigger than 2×2,φ2_{is not normalised.}

To obtain a normalised index, useful for comparison, use a different modiﬁcation ofX2 _{called the Cramer index. Following an approach quite common in descrip-}

tive statistic, the Cramer index is obtained by dividing the φ2 statistic by the maximum value it can assume, for the structure of the given contingency table. Since such maximum is the minimum between the valuesI−1 andJ−1, with I andJ respectively the number of rows and columns of the contingency table, the Cramer index is equal to

V2₌ X2

nmin[(I−1), (J −1)]

It can be shown that 0≤V2_≤_{1 for any}_I_×_J _{contingency table, and}_V2₌_{0 if}

and only ifXand Y are independent. On the other hand,V2 =1 for maximum dependency between the two variables. Then three situations can be distinguished, referring without loss of generality to Table 3.8:

a) There is maximum dependency ofY on Xwhen in every row of the table there is only one non-zero frequency. This happens if to each level of X

there, corresponds one and only one level ofY. This condition occurs when V2₌_{1 and}_I _≥_J_.

b) There is maximum dependency ofX on Y if in every column of the table there is only one non-null frequency. This means that to each level of Y there corresponds one and only one level ofX. This condition occurs when V2₌_{1 and}_J _≥_I_.

c) If the two previous conditions are simultaneously satisﬁed, i.e. if I =J whenV2₌_{1, the two variables are maximally interdependent.}

We have referred to the case of two-way contingency tables, involving two variables, with an arbitrary number of levels. However, the measures presented here can easily be applied to multiway tables, extending the number of summands in the deﬁnition of X2 to account for all table cells.

The association indexes based on the Pearson statistic X2 _{measure the dis-}

tance of the relationship between X andY from the situation of independence. They refer to a generic notion of association, in the sense that they measure exclusively the distance from the independence situation, without giving information on the nature of that distance. On the other hand, these indexes are rather general, as they can be applied in the same fashion to all kinds of contingency table. Furthermore, as we shall see in Section 5.4, the statistic X2has an asymp- totic probabilistic (theoretical) distribution, so it can also be used to assess an inferential threshold to evaluate inductively whether the examined variables are signiﬁcantly dependent. Table 3.9 shows an example of calculating twoX2_-based

measures. Several more applications are given in the second half of the book.

3.4.3 Dependency measures

The association measures seen so far are all functions of theX2 statistics, so they are hardly interpretable in most real applications. This important aspect has been underlined by Goodman and Kruskal (1979), who have proposed an alternative

Table 3.9 Comparison of association measures.

Variable X2 V2 _U Y|X Sales variation 235.0549 0.2096 0.0759 Real estates 116.7520 0.1477 0.0514 Age of company 107.1921 0.1415 0.0420 Region of activity 99.8815 0.1366 0.0376 Number of employees 68.3589 0.1130 0.0335 Sector of activity 41.3668 0.0879 0.0187 Sales 23.3355 0.0660 0.0122 Revenues 21.8297 0.0639 0.0123 Age of owner 6.9214 0.0360 0.0032 Legal nature 4.7813 0.0299 0.0034 Leadership persistence 4.742 0.0298 0.0021 Type of activity 0.5013 −0.0097 0.0002

approach to measuring the association in a contingency table. The set-up followed by Goodman and Kruskal is based on defining indexes for the specific context under investigation. In other words, the indexes are characterised by an operational meaning that defines the nature of the dependency between the available variables. Suppose that, in a two-way contingency table,Y is the ‘dependent’ variable and X is the ‘explanatory’ variable. It may be interesting to evaluate if, for a generic observation, knowledge of theXlevel is able to reduce uncertainty about the corresponding category ofY. The degree of uncertainty in the level of a qualitative character is usually expressed using a heterogeneity index (Section 3.1).

Letδ(Y ) indicate a heterogeneity measure for the marginal distribution ofY, indicated by the vector of marginal relative frequencies, {f₊1, f+2, . . . , f+J}. Similarly, let δ(Y|i) be the same measure calculated on the conditional distribution of Y to the ith row of the variable X of the contingency table,{f_1|_i, f_2|_i, . . . , f_J_|_i} see Section 2.4. An association index based on the ‘proportional reduction in the heterogeneity’, or error proportional reduction index (EPR), can then be calculated as follows (Agresti, 1990):

EPR= δ(Y )−M[δ(Y|X)] δ(Y )

where M[δ(Y|X)] indicates the mean heterogeneity calculated with respect to the distribution ofX, namely

M[δ(Y|X)]= i

fi+δ(Y|i)

where

fi+=ni+/n (i=1,2, . . . , I)

This index measures the proportion of heterogeneity ofY (calculated throughδ) that can be ‘explained’ by the relationship with X. Remarkably, its structure is analogous to that of the squared linear correlation coefﬁcient (Section 4.3.3). By choosingδappropriately, different association measures can be obtained. Usually the choice is between the Gini index and the entropy index. Using the Gini index in the EPR expression, we obtain the so-called concentration coefﬁcient,τ_Y_|_X:

τY|X= f2 ij/fi+− f2 +j 1− j f2 +j

Using the entropy index in the ERP expression, we obtain the so-called uncertainty coefﬁcient,U_Y_|_X: UY|X = − i j fijlog(fij/fi+·f+j) j f+jlogf+j

In the case of null frequencies it is conventional to set log 0=0. Bothτ_Y_|_X and UY|X take values in the interval [0,1]. In particular, we can show that

τY|X=UY|X if and only if the variables are independent

τY|X=UY|X=1 if and only if Y has maximum dependence on X The indexes have a simple operational interpretation regarding specific aspects of the dependence link between the variables. In particular, both τ_Y_|_X and U_Y_|_X represent alternative quantifications of the reduction of the Y heterogeneity that can be explained through the dependence of Y on X. From this viewpoint they are rather specific in comparison to the distance measures of association. On the other hand, they are less general than the distance measures. Their application requires us to identify a causal link from one variable (explanatory) to the other (dependent), whereas the X2_{-based indexes are asymmetric. Furthermore, they}

cannot easily be extended to contingency tables with more than two ways, for obtaining an inferential threshold.

Table 3.9 is an actual comparison between the distance measures and the uncertainty coefﬁcient U_Y_|_X. It is based on data collected for a credit-scoring problem (Chapter 11). The objective of the analysis is to explore which of the explanatory variables described in the table (all qualitative or discrete quantitative) are most associated with the binary response variable. The response variable describes whether or not each of the 7134 business enterprises examined is credit- worthy. Note that the distance measures (X2and Cramer’sV2) are more variable and seem generally to indicate a higher degree of association. This is due to the more generic type of associations that they detect. The uncertainty coefﬁcient is more informative. For instance, it can be said that the variable ‘sales variation’ reduces the degree of uncertainty on the reliability by 7.6%. On the other hand, X2 _{can be easily compared to the inferential threshold that can be associated with}

it. For instance, with a given level of significance, only the first eight variables are significantly associated with creditworthiness.

3.4.4 Model-based measures

We can examine those association measures that do not depend on the marginal distributions of the variables. None of the previous measures satisfy this require- ment. We now consider a class of easily interpretable indexes that do not depend on the marginal distributions. These measures are based on probabilistic models and therefore allow an inferential treatment (Sections 5.4 and 5.5). For ease of notation, we shall assume a probabilistic model in which cell relative frequencies are replaced by cell probabilities. The cell probabilities can be interpreted as relative frequencies when the sample size tends to inﬁnity, therefore they have the same properties as the relative frequencies.

Consider a 2×2 contingency table, relative to the variablesXandY, respectively associated with the rows (X=0,1) and columns (Y =0,1) of the table. Let π11, π00, π10 and π01 indicate the probabilities that one observation is classiﬁed

in one of the four cells of the table. The odds ratio is a measure of association that constitutes a fundamental parameter in the statistical models for the analysis of qualitative data. Let π1|1 andπ0|1 indicate the conditional probabilities of

having a 1 (a success) and a 0 (a failure) in row 1; letπ1|0 andπ0|0 be, the same

probabilities for row 0. The odds of success for row 1 are deﬁned by odds1=

π1|1

π0|1 =

P (Y =1|X=1) P (Y =0|X=1) The odds of success for row 0 are deﬁned by

odds0= π 1|0

π0|0 =

P (Y =1|X=0) P (Y =0|X=0)

The odds are always a non-negative quantity, with a value greater than 1 when a success (level 1) is more probable than a failure (level 0), that is when P (Y =1|X=1) > P (Y =0|X=1). For example, if the odds equal 4, this means that a success is four times more probable than a failure. In other words, it is expected to observe four successes for every failure. Instead odds=1/4=0.25 means that a failure is four times more probable than a success; it is therefore expected to observe a success for every four failures.

The ratio between the two previous odds is the odds ratio: θ= odds1

odds0 =

π1|1/π0|1

π1|0/π0|0

From the deﬁnition of the odds, and using the deﬁnition of joint probability, it can easily be shown that:

θ = π11π00

π10π01

This expression shows that the odds ratio is a cross product ratio, the product of probabilities on the main diagonal to the product of the probabilities on the secondary diagonal of a contingency table. In the actual computation of the odds ratio, the probabilities will be replaced with the observed frequencies, leading to the following expression:

θij = n 11n00

n10n01

Here are three properties of the odds ratio:

• The odds ratio can be equal to any non-negative number; that is, it can take values in the interval [0,+∞).

• WhenXandY are independentπ1|1=π1|0, so that odds1 =odds0andθ=1.

On the other hand, depending on whether the odds ratio is greater or less than

In document Applied Data Mining Statistical Methods for Business and Industry Giudici P (2003) pdf (Page 65-75)