Entropy coefficients - 3 ‘Geometric’ distance

3 ‘Geometric’ distance

3. Entropy coefficients

A third approach to studying relationships between variables comes from informa- tion theory and is based on the concept of entropy. For an event with probability

p, the amount of uncertainty is defined as −ln(p). If we have a distribution, it is

possible then to define the average amount of uncertainty across all events and this average uncertainty is called entropy. In mathematical terms, for a distribution with probabilities z1, . . . , znthe formula for entropy is

185 5.1 analysis of contingency tables

It is clear that any entropy is a non-negative number, with zero value being achieved if each probability is either 0 or 1; in this case there is no uncertainty. The maximal value for entropy is, in fact, ln(n) and is achieved when all probabilities zi are the

same (see for instance [14], section 3.5).

Therefore, we can define entropies for the row and column variables:

E(P)= − k i=1 piln( pi), E(Q) = − l j=1 qjln(qj).

This is the ‘prior’ amount of uncertainty, without any additional information. One natural thing to do is then to measure how much reduction in uncertainty of one variable we get when the other variable is introduced. Suppose, for example, that we initially had the ‘row’ entropy E(P) and now the ‘column’ information will be used to reduce it. For any cell (i, j), we now have to use the conditional probability r_{i, j}/qj

from column j , as in the Goodman and Kruskal approach. The uncertainty for this cell is then−ln(ri, j/qj) and the ‘posterior’ entropy is computed as the average uncertainty:

E(P|Q) = −

i, j

ri, jln(ri, j/qj).

This number is still non-negative because ri, j/qj≤ 1. Now the coefficient used to

measure association between variables can be defined:

θ1=

E(P)− E(P|Q)

E(P) .

The interpretation of this coefficient is straightforward – it shows the reduction in the amount of uncertainty for ‘rows’ when ‘column’ information is introduced.

Similarly, if the ‘row’ information is used to predict column values, the corresponding coefficient can be defined in the same way:

E(Q|P) = − i, j ri, jln(ri, j/pi) and θ2= E(Q)− E(Q|P) E(Q) .

It is also possible to express these coefficients in terms of the joint entropy that corresponds to the cell distribution and is defined by

E(P, Q) = − k i=1 l j=1 ri, jln(ri, j).

We then have (see Appendix G for details)

E(P|Q) = E(P, Q) − E(Q) and E(Q|P) = E(P, Q) − E(P). (5.2)

These formulae imply, in particular, that E(P, Q) ≥ E(P) and E(P, Q) ≥ E(Q), so that the cell distribution has a greater (or equal) average amount of uncertainty than each of the marginal distributions.

Hence, the coefficients can be expressed as

θ1=

E(P)− (E(P, Q) − E(Q))

E(P) =

E(P)+ E(Q) − E(P, Q) E(P)

and

θ2=

E(Q)− (E(P, Q) − E(P))

E(Q) =

E(P)+ E(Q) − E(P, Q)

E(Q) .

Now we are ready to consider maximum and minimum values of the coefficients. The maximum value, for instance, forθ1will be when E(P|Q) achieves its own minimum

value which is zero. Hence, the maximum value forθ1(and similarly forθ2) is 1 and

it will happen when

E(Q|P) = −

i, j

ri, j ln(ri, j/pi)= 0.

This can be only if ri, jln(ri, j/pi)= 0 for any i, j so that any cell proportion should be

either zero or equal to the row total pi. This result is the same as for the corresponding

Goodman and Kruskal measure.

To obtain the minimum value ofθ1andθ2, we need to know the maximum value

of E(P, Q). It is proved in Appendix G that E(P, Q) ≤ E(P) + E(Q) (so that the average amount of uncertainty for the cell distribution cannot exceed the sum of average uncertainty amounts for the marginal distributions), with the equality being achieved if and only if the variables are independent. Consequently, the minimum value for both coefficients is zero and is achieved only when the variables are totally independent, as with Cramer’s coefficient.

Finally, a symmetric coefficient can be obtained if we take an average ofθ1andθ2.

It could be a ‘simple’ arithmetic average (θ1+ θ2)/2, or the average could use different

weights forθ1andθ2depending on what is more appropriate for a particular situation.

It is sometimes recommended to use the following formula for a symmetric coefficient:

E(P)+ E(Q) − E(P, Q)

E(P)+ E(Q) .

However, the problem with this coefficient is that it does not have a ‘canonical’ maximum value in the sense that it is not clear for which tables the maximum value is achieved. Indeed, the ‘absolute’ maximum value would obviously be when

E(P, Q) = 0, which means that ri, j ln(ri, j)= 0 for any pair (i, j). Therefore, any

cell probability would have to be either 0 or 1 which is impossible if we assume that all marginal probabilities are non-zero. Hence, the maximum value must be less than 1 and there is no clear definition of tables for which it will be achieved. Notice also that this coefficient is not an average ofθ1andθ2, it is smaller than both of them.

5.1.4 Correspondence analysis

This is a potentially useful multivariate technique that analyses the ‘geometric structure’ of the relationship between rows and columns in a contingency table in which two categorical variables are cross-tabulated.

187 5.1 analysis of contingency tables

A simple direct measure of the relationship between a row category and a column category is the degree to which the intersection, the proportion of the sample with both the row and the column attribute, is greater or less than the product of the column and row marginal proportions. This is often expressed by indexing the actual size of the intersection cell on the product of the two marginal proportions. However, correspondence analysis takes into account information about how the row category interacts with the other column categories and how the column category interacts with the other row categories, and how these interact with each other. These other interactions may reinforce the effect seen in the intersection cell or weaken it.

Correspondence analysis uses aggregated data, in the form of a two-way contingency table, not the individual data items. This works because we are dealing with categorical data where for each cell the row attribute is either present or absent for each sample member, as is the column attribute. The size of the intersection cell and the two marginal totals thus allow a full distribution of the incidence of all four possible combinations of row and column attribute presence or absence to be determined.

From the contingency table aggregate figures a new set of unnamed variables (one less than the lower of the numbers of rows and columns) is postulated such that the first

k of these variables will ‘explain’ the original variables in the best possible way, for

any k. The first variable postulated explains the greatest amount of the variation and subsequent variables explain decreasing amounts. Each variable category becomes a point on this set of dimensions that can be graphed.

It is beyond the scope of this book to discuss the mathematics of correspondence analysis (which requires a certain level of mathematical knowledge). Instead, this technique will be illustrated with an example. For technical details and theory see Greenacre [22, 23] and Hoffman and Franke [29].

Correspondence analysis provides a graphic summary in the form of plots that show the relationships between categories of the original variables. The plot shows a selected pair of the postulated variables. In practice the first two are normally used as they generally exhibit far more explanatory power than the remainder. The position of each row category and each column category is plotted on the x and y axes of the graph.

The interpretation of the plot is outwardly simple. The origin represents the ‘centre of gravity’ of the data. Categories plotted further away from the origin are individually more distinctive than those close to it. A category at the origin appears to have no net interaction with other categories.

Items plotted away from the centre in the same direction are taken to show some affinity. Items away from the centre in opposite directions have negative affinities. Items away from the centre in directions at right-angles are taken to have no affinity. As many items, from row and column categories, may be plotted in the single two- dimensional space some distortion is inevitable. Correspondence analysis purports only to give the best possible two-dimensional representation of a more complex multi-dimensional pattern.

Table 5.1. Input for correspondence analysis

Bank where respondent has a transaction account

Federal vote, Bank of Common- National St. Westpac

first preference Total ANZ Melbourne wealth Australia George Bank

(unweighted) 40172 5963 2105 16397 6859 3866 5457 (population ’000) 11940 1742 689 4993 2130 1165 1578 Australian 40.0% 35.7% 39.4% 44.1% 34.6% 40.0% 37.7% Labor Party 100 89 98 110 86 100 94 Liberal 31.3% 35.6% 39.7% 28.0% 36.1% 32.2% 33.3% 100 114 127 90 115 103 106 Australian 5.0% 5.1% 4.7% 4.7% 4.6% 6.2% 4.9% Democrats 100 100 93 93 91 123 98 Independent/ 4.5% 4.9% 3.8% 4.1% 4.6% 4.8% 4.5% Other 100 107 85 89 101 106 99 National Party 3.3% 3.7% 1.6% 2.8% 5.2% 2.1% 4.3% 100 113 50 85 159 64 132 Greens 3.1% 3.0% 3.0% 3.2% 2.8% 2.9% 2.4% 100 97 97 102 89 93 77 One Nation 2.9% 3.0% 0.8% 2.5% 3.2% 3.0% 2.8% 100 103 27 86 112 102 98

Roy Morgan Single Source Australia, October 1999–September 2000. Filter: electors with a bank account.

The example chosen is taken from a real survey with a large sample. It illustrates the transition from contingency table to correspondence analysis. Table 5.1 shows the voting intention of registered electors who bank with each of Australia’s major banks. The figures shown are percentaged on the column totals, and the index figure is a simple measure of how the percentage in any column relates to the row-total percentage. Notice that it is not a requirement of correspondence analysis that the data be ‘complete’.

Table 5.1 is expressed graphically in correspondence analysis as shown in Figure 5.1.

The associations make immediate sense to anyone familiar with the Australian political scene. The traditionally working-class Labor Party is associated with the traditionally working-class Commonwealth Bank: the stereotype works. The Bank of Melbourne has a negative relationship with One Nation. This is not surpris- ing. Most Bank of Melbourne customers live in Melbourne where support for One Nation is minimal. Other relationships can be explained in part by reference to geography. The National Party’s power base is Queensland, which is also National Australia Bank’s strongest State. One Nation, also Queensland based, has affinities with National Australia Bank and with Westpac, both of which are relatively strong in that State.

In document Statistics for Real-Life Sample Surveys (Page 196-200)