• No results found

Chapter 5: A Theoretical Overview of GDA Methods

5.4 The CA Biplot

5.4.1 CA Biplot Variants

The various types of CA biplots are discussed in great detail in Chapter 7 of Understanding Biplots (Gower et al., 2011:289). Provided below is an explanation of the several variants of the independence model that can be approximated in the CA biplot.

5.4.1.1 Pearson’s chi-squared:

By making use of the weighting the independence model, the contributions to the Pearson’s chi- squared statistic can be approximated. Thus following from equation 5-17 the model which must be minimised is:

β€–π‘Ήβˆ’Β½(𝑿 βˆ’π‘ΉπŸπŸβ€²π‘ͺ 𝑛 ) π‘ͺ

βˆ’Β½βˆ’ 𝑿̂‖2, (5-26)

where 𝑿̂ is the π‘˜-dimensional approximation of π‘Ήβˆ’Β½(𝑿 βˆ’π‘ΉπŸπŸβ€²π‘ͺ 𝑛 ) π‘ͺ

βˆ’Β½ (Gower et al., 2011:291). Note that the SVD of 𝑿̂, the π‘Ÿ-dimensional approximation of 𝑿, is written as π‘Όπ‘±πšΊπ‘½β€².The π‘Ÿ-dimensional approximation of the πœ’2 contributions requires plotting of the first π‘Ÿ columns of π‘ΌπšΊΒ½ and π‘½πšΊΒ½. Such biplots provide a display from which it is possible to identify the elements of 𝑿̂ which contribute most to the total inertia as well as which element diverge from the assumption of independence.

5.4.1.2 Approximating the deviations from independence:

This method seeks to approximate the deviations from independence, 𝑿 βˆ’ 𝑬, which are weighted by the inverse square roots of the row and column totals. This translates to the minimization of:

β€–π‘Ήβˆ’Β½{(𝑿 βˆ’π‘ΉπŸπŸβ€²π‘ͺ

𝑛 ) βˆ’ 𝑿̂} π‘ͺβˆ’Β½β€–

2

(5-27)

(Gower et al., 2011:291). Equation 5-27 is minimized by π‘Ήβˆ’Β½π‘ΏΜ‚π‘ͺβˆ’Β½= π‘ΌπšΊπ‘±π‘½β€². Solving for the π‘Ÿ- dimensional approximation of 𝑿 gives 𝑿̂ = π‘ΉΒ½π‘ΌπšΊπ‘±π‘½β€²π‘ͺΒ½. By plotting the first π‘Ÿ columns of π‘ΉΒ½π‘ΌπšΊΒ½ and π‘ͺΒ½π•πšΊΒ½ a biplot which represents the deviations away from independence is achieved.

5.4.1.3 Approximation to the contingency ratio:

The contingency ratio, as defined in Correspondence Analysis in Practice (Greenacre, 2007:101) is the ratio of observed proportions of a contingency table to the expected proportions. A CA biplot which approximates the contingency ratios is defined as the minimization of:

β€–π‘Ήβˆ’Β½{π‘Ήβˆ’1(𝑿 βˆ’π‘ΉπŸπŸβ€²π‘ͺ

𝑛 ) π‘ͺβˆ’1βˆ’ 𝑿̂} π‘ͺβˆ’Β½β€–

2 (5-28)

(Gower et al., 2011:292). In equation 5-28 the matrix 𝑿̂ approximates π‘Ήβˆ’1(𝑿 βˆ’π‘ΉπŸπŸβ€²π‘ͺ 𝑛 ) π‘ͺ

βˆ’1= π‘Ήβˆ’Β½π‘ΌπšΊπ‘½β€²π‘ͺβˆ’Β½. Note that a contingency ratio of unity is desirable, and the closer the value is to unity the closer the data are to independence (Greenacre, 2007:101). By plotting the first π‘Ÿ columns of π‘Ήβˆ’Β½π‘ΌπšΊΒ½ and π‘ͺβˆ’Β½π‘½πšΊΒ½ a biplot which depicts the deviation from the unity contingency ratio is obtained. Note that the biplot does not provide a direct deviation from unity but rather the deviation divided by 𝑛. This characteristic of the biplot has no material influence on the biplot but must be accounted for when calibrating the axes.

5.4.1.4 Approximation to the Chi-squared distance:

The independence assumption is evaluated by the πœ’2 statistic. For any two-way table, the assumption can be tested via the πœ’2 distance between rows or the πœ’2 distance between the columns of the table. Therefore, there are two possible solutions for when approximating the Chi-squared distances in a contingency table.

The chi-squared distances between the 𝑖th and 𝑖′th rows of 𝑿 are defined as: 𝑑𝑖𝑖2β€² = ( π’™π’Š π‘₯𝑖.βˆ’ π’™π’Šβ€² π‘₯𝑖′.) β€² π‘ͺβˆ’1(π’™π’Š π‘₯𝑖.βˆ’ π’™π’Šβ€² π‘₯𝑖′.) (5-29)

(Gower et al., 2011:293). Equation 5-29 is of the form of weighted Euclidean distances between the rows of 𝑿. The chi-squared distances are then calculated as the normal Euclidean distances between the rows of π‘Ήβˆ’1𝑿π‘ͺβˆ’Β½. This can also be expressed as π‘Ήβˆ’1𝑿π‘ͺβˆ’Β½βˆ’πŸπŸβ€²π‘ͺΒ½

𝑛 , where πŸπŸβ€²π‘ͺΒ½

𝑛 is the translation term. The translation term has no material effect on the chi-squared measures between rows of 𝑿 but translates the points about the centroid in the biplot. The translated chi-squared distances can then be written as:

π‘Ήβˆ’1(𝑿 βˆ’π‘ΉπŸπŸβ€²π‘ͺ

𝑛 ) π‘ͺβˆ’Β½= π‘Ήβˆ’1(𝑿 βˆ’ 𝑬)π‘ͺβˆ’Β½

(5-30)

(Gower et al., 2011:294). The biplot of the chi-squared distances between the rows of matrix 𝑿 is then based upon the minimization of:

‖𝑹½{π‘Ήβˆ’1(𝑿 βˆ’π‘ΉπŸπŸβ€²π‘ͺ 𝑛 ) π‘ͺ

βˆ’Β½βˆ’ 𝑿̂}β€–2, (5-31)

β€–π‘ΌπšΊπ‘½β€²βˆ’ 𝑹½𝑿̂‖2. (5-32) The approximation, 𝑿̂, is then obtained from the inner product of π‘Ήβˆ’Β½π‘ΌπšΊπ‘½β€² and the π‘Ÿ-dimensional biplot is obtained by plotting the first π‘Ÿ columns of π‘Ήβˆ’Β½π‘ΌπšΊ for the row points and 𝑽 for the column points (Gower et al., 2011:294). Note that 𝑽 is an orthogonal matrix and has no effect on the distances between the row coordinates.

Similar arguments follow for the chi-squared distances between the columns of 𝑿. The chi-squared distances between the 𝑗th and 𝑗′th columns of 𝑿 are defined as:

𝑑𝑗𝑗′2 = (𝒙𝑗 π‘₯.π‘—βˆ’ 𝒙𝑗′ π‘₯.𝑗′) β€² π‘Ήβˆ’1(𝒙𝑗 π‘₯.π‘—βˆ’ 𝒙𝑗′ π‘₯.𝑗′). (5-33)

This leads to a biplot which attempts to minimise: β€–{π‘Ήβˆ’Β½(𝑿 βˆ’π‘ΉπŸπŸβ€²π‘ͺ

𝑛 ) π‘ͺβˆ’1βˆ’ 𝑿̂} π‘ͺΒ½β€– 2

, (5-34)

which simplifies to:

β€–π‘ΌπšΊπ‘½β€²βˆ’ 𝑿̂π‘ͺΒ½β€–2. (5-35)

The approximation, 𝑿̂, is then obtained from the inner product of π‘ΌπšΊπ‘½β€²π‘ͺβˆ’Β½ and the π‘Ÿ-dimensional biplot is created by plotting the first π‘Ÿ columns of π‘ͺβˆ’Β½π‘½πšΊ for the row points and 𝑼 for the column points. Similarly, 𝑼 is also an orthogonal matrix which has no effect on the distances between the points representing the columns of 𝑿̂.

It is common practice to plot both the row and column chi-squared distances in one single plot, although the relationships between these two measures has no meaningful interpretation. This plot is simply achieved by plotting π‘Ήβˆ’Β½π‘ΌπšΊ and π‘ͺβˆ’Β½π‘½πšΊ simultaneously as two sets of points.

5.4.1.5 Canonical Correlation Approximation:

Canonical correlation is associated with the optimal CA solution. The optimal solution can be viewed as a function of the correlation between the rows and columns of the two-way table. When the CA solution is aimed to maximize the correlation, then the name of the corresponding correlation for the optimal solution is known as canonical correlation. In this variant of the CA biplot there will be a search for the optimal row and column scores such that the correlation between rows and columns is maximised. Due to its lengthy derivation, the explanation of said method is not provided for. Rather, the reader is referred to Chapter 7 of Understanding Biplots for a detailed explanation (Gower et al., 2011:296).

5.4.1.6 Approximating the row profiles:

The final variant of the CA biplot is centred upon the row profiles, π‘Ήβˆ’1𝑿, of the two-way table. Similarly, the approximation of the column profiles can be achieved by simply replacing the input matrix with the transpose of 𝑿. In this case the suggested biplot model is:

‖𝑹½{π‘Ήβˆ’1(𝑿 βˆ’π‘ΉπŸπŸβ€²π‘ͺ

𝑛 ) βˆ’ 𝑿̂} π‘ͺ

βˆ’Β½β€–2, (5-36)

and the approximation 𝑿̂ takes the form of π‘Ήβˆ’Β½π‘ΌπšΊπ‘½β€²π‘ͺΒ½ (Gower et al., 2011:298). The accompanying biplot is obtained by plotting the first π‘Ÿ columns of π‘Ήβˆ’Β½π‘ΌπšΊ for the rows and π‘ͺ½𝑽 for the columns. The distances expressed in this biplot are chi-squared distances, and the points for the columns provide axes for the row profile points. Here the row profiles are not the pure row profiles but rather the deviations from the marginal row profile, 𝟏

β€²π‘ͺ 𝑛 .

Related documents