Principal component analysis - Statistical methods

4. Data and methods

4.1 Statistical methods

4.1.4 Principal component analysis

When dealing with a large number of variables, a principal component analysis can gauge the redundancy of a dataset and can try to summarize several variables into a smaller number of components. The method starts with a correlation matrix between all the variables and consists in finding out which ones are strongly correlated together in order to create a new set of variables (components) that replaces those variables. The purpose of the method is to make a complex dataset more easily comprehensible by compressing it into a lower number of components than the number of initial variables.

This principle is particularly efficient when variables strongly correlate. Let us imagine a questionnaire which asks respondents to agree or disagree with several statements by giving

a score ranging from 1 (strongly disagree) to 5 (strongly agree). The following five statements are presented:

1. You prefer to stay at home over going on a night out. 2. You can easily talk to people that you don’t know. 3. You like raclette.

4. You are embarrassed to sing karaoke with your friends. 5. You like fondue.

It is very apparent that there are two sets of questions that are related. On the one hand, statements (1), (2) and (4) relate to social settings and introvert/extrovert behaviours and we can assume that respondents will tend to give similar answers to these three questions. On the other hand, the other two relate to liking a certain type of cheese-based meal and the assumption is that someone who likes raclette will also be likely to like fondue. A principal component analysis can investigate whether responses to (1), (2) and (5) tend to be correlated together, while (3) and (5) are also correlated together, but that the two sets do not necessarily correlate with each other. The principal component analysis can show that (1), (2) and (4) are well summarized by one component, while (3) and (5) can be summarized by another. Researchers can name and interpret the components as they like.

Let us now look at this situation from a more practical point of view and consider actual numbers. Again, fifteen imaginary participants had to respond to the five statements and their imaginary results are reported in Table 13. A principal component analysis provides a lot of output to analyse and only the outputs used in chapter 5 are going to be discussed here. The first most basic element is the initial correlation matrix (Table 14), which is itself not an output of the principal component analysis but rather its starting point. A correlation matrix is a symmetrical matrix, since the correlation between variable Q1 and variable Q2 is the same as the correlation between variable Q2 and variable Q1. This is why only one half of the matrix is filled with the correlation values, which makes it easier to read as appears from Table 14. Furthermore, the correlation of a variable with itself is one, as can be seen in the diagonal of the matrix. Correlations range from zero (no correlation) to one (strong correlation). The sign of the correlation indicates positive or negative relationships. For example, Q1 and Q2 have a strong negative correlation (-0.827), which means that people who agree with statement (1) tend to systematically disagree with statement (2). The strongest

correlations in Table 14 are in bold and one can already observe that Q1, Q2 and Q4 are strongly correlated together, while Q3 and Q5 are also strongly correlated together.

Participant ID Q1 Q2 Q3 Q4 Q5 Participant_1 1 5 1 2 2 Participant_2 5 2 2 5 2 Participant_3 1 5 1 2 1 Participant_4 1 4 2 1 1 Participant_5 3 1 5 4 5 Participant_6 4 3 4 3 5 Participant_7 4 1 5 5 4 Participant_8 4 2 4 4 4 Participant_9 5 2 5 4 4 Participant_10 5 1 3 5 3 Participant_11 1 4 1 2 1 Participant_12 4 1 2 5 1 Participant_13 1 4 5 3 5 Participant_14 2 5 5 2 5 Participant_15 4 1 2 4 2

Table 13. Participant responses to each statement, with scores ranging from 1 to 5.

Q1 Q2 Q3 Q4 Q5 Q1 1 -0.827 0.317 0.866 0.237 Q2 1 -0.280 -0.891 -0.133 Q3 1 0.296 0.924 Q4 1 0.192 Q5 1

Table 14. Correlation matrix of the dataset from Table 13. Strongest correlations in bold. We may now move to the actual results of the principal component analysis. The first output is presented in Table 15 and shows the percentage of variance explained by number of components. What this output shows is how well the dataset can be summarized by a certain number of components. Here, 60.895% of the information in the dataset is still retained if instead of all five variables, only one component is used. Similarly, if two components are used, then 93.163% of the information in the dataset is retained. In short, using two components almost gives the same information as using all five variables. This is a form of data compression. The rest of the table shows the percentages for three, four and five components. It stops at five components since this is equal to the number of variables at the

start. In a principal component analysis, one needs to decide how many components to choose. There are several ways to make this decision, and statistical software such as SPSS offers a variety of options. The default option is to only retain so-called eigenvalues that are superior to one. Eigenvalues are obtained by a process that is called spectral decomposition that operates on the correlation matrix. This process will not be explained in detail here, as it is a rather complex algorithm and is outside the scope of this discussion. What matters is that in the present case, this process would yield the selection of two components. Deciding how many components to keep can also be done intuitively in the present case, since adding a third component only adds 3.602% of information, which is much lower than the previous values.

In chapter 5, two components will generally by retained because this makes it possible to give a clear visual representation of the dataset, as illustrated in Figure 2. This representation is based on Table 16 which is a component matrix (also called a saturation matrix). It shows how the components relate to the initial variables and works like correlations (from zero to one, negative or positive). For instance, Table 16 shows that the first component mostly represents Q1, Q2 and Q4 (values in bold), while the second mostly represents Q3 and Q5 (values in bold). This is how one can decide to name component 1 “social profile” and component 2 “cheese affinity”. Communalities indicate whether a given variable is well-represented by the two components together. In the current situation, all variables are well-represented, as the smallest one is 0.886.

Component

Initial Eigenvalues

Total % of variance _explained % cumulated 1 3.045 60.895 60.895 2 1.613 32.268 93.163

3 0.180 3.602 96.765

4 0.105 2.099 98.864

5 0.057 1.136 100

Table 15. Percentage of variance explained by number of component.

Regarding the interpretation of the visual representation in Figure 2, what is of interest are the angles that a pair of variables forms with the origin (center) of the graph. Right angles (90°) tend to correspond to correlations close to zero. Small angles (<45°) correspond to positive correlations, while wide angles (>135°) correspond to negative correlations. Variables Q3 and Q5 form a very small angle and are therefore strongly positively correlated.

The same can be said about variables Q1 and Q4. On the other hand, Q2 forms a very large angle with Q1 and Q4, which means that it is strongly negatively correlated with them.

Figure 2. Component plot, using two components

Component 1 Component 2 Communalities

Q1 0.893 -0.297 0.886

Q2 -0.874 0.382 0.910

Q3 0.628 0.755 0.963

Q4 0.900 -0.349 0.931

Q5 0.529 0.830 0.968

Table 16. Component matrix and communalities. The left columns show how much each variable contributes to each component. The communalities column shows the percentage of variance of each variable that is accounted for by the model.

While the situation presented here is rather clear, real datasets tend to have components which are less easily interpretable. The main interest of principal component analysis for the datasets presented in chapter 5 is to determine to what extent the variables correlate. It is important to know, for example, if collocate and colligate diversities correlate with frequency, and if so, to what extent. Indeed, there is a difference between weak and strong correlations. The percentage of variance explained by a number of components (Table 15) is a good way to

evaluate redundancy in a dataset, as it can show how many components are required to reach a similar amount of information.

A further note is that researchers often apply principal component analysis in order to then use the components themselves as substitutes for the variables. This is not the preferred solution in the subsequent studies because components, as shown above, may summarize several variables together, but not always in a clear-cut way. For instance, Q3 plays a role in both components one and two at the same time, which makes its individual contribution hard to isolate. Therefore, principal component analysis is primarily used in the subsequent studies to better understand the relationships between the variables themselves, instead of using the components as substitutes for the variables.

In document Measurements of grammaticalization:: developing a quantitative index for the study of grammatical change (Page 95-100)