• No results found

The information that a PCA biplot of a data set consisting of samples structured into a number of predefined groups can provide regarding the group structure un- derlying that data set is unfortunately quite limited. The reason for this being that (1) the group membership of the samples plays no role in the construction of the PCA biplot and (2) Pythagorean distances does not take the correlations between the measured variables into account. However, different plotting characters and/or different colours can be used to represent samples belonging to different groups to highlight potential differences between the groups. For visualisation of the central locality of the different groups, the group means can be interpolated onto the PCA

biplot (see Section 2.7.1.1). Imposing an α-bag for each group may also add some information about the group structure underlying the data set as it allows for visual appraisal of the amount of overlap and/or separation amongst the groups. When the number of samples in a particular group is too small for the construction of an α-bag, a convex hull can be constructed for that group.

Table 2.4: The standard deviations of the measured variables of the Ocotea data set.

VesD VesL FibL RayH RayW NumVes

24.53 89.48 214.05 68.39 7.13 5.23 VesD 80 100 100 120 140 140 160 VesL 300 300 400 500 500 FibL 800 1000 1200 1400 1600 RayH 300 350 400 450 500 550 RayW 20 25 30 35 40 45 50 55 55 NumVes 2525 20 1515 10 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 O. bullata O. porosa O. kenyensis (a) VesD 80 100 100 120 140 140 160 VesL 300 300 400 500 500 FibL 800 1000 1200 1400 1600 RayH 300 350 400 450 500 550 RayW 20 25 30 35 40 45 50 55 55 NumVes 2525 20 1515 10 55 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 O. bullata O. porosa O. kenyensis (b)

Figure 2.8: (a) The two-dimensional predictive PCA biplot of the Ocotea data set with95%bags constructed for O. bullata and O. porosa and a convex hull constructed for O. kenyensis; (b) The two-dimensional predictive PCA biplot of the Ocotea data set with 50% bags constructed for O. bullata and O. porosa and a convex hull con- structed for O. kenyensis.

The foregoing concepts will now be illustrated at the hand of theOcotea data set which is available in the R-package ‘UBbipl’. TheOcotea data set contains informa- tion on samples from three species of the Lauraceae family, namely Ocotea bullata

(O. bullata), Ocotea kenyensis (O. kenyensis) and Ocotea porosa (O. porosa). The species O. bullata and O. kenyensis are indigenous to South Africa. The specie O. porosa on the other hand is an imported wood used as a substitute forO. bullata in the manufacturing of high quality furniture. TheOcotea data set contains informa- tion on20samples belonging toO. bullata, seven samples belonging toO. kenyensis

and ten samples belonging to O. porosa. The data set provides measurements of the samples on six variables, namely tangential vessel diameter in ìm (VesD), vessel element length in ìm (VesL), fibre length in ìm (Fibl), ray height in ìm (RayH), Ray width in ìm (RayW) and number of vessels per square mm (NumVes). The stan-

dard deviations of the six measured variables are provided in Table 2.4. Since the measured variables have greatly differing standard deviations, the two-dimensional predictive PCA biplots of the Ocotea data set provided in Figure 2.8(a) and Fig- ure 2.8(b), were constructed from the standardised measurements of theOcoteadata set.

To allow for visual appraisal of the extent to which the three groups overlap,

95%bags were superimposed onto the biplot in Figure 2.8(a) for the O. bullata and

O. porosa species while a convex hull was superimposed for theO. kenyensis specie (due to the small number of samples belonging to this specie). Figure 2.8(b) contains the same PCA biplot as Figure 2.8(a) with the exception that50% bags instead of

95% bags are superimposed for the O. bullata and O. porosa species. Different plotting characters as well as different colours were used to represent the samples (and centroids) of the three different species to ease visualisation of the differences between these species - solid black squares were used for the samples belonging toO. bullata, solid red triangles for those belonging toO. kenyensis and solid green circles for those belonging to O. porosa. For visualisation of the central locations of the three species, the centroid of the samples corresponding to each of the three species was interpolated onto the PCA biplot. The centroid of the samples belonging toO. bullata is indicated with a black square (not solid), that ofO. kenyensis is indicated with a red triangle (not solid) and that ofO. porosa is indicated with a green circle (not solid).

The 95%bags of O. bullata and O. porosa overlap substantially, indicating that these two species are likely to be very similar with respect to at least some of the measured variables. The fact that even the 50% bags of these two species overlap provides further evidence of their similarity. In order to know with respect to which variables these two species are similar and with respect to which they differ, their overlap with respect to the individual biplot axes needs to be investigated. It is very important to note that the mere overlap of two groups with respect to a biplot axis alone is not sufficient evidence to conclude that the two groups are similar with respect to the corresponding variable. The overlap of two groups only suggests a possible similarity between the two groups. In addition to the extent of overlap between the groups with respect to the biplot axis, the biplot axis’ ability to reproduce the true measurements of the samples on that variable needs to be considered. A measure of the predictive ability of PCA biplot axes will be studied in Section 3.4.1.

Both the 95% bags and the 50% bags of O. bullata and O. porosa overlap on each of the individual biplot axis. This indicates that theO. bullata and O. porosa

species are probably quite similar with respect to each of the six measured variables. In addition to the overlap between groups with respect to the the individual biplot axes, the overlap between groups with respect to pairs (or sets) of biplot axes should be considered. One possible difference between theO. bullata and O. porosa species that is suggested by the 50% bags in Figure 2.8(b) is that samples belonging to O. porosa that have measurements on NumVes and Fibl similar to those of samples belonging to O. bullata, tend to have greater measurements on RayW, VesL, VesD

and RayH than the samples belonging toO. bullata.

95% bags and the convex hull of the O. kenyensis specie. This indicates that the

O. kenyensis specie is probably very different from the O. bullata and O. porosa

species with respect to at least some of the measured variables. This does however not imply that the O. kenyensis specie differs from the O. bullata and O. porosa

species with respect to all six the measured variables. Projection of the two 95%

bags and the convex hull onto the biplot axis representing the variableRayW shows a great extent of overlap of the three species with respect to this variable. This is also true for the 50% bags of the O. bullata and O. porosa species and the convex hull of the O. kenyensis specie in Figure 2.8(b). This indicates that the three species are likely to be very similar with respect to the variable RayW. On the other hand, not even the 95% bags of O. bullata and O. porosa overlap with the convex hull ofO. kenyensis with respect to the biplot axes representing the variables

FibL and NumVes, suggesting that the O. bullata and O. porosa species probably differ from the O. kenyensis specie to a great extent with respect to these two variables. The biplot also seems to suggest that samples belonging toO. kenyensis

that have measurements onRayW,VesL,VesD andRayH similar to those of samples belonging toO. bullata and O. porosa, tend to have greater measurements on Fibl

and smaller measurements onNumVes than the samples belonging toO. bullata and

O. porosa. However, as explained before, no conclusions can be made with certainty prior to considering the predictive abilities of the two biplot axes.

It should once again be emphasised that the PCA biplot is not designed to represent the underlying group structure of a data set. If visualisation of the group structure of a data set is the main interest of the investigator, then the data set should rather be represented by means of a CVA biplot which is designed specifically for this purpose.