CLUSTER ANALYSIS

Setting the scene

4 CHAPTER 1 SETTING THE SCENE

2.6. CLUSTER ANALYSIS

DeLacy, 1994} .

Many of these approaches aim to separate the influences of main effect and G x E inter action. It is well-known that the impact of G x E interaction makes comparison of averages or main effects invalid if it is ignored. Finding groups of genotypes or environments so that the G xE matrix can be subdivided into submatrices that exhibit no G x E interaction will allow comparisons based on means {Lin, 1982} . In a similar vein, Ivory et al. {1991}

used bar graphs to compare genotype means for each group of environments, which has the advantage of displaying the best environmental grouping to which each genotype is specifically adapted. Rather than use a comparison of simple means, Lin et al. {1986} sug gests using multiple range tests within groups found by clustering with a distance based on G X E interaction.

Yau {1991} criticized the approach of Lin {1982} for clustering genotypes on the basis of G x E interaction similarity because "the clustering of widely adapted with non-adapted lines is not acceptable to most plant breeders" . The flaw in this statement is that it assumes that all genotypes grouped together are interchangeable, but this is not the case when the means are then compared. Lin (1982) advocated the comparison of genotypes within clusters on the basis of their mean performance, and in this instance, non-adapted genotypes would be exposed as inferior.

Ivory et al. (1991) used a 6 x 19 G x E matrix, while the matrix of Cooper and DeLacy

{1994} was 15x 10 in size. These authors used cluster analysis to classify the environments,

but only in the case of Ivory et al. ( 1991) does this seem logical. It seems to have proved more effective to reduce the number of comparisons by reducing the longer dimension of

the G x E matrix, such as Mungomery et al. ( 1 974) , who clustered the 58 genotypes in

their scenario. Subsequent use of ordination to show how the different clusters differ in a two-dimensional display has been named 'pattern analysis" .

Cooper and DeLacy { 1994} have reduced their 15 x 10 G x E matrix to a 5 x 5 matrix and subsequently plotted the means on an interaction plot. Their problem scenario could probably have been managed without clustering both genotypes and environments, but Byth et al. (1976) needed to group both genotypes and environments to reduce their 49 x 63

G x E matrix. They clustered genotypes and environments separately to get a 10 x 10

matrix, which they further reduced when plotting results on an interaction plot. These authors noted that greater benefit could be gained by combining the initial classification of genotype and environments. This was later achieved by Corsten and Denis (1990) and then Baril et al. ( 1994) .

As discussed, the performance of genotypes has commonly been investigated (after cluster analyses) using graphical techniques. Lin et al. {1986} , when writing in support of cluster analysis, stated: "The advantage of a non-parametric approach is that a cultivar's response characteristics can be assessed qualitatively, without the need for a mathematical characterization." Byth et al. {1976} were able to define a model for their two-way classi-

36 CHAPTER 2 . FUNDAMENTALS O F G x E ANALYSIS

fication, that included terms for main effects, genotype grouping, environment grouping, differences for individual genotypes from genotype group effects, differences for individual environments from environment grouping and the four interaction terms that were then available. Parameterization of the outcome of clustering has been addressed further in Section 5.5.

Distance measures

All cluster analyses use a distance measure when forming clusters. The particular measure used differs according to the need of the researcher. For example, Abou-EI-Fittouh et al.

(1969) grouped environments in the U.S. cotton belt by grouping on the basis of the

correlation coefficient, and Lin ( 1982) used a distance measure that allowed the cluster analysis to be compared to a two-way ANOVA .

Corsten and Denis ( 1990), i n their simultaneous clustering o f genotypes and environ ments, adjusted observed inter genotypes and inter environment distances by the number of comparisons being made. They did, however, assume the matrix to be complete.

Ouyang et al. (1995) clustered locations using an incomplete G x E matrix. As with Corsten and Denis (1990), they used an average squared difference to form their distance matrix. Some inter-location distances were not available, and Ouyang et al. (1995) es timated these unobserved distances using the maximum of the observed distances. This could be done in situations like their data, but is not appropriate when the data is struc tured differently. The Ouyang et al. ( 1995) approach is considered further in Chapter 4

when the need to estimate unobserved distances in the principal data is discussed.

Standardization and scaling

Standardization of variables is commonly performed in cluster analyses to remove any

excessive weighting caused by changes in the variance of variables ( Manly, 1994) . Although

the problem is less noticeable in G x E research, many transformations have been employed to ensure that clusters are based on the desired biological basis.

DeLacy et al. ( 1990) compared four transformations that could be used within envi ronments to scale the yields of genotypes. These four transformations were:

1 . Centring within an environment by subtracting the environment mean. Ivory et al.

( 1 991) used this transformation, later termed 'coding' by Cooper et al. ( 1993) .

2. Standardizing the yields within an environment. Squared E uclidean distance mea sured using this data would result in a measure of the phenotypic correlation between environments (Cooper et al. , 1993) .

3. Ranking the yields within each environment, and then subtracting the average rank so that results are comparable across environments.

2 . 6 . CLUSTER ANALYSIS 37

4. Scaling the ranks of within-environment yields by the within-environment standard deviation of the ranks, which is necessary when different numbers of genotypes are used in each environment across years.

Cooper et al. ( 1993) discussed a further rank based transformation which does not use the centring of ranks within environments before the scaling by the standard deviation of ranks. The centring of ranks is, however, not important when data are complete.

All of these transformations are aimed at determining which environments order geno type performances in a similar way. If environments are to be classified on these grounds, then the transformations most advocated would be standardization of yields (Fox and Rosielle, 1982, Cooper et al. , 1993) or within-environment standardized ranks (Cooper

et al., 1993) . These suggestions amount to standardization of the observations being clustered, whereas standardizing variables when clustering observations is more common outside G x E research. Yau (1991) clearly stated the need to use within-environment standardization when clustering genotypes.

Yau (1991) also advocated the use of range transformation as an alternative to stan-

dardization, but noted its lack of availability in standard statistical packages. This trans formation divides the yields of each G x E combination by the within-environment range and results in the adjusted data for every environment having a range of one.

Cluster formation strategies

The vast majority of cluster analyses in the G x E literature are based on hierarchical agglomerative cluster formation strategies. There are many methods for forming clusters that fall under this broad umbrella, and little comparison of their worth to G x E analyses has been published. A notable exception to this is the work of Ramey and Rosielle ( 1983)

who suggested a better method for forming genotype clusters than the method proposed in Lin (1982).

Other strategies have been devised for identifying homogeneous groups of observa

tions, and have been reviewed in Anderberg ( 1973) and Everitt (1993). Letkovitch (1985)

described an approach for clustering genotypes on the basis of similarity of both mean and across environment variation for each genotype using a conditional clustering method

presented by Letkovitch ( 1980) . McLachlan and Basford ( 1988) show how mixture models

can be employed to determine the memberships of a pre-determined number of clusters.

Discussion of the clustering method presented by Moro and Denis (1997) is left until Sec

tion 2.8 because it is based on the dominance of one genotype over others, rather than on

38 CHAPTER 2. FUNDAMENTALS OF G x E ANALYSIS

Stopping criteria

Other than manual inspection of dendrograms, little attention has been given to the point at which clustering should be truncated. In some studies, the number of clusters has been chosen before clustering was commenced (Byth et al., 1976) , but two other broad strategies have appeared in the G x E literature.

G haderi ( 1980) suggested that the correct truncation point for clustering could be found using the among-cluster to within-cluster variance ratio of ANOVA, but this was criticized by Lin and Butler (1990) as laborious. The speed and power of modern com puters makes this criticism somewhat redundant, so the technique merits further consid eration. Corsten and Denis (1990) have also employed a distribution based test in their simultaneous two-way clustering.

A simpler method was given by Baril et al. (1994) who introduced the term 'mean square decreasing method' in their two-way classification. If this strategy was converted to a one-way clustering scenario, it would lead to continuation of clustering while the change in the explained sum of squares exceeds the overall mean-square of the entire data set. Baril et al. (1994) use a plot of explained sum of squares versus degrees of freedom to show the value of each step in the clustering process. Clustering was stopped when the tangent of a pair of consecutive points on this plot was parallel to the line joining the first and last points. This idea works well because a clustering process generally explains the sum of squares faster than it expends degrees of freedom at first, but this benefit diminishes throughout the process.

Application to the principal data

The principal data of this investigation is so large (400 x 123) that some effort to reduce the number of comparisons that need to be made to answer the principal research question should be given priority. The ability of cluster analysis to determine groups of genotypes with similar specific adaptation can assist in meeting the ultimate objective of finding genotypes that suit tropical and subtropical locations. Cluster analysis is therefore poten tially useful, despite the current paucity of methodology appropriate for handling incom plete data. An opportunity exists to develop new cluster analysis methodology capable of handling incomplete data.

Clustering of either (or both) genotypes and environments needs to be performed using a distance measure based on similarity of G xE interaction, so that interaction-free subsets of the original data can be found, thereby allowing simple comparisons based on mean

performance. The distance measure of Ouyang et al. ( 1 995) , while capable of handling

incomplete data, does not achieve this aim as it confounds main effect and interaction. Distance measures capable of handling missing data are introduced in Chapter 4.

2 . 7 . STABILITY MEASURES 39

a viable option for analysing the principal data. Cluster analysis was felt to show the

greatest potential, of the options presented in this chapter for analysing an incomplete

G x E matrix. Many new developments in the use of distance measures for clustering of

incomplete data, first presented in Godfrey et al. (2001), are discussed in greater detail in

Chapters 4 and 5.

2.7 Stability measures

Stability measures have been employed in G x E analyses to rank genotypes according to different criteria. The various stability parameters have been repeatedly examined with the most notable contribution being that of Lin et al. (1986). In that paper many stability parameters were classified into stability types, and the current examination is guided by that framework.

Lin et al. (1986) determined that a genotype was stable if:

1 . It had small among-environment variance.

2. Its response was parallel to the mean of the genotypes in each trial.

3. The residual MS attributable to that genotype in a joint regression was small.

Lin and Binns (1988b) presented a fourth type of stability that considers the difference

between predictable and unpredictable environment variation and the effect that this has on each genotype.

Freeman (1973) provided the first real summary of the stability parameters, noting the Wricke ecovalence, Shukla's parameter, and those found from the various joint regression models. All of these and other stability parameters were classified into the three types of stability by Lin et al. (1986) and can be found in Table 2. 1 . The discussion that follows focuses on the type of stability and includes other stability parameters where appropriate. These additional parameters are listed in Table 2.2.

Type 1 stability - Static performance

A genotype with type 1 stability has constant performance across environments and there

fore has a low genotype variance. Type 1 stability is synonymous with the notion of static

performance

(

Becker and Lean, 1988) , and was discussed in relation to the joint regression

model of Section 2,3. A genotype deemed stable under this type of stability does not

benefit from the addition of environmental conditions.

This type of stability does not explicitly consider any differences between environmen tal main effect and G x E interaction. Lin et al. (1986) noted that

" . . . , the usefulness of type 1 stability depends very much on the range of

Type

1

2

1

2

1

2

3

Equation K

S;

2)Yik -yd2/(K -

1)

k=l

ClIi =

Si/Vi.

100 () _i

_-

1 _-

Yik - Yi· - Y·k -

- + - )2 +

y..

_-

-

k=l

I K

where,

SS(GE)

LL(Yik - Vi. - Y.k + y .. )2

In document Dealing with sparsity in genotype x environment analyses : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Palmerston North, New Zealand (Page 54-59)