Cross validation is one technique that could be employed to gauge the stability of imputed values 'Stability' is used to mean how robust the imputed values are to the

6 . 5 . ApPLICATION OF TWO-STAGE IMPUTATION T O THE TRIALS PROGRAMME DATA 169

Figure 6.9: Dendrogram of the genotypes of Onion Data

I,

clustered using Euclidean dis

tance applied to the sparse data. Clustering was truncated at the level 135.36, forming 26

clusters.

Figure 6. 10: Dendrogram of the genotypes of Onion Data Il, clustered using Euclidean dis

tance applied to the sparse data. Clustering was truncated at the level 124.04, forming 20

1 70 CHAPTER 6 . TWO-STAGE IMPUTATION

Figure 6 . 1 1 : Dendrogram of the genotypes of Onion Data I, clustered using Euclidean distance. Fully imputed data was created using the nearest cluster method. Clustering

was truncated at the level 19.45, forming 14 clusters.

Figure 6 . 12: Dendrogram of the genotypes of Onion Data 11, clustered using Euclidean distance. Fully imputed data was created using the nearest cluster method. Clustering

6 . 6 . FURTHER IDEAS FOR TWO-STAGE IMPUTATION 171

data sets, compared to the number of observations, it was likely that two-stage imputed values would be unstable because first stage clusters were formed using so few observations. Clustering of genotypes using the sparse data resulted in clusters that had few genotypes in them irrespective of the distance measure used for clustering. It is therefore expected that neither method has a clear stability advantage.

6.6 Further ideas for two-stage imputation

This section highlights future avenues for investigation t hat have arisen through the de velopment of two-stage imputation.

Two-stage imputation has been presented using genotypes as observations. There is no reason why the transpose of the G x E matrix could not be used in order to impute missing G x E combinations. There is also the possibility that genotype based imputed values could be averaged with those based on environment similarity. In either case, the observed minimum and maximum yields from each environment would still need to be used in the trimming stage.

In situations where no data is available for a certain environment within a first stage cluster of genotypes, two-stage imputation looks for the cluster that is most similar which has data in that environment. There is some potential for imputed values in this instance to be based on dubious similarity, especially when data is extremely sparse. If only first stage clusters were used to find imputed values, many G x E observations in Onion Data I and 11 would not have been imputed. Imputations based on first stage clustering could be used as the input for a second round of imputation based on a new first stage clustering. This process could be repeated until all missing G x E combinations were imputed, but would be unsuccessful in the event that a single genotype remained alone in a first stage cluster.

No cross validation approaches have been investigated at this point. Ideally some

G x E combinations could be omitted from the data and the subsequent imputed values

compared. The correlation between the common G x E combinations of Onion Data I and 11 in the previous section is an example of what is possible. The sparsity of Onion Data I and 11 made this task less possible, because some G x E combinations are crucial for maintaining the ability to link data from trials together. Such an exercise could be used to provide an interval estimate for imputed values rather than the current point estimates. Note that this interval estimation of imputed values could not be undertaken for all missing entries in Onion Data I and 11 if the minimum data constraints were to remain intact.

Another means of establishing interval estimates for imputed values would be to use the two-stage clustering model given in (5.6) to impute missing values as part of a multiple imputation strategy. The process presented in Section 6.1 can now be enhanced to include

172 CHAPTER 6 . TWO-STAGE IMPUTATION

the G x E interaction effects, determined by first stage clustering, to give:

1.

Perform first stage clustering.

2. Create an explanatory indicator variable

G f{i}

to indicate first stage cluster mem bership.

3. Fit the model

(6.2)

to the data. The interaction term uses the term for genotype cluster membership, not the term for genotype main effect.

4. Use the parameter values from this model to determine the expected values of the missing entries.

5. Use the error MS from this model as the variance of the distributions used to impute missing entries.

If a first stage cluster has no data in an environment, there will be no data to estimate parameter values for the corresponding

GEf{i}k

terms. These parameters could be esti mated using their expected value (zero

)

, but will limit the usefulness of the approach in choosing the best genotypes to grow in each type of environment.

6.7 Summary

This chapter presented an imputation method that arose as a consequence of the two-stage clustering method developed in Chapter 5. The ability to determine sets of genotypes that perform similarly across environments and subsequently to take advantage of differences in their mean performance allows incomplete G x E matrices to be made complete.

Existing imputation methodology was shown to be of limited use when working with G x E data. Model-based imputation strategies were discounted for use with data as sparse

as that arising from the Onion Trials Programme. Simulation testing found that the two stage imputed values were better than those found using other clustering-based imputation strategies. Testing was done using data sets from the

G x E literature which were all

complete and differed in size.

Application of two-stage imputation to Onion Data I and II gave consistent results, especially for the imputed values for environments in which greater numbers of genotypes were tested.

Some further ideas for the future development of two-stage imputation were introduced in Section 6.6. Regardless of the imputation method employed, there is a need to link variety selections to the different types of environment within the data. In Chapter 7 these groups of environments will be found so that genotype success in new environments can be predicted.

1 73

Chapter 7

In document Dealing with sparsity in genotype x environment analyses : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Palmerston North, New Zealand (Page 186-191)