Chapter 3: Methods of classification of Echinococcus coproantigen ELISA data
3.1.5 Mixture models
As described above, the concept of the identification of component distributions within a group of biological samples was first introduced in 1894 (Pearson, 1894), in one of the first examples of the application of statistical principles to the analysis of biological data (McLachlan and Peel, 2000c). Mixture models (or similar approaches) have subsequently been frequently applied to the problem of diagnostic test interpretation (Rushforth et al., 1971; Grannis and Lott, 1978; Parker et al., 1990; Gay, 1996; Neuenschwander et al., 2000; Baughman et al., 2006; Vyse et al., 2006; Hardelid et al., 2008), where they have potential use as a method of classification in the absence of a gold standard test. Due to the logistical and practical difficulties associated with the identification of known infected and uninfected dogs in the field, mixture models were investigated here as a potential approach to coproELISA classification.
Finite mixture models (FMMs) are a form of cluster analysis method whereby a finite number of subpopulations (‘components’) can be identified within a population based on the distribution of the data rather than through association with external variables (meaning that they can also be described as a type of ‘person-centred’ rather than ‘variable-centred’ analysis tool (Muthén and Muthén, 2000; Jung and Wickrama,
76
2008)). FMMs can also be described as form of ‘model-based clustering’, which groups individuals based upon explicit assumptions regarding the distributional qualities of the components – most commonly, that these follow a Gaussian distribution (in the case of Gaussian Finite Mixture Models, as are used in the current report). This gives the model a clear statistical foundation, as well as potentially having some biological basis. The output of an FMM includes a description of the parameters of the component distributions, along with the a priori probability of membership of each component (that is, the ‘relative size’ of each component). From this, estimates of the posterior probability of component membership for individual samples can be made, and if desired, these can be allocated to particular components according to modal probability. Despite being first proposed over 100 years ago (Newcomb, 1886; Pearson, 1894), it is only in recent years that computational advances such as the expectation- maximisation algorithm (Dempster et al., 1977; McLachlan and Peel, 2000a) and Markov Chain Monte Carlo (MCMC) methods (Hastings, 1970; McLachlan and Peel, 2000b) have allowed reasonable model fitting (Aitkin and Rubin, 1985; McLachlan and Peel, 2000c).
The statistical background to mixture models has been reviewed elsewhere (McLachlan and Peel, 2000c), and will be only briefly introduced here. As they are a model-based approach, a statistical model can be explicitly defined. For a simple univariate Gaussian FMM, 𝑌 represents a vector of length 𝑛, relating to a random sample of 𝑛 individuals from a population (𝑌𝑗; 𝑗 = 1: 𝑛). The probability density function of 𝑦𝑗, 𝑓(𝑦𝑗), can be presented as the sum of 𝑔 components, each of which has its own mixing proportion (or weight), 𝜋𝑖: each of which lie between zero and one, and sum to one. Each component is distributed with its own normal distribution, 𝑓𝑖(𝑦𝑗)~ 𝑁(𝜇𝑖, 𝜎𝑖2); (𝑖 = 1: 𝑔). This can be presented as follows:
𝑓(𝑦𝑗) = ∑ 𝜋𝑖𝑓𝑖(𝑦𝑗) 𝑔
𝑖=1
As such, in order for the model to be created, the number of mixture components, 𝑔, must be specified. This is one of the main difficulties encountered when constructing
77
a mixture model, as in most cases, it is unknown - leading to a roundabout problem of model assessment prior to model creation. Possible options for achieving this have been recently reviewed (Oliveira-Brochado and Martins, 2005), and will not be fully described here. A common method of comparing different numbers of groups in FMMs is by creating models with different numbers of components and comparing these using complexity-penalised information criteria such as Akaike’s Information Criterion (AIC) (Akaike, 1973) or the Bayesian Information Criterion (Schwarz, 1978). Traditional hypothesis tests of the effect of adding an extra component to the model, such as the likelihood ratio test, are complicated by the fact that models with different numbers of components are not nested within one another (Aitkin and Rubin, 1985). This problem can be circumvented using bootstrapping approaches (McLachlan, 1987; Efron and Tibshirani, 1993). A bootstrap sample is taken from the data under the “null hypothesis” of 𝑔 components in the model, and the likelihood estimated. This is repeated for the “alternative hypothesis” of (𝑔 + 1) components, and the likelihood ratio of the null and alternative hypotheses (λ) is estimated. From this, −2ln (λ) can be estimated (as is usually used in the likelihood ratio test). This process is then repeated multiple times, allowing the full distribution of −2ln (λ) to be estimated. Evidence against the null hypothesis can therefore be obtained if the likelihood ratio statistic obtained from the data differs from that predicted from these replications, as is the case with any null hypothesis test (Hope, 1968; Aitkin et al., 1981; McLachlan, 1987). Given that the estimated p-value is below the significance threshold, this process is then repeated with the null hypothesis of (𝑔 + 1) components, and an alternative hypothesis of (𝑔 + 2) components, and continues until there is no evidence against the null hypothesis.
Assuming that the number of components is known, the remaining issue, as alluded to earlier, is fitting of the model. The likelihood of the model with the distributional parameters (𝜇𝑖, 𝜎𝑖2) = 𝜃𝑖 is as follows: L(𝜃1: 𝜃𝑔 ; 𝜋1: 𝜋𝑔 |𝑌) = ∏ ∑ 𝜋𝑖𝑓𝑖(𝑦𝑗|𝜃𝑖) 𝑔 𝑖=1 𝑛 𝑗=1
78
Where 𝑦𝑗 indicates an individual observation. Maximising this likelihood in order to estimate the parameters 𝜃𝑖 and 𝜋𝑖 can be facilitated through the use of the expectation-maximisation (EM) algorithm, which was developed by Dempster et al and has been described elsewhere (Dempster et al., 1977; Jeff Wu, 1983; McLachlan and Peel, 2000a; Fraley and Raftery, 2002). Basically, the EM algorithm in the context of FMMs assumes that along with the dataset 𝑌, there are missing/unobserved variables relating to component membership, which need to be taken into account when maximising the likelihood. This can be presented as each observation in the ‘complete data’, 𝑥𝑗 , being comprised of the individual observations (𝑦𝑗) and the 𝑛 unobserved variables associated with these (𝑧𝑗) which relate to component membership. Each 𝑧𝑗 is a vector of length g (𝑧𝑗 = (𝑧𝑗1: 𝑧𝑗𝑔)), where 𝑧𝑗𝑖 = 1 if 𝑦𝑗 is in component 𝑖, and 𝑧𝑗𝑖 = 0 otherwise. The algorithm itself is an iterative procedure consisting of an ‘expectation step’, where 𝑧𝑗𝑖 is estimated, based upon 𝑌 and current estimates of 𝜃𝑖 and 𝜋𝑖; followed by a ‘maximisation step’, whereby 𝑧𝑗𝑖 is assumed fixed and the log-likelihood of 𝜃𝑖 and 𝜋𝑖 are maximised, conditional on 𝑌.