epidemic and genetic data
5.3.1
Modelling within-host dynamics
As discussed in 2.2.4, the evolutionary model considered in Chapter 4 is an abstraction of a much more complex biological world. A key extension is to model the within-host diversity of the pathogens which has not been captured by Equation 4.2.
We can consider a Yule process (Yule, 1925) (a binary branching process) which asserts that each strain has a constant rate λ at which it gives birth to a new strain. Figure 5.1 illustrates a realisation of the Yule process starting with a single strain as it evolves to four strains at time t.
Figure 5.1: An illustration of a Yule process. Starting from one strain, after going through 3 birth events at 3 nodes (indicated by the black dots), evolves to 4 strains (a, b, c, d) at time t. a b c d ● ● ●
Probabilistic distribution of the number of strains nt
Let td = (td1, td2, . . . , tdnt−1) be the vector of time between the respective nodes (i.e.,
nt=4 in Figure 5.1).
It can be shown that nt (Cox and Lewis, 1966; Rannala, 1997) follows a negative
binomial distribution with parameters n0 and 1 − e−λt. That is,
nt|n0, λ, t ∼ N B(n0, 1 − e−λt) =
nt− 1
nt− n0
!
e−n0λt(1 − e−λt)nt−n0. (5.1)
where n0 is the initial number of strains at time zero.
Probabilistic distribution of divergent time tdi
Another distribution of interest is tdi|n0, λ, t, the distribution of the time between the
ith divergent time and t. It can be first shown (Nee, 2001) that
P (td|nt, n0, λ, t) = (nt− n0)!( λ 1 − e−λt) nt−n0e−λ Pnt n0+1tdi. (5.2)
Equation 5.2 above corresponds to the probability density function of the order statis- tics of nt−n0 independent and identically distributed random variables with truncated
exponential distributions which are independent of nt and n0. As a result (Nee, 2001),
we see that Equation 5.2 implies
P (tdi|λ, t) =
λe−λtdi
1 − e−λt. (5.3)
Suppose we assume each strain on an infected premises at time t has an equal probabil- ity to be transmitted to a susceptible premises and assume the initial infection with a single strain, the subsequent growth of strains within each premises can be modelled by Equation 5.1. A foolproof but maybe computationally unmanageable approach is to jointly impute the transmission tree and the branching patterns in Figure 5.1 within each infective and infected premises. Consider two sequences sampled from two premises. In fact, it may be sufficient to just know (impute) their most recent
common ancestor (MRCA), which corresponds to either a node or a tip in Figure 5.1,
to describe adequately their evolutionary relationship. Although Equation 5.2 and Equation 5.3 allow us to work out the times of the nodes (hence the times for MRCA,
tM RCA), further theoretical developments are needed. For example, given any pairs
of descended samples (e.g., a and d in Figure 5.1), we need to know the probability they have a particular MRCA so we can assign the respective tM RCA. However, as far
Chapter 5: Conclusion and future work
as we are aware, this probability depends on the branching patterns and presents a complicated tree-combination problem to which a solution is not available so far (e.g., Mulder (2011); Steel and McKenzie (2001)).
Existing approaches cannot be used directly for the sake of an accurate joint inferential framework. For example, Ypma et al (2013) used a simple pathogen-effective-size growth model but assumed it is completely known. Didelot et al (2014) constructed the phylogeny in hosts independently of the transmission network opposed to a truly joint approach. The feasibility of extending these approaches, for example, estimating the growth model used by Ypma et al (2013) jointly with the transmission dynamics, requires further research. In Chapter 4 we have used a universal master sequence
GM coupled with a variation process to model the background infection process and
shown that the master sequence and the variation process can be accurately estimated together with the transmission dynamics. It may be then possible that the within- host diversity can be modelled in a similar manner, where GM would now represent
the “local” master sequence within the host.
5.3.2
Alternative sampling schemes of genetic data
Lastly, we have considered random sub-sampling of exposures for sequence data in Chapter 4. However, our framework is not restricted to that and alternative sampling schemes may be considered and investigated. For instance, in events of superspread- ing where many infections may occur (cluster) during a short period of time, we may accordingly consider a more concentrated sampling (e.g., at the peak of inci- dence) rather than an “even” (random) sampling as most of the information of the evolutionary process may be contained in the observations during this period of su- perspreading.
A R package for latent residuals
test
A.1
A brief description
A R package EpiResTest (Lau and Pollock, 2014) is built (beta version) to implement the latent-residual tests developed in Chapter 3 which are specifically designed to measure the goodness-of-fit of different model components of a general spatial SEIR model commonly used in epidemiology and ecology studies. Functions in the package require inputs of snapshots of posterior samples of model parameters (e.g., exposure times) and impute the residuals. They do not compute any summary statistics such as the posterior p-value used in Chapter 3 so users may analyse the raw distributions of the residuals. The underlying functions are coded in C++ so they should be generally quick.