Prediction Methods - Materials and Methods

5.2 Materials and Methods

5.2.2 Prediction Methods

Before we propose our prediction methods, we introduce some notations. A table of the most important notations used in this chapter can be found in Appendix 5.G. Let s_v = {s_v1, s_v2, ...} denote the vector of screening time intervals for village v, where sv1 denotes the time between the start of the time horizon and the first screening for this village, sv2 denotes the time between the first and the second screening, and so on. The time at which the n^thscreening is performed is given by Svn=P

m≤ns_vm and the participation fraction in this screening round is denoted by pvn. Parameter

Nvrepresents the population size of village v. Furthermore, let iv represent historical information on HAT cases in this village: the numbers of cases detected during past screening rounds. We model the expected prevalence level at time t in village v as a function f_v(·) of s_v, i_v, and some parameters β: f_v(t, s_v, i_v, β). Note that the expected prevalence level is a latent, i.e. unobserved, variable, and that the observed prevalence level, x_v(t), generally deviates from the expected value. We measure prevalence levels fv(·) and xv(·) as fractions and represent the difference between the expected and observed prevalence level in village v by the random variable εv:

x_v(t) = fv(t, sv, i_v, β) + ε_v (5.1) Time series models such as discrete time ARMA, ARIMA or ARIMAX models seem to be the most popular methods for predicting prevalence (or incidence) (see e.g.

Abeku et al., 2002; Allard, 1998). These models describe the prevalence level at time t as a linear function of the prevalence levels at time t − 1, t − 2, ... and (optionally) some other variables. Their applicability in our context is however limited. Discrete time models require estimates of the prevalence level at each time unit (e.g., each month), whereas information to estimate the HAT prevalence level is available only at moments at which a screening round is performed. Namely, many HAT patients are not detected by the regular health system, particularly if they are in the first stage of the disease (Hasker et al., 2010).

The class of continuous time models is much more suitable for analyzing data observed at irregularly spaced times. These models assume that the variable of in-terest, fv(t, sv, iv, β), follows a continuous process, defining its value at each t > 0.

The next subsections propose five continuous time models for predicting HAT preva-lence levels. We again note that models describing the causal processes determining the observed prevalence levels in detail (e.g., by explicitly modelling disease inci-dence, passive case finding, death and cure) may be most intuitive, but require data that are not available on a village level. Therefore, to safeguard their relevance for practical application, the variables we include are only those that are available on a large scale. This does not imply that our models neglect the causal processes.

In-stead, they are to some extent accounted for in an implicit way by fitting the models to the observed prevalence levels.

Data that are typically available at village level are numbers of HAT cases found during screening rounds and the times of these screening rounds. For a given vil-lage, the first yields estimates of past prevalence levels, and the latter yield the time intervals between past screening rounds. We hypothesize that the current expected prevalence level at time t is related to past prevalence levels, past screening inter-vals, and in particular the time since the last screening round, which we denote by δ_v⁻(t) = min_n{t − Svn|Svn≤ t}. Hence, we include (functions of) these variables in our models.

Linear regression models are very widely used in the world of forecasting (see e.g.

Franses (1998)). Major advantages of these models are that they are easy to under-stand, to implement, to fit, and to analyze. Therefore, the first model we introduce is a linear model (model 1), which also serves as a benchmark for our more advanced models. This model describes the expected HAT prevalence in a given village as a function of the time since the last screening and past prevalence levels. Such linear model is, however, very vulnerable to a typical structure present in active case find-ing datasets. High past prevalence levels tend to increase the priority of screenfind-ing a village, causing the time intervals between screening rounds to decrease. As a re-sult, δ⁻_v(t) is a highly “endogenous” variable. More formally, external variables (past prevalence levels) are correlated with both the dependent variable (f_v(t)) and the independent variable (δ⁻_v(t)), which makes it hard to quantify the (causal) relation between them. In response to this, we present four alternative models. Model 2 is a fixed effects model, which adds a dummy-variable for each village to the initial model. Model 3 is a (non-linear) exponential growth and decay model which is in-spired by the SIS epidemic model. This model is being used extensively for modeling epidemics that are characterized by an initial phase in which the number of infected individuals grows exponentially, and a second phase in which this number levels off to a time-invariant carrying capacity. We refer to model 3 as the logistic model with a constant carrying capacity. Finally, model 4 is a less data dependent version of model 3 and model 5 is variant of model 3 in which the carrying capacity is allowed to vary over time.

Model 1: Linear Model. Let xv(Svn) and ¯xv(Svn) represent the observed preva-lence level exactly at the n^th screening round in village v and the average observed prevalence level in the three years prior to the n^th screening round in village v, re-spectively. (For screening rounds in the first T < 3 years of our dataset, ¯x_v(S_vn) represents the average prevalence levels in these T years). The linear model (LM) describes the expected prevalence level at time t, f_v(S_vn+ δ⁻_v(t)) as a function of the time since the last screening round in village v before time t, and past observed prevalence levels in that village:

x_v(Svn+ δ⁻_v(t)) = fv(Svn+ δ⁻_v(t)) + εv

= β1x¯v(Svn) + β2xv(Svn) + β3

δv⁻(t) + β4δ_v⁻(t) +β5x¯v(Svn)

δv⁻(t) + β6xv(Svn) q

δ⁻v(t)

+β7x¯v(Svn)δ_v⁻(t) + β8xv(Svn)δ_v⁻(t) + εv (5.2)

We include the term q

δ⁻v(t) into this model to incorporate the non-linear nature that characterizes epidemics: the (increase in the) number of cases typically levels off after an initial phase of fast growth. Furthermore, we include the cross terms based on the hypothesis that the prevalence level increases faster over time for villages which have higher past prevalence levels.

Model 2: Fixed Effects Model. Datasets about the HAT epidemic, such as the data set used in this study, show much similarity to panel data in the sense that both our data and panel data consist of observations at multiple points in time for multiple units (i.e., villages). The causality problem mentioned in the introduction of this section can be regarded as the problem that differences between units explain part of the regressors. A common way to deal with differences between units in panel data is to assume that there is a unit-specific, time-invariant fixed effect that contributes to the dependent variable. This is modeled by adding a constant for each unit to the regression model. Thereby, the model relates deviations from the fixed

effect to variations in the explanatory variables. In line with this, we consider the following regression model, which we denote by the fixed effects model (FEM):

xv(Svn+ δ⁻_v(t)) = fv(Svn+ δ_v⁻(t)) + εv

= α_v+ β₁x¯_v(S_vn) + β₂x_v(S_vn) + β₃ q

δ⁻v(t) + β₄δ_v⁻(t) +β₅x¯_v(S_vn)

δ⁻v(t) + β₆x_v(S_vn) q

δv⁻(t) + β₇x¯_v(S_vn)δ⁻_v(t)

+β₈xv(Svn)δ⁻_v(t) + εv (5.3)

Here, α_v represents the fixed effect for village v. Note that the term α_v forms the only difference between the FEM and the linear model.

Model 3: Logistic Model with a Constant Carrying Capacity. Literature suggests that the development of the expected fraction of individuals infected with HAT grows exponentially in a first phase and that this fraction levels off to (or oscillates around and converges to) an equilibrium prevalence level in a second phase (Jusot et al., 1995; Rogers, 1988). We refer to the equilibrium prevalence level as the carrying capacity. A simple epidemic model that incorporates these effects is the SIS model, which describes the evolution of the expected prevalence level in village v by means of the following differential equation:

df_v(t)

dt = κ · f_v(t)

1 − f_v(t) Kv

(5.4)

Here, as in the previous section, fv(t) denotes the expected fraction of people in population v who are infected at time t, K_v represents the carrying capacity of this village, and κ represents a parameter indicating the speed of convergence to the epidemic equilibrium. When the epidemic is in its initial phase, the growth rate of fv(t) is approximately κ · fv(t), whereas this rate goes to 0 when fv(t) approaches the carrying capacity Kv. We propose to use the explicit solution to this differential equation to model the development of the expected HAT prevalence level in village v.

This solution is obtained by multiplying both sides of Equation (5.4) by dt, taking the integral on both sides, and regarding the previous screening (i.e., the one performed at time Svn) as the beginning of the time horizon:

f_v(S_vn+ δ_v⁻(t)) = K_v

1 + Avn· e^−κ·δ⁻^v^(t) (5.5) Avn can be interpreted as a parameter determining the expected prevalence level immediately after the n^thscreening round in village v: Avn= ^K^v

fv(S⁺_vn)− 1, where S_vn⁺ denotes this moment.

We fit the parameters in function (5.5) using the dataset described in the previous section. To keep the model simple, we assume that κ is constant over time and over villages (i.e., this parameter is an intrinsic property of the epidemic). Furthermore, we specify an interval of realistic values for κ based on the following reasoning. Note that κ represents the expected number of new infections per infected person per time unit in a population that is almost completely susceptible: fv(t) → 0. Now let r represent the yearly removal rate for HAT – the rate of progression to death or cure – and let R0 denote the basic reproduction number for HAT – the average number of secondary cases induced by one case in an otherwise susceptible population. Since the expected disease duration of this one case is ¹_r years, the average rate at which secondary infections are induced by this one case equals ^R1⁰

= r · R0per year. Under the assumption that this rate is constant over the disease duration, this implies that κ = r · R0. Since several analyses suggest that the value of R0 is generally close to 1 in endemic regions (Rock et al., 2015; Funk et al., 2013; Davis et al., 2011), we hypothesize that R0∈ [1.0, 1.5]. Furthermore, 95% confidence intervals for the stage 1 and stage 2 disease duration, as presented by Checchi et al. (2008), suggest that r lies in the interval [0.22, 0.52]. Based on these ranges we estimate that κ lies in the interval [0.2, 0.8]. We note that one could try to estimate village-specific or focus-specific ranges for R₀(and hence for κ) using the next-generation matrix method, as proposed by Diekmann et al. (1990). Yet, as the corresponding data requirements are relatively large, we deem this to be practically unsuitable.

Furthermore, we hypothesize that Kv is highly related to past prevalence levels and past screening activities in village v. Higher prevalence levels indicate a higher carrying capacity. Furthermore, if two villages had equal prevalence levels, but if village 1 has been screened more frequently than village 2, this may indicate that the carrying capacity in village 1 is higher. In accordance with these hypotheses, we estimate K_v as:

Kv = β1+ β2µ˜vx˜v (5.6) Here, ˜xv and ˜µv denote the observed average prevalence level in village v during five consecutive years and the average screening frequency (screening rounds per year) in this village during these years, respectively. We take the last five years of data in our estimation sample (as we discuss in the next section, we split up the dataset in an estimation sample and a prediction sample). If only T < 5 years of data are available, we base ˜xv and ˜µv on these T years. Note that variables ˜xv and

µ_v are based on historical information and hence that these variables (and hence the estimated carrying capacity) are not affected by current screening frequency decisions. Furthermore, we note that these variables should be based on the same period of five years, since combined information on the screening efforts and resulting prevalence levels provides insight into the epidemic potential in a village.

The estimate of K_v provides one of the two inputs for determining A_vn. The second input is the expected prevalence level in village v at time S_vn⁺ (i.e., immediately after the n^thscreening round). Under the assumption that infected individuals who are detected during a screening round are immediately “removed”, the only people infected at time S_vn⁺ are those who did not participate in the screening round and those not detected by the diagnostic test. This suggests the following relation between the expected prevalence after and the expected prevalence before the n^th screening round in village v:

fv(S⁺_vn) = (1 − pvn· s) fv(S_vn⁻) (5.7)

Here, pvn and s denote the participation in the n^th screening in village v and the sensitivity of the diagnostic test, and S⁻_vn denotes the moment right before the infected individuals were removed. We assume that s = 0.925 based on the review by Brun et al. (2010), and obtain the following estimator for A_vn:

Avn= Kv

fv(Svn⁺)− 1 = β1+ β2µ˜vx˜v

(1 − pvn· s)fv(Svn⁻)− 1 (5.8) The model requires an assumption about the (unobserved) expected prevalence level at the beginning of the time horizon (i.e., at 01-01-2004). The only informa-tion we have about this level is that it is lower than the expected prevalence level at the first screening round. Furthermore, under the realistic assumption that the epidemic is in its “convex part”, the expected prevalence level will have steeply in-creased between 01-01-2004 and the first screening round, suggesting that there is a substantial difference between the two. Therefore, taking ˜xv as an estimate of the expected prevalence level at the first screening round and lacking further information, we choose to estimate fv(0) as half the average observed prevalence level during five years: 0.5˜xv. Section 5.3.2 discusses the sensitivity of our results with respect to this choice.

Substituting Equation (5.6) and Equation (5.8) into the definition of the preva-lence level (5.5) we obtain the following model, which we refer to as the Logistic Model with a Constant Carrying Capacity (LMCCC), and which we fit by estimating the parameters κ, β1, and β2:

xv(Svn+ δ⁻_v(t)) = fv(Svn+ δ⁻_v(t)) + εv

= β1+ β2µ˜vx˜v

1 +

β1+β2µ˜vx˜v

(1−p_vn·s)fv(S⁻_vn)− 1

· e^−κ·δ⁻^v^(t)

+ εv (5.9)

Note that the expected prevalence level before the n^thscreening round, fv(S_vn⁻), depends on the estimated values of κ, β₁, and β₂. In fact, by substituting its defini-tion, Equation (5.9) can be written in the following form (see Appendix 5.B):

x_v(S_vn+ δ⁻_v(t)) = β1+ β2µ˜vx˜v

1 +P

i≤na_vie^−κ·δ^vi + ε_v (5.10) Here, a_vi and δ_vi denote some nonnegative constants.

Model 4: Restricted Logistic Model with a Constant Carrying Capacity.

Estimates of historical prevalence levels ˜xv are available on a very large scale. The national HAT control programs are a main source of these data, as they keep track of all HAT cases found. Alternatively, modeling studies have yielded estimates of the incidence levels in Sub-Saharan Africa at the level of detail of 1 km²(see Simarro et al.

(2010, 2012)), which can be transformed into corresponding estimates of prevalence levels. Data to measure ˜µv, the historical screening frequency in village v, are however scarcer in general, as gathering these data brings about a significant administrative burden. It is therefore relevant to investigate the predictive performance of a model that only uses ˜x_v to estimate K_v instead of ˜µ_vx˜_v. We do so by fitting the following model, which we refer to as the restricted Logistic Model with a Constant Carrying Capacity (rLMCCC):

xv(Svn+ δ⁻_v(t)) = β1+ β2x˜v

1 +

β1+β2x˜v

(1−p_vn·s)fv(S⁻_vn)− 1

· e^−κ·δ⁻^v^(t)

+ εv (5.11)

Model 5: Logistic Model with a Varying Carrying Capacity. The models presented in the previous subsections implicitly assume that the carrying capacity (i.e., the upper bound on the expected prevalence level) remains the same over time.

However, due to possible changes in conditions that affect the disease dynamics – such as vegetation, the number of flies and humans in and around a village, passive case finding activities – it is realistic to assume that the carrying capacity may vary over time. Furthermore, the epidemic potential may also vary due to variations in susceptibility to or tolerance of HAT infection, as argued by Welburn et al. (2016).

Based on this assumption, we propose a Logistic Model with a Varying Carrying Capacity (LMVCC):

x_v(S_vn+ δ⁻_v(t)) = β1+ β2µ¯v(Svn)¯xv(Svn) 1 +_β

1+β2µ¯v(Svn)¯xv(Svn) (1−pvn·s)fv(S⁻_vn) − 1

· e^−κ·δ⁻^v^(t)

+ ε_v (5.12)

The only change with respect to the logistic model with a constant carrying capacity is that ˜x_vand ˜µ_vare replaced by ¯x_v(S_vn) and ¯µ_v(S_vn), the average observed prevalence level in and the screening frequency for village v in the three years prior to the n^thscreening round in that village, respectively.

In document Evidence-Based Optimization in Humanitarian Logistics (Page 158-167)