Characteristics of the MTIMBA data - Introduction and objectives

Chapter 1: Introduction and objectives

1.6 Characteristics of the MTIMBA data

1.6.1 Geostatistical data

HDSS data are collected at fixed geographical locations. This type of data is known as geostatistical data. Observations collected at locations close to each other in space are correlated because locations in close proximity are characterized by similar risks due to common exposures. Standard statistical models assume independence of observations. Therefore analyzing these data without taking into account the spatial correlation could result in incorrect model estimates (Cressie, 1993; Thomson et al., 1999).

Spatial models take into account the spatial correlation according to the way the geographical information is available. For instance, in geostatistical data, spatial models introduce an extra parameter (random effect) at each location. These parameters are considered as latent observations of a spatial process and are modeled via a multivariate distribution which incorporates spatial correlation in the covariance matrix, typically assuming that the covariance between any pair of locations is a function of distance between the locations. The number of parameters increases with the number of locations surveyed. Hence these models are highly parameterized when large number of locations is involved (as in the case of MTIMBA data) and

Chapter 1: Introduction and objectives

can not be estimated by most commonly used maximum likelihood methods. Bayesian computational methods are suitable in fitting highly parameterized models by employing Markov chain Monte Carlo (MCMC) simulation algorithms (Gelfand and Smith, 1990). Diggle et al (1998) formulated geostatistical models using the Bayesian framework of inference. These models have been applied and further developed for mapping malaria transmission (Kleinschmidt et al., 2000; Diggle et al., 2002; Gemperli, Sogoba, et al., 2006; Gemperli, Vounatsou, et al., 2006; Gosoniu et al., 2006, 2009; Kazembe et al., 2006; Sogoba et al., 2007; Hay et al., 2009; Riedel et al., 2010; Gething et al., 2011) and mortality (Gemperli et al., 2004; Kazembe et al., 2007; Sartorius et al., 2011).

For large number of locations (e.g. over 1000) computations involving the covariance matrix of the spatial process during model fit are not feasible. Recent developments in geostatistical modeling estimate the spatial process from a subset of locations and use approximations to obtain the random effects at the observed locations (Banerjee et al., 2008). These methods have been used in analyzing MTIMBA Rufiji data in Tanzania (Rumisha et al., 2012) and mortality data from the Agincourt DSS in South Africa (Gosoniu et al., 2012)

1.6.2 Spatial misaligned data

Entomological data was collected in randomly selected houses (locations), while mortality outcome status was obtained from all locations within the study area. The locations of the two datasets do not necessarily match and thus the datasets are spatially misaligned (Banerjee and Gelfand, 2002). In 2003, Gamperli and colleagues analyzed misaligned malaria survey and mortality data extracted from independent databases: the demographic and health surveys (DHS) and the mapping malaria risk in Africa (MARA) database, respectively. They linked the data by developing geostatistical models to predict malaria prevalence at the mortality locations. Subsequently survival models with errors-in-covariates were fitted to take into account the prediction error of the malaria covariate.

1.6.3 Seasonality and temporal data

Malaria transmission is driven by environmental factors such as rainfall and temperature. Therefore transmission intensity and vector population fluctuate over time in areas where environmental factors are seasonal. In addition entomological data was collected biweekly

Chapter 1: Introduction and objectives

introducing temporal correlation in the data. Ignoring seasonal and/or temporal correlation when analyzing these data may lead to incorrect model estimates. Studies have adjusted for seasonality within a modeling framework by introducing a binary covariate indicating wet (transmission) and dry (no transmission) seasons (Abeku et al., 2002; Gemperli, Sogoba, et al., 2006). (Zhang et al., 2007; Briët et al., 2008) have used seasonal autoregressive integrated moving averages (SARIMA) models to assess seasonality and take into account temporal correlation. Furthermore, harmonic functions have been employed to model seasonal trends in time-series data (Stolwijk et al., 1999; Griffin et al., 2010). However, in malaria epidemiology literature is sparse in model- based approach in estimating seasonal trends in non-Gaussian data. In a Bayesian formulation temporal correlation can be modeled by introducing into the model random parameters at each time point (e.g. month) modeled via autoregressive process of various orders. The Deviance Information Criterion (Spiegelhalter et al., 2002) is used to identify the best fitting order.

1.6.4 Zero inflated entomological data

Mosquito entomological data are usually collected at fix locations over time. Therefore besides being correlated in space and time, they are also characterized by large number of locations with either no mosquitoes or proportion with parasites (sporozoites in glands). The occurrence of large number of “zero” (no) mosquitoes or infected ones could be due to (i) seasonality: population of mosquitoes is high in wet season as it favors their development and survival as opposed to dry season which is characterized by unsuitable weather conditions resulting to high mortality of mosquito and (ii) interventions targeting survival of mosquitoes prevents their development and kills older ones that are likely to be infectious. These lead to no mosquitoes or proportion of infected mosquitoes.

The presence of large number of zero mosquitoes or proportion infected results in over- dispersion in the data and this is popularly known as “zero-inflation”. Zero inflated data contain extra zeros relative to underlying distribution. Standard models only estimate a certain frequency of zeros in the data. Therefore are not appropriate for analyzing zero inflated data because they predict fewer zeros than the number that are observed leading to poor fit. Zero inflated analogues of the standard models are appropriate to fit sparse data (Lambert, 1992; Hall, 2000). A zero inflated model is a mixture model having two components; one arising from a parent

Chapter 1: Introduction and objectives

distribution and the other corresponds to the excessive zeros that can not be accounted for by the distribution. The zeros from the parent distribution can be assumed to be random and driven by frequency determined by the parent distribution. The remaining excess zeros are assumed to be “structural’ that may arise from unmeasured predictors of the outcome and/or seasonality. Zero inflated models have been applied in number epidemiological studies (Nobre et al., 2005; Ramis- Prieto et al., 2007; Barnes et al., 2008; Fernandes et al., 2009; Vounatsou et al., 2009; Berrang- Ford et al., 2010; Manh et al., 2011), but limited in malaria transmission adjusting for temporal correlation.

In document Bayesian spatio-temporal modelling of the relationship between mortality and malaria transmission in rural western Kenya (Page 36-39)