• No results found

Online publication date: 21 October 2010

N/A
N/A
Protected

Academic year: 2021

Share "Online publication date: 21 October 2010"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

PLEASE SCROLL DOWN FOR ARTICLE

On: 22 October 2010

Access details: Access Details: [subscription number 928470289]

Publisher Taylor & Francis

Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,

37-41 Mortimer Street, London W1T 3JH, UK

Communications in Statistics - Theory and Methods

Publication details, including instructions for authors and subscription information:

http://www.informaworld.com/smpp/title~content=t713597238

Multiple Spatio-Temporal Cluster Detection for Case Event Data: An

Ordering-Based Approach

C. Dematteia; L. Cucalab

a Medical Information Department, Nimes University Hospital Center, Nimes, France b Institute of

Mathematics and Modelling of Montpellier, Montpellier, France Online publication date: 21 October 2010

To cite this Article Demattei, C. and Cucala, L.(2011) 'Multiple Spatio-Temporal Cluster Detection for Case Event Data:

An Ordering-Based Approach', Communications in Statistics - Theory and Methods, 40: 2, 358 — 372

To link to this Article: DOI: 10.1080/03610920903411200

URL: http://dx.doi.org/10.1080/03610920903411200

Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf This article may be used for research, teaching and private study purposes. Any substantial or systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

(2)

ISSN: 0361-0926 print/1532-415X online DOI: 10.1080/03610920903411200

Multiple Spatio-Temporal Cluster Detection

for Case Event Data: An Ordering-Based Approach

C. DEMATTEI

1

AND L. CUCALA

2

1Medical Information Department, Nimes University Hospital Center,

Nimes, France

2Institute of Mathematics and Modelling of Montpellier,

Montpellier, France

This article introduces a spatio-temporal distance which allows the extension of the spatial cluster detection methods of Demattei et al. (2007) and Cucala (2009). A review of these methods is given before we define a spatio-temporal distance. Then this distance is used for detecting spatio-temporal clusters. These ordering-based methods are compared to the scan statistic by a simulation study. The scan procedure is more powerful but it detects fewer true positives due to its lack of flexibility. Those techniques are applied to a seismic data set. This article highlights two advantages of the ordering-based methods: their flexibility and their low computational demand.

Keywords Case event data; Cluster detection; Ordering-based methods; Spatio-temporal distance.

Mathematics Subject Classification 62P99.

1. Introduction

The spatio-temporal cluster detection is a recent development which deserves all the interest successively given to the detection of clusters in time, then in space. As in the spatial setting, several methods, notably derived from the scan statistic, are proposed.

Several procedures have been developed in order to study the spatio-temporal aggregation of geographical health data. The interaction test proposed by Knox (1964) is a precocious approach of this type of test. This test is a global method since potential clusters cannot be located.

The first real approach that allows researchers to locate and detect a spatio-temporal cluster is certainly the spatio-spatio-temporal scan statistic introduced by Kulldorff (1998) which is a direct adaptation of the spatial scan statistic. Instead of

Received October 1, 2008; Accepted October 13, 2009

Address correspondence to C. Demattei, Departement de I’Information Medicale, Batiment Polyvalent, CHU Caremeau, 30029 Nimes, France; E-mail: christophe.demattei@ chu-nimes.fr

358

(3)

using circular windows, cylinders are chosen for potential cluster shape. The circular basis represents the spatial area and the height of the cylinder represents the time period.

Other approaches have also been proposed such as the method of Iyengar (2004) that uses the scan statistic to search for pyramidal-shaped clusters. All those methods have advantages and drawbacks. We may blame the scan statistic for not being flexible. Other methods are more flexible but need more computational resources. Let us recall that the spatio-temporal cluster detection domain is quite new and lots of improvement are to be done, as it was the case in the spatial setting. The core of this article is the definition of a spatio-temporal distance which makes it possible to generalize the spatial cluster detection methods for case event data of Demattei et al. (2007) and Cucala (2009).

We first define a general framework for those two methods denoted by “ordering-based methods”. Indeed, those approaches have in common to be the spatial adaptation of a temporal cluster detection technique, and this generalization is achieved by a common data transformation (Demattei et al., 2007) that makes it possible to order spatial events.

Once this framework is defined, we extend both spatial ordering-based methods to the spatio-temporal setting by introducing a spatio-temporal distance that allows us to order events both in space and time.

A simulation power study is then proposed to compare the ordering-based methods to the spatio-temporal scan statistic, which is the reference method. The scan statistic is widely used to locate and detect spatio-temporal clusters of medical events as for example in McNally and Colver (2008) and Demattei et al. (2006a). The scan statistic is a very powerful method, but the parametric shape of the potential clusters (usually cylinders) reduces its flexibility and increases considerably the computational time compared to non-parametric methods such as ordering-based procedures. The flexibility of ordering-ordering-based methods is illustrated on a “L”-shaped example.

Those three techniques are also applied to a seismic data set. This data set is an Italy Catalogue of Earthquake Events included in the Statistical Seismology Library (SSLib) provided in R format by David Harte and Ray Brownrigg.

Finally, a computing time comparison is proposed between ordering-based and scan methods. A discussion concludes this article.

2. 2D and 3D Spatial Cluster Detection

2.1. The Spatial Scan Statistic

The most popular of the spatial cluster detection methods is the spatial scan statistic (Kulldorff, 1997). It relies on restricting the set of possible clusters to a finite family of subsets of the observation area. Then, the concentration in each of these subsets is assessed through the likelihood ratio test of H0, the hypothesis that cases are distributed as the underlying population, against a specific piecewise-constant density alternative. Originally, the possible clusters family was set to all the circles whose centers were points of a predefined grid. Recently, Kulldorff et al. (2006) extended the method by investigating a wide family of elliptic windows with predetermined shape, angle and center.

A multiple procedure has also been introduced (Zhang et al., 2010) and allows users for testing the significativity of secondary clusters. Even if no theoretical

(4)

justification exists, a simulation study shows that the Type I error of this procedure remains close to the nominal level.

Even if these methods were first defined to study aggregated data, they also adapt to case event data. However, when the number of events becomes large, the set of possible clusters increases dramatically, and so does the computation time.

2.2. Introduction to Ordering-Based Spatial Methods

Spatial methods are often adapted from existing temporal cluster detection techniques, as for regression (Demattei et al., 2007) and spacings (Cucala, 2009) methods. In the temporal setting, the time of occurrence of events can be considered as a natural way to order events between them. In the spatial case, such a natural order does not exist. A data transformation is then needed to order spatial events and be able to adapt temporal techniques to spatial data. Those approaches are called ordering-based spatial methods. The regression and spacings methods belong to this class of spatial cluster detection methods.

The same data transformation is used for both approaches. LetX1 Xnbe random variables which denote the spatial coordinates of the occurrence ofnevents in A, a bounded set of2 or3. The first ordered event X

1 is arbitrarily chosen

to be the nearest point from the boundary ofA, denoted byA, using the euclidian distance, denoted byd . Thus,

D0=dX1 A= min

1≤i≤ndXi A

Then the second event, whose location is calledX2, is the closest toX1among all the events that have not yet been ordered, such that

D1=dX2 X1= min

1≤i≤n Xi=X1

dXi X1

All the events are iteratively ordered in the same way and Dj=dXj+1 Xj= min

1≤i≤n Xi=Xk ∀k=1j−1

dXi Xj1 ∀j=2 n−1

The D1 Dn1 series is then used by both approaches to detect portions with small successive distances. This is described in the following sections.

2.3. The Regression Method

The spatial regression method (Demattei et al., 2007) is an adaptation of a multiple temporal cluster detection procedure (Molinari et al., 2001).

Let dk=dxk+1 xk be a realization ofDk. This distance has to be weighted both to correct the presence of high distances due to the elimination process of pre-selected points and to adjust for a potential inhomogeneity in the underlying population density. The weighted distance dw

k is defined as the ratio between the

(5)

distance dk and its expectation under H0. Demattei et al. (2007) showed that the

expected distance can be written

EH 0 Dk/X1=x1 Xk =xk= a 0 1− Ak−1∩Sxkrfxdx Ak−1fxdx n−k dr (1)

wherefxstands for the underlying population density from which thenpoints are independently sampled,Sx r is the sphere with center xand radius r, and Ak= A\ki=1Sxi di with the conventionA0=A.

The numerical integration of0a in Eq. (1) is achieved by using the trapezoidal rule. Moreover, the underlying populationZ, constituted byN individualszi i= 1 N, makes it possible to estimate the density integrals A

k−1 and

Ak−1∩Sxkr. Indeed, for any set B⊂A,Bfxdx can be approximated by #i/zi∈B/N. This

integral approximation makes it possible to adjust the computation of dw k for

inhomogeneous population. This adjustment is important since, when dealing with rare diseases, a large study area is necessary to examine data for evidence of spatial clustering. Hence, due to a natural inhomogeneity, the density of the population at risk is not constant over the study area.

Cluster bounds can now be determined from transformed datak dw

kk=1n−1.

For this purpose, we consider the weighted distance regression on the selection order k. To determine the presence of m breaks (denoted by b1 bm), the regression function taken into consideration is

ft= m+1 j=1 ¯ d b j−1+1bj×I bj−1+1bjt (2)

with the conventionb0=0 andbm+1=n−1. The notationd¯ b

j−1+1bjstands for the mean ofdw

t fortin bj−1+1 bj.

The minimum percentage of points between two breaks is a parameter which has to be taken into account. Let ∈ 01 denote this parameter. Then, the set of possible partitions is=b1 bm;∀i=1 m+1,card bi1+1 bi≥ n−1.

Breaks (cluster bounds) are estimated by

1m= argmin

b1bm∈

n−1 t=1

dwt −ft2 (3)

and are computed efficiently using a dynamic algorithm programming (Demattei et al., 2006b).

The double maximum test proposed by Bai and Perron (1998) is used to select the best model. This test makes it possible to test the null hypothesis of no break against an unknown number of breaks given a certain upper bound M. Once the best model is selected, ap-value is computed for each portion between two breaks by a Monte Carlo procedure. The best model selection and thep-value computation are fully described in Secs. 2.5 and 2.6 of the article of Demattei et al. (2007).

(6)

2.4. The Spacings Method

This procedure, introduced by Cucala (2009), is also based on the same ordering of the spatial events. However, it does not rely on the distances between events, but on the areasSiexplored between events. The first area is defined by

S1=x∈A dx A < D0 The successive areas are then given by

Si= x∈ j=1···i−1 Sj dx Xi1 < Di−1 2≤i≤n whereB=x∈A x B.

The area spacings are defined by Si=

Si

fsds 1≤i≤n+1

and can be estimated using the underlying population Z, as in the regression method. These area spacings follow under H0 the same distribution as uniform spacings (i.e., spacings issued from a 01-uniform n-sample). Thus, the spatial cluster detection can rely on a temporal cluster detection technique applied to the point process T1 Tn, where Ti=

i

j=1Sj. The objective of temporal cluster

detection is to test whether there exists a time interval in which events are abnormally concentrated. The concentration indicator we apply here is the one introduced by Cucala (2008). The p-value of the most concentrated interval is estimated by a Monte-Carlo procedure and a multiple procedure makes it possible to detect secondary clusters. Let Ii= Ta

i Tbi, 1≤i≤k denote the ksignificative temporal clusters identified. The corresponding spatial clusters are then the zones

Ci= bi j=ai+1 Si

and the total clustering zone is C= k i=1 Ci

3. Spatio-Temporal Cluster Detection

3.1. The Spatio-Temporal Scan Statistic

The spatial scan statistic naturally extends to the spatio-temporal setting. The possible clusters family is now a collection of cylindrical windows with a circular (or elliptic) geographic base and with height corresponding to time. Then, the concentration in each of these windows is assessed similarly than in the spatial setting.

A multiple procedure identical to the spatial one can also be used. Of course, the number of windows to compare is even larger than in the spatial setting

(7)

and analysing large case event data sets with this method is computationally very expensive.

3.2. Spatio-Temporal Distance Definition

We propose to extend the regression and spacings spatial methods to space-time cluster detection. In this issue, the main difficulty results from the different roles played by the temporal and spatial dimensions.

To overcome this difficulty, we propose introducing a so-called spatio-temporal distance which is a weighted combination of the spatial and temporal euclidian distances. A parameter is needed to establish a correspondence between space and time.

We propose the following choice for the value of this parameter. LetAdenote the observation domain area andTthe time observational interval length. LetD= 2

A

the diameter of a disc whose area is A. This is nothing but the maximal

spatial distance between two points in the disc. Our objective is to consider that a temporal distance equal toTbetween two events is equivalent to a spatial distance equal toD. Hence, we set the spatio-temporal distance, denoted by dST, to be the

following function of the euclidian spatial and temporal distancesdS anddT:

dST x y t x0 y0 t02=dS x y x0 y02+ D 2 T2d T t t 0 2

Intuitively, this distance can be seen as a spatial euclidian distance in

3 after rescaling the temporal axis. More rigorously, dST x y t x

0 y0 t0=

dS x yD

Tt x0 y0DTt0which ensures thatdST is a distance.

3.3. Ordering-Based Spatio-Temporal Methods

For k=1 n, Zk=Xk Tk denotes the spatial and temporal coordinates of a spatio-temporal event in a bounded set of A×T ∈s×, with s=2 or 3. The

regression and spacings method previously described are applied to these data, replacing the spatial euclidian distance by the spatio-temporal distance we just introduced.

4. Simulations

In this section, all samples were simulated withA= 01× 01(space) and T= 12 31(time = one month in days).

The Type I error rate was obtained on 500 samples simulated under the no-clustering hypothesis (H0).

A power study was performed using several aggregation alternatives. For each alternative, 100 samples of 200 points were simulated . The time of occurrence of the events follows a discrete uniform distribution on T. In the clustering alternatives, the spatial locations follow a mixture of two uniform point processes, depending on the time interval, as described in Table 1.

The cluster simulation zones for each cluster alternative are illustrated in Fig. 1.

(8)

Table 1

Simulation of spatio-temporal aggregates: spatial density along with time windows

Cluster alternative Time window Spatial mixture

Parrallelepipedic 12 18 07× 04055× 0307+03×A Otherwise A 3D Z-shaped 12 14 07× 04055× 0307+03×A 15 17 07× 04055× 03045+03×A 18 20 07× 0408× 03045+03×A Otherwise A Personal shaped 10 13 a× 0102× 0204+1−a×A a=b=07 for HD 14 16 a× 0104× 0204+1−a×A a=02 andb=06 for LD 17 20 a× 0304× 0204+1−a×A 21 23 b× 0307× 0207+1−b×A Otherwise A Diagonal 5 8 06× 0609× 0609+04×A 9 11 06× 0508× 0508+04×A 12 14 06× 0407× 0407+04×A 15 17 06× 0306× 0306+04×A 18 20 06× 0205× 0205+04×A 21 23 06× 0104× 0104+04×A Otherwise A

For each simulated sample, the p-value of the identified clusters and the number of true positives and false positives were computed. The results are summarized in Table 2.

Under H0, the Type I error rate is about 5% for the regression, spacings and scan methods, close to the nominal level.

The more powerful method is the scan statistic, and the less powerful is the regression method. On the other hand, the scan statistic has a very low true positives rate compared to the spacings and regression methods. The scan technique is very powerful but the detected clusters are very small compared with the real ones. The low true positives rate of the scan statistic is due to the lack of flexibility for the potential cluster shape (cylinders). The true positives rate is slightly higher for the spacings than for the regression method but the false positives rate is lower for the regression one. When the simulated cluster has a quite low density, the scan statistic fails with a low power and a low true positives rate.

5. A “L”-Shaped Example

In order to illustrate the flexibility of the ordering-based methods compared to the scan statistic, a “L”-shaped spatio-temporal cluster was simulated. The spatial cluster simulation zone was P=P1∪P2∪P3, with P1= 0405× 0304×

614, P2= 0405× 0304× 1523, and P3= 0509× 0304× 1523. The spatial locations follow a mixture of two uniform point processes, depending on the time interval, as follows : 07×P1+03×A if time in 6 14, 07×P2∪P3+03×A if time in 15 23, and A otherwise.

(9)

Figure 1. Simulation cluster alternatives: from top to bottom, parallelepipedic, 3D Z shaped, personalized and diagonal cluster simulation zones.

The ordering-based and scan methods were applied to this simulated data set. The obtained results are illustrated in Fig. 2. The ordering-based methods detect a significant aggregate made up of 49 events. Fourty-five events (92%) are included in P. 100% of the cluster events are detected by the ordering-based methods. The scan statistic detects a significant aggregate of 34 events, all included inP. The scan detects only 76% of points included inP. Indeed, it detects 100% of the events in P1∪P2 but 0% of the events inP3.

This example illustrates the lack of flexibility of the scan statistic compared to the ordering-based methods. This is due to the parametric shape of the scanning window which, here, only detects the lengthened zoneP1∪P2 (close to a cylindric shape), and does not detect theP3zone. In the same time, this “L”-shape is entirely

(10)

Table 2

Simulation study results

Cluster alternative Method Power True positives False positives

H0 Regression 0.054 / / Spacings 0.062 / / Scan 0.042 / / Parrallelepipedic Regression 0.86 0.782 0.039 Spacings 0.9 0.801 0.083 Scan 0.92 0.524 0.009

3D Z-shaped aggregate Regression 0.88 0.500 0.041

Spacings 0.93 0.548 0.083

Scan 0.97 0.293 0.006

Personal aggregate HD Regression 1 0.571 0.073

Spacings 1 0.568 0.116

Scan 1 0.270 0.018

Personal aggregate LD Regression 1 0.581 0.139

Spacings 1 0.751 0.364

Scan 0.79 0.169 0.007

Diagonal aggregate Regression 1 0.321 0.134

Spacings 1 0.325 0.168

Scan 1 0.095 0.026

detected by the ordering-based methods. The price to pay for this flexibility is false positives but the false positive rate in this example is only 100445 =7%.

Let us remark that this example of cluster alternative corresponds to the real life situation in which the spatial base of the aggregate grows in time, which is a quite common situation.

Figure 2. “L”-shaped cluster: points in black are included both in ordering-based and scan aggregates, points in dark gray are only included in ordering-based aggregate, and points in light gray are not included in an aggregate. The two figures represent the data with two distinct points of view (from side and from above respectively). The cluster simulation zone is represented by two parallelepipeds.

(11)

6. Spatio-Temporal Analysis of Colfiorito Earthquake Sequence

To illustrate those approaches on real data, we used a seismic data set included in the Statistical Seismology Library (SSlib: a collection of earthquake hypocentral catalogues and R functions to analyse the catalogues). This data set is made up of 43,000 earthquake events that appeared in Italy from 1983–2006. The data source is the Istituto Nazionale di Geofisica e Vulcanologia. For each event, latitude, longitude, date, time, and magnitude are given. The earthquake events are represented in Fig. 3 along with the two fault lines.

Due to the very high number of events, we selected the earthquake events with a magnitude higher than 4.5 (on the Richter scale). To avoid a selection bias due to non exhaustivity at the beginning of the study, we also selected the events that appeared after 1993. Finally, 66 events were selected and analysed with spatial and spatio-temporal cluster detection methods.

The spatial and spatio-temporal regression, spacings and scan methods were applied to this data set. All the clusters reported in what follows are significant at the nominal level of 005. Depending on the choice for the underlying populationZ, we may take into account the spatial and temporal distribution of the earthquakes or not.

Figures presented in this section use the same legend. Events are represented by a black point. Events located in the regression aggregates are surrounded by a grey disc. Events located in the spacings aggregates are surrounded by a black circle. The most likely cluster detected by the scan statistic is represented by an ellipse (circle deformed by planar projection). Underlying population individuals are represented by little black points. For spatio-temporal figures, a full line relies temporally successive events of the north fault aggregate and a triangle denotes the direction of the temporal evolution.

Figure 3. Colfiorito earthquake sequence: earthquake events are represented by little black points. Full lines materialize south and north fault lines.

(12)

6.1. Testing Against Uniformity

Firstly, we chose a uniform 10,000 points random process as the underlying populationZ.

All the spatial cluster detection methods exhibit the same compact cluster located along the south fault including nine earthquake events which are spatially very close. The spacings method also detects a second cluster, less compact, along the north fault and including 11 events. Those clusters are shown in Fig. 4.

The spatio-temporal regression and spacings methods exhibit two identical temporal clusters. The spacings method reveals a third one. The spatio-temporal scan statistic only detects the first cluster located by the ordering-based methods. The spatial representation of those clusters is shown in Fig. 5.

The two ordering-based methods and the scan statistic highlight the same compact cluster (in the south) including nine earthquake events that occurred in 1997. This spatio-temporal aggregate is identical to the spatial cluster detected. This result gives more reliability to this spatio-temporal cluster, since spatial aggregated events are also temporally very close. Watching more carefully those nine aggregated events, we remark that seven out of nine occurred between September 26, 1997 and November 9, 1997. September 26, 1997 is exactly the day of the beginning of the Colfiorito earthquake sequence (magnitude of 6.1). This sequence presented the peculiarity to be made up of several major earthquakes close (in time and space) from a first earthquake, which is the case for our detected cluster.

The two ordering-based methods also detect a second cluster along the north fault line. Those nine events are more spread in space (as we can see in Fig. 5) and time since they occurred between September 12, 2005 and November 7, 2006.

The spacings method detects a third cluster including seven events located between the two faults. Those events occurred between October 31, 2002 and November 25, 2004.

Figure 4. Colfiorito earthquake sequence: spatial clusters detected with regression, spacings and scan method with uniform underlying population.

(13)

Figure 5. Colfiorito earthquake sequence: spatio-temporal clusters detected with regression, spacings and scan methods with uniform underlying population.

6.2. Taking the Spatial and Temporal Distributions Into Account

Secondly, we chose the 43,000 earthquake events that occurred between 1993 and 2006 as the underlying population. For the scan statistic, which is computer intensive, we have selected the sub population of events with a magnitude higher than 2.5, which contains 14,591 events. Making this choice, we may identify the clusters of high-magnitude events among all the events. The underlying population, high magnitude events and aggregated events are represented in Fig. 6.

Figure 6. Colfiorito earthquake sequence: spatio-temporal clusters detected with regression, spacings and scan methods with true underlying population.

(14)

The southern cluster has disappeared, but the spread northern one is included in the exhibited cluster. Those aggregated events correspond to events which occurred along the north fault line, where the probability for an event to occur is low compared to the south fault line. Those results illustrate the importance of the choice for the underlying population. We also remark that aggregated events are temporally ordered from east to west along the north fault line. This result highlights the spatio-temporal evolution of an initial earthquake event. The most likely spatio-temporal cluster of the scan statistic is not reported here since its spatial radius is 715 kms. This cluster is huge compared to the maximal distance between two events, that is 1,750 kms.

7. Computing Time

In the previous section, we have asserted that the scan statistic is computer intensive. Even if the SaTScan (Kulldorff and Information Management Services, 2006) user guide specifies that “The spatial and space-time scan statistics are computer intensive to calculate”, this fact has to be substantiated. For this purpose, we have evaluated the computing time for each method on 100 samples of 100 points. The underlying population size was fixed at 10,000.

The regression method runs very fast since it takes only 11.7 s in average and ranges between 11 and 13 s. The spacings method is a little more time-consuming since it runs in 30.8 s and ranges between 25 and 39 s. Finally, the scan method is very consuming in computing time. The average is 18.2 min, and it ranges between 16.7 and 36 min. On those data sets, the scan takes about 100 times more time than the regression method and 35 times more time than the spacings method.

All the programs run on the same PC. Moreover, while the scan statistic is computed by the SaTScan software, which is optimized for reducing computing time, the ordering-based methods are written in R language, which is quite time-comsuming.

To complete the illustration of the quickness of the ordering-based methods in comparison with the scan statistic, let us remark that the regression method runs in 8 s for analysing the “L”-shaped data set studied in Section 5. On the same data set, the scan statistic runs in 38 min and 20 s, that is to say 280 times slower.

Even if our demonstration holds true only in the particular case of our data sets on our personal computer, let us try to explain such differences on the computing time. The computational problem of the classical scan statistic is a consequence of the generation of the possible clusters. Indeed, the SaTScan software takes into account all the events in order to build the possible clusters, whether these are case events or control events. Then, the program counts how many case and control events are located in each possible cluster. So, increasing the number of control events increases both the number of possible clusters and the length of the counting process in each one. On the other hand, when one applies an ordering-based method, the computation of the areas Ai only depends on the case events. The control events are then counted in each of these areas to compute the weighted distancesdw

k or the area spacingsAi. Thus, increasing the number of control events

increases only the length of the counting process in the ordering-based methods: they become less time-consuming when the control data increases.

(15)

8. Discussion

The ordering-based spatio-temporal methods that we have developed and presented in this article are very flexible. Indeed, the shape of the estimated cluster is not prespecified. This is a great advantage since usually people do not know the kind of cluster they are looking for. The lack of power compared to the scan statistic, illustrated by the simulation study, is balanced by the flexibility and the ability to locate and detect several arbitrarily-shaped clusters, as shown with the “L”-shaped example. This flexibility is also illustrated by the application to the seismic data set: we manage to highlight the evolution of a earthquake event sequence along the north fault while the scan statistic fails.

Even if the flexibility of our method is an advantage, some people may prefer using the scan statistic which is a better alarm as it detects more easily the presence of a cluster. In fact, any user should answer the question: do I prefer detecting the presence of a cluster without estimating its precise location, or getting a more precise location of the cluster without being sure of detecting it? An example of situation in which the researcher or user will prefer the second approach is the study of the spatio-temporal repartition of paludism cases along a river. Indeed, the vector of the parasite responsible of the human paludism are anopheles which live mainly near water and in damp surroundings. In this case, the use of ordering-based spatio-temporal methods is clearly indicated to put into evidence the increased risk of contamination along the river. In other situations, the circular or cylindric scan remains a preferred approach since in some cases the objective is to identify the presence and approximate locations of aggregated events. In this latter case, the scan remains both a powerful and elegant approach.

The computing time is another important aspect that has to be taken into consideration before selecting a method. Indeed, as the information accuracy increases more and more, the amount of data to analyse becomes huge and computationally-efficient methods should be preferred. In this article, we show in the particular case of our data sets on our personal computer that the ordering-based methods run between 30 and 100 times faster than the scan statistic. In the hypothesis that this conclusion can be generalized, it will be a great advantage when analysing big data sets. Indeed, the ratio between the computation time of the methods is very important when using a standard personal computer but we can only suppose this remains almost the same when using more performing processors. This should be checked later on through the analysis of a bigger data set.

Both methods we developed here can be used only for case event data. This type of data is more difficult to obtain than grouped data. However, its availability is greatly increasing with the development of Geographical Information Systems and taking into account the whole spatial and temporal information seems to be essential.

We explained how the ordering-based methods can adapt to population inhomogeneity. Moreover, this would also be true for any continuous covariate adjustment (Klassen et al., 2005). This could be done by modeling a risk function depending both on the underlying population and on the adjusted covariates.

Finally, the ordering-based methods can also be applied to multivariate data, whatever the dimension, as soon as a distance between individuals is defined. In that case, using a scan technique appears unfeasible as the number of potential clusters would be gigantic.

(16)

References

Bai, J., Perron, P. (1998). Estimating and testing linear models with multiple structural changes.Econometrica66(1):47–78.

Cucala, L. (2008). A hypothesis-free multiple scan statistic with variable window.Biometr. J.

50(2):299–310.

Cucala, L. (2009). A flexible spatial scan test for case event data. Computat. Statist. Data Anal.53(8):2843–2850.

Demattei, C., Zawar, V., Lee, A., Chuh, A., Molinari, N. (2006a). Spatial-temporal case clustering in children with Gianotti-Crosti syndrome. Systematic analysis led to the identification of a mini-epidemic.Eur. J. Pediatr. Dermatol. 16:159–64.

Demattei, C., Molinari, N., Daures, J. P. (2006b). SPATCLUS: an R package for arbitrarily shaped multiple spatial cluster detection for case event data. Comput. Meth. Progr. Biomed.84:42–49.

Demattei, C., Molinari, N., Daures, J. P. (2007). Arbitrarily shaped multiple spatial cluster detection for case event data.Computat. Statist. Data Anal.51(8):3931–3945.

Iyengar, V. S. (2004). On detecting space-time clusters. Proceedings of the Tenth ACM SIGKDD Int. Conf. Knowledge Discov. Data Mining, New York: ACM Press, pp. 587–592.

Klassen, A., Kulldorff, M., Curriero, F. (2005). Geographical clustering of prostate cancer grade and stage at diagnosis, before and after adjustment for risk factors.Int. J. Health Geograph.4:1.

Knox, G. (1964). The detection of space-time interactions.Appl. Statist.13:25–29.

Kulldorff, M. (1997). A spatial scan statistic.Commun. Statist. Theor. Meth. 26(6):1481–1496. Kulldorff, M. (1998). Evaluating cluster alarms: a space-time scan statistic and brain cancer

in Los Alamos, New Mexico.Amer. J. Public Health88(9):1377–1380.

Kulldorff, M., Huang, L., Pickle, L., Duczmal, L. (2006). An elliptic spatial scan statistic.

Amer. J. Public Health25(22):3929–3943.

Kulldorff, M., Information Management Services, I (2006). SaTScanTM v7.0: Software for

the spatial and space-time scan statistics. Available at: http://www.satscan.org/. McNally, R. J., Colver, A. F. (2008). Space-time clustering analysis of occurrence of cerebral

palsy in Northern England for births 1991 to 2003.Ann. Epidemiol.18(2):108–112. Molinari, N., Bonaldi, C., Daures, J. P. (2001). Multiple temporal cluster detection.

Biometrics57:577–583.

Zhang, Z., Kulldorff, M., Assuncao, R. (2010). Spatial scan statistics adjusted for multiple clusters.J. Probability Statistics(in press).

References

Related documents

The significant contemporaneous correlation between expenditure changes and income changes reported in Panel B of Table 5 implies that expenditure changes are not dominanted

The neutral species are key components playing a role in the reaction kinetics of the plasma process... Analysis of Plasma Neutrals by Ionisation Threshold and Electron

First, the current research reveals that all three facets of organizational justice – distributive, procedural, and interactional – have significant positive effects on managers’

Using German data, we identify fairness preferences for how six different factors should affect income, and estimate the unfair economic inequality in Germany.. We apply

1.1 There is an operational requirement for information on low-level wind shear and turbulence (from any cause) to be provided to the pilot in such a manner as to enable the pilot

In principle, private wealth managers’ ability to tailor investment portfolios to their clients’ needs (whereas a retail-oriented fund manager deals with a large number

The fact that 34 percent of respondents expressed no specific opinion regarding the need for additional regulation to protect the welfare of food animals suggests that

Spinbox The Spinbox widget is a variant of the standard Tkinter Entry widget, which can be used to select from a fixed number of values. Text The Text widget is used to display text