2.3 Methods
2.3.2 The Knox test
The Knox (1964) test has been used widely in studies which address the simple ques- tion of whether there is interaction between the spatial and temporal distributions of a set of incidents; i.e. whether space-time clustering is present. For simplicity, and in keeping with the work which follows, a basic formulation will be described here, although more complex variations have been used elsewhere (and, indeed, will be considered in Chapter 4).
The origin of the test lies in epidemiology; more specifically, in the study of childhood leukaemia. Its motivation is to test whether a set of event data (in this context, observed cases of disease) is consistent with having been generated by a process involving an element of contagion, as is a common hypothesis in many such scenar- ios. The underlying rationale is that, if contagion is present, events will tend to be followed by other events in some spatial vicinity more than would be expected on the basis of chance. Such a relationship corresponds to a dependence between the spatial and temporal distributions: events are more likely to be close in space when they are close in time, and vice versa.
The basis for the test is the concept of a ‘close pair’ of events: one for which the the spatial and temporal separations of the events both lie within certain thresh- olds. For concreteness, D will be taken to represent the spatial threshold and T the critical temporal separation. These thresholds can be taken to have any value, but are typically selected on the basis of the anticipated radius of any contagion effect (which, in turn, may be informed by other analysis). More sophisticated versions of the test consider several bands in each dimension, which represent different levels of separation (see Chapter 4); for simplicity, though, only the simple single-threshold case is considered here.
The first step of the test is to compare every possible pair of events in the dataset (for N events, there will be N (N2−1) comparisons) and to record the number of those which
are close pairs (i.e. the number of pairs{i, j} for which dij ≤ D and −T ≤ tij ≤ T ).
This statistic, denoted SK, represents the observed proximity of events, as defined by the binary close pair relationship.
Once this count of close pairs has been found, it must be compared against what would be expected under the null hypothesis (that the events’ locations in time and space are independent). In Knox’s original work, it was assumed that the number of close pairs followed a Poisson distribution, and that the expected frequency could be computed using the marginal frequencies of spatial close pairs and temporal close pairs.
An alternative method, however, which has been used in recent work concerned with crime (e.g. Johnson et al., 2007), is to employ a Monte Carlo approach. Rather than computing a theoretical value, this involves examining the values of the statis- tic in question (in this case, the number of close pairs) for a number of explicitly- constructed alternative datasets, generated under the assumption of the null hy- pothesis.
There are a number of ways in which the construction of these datasets can be performed, but a popular one is to use a permutation approach. Starting with the observed event data, sets of randomised events are generated by repeatedly permut- ing the timings of events, while maintaining the spatial information. Denoting the permutation at a given iteration by σ, the pair-wise comparison between two events i and j therefore involves comparison of their true spatial locations xi and xj (as before) but their permuted time-points tσ(i) and tσ(j). In other words, the temporal
components of the events are shuffled, so that any alignment with the spatial compo- nents will be broken down. These sets of events therefore correspond to what would be expected under the null hypothesis: if there is no association between spatial and temporal distributions, the shuffling ought to make no significant difference to the number of close pairs observed.
The permutation approach is also appealing in another respect, which is that it is based entirely on observed data. Rather than synthesising events, the observed information is simply restructured; for each event set constructed, therefore, the spatial and temporal distributions are identical to the observed case. Anomalous results cannot, therefore, be ascribed to a change in either of the marginal distribu- tions.
Given the method of generating randomised sets of events - the permutation of one dimension - the remaining analysis is simple. A number, nK, of sets of events
are constructed in this way, and the statistic of interest (the number of close pairs) is computed in each case; this is denoted eSK for the shuffled data. This can then
be used as a reference distribution against which the true observed value can be compared.
The deviation from this distribution can be quantified: if rK is the rank at which SK would appear in an ordered list of the eSK values generated, then, as proposed
by North et al. (2002), a pseudo-significance is given by:
p = rK nK+ 1
. (2.2)
Furthermore, the magnitude of the effect can be estimated by computing the z-score of SK, relative to the null distribution. Significant deviation from the reference
distribution indicates that it is improbable that the observed data could have been generated if there was no association between the timings and locations of events.