2.3 Methods
2.3.8 Statistical analysis
Having introduced the concept of network motifs and chains, and the interpretation of such features in a spatio-temporal context, it remains to discuss the methods by which the prevalence of such subgraphs can be evaluated. To emphasise again, the motivation for this analysis is to gain additional insight into sets of events which are known to be clustered (this having been demonstrated via an established statistical test, such as the Knox test) but where the exact nature of the clustering is un- known. Several factors specific to the case of event networks mean that established techniques cannot be applied without modification.
The process of counting subgraphs of small order in a given network, on which this analysis rests, is well-documented. Brute-force enumeration can be used for sufficiently small networks (and is used in the work in this chapter) but, where this is computationally prohibitive, efficient methods based on sampling are also available (Kashtan et al., 2004; Wernicke & Rasche, 2006). The primary technical question shifts, therefore, to that of which ‘random’ networks the observed data should be compared against. Crucially, these networks should correspond to sit- uations for which the level of clustering is the same, in a purely pair-wise sense, as for the original data, in order to gain insight beyond that which can be found by existing means. Since the method is intended to be applied to data for which clustering is already known to be present, this is a crucial point.
ulate the approach of the Knox test, described in Section 2.3.2, by employing a permutation approach. Such an approach, however, is immediately seen to be prob- lematic upon consideration of the relationship between link count and subgraph frequency. It is a simple observation that, for networks with a constant number of vertices, there will exist an association between the number of links and the fre- quency of subgraphs; to choose two pathological cases, an empty network contains no connected subgraphs, whereas every subgraph of a complete network is itself complete. Away from these extremes, an increasing density of links will generally imply a higher number of connected subgraphs overall (ultimately favouring more dense subgraphs and longer chains).
The issue is, therefore, that the event network of any set of clustered events (if the definition of ‘clustered’ is taken to be ‘that which gives a significant Knox test’) will necessarily have a lower link count when the temporal data are permuted, since that is precisely what is indicated by the Knox test. Any difference between ob- served subgraph counts and those under permutation might, therefore, simply be an artefact of the change in density, rather than any effect over and above the known clustering. Since observed subgraph counts are expected to be anomalous under such a comparison, they are therefore of little inferential value.
The implication of this argument, from an analytical perspective, is that it is nec- essary to maintain a constant link-count when generating random networks against which observed subgraph frequencies can be compared. This is certainly not a novel observation for motif analysis, and methods by which the issue can be addressed are outlined in previous work on the topic (e.g. Milo et al., 2002). The typical approach taken is to generate the required randomised networks by simply re-wiring the links of the original observed network; that is, by randomly re-assigning one or both of the end-points of individual links. In this way, no links are created or destroyed, and other structural features of the network (e.g. vertex degrees) can also be preserved, as in the configuration model (Newman et al., 2001). It is assumed that networks
derived by such a method are a representative random sample of all networks which possess the prescribed properties.
This approach, however, is not applicable in the context of event networks. As noted previously, the fact that event networks are derived from spatial and tempo- ral data constrains the space of possible configurations: they must, for example, be cycle-free (since otherwise this would imply that one event occurred simultaneously before and after another).
In addition, since links represent spatial proximity, their existence is not mutu- ally independent, and there exist certain combinations which cannot arise from any possible set of events. To explain this, it is useful to return to the concept of the unit disk graph (UDG), of which the spatial proximity network GDd is an example. A realisation of a given UDG, G0, is a set of points in space for which G0 is the induced UDG; that is, a set of points whose proximity is as indicated by G0. Any UDG must have at least one realisation, by definition. A key problem in the field therefore concerns the question of whether a given arbitrary graph has a realisation, i.e. whether it is a UDG.
Not all graphs do have a realisation: as shown by Marathe et al. (1995), for ex- ample, the 7-vertex ‘star’ graph shown in Figure 2.6 does not. This is because, in 2-dimensional space, it is impossible for six circles of equal radius to intersect with another similar circle (the red circle in Figure 2.6b) without at least two of the six intersecting with each other. Aside from stylised examples such as this, however, establishing the (non-)existence of a realisation for arbitrary graphs is non-trivial; indeed, it has been shown by Breu & Kirkpatrick (1998) to be NP-hard.
The implication of these results for the present study is that some networks do not arise as the event network of any possible set of events. Motivated by this, it is therefore helpful to define a valid network as one which is the event network for
(a) (b)
Figure 2.6: A ‘star’ graph of 7 vertices, which cannot arise as an event network due to geometric constraints. Were the graph shown in a) to be an event network, it would imply that the events represented by the outer vertices all occurred within distance D of that represented by the central vertex. However, by plotting circles of diameter D around each location in b), it is clear to see that, if the outer circles intersect with the red central circle as required, at least two of the outer circles must also intersect. The corresponding locations must therefore lie within D of each other, meaning that an additional link must exist.
at least one set of spatio-temporal event data. Only networks which are realisable in this sense must be considered in analysis, but it is clear that: a) not all networks are valid, and b) determination of the validity of a given network is, in general, computationally-prohibitive. It is for this reason that a simple re-wiring approach is inappropriate: it is liable to generate invalid networks, and there is no natural way to adapt the process to ensure validity.
An even more significant, though more subtle, problem concerns the fact that, even amongst valid networks, not all are equally likely to arise from random event data. The mapping from event datasets to their event networks is many-to-one, and the number of datasets which give rise to each event network is not equal; some arrange- ments will naturally arise more readily than others. Given that the ultimate interest of the analysis is to establish the randomness (or departure from it) of the event data, it is evident that, even if the set of all event networks could be characterised succinctly, randomly sampling from it would still be inadequate.