• No results found

Statistical methodology for point patterns .1 Summary statistics.1Summary statistics

A time-honoured approach to spatial point pattern data is to calculate a summary statistic that is intended to capture an important feature of the pattern.

For the water striders data, the feature of interest is the spacing between the water strider larvae.

An appropriate summary statistic is the average distance from a larva to its nearest neighbour, the nearest other larva in the same pattern. This is appropriate because if the larvae are territorial we can expect them to try to increase the nearest-neighbour distance. The average nearest-neighbour distance is a numerical measure of the typical spacing between larvae.

For the three patterns in Figure 1.2 the average nearest-neighbour distances are 55, 49 and 54 mm, respectively. To interpret these values, we need a benchmark or reference value. Since our goal is to decide whether the water strider larvae are territorial or not, a suitable benchmark is the average nearest-neighbour distance that would be expected if the water striders are not territorial, that is, if the points were placed completely at random. This benchmark is slightly different for the three patterns in Figure 1.2, because they are slightly different in the number of points and the size of frame. A simple solution is to normalise the values, dividing the observed distance by the benchmark distance for each pattern. This ratio, called the Clark-Evans index [155], should be about 1 if the larvae are completely random, and greater than 1 if the larvae are territorial. The

suggesting that the larvae are territorial. See Chapter 8 for further discussion.

A summary statistic can be useful if it is appropriate to the application, and is defined in a simple way, so that its values can easily be interpreted. However, by reducing a spatial point pattern to a single number, we discard a lot of information. This may weaken the evidence, to the point where it is impossible to exclude other explanations. For example, the Clark-Evans index is very sensitive to spatial inhomogeneity: index values greater than 1 can also be obtained if the points are scattered independently but unevenly over the study region. The analysis above does not necessarily support the conclusion that the water strider larvae are territorial, until we eliminate the possibility that the water striders have a preference for one side of the pond over another.

Summary functions are often used instead of numerical summaries. Figure 1.15 shows the estimated pair correlation functions for the three water strider patterns in Figure 1.2. For each value of distance r, the pair correlation g(r) is the observed number of pairs of points in the pattern that are about r units apart, divided by the expected number that would be obtained if the points were completely random. Pairs of water striders separated by a distance of 3 centimetres (say) are much less common than would be expected if their spatial arrangement was completely random. See Chapter 7.

0 2 4 6 8 10 12

0.00.40.81.2

r (cm)

g(r)

0 2 4 6 8 10 12

0.00.40.81.2

r (cm)

g(r)

0 2 4 6 8 10 12

0.00.40.81.2

r (cm)

g(r)

Figure 1.15. Measuring correlation between points. Estimates of the pair correlation function (solid lines) for the three water strider patterns in Figure 1.2. Dashed horizontal line is the expected value if the patterns are completely random.

To decide whether the deviations in Figure 1.15 are statistically significant, a standard technique is to generate synthetic point patterns which are completely random, compute the pair correlation function estimates for these synthetic patterns, and plot the envelopes of these functions (minimum and maximum values of pair correlation for each distance r). See Figure 1.16. This can be inter-preted as a statistical test of significance, with care (see Chapter 10).

Often the main goal is to detect and quantify trends in the density of points. The enterochromaf-fin-like cells in gastric mucosa (left panel of Figure 1.1) clearly become less dense as we move towards the interior of the stomach (towards the top of the picture). Figure 1.17 shows two ways of quantifying this trend. In the left panel, the spatial region has been divided into equal squares, and the number of points falling in each square has been counted and plotted. The numbers trend downwards as we move upwards. The right panel shows an estimate of the spatially varying density of points using kernel smoothing.

1.2.2 Statistical modelling and inference

Summary statistics work best in simple situations. In more complex situations it becomes diffi-cult to adjust the summary statistic to ‘control’ for the effects of other variables. For example,

1Using Donnelly’s edge correction [244].

0 2 4 6 8 10 12

Figure 1.16. Assessing statistical significance of correlation between points. Estimates of the pair correlation function (solid lines) for the three water strider patterns. Dashed horizontal line is the expected value if the patterns are completely random. Grey shading shows pointwise 5%

significance bands.

Figure 1.17. Measuring spatial trend. Enterochromaffin-like cells in gastric mucosa (Figure 1.1) showing(Left) counts of points in each square of side length 0.2 units, (Right) kernel-smoothed density of points per unit area.

the enterochromaffin-like cells (Figure 1.1) have spatially varying density. The Clark-Evans index cannot be used, unless we can think of a way of taking account of this spatial trend.

Analysing data using a summary statistic or summary function is ultimately unsatisfactory. If, for example, the conclusion from analysis of the water striders data is that there is insufficient evidence of territorial behaviour, then we have the lingering doubt that we may have discarded precious evidence by reducing the data to a single number. On the other hand if the conclusion is that there is evidence of territorial behaviour, then we have the lingering doubt that the summary statistic may have been ‘fooled’ by some other aspect of the data, such as the non-uniform density of points.

A more defensible approach to data analysis is to build a statistical model. This is a comprehen-sive description of the dataset, describing not only the averages, trends, and systematic relationships in the data, but also the variability of the data. It contains all the information necessary to simulate the data, i.e., to create computer-generated random outcomes of the model that should be similar to the observed data. For example, when we draw a straight line through a cloud of data points, a regression modeltells us not only the position of the straight line, but also the scatter of the data points around this line. Given a regression model we can generate a new cloud of points scattered about the same line in the same way.

Statistical modelling is the best way to investigate relationships while taking account of other

all the variables that influence the data, we are able to account for (rather than ‘adjust for’) the effects of extraneous variables, and draw sound conclusions about the questions of interest.

A statistical model usually involves some parameters which control the strength of the relation-ships, the scale of variability, and so on. For example, a simple linear regression model says that the response variable y is related to the explanatory variable x by y =α+βx + e whereβ is the slope of the line,α is the intercept, and e is a random error with standard deviationσ. The numbersα,β,σ are the parameters of the regression model.

Fitting a model to datameans selecting appropriate values of the model parameters so that the model is a good description of the data. For example, fitting a linear regression model means finding the ‘line of best fit’ (choosing the best values for the interceptα and slopeβ of the line) and also finding the ‘standard deviation of best fit’ (choosing the best value ofσ to describe the scatter of data points around the line). The best-fit estimate of the model parameters usually turns out to be a sensible summary statistic in its own right.

Statistical modelling may seem like a very complex enterprise. Our correspondents often say

“I’m not interested in modelling my data; I only want to analyse it.” However, any kind of data analysis or data manipulation is equivalent to imposing assumptions. In taking the average of some numbers, we implicitly assume that the numbers all come from the same population. If we conclude that something is ‘statistically significant’, we have implicitly assumed a model, because the p-value is a probability according to a model.

The purpose of statistical modelling is to make these implicit assumptions explicit. By doing so, we are able to determine the best and most powerful way to analyse data, we can subject the assumptions to criticism, and we are more aware of the potential pitfalls of analysis. If we “only want to do data analysis” without statistical models, our results will be less informative and more vulnerable to critique.

Using a model is not a scientific weakness: it is a strength. In statistical usage, a model is always tentative; it is assumed for the sake of argument. In the famous words of George Box: “All models are wrong, but some are useful.” We might even want a model to be wrong, that is, we might propose the model in order to refute it, by demonstrating that the data are not consistent with that model.

120 130 140 150 160

02000600010000

el ev

Intensity

120 130 140 150 160

02000600010000

el ev

Intensity

Figure 1.18. Modelling dependence of spatial trend on a covariate. Beilschmiedia trees data (Figure 1.10). Estimated mean density of points assuming it is a function of terrain elevation.

Left: nonparametric estimate assuming smooth function. Right: parametric estimate assuming log-quadratic function. Grey shading indicates pointwise 95% confidence intervals.

It is often instructive to compare results obtained with different models. Figure 1.18 shows two different estimates of the density (in trees per square kilometre) of Beilschmiedia trees as a function of terrain elevation, based on the tropical rainforest data of Figure 1.10. The estimates are derived from two different statistical models, which assume that the Beilschmiedia tree locations are random, but may have a preference for higher or lower altitudes. The left panel is a nonparametric estimate, based on the assumption that this habitat preference is a smoothly varying function of terrain elevation. The right panel is a parametric estimate, based on a very specific model where the logarithm of forest density is a quadratic function of terrain elevation.

0 1 2 3 4 5 6

0.00.20.40.60.81.0

r (cm)

potential

0 1 2 3 4 5 6

0.00.20.40.60.81.0

r (cm)

potential

0 1 2 3 4 5 6

0.00.20.40.60.81.0

r (cm)

potential

Figure 1.19. Modelling dependence between points. Fitted ‘soft core’ interaction potentials for water striders.

Figure 1.19 shows one of the results of fitting a Gibbs model to the water striders data. Such models were first developed in physics to explain the behaviour of gases. The points represent gas molecules; between each pair of molecules there is a force of repulsion, depending on the distance between them. In our analysis the parameters controlling the strength and scale of the repulsion force have been estimated from data. Each panel of Figure 1.19 shows the interaction probability factor c(r) = e−U(r) for a pair of points separated by a distance r, where U(r) is the potential (total work required to push two points together from infinite distance to distance r). A value of c(r) ≈ 0 means it is effectively forbidden for two points to be as close as r, while a value of c(r) ≈ 1 indicates that points separated by the distance r are ‘indifferent’ to each other. These graphs suggest (qualitatively) that the water striders do exhibit territorial behaviour (and the model is quite suitable for this application). Figure 1.20 shows simulated realisations of the Gibbs model for the water striders data. Gibbs models are explained in Chapter 13.

Figure 1.20.Simulated realisations of the fitted soft core model for the water striders.

The main concern with modelling is, of course, that the model could be wrong. Techniques for validatinga statistical model make it possible to decide whether the fitted model is a good fit overall, to criticise each assumption of the model in turn, to understand the weaknesses of the analysis, and to detect anomalous data. Statistical modelling is a cyclic process in which tentative models are fitted to the data, subjected to criticism, and used to suggest improvements to the model.

2

Figure 1.21. Validation of trend model for Beilschmiedia trees. Left: influence of each point (×1000). Right: smoothed Pearson residual field.

Figure 1.21 shows two tools for model validation, applied to the model in which Beilschmiedia density is a log-quadratic function of terrain elevation. The left panel shows the influence of each data point, a value measuring how much the fitted model would change if the point were deleted from the dataset. Circle diameter is proportional to the influence of the point. There is a cluster of relatively large circles at the bottom left of the plot, indicating that these data points have a disproportionate effect on the fitted model. Either these data points are anomalies, or the assumed model is not appropriate for the data.

The right panel of Figure 1.21 is a contour plot of the smoothed residuals. Analogous to the residuals from a regression model, the residuals from a point process model are the difference between observed and expected outcomes. The residuals are zero if the model fits perfectly. Sub-stantial deviations from zero suggest that this model does not fit well, and indicate the locations where it does not fit.

1.3 About this book