Analysis of error - — Spatial data disaggregation

Chapter 3 — Spatial data disaggregation

3.4 Analysis of error

In the previous sections, we reviewed the different methodologies driven by the data or theoretical process to spatially disaggregate the socio-economic data. As the spatial disaggregation is an uncertain process, whatever the technique, the results given by spatial disaggregation will contain errors. The accuracy of the spatial disaggregation will be a significant issue for the urban studies in which these values are used. Together with the development of new methods, there has arisen a need to validate those methods by the evaluation of their accuracy. However, although the spatial disaggregation has been researched for a couple of decades, only recently has there been a specific attempt to quantify these errors. In this section, we review the methods used in the previous literature to validate the spatial disaggregation techniques.

3.4.1 Statistical measure of errors and visualisation

Most studies evaluate the error of the spatial disaggregation techniques using actual values that were known from independent data sources for the target zones (Flowerdew, 1988; Flowerdew and Green, 1991, 1992; Goodchild et al., 1993; Reibel, 2005; Reibel and Aditya, 2006). Different descriptive statistics were used in those studies to quantitatively measure the degree and the pattern of the disaggregation error, which include, mean error, maximum error, percentile and standard deviation. Some error measurements can be visualised in the form of the choropleth maps that are used to indicate the spatial characteristics of the error distribution. For example, Langford et al. (1991) investigated the accuracy of three regression-based methods through a comparison of the error distribution between the different techniques. Another good example is Eicher and Brewer (2001) who mapped the percentage error and the count error to compare the accuracy of

The overall error statistics are widely used to describe the accuracy of a technique. These include the mean of absolute error (MAE) (Langford, 2006), the mean of percentage error (MPE) (Goodchild et al., 1993), and the root mean square error (RMSE) (Fisher and Langford, 1995; Eicher and Brewer, 2001; Nordhaus, 2002; Gregory, 2002; Gregory and Paul, 2005; Langford, 2006). The RMSE is the most frequently-used measure of the differences between the values predicted by a model and the values actually observed from the data being estimated. These individual differences are also called residuals. The RMSE serves to aggregate them into a single measure of estimation error. Basically, it is calculated using the formula:

As suggested by Gregory (2002) that if the value (for instance, employment) of the target zones varies widely across the study area, some almost-insignificant errors at the target areas that have a larger population base would have serious consequences for the other areas. In this case, the RMSE can be formulated differently by calculating the proportional error, using: Fisher and Langford’s (1995) concern is that the RMSE is highly dependent upon the mean population in the target zones itself as a reflection of the number of target zones. In dealing with this, a standardised coefficient of variation (CV) is introduced (see Fisher and Langford, 1995; Langford, 2006). A standardised CV is obtained by dividing the RMSE by the mean target zone population. The analysis of the CV is very useful in comparative studies. For example, Eicher and Brewer (2001) have employed statistical measures of CV (95% of the confidence interval and overall mean) to evaluate the comparative accuracy of the different dasymetric methods. Other statistical measures of accuracy of the spatial disaggregation include, the simple regression, t-test statistics and outlier distribution (see

Cockings et al., 1997). Overall, the use of these methods is subject to the requirement of the study.

3.4.2 Monte Carlo simulation

From a methodological perspective, it has been questioned in some literature that the results obtained by a single examination any particular methods have limited statements with the reliability and global applicability of the methods (Fisher and Langford 1995;

Sadahiro 1999). This is because the distribution of spatial data is very complex and model their pattern are often dependent to the sampled data. Therefore, there is reasonably high probability that the detected variation that occurred by chance, and the technique examined solely fit the observed data will result in poor generalizability. In this regards, Fisher and Langford (1995) employed a Monte Carlo simulation to evaluate the accuracy of the different spatial disaggregation methods considering the diverse combinations of the source and target zone systems. Using Monte Carlo simulation allows the testing of the spatial disaggregation methods in a variety of geographical circumstances so that the results have a wider applicability. Cockings et al. (1997) further produced predictive models of the errors in the spatial disaggregation using a Monte Carlo simulation. The models revealed the relationships between the parameters of the target zones (perimeter, shape and population density) and the mean error produced by the simulations. In addition, the mean errors and the standard deviations of errors between the different techniques are visualised at each target zone level, which allows the detailed comparisons to be made.

Nevertheless, the Monte Carlo simulation has not been widely used as a standard validation method for spatial disaggregation techniques. This is because the Monte Carlo simulation is rather complex and computationally intensive. The method requires the observed value of a statistical test to be compared with large number of simulated ones. This involves many complex procedures such as random data generation and computational tools to iteratively test the techniques against the simulated data. Therefore, in many previous studies, when the available data are good enough to justify the significance of the result and the theoretical assumption of the technique, most researchers would choose simpler

3.4.3 Theoretical examination

In addition to the numerical error evaluation, some researchers have used a theoretical examination to validate the accuracy among the spatial disaggregation methods. For example, Sadahiro’s approach (1999) is based on a stochastic model that represents diverse geographical situations. In contrast to the Monte Carlo simulation, the model discovers the relationship between the estimation accuracy and the spatial distribution of estimation error literature from a theoretical point of view, without a high computational cost. Sadahiro’s finding can be very suggestive for the choice of the spatial disaggregation methods. Nevertheless, the theoretical examination has limitations and some important spatial issues are not involved, such as the influence of shape, size and the composition of the source zones and the target zones (Sadahiro, 1999).

3.5 Conclusion

This chapter provides an extensive review of the current spatial disaggregation techniques that are primarily applied to the socio-economic data. The review covers their principles, theoretical assumptions, implementation issues and validation methods. It thoroughly discusses the major findings from the previous literature in the area of the accuracy of the spatial disaggregation methods.

The review of the current methodologies uncovers the major limitations associated with the existing methods. Of all the data-driven methods, the dasymetric mapping is well established in the spatial disaggregation literature. The method implies a more relaxed density assumption than the other techniques and presents a higher ability to resolve the spatial heterogeneity in the population disaggregation. However, previous dasymetric methods were developed, based on a coarse density classification and tested against relatively simple geographical areas. One area that has received little attention is to what extent the accuracy of the dasymetric method can be further improved for a refined density classification. This is particularly a problem for the SEQ population disaggregation, because the high degree of spatial heterogeneity in the dataset might not be easily resolved in simply two or three classes. Therefore, a further methodological investigation needs to be carried out to test the performance of the dasymetric methods that incorporate the extended number of density classes. The purpose is to identify a best

density solution for a dasymetric method that is robust to the highly heterogeneous SEQ population (this is demonstrated in Chapter 4).

This chapter also introduced the major theory-driven techniques for spatial disaggregating the employment data. Given the availability of the data this research attempts to apply a theory-driven technique to disaggregate the employment data. This chapter specified that the use of theoretical approaches for spatial disaggregation primarily depending on their theoretical soundness, including assumptions, for the patterns of employment at certain spatial scales. I emphasise the importance of the spatial dependence and the spatial heterogeneity in the employment data at the geography of metropolitan level, which necessitates a novel method over existing techniques to appropriately disaggregate the SEQ employment data whilst accounting for these spatial effects (this is demonstrated in Chapter 5).

Chapter 4 – Spatially disaggregate population

In document Predicting Future Spatial Distributions of Population and Employment for South East Queensland: A spatial disaggregation approach (Page 86-91)