Chapter 4 – Spatial microsimulation modelling
4.8. Results of the spatial microsimulation model
4.8.1. Internal validation of the model
The overall model constrained extremely accurately. Using Pearson’s r correlation coefficient, the Census data constraints and aggregated survey data were plotted and
125
compared, obtaining a perfect score of 1. The correlation between the aggregated survey data and the Census data can be seen in Figure 15. The correlations for the individual constraint variables were also very accurate, with each constraint also attaining a correlation coefficient of 1.
Figure 15 – Census data plotted against the aggregated survey data
Beyond the use of the Pearson’s r statistic already presented, the model was internally validated in a number of other ways. As mentioned in Section 4.7, the total absolute error (TAE) is a measure of the absolute difference between the Census data and simulated data for each zone. Having run the model for 15 iterations the TAE score for the overall model was again extremely encouraging at just 0.2985, out of a total adult population of 452,014. This ranged from 0.000000005 to a maximum of 0.066118601 across the 345 LSOAs. The population total created from the simulation was also very accurate, as the simulated population figure of 452,014 compared exactly to the adult population supplied by NOMIS (NOMIS, 2011) at the time of the 2011 Census (n=452,014). The standardised absolute error (SAE), where the TAE is divided by the population of the LSOA, also performed extremely well, with a score of just
126
Figure 16 – Total absolute error (TAE) scores per LSOA
Figures 16 and 18 show the TAE and SAE scores mapped onto the Sheffield LSOA boundaries. Figure 16 shows that a cluster of zones to the west of the city centre featured the highest TAE scores. The zone with the highest error score was LSOA E01033261, an area near Broad Lane and the University of Sheffield’s city campus (Figure 17). Further analysis revealed this zone accounted for 27.47 % of the TAE in the model.
127
Figure 17 – Location of LSOA E01033261 near Broad Lane and the university
Figure 18 demonstrates a similar pattern for the standardised absolute error, with a cluster of LSOAs to the west of the city centre again providing the highest error scores.
Figure 18 – Standardised absolute error (SAE) scores per LSOA
This time LSOA E01033267 had the higher error score. This area was situated around Devonshire Green, and was also located near to a number of buildings affiliated with
128
the University of Sheffield, as well as student accommodation at West One (Figure 19). This LSOA accounted for 33.25% of the standardised error in the model.
Figure 19 – Location of LSOA E01033267 near Devonshire Green and West One
It would seem that one zone did not account for the majority of the absolute or
standardised error within the model. If one zone had been principally responsible for the error, then the rest of the model could be assumed to be accurate. However, this is not necessarily a problem when the error scores are a small as they were in this research. Interestingly, a similar cluster of LSOAs to the west of the city centre appeared to account for the majority of the error (both absolute and standardised) in the model, with these areas being located near the main campus of the University of Sheffield. As these LSOAs cover areas occupied by university buildings as well as student accommodation, it may be that the characteristics of the student population in the area, which will differ to that of the usual resident population near the university, caused the higher error counts in these areas. Ballas et al. (2005a) have stated that spatial microsimulation is not suitable for estimating populations when ‘affected considerably by external and
localised factors, such as transport networks and public transport services, or the presence of a disproportionally large university or a single major employer in the
region’ (p.14). It is possible that the presence of the university and its students may have affected the output of the model, although it is curious that higher error counts were not seen in areas occupied by buildings associated with Sheffield Hallam University and its students.
129
While the TAE and SAE are helpful in assessing error, they do not give any indications of statistical difference, and tests of this type provide an alternative way to judge the fit of the data internally. Edwards and Clarke (2007) used two-tailed equal variance t-tests to check whether there were statistically significant differences between the Census data and aggregated survey data, with the results as applied to this research presented in Table 15. The data in this table demonstrates the excellent fit of the model internally, with no statistically significant differences between the Census and survey data for the overall model, or any of the six constraint variables.
Table 15 – Results of the two-tailed equal variance t-tests
Constraint p-value Overall model 0.995 Age 0.9996 Gender 0.9993 Qualification 0.9998 NS-SEC 0.9995 General health 0.9998 Car ownership 1
Edward and Clarke (2007) also suggest linear regression as a further internal validation method, as the p-values and R2 statistics give a good indication of internal fit. Again the simulation performed well in this regard, as the overall model and all of the constraints attained statistically significant p-values (p <0.01), and R2 values of 1. Finally, Smith et al. (2007) contend that for a rare disease such as diabetes, there should be less than 10% error in 90% of the areas (output areas in their research). In this research all of the SAE scores for each LSOA were below 10%, with the highest standardised error score being 0.00000776133, indicating excellent internal model fit. Based on the findings of the internal validation, it was assumed that the model was internally valid, and had constrained accurately to the Census data.