Chapter 4 – Spatial microsimulation modelling
4.7. Validation
One of the most important parts of the IPF process is ensuring the validity of the output, particularly given that the models are estimating variables which previously did not exist in a geographical format. Models can be validated both internally and externally. Internal validation involves checking the fit between the Census constraints and the aggregated survey data, and Lovelace and Dumont (2016) list several ways to test this. Firstly, a simple check of the data can be performed using the correlation function in R (‘cor’), which calculates a Pearson’s r coefficient (Pearson, 1896). This value should be as close to 1 as possible, and preferably over 0.99 to show the two datasets have
converged. This can also be assessed visually by plotting the two datasets against each other. Two further tests of the fit of the data include the total absolute error (TAE) and the standardised absolute error (SAE). The TAE calculates the differences between the observed and simulated values for each constraint in each zone, and sums these. The SAE is perhaps a more useful measure of error, as this takes the TAE for each zone and divides it by the population of each area, thus making comparisons between different
123
areas more relevant. Having calculated the TAE and SAE it is then possible to
investigate which areas have the highest error levels, how much these areas contribute to the overall level of error, as well as which of the variables have not converged and are causing the error to be higher in these areas. Statistical tests such as regression and t- tests can also be used to test the convergence of the two datasets (Edwards and Clarke, 2007)
External validation of spatial microsimulation models, while the ‘gold standard’, can be more difficult to conduct. Some models can be caught in a ‘catch 22’ situation, as the outputs are often new data with nothing else to compare them against, hence the need to use the method in the first place (Lovelace and Dumont, 2016). Where external
validation is possible, there are a number of ways of achieving this, with Tanton and Edwards (2013) proposing four potential approaches. The first of these involves aggregating the results of the model to a spatial scale where relevant data already exist that can be compared to the output. The authors warn of the importance of the
ecological fallacy during this approach, and the importance of not making assumptions about populations from aggregate data. The ecological fallacy concerns the
‘aggregational variability inherent in areal data’ (Openshaw, 1984 – p.18), whereby aggregate levels group statistics from Census data can become unrepresentative of the individuals that comprise that group. The danger then comes from making assumptions about the nature of individuals based on these aggregated statistics. This also brings into the question the areal unit used for the analysis, which in turn will influence the
resultant statistics. This is another reason why the neighbourhood based variables in this study were, where possible, not based on aggregated individual data, so as not to create difficulties in interpreting group level data.
The second approach, which is far more time and resource consuming, is to collect primary data on the outcome variable of choice in either one, or a sample of small areas, assuming no other relevant data exists. The third approach suggested by the authors is to compare the simulated data to different variables that happen to be correlated with the simulated output, using data that already exists at the micro level. Using small area geographical data from the Census is perhaps the most obvious option in this regard, while Anderson (2013) used the Welsh Indices of Multiple Deprivation (StatsWales, 2011) as an alternative source of data. The fourth and final suggestion is to run the model at a larger geographical scale, before measuring the results against (reliable)
124
estimates for larger scales from another dataset. This approach has similarities with the first method, and in the case of this study data from the Yorkshire and Humber region could be used. A fifth technique for external validation is to use unconstrained
variables, i.e. one that is present in both datasets but was not used as a constraint (Campbell, 2011). By including these variables as additional target variables, it is possible to compare the model estimates against the Census version of the variable. This gives an idea as to whether the model is accurately producing data for the target
variables. This approach was used in this research, through the inclusion of ethnicity and marital status as additional target variables.
Internal validation should be conducted as the bare minimum in the validation process. It may not be possible to conduct external validation on a model however, and this will very much depend on the target variable of choice in the research, as well as the spatial scale at which the data is produced. Once the data has been validated it is possible to export data frames from R as .csv files using the ‘xlsx’ package and ‘write.csv’ command. Once in this format it is also possible to read such data into agent-based modelling software, and this will be explained in more detail in Chapter 5. The next section will focus on the results of the models and their validation, before moving on to geographical and statistical analysis of the output.