Chapter 2 Background Literature Review
2.4 Geospatial Data Quality
2.4.1 The Notion of Geospatial Data Quality
The quality of geospatial data can be defined as “a measure of the difference between the data and the real world that they represent” (Goodchild, 2006, p. 13). The greater this difference, the poorer the quality of data and the smaller its true value. Having to go through processes of generalisation, abstraction and aggregation, geospatial data can only provide an approximation of the real world and therefore almost always suffers from imperfect quality (Goodchild, 1995; Li et al., 2012). Perfect representation of the real world with all its unlimited complexity and level of detail cannot be achieved (Devillers and Jeansoulin, 2006; Goodchild, 2006). Consequently, almost all geospatial data has limited accuracy (Goodchild, 1995) and is inevitably uncertain (Couclelis, 2003). Goodchild (1995) identifies six major
sources of error in spatial data: measurement; definition; lack of documentation; distortion in physical media; processing; and interpretation. The error of measurement takes place when raw geographic data is produced and stays with the “product through the entire process of dataset creation and use” (Goodchild, 1995, p. 414). The data acquisition process, for example the technique or instruments used, generally determines the quality of produced data. Definition errors arise from variations between different observers in terms of the definitions of variables being measured. Such definition variations cause misclassifications and data inaccuracy. For instance, what one observer records in the dataset as ‘hill’ can be defined as ‘mound’ by other observer. Lack of adequate documentation supporting spatial data can also contribute towards error propagation; having insufficient metadata means that discrepancies and other errors become harder to identify. Distortion in physical media occurs as a result of the digitalisation of paper maps and is probably less relevant at present when most spatial data is collected with satellite and digital technologies. Processing geospatial data to create different data products introduces more errors and thereby increases data uncertainty. When data lacks adequate documentation the responsibility for data interpretation lies more with the data users, which can lead to interpretation errors. Having analysed the major sources of spatial data errors, Goodchild concludes that imperfection in spatial data makes measuring and documenting its quality essential.
Worboys (1998) identifies five factors that can lead to spatial data quality deficiencies: inaccuracy and error; vagueness; incompleteness; inconsistency; and imprecision. He describes the inaccuracy and error factor as “deviation from true values” (Worboys, 1998, p. 258) – i.e., a difference between produced data and the real world. These quality deficiency factors can arise from imprecise measurements, distortion in physical media or processing of data as identified by Goodchild (1995). The vagueness factor – “imprecision in concepts used to describe the information” (Worboys, 1998, p. 258) – occurs when different producers and data analysts use different terms to describe the same concepts, objects and object properties. This factor directly relates to the Goodchild’s error in definition discussed above. The data inconsistency factor arises when information conflicts exist in the data produced. For example, inconsistency can emerge from a dataset which encloses data from multiple sources or where data producers are not complying with common standards and best practices (Mohammadi et al., 2009). The inconsistency factor affects many spatial datasets as they are aggregated from multiple sets of data which may each have different levels of quality or even come from different sources. Lack of documentation and interpretation errors, as described above, can contribute to the inconsistency of the dataset. The imprecision factor implies low resolution and granularity of spatial data; when data does not provide a sufficient level of detail and/or precision it leads to data uncertainty. This factor typically arises from low resolution measurement instruments and distortion in physical media.
Collins and Smith (1994) classify the errors depending on the phases of data collection and use: data collection (e.g., errors of measurement, errors produced by data collection equipment, etc.); data input (e.g., data digitalisation errors); data storage (e.g., errors caused by numerical imprecision, rounding); data manipulation (e.g., error propagation, map mash- ups); data output (e.g., errors produced from scaling by output device); data usage (e.g., data misuse or misinterpretation). Their classification schema closely mirrors classifications proposed by Goodchild (1995) and Worboys (1998) but introduces new concepts such as data input; storage; manipulation; and output errors.
Beard (1989) proposes to classify map errors into three main categories: errors produced through data acquisition (source errors); errors introduced by data processing (process errors); and errors caused by data misuse (use errors). It can be argued that this classification scheme is a high level categorisation of the geospatial data errors and factors proposed by Goodchild (1995), Worboys (1998) and Collins and Smith (1994). The source errors category encompasses data collection errors; errors of measurement; lack of documentation; interpretation (or vagueness) errors; incompleteness; inconsistency; and imprecision. The process errors can include errors of processing; input; storage; manipulation; output; inaccuracy; inconsistency; and imprecision. Finally, use errors comprise errors of interpretation and misuse.
While a number of classifications have been identified to categorise the factors that affect geospatial data quality, the proposed categories are generally consistent and interrelated. Spatial data quality deficiency can arise from any combination of the numerous factors described above; hence, as already mentioned, almost all the spatial data that is being produced is uncertain and imprecise to some degree, and the aim must be to control and document the errors such that they do not adversely affect the use to which such data is put.
Devillers et al. (2005) and Devillers and Jeansoulin (2006) discuss two categories of data quality: internal quality and external quality. Internal quality refers to the level of similarity between the data produced and the “perfect” data that should have been produced. In the GIS domain, internal quality is often described in terms of the ‘famous five’ elements of geospatial data quality (see more on these elements in section 2.4.2). As argued by Devillers and Jeansoulin (2006), the internal quality of data can be improved during the course of data creation. External quality refers to how well a product meets user’s needs or expectations, in a given context. External quality is not absolute and is subjective; it largely depends on user requirements and therefore the same product can be of different quality to different users. Due to its subjective nature, external data quality is often defined as ‘fitness for use’ or ‘fitness for purpose’ (see more on ‘fitness for use’ in section 2.4.3). Despite the diversity in notions of internal (objective) and external (subjective) data quality, these two categories are closely linked together because, in order to evaluate external data quality, users will often
require objective data quality descriptions. While there exist methods for evaluation of internal quality of geospatial data, evaluation of external quality still remains an open issue in the GIS domain (Ivánová et al., 2013).