3. State of the Art
3.2. Data Quality
3.2.3. Data Quality Dimensions
Since data quality touches many different aspects it can be decomposed into different data quality dimensions. Depending on the use case, only a subset of all known dimen- sions needs consideration, breaking the problem of measuring quality down into smaller pieces. Even though regarding data quality as a multi-dimensional concept is a com-
mon view in the literature and an enormous amount of dimensions were proposed (see Appendix C),“there is no agreement on the set of dimensions characterizing data qual- ity” [128]. Another issue is that there is also no consensus on what particular dimensions mean, leading to multiple definitions of single dimensions. This situation to some extent
reflects the circumstance that quality is often considered in connection with a certain use case or application domain. Besides these differing views of quality dimensions there
are also different approaches to actually infer them from a given use case. These are categorized into theoretical, empirical and intuitive approaches [128, 14].
In a theoretical approach the modeled system is considered in a more abstract way, deriving a formal model to detect and describe quality issues. An example of such an
approach is the quality model of Wand and Wang. Due to the presented abstraction, viewing the development of an information system as a mapping problem of the real world, several artifacts can be derived that are of interest. These are design deficien- cies referring to the errors shown in Figure 11 (incomplete representation, ambiguous
representation and meaningless state) and operation deficiencies standing for inappro- priate behaviour of the system. With these deficiencies at hand one can define quality
dimensions as shown in Table 9. The process oriented models mentioned in the previous section are theoretical approaches as well.
Dimension Description Accuracy and
Precision
“inaccuracy implies that the information system represents a real world state different from the one that should have been represented.” Reliability indicates “whether the data can be counted on to convey the right
information; it can be viewed as correctness of data.” Timeliness
and Currency
refers to “the delay between a change of the real-world state and the resulting modification of the information system state.”
Completeness is“the ability of an information system to represent every meaningful state of the represented real world system.”
Consistency
inconsistency of data values occurs if there is more than one state of the information system matching a state of the real-world system; therefore“inconsistency would mean that the representation mapping is one-to-many.”
Table 9: Quality dimensions derived from the quality model of Wand and Wang [143] The empirical approach does not consider formal models but takes stakeholder opin-
ions into account. In most cases such approaches are based on a user survey as in the method of Wang and Strong [144]. There, a survey performed in multiple steps led to a shortlisted catalogue of 19 quality dimensions grouped in four categories shown in Figure 12.
When following an intuitive approach, data quality dimensions are defined“according to common sense and practical experience” [14]. A concrete example of this approach is given by Redman [123]. The corresponding data quality dimensions are listed in Table 10.
These three approaches and their prerequisites are summarized in Figure 13. Apart from the dimensions presented for the three approaches, an overview of all dimensions
introduced in the considered literature, can be found in Appendix C. To ease the un- derstanding, the dimension definitions or descriptions were normalized using a shared
vocabulary for formulae, and consolidated in case multiple dimensions share the same meaning.
Data Quality Intrinsic Data Quality Contextual Data Quality Representational Data Quality Accessibility Data Quality - Believability - Accuracy - Objectivity - Reputation - Value-added - Relevancy - Timeliness - Completeness - Appropriate amount of data - Interpretability - Ease of understanding - Representational consistency - Concise representation - Accessibility - Access security
Figure 12: Quality dimensions according to Wang and Strong [144]
Type Dimension Description
Data value
Accuracy “Distance between v and v′, considered as correct” Completeness “Degree to which values are present in a data collection” Currency “Degree to which a datum is up to date”
Consistency “Coherence of the same datum, represented in multiple copies, or different data to respect integrity constraints and rules”
Data format
Appropriateness “One format is more appropriate than another if it is more suited to the user needs”
Interpretability “Ability of the user to interpret correctly values from their format”
Portability “The format can be applied to as a wide set of situations as possible”
Format precision
“Ability to distinguish between elements in the domain that must be distinguished by users”
Format flexibility
“Changes in user needs and recording medium can be easily accommodated”
Ability to repre- sent null values
“Ability to distinguish neatly (without ambiguities) null and default values from applicable values of the domain”
Efficient use of memory
“Efficiency in the physical representation. An icon is less efficient than a code”
Representation consistency
“Coherence of physical instances of data with their formats” Table 10: Quality dimensions proposed by Redman [123] (cited from [14])
theoretical empirical intuitive experience/ intuition survey model/ formalization dim dim dim dim dim approach prerequisites dimensions
Figure 13: Approaches to derive quality dimensions to consider for a given domain and their prerequisites