The foundation of a build-ing plays a major role in the successful construction and longevity of the building. The stronger the foundation, the stronger the building. In the same way, data are the foun-dation upon which high-per-formance organizations rest in this competitive age. Data are no longer a by-product of an organization’s information technology (IT) systems and applications, but are an orga-nization’s most valuable asset and resource, with a real, measurable value. Besides the importance of data as a resource, it is also appropriate to view data as a commodity. These issues are compounded because the value of the data does not only rest with the data themselves, but also the actions that arise from the data and their usage (Mahanti 2019). This article summariz-es the case for data quality, discusses the different data quality dimensions and their role in ensuring quality data in a succinct fashion, and the central role data quality plays in ensuring product quality, process quality, and compli-ance in the digital age.
KEYWORDS
data, data quality, data quality dimensions, digital age, Industry 4.0
Data Quality and Data
Quality Dimensions
By Rupa Mahanti
INDUSTRIAL REVOLUTION, DIGITAL
AGE, AND DATA
The first industrial revolution (late 1700s and early 1800s) was characterized by steam-powered machines, and the second industrial revolution (late 1800s) was characterized by electricity and assembly lines. The introduction of computers, innovations in computing, and industrial automation defined the third industrial revolution (Radziwill 2018). The current and fourth industrial revolution, common-ly known as Industry 4.0 (Kagermann 2011), is characterized by machine intelligence, pervasive computing, affordable storage, robust connectivity (Radziwill 2018), and Internet of Systems and Industrial Inter-net of Things (IIoT) and fosters the vision of a smart factory. Data are the enabler and the differentiator in the Industry 4.0 era. In this connected, digital age, high-quality data form the basis for every solid opera-tional and strategic process. Good quality data are essential to providing excellent customer service, making operations efficient, ensuring compliance with regu-latory requirements, engaging in effective decision making, and conducting effective strategic business planning. The data used for all these decisions need to be managed efficiently in order to generate a return.
Additionally, the same data are often used several times for multiple purposes. For example, address data are used for deliveries, billing, invoices, and marketing. Product data are used for sales, inventory, forecasting, marketing, financial forecasts, and supply chain management. Although one often thinks in terms of software systems driving organizational processes, instead, data are the true foundation for
the various applications and systems for business functions in an organization (Mahanti 2019).
DATA-CENTRIC ORGANIZATIONS
In some organizations, data are the primary product or service. Insurance companies, banks, online retailers, credit card companies, financial services com-panies, and the Internal Revenue Service (IRS) are all organizations in which business is data centric. These organiza-tions rely heavily on data and processing data as their primary activities. These organizations primarily process and trade information products (Mahanti 2019).
Other organizations, such as man-ufacturing, utilities, and healthcare organizations, may appear to be less involved with information systems because their products or activities are not information-specific. However, if one looks beyond the products into opera-tions, it becomes clear that most of their activities and decisions are driven by data. For instance, manufacturing organiza-tions process raw materials to produce and ship products. However, data drives the processes of material acquisition, inventory management, supply chain management, determining final product quality, order processing, shipping, and billing (Mahanti 2019).
For hospitals and healthcare organizations, the primary activity is medical and patient care. While these activities on their own are not information-centric, hos-pitals need to store and process patient data, physician data, encounter data, information about care protocols, information about resource utilization and scheduling, and patient billing to provide good quality service (Mahanti 2019). Furthermore, the Internet of Things (IoT) is making it possible to establish health monitoring
networks around patients, and to connect patients and healthcare professionals via wearable sensors that collect vital data about the human body for further use (Wu et al. 2017; Cicirelli et al. 2016; Gope and Hwang 2016).
DATA AND ORGANIZATIONAL IMPACT
New trends in data warehousing, business intelligence, data mining, data analytics, decision support, enterprise resource planning, and customer relationship manage-ment systems draw attention to the fact that data plays an ever-growing and important role in organizations (Mahanti 2019).
More information is available because people and devices (such as sensors and actuators) are producing data at greater rates than ever before (Radziwill 2018). Large volumes of data across the various applications and systems in organizations present a number of challenges to the organization. From executive-level decisions about mergers and acquisition activity to call-center represen-tatives making split-second decisions about customer service, the data an enterprise collects on virtually every aspect of the organization—customers, prospects, products, inventory, finances, assets, or employees—can have a significant effect on the organization’s ability to meet quality and performance objectives. These may include satisfying customers, reducing costs, improving productivity, mitigating risks (Dorr and Murnane 2011), or increasing operational efficiency.
Accurate, complete, current, consistent, and timely data are critical to accurate, timely, and unbiased deci-sions. Since data and information are the basis of decision making, they must be carefully managed to ensure they can be located easily; relied upon for their currency, com-pleteness, and accuracy; and obtained when and where the data are needed (Mahanti 2019).
DATA QUALITY
While good data are a source of information, knowledge, and myriad opportunities, bad data are a tremendous burden and only present problems. There are many ways of defining data quality. Data quality is the capability of data to satisfy the stated business, system, and technical requirements of an enterprise. Data quality can be
defined as the data’s fitness for use or purpose for a given context or specific task. Data quality reflects insights into, or direct evaluation of, data’s fitness to serve their purpose in a given context. In this sense, data reflect J. M. Juran’s “fitness for use” criterion applied to any other product or entity, and data on performance are essential for all quality planning (Bisgaard 2007).
Data quality is accomplished when a business uses data that are, at a minimum, complete, relevant, and timely. Determination of quality is dynamic, as a certain level of excellence is not universal, not an absolute, and not a constant but is assessed to a relative degree. The same applies in the case of data quality (Mahanti 2019).
When people talk about data quality, they usually relate to data accuracy only and do not consider or assess other important data quality dimensions in their quest to achieve better quality data. Undeniably, data are normally considered of poor quality if erroneous values are associated with the real-world entity or event, such as an incorrect zip code used in an address, a wrong date of birth, an incorrect title or gender for employees or customers, incorrect phone numbers or email IDs in contact information, or incorrect product specifications in the case of products in a retail store (Mahanti 2019).
However, data quality is not one-dimensional, but rather multidimensional and hierarchical, and hence complex. While accuracy is definitely an important characteristic of data quality and therefore should not be overlooked, accuracy alone does not completely characterize the quality of data. Data quality has many more attributes than the evident characteristic of data accuracy. There are other substantial dimensions, such as completeness, consistency, currency, and timeliness, that are needed to holistically illustrate the quality of data across multiple dimensions (Mahanti 2019).
Despite the fact that fitness for use or purpose does capture the principle of quality, it is still abstract. Thus, it is a challenge to measure data quality using only this holistic construct or definition. It needs to be broken into measurable facets or characteristics, known as data quality dimensions, for data quality assessment to be actionable. Hence, to measure data quality, one needs to measure one or more of the dimensions of data quality, depending on the context, situation, and task for which the data are to be used. In short, these data quality dimensions enable one to operationalize data quality (Mahanti 2019).
DATA QUALITY DIMENSIONS
and assessing data quality and includes the foundational information necessary to comprehend common-sense assumptions about data, thus providing a starting point for defining expectations related to data quality (Sebas-tian-Coleman 2013).
In the absence of metadata or given inadequate metadata, subject matter experts need to be consulted to get an understanding of the data. When measuring data quality dimensions, it is also imperative to contemplate the data granularity level at which they are applicable, so the measurements are practically useful. In studying data quality dimensions, the author observes that some dimensions (for example, data coverage and timeliness) are applicable at higher granularity levels. such as the data set level. Alternatively, dimensions such as completeness can be applicable at lower levels of data granularity, namely the data element level. Granularity may depend on the types of the dimensions that are selected for measurement (Mahanti 2019).
Data quality dimensions that are related to characteristics of the data themselves, for example, completeness, accuracy, consistency, uniqueness, integrity, and validity, are primarily defined based on data element and/or data record level. Measurements in this case generally involve objectively examining data values stored in the data set against business rules to measure the data quality dimensions. On the other hand, the data quality dimensions that deal with the usage of data that contribute to users’ judgment about the data’s fitness for use, such as interpretability, accessibility, and credibility, may be defined based on any arbitrary abstraction of data elements, records, or data sets (Mahanti 2019).
CONCLUDING THOUGHTS
Data without quality can neither contribute any value nor serve any purpose. Hence, high-quality data is not a “nice-to-have” requirement but a “must-have”
require-ment. A data quality improvement program can be driven with the Six Sigma approach (Mahanti 2019). However, Six Sigma improvement projects will not yield reliable outcomes without measurements that take into account all the required data quality dimensions.
While measurement is an integral part of the data quality journey, data quality management involves much more than measurement. It also involves the manage-ment of people, processes, policies, technology, standards, and data within an enterprise. Data quality management is data-, people-, process-, and technology-intensive, with data being at the core, and, as such, to succeed would need all these elements to work in an integrated manner to ensure success (Mahanti 2019). Good data quality and effective management of data and processes to improve and sustain data quality can reduce costs and risks, and (eventually) resolved or minimized. Enterprise data
must conform to the various dimensions of data quality an organization has determined are important to be fit for operational and analytical use (Mahanti 2019).
When defining data quality measures, one should try to focus at a minimum on the dimensions that are meaningful and pertinent for the business with maximum return on investment. On the other hand, measuring all the different dimensions of data quality gives the complete picture. Each organization needs to identify the appropri-ate balance based on its unique competitive position and risk appetite. Also, data quality dimensions are intercon-nected, and dependencies and correlations exist between them, which must be taken into account when measure-ment is being planned (Mahanti 2019). The different data quality dimensions are summarized in Table 1.
MEASURING DATA QUALITY DIMENSIONS
The management axiom “what gets measured gets managed” (Willcocks and Lester 1996) applies to data quality and, in this light, data quality dimensions signify a fundamental management element in the data quality arena. Measurement exposes the hidden truths and thus is essentially the first step toward diagnosing and fixing data quality.
With data quality being such a broad topic, and with the huge amounts of data and number of data elements that organizations have and continue to capture, store, and accumulate (thanks to the capabilities of digitiza-tion), measurement can feel overwhelming. The myth that data need to be 100 percent error-free makes things even more difficult. Not all data quality dimensions need to be measured for data, nor do all data elements need to be subject to measurement. Only those data elements that drive significant benefits should be measured for quality purposes. However, data do not need to be 100 percent error-free, and, though data quality is broad, the various data quality dimensions make measurement an achievable exercise (Mahanti 2019).
The degree of data quality excellence that should be attained and sustained is driven by the criticality of the data, the business need, and the cost and time to achieve the defined degree of data quality. The costs in time, resources, and dollars to achieve and sustain the desired quality level must be balanced against the return on investment and benefits derived from that degree of quality (Mahanti 2019).
Data Quality Dimension Definition
Accessibility The ease with which the existence of data can be determined, the suitability of the form or medium through which the data can be quickly and easily retrieved.
Accuracy The extent to which data are the true representation of reality, be it features of the real-world entity, situation, object, phenomenon or event, which they intend to model.
Believability The extent to which the data are regarded as being trustworthy and credible by the user.
Credibility The extent to which the good faith of a provider of data or source of data can be relied upon to ensure what the data actually represent is what the data are supposed to represent, and there is no intent to misrepresent what the data are supposed to represent (Chisholm 2014).
Trustworthiness The extent to which the data originate from trustworthy sources.
Reputation The extent to which the data are highly regarded in terms of their source or content (Pipino et al. 2002) Timeliness The time expectation of availability of data for consumption.
Currency The extent to which the stored data values are sufficiently up to date for the intent of use despite lapse of time. Volatility The frequency with which the data elements change over time.
Correctness Refers to freedom from errors.
Precision The extent to which the data elements contain a sufficient level of detail.
Reliability Whether the data can be counted on to convey the right information (Wand and Wang 1996). Consistency The extent to which the same data are equivalent across different data tables sources or systems.
Integrity The extent to which data are not missing important relationship linkages (Faltin et al. 2012) and the relationship linkages are valid.
Completeness The extent to which the applicable data (data element, records or data set) are not absent.
Conformance/Validity The extent to which data elements comply to a set of internal or external standards or guidelines or standard data definitions, including data type, size, format, and other features.
Interpretability The extent to which the user can easily understand and properly use and analyze the data.
Security The extent to which access to data is restricted and regulated appropriately to prevent unauthorized access. Conciseness The extent to which the data are represented in a compact manner but at the same time are complete. Uniqueness The extent to which an entity is recorded only once and there are no repetitions. Duplication is the inverse of
uniqueness.
Duplication The extent of unwanted duplication of an entity. Uniqueness is the inverse of duplication.
Cardinality Refers to the uniqueness of the data values that are contained in a particular column, known as attribute, of a database table.
Data coverage The extent of the availability and comprehensiveness of the data when compared to total data universe or population of interest (McGilvray 2008).
Relevance The extent to which the data content and coverage is relevant for the purpose for which it is used and the extent to which it meets the current and potential future needs.
Ease of manipulation The extent to which the data can be easily manipulated or transformed for different tasks. Objectivity The extent to which the data are free from bias and judgement.
Traceability/Lineage The extent to which data can be verified with respect to the origin, history, first inserted date and time, updated date and time, and audit trail by means of documented recorded identification.
Data Specification A measure of the existence, completeness, quality, and documentation of data standards, data models, business rules, metadata, and reference data (McGilvray 2008).
Granularity The extent to which data elements can be subdivided.
Redundancy The extent to which data are replicated and captured in two different systems in different storage locations.
Mahanti, R. 2019. Data quality: Dimensions, measurement, strategy, management and governance. Milwaukee: ASQ Quality Press.
McGilvray, D. 2008. Executing data quality projects. Burlington, MA: Morgan Kaufmann.
Pipino, L. L., Y. W. Lee, and R. W. Wang. 2002. Data quality assess-ment. Communications of the ACM 45, 211-218.
Radziwill, N. M. 2018. Let’s get digital. Quality Progress (October).
Sebastian-Coleman, L. 2013. Measuring data quality for ongoing improvement: A data quality assessment framework. Burlington, MA: Morgan-Kaufmann.
Wand, Y., and R. Y. Wang. 1996. Anchoring data quality dimensions in ontological foundations. Communications of the ACM 39, 11.
Willcocks, L., and S. Lester. 1996. Beyond the IT productivity paradox. European Management Journal 14, no. 3:279-290.
Wu, T., F. Wu, J. M. Redouté, and M. R. Yuce. 2017. An autonomous wireless body area network implementation towards IoT connected healthcare applications. IEEE Access 5, 11413-11422.
eliminate waste, empower decision making, improve customer satisfaction, enhance brand image, and help organizations be compliant and satisfy privacy, security, and regulatory requirements.
Data quality is just as important whether an organi-zation is dealing with Big Data or “small data.” However, data management technologies that are suitable for smaller data will not work for Big Data, owing to the different Vs of Big Data (namely, volume, velocity, va-riety, and veracity). For example, fancy charts, graphs, and analysis provided for Big Data may not contain the comparable accuracy that is experienced with the “tried-and-true” methods for smaller data (Duarte and Dame 2019). Future research will be focused on Big Data.
ACKNOWLEDGEMENTS
To learn more about data quality such as the myths, chal-lenges, critical success factors, strategy, DQ dimensions, data profiling, and more, including how to measure data quality dimensions, implement methodologies for data quality management, and data quality aspects to consider when undertaking data intensive projects, please read
Data Quality: Dimensions, Measurement, Strategy, Man-agement and Governance published by ASQ Quality Press
in 2019. This article draws significantly from the research presented in that book. Special thanks to Nicole Radzi-will, SQP editor, for reviewing and editing this article.
REFERENCES
Bisgaard, S. 2007. Quality management and Juran’s legacy. Quality and Reliability Engineering International 23, no. 6:665-677.
Chisholm, M. 2014. Data credibility: A new dimension of data quality? Available at: https://www.information-management.com/news/ data-credibility-a-new-dimension-of-data-quality.
Cicirelli, F., G. Fortino, A. Giordano, A. Guerrieri, G. Spezzano, and A. Vinci. 2016. On the design of smart homes: A framework for activity recog-nition in home environment. Journal of Medical Systems 40, no. 9:1–17.
Dorr, B., and R. Murnane. 2011. Using data profiling, data quality, and data monitoring to improve enterprise information. Software Quality Professional 13, no. 4:9.
Duarte, J., and J. Dame. 2019. Data science and the quality profes-sional. Software Quality Professional 21, no. 3:13-19.
Faltin, F., R. S. Kenett, and F. Ruggeri. 2012. Statistical methods in healthcare. New York: Wiley.
Kagermann, Henning, Wolf-Dieter Lukas, and Wolfgang Wahlster. 2011. Industrie 4.0: Mit dem Internet der Dinge auf dem Weg zur 4. In-dustriellen Revolution, VDI Nachrichten 13, no. 11. Available at: https:// tinyurl.com/ly6vkgf.
BIOGRAPHY
Rupa Mahanti is a business and information management consultant and has extensive and diversified consulting experience in different