“Good decisions require good data”
Process for
Data Quality Assurance
at Manitoba Centre for Health Policy (MCHP)
Mahmoud Azimaee Data Analyst at ICES
Literature and Resources
• CIHI Data Quality Framework, (2009
edition)
• UK’s NHS Data Quality Reports • Handbook on Data Quality
Assessment Methods and Tools,
(European Commission)
• Handbook on Improving Quality by
Analysis of Process Variables
,
(European Commission) • Data fitness
Data Quality at MCHP
1. Data Quality Indicators
2. Rating System
– CIHI Data Quality Framework,
(2009 edition)
3. Data Quality Report
– UK’s NHS Data Quality Reports
4. Practical Approach
5. Automation
– Cody’s Data Cleaning
Data Quality Indicators and Rating
System
• Example:
– Completeness: Rate of missing values for all data
elements.
Level of Agreement With Other Databases Completeness Measurement Error Level of Bias Degree of Problems with Consistency Accuracy Reliability Availability and Quality of: Documents , Policies and Procedures, Formats Libraries, Metadata, Data Model Diagrams Time to Acquisition Time to Release Currency of Data Identifying Units of Analysis (Persons, Places, Things, ...) Level of Agreement With the Literature and available reports Internal Consistency Stability Across Time Linkability Completeness (Missing Values) Correctness (Invalid codes, Invalid Dates, Out of Range, Outliers and Extreme Observations)
Accuracy Internal Validity External
Validity Timeliness Interpretability
Database Level (In Data Management)
Research Level
(In a Specific Research Projects)
Data Quality Assurance
Data
Management
Process at
MCHP
6. Release Data to Analyst(s) and Researcher(s)
Meet with programmer(s) and researcher(s) to present data structure and content
5. Document Data
Including original documents, data model diagram, SPDS data dictionary, history, file variations and structural changes, revisions and common problems and data quality report, where available
4. Evaluate Data Quality
Test the installed data using standardized protocol
Identify solutions to address deficiencies in data quality
Prepare data quality report for addition to standard documentation
3. Apply SAS Programs
Apply Normalization or De-normalization as required
Normalization can be defined as the practice of optimizing table structures by eliminating redundancy and inconsistent dependency
Apply data field and SAS format standards
Install on SPD server
(This includes indexing, sorting and clustering)
Create Metadata
If there is a problem, liaise with the source
agency
2. Become Familiar with Data Structure and Content
Review provided documentation If required, create a data model for the original data If receiving test data, test it and send feedback to the source agency
1. Formulate the Request and Receive the Data
Check the data sharing agreements
Liaise with the source agency to acquire available data, data model diagram, data dictionary, documentation about historical changes in data content,
format, and structure, data quality reports
Prepare the data request letter
Receive the data and associated documentation
How to Present Data Quality
Results?
• CIHI Data Quality Report
• UK’s NHS Data Quality Report
–
VODIM
Test Analysis Methodology
• Valid • Other • Default • Invalid • Missing • Valid • Invalid • Missing • Outlier VIMO!
(1) I just discovered that the data system we have been
working on for the last five years has major data quality
problems.
(2) That is why I treat data systems the same way I do sausage – I do not want to know
what is inside either one. (3) Ouch!! That is why I am a
vegetarian!
•
Example 1: Identifying Outliers/Extreme
Observations:
1. Standard Deviation (Mean +/- 2*SD) 2. Trimmed Standard Deviation
(MeanTrimmed10% +/- 2*1.49*SDTrimmed10%) 3. Interquartile Range
(Q1 – k*IQR , Q3 + k*IQR), k=2.5
– Ordered statistics for calculating quartiles is very memory intensive
=>> P² method to approximate the quartiles (Using QMETHOD=P2 in PROC MEANS)
[piecewise-parabolic (P²) algorithm invented by Jain and Chlamtac (1985)]
Operational Approaches
Example 2: Stability Across Time
Based on CIHI guideline:
– Trend analysis is used to examine changes in core data elements over time
– No change across years may also be an
indication of a problem if the data is expected to naturally trend upward or downward
– Changes in methodology or inclusion/exclusion criteria should be taken into account to determine whether the observed changes were real or not.
Example 2: Stability Across Time
(Continued)• Identify unusual changes
– Outlier analysis
• Outlier analysis requires a model
– How to choose an appropriate model in an automated fashion?
• Fit a series of common models: – Simple Linear: Y=β0 + β 1X – Quadratic: Y= β 0 + β 1X2 – Exponential: Y= β 0 + β 1exp(X) – Logarithmic: Y= β 0 + β 1log(X) – SQRT: Y= β 0 + β 1 𝑥 – Inverse: Y= β 0 + β 1 1 𝑥
Example 2: Stability Across Time
(Continued)• Choose the best model with the minimum MSE • Re-fit the chosen model on the data
• Do an outlier analysis
– Estimate Studentized residuals for each observation (with the current observation deleted)
• Flag significant observations as potential outliers • Flag observation with no changes over time
• How about Small Cell Size Policy? (0<Frequency<6)
– Use the actual values in modeling but flag and then force them to 3 in the report
• MCHP’s data repository includes over 65
health and other administrative databases,
(linkable using a common encrypted
individual identifier).
• Annual updates for most of the databases
in its repository.
• Designing an automated process became a
must!
• A SAS Macro based application package
was developed (16 Macros)
– Pre Data Quality Macro (1)
– Main Macros (6)
– Intermediate Macros (9)
Automation
(Continued) Documentation System VIMO Macro GETNOBS Macro INVALID Macro POSTMUN Macro OUTLIER Macro GETFORMAT Macro GETVARLIST Macro METADATA MACRO Special Features:• Can handle standalone and Clustered tables • Can Validate Postal and Municipal codes
LINK Macro
GETNOBS Macro
TREND Macro GETNOBS Macro FISCAL Macro MONTHLY Macro
Automation
(Continued)CONTENT MACRO AGREEMENT MACRO PHINCHECK MACRO • Checks 3
rd and 5th positions of PHINs which must be 0 and 9
• Compares the distribution of the first position with the corresponding PHINs from registry files
Non-Automated Indicators
• Internal Consistency
Level of Agreement With Other Databases Completeness Measurement Error Level of Bias Degree of Problems with Consistency Accuracy Reliability TREND Macro LINK Macro PHINCHECK Macro AGREEMENT Macro VOMO Macro Availability and Quality of: Documents , Policies and Procedures, Formats Libraries, Metadata, Data Model Diagrams Time to Acquisition Time to Release Identifying Units of Analysis (Persons, Places, Things, ...) Level of Agreement With the Literature and available reports Internal Consistency Stability Across Time Linkability Completeness (Missing Values) Correctness (Invalid codes, Invalid Dates, Out of Range, Outliers and Extreme Observations)
Accuracy Internal Validity External
Validity Timeliness Interpretability
Database Level (In Data Management)
Research Level
(In a Specific Research Projects)
• Central Format Library • Metadata Database • Standardization
– Bad standards are better than no standards at all!
• Data Quality Algebra • Data Quality Axioms
Acknowledgment
• Mr. Mark Smith
(MCHP Associate Director, Repository)• Dr. Lisa Lix
(Associate Professor at University of Saskatchewan)CONTACT INFORMATION Mahmoud Azimaee
Institute for Clinical Evaluative Sciences Work Phone: (647) 480-4055 (Ex. 3618)
E-mail: [email protected]