• No results found

Data Quality Assurance

N/A
N/A
Protected

Academic year: 2021

Share "Data Quality Assurance"

Copied!
26
0
0

Loading.... (view fulltext now)

Full text

(1)

“Good decisions require good data”

Process for

Data Quality Assurance

at Manitoba Centre for Health Policy (MCHP)

Mahmoud Azimaee Data Analyst at ICES

(2)

Literature and Resources

• CIHI Data Quality Framework, (2009

edition)

• UK’s NHS Data Quality Reports • Handbook on Data Quality

Assessment Methods and Tools,

(European Commission)

• Handbook on Improving Quality by

Analysis of Process Variables

,

(European Commission) • Data fitness

(3)

Data Quality at MCHP

1. Data Quality Indicators

2. Rating System

– CIHI Data Quality Framework,

(2009 edition)

3. Data Quality Report

– UK’s NHS Data Quality Reports

4. Practical Approach

5. Automation

– Cody’s Data Cleaning

(4)

Data Quality Indicators and Rating

System

• Example:

Completeness: Rate of missing values for all data

elements.

(5)

Level of Agreement With Other Databases Completeness Measurement Error Level of Bias Degree of Problems with Consistency Accuracy Reliability Availability and Quality of: Documents , Policies and Procedures, Formats Libraries, Metadata, Data Model Diagrams Time to Acquisition Time to Release Currency of Data Identifying Units of Analysis (Persons, Places, Things, ...) Level of Agreement With the Literature and available reports Internal Consistency Stability Across Time Linkability Completeness (Missing Values) Correctness (Invalid codes, Invalid Dates, Out of Range, Outliers and Extreme Observations)

Accuracy Internal Validity External

Validity Timeliness Interpretability

Database Level (In Data Management)

Research Level

(In a Specific Research Projects)

Data Quality Assurance

(6)

Data

Management

Process at

MCHP

6. Release Data to Analyst(s) and Researcher(s)

Meet with programmer(s) and researcher(s) to present data structure and content

5. Document Data

Including original documents, data model diagram, SPDS data dictionary, history, file variations and structural changes, revisions and common problems and data quality report, where available

4. Evaluate Data Quality

Test the installed data using standardized protocol

Identify solutions to address deficiencies in data quality

Prepare data quality report for addition to standard documentation

3. Apply SAS Programs

Apply Normalization or De-normalization as required

Normalization can be defined as the practice of optimizing table structures by eliminating redundancy and inconsistent dependency

Apply data field and SAS format standards

Install on SPD server

(This includes indexing, sorting and clustering)

Create Metadata

If there is a problem, liaise with the source

agency

2. Become Familiar with Data Structure and Content

Review provided documentation If required, create a data model for the original data If receiving test data, test it and send feedback to the source agency

1. Formulate the Request and Receive the Data

Check the data sharing agreements

Liaise with the source agency to acquire available data, data model diagram, data dictionary, documentation about historical changes in data content,

format, and structure, data quality reports

Prepare the data request letter

Receive the data and associated documentation

(7)

How to Present Data Quality

Results?

• CIHI Data Quality Report

• UK’s NHS Data Quality Report

VODIM

Test Analysis Methodology

Valid • Other • Default • Invalid • Missing • Valid • Invalid • Missing • Outlier VIMO!

(8)
(9)
(10)

(1) I just discovered that the data system we have been

working on for the last five years has major data quality

problems.

(2) That is why I treat data systems the same way I do sausage – I do not want to know

what is inside either one. (3) Ouch!! That is why I am a

vegetarian!

(11)

Example 1: Identifying Outliers/Extreme

Observations:

1. Standard Deviation (Mean +/- 2*SD) 2. Trimmed Standard Deviation

(MeanTrimmed10% +/- 2*1.49*SDTrimmed10%) 3. Interquartile Range

(Q1 – k*IQR , Q3 + k*IQR), k=2.5

– Ordered statistics for calculating quartiles is very memory intensive

=>> P² method to approximate the quartiles (Using QMETHOD=P2 in PROC MEANS)

[piecewise-parabolic (P²) algorithm invented by Jain and Chlamtac (1985)]

(12)

Operational Approaches

Example 2: Stability Across Time

Based on CIHI guideline:

– Trend analysis is used to examine changes in core data elements over time

– No change across years may also be an

indication of a problem if the data is expected to naturally trend upward or downward

– Changes in methodology or inclusion/exclusion criteria should be taken into account to determine whether the observed changes were real or not.

(13)

Example 2: Stability Across Time

(Continued)

• Identify unusual changes

– Outlier analysis

• Outlier analysis requires a model

– How to choose an appropriate model in an automated fashion?

• Fit a series of common models: – Simple Linear: Y=β0 + β 1X – Quadratic: Y= β 0 + β 1X2 – Exponential: Y= β 0 + β 1exp(X) – Logarithmic: Y= β 0 + β 1log(X) – SQRT: Y= β 0 + β 1 𝑥 – Inverse: Y= β 0 + β 1 1 𝑥

(14)

Example 2: Stability Across Time

(Continued)

• Choose the best model with the minimum MSE • Re-fit the chosen model on the data

• Do an outlier analysis

– Estimate Studentized residuals for each observation (with the current observation deleted)

• Flag significant observations as potential outliers • Flag observation with no changes over time

• How about Small Cell Size Policy? (0<Frequency<6)

– Use the actual values in modeling but flag and then force them to 3 in the report

(15)

• MCHP’s data repository includes over 65

health and other administrative databases,

(linkable using a common encrypted

individual identifier).

• Annual updates for most of the databases

in its repository.

• Designing an automated process became a

must!

(16)

• A SAS Macro based application package

was developed (16 Macros)

– Pre Data Quality Macro (1)

– Main Macros (6)

– Intermediate Macros (9)

(17)

Automation

(Continued) Documentation System VIMO Macro GETNOBS Macro INVALID Macro POSTMUN Macro OUTLIER Macro GETFORMAT Macro GETVARLIST Macro METADATA MACRO Special Features:

• Can handle standalone and Clustered tables • Can Validate Postal and Municipal codes

(18)

LINK Macro

GETNOBS Macro

(19)

TREND Macro GETNOBS Macro FISCAL Macro MONTHLY Macro

Automation

(Continued)
(20)

CONTENT MACRO AGREEMENT MACRO PHINCHECK MACRO • Checks 3

rd and 5th positions of PHINs which must be 0 and 9

• Compares the distribution of the first position with the corresponding PHINs from registry files

(21)

Non-Automated Indicators

• Internal Consistency

(22)

Level of Agreement With Other Databases Completeness Measurement Error Level of Bias Degree of Problems with Consistency Accuracy Reliability TREND Macro LINK Macro PHINCHECK Macro AGREEMENT Macro VOMO Macro Availability and Quality of: Documents , Policies and Procedures, Formats Libraries, Metadata, Data Model Diagrams Time to Acquisition Time to Release Identifying Units of Analysis (Persons, Places, Things, ...) Level of Agreement With the Literature and available reports Internal Consistency Stability Across Time Linkability Completeness (Missing Values) Correctness (Invalid codes, Invalid Dates, Out of Range, Outliers and Extreme Observations)

Accuracy Internal Validity External

Validity Timeliness Interpretability

Database Level (In Data Management)

Research Level

(In a Specific Research Projects)

(23)
(24)

• Central Format Library • Metadata Database • Standardization

– Bad standards are better than no standards at all!

(25)

• Data Quality Algebra • Data Quality Axioms

(26)

Acknowledgment

• Mr. Mark Smith

(MCHP Associate Director, Repository)

• Dr. Lisa Lix

(Associate Professor at University of Saskatchewan)

CONTACT INFORMATION Mahmoud Azimaee

Institute for Clinical Evaluative Sciences Work Phone: (647) 480-4055 (Ex. 3618)

E-mail: [email protected]

References

Related documents

The Chinese authorities announced on Saturday 19 June 2010 that the exchange rate regime of the renminbi (RMB) was reformed to enhance flexibility and that the reference for

So, it’s a three man race now…almost at the end and, although Corinth and Sparta are catching up, it looks like Athens will be able to hold them off…and, yes, Athens are

If the Customer requests that both the existing AT&amp;T Switched Ethernet Service and the new AT&amp;T Switched Ethernet Service be in service at the same time, such

This study adopted a mixed method. It includes three surveys at three time points, an opened-ended questionnaire, and student reflection paper. Due to the space limitation, the

MSS (microsatellite stable) colorectal cancer and to ascertain whether sex of the patient influences prognosis in MSI patients, irrespective of stage or

For the stand-alone derivatives entered into in the fiscal year ended March 31, 2012 and the currency swaps associated with Tranche F1 issued under the Senior Credit Facility, that

The results of the high proficient students in both the control and the experimental groups brought good news to teachers that the combination of peer and self-directed