• No results found

How to extract transform and load observational data?

N/A
N/A
Protected

Academic year: 2021

Share "How to extract transform and load observational data?"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

How to extract transform and load observational data?

Martijn Schuemie

Janssen Research & Development

Department of Pharmacology & Pharmacy, The University of Hong Kong

(2)

Outline

• Observational data & research networks

• OMOP Common Data Model (CDM)

• Transforming data to the CDM

(3)
(4)

Observational health data

Subjective:

• Complaint and symptoms

• Medical history

Objective:

• Observations

• Measurements – vital signs, laboratory tests, radiology/ pathology findings

Assessment and Plan:

• Diagnosis

(5)

• All captured in the medical record

• Medical records are increasingly being captured in EHR systems

• Data within the EHR are becoming increasingly structured

(6)

The average primary care physician sees 20 different patients a day

care

(7)

…that’s 100 visits in a week…

care

(8)

…which can reflect a panel of over 2,000 patients in a year

care

(9)

Databases for research

• Electronic Health Records

• Insurance claims

• Registries

• Diseases

• Health care provided

• Effects of treatments

• Differences between patients

• Personalized medicine

Evidence generation

(10)

Key point

Existing observational health care data such as Electronic Health Records and insurance claims databases have great potential for research

现有的卫生保健观测数据,例如电子健康记 录、保险索赔数据库等具有巨大的潜力为研 究所用

(11)

Research networks

• Data may be at different sites

• Sites often cannot share data at the patient level

• Data can be in very different formats

Hospital Claims Registry

Practice

Patient level, identifiable information

(12)

Hospital Claims Registry Practice

Aggregate summary statistics

Patient level, identifiable information

ETL ETL ETL ETL

Extract, Transform, Load (ETL) Common Data Model

Firewall Standardized analytics

(13)

Hospital Claims Registry Practice

Patient level, identifiable information

ETL ETL ETL ETL

Extract, Transform, Load (ETL) Common Data Model

Firewall Count people on drug A

and B, and the number of outcomes X

(14)

Hospital Claims Registry Practice

Patient level, identifiable information

ETL ETL ETL ETL

Extract, Transform, Load (ETL) Common Data Model

Firewall

Practice:

100 patients on A, 1 has X 200 patients on B, 4 have X Hospital:

1000 patients on A, 10 have X 2000 patients on B, 40 have X Etc.

Count people on drug A and B, and the number of

outcomes X

Aggregated data No privacy concerns

(15)

Distributed research networks

• Europe

• United States

• Asia

• Global

(16)

Key point

Observational data can often not be shared. Using a common data model allows analysis programs to ‘visit’ the data instead.

观测数据通常不是共享的。而使用公共数据 模型可以实现分析程序对不同来源数据的统 一“访问”

(17)
(18)

Common Data Model

• A common structure

• A common vocabulary

Person - person_id - year_of_birth - month_of_birth - day_of_birth - gender_concept_id

How do we store gender? - M, F, U

- 0, 1, 2

(19)

OMOP Common Data Model

• Designed for various types of data (EHR, insurance claims) in various countries

• Developed by the OMOP / OHDSI community

• Currently on 5th version

(20)

Concept Concept_relationship Concept_ancestor Vocabulary Source_to_concept_map Relationship Concept_synonym Drug_strength Cohort_definition St and ar d iz ed vo cabu la ries Attribute_definition Domain Concept_class Cohort Dose_era Condition_era Drug_era Cohort_attribute St and ar d iz ed d eriv ed elemen ts St and ar d iz ed cli n ical d at a Drug_exposure Condition_occurrence Procedure_occurrence Visit_occurrence Measurement Procedure_cost Drug_cost Observation_period Payer_plan_period Provider Care_site Location Death Visit_cost Device_exposure Device_cost Observation Note

Standardized health system data

Fact_relationship Specimen CDM_source Standardized meta-data St and ar d iz ed h ea lth ec o n o mic s

Drug safety surveillance Device safety surveillance Vaccine safety surveillance

Comparative effectiveness

Health economics

Quality of care Clinical research

One model, multiple use cases

(21)

CDM principles

• Person centric

• Data is split into domains (e.g. conditions, drugs, procedures)

• Preserve source values and map to standard values

• Store the source of the data

(22)

OMOP Vocabulary

All codes are mapped to standard coding systems

• Drugs: RxNorm

• Conditions: SNOMED

• Measurements: LOINC

(23)

Vocabulary

Source codes SNOMED

MEDDRA Liver disorder (10024670)

Injury of liver

(39400004) Biliary cirrhosis (1761006)

Biliary cirrhosis of children (J616200) Primary biliary cirrhosis (K74.3) Primary biliary cirrhosis (31712002)

(24)

Key point

The OMOP Common Data Model provides a

common structure and a common vocabulary. Different levels of precision in different coding systems are resolved using a concept hierarchy.

OMOP公共数据模型对数据采用通用结构和通 用词汇;利用统一的概念等级解决不同编码 系统的各个级别的精确度问题。

(25)
(26)

Process to create an ETL

1. Data experts and CDM experts together design the ETL

2. People with medical knowledge create the code mappings

3. A technical person implements the ETL

(27)

Tools to support the process

1. Data experts and CDM experts together design the ETL

2. People with medical knowledge create the code mappings

3. A technical person implements the ETL

4. All are involved in quality control

W

HITE

R

ABBIT

R

ABBIT

I

N

A H

AT

U

SAGI

A

CHILLES

(28)

Step 1. Designing the ETL

• Run

W

HITE

R

ABBIT

 Scans the source database

Creates a report describing

o Tables

o Fields in the tables

o Values in the fields

(29)

Step 1. Designing the ETL

• Run

• In an interactive session with both data experts and CDM experts, run

W

HITE

R

ABBIT

(30)
(31)

Step 1. Designing the ETL

• Run

• In an interactive session with both data experts and CDM experts, run

• Output document becomes basis of ETL specification

W

HITE

R

ABBIT

(32)

Step 2. Create code mappings

• Need to map codes to one of the OMOP CDM standards:

– Drugs: RxNorm

– Conditions: SNOMED

– Lab values: LOINC

– Specialties: CMS

(33)

Step 2. Create code mappings

• Prepare data

– Codes

– Descriptions (translated to English if needed)

– Frequencies

• Use existing mapping information if available

– If ATC codes are available, use to select subsets

– Use (partial) mappings in UMLS

(34)
(35)

Step 3. Implement the ETL

No standard. Depends on the expertise of the person doing the work:

• SQL

• SAS

• C++ (CDMBuilder)

• Java (JCDMBuilder)

(36)

Step 4. Quality control

Not yet fully defined, but we currently use:

• Manually compare source and CDM information on a sample of persons

• Use Achilles software

• Replicate a study already performed on the source data

(37)

Key point

Transforming data to a common data model requires an interdisciplinary team, including clinicians, informatics specialists, and people that know the data.

实现数据到公共数据模型的转换需要一个跨 学科领域团队的合作,包括临床医生、信息 学专家以及了解数据内容的人员

(38)
(39)

Introducting OHDSI

• International collaborative

(40)

Introducting OHDSI

• International collaborative

• Coordinating center in Columbia University

• Academia and industry

• Data network

• Research (clinical + methodology)

• Open source software

(41)

ACHILLES

Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems

Aggregate statistics: - Persons

- Conditions - Drugs

- Lab results - …

Data in CDM Explore statistics

(42)

ACHILLES

• >12 databases from 5 countries across 3 different platforms:

• Janssen (Truven, Optum, Premier, CPRD, NHANES, HCUP)

• Columbia University

• Regenstrief Institute

• Ajou University

• IMEDS Lab (Truven, GE)

• UPMC Nursing Home

• Erasmus MC

(43)

Achilles Heel

(44)

Achilles

Statistics can be explored for patterns indicating quality

(45)

Open source tools

• Support for ETL

White Rabbit + Rabbit in a Hat to help design an ETL

Usagi to help create code mappings

• Data exploration / study feasibility

HERMES to explore the vocabulary

ACHILLES to explore a database

CIRCE to create cohort definitions

HERACLES to explore cohort characteristics

• Methods for performing studies

Cohort method for new user cohort studies using large-scale propensity scores

IC Temporal pattern discovery

Self-Controlled Case Series

Korean signal detection algorithms

• Knowledge base

MedlineXmlToDatabase for loading Medline into a local database

LAERTES for combining evidence from literature, labels, ontologies, etc.

Orange: Under development Black: Complete

(46)

Key point

OHDSI is a research community that develops

open source tools that run on the Common Data Model.

OHDSI是一个研究团体,主要致力于开发在公 共数据模型上运行分析的开源工具

(47)

Concluding thoughts

• Wealth of observational data to generate wealth of

evidence for health care

• ETL into Common Data Model can be large initial

investment

• Common Data Model allows sharing of results,

(48)

Thank you

References

Related documents

For example, Table 3 shows that for unemployment both the test with WCE-DM and test the with WCE-B and standard asymptotics reject at 10% significance level the null of equal

In our construction, the predicate is not complex at all (and our hope that the function is hard to invert can not arise from the complexity of the predicate).. That is, we hope

For females, only the association between temporary worklessness and NEET remained significant (Model 2). There were independent risk effects from low parental

As the LCA indicated that 40 per cent of the NEET group (LCA groups 4 and 6) consistently performed poorly across their educational life, the review analyses the potential of

If hearing-impaired listeners are relying on these particular spectro-temporal cues for phonemic and subsequently word identification, then their perceptual performance is

Large urban development projects play a central role in shaping new social practices and work as testing sites of new governing models (Swyngedouw et al. The

USA UK Germany Australia Caribbean Belgium The Netherlands [email protected] www.pallas-athena.com Redistribution is not permitted without written notice from

- rtlo, containing the layoutcode you specified for this transaction - status, which contains one of the result codes in § 5.2.. If cinfo_in_callback=1 cname, ccity and cbank will