Using Big Data in Healthcare

(1)

Speaker

First Plenary Session

THE USE OF "BIG DATA" - WHERE ARE WE

AND WHAT DOES THE FUTURE HOLD?

David R. Holmes III, PhD

Mayo Clinic College of Medicine

Rochester, MN, USA

Using Big Data in Healthcare

David R. Holmes III

ISPOR 19

th

_{Annual Meeting}

June 2

nd

_{, 2014}

Graph Databases and

Graph Analytic Approaches

(2)

Teamwork

• Special Purpose Processor Development Group

• Barry Gilbert, Ph.D.

• Robert Techentin

• Center for Science of Healthcare Delivery

• Jeanne Huddleston, M.D.

• Nilay Shah, Ph.D.

• Rochester Epidemiology Project

• Jennifer St. Sauver, Ph.D.

• YarcData

• Steve Reinhardt

• Biomedical Imaging Resource

_{Will and Charlie Mayo, The Mayo Brothers}

(3)

What is a graph?

1

2

“Node 1 and Node 2 are related”

“Node 1 is forward related to Node 2”

3

“Node 1 is forward related to Node 2 and Node 3”

A

B

“Node 1 is forward related to Node 2 via Edge A. Node 1 is forward related to Node 3 via Edge B”

Smoking Coffee Drinking Heart Attack Correlates

Causes “Smoking is correlated with coffee drinking. Smoking may cause heart attacks. Smoking is a confounding variable.”

Semantic Graphs / Databases

• Node-typed, edge-typed, directed graph

• Using the Resource Description Framework (RDF), we can describe each piece of information in the graph as a triple:

• <Subject> <Predicate> <Object>

<Smoking> <corr. with> <Coffee Drinking> <Coffee Drinking> <corr. with> <Smoking> <Smoking> <causes> <Heart Attacks>

• A semantic database is referred to as a triple-store (e.g. a collection of triples)

• Semantic Databases are queried using SPARQL (the semantic equivalent of SQL)

• Inferential rules and ontologies can be applied dynamically to the data to further enrich the dataset

Smoking Coffee Drinking Heart Attack Correlates Causes

(4)

Origins of Semantic Databases in

Healthcare

• Mishelevich, David J.

• "MEANINGEX: a computer-based semantic parse approach to the analysis of meaning." (1971)

• "Semantic analysis of medical records." (1972)

• Initial notion of an ontology and semantic (i.e. noun phrase) representation of medical data

• Schmid, Hans Albrecht, and J. Richard Swenson. • "On the semantics of the relational data model." (1975) • Formalizing the graph-like nature of semantic data models

• 1970s…

• 1980s…

• 1990s…

• 2000s...

• Lenz, Richard, Mario Beyer, and Klaus A. Kuhn. • "Semantic integration in healthcare networks.“ (2007)

Benefits of Semantic Databases

• Semantic databases center around the users need to collect and

interrogate the heterogeneous data

• Flexible Schema

• New variables can be added to the data model easily

• Data type agnostic

• New variables are added with indifference to variables already

in the data model

• Expressability

• Ability to query the database in a flexible manner without

regards for the specific data model

• Can dynamically apply inferential rules and ontologies

• Whole graph algorithms can be applied in order to find unique

relationships between variables

(5)

Healthcare Semantification at Mayo

• Rochester Epidemiology Project (Population-based)

• Goal: Leverage the stable population to track health

over time

• 500K Individuals, 40 year duration

• 2 M healthcare records

• Bedside Patient Rescue (In-hospital)

• Goal: Early Warning Systems (EWS) for patient events

• 115K patient encounters, 2 year duration

• 38M records (labs, nursing evals, etc.)

(6)

(7)

• Diffusion algorithm can find hidden relationships by exploiting

connections in the semantic graph

• Initial values are attached to specific “seed” nodes

• Values propagate over graph edges, and accumulate in

different parts of the graph

• Sometimes results are unexpected

• With a functioning graph diffusion algorithm, many possible

searches can be performed

• For the REP, we can identify a representative example of

cohort features and label the graph

(8)

(9)

(10)

(11)

Just one algorithm? No

• There are many whole graph algorithms which could be

applied to healthcare data:

• PageRank – Google-developed algorithms for

weighting the edges to emphasize important nodes

in a graph

• Peer-pressure clustering – Graph-based cluster

algorithm to find groups based on both node and

edge data

• Betweeness-centrality – Algorithm to determine

key nodes in a graph which are most connected

• Clique detection – Methods to find sub-graphs in a

graph

(12)

Why doesn’t everyone use Semantic

Databases?

• Migrating relational databases to semantic

databases can be tricky

• Graph databases suffer from missing data and

noisy data – just like relational databases

• Graph databases are large, and graph

algorithms are complex

Migrating Relational Databases

• Relational DBs, by definition, are an efficient

tabular storage of information.

• Care must be taken in developing a semantic

model to ensure “semantic richness”

• Data must be promoted correctly to

subjects/objects

• Predicates must be semantically meaningful

• Standard nomenclature must be used to be

(13)

Missing and Noisy Data

• Missing data is just that … missing.

• Graph algorithms need to be smarter about missing data. For example, • Building latent variables into the data

• Using a priori models to address missing data

• Healthcare data is notoriously noisy • Moreover, there is a lot of it • Algorithms must be robust to noise

and oversampling

• While pre-processing can address this, some useful information can be lost.

• Algorithms need to “intelligently” weight the data to draw meaningful conclusions.

Connecting Two BPR Encounters

Graph Data is Large and Complex

• For decades, the community didn’t have the

computational resources to deal with

semantic data efficiently.

• Technology developers were unable to

pack enough memory into a computer

to hold the data

• Networks were too slow

• As a result, CPUs were “data starved”

• New technologies address this issue

specifically

• Hadoop clusters

• Graph computers

(14)

Progressively complex queries using graph

computer vs standard SQL database

Final Thoughts

• Graph databases for healthcare were proposed in the 1970s.

• Over time, the conceptual model of graph databases / algorithms matured. • Technology has finally caught up.

• The technical community is now prepared to accept massive amounts of healthcare data and store it semantically.

• Semantic graph databases change the way that we look at data. • Graph analytics will yield new insights into existing and soon-to-be

collected datasets.

• There are still challenges in data migration and data quality to be addressed.

• Harass your favorite computer scientist / informaticist to make progress in these areas.