Speaker
First Plenary Session
THE USE OF "BIG DATA" - WHERE ARE WE
AND WHAT DOES THE FUTURE HOLD?
David R. Holmes III, PhD
Mayo Clinic College of Medicine
Rochester, MN, USA
©2014 MFMER | slide-2
Using Big Data in Healthcare
David R. Holmes III
ISPOR 19
thAnnual Meeting
June 2
nd, 2014
Graph Databases and
Graph Analytic Approaches
©2014 MFMER | slide-3
Teamwork
•
Special Purpose Processor Development Group
• Barry Gilbert, Ph.D.
• Robert Techentin
•
Center for Science of Healthcare Delivery
• Jeanne Huddleston, M.D.
• Nilay Shah, Ph.D.
•
Rochester Epidemiology Project
• Jennifer St. Sauver, Ph.D.
•
YarcData
• Steve Reinhardt
•
Biomedical Imaging Resource
Will and Charlie Mayo, The Mayo Brothers©2014 MFMER | slide-4 Graph Analytics
©2014 MFMER | slide-5
What is a graph?
1
2
“Node 1 and Node 2 are related”
“Node 1 is forward related to Node 2”
3
“Node 1 is forward related to Node 2 and Node 3”
A
B
“Node 1 is forward related to Node 2 via Edge A. Node 1 is forward related to Node 3 via Edge B”
Smoking Coffee Drinking Heart Attack Correlates
Causes “Smoking is correlated with coffee drinking. Smoking may cause heart attacks. Smoking is a confounding variable.”
©2014 MFMER | slide-6
Semantic Graphs / Databases
• Node-typed, edge-typed, directed graph
• Using the Resource Description Framework (RDF), we can describe each piece of information in the graph as a triple:
• <Subject> <Predicate> <Object>
<Smoking> <corr. with> <Coffee Drinking> <Coffee Drinking> <corr. with> <Smoking> <Smoking> <causes> <Heart Attacks>
• A semantic database is referred to as a triple-store (e.g. a collection of triples)
• Semantic Databases are queried using SPARQL (the semantic equivalent of SQL)
• Inferential rules and ontologies can be applied dynamically to the data to further enrich the dataset
Smoking Coffee Drinking Heart Attack Correlates Causes
©2014 MFMER | slide-7
Origins of Semantic Databases in
Healthcare
• Mishelevich, David J.
• "MEANINGEX: a computer-based semantic parse approach to the analysis of meaning." (1971)
• "Semantic analysis of medical records." (1972)
• Initial notion of an ontology and semantic (i.e. noun phrase) representation of medical data
• Schmid, Hans Albrecht, and J. Richard Swenson. • "On the semantics of the relational data model." (1975) • Formalizing the graph-like nature of semantic data models
• 1970s…
• 1980s…
• 1990s…
• 2000s...
• Lenz, Richard, Mario Beyer, and Klaus A. Kuhn. • "Semantic integration in healthcare networks.“ (2007)
©2014 MFMER | slide-8
Benefits of Semantic Databases
•
Semantic databases center around the users need to collect and
interrogate the heterogeneous data
•
Flexible Schema
• New variables can be added to the data model easily
• Data type agnostic
• New variables are added with indifference to variables already
in the data model
•
Expressability
• Ability to query the database in a flexible manner without
regards for the specific data model
• Can dynamically apply inferential rules and ontologies
• Whole graph algorithms can be applied in order to find unique
relationships between variables
©2014 MFMER | slide-9
Healthcare Semantification at Mayo
•
Rochester Epidemiology Project (Population-based)
• Goal: Leverage the stable population to track health
over time
• 500K Individuals, 40 year duration
• 2 M healthcare records
•
Bedside Patient Rescue (In-hospital)
• Goal: Early Warning Systems (EWS) for patient events
• 115K patient encounters, 2 year duration
• 38M records (labs, nursing evals, etc.)
©2014 MFMER | slide-10
©2014 MFMER | slide-11
©2014 MFMER | slide-13
©2014 MFMER | slide-14
•
Diffusion algorithm can find hidden relationships by exploiting
connections in the semantic graph
•
Initial values are attached to specific “seed” nodes
•
Values propagate over graph edges, and accumulate in
different parts of the graph
• Sometimes results are unexpected
•
With a functioning graph diffusion algorithm, many possible
searches can be performed
•
For the REP, we can identify a representative example of
cohort features and label the graph
©2014 MFMER | slide-15
©2014 MFMER | slide-16
©2014 MFMER | slide-17
©2014 MFMER | slide-19
©2014 MFMER | slide-21
Just one algorithm? No
•
There are many whole graph algorithms which could be
applied to healthcare data:
• PageRank – Google-developed algorithms for
weighting the edges to emphasize important nodes
in a graph
• Peer-pressure clustering – Graph-based cluster
algorithm to find groups based on both node and
edge data
• Betweeness-centrality – Algorithm to determine
key nodes in a graph which are most connected
• Clique detection – Methods to find sub-graphs in a
graph
©2014 MFMER | slide-23
Why doesn’t everyone use Semantic
Databases?
•
Migrating relational databases to semantic
databases can be tricky
•
Graph databases suffer from missing data and
noisy data – just like relational databases
•
Graph databases are large, and graph
algorithms are complex
©2014 MFMER | slide-24
Migrating Relational Databases
•
Relational DBs, by definition, are an efficient
tabular storage of information.
•
Care must be taken in developing a semantic
model to ensure “semantic richness”
• Data must be promoted correctly to
subjects/objects
• Predicates must be semantically meaningful
• Standard nomenclature must be used to be
©2014 MFMER | slide-25
Missing and Noisy Data
• Missing data is just that … missing.
• Graph algorithms need to be smarter about missing data. For example, • Building latent variables into the data
• Using a priori models to address missing data
• Healthcare data is notoriously noisy • Moreover, there is a lot of it • Algorithms must be robust to noise
and oversampling
• While pre-processing can address this, some useful information can be lost.
• Algorithms need to “intelligently” weight the data to draw meaningful conclusions.
Connecting Two BPR Encounters
©2014 MFMER | slide-26
Graph Data is Large and Complex
•
For decades, the community didn’t have the
computational resources to deal with
semantic data efficiently.
• Technology developers were unable to
pack enough memory into a computer
to hold the data
• Networks were too slow
• As a result, CPUs were “data starved”
•
New technologies address this issue
specifically
• Hadoop clusters
• Graph computers
©2014 MFMER | slide-27
Progressively complex queries using graph
computer vs standard SQL database
©2014 MFMER | slide-28
Final Thoughts
• Graph databases for healthcare were proposed in the 1970s.
• Over time, the conceptual model of graph databases / algorithms matured. • Technology has finally caught up.
• The technical community is now prepared to accept massive amounts of healthcare data and store it semantically.
• Semantic graph databases change the way that we look at data. • Graph analytics will yield new insights into existing and soon-to-be
collected datasets.
• There are still challenges in data migration and data quality to be addressed.
• Harass your favorite computer scientist / informaticist to make progress in these areas.
©2014 MFMER | slide-29
©2014 MFMER | slide-29