Building the knowledge base for using Big Data in health
Jeremy Wyatt DM FRCP ACMI Fellow
Leadership chair in eHealth Research
Leeds Institute of Health Sciences, Univ. of Leeds
Clinical adviser on Technologies, Royal College of Physicians
Agenda
• The potential of Big Data in health
• Some challenges and tools to address them:
1. Confidentiality 2. Data quality
3. Reliable inference 4. Lack of data analysts
5. Making sense of the results
• What is Leeds up to ?
• Conclusions
How do Google traffic maps work ?
Cambridge traffic [at 6am today]
Since 2012, Google
captures GPS data from Android phones, then processes it to give average speeds
http://googleblog.blogsp ot.co.uk/2009/08/bright- side-of-sitting-in-
traffic.html
Google flu trends, yesterday
Crowd sourcing: potholes & cyclists
http://thepotholegardener.com/page/2/
www.potholes.co.uk – funded by Warranties direct
Asthmopolis
Where is Big Data & Health heading ?
Health data: CPRD, EPR, images, HSCIC, research datasets, bibliographic Census data: household, employment, education, religious beliefs…
Consumer data: supermarket, energy use, travel, telecomms…
New data: social media, Apps, Quantified Self, Google Glass…
New health related datasets
So, what could go wrong ?
Challenge 1: Confidentiality
• There are many threats: hacking, journalists,
blackmail, disgruntled employee, organised crime
• Catch-22: no data provider will collaborate until you demonstrate trustworthy approach - but you need data to do that
• One small oversight can easily ruin your reputation
• [Wilful ?] Misunderstanding by some of benefits vs.
risks equation [care.data, Summary Care Record, NHSNet 1990s…]
• Professionals and Trusts also worry about disclosure
Some solutions
General:
• Maintain close working with the public & data providers
• Only accredited users access the data (eLearning package)
• Project log & agreed SOPs; annual external audit
• Help line for reporting near misses
Technical:
• Data substitution – eg. distance for postcode
• Natural language understanding to delete identifiers in free text
• Precautions against internal sabotage, eg. MILA (Mark
McGilchrist): federated database (ie. no data warehouse !)
• Never distribute any data (Caldicott 2): all users access & analyse data on a monitored Citrix virtual research platform
• Scrutinise all results for potential disclosure (ISD Scotland guide)
And Global Alliance for Genomics and Health ?
Aim to develop harmonized approaches to enable responsible, secure, effective
sharing of genomic and clinical
information in the cloud “with highest ethics & privacy standards”
Members include leading global
technology, healthcare, research &
disease advocacy organizations
Model: any app (graphical, command-line, or batch processing) can work with
information in any repository Benefits: as ecosystem grows, all
developers and researchers benefit from each developer’s work
Google joined, Feb 27th:
http://googleresearch.blogspot.co.uk/
Challenge 2: Data quality
• First law of HI: data collected for one purpose can only rarely be used for another (Johan van der Lei, 1989)
• Numerator issues:
– Local data definitions & conventions on code usage per GP practice
– Completeness varies with time of day
– Undocumented changes in data definitions, collection process, normal lab ranges, thresholds for payment…
• Denominator issues: changes in cohort type, coverage, follow up process; missing records = litigation…
• Lack of metadata
Some solutions
• Obtain or reverse engineer the metadata:
– Observe data pathway from origin to delivery – Exploratory data analysis – Tukey
– Tools to automate database documentation
• Use emerging Farr Inst. models and methods
• Use multiple imputation with care – www.missingdata.org
• Consider if data artefacts are explanation for findings
Challenge 3: Reliable inference
• If you carry out 20 analyses, at least one will show p = 0.05
• Simpson’s paradox, confounding by indication, regression to the mean…
• Association is not causation !
Do lemons cause highway fatalities ?
Source:
www.cqeacademy.com/cqe-body-of-knowledge/continuous-improvement/quality-control-tools/
Some classic medical mistakes
Intervention Disease Original study Truth Post menopausal HRT CAD & stroke
prevention
Non randomised Ineffective Vitamin E 1o CAD
prevention
RCT Ineffective
Vitamin E 2o CAD prevention
Non randomised Ineffective Inhaled nitric oxide ARDS Non randomised Ineffective Endotoxin antibodies Gram neg sepsis Non randomised Ineffective Flavonoids CAD prevention Non randomised Effect smaller Carotid endartectomy High grade
stenosis
Non randomised Effect smaller Coronary stent vs.
PTCA
CAD Non randomised Effect smaller Zidoudine HIV infection Non randomised Effect smaller
Source: Ionnidis, Science 2008
The impact of biases in estimating mortality effect size for ezetimibe in 2233 post MI deaths from all causes
0 0.2 0.4 0.6 0.8 1 1.2
Cox model Propensity scoring Further modelling Hazard ratio for death compared to simvastatin group
Ezetemibe
Intensified statin
Eg. First incident MI; missing cholesterol levels; medication covariates
Source: Pauriah et al. Ezetimibe Use and Mortality in Survivors of an Acute Myocardial Infarction: A Population-based Study. Heart 2014
Some solutions
• Expertise in study designs: case crossover, instrumental variable analysis…
• Understand & quantify the biases
• Expertise in analytical methods: life course
epidemiology, multi level modelling, functional
data analysis for episodic frequent data…
So, what kinds of question can Big Data safely answer ?
Descriptive questions:
• Rates of symptoms, diseases, investigations, treatments [G Flu]
• Severity of illness, results of tests, doses of therapy etc.
• Distribution of services, diseases, risks etc. [Asthma map]
• Adherence to evidence, to EB guidelines
Questions about association:
• Prognostic markers to aid targeting of services & drugs
Causative questions ?
Problems facing healthcare systems
• Cost – need to target people for services, reduce waste
• Large variations in practice - Wennberg
– Overuse of some services, procedures – Under-utilisation of other services
• Gap between evidence, “best” practice and actual practice
• Avoidable errors, eg. drug reactions
Challenge 4: Data analyst capacity
• Big data is a new discipline for health, requiring fluency in data warehouses & software, analysis methods and tools, biases, IG / ethics …
• And the domain: otherwise hard to understand the questions asked, importance of serendipitous finding
“We’re creating these great datasets, but we don’t have enough scientists to analyse them”
NASA Asteroid data hunter contest, 11-3-14
What skills are needed ?
• Human skills:
– Working with those who ask questions, use results – Communication of the results: infographics
• Technical skills:
– Data management
– Data access – information governance; PPI – Data exploration / visualisation
– Data analysis: biostatistics; machine learning; study design / methodology
Challenge 5: Making sense of the results
• Reading scientific English is a learned skill
• Scientists often don’t know what is important to the users of their results
• Often, getting people’s attention is more
important than transferring information
Lancet series Oct / November 1998
Information Design
As font legibility declines, reading slows and people give more attention to the words and less to their meaning
Italic text and bold text are less legible than plain text
White text on a shaded or dark background makes text less legible, unless a bolder typeface is used
EXTENSIVE USE OF CAPITAL LETTERS SLOWS DOWN READING SO ROAD SIGNS LOOK LIKE THIS: Warwick University
Underline covers up descenders for the letters g, j, p, q and y as well as commas, & colons; so reduces legibility
Double justification to achieve an even right margin reduces legibility and reading speed compared to unjustified text of the same font size For a given
font size, short lines
or long lines with many words are more difficult to read than lines of about 10 words
Cramming lines of text so close together that there is no space
between them confuses the eye, reduces legibility and slows reading, making errors more likely.
Infographics
CDC obesity animation
So, how is Leeds University engaging
with all this ?
School of Medicine applied health research
Excellent research on - and management of - Big data and records
Bio informatics
Computer science Epidemiology Biostatistics Clinical
trials
Health services
research Health Economics
Health informatics Psychology
& social sciences Ethics and
law Patient &
public engagement
Health &
Social Care Information
Centre
Dentistry, Education, Psychology Maths,
Statistics, Geography Dentistry
Primary Care &
Public health,
Leeds &
Yorkshire Care Record Other
National Datasets
University of Leeds big data strategy
Recent UK Research Council awards around Big Data totalling £12M:
• MRC Medical Bioinformatics Centre £6M
• ESRC Consumer Data Research Centre £5M
• NERC etc. £0.7M
Our new VC Sir Alan Langlands (ex. NHS / Dundee / HEFCE) has invited David Willetts to launch our Leeds Institute for Data Analytics in May
Research to understand and improve individual {health related / educational / energy usage/ financial…} behaviour
Individual behaviour
Data analytics, infographics, Insights, toolkits, questions
Goals of society, individuals
Behaviour change programme
Psychology, social marketing, behavioural economics…
Leeds Institute for Data Analytics “Leeds Institute for Behaviour Change” ?
BIG Data
data capture “thought capture”
Can inform Training & capacity
development
Conclusions
• To safely build knowledge using Big Data, we need to address:
1. The concerns of data owners & the public 2. Data quality
3. Rigorous data analysis 4. Building capacity
5. Communicating our results