Building the knowledge base for using Big Data in health

(1)

Building the knowledge base for using Big Data in health

Jeremy Wyatt DM FRCP ACMI Fellow

Leadership chair in eHealth Research

Leeds Institute of Health Sciences, Univ. of Leeds

Clinical adviser on Technologies, Royal College of Physicians

[email protected]

(2)

Agenda

• The potential of Big Data in health

• Some challenges and tools to address them:

1. Confidentiality 2. Data quality

3. Reliable inference 4. Lack of data analysts

5. Making sense of the results

• What is Leeds up to ?

• Conclusions

(3)

How do Google traffic maps work ?

Cambridge traffic [at 6am today]

Since 2012, Google

captures GPS data from Android phones, then processes it to give average speeds

http://googleblog.blogsp ot.co.uk/2009/08/bright- side-of-sitting-in-

traffic.html

(4)

Google flu trends, yesterday

(5)

Crowd sourcing: potholes & cyclists

http://thepotholegardener.com/page/2/

www.potholes.co.uk – funded by Warranties direct

(6)

Asthmopolis

(7)

Where is Big Data & Health heading ?

Health data: CPRD, EPR, images, HSCIC, research datasets, bibliographic Census data: household, employment, education, religious beliefs…

Consumer data: supermarket, energy use, travel, telecomms…

New data: social media, Apps, Quantified Self, Google Glass…

(8)

New health related datasets

(9)

So, what could go wrong ?

(10)

Challenge 1: Confidentiality

• There are many threats: hacking, journalists,

blackmail, disgruntled employee, organised crime

• Catch-22: no data provider will collaborate until you demonstrate trustworthy approach - but you need data to do that

• One small oversight can easily ruin your reputation

• [Wilful ?] Misunderstanding by some of benefits vs.

risks equation [care.data, Summary Care Record, NHSNet 1990s…]

• Professionals and Trusts also worry about disclosure

(11)

Some solutions

General:

• Maintain close working with the public & data providers

• Only accredited users access the data (eLearning package)

• Project log & agreed SOPs; annual external audit

• Help line for reporting near misses

Technical:

• Data substitution – eg. distance for postcode

• Natural language understanding to delete identifiers in free text

• Precautions against internal sabotage, eg. MILA (Mark

McGilchrist): federated database (ie. no data warehouse !)

• Never distribute any data (Caldicott 2): all users access & analyse data on a monitored Citrix virtual research platform

• Scrutinise all results for potential disclosure (ISD Scotland guide)

(12)

And Global Alliance for Genomics and Health ?

Aim to develop harmonized approaches to enable responsible, secure, effective

sharing of genomic and clinical

information in the cloud “with highest ethics & privacy standards”

Members include leading global

technology, healthcare, research &

disease advocacy organizations

Model: any app (graphical, command-line, or batch processing) can work with

information in any repository Benefits: as ecosystem grows, all

developers and researchers benefit from each developer’s work

Google joined, Feb 27^th:

http://googleresearch.blogspot.co.uk/

(13)

Challenge 2: Data quality

• First law of HI: data collected for one purpose can only rarely be used for another (Johan van der Lei, 1989)

• Numerator issues:

– Local data definitions & conventions on code usage per GP practice

– Completeness varies with time of day

– Undocumented changes in data definitions, collection process, normal lab ranges, thresholds for payment…

• Denominator issues: changes in cohort type, coverage, follow up process; missing records = litigation…

• Lack of metadata

(14)

Some solutions

• Obtain or reverse engineer the metadata:

– Observe data pathway from origin to delivery – Exploratory data analysis – Tukey

– Tools to automate database documentation

• Use emerging Farr Inst. models and methods

• Use multiple imputation with care – www.missingdata.org

• Consider if data artefacts are explanation for findings

(15)

Challenge 3: Reliable inference

• If you carry out 20 analyses, at least one will show p = 0.05

• Simpson’s paradox, confounding by indication, regression to the mean…

• Association is not causation !

(16)

Do lemons cause highway fatalities ?

Source:

www.cqeacademy.com/cqe-body-of-knowledge/continuous-improvement/quality-control-tools/

(17)

Some classic medical mistakes

Intervention Disease Original study Truth Post menopausal HRT CAD & stroke

prevention

Non randomised Ineffective Vitamin E 1^o CAD

prevention

RCT Ineffective

Vitamin E 2^o CAD prevention

Non randomised Ineffective Inhaled nitric oxide ARDS Non randomised Ineffective Endotoxin antibodies Gram neg sepsis Non randomised Ineffective Flavonoids CAD prevention Non randomised Effect smaller Carotid endartectomy High grade

stenosis

Non randomised Effect smaller Coronary stent vs.

PTCA

CAD Non randomised Effect smaller Zidoudine HIV infection Non randomised Effect smaller

Source: Ionnidis, Science 2008

(18)

The impact of biases in estimating mortality effect size for ezetimibe in 2233 post MI deaths from all causes

0 0.2 0.4 0.6 0.8 1 1.2

Cox model Propensity scoring Further modelling Hazard ratio for death compared to simvastatin group

Ezetemibe

Intensified statin

Eg. First incident MI; missing cholesterol levels; medication covariates

Source: Pauriah et al. Ezetimibe Use and Mortality in Survivors of an Acute Myocardial Infarction: A Population-based Study. Heart 2014

(19)

Some solutions

• Expertise in study designs: case crossover, instrumental variable analysis…

• Understand & quantify the biases

• Expertise in analytical methods: life course

epidemiology, multi level modelling, functional

data analysis for episodic frequent data…

(20)

So, what kinds of question can Big Data safely answer ?

Descriptive questions:

• Rates of symptoms, diseases, investigations, treatments [G Flu]

• Severity of illness, results of tests, doses of therapy etc.

• Distribution of services, diseases, risks etc. [Asthma map]

• Adherence to evidence, to EB guidelines

Questions about association:

• Prognostic markers to aid targeting of services & drugs

Causative questions ?

(21)

Problems facing healthcare systems

• Cost – need to target people for services, reduce waste

• Large variations in practice - Wennberg

– Overuse of some services, procedures – Under-utilisation of other services

• Gap between evidence, “best” practice and actual practice

• Avoidable errors, eg. drug reactions

(22)

Challenge 4: Data analyst capacity

• Big data is a new discipline for health, requiring fluency in data warehouses & software, analysis methods and tools, biases, IG / ethics …

• And the domain: otherwise hard to understand the questions asked, importance of serendipitous finding

“We’re creating these great datasets, but we don’t have enough scientists to analyse them”

NASA Asteroid data hunter contest, 11-3-14

(23)

What skills are needed ?

• Human skills:

– Working with those who ask questions, use results – Communication of the results: infographics

• Technical skills:

– Data management

– Data access – information governance; PPI – Data exploration / visualisation

– Data analysis: biostatistics; machine learning; study design / methodology

(24)

Challenge 5: Making sense of the results

• Reading scientific English is a learned skill

• Scientists often don’t know what is important to the users of their results

• Often, getting people’s attention is more

important than transferring information

(25)

Lancet series Oct / November 1998

Information Design

As font legibility declines, reading slows and people give more attention to the words and less to their meaning

Italic text and bold text are less legible than plain text

White text on a shaded or dark background makes text less legible, unless a bolder typeface is used

EXTENSIVE USE OF CAPITAL LETTERS SLOWS DOWN READING SO ROAD SIGNS LOOK LIKE THIS: Warwick University

Underline covers up descenders for the letters g, j, p, q and y as well as commas, & colons; so reduces legibility

Double justification to achieve an even right margin reduces legibility and reading speed compared to unjustified text of the same font size For a given

font size, short lines

or long lines with many words are more difficult to read than lines of about 10 words

Cramming lines of text so close together that there is no space

between them confuses the eye, reduces legibility and slows reading, making errors more likely.

(26)

Infographics

(27)

CDC obesity animation

(28)

So, how is Leeds University engaging

with all this ?

(29)

School of Medicine applied health research

Excellent research on - and management of - Big data and records

Bio informatics

Computer science Epidemiology Biostatistics Clinical

trials

Health services

research Health Economics

Health informatics Psychology

& social sciences Ethics and

law Patient &

public engagement

Health &

Social Care Information

Centre

Dentistry, Education, Psychology Maths,

Statistics, Geography Dentistry

Primary Care &

Public health,

Leeds &

Yorkshire Care Record Other

National Datasets

(30)

University of Leeds big data strategy

Recent UK Research Council awards around Big Data totalling £12M:

• MRC Medical Bioinformatics Centre £6M

• ESRC Consumer Data Research Centre £5M

• NERC etc. £0.7M

Our new VC Sir Alan Langlands (ex. NHS / Dundee / HEFCE) has invited David Willetts to launch our Leeds Institute for Data Analytics in May

(31)

Research to understand and improve individual {health related / educational / energy usage/ financial…} behaviour

Individual behaviour

Data analytics, infographics, Insights, toolkits, questions

Goals of society, individuals

Behaviour change programme

Psychology, social marketing, behavioural economics…

Leeds Institute for Data Analytics “Leeds Institute for Behaviour Change” ?

BIG Data

data capture “thought capture”

Can inform Training & capacity

development

(32)

Conclusions

• To safely build knowledge using Big Data, we need to address:

1. The concerns of data owners & the public 2. Data quality

3. Rigorous data analysis 4. Building capacity

5. Communicating our results

Building the knowledge base for using Big Data in health