Information Extraction
from unstructured data
Angus Roberts
University of Sheffield NLP
Acknowledgements
●
Technologies
–
University of Sheffield
●
Natural Language Processing Group
●GATE Team
●
Applications
–
NIHR Mental Health Biomedical Research
Centre at South London and Maudsley / King's
College London Institute of Psychiatry
–
Health Protection Agency (EU FastVac project)
–
Commercial partners
Outline
●
Why tackle text?
●
Information extraction examples
–
NIHR Mental Health Biomedical Research
Centre
–
Obstetrics decision support
●
Adding semantics from coding schemes
●
Semantics example
–
Semantic indexing and visualisation of the
University of Sheffield NLP
●
There is a lot of free text in secondary care
records
●
For example, in mental health records, much
information of value is not in the structured data
–
Few laboratory tests
–
Emphasis on relatively subtle symptomatology and
overlapping diagnoses
●
A clear need to
extract and structure
information from free text
–
Outcomes (e.g. cognitive function)
–
Context (e.g. education)
–
Presentations (e.g. symptoms)
–
Risk profiles (e.g. smoking)
Free text vs structured
data: MMSE coverage
Cases
Instanses
MMSE in structured
University of Sheffield NLP
Free text vs structured
data: MMSE coverage
Cases
Instanses
MMSE in structured
data
4000
5792
Text retrieved
containing the string
”MMSE” from
unstructured text
Cases
Instanses
MMSE in structured
data
4000
5792
Text retrieved
containing the string
”MMSE” from
unstructured text
16585
48805
MMSEs with dates,
mined and validated
from unstructured text
15364
34871
Free text vs structured
data: MMSE coverage
Structured source
GATE from free text
Structured OR GATE
Patients
Instances
Patients
Instances
Patients
Instances
MMSE
5282
7944
18425
42752
Meds:
Olanzapine
9573
74921
17231
263805
18076 338726
Meds:
Clozapine
1829
170160
2901
74045
3065
244205
ICD-10:
F00:
Alzheimers
5447
8305
4612
16967
6428
25570
ICD-10
F60:
Personality
Disorder
3329
6489
5444
28587
6659
35076
●
A framework for language processing
●
Open Source – a large community of users and developers
●Mature: over ten years old, currently at version 7.1
●
Funded by a mix of EU, UK RC and commercial funding
●
The most widely used toolkit of its kind, with 1000s of users at 100s
of sites
–
BBC World Cup and Olympics sites; The Press Association; The
National Archives; Elsevier; IBM and Oracle integration; various
pharma; many other multi-nationals and SMEs
●
Biggest single installation supports 10 000 concurrent users
●
An architecture: simplifying the construction of natural language
processing software.
GATE: a framework for Human
Language Technology
University of Sheffield NLP
Information extraction at
the mental health BRC
●
NIHR funded Biomedical Research Centre at
South London and Maudsley NHS Trust / KCL
Institute of Psychiatry
●
Part of a long term project to provide data for
mental health epidemiology
●
Centred around a Case Register of previous
cases – CRIS
●
Dealing mainly with correspondence and well
formed notes
●
Targets are not known in advance – the BRC
BRC applications
●
Smoking
●
Mini Mental State Examination
●Diagnosis
●
Medications
●
Education level / left school
●Social care
●
Negative symptoms of schizophrenia
●
To come soon: general symptomatology,
suicide, classifying adolescent MH, pregnancy
etc.
University of Sheffield NLP
Obstetrics
decision support
●
Commercial proof of concept
●
Extracting multiple targets from noisy,
terse labour suite notes
●
Question: can information extraction
deliver the quality required for decision
support?
Obstetrics screen shot
University of Sheffield NLP
Obstetrics screen shot
Semantic search
●Task:
●From the research literature, find proteins that
bind hyaluronan and that are expressed in brain
tissue
●Approach:
●Annotate texts against a large linked data
knowledge base (Linked Life Data), to provide
semantics
●Multimodal search across full text, annotations
and the knowledge base
University of Sheffield, NLP
Example: FastVac
●
This demo uses data from FastVac, a project for the rapid design,
development, testing and licensing of vaccines
●
GATE provides technologies to assist the with systematic literature
reviews for FastVac,
●
FastVac is Funded by the EU Directorate General for Health &
Consumers. Partners are:
–
Coordinator: Netherlands Vaccine Institute (NVI) - Netherlands
–Health Protection Agency (HPA) – United Kingdom
–
Statens Serum Institut (SSI) - Denmark
–Cantacuzino Institute (CI) - Romania
–
National Centre for Epidemiology (NCE) - Hungary
–Norwegian Institute of Public Health (NIPH) - Norway
–University of Plovdiv (MUP) - Bulgaria
University of Sheffield, NLP
GATE and medical
records
●
GATE systems often highly ranked in I2B2 challenges
●
Some commercial use by pharma and EPR vendors
●
US academic systems:
–
CaTIES
–
HiTEX
●
University of Sheffield
–
CLEF – a Clinical e-Science Framework
–
German radiology reports
–
Obstetrics system
–
BRC at South London and Maudsley
University of Sheffield NLP
The problem with
free text search
●
Smoking
–
Smokes
20 a day
–
Stopped
smoking
2 years ago
–
Regularly
smokes
hash
–
Burnt the toast and set off the
smoke
alarm
●
Diagnosis
–
Diagnosed with alzheimer's 2011
–
Mother has
alzheimer's
●
Cognitive ability
–
We will do an
MMSE
next week
–
Two weeks ago
MMSE
was
19/30
Accuracy for correctly identifying target text and features,
measured against unseen data
Application
Iterations
Recall
Precision
MMSE
6
0.89
0.94
Diagnosis text only
6
0.46
0.50
Smoking and status
6
0.58
0.93
University of Sheffield NLP
Application
Iterations
Recall
Precision
Medication
7
0.62
0.9
Dose, route, start, stop
7
0.59
0.87
Education level
7
0.25
1.00
Left school age
7
1.00
1.00
Lives alone
2
1.00
0.93
Accuracy of correctly identifying target text and features,
measured against unseen data
Results
Application
Iterations
Recall
Precision
Care home
5
0.73
0.82
Generic care package
5
0.79
0.78
Day care
5
0.89
0.79
Home care
5
0.96
0.88
Meals on wheels
5
0.89
1.00
Respite care
5
0.84
0.81
Overall accuracy
0.82
0.82
Accuracy for correctly identifying social care interventions
and their currency (past, current, planned etc), measured
against development data
Results
University of Sheffield NLP
Application
Recall
Precision
F1
Abstract thinking
0.58
0.74
0.65
Affect
0.28
1.00
0.78
Apathy
0.89
0.60
0.72
Emotional withdrawal
0.06
0.50
0.10
Eye contact
0.54
0.82
0.65
Poverty of speech
0.27
0.83
0.40
Rapport
0.56
0.82
0.67
Social withdrawal
0.58
0.85
0.69
Initial results against unseen data (kfold cross validated)
Results