Information Extraction from unstructured data

(1)

Information Extraction

from unstructured data

Angus Roberts

(2)

University of Sheffield NLP

Acknowledgements

●

Technologies

–

University of Sheffield

●

Natural Language Processing Group

●

GATE Team

●

Applications

–

NIHR Mental Health Biomedical Research

Centre at South London and Maudsley / King's

College London Institute of Psychiatry

–

Health Protection Agency (EU FastVac project)

–

Commercial partners

(3)

Outline

●

Why tackle text?

●

Information extraction examples

–

NIHR Mental Health Biomedical Research

Centre

–

Obstetrics decision support

●

Adding semantics from coding schemes

●

Semantics example

–

Semantic indexing and visualisation of the

(4)

University of Sheffield NLP

●

There is a lot of free text in secondary care

records

●

For example, in mental health records, much

information of value is not in the structured data

–

Few laboratory tests

–

Emphasis on relatively subtle symptomatology and

overlapping diagnoses

●

A clear need to

extract and structure

information from free text

–

Outcomes (e.g. cognitive function)

–

Context (e.g. education)

–

Presentations (e.g. symptoms)

–

Risk profiles (e.g. smoking)

(5)

Free text vs structured

data: MMSE coverage

Cases

Instanses

MMSE in structured

(6)

University of Sheffield NLP

Free text vs structured

data: MMSE coverage

Cases

Instanses

MMSE in structured

data

4000

5792

Text retrieved

containing the string

”MMSE” from

unstructured text

(7)

Cases

Instanses

MMSE in structured

data

4000

5792

Text retrieved

containing the string

”MMSE” from

unstructured text

16585

48805

MMSEs with dates,

mined and validated

from unstructured text

15364

34871

Free text vs structured

data: MMSE coverage

(8)

Structured source

GATE from free text

Structured OR GATE

Patients

Instances

Patients

Instances

Patients

Instances

MMSE

5282

7944

18425

42752

Meds:

Olanzapine

9573

74921

17231

263805

18076 338726

Meds:

Clozapine

1829

170160

2901

74045

3065

244205

ICD-10:

F00:

Alzheimers

5447

8305

4612

16967

6428

25570

ICD-10

F60:

Personality

Disorder

3329

6489

5444

28587

6659

35076

(9)

●

A framework for language processing

●

Open Source – a large community of users and developers

●

Mature: over ten years old, currently at version 7.1

●

Funded by a mix of EU, UK RC and commercial funding

●

The most widely used toolkit of its kind, with 1000s of users at 100s

of sites

–

BBC World Cup and Olympics sites; The Press Association; The

National Archives; Elsevier; IBM and Oracle integration; various

pharma; many other multi-nationals and SMEs

●

Biggest single installation supports 10 000 concurrent users

●

An architecture: simplifying the construction of natural language

processing software.

GATE: a framework for Human

Language Technology

(10)

University of Sheffield NLP

Information extraction at

the mental health BRC

●

NIHR funded Biomedical Research Centre at

South London and Maudsley NHS Trust / KCL

Institute of Psychiatry

●

Part of a long term project to provide data for

mental health epidemiology

●

Centred around a Case Register of previous

cases – CRIS

●

Dealing mainly with correspondence and well

formed notes

●

Targets are not known in advance – the BRC

(11)

BRC applications

●

Smoking

●

Mini Mental State Examination

●

Diagnosis

●

Medications

●

Education level / left school

●

Social care

●

Negative symptoms of schizophrenia

●

To come soon: general symptomatology,

suicide, classifying adolescent MH, pregnancy

etc.

(12)

(13)

(14)

University of Sheffield NLP

Obstetrics

decision support

●

Commercial proof of concept

●

Extracting multiple targets from noisy,

terse labour suite notes

●

Question: can information extraction

deliver the quality required for decision

support?

(15)

Obstetrics screen shot

(16)

University of Sheffield NLP

Obstetrics screen shot

(17)

Semantic search

●

Task:

●

From the research literature, find proteins that

bind hyaluronan and that are expressed in brain

tissue

●

Approach:

●

Annotate texts against a large linked data

knowledge base (Linked Life Data), to provide

semantics

●

Multimodal search across full text, annotations

and the knowledge base

(18)

University of Sheffield, NLP

(19)

Example: FastVac

●

This demo uses data from FastVac, a project for the rapid design,

development, testing and licensing of vaccines

●

GATE provides technologies to assist the with systematic literature

reviews for FastVac,

●

FastVac is Funded by the EU Directorate General for Health &

Consumers. Partners are:

–

Coordinator: Netherlands Vaccine Institute (NVI) - Netherlands

–

Health Protection Agency (HPA) – United Kingdom

–

Statens Serum Institut (SSI) - Denmark

–

Cantacuzino Institute (CI) - Romania

–

National Centre for Epidemiology (NCE) - Hungary

–

Norwegian Institute of Public Health (NIPH) - Norway

–

University of Plovdiv (MUP) - Bulgaria

(20)

(21)

(22)

(23)

(24)

(25)

(26)

University of Sheffield, NLP

(27)

GATE and medical

records

●

GATE systems often highly ranked in I2B2 challenges

●

Some commercial use by pharma and EPR vendors

●

US academic systems:

–

CaTIES

–

HiTEX

●

University of Sheffield

–

CLEF – a Clinical e-Science Framework

–

German radiology reports

–

Obstetrics system

–

BRC at South London and Maudsley

(28)

University of Sheffield NLP

The problem with

free text search

●

Smoking

–

Smokes

20 a day

–

Stopped

smoking

2 years ago

–

Regularly

smokes

hash

–

Burnt the toast and set off the

smoke

alarm

●

Diagnosis

–

Diagnosed with alzheimer's 2011

–

Mother has

alzheimer's

●

Cognitive ability

–

We will do an

MMSE

next week

–

Two weeks ago

MMSE

was

19/30

(29)

Accuracy for correctly identifying target text and features,

measured against unseen data

Application

Iterations

Recall

Precision

MMSE

6

0.89

0.94

Diagnosis text only

6

0.46

0.50

Smoking and status

6

0.58

0.93

(30)

University of Sheffield NLP

Application

Iterations

Recall

Precision

Medication

7

0.62

0.9

Dose, route, start, stop

7

0.59

0.87

Education level

7

0.25

1.00

Left school age

7

1.00

Lives alone

2

1.00

0.93

Accuracy of correctly identifying target text and features,

measured against unseen data

Results

(31)

Application

Iterations

Recall

Precision

Care home

5

0.73

0.82

Generic care package

5

0.79

0.78

Day care

5

0.89

0.79

Home care

5

0.96

0.88

Meals on wheels

5

0.89

1.00

Respite care

5

0.84

0.81

Overall accuracy

0.82

Accuracy for correctly identifying social care interventions

and their currency (past, current, planned etc), measured

against development data

Results

(32)

University of Sheffield NLP

Application

Recall

Precision

F1

Abstract thinking

0.58

0.74

0.65

Affect

0.28

1.00

0.78

Apathy

0.89

0.60

0.72

Emotional withdrawal

0.06

0.50

0.10

Eye contact

0.54

0.82

0.65

Poverty of speech

0.27

0.83

0.40

Rapport

0.56

0.82

0.67

Social withdrawal

0.58

0.85

0.69

Initial results against unseen data (kfold cross validated)

Results