• No results found

Information Extraction from unstructured data

N/A
N/A
Protected

Academic year: 2021

Share "Information Extraction from unstructured data"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Information Extraction

from unstructured data

Angus Roberts

(2)

University of Sheffield NLP

Acknowledgements

Technologies

University of Sheffield

Natural Language Processing Group

GATE Team

Applications

NIHR Mental Health Biomedical Research

Centre at South London and Maudsley / King's

College London Institute of Psychiatry

Health Protection Agency (EU FastVac project)

Commercial partners

(3)

Outline

Why tackle text?

Information extraction examples

NIHR Mental Health Biomedical Research

Centre

Obstetrics decision support

Adding semantics from coding schemes

Semantics example

Semantic indexing and visualisation of the

(4)

University of Sheffield NLP

There is a lot of free text in secondary care

records

For example, in mental health records, much

information of value is not in the structured data

Few laboratory tests

Emphasis on relatively subtle symptomatology and

overlapping diagnoses

A clear need to

extract and structure

information from free text

Outcomes (e.g. cognitive function)

Context (e.g. education)

Presentations (e.g. symptoms)

Risk profiles (e.g. smoking)

(5)

Free text vs structured

data: MMSE coverage

Cases

Instanses

MMSE in structured

(6)

University of Sheffield NLP

Free text vs structured

data: MMSE coverage

Cases

Instanses

MMSE in structured

data

4000

5792

Text retrieved

containing the string

”MMSE” from

unstructured text

(7)

Cases

Instanses

MMSE in structured

data

4000

5792

Text retrieved

containing the string

”MMSE” from

unstructured text

16585

48805

MMSEs with dates,

mined and validated

from unstructured text

15364

34871

Free text vs structured

data: MMSE coverage

(8)

Structured source

GATE from free text

Structured OR GATE

Patients

Instances

Patients

Instances

Patients

Instances

MMSE

5282

7944

18425

42752

Meds:

Olanzapine

9573

74921

17231

263805

18076 338726

Meds:

Clozapine

1829

170160

2901

74045

3065

244205

ICD-10:

F00:

Alzheimers

5447

8305

4612

16967

6428

25570

ICD-10

F60:

Personality

Disorder

3329

6489

5444

28587

6659

35076

(9)

A framework for language processing

Open Source – a large community of users and developers

Mature: over ten years old, currently at version 7.1

Funded by a mix of EU, UK RC and commercial funding

The most widely used toolkit of its kind, with 1000s of users at 100s

of sites

BBC World Cup and Olympics sites; The Press Association; The

National Archives; Elsevier; IBM and Oracle integration; various

pharma; many other multi-nationals and SMEs

Biggest single installation supports 10 000 concurrent users

An architecture: simplifying the construction of natural language

processing software.

GATE: a framework for Human

Language Technology

(10)

University of Sheffield NLP

Information extraction at

the mental health BRC

NIHR funded Biomedical Research Centre at

South London and Maudsley NHS Trust / KCL

Institute of Psychiatry

Part of a long term project to provide data for

mental health epidemiology

Centred around a Case Register of previous

cases – CRIS

Dealing mainly with correspondence and well

formed notes

Targets are not known in advance – the BRC

(11)

BRC applications

Smoking

Mini Mental State Examination

Diagnosis

Medications

Education level / left school

Social care

Negative symptoms of schizophrenia

To come soon: general symptomatology,

suicide, classifying adolescent MH, pregnancy

etc.

(12)
(13)
(14)

University of Sheffield NLP

Obstetrics

decision support

Commercial proof of concept

Extracting multiple targets from noisy,

terse labour suite notes

Question: can information extraction

deliver the quality required for decision

support?

(15)

Obstetrics screen shot

(16)

University of Sheffield NLP

Obstetrics screen shot

(17)

Semantic search

Task:

From the research literature, find proteins that 

bind hyaluronan and that are expressed in brain 

tissue 

Approach:

Annotate texts against a large linked data 

knowledge base (Linked Life Data), to provide 

semantics

Multimodal search across full text, annotations 

and the knowledge base

(18)

University of Sheffield, NLP

(19)

Example: FastVac

This demo uses data from FastVac, a project for the rapid design,

development, testing and licensing of vaccines

GATE provides technologies to assist the with systematic literature

reviews for FastVac,

FastVac is Funded by the EU Directorate General for Health &

Consumers. Partners are:

Coordinator: Netherlands Vaccine Institute (NVI) - Netherlands

Health Protection Agency (HPA) – United Kingdom

Statens Serum Institut (SSI) - Denmark

Cantacuzino Institute (CI) - Romania

National Centre for Epidemiology (NCE) - Hungary

Norwegian Institute of Public Health (NIPH) - Norway

University of Plovdiv (MUP) - Bulgaria

(20)
(21)
(22)
(23)
(24)
(25)
(26)

University of Sheffield, NLP

(27)

GATE and medical

records

GATE systems often highly ranked in I2B2 challenges

Some commercial use by pharma and EPR vendors

US academic systems:

CaTIES

HiTEX

University of Sheffield

CLEF – a Clinical e-Science Framework

German radiology reports

Obstetrics system

BRC at South London and Maudsley

(28)

University of Sheffield NLP

The problem with

free text search

Smoking

Smokes

20 a day

Stopped

smoking

2 years ago

Regularly

smokes

hash

Burnt the toast and set off the

smoke

alarm

Diagnosis

Diagnosed with alzheimer's 2011

Mother has

alzheimer's

Cognitive ability

We will do an

MMSE

next week

Two weeks ago

MMSE

was

19/30

(29)

Accuracy for correctly identifying target text and features, 

measured against unseen data

Application

Iterations

Recall

Precision

MMSE

6

0.89

0.94

Diagnosis text only

6

0.46

0.50

Smoking and status

6

0.58

0.93

(30)

University of Sheffield NLP

Application

Iterations

Recall

Precision

Medication

7

0.62

0.9

Dose, route, start, stop

7

0.59

0.87

Education level

7

0.25

1.00

Left  school age

7

1.00

1.00

Lives alone

2

1.00

0.93

Accuracy of correctly identifying target text and features, 

measured against unseen data

Results

(31)

Application

Iterations

Recall

Precision

Care home

5

0.73

0.82

Generic care package

5

0.79

0.78

Day care

5

0.89

0.79

Home care

5

0.96

0.88

Meals on wheels

5

0.89

1.00

Respite care

5

0.84

0.81

Overall accuracy

0.82

0.82

Accuracy for correctly identifying social care interventions 

and their currency (past, current, planned etc), measured 

against development data

Results

(32)

University of Sheffield NLP

Application

Recall

Precision

F1

Abstract thinking

0.58

0.74

0.65

Affect

0.28

1.00

0.78

Apathy

0.89

0.60

0.72

Emotional withdrawal

0.06

0.50

0.10

Eye contact

0.54

0.82

0.65

Poverty of speech

0.27

0.83

0.40

Rapport

0.56

0.82

0.67

Social withdrawal

0.58

0.85

0.69

Initial results against unseen data (k­fold cross validated)

Results

References

Related documents

preservation of hotspot) individually and together. Shown are interphase and mitotic DNase accessibility profiles from G1E+GATA1, DNA methylation ratios in mouse HSCs, and

If the sensor detects that there is a wall in front of the robot (in roaming state), the robot will transverse to an obstacle-free path.[2] The same can be applied whenever

Rm4 UG06 (=No. 95eb) Pen arm with guidance. 95ed) Float with guide rod, spare. 95eg) Float vessel with float pen arm. 95k4) Collecting vessel, spare, capacity 4.5

Water quality investigations were refined to identify sources of acid mine drainage within Gosline, Lovers Lane, and Turkey Run, conduct tributary mass-balance chemical water

Grade GS-13 is distinguished from grade GS-12 primarily in: (1) the extreme complexity and scope of assigned cases; (2) the interrelated activities that the subjects

In 2011, Trip Advisor started a project to implement an intranet in their company to communicate with their employees and keep them updated about the

Life opened up in one of its amazing bursts of radiance and Amory suddenly and permanently rejected an old epigram that had been playing listlessly in his mind: "Very few

The buyer has the highest positive gradient with the extended payment period, larger than the bank, since the benefit for the buyer is dependent on the risk