Big Data and Text Mining

(1)

Big Data and Text Mining

Dr. Ian Lewin

Senior NLP Resource Specialist

[email protected]

(2)

About Linguamatics

2

Boston, USA Cambridge, UK

• Agile, scalable, real-time NLP-based text mining • Fact extraction and knowledge synthesis

Software Consulting Hosted content

Pharma/Biotech Healthcare Government

Including 17 of

(3)

Solutions & Applications in Life Sciences

Advanced text analytics delivers value along the pipeline

Toxicity analysis and prediction Biomarker discovery Drug repurposing Patent analysis KOL identification Opportunity scouting Trial site selection and study design

Safety Competitive intelligence Pharmacovigilance Social media analysis Comparative Effectiveness Regulatory Submission QC HEOR SAR

(4)

Solutions & Applications in Healthcare

4 © Linguamatics 2015 - Confidential Care gap models Pathology, radiology, initial assessment, discharge, check up Structured data _Patient characteristics Potential adverse drug reactions Clinical trials gov Patient characteristics Matching Clinical trials Clinical case histories and/or genomic interpretation Patient characteristics Electronic Health Record Enterprise Data Warehouse Patient characteristics Patient lists FDA drug labels Scientific literature

(5)

Structured Data & its Evidential Basis

... I2E can mine and extract with precision at scale

Scientific literature

Social media Patents News _feeds EHRs Internal reports Drug labels Clinical trials ...

(6)

Text Mining – a precursor to Big Data?

6

• Unstructured data is just huge

• We can’t wait for those human db curators....

• Besides, those curators ignore my parameter..

• And all that text is just out there!

• (see Google for details)

(7)

Multisource data  Big data

7

• Lots of different types of data

• Scientific literature • Medical records

• Patents

• Regulatory publications (clinical trials, drug labels,

adverse event reporting…)

• Internal reports

• Lots of different types of text

• In lots of different silos

(8)

Connected Data Technology

8

Single query across

multiple data sources and network locations

(9)

Connected Data Technology

9

(10)

Connected Data Technology

10

Unified results for fast review and discovery of relationships across multiple data sources

(11)

Huge (Textual) Data  Big Data

We (i.e. text-miners…) are often joining data

− Unstructured

− And structured

− Across silos

Before the tabular results go to “analysis”

(12)

The How of Text Mining

Text Mining isn’t completely shrink wrapped

There is, usually, some customization

− To find the parameter value that you’re interested in

− To find the value that everyone’s interested in, but only in circumstances c

− To find it in datasource X

− To find it in X but only in circumstances c

− To map to ontology A rather than B

It often makes sense to express these

constraints at time of text-mining (not

analysis)

(13)

Toolbox of Methods for Powerful Querying

• Precise linguistic relationships, sentence co-occurrence • Precise negation e.g. “pressure” but not “blood pressure”

NLP

• Search for concepts and classes, not just keywords • e.g. cancer and get synonyms and children:

• Malignant neoplasms, Malignant tumor …

Terminologies

• Rule based pattern matching for e.g. measurements, lab codes, mutations

• e.g. microRNA: let-?\d+.* mirn?a?-?\d+.*

Regular Expressions Chemistry

• Restrict within particular regions of a document, including nested e.g. table cell in table in Description

Fielded Search

• Simultaneous processing of large numbers of items e.g. • 500 compounds, 500 genes from microarray

experiment, etc.

High Throughput

(14)

Linguistic Processing Using NLP

Interprets meaning of the text

Groups words into meaningful units Search for different forms of words

14

We find that p42mapk phosphorylates c-Myb on serine and threonine . Purified recombinant p42 MAPK was found to phosphorylate Wee1 .

sentences morphology - different forms noun groups match entities verb groups match actions

(15)

Discovering extraction patterns..

We often need to look at the data first (the

“huge data”) to find the extraction patterns

Linguistic patterns of expression vary

− Over data sets

− Over time

This pre-extraction exploration is something

itself that needs informing

− By the ontologies and KBs that are already out there

− By the re-use of generally successful strategies

(16)

Innovative tools to enable exploration of

complex and specialised data sets

Grant funded by InnovateUK (Dept of BIS and

EPSRC)

Project End-date: mid 2016

“…easier discovery and extraction of key facts”

− by sharing search strategies rather than sharing just

search results

− by using novel algorithms for semantic information extraction

− linking information from multiple resources to help users find similar and relevant information.

(17)

Summary

Text Mining – the extraction of structured

information from unstructured text

It’s a natural precursor to large scale analytics

It’s also a big data task itself

− Voluminous source data

− Distributed over many silos

− Expressed in different ways

It’s not just a precursor

− We’re (already) joining data at extraction time

− We’re researching exploiting and joining more data at the earliest phases of data exploration, prior to

extraction

(18)

Thank You

For more information…

Visit: www.linguamatics.com

Big Data and Text Mining