• No results found

Big Data and Text Mining

N/A
N/A
Protected

Academic year: 2021

Share "Big Data and Text Mining"

Copied!
18
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data and Text Mining

Dr. Ian Lewin

Senior NLP Resource Specialist

[email protected]

(2)

About Linguamatics

© Linguamatics 2015

2

Boston, USA Cambridge, UK

• Agile, scalable, real-time NLP-based text mining • Fact extraction and knowledge synthesis

Software Consulting Hosted content

Pharma/Biotech Healthcare Government

Including 17 of

(3)

Solutions & Applications in Life Sciences

Advanced text analytics delivers value along the pipeline

© Linguamatics 2015 - Confidential 3 Gene-disease mapping Target ID/selection Mutation/expression analysis

Toxicity analysis and prediction Biomarker discovery Drug repurposing Patent analysis KOL identification Opportunity scouting Trial site selection and study design

Safety Competitive intelligence Pharmacovigilance Social media analysis Comparative Effectiveness Regulatory Submission QC HEOR SAR

(4)

Solutions & Applications in Healthcare

4 © Linguamatics 2015 - Confidential Care gap models Pathology, radiology, initial assessment, discharge, check up Structured data Patient characteristics Potential adverse drug reactions Clinical trials gov Patient characteristics Matching Clinical trials Clinical case histories and/or genomic interpretation Patient characteristics Electronic Health Record Enterprise Data Warehouse Patient characteristics Patient lists FDA drug labels Scientific literature

(5)

Structured Data & its Evidential Basis

... I2E can mine and extract with precision at scale

Scientific literature

Social media Patents News feeds EHRs Internal reports Drug labels Clinical trials ...

© Linguamatics 2015 - Confidential

(6)

Text Mining – a precursor to Big Data?

Copyright © Linguamatics 2014 - Confidential

6

Unstructured data is just huge

We can’t wait for those human db curators....

Besides, those curators ignore my parameter..

And all that text is just out there!

• (see Google for details)

(7)

Multisource data  Big data

Copyright © Linguamatics 2014 - Confidential

7

Lots of different types of data

• Scientific literature • Medical records

• Patents

• Regulatory publications (clinical trials, drug labels,

adverse event reporting…)

• Internal reports

Lots of different types of text

In lots of different silos

(8)

Connected Data Technology

8

Single query across

multiple data sources and network locations

(9)

Connected Data Technology

Copyright © Linguamatics 2014-2015 - Confidential

9

(10)

Connected Data Technology

Copyright © Linguamatics 2014-2015 - Confidential

10

Unified results for fast review and discovery of relationships across multiple data sources

(11)

Huge (Textual) Data  Big Data

We (i.e. text-miners…) are often joining data

− Unstructured

− And structured

− Across silos

Before the tabular results go to “analysis”

Copyright © Linguamatics 2014 - Confidential

(12)

The How of Text Mining

Text Mining isn’t completely shrink wrapped

There is, usually, some customization

To find the parameter value that you’re interested in

− To find the value that everyone’s interested in, but only in circumstances c

− To find it in datasource X

To find it in X but only in circumstances c

− To map to ontology A rather than B

It often makes sense to express these

constraints at time of text-mining (not

analysis)

Copyright © Linguamatics 2014 - Confidential

(13)

Toolbox of Methods for Powerful Querying

• Precise linguistic relationships, sentence co-occurrence • Precise negation e.g. “pressure” but not “blood pressure”

NLP

• Search for concepts and classes, not just keywords • e.g. cancer and get synonyms and children:

• Malignant neoplasms, Malignant tumor …

Terminologies

• Rule based pattern matching for e.g. measurements, lab codes, mutations

• e.g. microRNA: let-?\d+.* mirn?a?-?\d+.*

Regular Expressions Chemistry

• Restrict within particular regions of a document, including nested e.g. table cell in table in Description

Fielded Search

• Simultaneous processing of large numbers of items e.g. • 500 compounds, 500 genes from microarray

experiment, etc.

High Throughput

© Linguamatics 2015 - Confidential

(14)

Linguistic Processing Using NLP

Interprets meaning of the text

Groups words into meaningful units Search for different forms of words

© Linguamatics 2015 - Confidential

14

We find that p42mapk phosphorylates c-Myb on serine and threonine . Purified recombinant p42 MAPK was found to phosphorylate Wee1 .

sentences morphology - different forms noun groups match entities verb groups match actions

(15)

Discovering extraction patterns..

We often need to look at the data first (the

“huge data”) to find the extraction patterns

Linguistic patterns of expression vary

Over data sets

Over time

This pre-extraction exploration is something

itself that needs informing

By the ontologies and KBs that are already out there

By the re-use of generally successful strategies

Copyright © Linguamatics 2014 - Confidential

(16)

Innovative tools to enable exploration of

complex and specialised data sets

Grant funded by InnovateUK (Dept of BIS and

EPSRC)

Sponsored Partners: Univ. of Essex & Linguamatics

Project End-date: mid 2016

“…easier discovery and extraction of key facts”

− by sharing search strategies rather than sharing just

search results

− by using novel algorithms for semantic information extraction

− linking information from multiple resources to help users find similar and relevant information.

Copyright © Linguamatics 2014 - Confidential

(17)

Summary

Text Mining – the extraction of structured

information from unstructured text

It’s a natural precursor to large scale analytics

It’s also a big data task itself

− Voluminous source data

− Distributed over many silos

− Expressed in different ways

It’s not just a precursor

We’re (already) joining data at extraction time

− We’re researching exploiting and joining more data at the earliest phases of data exploration, prior to

extraction

Copyright © Linguamatics 2014 - Confidential

(18)

Thank You

For more information…

Visit: www.linguamatics.com

References

Related documents

In general, when punishment for drink-driving is swift, the effectiveness of the punishments (at any level of severity) is increased. Administrative licence suspension is

This result points out the potential for mould design improvements, since the developed framework can be used for search the best solution for mould design,

ار روشک تیعمج زا یمیظع رشق نایوجشناد هکنیا هب هجوت اب هرود هاگشناد هب دورو و دنهد یم لیکشت یگدنز زا ساسح یا ام روشک رد ار ناناوج یم لیکشت نیا یتخانشناور یتسیزهب

However, as argued above, general equilibrium considerations suggest that large in‡ows into safe assets need not lead to large house price booms because the e¤ect of lower

Transportation from Adnan Menderes Airport (ADB) to hotels for group for 6-13 persons is, 20 euros per person(2 ways).. Transportation from Adnan Menderes Airport (ADB) to hotels

• Select the right real estate professional • Determine a fair asking price, and • Prepare the property for the market.. S electing the Right Real