About Linguamatics
© Linguamatics 2015
2
Boston, USA Cambridge, UK
• Agile, scalable, real-time NLP-based text mining • Fact extraction and knowledge synthesis
Software Consulting Hosted content
Pharma/Biotech Healthcare Government
Including 17 of
Solutions & Applications in Life Sciences
Advanced text analytics delivers value along the pipeline
© Linguamatics 2015 - Confidential 3 Gene-disease mapping Target ID/selection Mutation/expression analysis
Toxicity analysis and prediction Biomarker discovery Drug repurposing Patent analysis KOL identification Opportunity scouting Trial site selection and study design
Safety Competitive intelligence Pharmacovigilance Social media analysis Comparative Effectiveness Regulatory Submission QC HEOR SAR
Solutions & Applications in Healthcare
4 © Linguamatics 2015 - Confidential Care gap models Pathology, radiology, initial assessment, discharge, check up Structured data Patient characteristics Potential adverse drug reactions Clinical trials gov Patient characteristics Matching Clinical trials Clinical case histories and/or genomic interpretation Patient characteristics Electronic Health Record Enterprise Data Warehouse Patient characteristics Patient lists FDA drug labels Scientific literatureStructured Data & its Evidential Basis
... I2E can mine and extract with precision at scaleScientific literature
Social media Patents News feeds EHRs Internal reports Drug labels Clinical trials ...
© Linguamatics 2015 - Confidential
Text Mining – a precursor to Big Data?
Copyright © Linguamatics 2014 - Confidential
6
•
Unstructured data is just huge
•
We can’t wait for those human db curators....
•
Besides, those curators ignore my parameter..
•
And all that text is just out there!
• (see Google for details)
Multisource data Big data
Copyright © Linguamatics 2014 - Confidential
7
•
Lots of different types of data
• Scientific literature • Medical records
• Patents
• Regulatory publications (clinical trials, drug labels,
adverse event reporting…)
• Internal reports
•
Lots of different types of text
•
In lots of different silos
Connected Data Technology
8
Single query across
multiple data sources and network locations
Connected Data Technology
Copyright © Linguamatics 2014-2015 - Confidential
9
Connected Data Technology
Copyright © Linguamatics 2014-2015 - Confidential
10
Unified results for fast review and discovery of relationships across multiple data sources
Huge (Textual) Data Big Data
We (i.e. text-miners…) are often joining data
− Unstructured
− And structured
− Across silos
Before the tabular results go to “analysis”
Copyright © Linguamatics 2014 - Confidential
The How of Text Mining
Text Mining isn’t completely shrink wrapped
There is, usually, some customization
− To find the parameter value that you’re interested in
− To find the value that everyone’s interested in, but only in circumstances c
− To find it in datasource X
− To find it in X but only in circumstances c
− To map to ontology A rather than B
It often makes sense to express these
constraints at time of text-mining (not
analysis)
Copyright © Linguamatics 2014 - Confidential
Toolbox of Methods for Powerful Querying
• Precise linguistic relationships, sentence co-occurrence • Precise negation e.g. “pressure” but not “blood pressure”
NLP
• Search for concepts and classes, not just keywords • e.g. cancer and get synonyms and children:
• Malignant neoplasms, Malignant tumor …
Terminologies
• Rule based pattern matching for e.g. measurements, lab codes, mutations
• e.g. microRNA: let-?\d+.* mirn?a?-?\d+.*
Regular Expressions Chemistry
• Restrict within particular regions of a document, including nested e.g. table cell in table in Description
Fielded Search
• Simultaneous processing of large numbers of items e.g. • 500 compounds, 500 genes from microarray
experiment, etc.
High Throughput
© Linguamatics 2015 - Confidential
Linguistic Processing Using NLP
Interprets meaning of the textGroups words into meaningful units Search for different forms of words
© Linguamatics 2015 - Confidential
14
We find that p42mapk phosphorylates c-Myb on serine and threonine . Purified recombinant p42 MAPK was found to phosphorylate Wee1 .
sentences morphology - different forms noun groups match entities verb groups match actions
Discovering extraction patterns..
We often need to look at the data first (the
“huge data”) to find the extraction patterns
Linguistic patterns of expression vary
− Over data sets
− Over time
This pre-extraction exploration is something
itself that needs informing
− By the ontologies and KBs that are already out there
− By the re-use of generally successful strategies
Copyright © Linguamatics 2014 - Confidential
Innovative tools to enable exploration of
complex and specialised data sets
Grant funded by InnovateUK (Dept of BIS and
EPSRC)
Sponsored Partners: Univ. of Essex & Linguamatics
Project End-date: mid 2016
“…easier discovery and extraction of key facts”
− by sharing search strategies rather than sharing justsearch results
− by using novel algorithms for semantic information extraction
− linking information from multiple resources to help users find similar and relevant information.
Copyright © Linguamatics 2014 - Confidential
Summary
Text Mining – the extraction of structured
information from unstructured text
It’s a natural precursor to large scale analytics
It’s also a big data task itself
− Voluminous source data
− Distributed over many silos
− Expressed in different ways
It’s not just a precursor
− We’re (already) joining data at extraction time
− We’re researching exploiting and joining more data at the earliest phases of data exploration, prior to
extraction
Copyright © Linguamatics 2014 - Confidential
Thank You
For more information…
Visit: www.linguamatics.com