Clinical Text Analytics
Mei Liu, PhD
Division of Medical Informatics
University of Kansas Medical Center
What is text analytics and why it matters in healthcare?
Major learning tasks in Natural Language Processing
Medical Terminology
Why and what
Clinical NLP tools
Applications of clinical NLP tools
Review on clinical information extraction Cardiovascular medicine
Drug safety surveillance
Medical device safety surveillance
What is text analytics?
“The process of deriving high quality information
from text, by applying natural language processing
Structured clinical data
Information stored and displayed in a consistent, organized manner Demographics, vitals, labs
A piece of structured data should consist of two parts: a variable name and a value
Ex. Weight: 130lb
Unstructured clinical data
Information documented that does not follow a particular format Must be manually analyzed and interpreted
Narrative clinical notes
Procedure and op notes, progress notes, chief complaint, history of
present illness, physical exam, assessment and plan, cardiology reports: echo, stress test, EKG, radiology reports, pathology reports, discharge summaries, consults
Imagine you are a cardiologist and a patient with chronic
congestive heart failure walks into your office for an
appointment. What would you ideally find in a quick review of
the chart as you are walking into the room?
Current med list, current weight, current BP
Key events from most recent hospitalization, e.g. new cardiac events, discharge weight, new echo report with EF and wall motion, reason for decompensation
Latest ejection fraction and perhaps a graph of trend in EF over time Current symptoms or complaints: weight gain, shortness of breath,
peripheral edema
Imaging you are a cardiologist and a patient with chronic
congestive heart failure walks into your office for an
appointment. What would you ideally find in a quick review of
the chart as you are walking into the room?
Current med list, current weight, current BP
Key events from most recent hospitalization, e.g. new cardiac events,
discharge weight, new echo report with EF and wall motion, reason for decompensation
Latest ejection fraction and perhaps a graph of trend in EF over time Current symptoms or complaints: weight gain, shortness of breath,
peripheral edema
Now imagine you are the clinical director of a heart failure
program and you are looking to assess performance in
managing heart failure across your practice or health systems
population and identify opportunities to target improvement
efforts to a subpopulation
How are subgroups of patients with heart failure doing?
Ejection fraction
ACC heart failure stage Hospital days
Mortality
Compared to practice patterns?
Medications, procedures, rehab
Practices, individual providers
Now imagine you are the clinical director of a heart failure
program and you are looking to assess performance in
managing heart failure across your practice or health systems
population and identify opportunities to target improvement
efforts to a subpopulation
How are subgroups of patients with heart failure doing?
Ejection fraction
ACC heart failure stage
Hospital days Mortality
Compared to practice patterns?
Medications, procedures, rehab Practices, individual providers
Clinical text is important for research
Captures many complexities of patient encounters and outcomes that are underreported or absent in billing/diagnosis codes
May increase accuracy and lead time of signal detection
Clinical text analysis is challenging
Content varies across institutions
Require the use of natural language processing (NLP) to mine the data Clinical notes
Heterogeneous report structures Telegraphic text formats
Abbreviations
Admit 10/23
71 yo woman h/o DM, HTN, Dilated CM/CHF, Afib s/p embolic event, chronic diarrhea, admitted with SOB. CXR pulm edema. Rx’d Lasix. All: none
Meds Lasix 40mg IVP bid, ASA, Coumadin 5, Prinivil 10, glucophage 850 bid, glipizide 10 bid, immodium prn
Hospitalist=Smith PMD=Name Full Code, Cx>101 Sign-out Notes
DM = Diabetes mellitus; HTN = Hypertension; CHF = Congested heart
failure; Afib = Atrial Fibrillation; SOB = Shortness of breath; CXR = Chest X-ray
Name entity recognition
– identify named text features
People, organizations, places, certain abbreviations, etc.
Word sense disambiguation
– use contextual clues to
determine the true meaning of the entity
Does “Ford” refer to a former US president, vehicle manufacturer, movie star, or other entity?
Co-reference resolution
– identify noun phrases and other
terms that refer to the same object
“Mary said she would help me …” – “she” and “Mary” refer to the same person
“I saw Scot yesterday. He was fishing by the lake.” – “Scott” and “he”
Part-of-speech tagging – given a sentence, determine the part of speech for each word
Ex. ‘book’ can be a noun or a verb
Parsing – determine the parse tree (grammatical analysis) of a given sentence
Sentence boundary disambiguation – given a chunk of text, find the sentence boundaries
Sentence boundaries are often marked by periods or other punctuations, but
they also can serve other purposes (e.g., making abbreviations)
Relationship extraction
– identify associations among entities
and other information in text
“
Patients who not only survive a warfarin-associated gastrointestinal tract bleeding (GIB) event but also have an ongoing risk forthromboembolismpresent 2 clinical dilemmas: whether and when to resume anticoagulation”
Sentiment analysis
– discerning subjective material and
extracting various forms of attitudinal information
Sentiment, opinion, mood, emotion
Can be analyzed at the entity, concept, or topic level
Automatic summarization
– produce a readable summary of
a chunk of text.
Ex. Summary of the financial section of a newspaper
What should NLP Solution for Healthcare
look like?
Why Medical Terminology?
Standardized “Language of Medicine”
Allows all medical professionals to understand each other and communicate effectively
Medical dictionary of all diseases, drugs, procedures, findings,
etc., and their relationships
Every year new terms are added to the vocabulary of medicine Over 2.5 million medical terms in the English language
Many problems exist in medical terms:
Homonym problem
Synonym problem
Homonym Problem
Homonyms = same “name” describing different diseases
Cold – temperature or body temperature cold – the common cold (disease)
COLD – Chronic Obstructive Lung Disease (disease)
Why is this a problem?
Medical professionals can interpret from context on the meaning
Computers CANNOT
Computers are bad at context
Solution = assign a unique “code number” to each term
By using different code numbers when sending data to a computer, misunderstandings can be avoided
Synonym Problem
Synonyms = different “name” describe the same disease
Diabetes mellitus
NIDDM – non insulin dependent diabetes mellitus T2DM – type II diabetes
Why is this a problem?
A computer (or another doctor) might only know one of the terms, not the term typed to it (or said to him/her)
Humans can clarify
Computers CANNOT
Solution
Assign a unique “code number” to “Diabetes Mellitus” and treat all other terms as synonyms of it
“Pneumonia” is a general term
More specified forms of Pneumonia include:
Bacterial pneumonia
Mycoplasma pneumonia
Aspiration pneumonia
Pneumocystis carinii pneumonia Legionnaire’s disease
Streptococcus pneumonia is a kind of Bacterial pneumonia,
i.e. it is even more specific than bacterial pneumonia
A human understands that if somebody has mycoplasma
pneumonia, he has pneumonia
A computer DOES NOT
A computer does not even know that pneumonia =
Pneumonia
Note, one cannot rely on string matching
Legionnaire’s disease does not contain “Pneumonia” in the term!
Solution
Create a direct link between a specific term and a general term (in Computer Science terms, a pointer)
Provide formal and machine-computable representations of
medical knowledge and data
Such representation can facilitate interoperability,
dissemination, decision support, research
Terminologies are formal representations of
entities
and their
interrelationships
Embodied as terms, concepts, linkages Terms are evocative words or phrases
Concepts are the cognitive representation of entities or meanings
(idea)
Linkages are explicitly defined relationships
Medical Terminology
ICD (International Classification of Diseases)
Used to define and report diseases and health conditions (ICD-9, ICD-10)
CPT (Current Procedural Terminology)
Used to report medical, surgical, and diagnostic procedures and services
SNOMED-CT
LOINC
Standard for identifying medical laboratory findings
RxNorm
Provides normalized names for clinical drugs
UMLS (Unified Medical Language System)
Terminology collection and concepts are unique
SNOMED CT
Multilingual clinical healthcare terminology created in 1999 and maintained by an international non-profit standards development organization in London, UK
Primary purpose is to encode meanings that are used in health
information and to support effective clinical recording of data with the aim of improving patient care
Coverage includes
Clinical findings, symptoms, diagnoses, procedures, body structures,
organisms and other etiologies, substances, pharmaceuticals, devices, and specimens
Can cross-map to other international standards and classifications Specific language editions are available which augment the
international edition that contain language translations and additional terms unique to a country
LOINC (Logical Observation Identifiers Names and Codes)
Universal standard for identifying medical laboratory observations developed
in 1994 and maintained by Regenstrief Institute
Purpose is to assist in the electronic exchange and gathering of clinical results
such as laboratory tests, clinical observations, outcomes management, and research
Two main parts: laboratory LOINC and clinical LOINC
Several standards such as IHE and HL7 use LOINC to electronically transfer
results from different reporting systems to the appropriate healthcare networks
Format: unique code “nnnnn-n” for each entry
Component – what is measured, evaluated, or observed (e.g. urea, blood, …)
Property – characteristics of how it is being measured (e.g. length, mass, volume, …) Timing – interval of time over which the observation or measurement was made System – context or specimen type within which the observation was made Scale – which way will the test result be expressed
Method (optional) – what method was used to make the measurement
RxNorm
US-specific terminology in medicine that contains all medications available on the US market maintained by the National Library of Medicine (NLM)
Provides normalized names for clinical drugs
Distinguishes different type of drug concepts and has concepts for drug ingredients or dose forms
NLM provides six APIs related to RxNorm
RxMix web application allowing users to access the RxNorms APIs without writing their own programs
UMLS (Unified Medical Language System)
Compendium of many controlled vocabularies in the biomedical sciences created in 1986 and maintained by the National Library of Medicine (NLM)
Provides mapping structure among different vocabularies and thus allows one to translate among the various terminology systems
May also be viewed as a comprehensive thesaurus and ontology of biomedical concepts
Intended to be used mainly by developers of systems in medical informatics
Three UMLS Knowledge Sources
Metathesaurus= terms and codes from many vocabularies including CPT, ICD, LOINC, RxNorm, SNOMED CT, etc.
Semantic Network = broad categories (semantic types) and their relationships (semantic relations)
SPECIALIST Lexicon and Tools = large syntactic lexicon of biomedical and general English and tools for normalizing strings, generating lexical variants, and creating indexes
Can use UMLS to:
Link terms and codes between doctor, pharmacy, and insurance company
Coordinate patient care among several departments within a hospital Process texts to extract concepts, relationships, or knowledge
Facilitate mapping between terminologies Develop an information retrieval system
Extract specific terminologies from the Metathesaurus Create and maintain a local terminology
Develop a terminology service
Research terminologies or ontologies
UMLS
UMLS can be used to support
Information retrieval
Natural language processing
Automated indexing
Text mining
Public health statistics reporting
Terminology research
Electronic medical record analysis
Traditional Approach – manual chart review
Reliable
Slow
Costly
Emerging approach – Informatics methods such as
Natural Language Processing (NLP), Machine
Learning, and Data Mining
Fast and scalable
Performance may be questionable
33
Medical Informatics (MI) community has invested much effort
to develop methods to abstract relevant information from the
clinical narratives
Types of NLP tools
Rule-based vs Machine learning based
General Development Frameworks
Apache Unstructured Information Management Architecture (UIMA) – Java framework for developing NLP pipelines
General Architecture for Text Engineering (GATE) – Java framework for developing NLP pipelines
Natural Language Toolkit (NLTK) – Python library for developing NLP applications
cTAKES – built on top of Apache UIMA
HITEX – rule-based NLP pipeline based on the GATE framework
Cleartk – framework for developing statistical NLP components on top of Apache UIMA
NegEx – detect negated terms from clinical text
ConText – extension to NegEx that also find temporality (recent, historical or hypothetical scenarios) and who the experiencer is (patient or other) MetaMap – comprehensive concept tagging system built on top of UMLS MedEx – recognize medication names, dose, frequency, route, duration SecTag – recognizes note section headers using NLP, Bayesian, spelling
correcting and scoring techniques
Stanford CoreNLP – integrated suite of NLP tools including tokenization,
c
linical
T
ext
A
nalysis and
K
nowledge
E
xtraction
S
ystem
Open source NLP system developed at Mayo Clinic in 2006 by Dr. Guergana Savova and Dr. Christopher Chute
Read through and extract concepts from plain text notes and transform them into structured and normalized information Processes clinical notes to identify clinical named entities
Drugs, diseases/disorders, signs/symptoms, anatomical sites and
procedures
Each named entity has attributes for Text span
Ontology mapping code
Context, e.g. family history of, current, unrelated to patient Negated/not negated
Components are specifically trained for the clinical domain
Named section identifier Sentence boundary detector Rule-based tokenizer
Formatted list identifier Normalizer
Context dependent tokenizer Part-of-speech tagger
Phrasal chunker
Dictionary lookup annotator Context annotator
Negation detector Uncertainty detector Subject detector Dependency parser
Patient smoking status identifier
Produces most commonly desired output from cTAKEs including
Annotations for anatomical sites, signs/symptoms, procedures, diseases and medications
For each annotation, there are normalized UMLS CUIs, plus values for negation, uncertainty and subject
Harness unstructured information by allowing i2b2 users to
query and join that information with existing i2b2 concepts
Currently, entire note is commonly stored as a single row in
the observation_blob field in the observation_fact table in
i2b2
One of cTAKES features is to extract concepts from the text
and transform into structured information
Format the output of cTAKES into the i2b2 observation_fact
table format, e.g. facts, concepts, modifiers, and values
Add an ‘NLP’ ontology in i2b2 that contains all concepts
extracted from text
C
linical
L
anguage
A
nnotation,
M
odeling, and
P
rocessing
Toolkit
Comprehensive clinical NLP software that enables recognition and automatic encoding of clinical information in narrative clinical notes High performance – components are built on proven methods in many
clinical NLP challenges
Customizable – one can choose from various choices of NLP and machine learning components
Enterprise features
Users can import clinical text corpora into CLAMP and annotate files using the built-in annotation tool
Demo at https://clamp.uth.edu/clampdemo.php
Tokenization – convert sentences to words
Removing unnecessary punctuation, tags
Removing stop words – frequent words such as “the”, “is”, …
Stemming – words are reduced to a root by removing
inflection, i.e. dropping unnecessary characters like suffix
Lemmatization – remove inflection by determining the part of
speech, e.g. studying
study
Mapping textual
data to real
valued vectors
for machine
learning
algorithms
One of the
simplest
techniques is
Bag of Words
(BOW)
Text Feature Extraction
Each word in BOW can be represented as either 1 for present
or 0 for absent OR as the number of times each word appears
in a document
Term Frequency-Inverse Document Frequency (TF-IDF)
TF = number of times term t appears in a document / number of terms in the document
IDF = log (N/n) where N is the number of documents and n is the number of documents a term t has appeared in
TF-IDF = TF * IDF
Limitation of BOW
Text Feature Extraction
Word Embedding – words with same meaning receive similar
representation
Word2Vec – takes text input and produces a vector space with each unique word being assigned a corresponding vector in the space
NLP task – Named Entity Recognition
Identify clinical syndromes and common biomedical concepts
from various types of notes
Clinical Information Extraction
Most frequently used clinical information extraction tools
cTAKES (n = 26)
MetaMap (n = 12)
MedLee (n = 10)
Most frequently used machine learning methods
Application areas of
clinical information
extraction and
corresponding number
of publications
The 21st Century Cures Act of 2016 required FDA to create a pathway to
allow real-world evidence (RWE) to support new drug indication and post-marketing surveillance starting in 2018
Objective: determine whether traditional RWE in cardiovascular medicine achieve accuracy sufficient for credible clinical assertion, aka “regulatory-grade’ RWE
Method: extracted a predefined set of clinical concepts from EHR
structured (EHR-S) and unstructured (EHR-U) data and evaluated against manually annotated cohorts
Dataset: 10,840 clinical notes drawn randomly
Outcome: regulatory-grade or not for clinical phenotyping in cardiovascular medicine
Recall > 85% and precision > 90%
Cardiovascular Medicine Phenotyping
High-level NLP pipeline for information extraction from clinical notes
Text Extraction – extract natural language text
Section Detection – used SecTag to identify the correct section to add context in concept interpretation, e.g. medical history section
Information Extraction and Tagging – ANNIE from GATE NLP pipeline
Removal of special characters, tokenization, sentence splitter, POS tagger, named entity recognition and negation and subject tagging
Concept Tagging – Normalize identified information to known concepts in medical terminologies including SNOMED-CT, RxNorm, LOINC
Cardiovascular Medicine Phenotyping
Hernandez-Boussard T. et al. Real world evidence in cardiovascular medicine: ensuring data validity in electronic health record-based studies. JAMIA, 26(11):1189-1194, 2019.
Cardiovascular Medicine Phenotyping
Conclusion:
Recall varied greatly between EHR-S and EHR-U
EHR-S did not meet regulatory-grade criteria (recall > 85% and precision >
Objective: Will adding EHRs to adverse event reporting system (AERS) of the FDA
improve signal detection accuracy?
Dataset: 4 million AERS reports + 1.2 million EHR narratives
Drug Safety Surveillance
Performance comparison based
on the precision at K statistic for different values of K (amount of signals selected)
Error bars reflect 95% CIs
Clinical text improved accuracy of
signal detection significantly
Drug Safety Surveillance
Processes clinical text and produces a patient-feature matrix encoded using
Medical devices require post-market surveillance to assess
the implants’ safety and efficacy
Pacemakers, joint replacements, breast implants, insulin pumps, spinal cord stimulators, etc.
Device surveillance in US relies primarily on spontaneous
reporting systems as means to document adverse events
reported by physicians and providers
Device-related adverse events are significantly underreported
Estimated as little as 0.5% of adverse event reports received by FDA concern medical devices
Evidence extracted from clinical notes can enable device
surveillance
Applied deep learning methods to identify reports of hip joint
implant related complications and pain from clinical notes
Combined structured and unstructured data to characterize
hip implant performance in the real world
Dataset: 6583 patients with hip replacement
55.6% female
Average age at surgery of 63
Average follow-up time after replacement of 5.3 years
386 (5.8%) had a coded record of at least one revision surgery
Average age at primary replacement surgery was 57.9 years Average follow-up time was 10.5 years
3 entity/event types
Implant system entities identified by a manufacturer and/or model name, e.g.
“Zimmer VerSys”
Implant-related complications, e.g. “infected left hip prosthetic” Patient-reported pain at a specific anatomical location, e.g. “left hip
tenderness”
Performance of the machine learning methods on entity and relation extraction
Medical Device Surveillance
AUPRC for Implant-Complication classifier performance at different training set sizes
Medical Device Surveillance
Negative binomial mode-derived incidence rate ratios (IRRs) for hip pain mentions
6 systems were all manufactured by Depuy had IRRs <1, indicating that they are
associated with lower rates of hip pain mentions relative to the Zimmer Biomet Triology + VerSys reference system when controlling for patient demographics, pain mentions in the prior year
4 systems (3 Zimmer Biomet, 1 Depuy) have IRRs >1, indicating that they are