Clinical Text Analytics

(1)

Clinical Text Analytics

Mei Liu, PhD

Division of Medical Informatics

University of Kansas Medical Center

(2)



What is text analytics and why it matters in healthcare?



Major learning tasks in Natural Language Processing



Medical Terminology

 Why and what



Clinical NLP tools



Applications of clinical NLP tools

 Review on clinical information extraction  Cardiovascular medicine

 Drug safety surveillance

 Medical device safety surveillance

(3)

What is text analytics?

“The process of deriving high quality information

from text, by applying natural language processing

(4)



Structured clinical data

 Information stored and displayed in a consistent, organized manner  Demographics, vitals, labs

 A piece of structured data should consist of two parts: a variable name and a value

 Ex. Weight: 130lb



Unstructured clinical data

 Information documented that does not follow a particular format  Must be manually analyzed and interpreted

 Narrative clinical notes

 Procedure and op notes, progress notes, chief complaint, history of

present illness, physical exam, assessment and plan, cardiology reports: echo, stress test, EKG, radiology reports, pathology reports, discharge summaries, consults

(5)



Imagine you are a cardiologist and a patient with chronic

congestive heart failure walks into your office for an

appointment. What would you ideally find in a quick review of

the chart as you are walking into the room?

 Current med list, current weight, current BP

 Key events from most recent hospitalization, e.g. new cardiac events, discharge weight, new echo report with EF and wall motion, reason for decompensation

 Latest ejection fraction and perhaps a graph of trend in EF over time  Current symptoms or complaints: weight gain, shortness of breath,

peripheral edema

(6)



Imaging you are a cardiologist and a patient with chronic

congestive heart failure walks into your office for an

appointment. What would you ideally find in a quick review of

the chart as you are walking into the room?

 Current med list, current weight, current BP

 Key events from most recent hospitalization, e.g. new cardiac events,

discharge weight, new echo report with EF and wall motion, reason for decompensation

 Latest ejection fraction and perhaps a graph of trend in EF over time  Current symptoms or complaints: weight gain, shortness of breath,

peripheral edema

(7)



Now imagine you are the clinical director of a heart failure

program and you are looking to assess performance in

managing heart failure across your practice or health systems

population and identify opportunities to target improvement

efforts to a subpopulation

 How are subgroups of patients with heart failure doing?

 Ejection fraction

 ACC heart failure stage  Hospital days

 Mortality

 Compared to practice patterns?

 Medications, procedures, rehab

Practices, individual providers

(8)



Now imagine you are the clinical director of a heart failure

program and you are looking to assess performance in

managing heart failure across your practice or health systems

population and identify opportunities to target improvement

efforts to a subpopulation

 How are subgroups of patients with heart failure doing?

 Ejection fraction

 ACC heart failure stage

 Hospital days  Mortality

 Compared to practice patterns?

 Medications, procedures, rehab  Practices, individual providers

(9)



Clinical text is important for research

 Captures many complexities of patient encounters and outcomes that are underreported or absent in billing/diagnosis codes

 May increase accuracy and lead time of signal detection



Clinical text analysis is challenging

 Content varies across institutions

 Require the use of natural language processing (NLP) to mine the data  Clinical notes

 Heterogeneous report structures  Telegraphic text formats

 Abbreviations

(10)

Admit 10/23

71 yo woman h/o DM, HTN, Dilated CM/CHF, Afib s/p embolic event, chronic diarrhea, admitted with SOB. CXR pulm edema. Rx’d Lasix. All: none

Meds Lasix 40mg IVP bid, ASA, Coumadin 5, Prinivil 10, glucophage 850 bid, glipizide 10 bid, immodium prn

Hospitalist=Smith PMD=Name Full Code, Cx>101 Sign-out Notes

DM = Diabetes mellitus; HTN = Hypertension; CHF = Congested heart

failure; Afib = Atrial Fibrillation; SOB = Shortness of breath; CXR = Chest X-ray

(11)



Name entity recognition

– identify named text features

 People, organizations, places, certain abbreviations, etc.



Word sense disambiguation

– use contextual clues to

determine the true meaning of the entity

 Does “Ford” refer to a former US president, vehicle manufacturer, movie star, or other entity?



Co-reference resolution

– identify noun phrases and other

terms that refer to the same object

 “Mary said she would help me …” – “she” and “Mary” refer to the same person

 “I saw Scot yesterday. He was fishing by the lake.” – “Scott” and “he”

(12)

 Part-of-speech tagging – given a sentence, determine the part of speech for each word

 Ex. ‘book’ can be a noun or a verb

 Parsing – determine the parse tree (grammatical analysis) of a given sentence

 Sentence boundary disambiguation – given a chunk of text, find the sentence boundaries

 Sentence boundaries are often marked by periods or other punctuations, but

they also can serve other purposes (e.g., making abbreviations)

(13)



Relationship extraction

– identify associations among entities

and other information in text



“

Patients who not only survive a warfarin-associated gastrointestinal tract bleeding (GIB) event but also have an ongoing risk for

thromboembolismpresent 2 clinical dilemmas: whether and when to resume anticoagulation”



Sentiment analysis

– discerning subjective material and

extracting various forms of attitudinal information

 Sentiment, opinion, mood, emotion

 Can be analyzed at the entity, concept, or topic level



Automatic summarization

– produce a readable summary of

a chunk of text.

 Ex. Summary of the financial section of a newspaper

(14)

What should NLP Solution for Healthcare

look like?

(15)

(16)

Why Medical Terminology?



Standardized “Language of Medicine”

 Allows all medical professionals to understand each other and communicate effectively



Medical dictionary of all diseases, drugs, procedures, findings,

etc., and their relationships

 Every year new terms are added to the vocabulary of medicine  Over 2.5 million medical terms in the English language



Many problems exist in medical terms:

 Homonym problem

 Synonym problem

(17)

Homonym Problem



Homonyms = same “name” describing different diseases

 Cold – temperature or body temperature  cold – the common cold (disease)

 COLD – Chronic Obstructive Lung Disease (disease)



Why is this a problem?

 Medical professionals can interpret from context on the meaning

 Computers CANNOT

 Computers are bad at context



Solution = assign a unique “code number” to each term

 By using different code numbers when sending data to a computer, misunderstandings can be avoided

(18)

Synonym Problem



Synonyms = different “name” describe the same disease

 Diabetes mellitus

 NIDDM – non insulin dependent diabetes mellitus  T2DM – type II diabetes



Why is this a problem?

 A computer (or another doctor) might only know one of the terms, not the term typed to it (or said to him/her)

 Humans can clarify

 Computers CANNOT



Solution

 Assign a unique “code number” to “Diabetes Mellitus” and treat all other terms as synonyms of it

(19)



“Pneumonia” is a general term



More specified forms of Pneumonia include:

 Bacterial pneumonia

 Mycoplasma pneumonia

 Aspiration pneumonia

 Pneumocystis carinii pneumonia  Legionnaire’s disease



Streptococcus pneumonia is a kind of Bacterial pneumonia,

i.e. it is even more specific than bacterial pneumonia

(20)



A human understands that if somebody has mycoplasma

pneumonia, he has pneumonia



A computer DOES NOT



A computer does not even know that pneumonia =

Pneumonia



Note, one cannot rely on string matching

 Legionnaire’s disease does not contain “Pneumonia” in the term!



Solution

 Create a direct link between a specific term and a general term (in Computer Science terms, a pointer)

(21)



Provide formal and machine-computable representations of

medical knowledge and data



Such representation can facilitate interoperability,

dissemination, decision support, research



Terminologies are formal representations of

entities

and their

interrelationships

 Embodied as terms, concepts, linkages  Terms are evocative words or phrases

 Concepts are the cognitive representation of entities or meanings

(idea)

 Linkages are explicitly defined relationships

Medical Terminology

(22)

 ICD (International Classification of Diseases)

 Used to define and report diseases and health conditions (ICD-9, ICD-10)

 CPT (Current Procedural Terminology)

 Used to report medical, surgical, and diagnostic procedures and services

 SNOMED-CT

 LOINC

 Standard for identifying medical laboratory findings

 RxNorm

 Provides normalized names for clinical drugs

 UMLS (Unified Medical Language System)

 Terminology collection and concepts are unique

(23)



SNOMED CT

 Multilingual clinical healthcare terminology created in 1999 and maintained by an international non-profit standards development organization in London, UK

 Primary purpose is to encode meanings that are used in health

information and to support effective clinical recording of data with the aim of improving patient care

 Coverage includes

 Clinical findings, symptoms, diagnoses, procedures, body structures,

organisms and other etiologies, substances, pharmaceuticals, devices, and specimens

 Can cross-map to other international standards and classifications  Specific language editions are available which augment the

international edition that contain language translations and additional terms unique to a country

(24)

(25)

 LOINC (Logical Observation Identifiers Names and Codes)

 Universal standard for identifying medical laboratory observations developed

in 1994 and maintained by Regenstrief Institute

 Purpose is to assist in the electronic exchange and gathering of clinical results

such as laboratory tests, clinical observations, outcomes management, and research

 Two main parts: laboratory LOINC and clinical LOINC

 Several standards such as IHE and HL7 use LOINC to electronically transfer

results from different reporting systems to the appropriate healthcare networks

 Format: unique code “nnnnn-n” for each entry

 Component – what is measured, evaluated, or observed (e.g. urea, blood, …)

 Property – characteristics of how it is being measured (e.g. length, mass, volume, …)  Timing – interval of time over which the observation or measurement was made  _{System – context or specimen type within which the observation was made}  _{Scale – which way will the test result be expressed}

 Method (optional) – what method was used to make the measurement

(26)



RxNorm

 US-specific terminology in medicine that contains all medications available on the US market maintained by the National Library of Medicine (NLM)

 Provides normalized names for clinical drugs

 Distinguishes different type of drug concepts and has concepts for drug ingredients or dose forms

 NLM provides six APIs related to RxNorm

 RxMix web application allowing users to access the RxNorms APIs without writing their own programs

(27)

(28)

 UMLS (Unified Medical Language System)

 Compendium of many controlled vocabularies in the biomedical sciences created in 1986 and maintained by the National Library of Medicine (NLM)

 Provides mapping structure among different vocabularies and thus allows one to translate among the various terminology systems

 May also be viewed as a comprehensive thesaurus and ontology of biomedical concepts

 Intended to be used mainly by developers of systems in medical informatics

 Three UMLS Knowledge Sources

 Metathesaurus= terms and codes from many vocabularies including CPT, ICD, LOINC, RxNorm, SNOMED CT, etc.

 Semantic Network = broad categories (semantic types) and their relationships (semantic relations)

 SPECIALIST Lexicon and Tools = large syntactic lexicon of biomedical and general English and tools for normalizing strings, generating lexical variants, and creating indexes

(29)

(30)



Can use UMLS to:

 Link terms and codes between doctor, pharmacy, and insurance company

 Coordinate patient care among several departments within a hospital  Process texts to extract concepts, relationships, or knowledge

 Facilitate mapping between terminologies  Develop an information retrieval system

 Extract specific terminologies from the Metathesaurus  Create and maintain a local terminology

 Develop a terminology service

 Research terminologies or ontologies

UMLS

(31)



UMLS can be used to support



Information retrieval



Natural language processing



Automated indexing



Text mining



Public health statistics reporting



Terminology research



Electronic medical record analysis

(32)

(33)



Traditional Approach – manual chart review



Reliable



Slow



Costly



Emerging approach – Informatics methods such as

Natural Language Processing (NLP), Machine

Learning, and Data Mining



Fast and scalable



Performance may be questionable

33

(34)



Medical Informatics (MI) community has invested much effort

to develop methods to abstract relevant information from the

clinical narratives



Types of NLP tools

 Rule-based vs Machine learning based



General Development Frameworks

 Apache Unstructured Information Management Architecture (UIMA) – Java framework for developing NLP pipelines

 General Architecture for Text Engineering (GATE) – Java framework for developing NLP pipelines

 Natural Language Toolkit (NLTK) – Python library for developing NLP applications

(35)

 cTAKES – built on top of Apache UIMA

 HITEX – rule-based NLP pipeline based on the GATE framework

 Cleartk – framework for developing statistical NLP components on top of Apache UIMA

 NegEx – detect negated terms from clinical text

 ConText – extension to NegEx that also find temporality (recent, historical or hypothetical scenarios) and who the experiencer is (patient or other)  MetaMap – comprehensive concept tagging system built on top of UMLS  MedEx – recognize medication names, dose, frequency, route, duration  SecTag – recognizes note section headers using NLP, Bayesian, spelling

correcting and scoring techniques

 Stanford CoreNLP – integrated suite of NLP tools including tokenization,

(36)



c

linical

T

ext

A

nalysis and

K

nowledge

E

xtraction

S

ystem

 Open source NLP system developed at Mayo Clinic in 2006 by Dr. Guergana Savova and Dr. Christopher Chute

 Read through and extract concepts from plain text notes and transform them into structured and normalized information  Processes clinical notes to identify clinical named entities

 Drugs, diseases/disorders, signs/symptoms, anatomical sites and

procedures

 Each named entity has attributes for  Text span

 Ontology mapping code

 Context, e.g. family history of, current, unrelated to patient  Negated/not negated

(37)

 Components are specifically trained for the clinical domain

 Named section identifier  Sentence boundary detector  Rule-based tokenizer

 Formatted list identifier  Normalizer

 Context dependent tokenizer  Part-of-speech tagger

 Phrasal chunker

 Dictionary lookup annotator  Context annotator

 Negation detector  Uncertainty detector  Subject detector  Dependency parser

 Patient smoking status identifier

(38)

 Produces most commonly desired output from cTAKEs including

 Annotations for anatomical sites, signs/symptoms, procedures, diseases and medications

 For each annotation, there are normalized UMLS CUIs, plus values for negation, uncertainty and subject

(39)



Harness unstructured information by allowing i2b2 users to

query and join that information with existing i2b2 concepts



Currently, entire note is commonly stored as a single row in

the observation_blob field in the observation_fact table in

i2b2



One of cTAKES features is to extract concepts from the text

and transform into structured information



Format the output of cTAKES into the i2b2 observation_fact

table format, e.g. facts, concepts, modifiers, and values



Add an ‘NLP’ ontology in i2b2 that contains all concepts

extracted from text

(40)

(41)



C

linical

L

anguage

A

nnotation,

M

odeling, and

P

rocessing

Toolkit

 Comprehensive clinical NLP software that enables recognition and automatic encoding of clinical information in narrative clinical notes  High performance – components are built on proven methods in many

clinical NLP challenges

 Customizable – one can choose from various choices of NLP and machine learning components

 Enterprise features

 Users can import clinical text corpora into CLAMP and annotate files using the built-in annotation tool

 Demo at https://clamp.uth.edu/clampdemo.php

(42)

(43)



Tokenization – convert sentences to words



Removing unnecessary punctuation, tags



Removing stop words – frequent words such as “the”, “is”, …



Stemming – words are reduced to a root by removing

inflection, i.e. dropping unnecessary characters like suffix



Lemmatization – remove inflection by determining the part of

speech, e.g. studying



study

(44)



Mapping textual

data to real

valued vectors

for machine

learning

algorithms



One of the

simplest

techniques is

Bag of Words

(BOW)

(45)

Text Feature Extraction



Each word in BOW can be represented as either 1 for present

or 0 for absent OR as the number of times each word appears

in a document



Term Frequency-Inverse Document Frequency (TF-IDF)

 TF = number of times term t appears in a document / number of terms in the document

 IDF = log (N/n) where N is the number of documents and n is the number of documents a term t has appeared in

 TF-IDF = TF * IDF



Limitation of BOW

(46)

Text Feature Extraction



Word Embedding – words with same meaning receive similar

representation

 Word2Vec – takes text input and produces a vector space with each unique word being assigned a corresponding vector in the space

(47)



NLP task – Named Entity Recognition



Identify clinical syndromes and common biomedical concepts

from various types of notes

Clinical Information Extraction

(48)



Most frequently used clinical information extraction tools

 cTAKES (n = 26)

 MetaMap (n = 12)

 MedLee (n = 10)



Most frequently used machine learning methods

(49)



Application areas of

clinical information

extraction and

corresponding number

of publications

(50)

 The 21st _{Century Cures Act of 2016 required FDA to create a pathway to}

allow real-world evidence (RWE) to support new drug indication and post-marketing surveillance starting in 2018

 Objective: determine whether traditional RWE in cardiovascular medicine achieve accuracy sufficient for credible clinical assertion, aka “regulatory-grade’ RWE

 Method: extracted a predefined set of clinical concepts from EHR

structured (EHR-S) and unstructured (EHR-U) data and evaluated against manually annotated cohorts

 Dataset: 10,840 clinical notes drawn randomly

 Outcome: regulatory-grade or not for clinical phenotyping in cardiovascular medicine

 Recall > 85% and precision > 90%

Cardiovascular Medicine Phenotyping

(51)

 High-level NLP pipeline for information extraction from clinical notes

 Text Extraction – extract natural language text

 Section Detection – used SecTag to identify the correct section to add context in concept interpretation, e.g. medical history section

 Information Extraction and Tagging – ANNIE from GATE NLP pipeline

 Removal of special characters, tokenization, sentence splitter, POS tagger, named entity recognition and negation and subject tagging

 Concept Tagging – Normalize identified information to known concepts in medical terminologies including SNOMED-CT, RxNorm, LOINC

(52)

Cardiovascular Medicine Phenotyping

Hernandez-Boussard T. et al. Real world evidence in cardiovascular medicine: ensuring data validity in electronic health record-based studies. JAMIA, 26(11):1189-1194, 2019.

(53)

Cardiovascular Medicine Phenotyping

 Conclusion:

 Recall varied greatly between EHR-S and EHR-U

 EHR-S did not meet regulatory-grade criteria (recall > 85% and precision >

(54)

 Objective: Will adding EHRs to adverse event reporting system (AERS) of the FDA

improve signal detection accuracy?

 Dataset: 4 million AERS reports + 1.2 million EHR narratives

Drug Safety Surveillance

(55)

 Performance comparison based

on the precision at K statistic for different values of K (amount of signals selected)

 Error bars reflect 95% CIs

 Clinical text improved accuracy of

signal detection significantly

(56)

Drug Safety Surveillance

 Processes clinical text and produces a patient-feature matrix encoded using

(57)

(58)

(59)



Medical devices require post-market surveillance to assess

the implants’ safety and efficacy

 Pacemakers, joint replacements, breast implants, insulin pumps, spinal cord stimulators, etc.



Device surveillance in US relies primarily on spontaneous

reporting systems as means to document adverse events

reported by physicians and providers



Device-related adverse events are significantly underreported

 Estimated as little as 0.5% of adverse event reports received by FDA concern medical devices



Evidence extracted from clinical notes can enable device

surveillance

(60)



Applied deep learning methods to identify reports of hip joint

implant related complications and pain from clinical notes



Combined structured and unstructured data to characterize

hip implant performance in the real world



Dataset: 6583 patients with hip replacement

 55.6% female

 Average age at surgery of 63

 Average follow-up time after replacement of 5.3 years

 386 (5.8%) had a coded record of at least one revision surgery

 Average age at primary replacement surgery was 57.9 years  Average follow-up time was 10.5 years

(61)

 3 entity/event types

 Implant system entities identified by a manufacturer and/or model name, e.g.

“Zimmer VerSys”

 Implant-related complications, e.g. “infected left hip prosthetic”  Patient-reported pain at a specific anatomical location, e.g. “left hip

tenderness”

 Performance of the machine learning methods on entity and relation extraction

(62)

Medical Device Surveillance

 AUPRC for Implant-Complication classifier performance at different training set sizes

(63)

Medical Device Surveillance

 Negative binomial mode-derived incidence rate ratios (IRRs) for hip pain mentions

 6 systems were all manufactured by Depuy had IRRs <1, indicating that they are

associated with lower rates of hip pain mentions relative to the Zimmer Biomet Triology + VerSys reference system when controlling for patient demographics, pain mentions in the prior year

 4 systems (3 Zimmer Biomet, 1 Depuy) have IRRs >1, indicating that they are

https://community.i2b2.org/wiki/display/NLPCTAKES/NLP+cTakes+Home

https://clamp.uth.edu/clampdemo.php