Automated assignment of
ICD-9-CM codes to
radiology reports
Richárd Farkas
University of SzegedFilip Ginter
University of TurkuOverview
Why clinical coding?
Importance, use of automated coding Challenge description Data used Evaluation methodology Our solutions Szeged system Turku system
Results and comparison
NLP in the clinical domain
Narrative texts
A huge amount of information is hidden Manual processing requires expertise
Time Costs
Special features of medical texts
Unique characteristics of the language used
„Smokes 2-3 cig / day , occ etoh , and no drugs except marijuana Exam”
Clinical coding
Automatic assignment of disease/symptom codes to clinical records
International Classification of Diseases (ICD-X-CM)
X – revision (current: 10, used: 9)
Used for
statistics – on diseases, or effects of treatment billing – the task has commercial relevance
Overcoding is penalised by 3x sum Undercoding means loss of revenue
Codes are added to the text afterwards the treatment (US)
International Challenge on Classifying Clinical Free Text
Using Natural Language Processing
Shared task challenge to evaluate NLP systems on clinical data
http://www.computationalmedicine.org/challenge/
ICD-9-CM coding Radiology reports Organization
Computational Medicine Center Cincinatti, Ohio, USA
February/March, 2007 Motivation
Practical importance for hospital administration and health insurance
120+ registered participants
44 systems submitted
Data Used
Radiology records annotated with ICD codes 978 documents used for training ICD-9 systems 976 unseen documents used for evaluation
Annotation provided by 3 health institutes majority labeling used as gold standard
45 different ICD codes used
codes appear in various combinations (94 different sets of codes)
frequency of labels vary
The data is made available free of charge for research purposes by the challenge organizers
Example
<doc id="97664713" type="RADIOLOGY_REPORT"> <codes>
<code origin="CMC_MAJORITY" type="ICD-9-CM">786.2</code> <code origin="COMPANY3" type="ICD-9-CM">518.0</code>
<code origin="COMPANY1" type="ICD-9-CM">786.2</code> <code origin="COMPANY2" type="ICD-9-CM">786.2</code> </codes>
<texts>
<text origin="CCHMC_RADIOLOGY" type="CLINICAL_HISTORY">
Cough. History of pneumonia on 1/2/01. Increased work of breathing. </text> <text origin="CCHMC_RADIOLOGY" type="IMPRESSION">
No significant change to overall appearance of perihilar lung opacities and peribronchial thickening most consistent with viral illness vs reactive airways disease. Increased
densities superimposed over the right middle lobe and lingular region on the lateral view may represent superimposition of shadows. However atelectasis or a small amount of parenchymal consolidation cannot be fully excluded. This patient's lung markings have appeared prominent on the four existing chest x-rays in our file. It is recommended that the child receive a well - child chest x-ray in order to evaluate lung markings when the child is not sick.
</text> </texts> </doc>
Szeged, Hungary
Richárd Farkas
Research Group on Artificial Intelligence of the Hungarian Academy of Sciences,
György Szarvas
University of Szeged,
Department of Informatics,
Human Language Technology Group
Szeged ICD coding solutions
Language Processing negation/speculation Exploiting ICD
and utilise labeled data Inter-label dependecies
Synonyms and abbreviations
Challenge system: hand crafted
Language processing
Coding guides order that
uncertain diagnosis should not be coded speculations
Peribronchial thickening most consistent with viral illness vs reactive airways disease
negation
Normal slightly hypoventilatory chest x-ray, no pneumonia.
issues in the past without direct effect on current treatment should not be coded
temporal resolution is neglected due to noisy annotation of historical findings
Detection of speculation/negation
Simple approach, motivated by not too difficult grammar of the text
physicians
aim to briefly enumerate findings and their opinion rarely use very complex Noun Phrases or syntax
Dictionaries of keywords collected from training data Scope identified by naive heuristic
right scope – end of sentence
left scope – previous punctuation
(or nothing, depending on the keyword)
Exploration of inter-label dependencies
Overcoding, e.g. symptoms and diseases
C4.5 classifiers trained for false positive
labels
Features: base-system labels
Extracted 5 dependencies
each express „Delete symptom if disease has textual evidence”
Data-driven model
Vector Space Model
token 1-2-3 grams as features
C4.5 classifier on 45 binary classification
tasks
Expanding the dictionaries:
Gathering missing synonyms, abbreviations
Example of terms found
Urinary Tract Infection
utiAsthma
reactive airways disease
Laurence-Moon-Biedl syndrome
Williams syndromeBeckwith-Wiedemann syndrome hemihypertrophy
External knowledge (ICD) vs.
Data-driven models
ICD
data independent
robust (information source is reliable) can cover rare codes
Data-driven
can explore individual coding style (synonyms, abbreviations)
requires labeled documents cannot handle rare codes
Added values of the subphases
89.41% 90.02% Hand-crafted system 83.21% 84.07% ICD 70.48% 71.46% - language processing 89.33% 90.53%Union of statistical and coding guide
88.93% 90.26%
+ statistical enriching (synonyms)
84.85% 85.57%
+ inter-label dependencies
86.69% 88.20%
45-class statistical system
Eval Train
The Turku Group in the Challenge
Language processing group at the Department of IT, University of Turku and Turku Centre for Computer Science (TUCS) Antti Airola Filip Ginter Tapio Pahikkala Sampo Pyysalo Tapio Salakoski Hanna Suominen
Department of nursing science, University of Turku Sanna Salanterä
The Turku ICD coding system
Feature engineering
Mapping text to UMLS concepts (MetaMap) Recognition of negation and speculation
Generalization via hypernymy
Machine learning
Primary classifier (RLS)
Secondary classifier (Ripper) – corrections of known errors made by the primary classifier
MetaMap
MetaMap identifies instances of UMLS
concepts in running text
NLM’s MetaMap program
Divides running text to phrases
Each phrase is mapped into a set of UMLS concepts from specified vocabularies
MetaMap output example
Eleven year old
Eleven, Quanitative Concept, C0205457 Year, Temporal Concept, C0439234
Old, Temporal Concept, C0580836
with acute leukemia
Acute leukaemia, Neoplastic Process, C0085669
bone marrow transplant
Bone marrow transplant, Therapeutic or Preventive Procedure, C0005961
on Jan. 2 now
with three day history
Three, Quantitative concept, C0205449 day, Temporal concept, C0439228
History, Occupation or Discipline, C0019664
of cough
Hypernym expansion
Hypernyms as additional features
Generalize the identified concepts along the hierarchy
Cough Respiratory symptoms Signs and
Symptoms …
Fever Body temperature altered Signs and
Symptoms …
Atelectasis Diseases of the lung Diseases of the respiratory system …
Pneumonia Diseases of the lung Diseases of
Hypernym expansion – motivation
More accurate similarity information
Lexically, cough and fever are differentHypernym expansion adds the information that both are symptoms
The connection can also be learned given
large quantities of data
Negation and speculation
Negation, speculation, temporal information
Recognize trigger wordscould, history of, likely, may, mild, minimal, no, past, possible, possibly, probable, probably,
questionable, suggestive, unsure, without
Scope: Everything from a trigger word up to the end of the current sentence
All features extracted from a negated text span are marked
ICD coding guide: speculated / unsure code is not assigned
Hypernym expansion & negation
Hypernym expansion and negation
VALID: pneumonia lung diseaseINVALID: not pneumonia not lung disease
Negated concepts are not expanded with
hypernyms
Room for improvement
VALID: possible pneumonia possible lung disease
Feature engineering
Final set of features entering the classifier
Text tokensNo particular order: Bag-of-Words (BoW) model Marked with neg- whenever negated
Set of UMLS concepts (their c-codes) extracted with MetaMap
Marked with neg- whenever negated
Set of hypernyms of the extracted UMLS concepts Included only for non-negated concepts
Classification
RLS (regularized least-squares) classifier
Maximal-margin, kernel-based classifierClose relative of Support Vector Machines (SVMs) Linear kernel (fast & worked well)
One classifier for each code
“1 versus all” classificationMay lead to no codes assigned or an impossible combination of codes
Correcting known errors
Cascaded classifier attempts to correct known errors Empty or impossible combinations
RIPPER
Decision rules
Much different paradigm than RLS
Trained and applied exactly as the first classifier 1 vs. All
Known errors made by the second classifier left uncorrected
Using ICD-9 in training
ICD-9 definitions as training instances
Concatenate the textual definitions of each of
the 45 codes and its parents in the ICD
hierarchy
Same “generalization” idea as previously
Extract features in the standard way
Pool the resulting 45 training instances with
the challenge training data
Turku system: Summary
Source text Tokenization Set of UMLS concepts UMLS hypernym expansion Negation and speculation detection Extended set of UMLS concepts MetaMap RLS classifier 1 vs. All RIPPER classifier 1 vs. All UMLS hierarchy Final set of ICD codes Set of ICD codes impossible combination possible combinationFEATURE EXTRACTION CLASSIFICATION
Source text tokens
Turku system: Component contribution
Cross-validated performance on training data
1% 13.4
86.6 ICD-9 training data
12% 13.5 86.5 Cascaded Ripper 8% 15.3 84.7 Negation/speculation 5% 16.6 83.4 UMLS hypernyms 9% 17.5 82.5 UMLS mapping 7% 19.3 80.7 Tokenization 20.7 79.3 RLS (initial) Relative Gain Error Fmicro
Turku vs. Szeged: Crucial differences
Turku system Szeged system
Pure machine learning Challenge system: rule-based
Replicated via machine learning
ICD-9 definitions used as training examples with 0.1
percentage point improvement No explicit use of ICD-9 coding guidelines
ICD-9 definitions and coding guidelines are the core of the system
Heavy reliance on UMLS MetaMap
Hypernyms No external resources beyond
Turku vs. Szeged: Crucial differences
Szeged system allows individual ICD code deletion
“if code X is given, delete code Y”
Turku system rejects the whole code combination and applies a different classifier
Paradoxically, no gain from using the Szeged finer ICD code handling on top of Turku results (0.3
percentage point F-score decrease)
E.g. false positive disease code causes a true positive symptom code to be removed
Use of hypernym expansion
More detailed negation/speculation/temporal detection in Szeged system
Language specifics
CMC challenge was on English text How about other languages?
Szeged system
Needs translated ICD
Language-adapted negation/speculation detection
Turku system
Needs translated UMLS resources and MetaMap
Much of the features are language-independent UMLS c-codes Language-adapted negation/speculation detection
Both systems rely on string search in one way or another
Crucial differences (cont.)
Different approach to design
Turku systemClassifier-centric
“Extract all thinkable features and feed them into a state-of-the-art classifier”
Szeged system Data-centric
Build from the available resources (ICD and training data) and use classifiers with interpretable models Study the mistakes and the model, correct errors
CMC challenge results: The big picture
Best F-score 89.1 (Szeged system)
Mean F-score 76.7 ( =13.4)
Turku and Szeged baselines
Szeged: 83.2% F – bare system with just NLP and ICD but no other direct use of the training data
Turku: 80.7% F – bare machine learning system with no data preprocessing of any kind (only
whitespace tokenization)
About half of the challenge submissions
stayed below these baseline systems!
CMC challenge: Lessons learned
General observations across all submissions
Presented by Pestian et al., ACL’07 BioNLP workshop, 2007
Based on short system descriptions (not publicly available)
1. Best systems explicitly took into account negation
and speculation
2. Better systems frequently worked with hypernym
and synonym detection
3. Significant amount of symbolic processing
CMC challenge: Lessons learned
5.Careful, medically-informed feature
engineering common
6.
SVM and related state-of-the-art
classification algorithms were strongly
represented, but not reliably predictive of
high ranking
Turku development observation: a number of “traditional” classifiers matched RLS
Beyond the ICD coding
Similar NLP tasks
The same architecture can be used
Find the relevant parts of the documentsFind relevant phrases (synonyms, abbreviations) simple string-matching with a particular dictionary
Prototype tasks:
The i2b2 „obesity” challenge Smoking status detection
The i2b2 „obesity” challenge
Who's obese and what co-morbidities do they
(definitely/likely) have?
Informatics for Integrating Biology and the
Bedside (i2b2)
2008. Febr. – June
730 training and 507 evaluation document
multi-label problem, 16 morbidities
Comparison
Focusing on several morbidities (matchable
with set of ICD)
Longer documents (
avg. of the lengths: 130 rows)
More noise
„The patient has a positive family history of coronary disease”
Negation/speculation detection is highlighted
(Y/N/Q/U F-macro)
Smoking status detection
i2b2 challenge – 2006
The patient in question is
SMOKER, NON-SMOKER, PAST-SMOKER or smoker status UNKNOWN
inter-annotator agreement ~85%
398 train and 104 eval documents
Small dictionaries:
smoke, tobacco
etc.
best systems 88%
Final thoughts on ICD coding
Some clear advantages
lower costsless error-prone processing of simpler cases
Fully automatic system is impossible
(nowadays)
Far away from human intelligence will not solve rare, harder cases
Right middle and probable right lower lobe pneumonia.
The place of an automatic system
Pre-labeling/highligthing to speed up manual
coding
prediction along with confidence measure
Validation
suggesting erroneous / missed codes
monitoring for health insurance companies
Automated coding of large datasets
Tasks to be solved…
Extending systems to thousends of codes
If a corpus with appropiate size is available…Incorporating more expert knowledge into the
statistical methods
user-friendly interfaces „interactive” systems
Better language processing
Corpus for developing sophisticated scope detectors: BioScope (released 2008 June) www.inf.u-szeged.hu/rgai/bioscope
Open questions
„the” coder or every institute has its own
individual coding styleshow to transfer among languages?
Is there any drop in accuracy
on other languages (free word order in Hungarian) on other domains (nursing notes)?
What is the real speed-up of an automatic
pre-coding/suggestion system?
Open questions (cont.)
More training data needed to scale the
systems up
Hospitals have the data but privacy concerns
prevent its dissemination to companies / NLP
researchers who build the system
Training data generally cannot be reconstructed from trained machine-learning systems
Distribute an “empty” system?
Legal issues?Multilingual ICD tagging: summary
Basic NLP tools Tokenizer
Lemmatizer
Tagger, phrase parser (in some approaches) Need domain adaptation
Controlled domain vocabulary resources
Term variants (e.g. synonyms and abbreviations) Generally scarce
Ideally within a large framework such as UMLS Allowing tool re-use
Basic NLP resources
Tokenizer
Preferably domain-adapted
Very poor language standards in some clinical documents Lemmatizer
Point in case: FinTWOL and nursing narratives
Basic FinTWOL extended by Lingsoft with ~3500 domain words
Recognition rate grew from 83.1% to 90.7%
That corresponds to 42% decrease in unrecognized running
words
Hungarian: lemmatizers exist but are not domain adapted due to data privacy concerns
Researchers who are able to adapt the lemmatizers do not have appropriate data access permissions
References
1st place: Farkas, R., & Szarvas, G. (2008). Automatic
construction of rule-based ICD-9-CM coding systems. BMC
Bioinformatics, 9S3, S10.
2nd place: Crammer, K., Dredze, M., Ganchev, K., & Talukdar, P.
P. (2007). Automatic code assignment to medical text.
Proceedings of ACL’07 BioNLP workshop.
3rd place: Suominen, H., Ginter, F., Pyysalo, S., Airola, A.,
Pahikkala, T., Salanterä, S., & Salakoski, T. (2008). Machine
Learning to Automate the Assignment of Diagnosis Codes to
Free-text Radiology Reports: a Method Description. Proceedings
of the ICML/UAI/COLT Workshop on Machine Learning for Health-Care Applications.
Challenge description: Pestian, J. P., Brew, C., Matykiewicz, P., Hovermale, D., Johnson, N., Cohen, K. B., & Duch, W.
(2007). A shared task involving multi-label classification of clinical