• No results found

Automated assignment of ICD-9-CM codes to radiology reports

N/A
N/A
Protected

Academic year: 2021

Share "Automated assignment of ICD-9-CM codes to radiology reports"

Copied!
53
0
0

Loading.... (view fulltext now)

Full text

(1)

Automated assignment of

ICD-9-CM codes to

radiology reports

Richárd Farkas

University of Szeged

Filip Ginter

University of Turku
(2)

Overview

Why clinical coding?

Importance, use of automated coding Challenge description Data used Evaluation methodology Our solutions Szeged system Turku system

Results and comparison

(3)

NLP in the clinical domain

Narrative texts

A huge amount of information is hidden Manual processing requires expertise

Time Costs

Special features of medical texts

Unique characteristics of the language used

„Smokes 2-3 cig / day , occ etoh , and no drugs except marijuana Exam”

(4)

Clinical coding

Automatic assignment of disease/symptom codes to clinical records

International Classification of Diseases (ICD-X-CM)

X – revision (current: 10, used: 9)

Used for

statistics – on diseases, or effects of treatment billing – the task has commercial relevance

Overcoding is penalised by 3x sum Undercoding means loss of revenue

Codes are added to the text afterwards the treatment (US)

(5)

International Challenge on Classifying Clinical Free Text

Using Natural Language Processing

Shared task challenge to evaluate NLP systems on clinical data

http://www.computationalmedicine.org/challenge/

ICD-9-CM coding Radiology reports Organization

Computational Medicine Center Cincinatti, Ohio, USA

February/March, 2007 Motivation

Practical importance for hospital administration and health insurance

(6)

120+ registered participants

44 systems submitted

(7)

Data Used

Radiology records annotated with ICD codes 978 documents used for training ICD-9 systems 976 unseen documents used for evaluation

Annotation provided by 3 health institutes majority labeling used as gold standard

45 different ICD codes used

codes appear in various combinations (94 different sets of codes)

frequency of labels vary

The data is made available free of charge for research purposes by the challenge organizers

(8)

Example

<doc id="97664713" type="RADIOLOGY_REPORT"> <codes>

<code origin="CMC_MAJORITY" type="ICD-9-CM">786.2</code> <code origin="COMPANY3" type="ICD-9-CM">518.0</code>

<code origin="COMPANY1" type="ICD-9-CM">786.2</code> <code origin="COMPANY2" type="ICD-9-CM">786.2</code> </codes>

<texts>

<text origin="CCHMC_RADIOLOGY" type="CLINICAL_HISTORY">

Cough. History of pneumonia on 1/2/01. Increased work of breathing. </text> <text origin="CCHMC_RADIOLOGY" type="IMPRESSION">

No significant change to overall appearance of perihilar lung opacities and peribronchial thickening most consistent with viral illness vs reactive airways disease. Increased

densities superimposed over the right middle lobe and lingular region on the lateral view may represent superimposition of shadows. However atelectasis or a small amount of parenchymal consolidation cannot be fully excluded. This patient's lung markings have appeared prominent on the four existing chest x-rays in our file. It is recommended that the child receive a well - child chest x-ray in order to evaluate lung markings when the child is not sick.

</text> </texts> </doc>

(9)
(10)
(11)

Szeged, Hungary

Richárd Farkas

Research Group on Artificial Intelligence of the Hungarian Academy of Sciences,

György Szarvas

University of Szeged,

Department of Informatics,

Human Language Technology Group

(12)

Szeged ICD coding solutions

Language Processing negation/speculation Exploiting ICD

and utilise labeled data Inter-label dependecies

Synonyms and abbreviations

Challenge system: hand crafted

(13)

Language processing

Coding guides order that

uncertain diagnosis should not be coded speculations

Peribronchial thickening most consistent with viral illness vs reactive airways disease

negation

Normal slightly hypoventilatory chest x-ray, no pneumonia.

issues in the past without direct effect on current treatment should not be coded

temporal resolution is neglected due to noisy annotation of historical findings

(14)

Detection of speculation/negation

Simple approach, motivated by not too difficult grammar of the text

physicians

aim to briefly enumerate findings and their opinion rarely use very complex Noun Phrases or syntax

Dictionaries of keywords collected from training data Scope identified by naive heuristic

right scope – end of sentence

left scope – previous punctuation

(or nothing, depending on the keyword)

(15)
(16)

Exploration of inter-label dependencies

Overcoding, e.g. symptoms and diseases

C4.5 classifiers trained for false positive

labels

Features: base-system labels

Extracted 5 dependencies

each express „Delete symptom if disease has textual evidence

(17)

Data-driven model

Vector Space Model

token 1-2-3 grams as features

C4.5 classifier on 45 binary classification

tasks

Expanding the dictionaries:

Gathering missing synonyms, abbreviations

(18)

Example of terms found

Urinary Tract Infection

uti

Asthma

reactive airways disease

Laurence-Moon-Biedl syndrome

Williams syndrome

Beckwith-Wiedemann syndrome hemihypertrophy

(19)

External knowledge (ICD) vs.

Data-driven models

ICD

data independent

robust (information source is reliable) can cover rare codes

Data-driven

can explore individual coding style (synonyms, abbreviations)

requires labeled documents cannot handle rare codes

(20)

Added values of the subphases

89.41% 90.02% Hand-crafted system 83.21% 84.07% ICD 70.48% 71.46% - language processing 89.33% 90.53%

Union of statistical and coding guide

88.93% 90.26%

+ statistical enriching (synonyms)

84.85% 85.57%

+ inter-label dependencies

86.69% 88.20%

45-class statistical system

Eval Train

(21)

The Turku Group in the Challenge

Language processing group at the Department of IT, University of Turku and Turku Centre for Computer Science (TUCS) Antti Airola Filip Ginter Tapio Pahikkala Sampo Pyysalo Tapio Salakoski Hanna Suominen

Department of nursing science, University of Turku Sanna Salanterä

(22)

The Turku ICD coding system

Feature engineering

Mapping text to UMLS concepts (MetaMap) Recognition of negation and speculation

Generalization via hypernymy

Machine learning

Primary classifier (RLS)

Secondary classifier (Ripper) – corrections of known errors made by the primary classifier

(23)

MetaMap

MetaMap identifies instances of UMLS

concepts in running text

NLM’s MetaMap program

Divides running text to phrases

Each phrase is mapped into a set of UMLS concepts from specified vocabularies

(24)

MetaMap output example

Eleven year old

Eleven, Quanitative Concept, C0205457 Year, Temporal Concept, C0439234

Old, Temporal Concept, C0580836

with acute leukemia

Acute leukaemia, Neoplastic Process, C0085669

bone marrow transplant

Bone marrow transplant, Therapeutic or Preventive Procedure, C0005961

on Jan. 2 now

with three day history

Three, Quantitative concept, C0205449 day, Temporal concept, C0439228

History, Occupation or Discipline, C0019664

of cough

(25)

Hypernym expansion

Hypernyms as additional features

Generalize the identified concepts along the hierarchy

Cough Respiratory symptoms Signs and

Symptoms …

Fever Body temperature altered Signs and

Symptoms …

Atelectasis Diseases of the lung Diseases of the respiratory system …

Pneumonia Diseases of the lung Diseases of

(26)

Hypernym expansion – motivation

More accurate similarity information

Lexically, cough and fever are different

Hypernym expansion adds the information that both are symptoms

The connection can also be learned given

large quantities of data

(27)

Negation and speculation

Negation, speculation, temporal information

Recognize trigger words

could, history of, likely, may, mild, minimal, no, past, possible, possibly, probable, probably,

questionable, suggestive, unsure, without

Scope: Everything from a trigger word up to the end of the current sentence

All features extracted from a negated text span are marked

ICD coding guide: speculated / unsure code is not assigned

(28)

Hypernym expansion & negation

Hypernym expansion and negation

VALID: pneumonia lung disease

INVALID: not pneumonia not lung disease

Negated concepts are not expanded with

hypernyms

Room for improvement

VALID: possible pneumonia possible lung disease

(29)

Feature engineering

Final set of features entering the classifier

Text tokens

No particular order: Bag-of-Words (BoW) model Marked with neg- whenever negated

Set of UMLS concepts (their c-codes) extracted with MetaMap

Marked with neg- whenever negated

Set of hypernyms of the extracted UMLS concepts Included only for non-negated concepts

(30)

Classification

RLS (regularized least-squares) classifier

Maximal-margin, kernel-based classifier

Close relative of Support Vector Machines (SVMs) Linear kernel (fast & worked well)

One classifier for each code

“1 versus all” classification

May lead to no codes assigned or an impossible combination of codes

(31)

Correcting known errors

Cascaded classifier attempts to correct known errors Empty or impossible combinations

RIPPER

Decision rules

Much different paradigm than RLS

Trained and applied exactly as the first classifier 1 vs. All

Known errors made by the second classifier left uncorrected

(32)

Using ICD-9 in training

ICD-9 definitions as training instances

Concatenate the textual definitions of each of

the 45 codes and its parents in the ICD

hierarchy

Same “generalization” idea as previously

Extract features in the standard way

Pool the resulting 45 training instances with

the challenge training data

(33)

Turku system: Summary

Source text Tokenization Set of UMLS concepts UMLS hypernym expansion Negation and speculation detection Extended set of UMLS concepts MetaMap RLS classifier 1 vs. All RIPPER classifier 1 vs. All UMLS hierarchy Final set of ICD codes Set of ICD codes impossible combination possible combination

FEATURE EXTRACTION CLASSIFICATION

Source text tokens

(34)

Turku system: Component contribution

Cross-validated performance on training data

1% 13.4

86.6 ICD-9 training data

12% 13.5 86.5 Cascaded Ripper 8% 15.3 84.7 Negation/speculation 5% 16.6 83.4 UMLS hypernyms 9% 17.5 82.5 UMLS mapping 7% 19.3 80.7 Tokenization 20.7 79.3 RLS (initial) Relative Gain Error Fmicro

(35)

Turku vs. Szeged: Crucial differences

Turku system Szeged system

Pure machine learning Challenge system: rule-based

Replicated via machine learning

ICD-9 definitions used as training examples with 0.1

percentage point improvement No explicit use of ICD-9 coding guidelines

ICD-9 definitions and coding guidelines are the core of the system

Heavy reliance on UMLS MetaMap

Hypernyms No external resources beyond

(36)

Turku vs. Szeged: Crucial differences

Szeged system allows individual ICD code deletion

“if code X is given, delete code Y”

Turku system rejects the whole code combination and applies a different classifier

Paradoxically, no gain from using the Szeged finer ICD code handling on top of Turku results (0.3

percentage point F-score decrease)

E.g. false positive disease code causes a true positive symptom code to be removed

Use of hypernym expansion

More detailed negation/speculation/temporal detection in Szeged system

(37)

Language specifics

CMC challenge was on English text How about other languages?

Szeged system

Needs translated ICD

Language-adapted negation/speculation detection

Turku system

Needs translated UMLS resources and MetaMap

Much of the features are language-independent UMLS c-codes Language-adapted negation/speculation detection

Both systems rely on string search in one way or another

(38)

Crucial differences (cont.)

Different approach to design

Turku system

Classifier-centric

“Extract all thinkable features and feed them into a state-of-the-art classifier”

Szeged system Data-centric

Build from the available resources (ICD and training data) and use classifiers with interpretable models Study the mistakes and the model, correct errors

(39)

CMC challenge results: The big picture

Best F-score 89.1 (Szeged system)

Mean F-score 76.7 ( =13.4)

Turku and Szeged baselines

Szeged: 83.2% F – bare system with just NLP and ICD but no other direct use of the training data

Turku: 80.7% F – bare machine learning system with no data preprocessing of any kind (only

whitespace tokenization)

About half of the challenge submissions

stayed below these baseline systems!

(40)

CMC challenge: Lessons learned

General observations across all submissions

Presented by Pestian et al., ACL’07 BioNLP workshop, 2007

Based on short system descriptions (not publicly available)

1. Best systems explicitly took into account negation

and speculation

2. Better systems frequently worked with hypernym

and synonym detection

3. Significant amount of symbolic processing

(41)

CMC challenge: Lessons learned

5.

Careful, medically-informed feature

engineering common

6.

SVM and related state-of-the-art

classification algorithms were strongly

represented, but not reliably predictive of

high ranking

Turku development observation: a number of “traditional” classifiers matched RLS

(42)

Beyond the ICD coding

Similar NLP tasks

The same architecture can be used

Find the relevant parts of the documents

Find relevant phrases (synonyms, abbreviations) simple string-matching with a particular dictionary

Prototype tasks:

The i2b2 „obesity” challenge Smoking status detection

(43)

The i2b2 „obesity” challenge

Who's obese and what co-morbidities do they

(definitely/likely) have?

Informatics for Integrating Biology and the

Bedside (i2b2)

2008. Febr. – June

730 training and 507 evaluation document

multi-label problem, 16 morbidities

(44)

Comparison

Focusing on several morbidities (matchable

with set of ICD)

Longer documents (

avg. of the lengths: 130 rows

)

More noise

„The patient has a positive family history of coronary disease”

Negation/speculation detection is highlighted

(Y/N/Q/U F-macro)

(45)

Smoking status detection

i2b2 challenge – 2006

The patient in question is

SMOKER, NON-SMOKER, PAST-SMOKER or smoker status UNKNOWN

inter-annotator agreement ~85%

398 train and 104 eval documents

Small dictionaries:

smoke, tobacco

etc.

best systems 88%

(46)

Final thoughts on ICD coding

Some clear advantages

lower costs

less error-prone processing of simpler cases

Fully automatic system is impossible

(nowadays)

Far away from human intelligence will not solve rare, harder cases

Right middle and probable right lower lobe pneumonia.

(47)

The place of an automatic system

Pre-labeling/highligthing to speed up manual

coding

prediction along with confidence measure

Validation

suggesting erroneous / missed codes

monitoring for health insurance companies

Automated coding of large datasets

(48)

Tasks to be solved…

Extending systems to thousends of codes

If a corpus with appropiate size is available…

Incorporating more expert knowledge into the

statistical methods

user-friendly interfaces „interactive” systems

Better language processing

Corpus for developing sophisticated scope detectors: BioScope (released 2008 June) www.inf.u-szeged.hu/rgai/bioscope

(49)

Open questions

„the” coder or every institute has its own

individual coding styles

how to transfer among languages?

Is there any drop in accuracy

on other languages (free word order in Hungarian) on other domains (nursing notes)?

What is the real speed-up of an automatic

pre-coding/suggestion system?

(50)

Open questions (cont.)

More training data needed to scale the

systems up

Hospitals have the data but privacy concerns

prevent its dissemination to companies / NLP

researchers who build the system

Training data generally cannot be reconstructed from trained machine-learning systems

Distribute an “empty” system?

Legal issues?
(51)

Multilingual ICD tagging: summary

Basic NLP tools Tokenizer

Lemmatizer

Tagger, phrase parser (in some approaches) Need domain adaptation

Controlled domain vocabulary resources

Term variants (e.g. synonyms and abbreviations) Generally scarce

Ideally within a large framework such as UMLS Allowing tool re-use

(52)

Basic NLP resources

Tokenizer

Preferably domain-adapted

Very poor language standards in some clinical documents Lemmatizer

Point in case: FinTWOL and nursing narratives

Basic FinTWOL extended by Lingsoft with ~3500 domain words

Recognition rate grew from 83.1% to 90.7%

That corresponds to 42% decrease in unrecognized running

words

Hungarian: lemmatizers exist but are not domain adapted due to data privacy concerns

Researchers who are able to adapt the lemmatizers do not have appropriate data access permissions

(53)

References

1st place: Farkas, R., & Szarvas, G. (2008). Automatic

construction of rule-based ICD-9-CM coding systems. BMC

Bioinformatics, 9S3, S10.

2nd place: Crammer, K., Dredze, M., Ganchev, K., & Talukdar, P.

P. (2007). Automatic code assignment to medical text.

Proceedings of ACL’07 BioNLP workshop.

3rd place: Suominen, H., Ginter, F., Pyysalo, S., Airola, A.,

Pahikkala, T., Salanterä, S., & Salakoski, T. (2008). Machine

Learning to Automate the Assignment of Diagnosis Codes to

Free-text Radiology Reports: a Method Description. Proceedings

of the ICML/UAI/COLT Workshop on Machine Learning for Health-Care Applications.

Challenge description: Pestian, J. P., Brew, C., Matykiewicz, P., Hovermale, D., Johnson, N., Cohen, K. B., & Duch, W.

(2007). A shared task involving multi-label classification of clinical

http://www.computationalmedicine.org/challenge/ www.inf.u-szeged.hu/rgai/bioscope

References

Related documents

U sustavu možete obaviti sve akcije vezane za ažurnu prijavu i odjavu turista i to putem sučelja, putem datoteka ili razvijanjem automatskih informacijskih veza prema

Kissell (2003) notes that water sprays only capture a small amount of airborne respirable dust in an underground coal mine, because not all the air passes directly by a water

Donovan (1982) used the with-without approach in his study o f 35 countries under 78 upper credit tranche Stand-by arrangements with the IMF for the period

Multiple contacts between Mediator and RNAPII are established in the holoenzyme complex, involving mostly the head and middle domains, and distributed around the

In order to apply standard VAR tools to our hybrid model, we identify shocks through the following ordering of the variables: (log) industrial production, inflation, money stock

The central canti contain rather detailed descriptions of the sieges of Nicosia and Famagusta, but these are inserted in a framework celebrating the dedicatee, Bianca Cappello,

represented. Those who have been foreclosed are somewhat less likely to be married, but more likely to have children under 18 in the household. They are less likely than others