• No results found

Evaluation of Diagnostic Tests

N/A
N/A
Protected

Academic year: 2021

Share "Evaluation of Diagnostic Tests"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

Biostatistics for Health Care Researchers: A Short Course

E

l

ti

f Di

ti T

t

Evaluation of Diagnostic Tests

Presented by:ese ed by Siu L. Hui, Ph.D.

Department of Medicine, Division of Biostatistics Indiana University School of Medicine

(2)

Objectives

• Calculate and interpret sensitivity, specificity, predictive value positive, and predictive value negative.

• Understand the principle behind ROC curves.

• Understand the use of kappa and intraclass correlation coefficients to measure agreement.

(3)

Examples

p

• Screening

L b (BMP li id ) – Lab (BMP, lipids)

– Imaging (bone density, mammogram)

– Questionnaires (depression dementia QOL)Questionnaires (depression, dementia, QOL)

• Diagnostic

– Blood tests for infectionBlood tests for infection – Imaging (X-ray, CT, MRI) – Histology

(4)

Outline

Outline

• Accuracy of a diagnostic test:

- Binary data: sensitivity and specificity

predictive value positive and negative

Ordinal or continuous data: ROC curve

- Ordinal or continuous data: ROC curve

• Measure of the agreement of two tests: • Measure of the agreement of two tests:

- nominal or ordinal data: Kappa

(5)

Example 1: Staging of Prostate Cancer with MRI

a p e

Stag g o

ostate Ca ce

t

Tempany, et al. (1994) studied the accuracy of p y, ( ) y

conventional MRI in detecting advanced stage prostate cancer.

• Disease: advanced stage prostate cancer.g p • Test: conventional MRI.

• The true disease status was established by surgery.

Question: How accurate is conventional MRI in detecting advanced stage prostate cancer?

Tempany, Zhou, Zerhouni, Rifkin, Quint, Piccoli, Ellis, and McNeil (1994). “Staging of Prostate Cancer: Results of Radiology Diagnostic Oncology Group Project Comparison of Three MR Imaging Techniques” Radiology, 192:47-54.

(6)

Example 1 (continued)

p

(

)

Disease\MRI T+ T- total

D+ 70 45 115

D- 32 53 85

(7)

Accuracy of a Test

Accuracy of a Test

• True Disease Status:

Disease (D+) and non-disease (D-) by gold standard

{Advanced stage vs. early stage by surgery}

• Test Result:Test Result:

Positive (T+) and negative (T-) results from a test of

interest.

{Advanced stage vs early stage as assessed by MRI} {Advanced stage vs. early stage as assessed by MRI}

(8)

Example 1 (continued)

p

(

)

Disease\MRI T+ T- total

D+ 70 45 115

D- 32 53 85

(9)

Overall Accuracy

Overall Accuracy

• N = total # of cases • N = total # of cases

• A = # of correctly diagnosed cases • Overall accuracy = A/N

(10)

Example 1 (continued)

p

(

)

Disease\MRI T+ T- total D+ 70 45 115 D- 32 53 85 total 102 98 200 Overall Accuracy = (70 + 53)/200 = 0.615 ~ 62%

(11)

Sensitivity and Specificity

Sensitivity and Specificity

• Sensitivity (Sens) - the ability of a test to give a positive y ( ) y g p

finding when the person tested truly has the disease under study.

• Specificity (Spec) - the ability of a test to give a

negative finding when the person tested truly is free of the disease under study.

(12)

Sensitivity and Specificity

Sensitivity and Specificity

# of T and D

Sens

P(T |D )

# f D

    

# of D

# f T

d D

  

# of T and D

Spec

P(T |D )

# of D

  

(13)

Sensitivity and Specificity

Sensitivity and Specificity

• Sensitivity: True positive rate (TPR)

S ifi it T ti t (1 FPR)

• Specificity: True negative rate (1-FPR) where FPR=false positive rate where FPR=false positive rate

(14)

Example 1 (continued)

p

(

)

Disease\MRI T+ T- total D+ 70 45 115 D- 32 53 85 total 102 98 200 Sens = 70/115 = 61%

(15)

Example 1 (continued)

p

(

)

Disease\MRI T+ T- total D+ 70 45 115 D- 32 53 85 total 102 98 200 Spec = 53/85 = 62%

(16)

I S iti it d S ifi it t

• In summary, Sensitivity and Specificity are two intrinsic properties of a test.

BUT, clinical providers have to infer a patient’s disease status from test results.

• How well can a given test result of a patient predict the

disease status of the patient?

{H lik l ti t ith iti MRI lt t ll

{How likely a patient with positive MRI result actually

(17)

Predictive Values

• Predictive value positive (PV+) is the probability that

a patient with a positive test result actually has the disease:

# of diseased patients with a positive test

• Predictive value negative (PV-) is the probability that

p p

# of patients with a positive test

PV

g ( ) p y

a patient with a negative test does not have the disease:

# of non-diseased patients with a negative test # of patients with a negative test

(18)

PV

+

and PV

-PV and -PV

• Both PV+ and PV- depend on the sensitivity, the p y,

specificity and the disease prevalence (Prev).

Sens Prev

 

P |

PV D T

Sens Prev + 1 Spec 1 Prev        

Spec Prev

 

P | PV D T

Spec Prev + 1 Sens 1 Prev 

 

(19)

Example 2: A Screening Test for a Rare

Di

Disease

Disease\Test T+ T- total Disease\Test T T total D+ 19 1 20 D 99 1881 1980 D- 99 1881 1980 total 118 1882 2000 • Disease prevalence=20/2000=1%.

(20)

Example 2: A Screening Test for a Rare Disease

Example 2: A Screening Test for a Rare Disease

Disease\Test T+ T- total D+ 19 1 20 D- 99 1881 1980 total 118 1882 2000 • Disease prevalence=20/2000=1%. • Sens=19/20=95% • Spec=1881/1980=95%.

(21)

Example 2: A Screening Test for a Rare

Di

Disease

Disease\Test T+ T- total Disease\Test T D+ 19 1 20 D 99 1881 1980 D- 99 1881 1980 total 118 1882 2000 • Disease prevalence=20/2000=1%. • Sens=19/20=95%, Spec=1881/1980=95%. • PV+=19/118=16.1% .

(22)

Example 2: A Screening Test for a Rare Disease

p

g

Disease\Test T+ T- total Disease\Test T D+ 19 1 20 D 99 1881 1980 D- 99 1881 1980 total 118 1882 2000 • Disease prevalence=20/2000=1%. • Sens=19/20=95%, Spec=1881/1980=95%. • PV+=19/118=16.1%, PV-=1881/1882=99.9%.

(23)

Example 2 (continued)

Prev(%) PV+ (%) PV- (%) 1 16 1 99 9 1 16.1 99.9 5 50.0 99.7 20 82.6 98.7 50 95.0 95.0 75 98.3 83.7 • Sens = Spec = 95%.

• Note how the prevalence affects PV+ and PV-.

• When does it not make sense to screen? • When does it not make sense to screen?

(24)

• What if your test gives a continuous

What if your test gives a continuous

reading rather than +/- ?

(25)

Example 3: Blood Test for Disease

ID D T ≤1 ≤2 ≤3 ≤4 ID D T ≤1 ≤2 ≤3 ≤4 1 - 0.2 - - - -2 - 0.7 - - - -3 1 8 3 - 1.8 + - - -4 - 2.0 + - - -5 - 3.1 + + - -6 - 3.3 + + + -7 + 1.5 + - - -8 + 2.4 + + - -9 + 3.0 + + + -10 + 3.1 + + + -11 + 3.8 + + + -12 + 4.0 + + + -Sens 1.00 0.83 0.67 0.00 Spec 0.33 0.67 0.83 1.00 FPR 0.67 0.33 0.17 0.00

(26)

ROC Curve

(Receiver Operating Characteristic Curve) (Receiver Operating Characteristic Curve)

• When is it applied?

Test results are continuous and we may have more – Test results are continuous and we may have more

than one possible cutoff points, or

– We have multiple degrees of suspicion for a given a

t t ( di l d fi it l di

test (ordinal response, e.g. definitely no disease,

probably no disease, probably disease, and definitely disease).

• How is it plotted? FPR on the horizontal axis and TPR on the vertical axis.

Recall: True positive rate (TPR) = Sens

(27)

ROC Curve

(Receiver Operating Characteristic Curve) (Receiver Operating Characteristic Curve)

• When is it applied?

Test results are continuous and we may have more – Test results are continuous and we may have more

than one possible cutoff points, or

– We have multiple degrees of suspicion for a given a

t t ( di l d fi it l di

test (ordinal response, e.g. definitely no disease,

probably no disease, probably disease, and definitely disease).

• How is it plotted? FPR on the horizontal axis and TPR on the vertical axis.

Recall: True positive rate (TPR) = Sens

(28)

Example 3 (continued)

1.0 0.8 SENS 4 0.6 0.2 0. 4 0.0 0 FPR=1-SPEC 0.0 0.2 0.4 0.6 0.8 1.0

(29)

ROC Curves

• Why do we need the ROC curve?

– Displays Sens (benefit) and FPR (cost) under

different thresholds so decision makers can choose the appropriate threshold for their situation.

the appropriate threshold for their situation.

– How to choose a threshold in practice? Based on how the test is used e.g. screening vs. diagnostic – Provides the ability to compare two or more

(30)

ROC Curve Comparison

1.0 0.8 SENS 0.6 2 0.4 0.0 0. 2 FPR=1-SPEC 0.0 0.2 0.4 0.6 0.8 1.0

(31)

Comments

Comments

• We can compare the accuracies of two tests if we know p the gold standard.

(32)

Reliability

Reliability

• When the gold standard is not available, a test is g ,

considered reliable if it agrees with another reference test.

A t t i id d li bl if it id hi h i t • A test is considered reliable if it provides higher

inter-rater and intra-inter-rater (or inter-assay and intra-assay) agreement.

• Kappa: used for nominal or ordinal data agreement • ICC: used for continuous data agreement

(33)

Example 4: Biphasic Radiography vs. Fiberoptic

E d i G t i Ul

Endoscopy in Gastric Ulcers

Endo\Radio No Ulcer Ulcer total

No Ulcer 351 4 355

Ulcer 7 12 19

total 358 16 374

Shaw, van Romunde, Griffioen, Janssens, Kreuning, Eilers (1987).

total 358 16 374

“Peptic Ulcers and Gastric Carcinoma: Diagnosis with Biphasic Radiography Compared with Fiberoptic Endoscopy” Radiology, 163:39-42.

(34)

Kappa (

pp ( )

)

• Po=observed agreement.

• Pe=agreement expected by chance. • P P =agreement beyond chance • Po-Pe=agreement beyond chance.

• 1-Pe=the maximum agreement possible beyond chance chance.

P

P

e e o

P

1

P

P

(35)

An Interpretation of Kappa

Kappa Strength of Agreement

<0.00 poor 0 00-0 20 Slightly poor 0.00 0.20 Slightly poor 0.21-0.40 Fair 0.41-0.60 Moderate 0.61-0.80 Substantial 0.81-1.00 Almost perfect 0.81 1.00 Almost perfect

(36)

Example 4 (continued)

Example 4 (continued)

Endo\Radio No ulcer Ulcer total

No Ulcer 351 4 355

Ulcer 7 12 19

Ulcer 7 12 19

total 358 16 374

• Total observed agreement = (351+12)/374 = 97%.Total observed agreement (351 12)/374 97%. • =.67 (Substantial).

(37)

Intraclass Correlation

• ICC compares the variability of a trait between subjects p y j to the total variation across all ratings and all subjects. • As the variability of a trait between subjects increases • As the variability of a trait between subjects increases

(38)

Example 5: Faculty Ratings of Residents’

Performances

Resident Faculty1 Faculty2

1 77 81 2 80 79 3 55 60 4 91 88 4 91 88 5 60 62 6 80 84 ICC= 96 ICC .96

(39)

Summary

Summary

• Measure of accuracy with gold standard:y g

– Sens, Spec, PV+, PV-– ROC curve

• Measure the agreement of two tests in the absence of gold standard:

gold standard: – Kappa

References

Related documents

 Detail each phase of care in upper limb

4 Preparation Unit for Chemical Reagents 1 V=4 cu.m Mud Agitatir, Chemical

Using ab initio modeling we demonstrate that H atoms can break strained Si ─ O bonds in continuous amorphous silicon dioxide ( a -SiO 2 ) networks, resulting in a new defect

I Independent causes for effect I Cyclic causal relations I Absence of a cause. I Different view on

Italy: United Nations World Food Programme. World Food Summit Plan of Action.. Smallholder characteristics in each study area.. Features of Peruvian smallholder clusters

Summary of evidence 24 We found that preterm delivery is associated with an increased maternal risk for future incident cardiovascular events, cardiovascular

In other parts of the globe, warming of forests, wetlands, grasslands and agricultural soils, increases the metabolism of plants and soil microbes, so that they store less carbon

the recent legislative initiatives are of concern not because of the powers of the Polish legislator to structure the judicial system in Poland but because the Polish