Visualizing and analyzing time to event data: Lifting the veil of censoring

(1)

Visualising and analysing

time-to-event data: lifting the veil of

censoring

Patrick Royston

Cancer Group, MRC Clinical

Trials Unit, London

(2)

A poet writes about censored observations:

Last night I saw upon the stair

A little man who wasn't there

He wasn't there again today

Oh, how I wish he'd go away!

From

Antigonish

(1899)

(3)

Outline

• Why is censoring of time-to-event data an

issue?

• Example in breast cancer

• Visualisation of censored data using

model-based imputation

• Multiple imputation and analysis of

survival data with missing covariate

observations

(4)

Why is censoring an issue?

• You can’t picture the raw data easily

• Reliance on Kaplan-Meier plots

Exaggerates differences between groups

Attracts attention to unreliable survival

estimates at extreme times

• Data will be analysed using Cox model

Still the almost-automatic choice – although

decent alternatives exist

• Time is “forgotten about” in the Cox model

(5)

• Results of Cox regression models are

usually expressed as (log) hazard ratios

Indirect – not dealing directly with time

Can be hard to interpret – different effect on

survival curves at high and low survival probs

Particularly difficult for interactions – ‘ratio of

hazard ratios’

• Non-proportional hazards

Data with long-term follow-up typically have it

Modelling and interpretation may be complex

(6)

Example: Primary

node-positive breast cancer

• GBSG trial BMFT-2

• 686 patients, 299 events for

recurrence-free survival (RFS)

• Patients assigned to hormonal therapy

(TAM) or not

• Visualise the effect of TAM on RFS

• Visualise interaction between TAM and ER

(estrogen receptor status)

(7)

Traditional visualisation:

Kaplan-Meier by TAM group

0.00

0.25

0.50

0.75

1.00 S(t)

0

2

4

6

8 Recurrence-free survival time, yr

hormone = No TAM

hormone = TAM

(8)

Dot plot by TAM therapy –

unhelpful with censored data

000000 1011000 01111110 11111111110 11100011110111111111 1111111111111111111 11111111010011111111111 1011110011111111111110000 00110010101111111011011 000110110010110 1010011001011010010111011100 0100101011111110101 010001000111 1011000000111011 0000101010001 101011010110100 1000111001001100000000 11010101 0001010001110000 10000000000101 00100010 01001000001000 0001100000000100 0100010000000 11000000010000 00000000 0000000001 0000001001 00000001 00000000 010 0000 100 0 0 000 11100011 1010 011011 10111111 1000111 1111111111111111 11110 0111101 10011010110000 10100111000 00101001 000101 0101100 1001001100 0101 10000 100001100 11010 0000000111 0000 000000110 0001000000000 00000000000010000 0000001 110000100 010000011 000 000000 0 00010 000 00 000 0

0

2

4

6

8 R

e

cu

rre

n

ce

-f

re

e

su

rvi

va

l t

ime

,

yr

No TAM

TAM

Hormone therapy, 1=no, 2=yes

(9)

How better to visualise

survival times?

• To make progress with visualisation, aim to

impute the “missing” part of censored times

• Assume a parametric distribution of survival time

• Survival times are sometimes approximately

lognormally distributed (Royston 2001a)

Can check by using modified Normal Q-Q plot

• If lognormal approximation is not good, can

consider Box-Cox transformation of time

(10)

Assessing lognormality:

modified Normal Q-Q plot

• Simple transformation of Kaplan-Meier survival curve

.2

.5

1

2

5

10 Recurrence-free survival time (log scale)

-3

-2

-1

0

1

(11)

Normal Q-Q plot by TAM group

.2

.5

1

2

5

10 Recurrence-free survival time (log scale)

-3

-2

-1

0

1 Normal equivalent deviates

(12)

Visualisation of censored data

using imputation

• Create m (

≥

1) copies of the data with

censored survival times imputed

• Need an imputation model to reflect

Distribution of times (e.g. lognormal)

Effects of covariates (prognostic factors)

• Creating an imputation model:

Use

mfp

with cnreg (censored normal regrn.)

to model poss. non-linear effects of covariates

E.g.

mfp cnreg lnt x1 x2 x3 x4a x4b x5 x6

x7 hormone, censored(c) select(1)

(13)

Creating the imputed

dataset(s)

• Can use the

ice

multiple imputation

command to create the imputations

Royston (2004, 2005a, 2005b)

Stata J

• ice

varlist

using

filename

[.dta]

[if

exp

] [in

range

] [

weight

],

[m(

#

) cmd(

cmdlist

) cycles(

#

)

boot[(

varlist

)] seed(

#

) dryrun

eq(

eqlist

) passive(

passivelist

)

substitute(

sublist

) dropmissing

(14)

Interval censoring with

ice

• gen ll = lnt

• gen ul = cond(_d==1, lnt, ln(50))

// chose upper limit of 50 years for

RFS: can use . for

+

∞

• (generate FP transformations of

x1

,

x5

,

x6

)

• ice x1_1 x2 x3 x4a x4b x5_1 x6_1 x7

hormone ll ul lnt using imputed.dta,

interval(lnt:ll ul) m(10)

(15)

How

interval()

works

Censored obs Upper limit Complete obs -4 -3 -2 -1 0 1 2 3 4

(16)

Code fragment from uvis.ado

`cmd' `yvarlist' `xvars' `wgt',

`options'

...

if "`cmd'"=="intreg" {

tempvar PhiA PhiB

gen `PhiA‘ = cond(missing(`ll'), 0,

norm((`ll'-`xb')/`rmsestar'))

gen `PhiB‘ = cond(missing(`ul'), 1,

norm((`ul'-`xb')/`rmsestar'))

replace `yimp‘ = `xb‘

+`rmsestar'invnorm(`u'

(`PhiB'-`PhiA')+`PhiA')

}

(17)

Uses of the

interval()

option

• Impute right-, left- or interval-censored

outcomes

Response variable in time-to-event studies

• Impute when a covariate is sometimes

partly observed, sometimes complete

Some observations recorded exactly

Others known to be below or above a cutoff

E.g. D-dimer in DVT, PgR/ER in breast cancer

• Interval censored covariates

(18)

Breast cancer data: visualisation

of time to recurrence

0 0 0000 0 0 0 0000 10 1 101 1010 11110111111010 11011011111 111010111111111000110 11111011111111111111111111111 111110101111111011111011001111111111111 1101111101100111001110111101011111111011111010110110011111111 0100011111010100000111101000110101110011001110110010010010100 100100110111110111111001111110100000010001001000101100110110101000000 0010100000100101011001001011100011100000110000100111001101110110101 10010000101000001000001000100010100000000110100000001011011100100010001011001100010 000100101010000001100000010000000010100000000010100000010000000000000001000000000001000000000000001000100000010 00100000000110010100000000000000011000010001000000000000010001001000000000000 00001000000000010000 1 1 11 11 11111111111 1111111011 1111111110111111 1111111111111111111111111111 11111111111111111111111111110111101 111111111101111111111111111111111111101111111111 100111111111111111111111111111101 01101011011111111010111111111101101011111011 10101011001111111010111111111111101111 001011110101100110001010011110001100111011011010 00101000000010001010100101110110100110000010001000 0000000010001100001000100001000000000000000000110001010000010000100 0000000000000000000000000000000000010000001000000000 00000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000 000000000000000000000000000000000 0000000000000000000000000 000000000000000000000000 000000000000000 0000000 00000000 0

.0

5 .1

.2

.5

1

2

5

10

20

50 Recu

rr

en

ce-fr

e

e sur

v

ival tim

e

, yr

0

1 imputation number

fig_response

(19)

Visualisation: some plots using

the first imputed sample

.2 .5 1 2 5 10 20 50

Recurrence-free survival time (log scale)

No TAM TAM

Hormone therapy

.2 .5 1 2 5 10 20 50

Recurrence-free survival time (log scale)

0 10 20 30 40 50

(20)

Visualisation: treatment by

covariate interaction

.2

.5

1

2

5

10

20

50 Recurrence-free survival time (log scale)

ER neg, no TAM

ER neg, TAM

ER pos, no TAM

ER pos, TAM

(21)

Limitations

• Imputed times to event are helpful for

visualisation, but less so for analysis

Effectively, such imputations are

extrapolations into the future

We don’t

know

the future distribution

Estimates of means, SD’s, regression coeffs

etc. are heavily dependent on the distributional

assumptions

Potential for bias if assumed distr’n is wrong

• Imputed times may be unrealistic

(22)

Other approaches

• A reasonably large literature exists

• Buckley-James estimation (Buckley & James

1979)

Estimates the mean of the censored part

Not so good for visualisation

• Wei & Tanner (1991)

Two algorithms which give multiple imputations of the

censored part

Relaxes the normality assumption – samples taken from

the distribution of the residuals

• stpm

(Royston 2001b, Royston & Parmar 2002)

(23)

Imputation of survival data with

missing covariate observations

• So far, have assumed covariates have complete data

• If covariates have

missing data

, need a suitable algorithm

for multiple imputation of all missing values

e.g. MICE (

ice

)

• To reduce bias, must include the response (time-to-event)

in the imputation model

How?

• “Standard” approach is to include (censored)

log time

and

the

censoring indicator

in the imputation model

No theoretical justification

• May be better to

Include covariates as usual

Impute right-censored times using

ice

with

interval()

option

(24)

Analysis of survival data with

missing covariate observations

• Disregard the imputed times in the MI

dataset

Except for visualisation purposes

• Use original time and censoring indicator

• Can analyse the MI dataset using

stcox

(Cox regression)

streg

(several models available)

stpm

(flexible parametric survival models)

(25)

Conclusions

• Use of familiar graphical tools with imputed times

to event can give greater insight into censored

survival data

Scatter plots, smoothers, etc

• Treatment or prognostic effects may be

depressingly small when displayed as scatter

plots of times

Much overlap between groups

Weak regression relationships

• Imputation of times may be helpful in multiple

imputation with missing covariate values

(26)

Some references

• Buckley J, James I (1979) Linear regression with censored data. Biometrika 66: 429-436

• Faucett CL, Schenker N, Taylor JMG (2002) Survival analysis using auxiliary variables via multiple imputation, with application to AIDS clinical trial data. Biometrics 58: 37-47.

• Ma SG (2006) Multiple augmentation with partial missing regressors. Biometrical Journal 48: 83-92 • Pan W (2000) A multiple imputation approach to Cox regression with interval-censored data.

Biometrics 56: 199-203

• Royston P (2001a) The lognormal distribution as a model for survival time in cancer, with an emphasis on prognostic factors. Statistica Neerlandica 55: 89-104

• Royston P (2001b) Flexible alternatives to the Cox model, and more. The Stata Journal 1:1-28. • Royston P, Parmar MKB (2002) Flexible proportional-hazards and proportional-odds models for

censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in Medicine 21: 2175-2197

• Royston P (2004) Multiple imputation of missing values. Stata Journal 4: 227-241

• Royston P (2005a) Multiple imputation of missing values: update. Stata Journal 5: 188-201

• Royston P. (2005b) Multiple imputation of missing values: update of ice. Stata Journal 5: 527-536 • Tanner MA, Wing HW (1987) The calculation of posterior distributions by data augmentation. JASA

82: 528-540. [Cited 642 times, WoS 10.9.2006]

• Wei GCG, Tanner MA (1990) Posterior computations for censored regression data. JASA 85: 829-39 • Wei GCG, Tanner MA (1991) Application of multiple imputation to the analysis of censored

Visualizing and analyzing time to event data: Lifting the veil of censoring

Visualising and analysing

time-to-event data: lifting the veil of

censoring

Patrick Royston

Cancer Group, MRC Clinical

Trials Unit, London

A poet writes about censored observations:

Last night I saw upon the stair

A little man who wasn't there

He wasn't there again today

Oh, how I wish he'd go away!

From

Antigonish

(1899)

Outline

•

Why is censoring of time-to-event data an

issue?

•

Example in breast cancer

•

Visualisation of censored data using

model-based imputation

•

Multiple imputation and analysis of

survival data with missing covariate

observations

Why is censoring an issue?

•

You can’t picture the raw data easily

•

Reliance on Kaplan-Meier plots



Exaggerates differences between groups



Attracts attention to unreliable survival

estimates at extreme times

•

Data will be analysed using Cox model



Still the almost-automatic choice – although

decent alternatives exist

•

Time is “forgotten about” in the Cox model

•

Results of Cox regression models are

usually expressed as (log) hazard ratios



Indirect – not dealing directly with time



Can be hard to interpret – different effect on

survival curves at high and low survival probs



Particularly difficult for interactions – ‘ratio of

hazard ratios’

•

Non-proportional hazards



Data with long-term follow-up typically have it



Modelling and interpretation may be complex

Example: Primary

node-positive breast cancer

•

GBSG trial BMFT-2

•

686 patients, 299 events for

recurrence-free survival (RFS)

•

Patients assigned to hormonal therapy

(TAM) or not

•

Visualise the effect of TAM on RFS

•

Visualise interaction between TAM and ER

(estrogen receptor status)

Traditional visualisation:

Kaplan-Meier by TAM group

0.00