---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95 as.factor(treatment)2 0.3287 3.0426 0.03109 3.474
size 1.0861 0.9207 0.98961 1.192
index 2.0345 0.4915 1.04913 3.945
Concordance= 0.873 (se = 0.132 ) Rsquare= 0.304 (max possible= 0.616 )
Likelihood ratio test= 13.78 on 3 df, p=0.003226 Wald test = 10.29 on 3 df, p=0.01627 Score (logrank) test = 14.9 on 3 df, p=0.001903
These data suggest that the Gleason Index is a significant risk factor of mortal-ity (p-value=0.0356). Size of tumor is marginally significant (p-value=0.0819).
Given that ˆ∆ = −1.11272 < 1 it appears that DES lowers risk of mortality;
however, the p-value = 0.3550 is nonsignificant.
6.4 Accelerated Failure Time Models
In this section we consider analysis of survival data based on an accelerated failure time model. We assume that all survival times are observed. Rank-based analysis with censored survival times is considered in Jin et al. (2003).
Consider a study on experimental units (subjects) in which data are col-lected on the time until failure of the subjects. Hence, the setup for this sec-tion is the same as in the previous two secsec-tions of this chapter, with time until event replaced by time until failure. For such an experiment or study, let T be the time until failure of a subject and let x be the vector of associated covariates. The components of x could be indicators of an underlying experi-mental design and/or concomitant variables collected to help explain random variability. Note that T > 0 with probability one. Generally, in practice, T has a skewed distribution. As in the last section, let the random variable T0
denote the baseline time until failure. This is the response in the absence of all covariates.
In this section, let g(t; x) and G(t; x) denote the pdf and cdf of T , re-spectively. In the last section, we introduced the hazard function h(t). A more formal definition of the hazard function is the limit of the rate of instantaneous
failure at time t; i.e., Models frequently used with failure time data are the log-linear models
Y = α + xTβ+ ǫ, (6.6)
where Y = log T and ǫ is random error with respective pdf and cdf f (s) and F (s). We assume that the random error ǫ is free of x. Hence, the baseline response is given by T0= exp{ǫ}. Let h0(t) denote the hazard function of T0. Because
T = exp{Y } = exp{α+xTβ+ ǫ} = exp{α+xTβ} exp{ǫ} = exp{α+xTβ}T0, it follows that the hazard function of T is
hT(t; x) = exp{−(α + xTβ)}h0(exp{−(α + xTβ)}t). (6.7) Notice that the effect of the covariate x either accelerates or decelerates the instantaneous failure time of T ; hence, log-linear models of the form (6.6) are generally called accelerated failure time models.
If T0 has an exponential distribution with mean 1/λ0, then the hazard function of T simplifies to:
hT(t; x) = λ0exp{−(α + xTβ)}; (6.8) i.e., Cox’s proportional hazard function given by expression (6.3) of the last section. In this case, it follows that the density function of ǫ is the extreme-valued pdf given by
f (s) = λ0esexp {−λ0es} , −∞ < s < ∞. (6.9) Accelerated failure time models are discussed in Kalbfleisch and Prentice (2002). As a family of possible error distributions for ǫ, they suggest the gen-eralized log F family; that is, ǫ = log T0, where down to a scale parameter, T0
has an F -distribution with 2m1and 2m2degrees of freedom. In this case, we say that ǫ = log T0has a GF (2m1, 2m2) distribution. Kalbfleisch and Prentice discuss this family for m1, m2≥ 1; while McKean and Sievers (1989) extended it to m1, m2 > 0. This provides a rich family of distributions. The distribu-tions are symmetric for m1 = m2; positively skewed for m1> m2; negatively skewed for m1 < m2; moderate to light-tailed for m1, m2 > 1; and heavy tailed for m1, m2≤ 1. For m1= m2= 1, ǫ has a logistic distribution, while as m1 = m2→ ∞ the limiting distribution of ǫ is normal. Also, if one of mi is one while the other approaches infinity, then the GF distribution approaches
an extreme valued-distribution, with pdf of the form (6.9). So at least in the limit, the accelerated GF models encompass the proportional hazards models.
See Kalbfleisch and Prentice (2002) and Section 3.10 of Hettmansperger and McKean (2011) for discussion.
The accelerated failure time models are linear models so the rank-based fit and associated inference using Wilcoxon scores can be used for analyses. By a prudent choice of a score function, though, this analysis can be optimized. We next discuss optimal score functions for these models and show how to com-pute analyses based on the them using Rfit. We begin with the proportional hazards model and then discuss the scores for the generalized log F -family.
Suppose a proportional hazard model is appropriate, where the baseline random variable T0 has an exponential distribution with mean 1/λ0. Then ǫ has the extreme valued pdf given by (6.9). Then as shown in Exercise 6.5.5 the optimal rank-based score function is ϕ(u) = −1 − log(1 − u), for 0 < u < 1. A rank-based analysis using this score function is asymptotically fully efficient.
These scores are in the package npsm under the name logrankscores. The left panel of Figure 6.3 contains a plot of these scores, while the right panel shows a graph of the corresponding extreme valued pdf, (6.9). Note that the density has very light right-tails and much heavier left-tails. To guard against the influence of large (absolute) observations from the left-tails, the scores are bounded on the left, while their behavior on the right accommodates light-tailed error structure. The scores, though, are unbounded on the right and, hence, the resulting R analysis is not bias robust. In the sensitivity analysis discussed in McKean and Sievers (1989), the R estimates based on these scores were much less sensitive to outliers than the maximum likelihood estimates.
Similar to the normal scores, these log rank scores appear to be technically bias robust.
We illustrate the use of these scores in the next example.
Example 6.4.1 (Simulated Exponential Data). The data for this model are generated from a proportional hazards model with λ = 1 based on the code eps <- log(rexp(10)); x=1:10; y = round(4*x+eps,digits=2). The ac-tual data used are given in Exercise 6.5.11. Using Rfit with the log-rank score function, we obtain the fit of this dataset:
> fit <- rfit(y~x,scores=mylogrank)
> summary(fit) Call:
rfit.default(formula = y ~ x, scores = mylogrank) Coefficients:
Estimate Std. Error t.value p.value (Intercept) -1.60687 1.49251 -1.0766 0.313
x 4.19125 0.22496 18.6310 7.107e-08 ***
---0.0 0.4 0.8
−101234
u
Logrank score
−6 −2 0 2 4
0.00.10.20.3
t
gt
FIGURE 6.3
Log-rank score function.
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Multiple R-squared (Robust): 0.9287307
Reduction in Dispersion Test: 104.2504 p-value: 1e-05
Note that the true slope of 4 is included in the approximate 95% confidence interval 4.19 ± 2.31 · 0.22.
Next, suppose that the random errors in the accelerated failure time model, (6.6), have down to a scale parameter, a GF (2m1, 2m2) distribution. Then as shown on page 234 of Hettmansperger and McKean (2011) the optimal score function is
ϕm11,m2(u) = m1m2[exp{F−1(u)} − 1]
m2+ m1exp{F−1(u)} , m1> 0, m2> 0, (6.10) where F is the cdf of ǫ. Note, for all values of m1and m2, these score functions are bounded over the interval (0, 1); hence, the corresponding R analysis is biased robust. These scores are called the generalized log-F scores (GLF). The software npsm contains the necessary R code logfscores to add these scores
0.0 0.2 0.4 0.6 0.8 1.0
−10123
u
ϕ
m1 = 1 and m2 = 20
0.0 0.2 0.4 0.6 0.8 1.0
−4−3−2−10
u
ϕ
m1 = 1 and m2 = 0.10
0.0 0.2 0.4 0.6 0.8 1.0
−1.00.00.51.0
u
ϕ
m1 = 0.10 and m2 = 0.10
0.0 0.2 0.4 0.6 0.8 1.0
−3−2−101
u
ϕ
m1 = 5 and m2 = 0.8
FIGURE 6.4
GLF scores for various settings of m1 and m2.
to the class of scores. For this code, we have used the fact that the pth quantile of the F2m1,2m2 cdf satisfies
q = exp{Fǫ−1(p)} where q = F2m−11,2m2(p).
The default values are set at m1= m2= 1, which gives the Wilcoxon scores.
Figure 6.4 shows the diversity of these scores for different values of m1 and m2. It contains plots of four of the scores. The upper left corner graph displays the scores for m1= 1 and m2= 20. These are suitable for error distributions which have moderately heavy (heaviness of a logistic distribution) left-tails and very light right-tails. In contrast, the scores for m1 = 1 and m2 = 0.10 are appropriate for moderately heavy left-tails and very heavy right-tails. The lower left panel of the figure is a score function designed for heavy tailed and symmetric distributions. The final plot, m1 = 5 and m2= 0.8, are appropri-ate for moderappropri-ate left-tails and heavy right-tails. But note from the degree of downweighting that the right-tails for this last case are clearly not as heavy as for the two cases with m2= 0.10.
The next example serves as an application of the log F -scores.
3.3 3.4 3.5 3.6
−202468
Voltage stress
Log failure time
Log Failure Time versus Voltage Stress
FIGURE 6.5
Log failure times of the insulation fluid versus the voltage stress.
Example 6.4.2 (Insulating Fluid Data). Hettmansperger and McKean (2011) present an example involving failure time (T ) of an electrical insu-lating fluid subject to seven different levels of voltage stress (x). The data are in the dataset insulation. Figure 6.5 shows a scatterplot of the log of failure time (Y = log T ) versus the voltage stress. As voltage stress increases, time until failure of the insulating fluid decreases. It appears that a simple linear model suffices. In their discussion, Hettmansperger and McKean recommend a rank-based fit based on generalized log F -scores with m1= 1 and m2= 5.
This corresponds to a distribution with left-tails as heavy as a logistic distribu-tion and right-tails lighter than a logistic distribudistribu-tion; i.e., moderately skewed left. The following code-segment illustrates computation of the rank-based fit of these data based on this log F -score.
> myscores <- logfscores
> myscores@param=c(1,5)
> fit <- rfit(logfail~voltstress,scores=myscores)
> summary(fit) Call:
rfit.default(formula = logfail ~ voltstress, scores = myscores) Coefficients:
Estimate Std. Error t.value p.value (Intercept) 63.9596 6.5298 9.795 5.324e-15 ***
voltstress -17.6624 1.8669 -9.461 2.252e-14 ***
---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Multiple R-squared (Robust): 0.5092232
Reduction in Dispersion Test: 76.78138 p-value: 0
> fit$tauhat [1] 1.572306
Not surprisingly, the estimate of the slope is highly significant. As a check on goodness-of-fit, Figure 6.6 presents the Studentized residual plot and the q−q plot of the Studentized residuals versus the quantiles of a log F -distribution with the appropriate degrees of freedom 2 and 10. For the q − q plot, the population quantiles are the quantiles of a log f -distribution with 2 and 10 degrees of freedom. This plot is fairly linear, indicating1 that an appropriate choice of scores was made. The residual plot indicates a good fit. The outliers on the left are mild and, based on the q−q plot, follow the pattern of the log F -distribution with 2 and 10 degrees of freedom.
6.5 Exercises
6.5.1. Using the data discussed in Example 6.2.4:
(a) Obtain a plot of the Kaplan–Meier estimates for the two treat-ment groups.
(b) Obtain the p-value based on the log-rank statistic.
(c) Obtain the p-value based on the Peto–Peto modification of the Gehan statistic.
6.5.2. Obtain full model fit of the prostate cancer data discussed in Exam-ple 6.3.2. Include age, serum haemoglobin level, size, and Gleason index. Com-ment on the similarity or dissimilarity of the estimated regression coefficients to those obtained in Example 6.3.2.
1See the discussion in Section 3.10 of Hettmansperger and McKean (2011).
0 1 2 3 4 5 6
−3−11
Fitted values
Studentized residuals
Studentized residual plot
−4 −3 −2 −1 0 1 2
−3−11
Log F quantiles, 2 and 10 df
Studentized residuals
Studentized q−q plot
FIGURE 6.6
The top panel contains the Studentized residual plot of the rank-based fit using generalized log F -scores with 2 and 10 degrees of freedom. The bottom panel shows the q−q plot of Studentized residuals versus log F -scores with 2 and 10 degrees of freedom.
6.5.3. For the dataset hodgkins, plot Kaplan–Meier estimated survival curves for both treatments. Note treatment code 1 denotes radiation of affected node and treatment code 2 denotes total nodal radiation.
6.5.4. To simulate survival data, often it is useful to simulate multiple time points. For example the time to event and the time to end of study. Then, events occurring after the time to end of study are censored. Suppose the time to event of interest follows an exponential distribution with mean 5 years and the time to end of study follows an exponential distribution with a mean of 1.8 years. For a sample size n = 100 simulate survival times from this model.
Plot the Kaplan–Meier estimate.
6.5.5. Show that the optimal rank-based score function is ϕ(u) = −1−log(1−
u), for 0 < u < 1 for random variables which have an extreme valued distri-bution (6.9). In this case, the generated scores are called the log-rank scores
6.5.6. Consider the dataset rs. This is simulated data from a simple regression model with the true slope parameter at 0.5. The first column is the indepen-dent variable x while the second column is the depenindepen-dent variable y. Obtain the following three fits of the model: least squares, Wilcoxon rank-based, and rank-based using logfscores with m1= 1 and m2= 0.10.
(a) Scatterplot the data and overlay the three fits.
(b) Obtain Studentized residual plots of all three fits.
(c) Based on Parts (a) and (b) which fit is worst?
(d) Compare the two rank-based fits in terms of precision (estimates of τϕ). Which fit is better?
6.5.7. Generate data from a linear model with log-F errors with degrees of freedom 4 and 8 using the following code
n <- 75; m1 <- 2; m2 <- 4; x<-rnorm(n,50,10) errs1 <- log(rf(n,2*m1,2*m2)); y1 <- x + 30*errs1
(a) Using logfscores, obtain the optimal scores for this dataset.
(b) Obtain side-by-side plots of the pdf of the random errors and the scores. Comment on the plot.
(c) Fit the simple linear model for this data using the optimal scores. Obtain a residual analysis including a Studentized residual plot and a normal q−q plot. Comment on the plots and the quality of the fit.
(d) Obtain a histogram of the residuals for the fit in part (b). Over-lay the histogram with an estimate of the density and compare it to the plot of the pdf in part (a).
(e) Obtain a summary of the fit of the simple linear model for this data using the optimal scores. Obtain a 95% confidence interval for the slope parameter β. Did the interval trap the true parameter?
(f) Use the fit to obtain a confidence interval for the expected value of y when x = 60.
6.5.8. For the situation described in Exercise 6.5.7, obtain a simulation study comparing the mean squared errors of the estimates of slope using fits based on Wilcoxon scores and the optimal scores. Use 10,000 simulations.
6.5.9. Consider the failure time data discussed in Example 6.4.2. Recall that the generalized log F -scores with 2m1= 2 and 2m2= 10 degrees of freedom were used to compute the rank-based fit. The Studentized residuals from this fit were then used in a q−q plot to check goodness-of-fit based on the strength of linearity in the plot, where the population quantiles were obtained from a log F -distribution with 2 and 10 degrees of freedom. Obtain the rank-based fits based on the Wilcoxon scores, normal scores, and log F -scores with 2m1= 10
and 2m2= 2. For each, obtain the q−q plot of Studentized residuals using as population quantiles the normal distribution, the logistic distribution, and the log F -distribution with 10 and 2 degrees of freedom, respectively. Compare the plots. Which, if any, is most linear?
6.5.10. Suppose we are investigating the relationship between a response Y and an independent variable x. In a planned experiment, we record responses at r values of x, x1 < x2 < · · · < xr. Suppose ni independent replicates are obtained at xi. Let Yij denote the response for the jth replicate at xi. Then the model for a linear relationship is
Yij = α + xijβ + eij, i = 1, . . . , r; j = 1, . . . , ni. (6.11) In this setting, we can obtain a lack-of-fit test. For this test, the null hy-pothesis is Model (6.11). For the alternative, we take the most general model which is a one-way design with r groups; i.e., the model
Yij = µi+ eij, i = 1, . . . , r; j = 1, . . . , ni, (6.12) where µiis the median (or mean) of the ith group (responses at xi). The rank-based drop in dispersion is easily formulated to test these hypotheses. Select a score function ϕ. Let D(RED) denote the minimum value of the dispersion function when Model (6.11) is fit and let D(FULL) denote the minimum value of the dispersion function when Model (6.12) is fit. The Fϕ test statistic is
Fϕ=[D(RED) − D(FULL)]/(r − 2) ˆ
τϕ
.
This test statistic should be compared with F -critical values having r − 2 and n − r degrees of freedom, where n =P
ini is the total sample size. In general the drop in dispersion test is computed by the function drop.test. Carry out this test for the data in Example 6.4.2 using the log F -scores with 2m1= 2 and 2m2= 10 degrees of freedom.
6.5.11. The data for Example 6.4.1 are:
x 1 2 3 4 5 6 7 8 9 10
y 2.84 6.52 6.87 16.43 18.17 25.24 28.15 31.65 36.37 38.84 (a) Using Rfit, verify the analysis presented in Example 6.4.1.
(b) Obtain Studentized residuals from the fit. Comment on the residual plot.
(c) Obtain the q −q plot of the sorted residuals of Part (b) versus the quantiles of the random variable ε which is distributed as the log of an exponential. Comment on linearity in the q−q plot.
7
Regression II
7.1 Introduction
In Chapter 4 we introduced rank-based fitting of linear models using Rfit.
In this chapter, we discuss further topics for rank-based regression. These include high breakdown fits, diagnostic procedures, weighted regression, non-linear models, and autoregressive time series models. We also discuss optimal scores for a family of skew normal distributions and present an adaptive pro-cedure for regression estimation based on a family of Winsorized Wilcoxon scores.
Let Y = [y1, . . . , yn]T denote an n×1 vector of responses. Then the matrix version of the linear model, (4.2), is
Y = α1 + Xβ + e (7.1)
where X = [x1, . . . , xn]T is an n × p design matrix, and e = [e1, . . . , en]T is an n × 1 vector of error terms. Assume for discussion that f(t) and F (t) are the pdf and cdf of ei, respectively. Assumptions differ for the various sectional topics.
Recall from expression (4.10) that the rank-based estimator ˆβϕ is the vector that minimizes the rank-based distance between Y and Xβ; i.e., ˆβϕis defined as
βˆϕ= Argminky − Xβkϕ, (7.2)
where the norm is defined by kvkϕ=
Xn i=1
a[R(yi− xTiβ)](yi− xTi β), v∈ Rn, (7.3) and the scores a(i) = ϕ[i/(n+1)] for a specified score function ϕ(u) defined on the interval (0, 1) and satisfying the standardizing conditions given in (3.12).
Note that the norm is invariant to the intercept parameter; but, once β is estimated, the intercept α is estimated by the median of the residuals. That is,
ˆ
α = medi{yi− xTi βˆϕ}. (7.4) The rank-based residuals are defined by
ˆ
ei= yi− ˆα − xTi βˆϕ, i = 1, 2, . . . , n. (7.5) 173
Recall that the joint asymptotic distribution of the rank-based estimates is multivariate normal with the covariance structure as given in (4.12).
As discussed in Chapter 3, the rank-based estimates are generally highly efficient estimates. Further, as long as the score function is bounded, the in-fluence function of ˆβϕ is bounded in the Y -space (response space). As with LS estimates, though, the influence function is unbounded in the x-space (fac-tor space). In the next section, we present a rank-based estimate which has bounded influence in both spaces and which can attain the maximal 50%
breakdown point.