• No results found

Thus far we have estimated µ and σ by regressing ˆF2−1(u) on ˆF1−1(u) by analogy with the relation

Y = µ + σX.d However, if we reparameterize the model as

X = ν + τ Yd

then we would estimate ν and τ by regressing ˆF1−1(u) on ˆF2−1(u). Since τ = σ1 and ν =−µσ, logical consistency would require that

ˆ τ = 1

ˆ

σ and ˆν = −µˆ ˆ

σ. (3.22)

However, from regression theory we know that 3.22 is false. We consider the following two cases:

Case 1 F2−1 = µ + σF1−1

Case 2 F1−1 = ν + τ F2−1. (3.23)

In case 1, µ and σ were estimated by GLS. In case 2 ν and τ were estimated by GLS and then converted to estimates of µ and σ using (3.22). The sample sizes were 50, 250 and 500, with 1000 repetitions. F1 was a standard normal distribution, F2 was normal (µ, σ) and k was set to 8. The same setup as in (3.20) to estimate the density function was used. The continuous EDF (2.3) was used to estimate F1(x) and F2(x) and hence F1−1(u) and F2−1(u). The mean of the estimates over the 1000 repetitions is given in Table 3.14. As can be seen the difference between the two sets of estimates are very minor and thus of little or no practical consequence.

Table 3.14: Estimates of µ and σ in cases 1 and 2 from (3.23) Case 1 Case 2

µ σ n µˆ σˆ µˆ σˆ

0 1 50 0.00 1.00 0.00 1.00 0 5 50 -0.05 4.97 -0.03 4.99 5 1 50 5.01 1.00 4.99 1.00 5 5 50 5.05 4.99 5.06 4.99 0 1 250 0.00 1.00 0.00 1.00 0 5 250 -0.01 5.00 -0.01 5.00 5 1 250 5.00 1.00 5.01 1.00 5 5 250 4.99 4.99 5.00 4.99 0 1 500 0.00 1.00 0.00 1.00 0 5 500 0.01 5.01 0.01 5.01 5 1 500 5.00 1.00 5.00 1.00 5 5 500 5.00 4.99 5.01 4.99

Chapter 4

A semi-parametric regression method for censored data

The scenario investigated in Chapter 3 will now be extended to the case where cen-soring of the observations can occur. This is discussed in Hsieh [10] and we will follow the methodology set out there. The following notation will be used when the data are censored. The data from the two groups will still be denoted by X1 ={X1,1, . . . , X1,n1} and X2 ={X2,1, . . . , X2,n2} with CDFs F1 and F2respectively.

The censoring observations will be C1 ={C1,1, . . . , C1,n1} and C2 ={C2,1, . . . , C2,n2} with survival functions K1 and K2 respectively. The observed data will then be de-noted by ˜X1 ={ ˜X1,1, . . . , ˜X1,n1} and ˜X2 ={ ˜X2,1, . . . , ˜X2,n2}, where ˜X1,i = X1,i∧ C1,i

and ˜X2,i = X2,i∧ C2,i. The observed censoring indicators will be δ1,i =I(X1,i ≤ C1,i) and δ2,i =I(X2,i ≤ C2,i). Denote the observed data, ordered in their first component, by

( ˜X1,(i), δ1,(i)), i = 1, . . . , n1 and ( ˜X2,(j), δ2,(j)), j = 1, . . . , n2 with ˜X1,(1) < . . . < ˜X1,(n1) and ˜X2,(1) < . . . < ˜X2,(n2).

The extension to the censored case is based on the following model T2 = T

1 γ

1 λ, (4.1)

where T1 and T2 are random variables representing the lifetimes in the two treatment

groups and γ, λ > 0. Taking logarithms in (4.1) gives log T2 = 1

γ log T1 + log λ. (4.2)

Set µ = log λ, σ = 1γ, X1 = log T1 and X2 = log T2. Then (4.2) becomes X2 = µ + σX1.

which is the model considered in Chapter 3. Following the methodology set out there, we postulate a heteroscedastic regression model

Fˆ2−1(u) = µ + σ ˆF1−1(u) + ϵ(u), (4.3)

where now ˆF1 and ˆF2 are Kaplan-Meier estimators of F1 and F2, i.e. for j = 1, 2 Fˆj(t) = 1−

{i: ˜Xj,(i)≤t}

( nj− i nj− i + 1

)δj,(i)

, (4.4)

where t ≥ 0. Monte Carlo simulations will be undertaken to investigate the relative effectiveness of the GLS and OLS estimators of µ and σ.

4.1 The regression setup

Expanding (4.3) with 0≤ u1 ≤ . . . ≤ uk≤ 1, we have Fˆ2−1(u1) = µ + σ ˆF1−1(u1) + ϵ(u1)

...

Fˆ2−1(uk) = µ + σ ˆF1−1(uk) + ϵ(uk), (4.5) an ordinary simple regression setup with response variable ˆF2−1(u) and predictor vari-able ˆF1−1(u). Hsieh [10] looks at both the OLS and GLS cases. He also shows that the GLS method is asymptotically efficient. The expressions for the asymptotic vari-ances are complicated and will not be dealt with here. We will be focusing on the simulation results.

From (9) and (10) from Hsieh [10], ˆF1−1 and ˆF2−1 can be represented in terms of two generalized Kiefer processes with covariance functions

Λi = D(1−u)C−1D(Cγ(i))CT−1D(1−u). (4.6) In (4.6) Dg represents a diagonal matrix with main diagonal vector g and where the matrix C is the linear operator such that Cu = (u1, u2 − u1, . . . , uk − uk−1)T and can be estimated by

ˆ (Recall the definition of D−1

f1( ˆF1−1(u)) from (3.12)). The covariance matrix is cov( ˆβ) = σ2(XTΣ−1X)−1.

The asymptotic covariance matrix of β can be found in [10] page 2713. The OLS

A number of cases were investigated by Monte Carlo simulation. Table 4.1 shows the distributions that were used. F1 and K1 denote the CDF and survival function of X1 and its associated censoring variable C1 respectively. The X2 data are distributed as µ + σX1 with censoring variable C2, which has survival function K2. Table 4.1 lists the three cases that will be considered. The exponential distribution, denoted by exp(λ), has the density function

f (x; λ) = e−xλ λ .

The lognormal distribution, denoted by lognormal(a, b), has the density function f (x : a, b) = 1 In addition we define

φ1 = 1

which are the observed proportions of uncensored observations.

Case 1 was the same simulation setup from Hsieh [10]. Our censoring proportions for the second group did not match his though for the first group they did. We could not fix this discrepancy and as such a direct comparison of his results was not possible.

Table 4.1: Distributions of variables X1, C1 and C2

Case X1 C1 C2

1 exp(1) exp(4) exp(8)

2 log(exp(1)) exp(2) exp(4)

3 lognormal(1, 1) lognormal(1.5, 1) lognormal(2, 1)

4.2.1 Bias of the estimators

Table 4.2 gives the bias results for the cases outlined in Table 4.1 for various values of µ and σ. There were 8 regression points, from 0.1 to 0.8 with evenly spaced intervals. The simulations were run 1000 times with the sample size being set to 50 for both groups. The continuous version of the Kaplan Meier estimator (2.16) was used to estimate F1(t) and F2(t) and hence F1−1(u) and F2−1(u). The bias is calculated simply as the arithmetic mean of the estimates over the simulations (¯µ and ¯ˆ σ alongˆ with ¯µ and ¯˜ σ) minus the true value, that is,˜

ˆ

µbias = ¯µˆ− µ and ˆσbias= ¯σˆ− σ for GLS and

˜

µbias = ¯σ˜− µ and ˜σbias = ¯σ˜− σ

for OLS. There is hardly any bias for any of the cases. The OLS method has less of a bias for µ than the GLS method though ˆµbias is still negligible.

4.2.2 Variance of the estimators

Table 4.3 gives the variance of the estimators for case 1 for the same simulation setup as used in the simulations to test the bias of the estimators. The variance was used rather than the mean squared error due to the negligible bias. The sample sizes were equal (n1 = n2 = n). The finite sample efficiencies are given by

e(˜µ : ˆµ) = var(ˆµ)

var(˜µ) and e(˜σ : ˆσ) = var(ˆσ) var(˜σ).

Table 4.2: Bias of the estimators, for the cases outlined in Table 4.1 Case µ σ φ1 φ2 µˆbias µ˜bias σˆbias σ˜bias

1 0.5 0.5 0.80 0.88 0.01 0.00 0.02 0.02 1 1 0.80 0.78 0.01 0.01 0.08 0.09 2 1.5 0.80 0.66 0.02 0.00 0.20 0.23 2 0.5 0.5 0.91 0.91 -0.03 0.01 0.00 0.00 1 1 0.91 0.84 -0.06 -0.02 0.00 0.00 2 1.5 0.91 0.72 -0.09 0.03 0.00 -0.01 3 0.5 0.5 0.85 0.95 0.00 0.00 0.02 0.02 1 1 0.85 0.86 0.02 0.00 0.02 0.03 2 1.5 0.85 0.72 0.02 0.00 0.03 0.04

The results show that the GLS method outperforms the OLS method by a large margin. The variance is low for both the estimation of µ and σ for both OLS and GLS methods. The only notably higher variance was in the estimation of σ when it is greater than 1 and the sample size is small.

Table 4.3: Variance of estimators for Case 1 from Table 4.1

µ σ n φ1 φ2 var(ˆµ) var(˜µ) var(ˆσ) var(˜σ) e(˜ˆµ : ˆµ) e(˜ˆσ : ˆσ)

0.5 0.5 50 0.80 0.89 0.00 0.00 0.02 0.02 0.38 0.86

1 1 50 0.80 0.78 0.01 0.02 0.08 0.09 0.38 0.91

2 1.5 50 0.80 0.66 0.02 0.04 0.21 0.23 0.39 0.90

0.5 0.5 100 0.80 0.88 0.00 0.00 0.01 0.01 0.40 0.95

1 1 100 0.80 0.78 0.00 0.01 0.04 0.05 0.39 0.87

2 1.5 100 0.80 0.66 0.01 0.02 0.11 0.12 0.37 0.87

Table 4.4 provides the variance of the estimators for case 2. The variances are larger than in case 1 but are still fairly low. There is little difference between OLS and GLS, OLS giving a better estimation of µ and GLS giving a better estimation of

σ.

Table 4.4: Variance of estimators for Case 2 from Table 4.1

µ σ n φ1 φ2 var(ˆµ) var(˜µ) var(ˆσ) var(˜σ) e(˜ˆµ : ˆµ) e(˜ˆσ : ˆσ)

0.5 0.5 50 0.91 0.91 0.02 0.02 0.01 0.01 1.05 0.96

1 1 50 0.91 0.84 0.06 0.06 0.05 0.06 1.06 0.95

2 1.5 50 0.91 0.72 0.15 0.13 0.13 0.13 1.10 1.03

0.5 0.5 100 0.91 0.92 0.01 0.01 0.01 0.01 1.00 0.94

1 1 100 0.91 0.84 0.03 0.03 0.03 0.03 1.03 0.93

2 1.5 100 0.91 0.72 0.07 0.06 0.06 0.06 1.05 0.94

The variance results for case 3 are given in Table 4.5. They are similar to case 1 in that the GLS gives a much more accurate estimation than OLS. The variances were all low except for values of σ greater than 1 with small sample sizes.

Table 4.5: Variance of estimators for Case 3 from Table 4.1

µ σ n φ1 φ2 var(ˆµ) var(˜µ) var(ˆσ) var(˜σ) e(˜ˆµ : ˆµ) e(˜ˆσ : ˆσ)

0.5 0.5 50 0.85 0.95 0.01 0.01 0.03 0.03 0.45 0.83

1 1 50 0.85 0.86 0.02 0.04 0.09 0.11 0.42 0.77

2 1.5 50 0.85 0.72 0.04 0.10 0.22 0.28 0.39 0.79

0.5 0.5 100 0.85 0.95 0.00 0.01 0.01 0.01 0.40 0.75

1 1 100 0.85 0.86 0.01 0.02 0.05 0.06 0.40 0.77

2 1.5 100 0.85 0.72 0.02 0.05 0.11 0.13 0.37 0.79

There was a problem with the estimation due to the method of choosing the regression points, u. The Kaplan-Meier estimator is only defined for a certain interval;

within 0 and the quantile of the last uncensored observation. This happens as beyond the last uncensored observation there is no more available information that can be used by the product-limit estimator as it has its jumps at the uncensored values.

Thus when there is heavy censoring, the quantile of the last uncensored observation may be lower than the highest regression point. Figure 4.1 illustrates this point. In

this case F1(x) is a standard exponential distribution and C1 is uniformly distributed on [0, 2.2], chosen so approximately 40 percent of the data is censored, n = 100. ˆF1(x) is only defined up to 0.89. If there was a regression point at 0.9 for instance, then the product limit estimator will be undefined. To counteract this, it is suggested that the largest regression point is less than or equal to the quantile of the last uncensored observation.

Figure 4.1: The estimated CDF for an standard exponential distribution with 40%

censoring

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x F 1 (x)

Chapter 5

Nonparametric confidence bands for the quantile comparison

function with censored data

This chapter is concerned with a fully nonparametric model in which two independent data sets, X1 = {X1,1, . . . , X1,n1} (> 0) and X2 = {X2,1, . . . , X2,n2} (> 0), come from two unspecified survival functions, S1 and S2 respectively. Our objective is to construct a confidence band for the quantile comparison function, q = S2−1(S1), and our only assumptions are that both S1 and S2 are continuous and strictly decreasing.

Doksum and Sievers [4] considered this type of problem for complete data. The additional factor that we wish to incorporate into the analysis is to allow incomplete data. Lu, Wells and Tiwari [15] used a bootstrap method that allowed for censored data. This section will look at a method that does not require use of a bootstrap.

The data are censored and we do not observe X1 and X2 but rather ˜X1 = {X1,i C1,i, i = 1, . . . , n1} and ˜X2 = {X2,i ∧ C2,i, i = 1, . . . , n2} where the C1,i and C2,i are independent observations from continuous distributions with survival functions K1 and K2 respectively. The observed data therefore comes from distributions with survival functions, with t≥ 0,

H(t) = P [ ˜X1,i > t] = S1(t)K1(t)

and

J (t) = P [ ˜X2,i > t] = S2(t)K2(t).

5.1 Asymptotic representation of the quantile com-parison function

First, suppose there is no censoring present. Set

n =

From Potgieter [16], Section 2.3, we see that

√nq(t)− q(t)) =

√n( ˆS1(t)− ˆS2(q(t)))

f2(q(t)) + op(1). (5.2) (Since Potgieter [16] is possibly not freely available, we reproduce his derivation in Appendix B). In principle, therefore, confidence bands for q could be obtained using the asymptotic distribution of the first term of the right hand side of (5.2). However this would require the estimation of f2(q(t)) which is extremely variable where f2(q(t)) is close to zero. The situation here is analogous to the estimation of a single quantile discussed in Section 21.8, page 309, of van der Vaart [18]. As pointed out there it is simpler to base the estimation on the numerator in (5.2) alone.

Our confidence band for q will be I := where C is chosen so that

P (q(t)∈ I|S1, S2) = 1− α.

Notice that

Sˆ2(q(t)) =1− 1

where the U1,i and U2,i are uniformly distributed between zero and one. Therefore the distributions of ˆS2(q(t)) and ˆS1(t) are independent of F2. Thus, we may assume

Thus far we have not taken account of any censoring. If censoring is indeed present then it seems natural to take ˆS1 and ˆS2 in (5.3) and (5.4) to be the respective Kaplan-Meier estimators.

We showed in Section 2.3.1 - see (2.20) and (2.21) - for i = 1, 2 that ¯Zi(u), 0 u≤ 1 is a path-continuous Gaussian process, with zero mean and covariance function

cov( ¯Zi(u1), ¯Zi(u2)) = u1u2

u1∧u2

0

dw

w2× Ki(Si−1(w)). Notice that from (3.7)

√n =

converges in distribution to a zero mean Gaussian process B which has covariance function ap-proximate the distribution of the latter random variable we use the following Lemma from Lombard [14].

Lemma 5.1. Let ˆB(u), 0 < u < 1, be a path-continuous Gaussian process with covariance function c(u1, u2). Suppose there exist continuous functions ν(u) > 0 and θ(u) such that

c(u, u + ϵ) = ν(u)− θ(u)ϵ + o(ϵ)

as ϵ→ 0. Set

Notice that in (5.8) |θ(u)| is incorrectly given as θ(u) in Lombard [14].

Now set

so that Lemma 5.1 is applicable. Since the survival functions Ki and Si appearing in the expression for θ(u) are unknown, we replace them by consistent estimates made from the data. Then, from the convergence in distribution of Bn to ˆB together with (5.8), we have the approximation

P (√ for ”large” b and for ”large” n1 and n2. The righthand side of (5.9) provides an approximate p-value for a test of the hypothesis S1 ≡ S2. Thus, in order to find an approximate 100(1− α)% simultaneous confidence band for q, we must solve the equation computation of the integral by numerical integration. These issues are dealt with in the next section.

Related documents