SES algorithm for AFT frailty models - Topics In Linear Models: Methods For Clustered, Censored

2.4 Remarks

3.1.2 SES algorithm for AFT frailty models

EM and ES algorithms

To motivate our approach, we first consider how the EM algorithm might be implemented for the joint estimation of the unknown parameters ψ = (θ, β, Λ0)

of the model specified by (3.1) and (3.2), for a general frailty distribution g(·|θ). Given Wi (i = 1, . . . , n), the relevant complete data log-likelihood is

Lc(θ, β, Λ0) = L1(θ) + L2(β, Λ0),where L1(θ) = n X i=1 {Dilog Wi+ Kilog g(Wi|θ)} , L2(β, Λ0) = n X i=1 Ki X k=1 {∆iklog λ0(eik(β)) − WiΛ0(eik(β))} , eik(β) = log(Tik) − Xik0 β and Di = PKi k=1∆ik.

Let O = {Oik; k = 1, . . . , Ki, i = 1, . . . , n}denote the observed data for all n

eters. For the E-step of the EM algorithm, we calculate the conditional expectation of the complete data log-likelihood with respect to the frailty terms given the observed data and under the assumption that the true parameter is ˆψ(s)_{. In}

particular, E{Lc(θ, β, Λ0)|O, ˆψ(s)} is given by the sum of

E{L1(θ)|O, ˆψ(s)} = n X i=1 h DiE(log Wi|O, ˆψ(s)) + KiE n log g(Wi|θ)|O, ˆψ(s) oi (3.3) and E{L2(β, Λ0)|O, ˆψ(s)} = n X i=1 Ki X k=1 n

∆iklog λ0(eik(β)) − E(Wi|O, ˆψ(s))Λ0(eik(β))

o . (3.4) For the M-step of the EM algorithm, we would like to maximize equations (3.3) and (3.4) with respect to θ, β, and Λ0(·). In principle, the univariate function (3.3)

can be maximized with respect to θ using standard numerical methods. For example, when all required expectations exist in closed form (e.g., Wi follows a

gamma distribution), implementation is quite straightforward. However, maximization of (3.4) with respect to β and Λ0(·)is impossible without further para-

metric assumptions on λ0(·).

Using arguments similar to those given in Pan (2001) and Zhang and Peng (2007), or a modification of the argument given in Strawderman (2006), Λ0(·)

can be estimated nonparametrically by ˆ Λ0(t) = n X i=n Ki X k=1 ∆ik Pn j=1 PKj l=1WˆjI{ejl( ˆβ) ≥ eik( ˆβ)} I{eik( ˆβ) ≤ t}, (3.5)

where ˆβ and ˆWj respectively estimate β and E(Wj|O) (j = 1, . . . , n). Similarly,

an estimate of β can be obtained via the estimating equation S_n(w)(β) = n X i=1 Ki X k=1 wik(β)∆ik " Xik− Pn j=1 PKj l=1WˆjXjlI{eik(β) ≤ ejl(β)} Pn j=1 PKj l=1WˆjI{eik(β) ≤ ejl(β)} # , (3.6)

where wik(·)are nonnegative weight functions (k = 1, . . . , Ki, i = 1, . . . , n). In

one replaces the maximization of (3.4) at a given stage of the EM algorithm with estimates for Λ0(·)and β derived from (3.5) and (3.6) using ˆWj = E(Wj|O, ˆψ(s)),

the resulting procedure is no longer a true EM algorithm but rather an example of an Expectation and Substitution (ES) algorithm (Elashoff and Ryan, 2004).

Setting ˆWj = 1 (j = 1, . . . , n), it can be seen that (3.6) reduces to the esti-

mating equation for β under the marginal independence approach (Lee et al., 1993) and, with Kj = 1 (j = 1, . . . , n), to the estimating function of Tsiatis (1990)

given in (1.1). The fact that β only appears in (3.6) as an argument to indicator functions means that Sn(w)(β)is not a continuous function of β; hence, a solu-

tion to Sn(w)(β) = 0 typically does not exist. Parameter estimates may instead

be obtained by minimizing kSn(w)(β)k, where kvk denotes (v0v)1/2 for a vector v.

However, this minimization problem may admit several solutions. In addition, because Sn(w)(β)is not necessarily monotone in β, the resulting set of solutions

may not be a convex set.

For the setting in which ˆWj = Kj = 1 (j = 1, . . . , n), Fygenson and Ri-

tov (1994) note that use of the Gehan weight function wi(β) = Pn_j=1I{ei(β) ≤

ej(β)} leads to a discontinuous but monotone estimating equation. Fol-

lowing Strawderman (2006), substitution of the modified weights wik(β) =

j=1

PKj

l=1WˆjI{eik(β) ≤ ejl(β)}into (3.6) leads to

Sn(β) = 1 n(n − 1) n X i=1 Ki X k=1 n X j=1 Kj X l=1 ∆ikWˆj(Xik− Xjl) I {eik(β) − ejl(β) ≤ 0}; (3.7)

see Zhang and Peng (2007) and Xu and Zhang (2010) for related developments. The estimating equation (3.7) is monotone in each component of β and, impor- tantly, equals the gradient of the convex objective function

Ln(β) = 1 n(n − 1) n X i=1 Ki X k=1 n X j=1 Kj X l=1 ∆ikWˆj{ejl(β) − eik(β)} I {eik(β) − ejl(β) ≤ 0}. (3.8)

respect to β (e.g., Jin et al., 2006a). The resulting set of solutions forms a convex set; however, a unique minimizer may not exist. As above, upon setting ˆWj = 1

(j = 1, . . . , n), (3.7) and (3.8) reduce to the Gehan estimating equation (2.1) and objective function (2.2) for clustered data under the marginal independence approach. This relationship between the sets of equations for the two approaches will be evident throughout the remainder of this chapter and will be reflected by the use of parallel notation.

Induced smoothing for estimation of β

In general, the ES algorithm based on (3.6) presents one solution to the inability to maximize (3.4). However, the well-known computational challenges summa- rized in the previous section continue to present barriers for implementation, even in the case where Ln(β)is convex. Relevant examples of algorithms that

attempt to cope with these challenges include those described in Pan (2001), Jin et al. (2003, 2006a), Strawderman (2006), Zhang and Peng (2007), and Xu and Zhang (2010). None make use of smoothing to ease the computational burden. In light of the connection of equations (3.7) and (3.8) to the problem of estimating β under a marginally specified semiparametric AFT regression model, we propose to incorporate a simple adaptation of the smoothing procedure intro- duced in §2.1.3 into the ES algorithm.

Define Z to be a N (0, Ip)random vector independent of the data, where Ip

denotes the p×p identity matrix. Let Γ be a p×p matrix such that kΓk = O(1) and Γ2 _{= Σ,}_{where Σ is some symmetric, positive definite matrix. Then, a smoothed}

estimating equation may be constructed by adding the random perturbation n−1/2ΓZto the argument of Sn(β)in (3.7) and taking its expectation with respect

to Z. Specifically, setting ˜Sn(β) = EZSn β + n−1/2ΓZ , we obtain ˜ Sn(β) = 1 n(n − 1) n X i=1 Ki X k=1 n X j=1 Kj X l=1 ∆ikWˆj(Xik− Xjl) Φ n1/2 ejl(β) − eik(β) rikjl , (3.9) where r2

ikjl = (Xik − Xjl)0Σ(Xik − Xjl) and Φ(·) denotes the standard nor-

mal cumulative distribution function. Brown and Wang (2005) proposed this technique of smoothing as a way to compute standard errors of general rank- based estimating equations and briefly discussed its “pseudo-Bayesian” mo- tivation. More obviously, the estimating equation (3.9) may be viewed as a kernel-smoothed version of (3.7) in which the indicator function is replaced by a monotone kernel function (i.e., the standard normal CDF) that uses a pair- dependent bandwidth.

Instead of an estimating equation, one can instead work directly with the smoothed objective function ˜Ln(β) = EZLn β + n−1/2ΓZ . Let φ(·) denote

the standard normal density function. Using well-known results for normal random variables and integration by parts, it can be shown that

˜ Ln(β) = 1 n(n − 1) n X i=1 Ki X k=1 n X j=1 Kj X l=1 ∆ikWˆj {ejl(β) − eik(β)} H_ikjl(n)(β) + rikjl n1/2h (n) ikjl(β) ,(3.10)

where rikjlis defined above,

H_ikjl(n)(β) = Φ n1/2 ejl(β) − eik(β) rikjl and h(n)ikjl(β) = φ n1/2 ejl(β) − eik(β) rikjl .

A straightforward calculation shows that ∇ ˜Ln(β) = ˜Sn(β). Under mild condi-

tions on the covariates, the smoothed objective function ˜Ln(β)is strictly convex

and infinitely continuously differentiable; hence, standard numerical methods can be used to efficiently compute its unique minimizer ˆβ = argminβL˜n(β).

To obtain marginal regression parameter estimates for the AFT model with clustered data, we considered several choices for the smoothing matrix Σ in §2.3, including both Σ = Ipand a data-dependent smoothing matrix that is computed

iteratively and which reflects the relative scaling of the regression parameters. It is not clear how a similar data-dependent Σ might be constructed in the AFT frailty model. In addition, we found that the choice of Σ generally had minimal impact on the bias or standard error of the regression estimates in the marginal approach. We therefore propose to use Σ = Ip in (3.10) in order to estimate β,

exploring this choice and others in the simulation studies of §3.3.

SES algorithm

The joint estimation procedure for ψ = (θ, β, Λ0)that results from incorporating

the smoothed regression parameter estimator into the ES algorithm, hereafter referred to as the Smoothing Expectation and Maximization (SES) algorithm, can now be implemented as follows:

1. Select initial values:

Initialize ˆW_i(0) = 1(i = 1, . . . , n).

Initialize ˆβ(0)by minimizing ˜Ln(β)in (3.10) with Σ = Ip.

Initialize ˆΛ(0)₀ using (3.5). Initialize ˆθ(0)_.

Set s = 1. 2. E-step:

Compute the update ˆW_i(s) = E(Wi|O, ˆψ(s−1))(i = 1, . . . , n).

3. S-step:

Update ˆβ(s)_{by minimizing ˜}_L

n(β)in (3.10) with Σ = Ip.

Update ˆΛ(s)₀ using (3.5).

Update ˆθ(s)by maximizing `(θ) = E{L1(θ)|O, ˆψ(s)} in (3.3).

4. Iterate between steps 2 and 3 until a specified convergence criterion is met and report ˆψ = ( ˆβ, ˆθ, ˆΛ0).

A remark on notation is needed. When calculating ˆW_i(s) = E(Wi|O, ˆψ(s−1))

in Step 2, the expectation is considered to be a function of ψ and is evaluated at ψ = ˆψ(s−1)_, _{where ˆ}_ψ(s−1) _{= (ˆ}_θ(s−1)_{, ˆ}_β(s−1)_{, ˆ}_Λ(s−1)

0 )is the most recently computed

iterate. However, when calculating E{L1(θ)|O, ˆψ(s)} in Step 3, all instances of

θ in L1(θ) are left to vary freely and all instances of β and Λ0(·) are replaced

by the most recently computed values ˆβ(s) and ˆΛ(s)₀ (·). As can be seen from the sequence of steps, ˆβ(s) and ˆΛ(s)₀ (·) also indirectly depend on ˆθ(s−1) through

W₁(s), . . . , ˆWn(s)calculated in Step 2.

In the SES algorithm, the choice of frailty distribution directly impacts the calculation of ˆW_i(s) = E(Wi|O, ˆψ(s))in Step 2 and the calculation, hence max-

imization, of `(θ) = E{L1(θ)|O, ˆψ(s)} in Step 3. It follows that EM algorithms

previously proposed for the proportional hazards regression model with a spe- cific frailty distribution can be adapted to the current setting; examples include Nielsen et al. (1992) and Klein (1992), who consider the gamma frailty distribution, and Wang et al. (1995), who develop an EM algorithm for the positive stable proportional hazards frailty model. Indeed, one may choose any frailty distribution, the primary limitation being a computationally feasible character- ization of the conditional expectations needed in Steps 2 and 3 of the SES algorithm.

If L(s|˜θ) denotes the Laplace transform of W when the true parameter is ˜

ψ = (˜θ, ˜β, ˜Λ0) and L(r)(t|˜θ) denotes its rth derivative with respect to s, then it

can be shown that

E(Wi|O, ˜ψ) = L(Di+1)_{( ˜}_H i|˜θ) L(Di)( ˜H i|˜θ) (3.11)

where ˜Hi = PK_k=1i Λ˜0(eik( ˜β)) (Aalen et al., 2008, §7.2.3). For this and other

reasons, Aalen et al. (2008, §6.2.3) state that useful choices of frailty distributions should minimally have Laplace transforms that exist in closed form. One example is the class of power variance function distributions which includes the gamma, inverse Gaussian, and positive stable distributions as special cases (Hougaard, 2000). Another is the class of generalized inverse Gaussian distributions which also includes both the gamma and inverse Gaussian distributions (Jørgensen, 1982; Aalen et al., 2008). The lognormal distribution, though a per- fectly valid choice, does not have a Laplace transform that exists in closed form. As a result, L(·|˜θ),and various related quantities (e.g., (3.11)), must be approxi- mated numerically.

Unfortunately, the existence of a Laplace transform in closed form is insuffi- cient to ensure the availability of a useful algorithm, for this does not guarantee that `(θ) in Step 3 is easily computed. Aalen et al. (2008, §7.2.5) discuss the special nature of the gamma distribution in this regard and, abstracting that discus- sion, describe two other classes of distributions considered suitable for shared frailty models: the generalized inverse Gaussian and Kummer distribution fam- ilies. We provide the necessary implementation details for the SES algorithm in the case of the gamma frailty distribution in §3.2.1; our algorithm may be com- pared with that of Pan (2001), Zhang and Peng (2007), and Xu and Zhang (2010). In §3.2.2, we discuss implementation in the case of the inverse Gaussian frailty distribution.

We close this subsection by noting that there exist several possibilities for initializing ˆθ(0) _{in Step 1. For example, ˆ}_θ(0) _{may be set arbitrarily, or one may}

attempt to employ (3.3). In §3.1.4, we introduce several other possibilities for estimating θ derived from novel moment identities. The resulting estimators

can be used to estimate ˆθ(0)_{. In addition, one could also employ these estimators}

in lieu of the profile likelihood estimator in Step 3 of the SES algorithm.

Remark 3.1.1 An approach advocated in both Nielsen et al. (1992) and Wang et al.

(1995) in the case of the proportional hazards model is to fix ˆθ(0)and then run a version of the above algorithm in which Step 3 is modified in a way that sets ˆθ(s) = ˆθ(0)at every iteration and computes the MLE of both β and Λ0(·). This procedure is then repeated

for a grid of ˆθ(0) _{values. The parameter set leading to the largest value of the profiled}

observed data likelihood function (i.e., in θ) is then selected as the approximate MLE. In principle, variance estimates can be obtained by calculating, or otherwise approximat- ing, the Hessian matrix for the marginal log-likelihood function. Pan (2001) considers a related idea in the context of the AFT gamma frailty model. However, it is important to note that the proposed procedure does not yield an approximate MLE since β is not estimated using the efficient score function. The failure to use the efficient score also complicates variance estimation since one also cannot numerically differentiate the log- likelihood function. The SES algorithm as presented above, combined with the bootstrap methodology described in §3.1.3 below, provides a simple (if computationally intensive) method of variance estimation for all model parameters.

In document Topics In Linear Models: Methods For Clustered, Censored Data And Two-Stage Sampling Designs (Page 35-43)