Florida State University Libraries

(1)

Florida State University

Libraries

Electronic Theses, Treatises and Dissertations

The Graduate School

2019

A Bayesian Semiparametric Joint Model

for Longitudinal and Survival Data

(2)

FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES

A BAYESIAN SEMIPARAMETRIC JOINT MODEL FOR LONGITUDINAL AND SURVIVAL DATA

By

PENGPENG WANG

A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

(3)

Pengpeng Wang defended this dissertation on April 16, 2019. The members of the supervisory committee were:

Elizabeth H. Slate

Professor Co-Directing Dissertation

Jonathan R. Bradley

Professor Co-Directing Dissertation

Amy M. Wetherby University Representative

Lifeng Lin

Committee Member

(4)

ACKNOWLEDGMENTS

First of all, I would like to express my sincere gratitude to my major advisor Dr. Elizabeth H. Slate for the continuous support of my PhD study and related research, for her motivation, patience, and immense knowledge. Her guidance and encouragement helped me in all the time of my research. Her elegant and well-organized personality has a great impact on my life. I am very grateful to have her as my advisor and she will always be a role model in my life.

I would like to thank my co-advisor Dr. Jonathan R. Bradley for sharing his research idea and expertise so willingly, for his insightful comments and encouragement. I appreciate him for spending time on my research and having discussions with me. Without his guidance and help this dissertation would not have been possible.

I am also grateful to the rest of my dissertation committee: Dr. Amy M. Wetherby and Dr. Lifeng Lin for their time, support, guidance and good will throughout the preparation for my de-fense. And thanks for their review of this document. I am especially appreciative of the opportunity to be involved with Dr. Amy M. Wetherby’s research team on multiple projects including “Autism Adaptive Community-based Treatment to Improve Outcomes Using Navigators (ACTION) Net-work” and “Mobilizing Community Systems to Engage Families in Early ASD Detection & Ser-vices,” which provided support for my research.

(5)

LIST OF TABLES

5.1 Parameter estimates for simulation study of Case 1-3 (Gaussian distributed data) using model GaussianMH (the Gaussian joint model with Metropolis-Hastings) . . . . 43 5.2 Parameter estimates for simulation study of Case 1-3 (Gaussian distributed data)

using model GaussianSS (the Gaussian joint model with slice sampler) . . . 44 5.3 MSE for the Gaussian joint model and the log-gamma joint model . . . 55 5.4 Parameter estimates for simulation study of Case 4-6 (log-gamma distributed data)

using the log-gamma joint model. . . 56 5.5 Clustering evaluation for the Gaussian joint model and the log-gamma joint model . . 61 5.6 Contingency table for the Rand index. . . 61 5.7 Effective sample size for the Gaussian joint model and the log-gamma joint model . . 66 6.1 Parameter estimates for the HIV data using the Gaussian joint model with slice

sam-pler and the log-gamma joint model. . . 72 6.2 Parameter estimates for the PSA data using the Gaussian joint model with slice

(8)

LIST OF FIGURES

2.1 Kernel density plots of the log-gamma distributions and the standard Gaussian dis-tribution. . . 6 5.1 Histograms of simulated survival time. The number of clusters is indicated in the title

heading of each panel. . . 39 5.2 Trace plots and density curves for the last 3,000 iterations using model GaussianSS

in Case 3. (a) MCMC trace plot of β01 (intercept for the first subject). (b) Posterior

density estimate of β01 from model GaussianSS (solid line) and the true value of β01

(dashed line). (c) MCMC trace plot of β11 (slope for the first subject). (d) Posterior

density estimates of β11 from model GaussianSS (solid line) and the true value of β11

(dashed line). . . 45 5.3 Trace plots and density curves for the last 3,000 iterations using model GaussianSS

in Case 3. (a) MCMC trace plot of γ (link parameter). (b) Posterior density estimate of γ from model GaussianSS (solid line) and the true value of γ (dashed line). (c) MCMC trace plot of α (covariate parameter). (d) Posterior density estimates of α from model GaussianSS (solid line) and the true value of α (dashed line). . . 46 5.4 True longitudinal observations vs. estimated longitudinal trajectories using model

GaussianSS in Case 3 (the simulation study with three cluster Gaussian distributied data for the Gaussian joint model). . . 47 5.5 True hazard rate vs. estimated hazard rate from model GaussianSS in Case 3 (the

simulation study with three cluster Gaussian distributied data for the Gaussian joint model). . . 47 5.6 Histograms of simulated survival time. The number of clusters is indicated in the title

heading of each panel. . . 49 5.7 Trace plots and density curves for the last 3,000 iterations using the log-gamma joint

model in Case 6. (a) MCMC trace plot of β0. (b) Posterior density estimates of

β0 from the semiparametric log-gamma joint model (solid line) and the true value of

β0 (dashed line). (c) MCMC trace plot of β1. (d) Posterior density estimates of β1

from the semiparametric log-gamma joint model (solid line) and the true value of β1

(dashed line). . . 52 5.8 Trace plots and density curves for the last 3,000 iterations using the log-gamma joint

model in Case 6. (a) MCMC trace plot of γ. (b) Posterior density estimates of γ from the semiparametric log-gamma joint model (solid line) and the true value of γ (dashed line). (c) MCMC trace plot of β1. (d) Posterior density estimates of α from

(9)

5.9 True longitudinal observations vs. estimated longitudinal trajectories using the log-gamma joint model in Case 6 (three cluster simulation study of log-log-gamma distributed data). . . 54 5.10 True hazard rate vs. estimated hazard rate using the log-gamma joint model in Case

6 (three cluster simulation study of log-gamma distributed data). . . 54 5.11 True vs. estimated longitudinal trajectories and hazard rates at cluster level in Case

3. The red solid lines are the true trajectories and hazard rates in each cluster based on true cluster assignments. The blue dash lines are the estimated trajectories and hazard rates. (a) and (b) are the results from model GaussianMH; (c) and (d) are the results from model GaussianSS; (e) and (f) are the results from model LG. . . 62 5.12 True vs. estimated longitudinal trajectories and hazard rates at cluster level in Case

6. The red solid lines are the true trajectories and hazard rates in each cluster based on true cluster assignments. The blue dash lines are the estimated trajectories and hazard rates. (a) and (b) are the results from model GaussianMH; (c) and (d) are the results from model GaussianSS; (e) and (f) are the results from model LG. . . 63 5.13 Comparison of the semiparametric joint model and the parametric Cox models in Case

6 (three cluster simulation with log-gamma distributed data). The black solid line is the true cumulative hazard. The dash lines are the estimated cumulative hazard of the log-gamma semiparametric joint model (red dash line) and the estimated cumulative hazard of the parametric Cox model with longitudinal effect (blue dash line) and without longitudinal effect (green dash line). . . 67 6.1 Observed trajectories of CD4 cells counts for all 467 patients in the original scale and

the square root scale. . . 69 6.2 Trace plots and density curves using the Gaussian joint model with slice sampler for

the HIV data. (a) MCMC trace plot of trajectory intercept β0. (b) Posterior density

estimates of trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d)

Posterior density estimates of trajectory slope β1. . . 73

6.3 Trace plots and density curves using the Gaussian joint model with slice sampler for the HIV data. (a) MCMC trace plot of the link parameter γ. (b) Posterior density estimates of the link parameter γ. (c) MCMC trace plot of covariate coefficients α. (d) Posterior density estimates of covariate coefficients α. . . 74 6.4 Trace plots and density curves using the log-gamma joint model for the HIV data.

(a) MCMC trace plot of trajectory intercept β0. (b) Posterior density estimates of

trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d) Posterior

density estimates of trajectory slope β1. . . 75

(10)

parameter γ. (c) MCMC trace plot of covariate coefficients α. (d) Posterior density estimates of covariate coefficients α. . . 76 6.6 Observed square root of CD4 cells counts (black circles) vs. estimated longitudinal

trajectories of randomly selected 4 patients using the Gaussian joint model with slice sampler (blue dash line) and the log-gamma joint model (red dash line). . . 77 6.7 Predicted CD4 trajectories, predicted hazard rate and predicted survival probability

for each cluster by using the Gaussian joint model with slice sampler and the log-gamma joint model for the AIDS data. . . 77 6.8 QQ-plot of residuals using Gaussian joint model and log-gamma joint model for the

AIDS data. . . 78 6.9 Observed trajectories of PSA readings for 647 subjects who have more than three PSA

readings. . . 79 6.10 Trace plots and density curves using the Gaussian joint model with slice sampler for

the PSA data. (a) MCMC trace plot of trajectory intercept β0. (b) Posterior density

estimates of trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d)

Posterior density estimates of trajectory slope β1. (e) MCMC trace plot of the link

parameter γ. (f) Posterior density estimates of the link parameter γ. . . 82 6.11 Trace plots and density curves using the log-gamma joint model for the PSA data.

(a) MCMC trace plot of trajectory intercept β0. (b) Posterior density estimates of

trajectory intercept β0. (c) MCMC trace plot of trajectory slope β1. (d) Posterior

density estimates of trajectory slope β1. (e) MCMC trace plot of the link parameter

γ. (f) Posterior density estimates of the link parameter γ. . . 83 6.12 Observed PSA readings (black circles) vs. estimated PSA longitudinal trajectories of

randomly selected 4 subjects by using the Gaussian joint model with slice sampler (blue dash line) and the log-gamma joint model (red dash line). . . 84 6.13 QQ-plot of residuals using Gaussian joint model and log-gamma joint model for the

PSA data . . . 85 6.14 Predicted longitudinal trajectories, predicted hazard rate and predicted survival

(11)

ABSTRACT

(12)

CHAPTER 1 INTRODUCTION

Longitudinal data, sometimes referred to as panel data, is a collection of repeated observations of the same subjects over short or long time periods. The subjects can be individuals, households, establishments, etc. Longitudinal data follows the same subjects over time and it’s useful for measuring of within-subject change over time, while cross-sectional data measures different subjects at each time point. In clinical trials, patients are often measured in scheduled follow-up time points. This kind of repeated observations of the same patients, over a period of time is also referred to as longitudinal data. We are interested in longitudinal measurements, since in this setting, we can use repeated observations to assess a within-patient change to quantify a treatment effect. For example, human immunodeficiency virus (HIV) patients may be followed over time, and their CD4 cell counts may be measured monthly. We can make inferences for the treatment effect based on the change in their observed CD4 cell counts.

In many cases, we are also concerned with time-to-event observations, which is the expected duration of time until one or more events of interest occur. This end point is accompanied with information on “failure” or “survival.” The event can be death, occurrence of a disease, or burnout of light bulbs, among other possibilities. The time-to-event or survival time can be measured in days, weeks, years or any unit of time. For example, if the event of interest is a heart attack, then the survival time can be the time in years until a person develops a heart attack.

(13)

between them as well. Joint modeling of these two outcomes has the potential to improve the inference over separate independent analyses.

To jointly analyze the longitudinal outcome and survival outcome, we develop a joint statistical model. A routine framework is to use a regression model for the longitudinal outcome and a proportional hazards model for the survival outcome. To link these two endpoints, a trajectory function is incorporated as a predictor for the survival endpoint. Tsiatis and Davidian [37] review the rationale for and development of this kind of joint model. Song, Davidian and Tsiatis [32] proposed a similar joint model with a random effects longitudinal model and a Cox proportional hazards model. They use the expectation-maximization (EM) algorithm for estimation. Guo and Carlin [17] proposed their joint model with a mixed random effects longitudinal model and two hazards models including a Weibull model and a Cox proportional hazards model. They use a Bayesian approach via Markov chain Monte Carlo (MCMC) for inference. All of the above approaches assume a parametric trajectory function. However, the longitudinal measurements may be highly diverse, and the parametric model may impose too many parametric assumptions. Wang and Taylor [39] use the same structure of joint model but include an integrated Ornstein-Uhlenbech (IOU) stochastic process in the longitudinal model. This formulation allows the trajectory to vary from a parametric model setting, but it may not be flexible enough to model between-patient variability which implies heterogeneity in population.

(14)

(15)

CHAPTER 2 BACKGROUND

2.1 Gaussian Assumption

In linear regression, it is common to assume the error terms are Gaussian distributed, which implicitly assumes that the underlying observation is also Gaussian distributed. There are many useful properties of the Gaussian distribution. For example, central limit theorems often imply that sample averages are Gaussian distributed asymptotically. Thus, the Gaussian distribution is often reasonable to assume in practice, as often, outcomes obtained on individuals can often be interpreted as an average (Lehmann [23]). In this dissertation, we are particularly interested in another property, namely, conjugacy. In a Bayesian analysis, if the posterior distribution is in the same family as the prior distribution, then the prior and posterior distributions are called conjugate for the likelihood function. If the likelihood function is a Gaussian distribution with unknown mean and known variance, then using a Gaussian prior for the mean will ensure that the corresponding posterior distribution is also a Gaussian distribution. In this case, it is straightforward to sample directly from the posterior or full conditional distribution. Previous work on joint modeling most often assumes a Gaussian based regression model for the longitudinal outcome with errors that are Gaussian distributed. However, the joint longitudinal-survival likelihood does not yield a recog-nizable function because the likelihood of the survival model is not a Gaussian distribution. It is difficult, then, to find conjugate priors, and sampling directly from the posteriors may not be pos-sible. In this case, we may have to use a computationally less efficient (in terms of effective sample size) Metropolis-Hastings algorithm or to approximately sample from the posterior distributions.

(16)

2.2 Log-Gamma Distribution

We define and introduce some properties of the log-gamma distribution. Let z be a gamma distributed random variable with shape parameter α > 0 and rate parameter κ > 0. Thus, the probability density function (pdf) of z is

f (z|α, κ) = κ

α

Γ(α)z

α−1_{exp (−κz) .} _(2.1)

The mean and variance of the gamma random variable z are well known and given by, E(z) = α κ V ar(z) = α κ2. Let q = log(z). (2.2)

Then the random variable q has a log-gamma distribution with shape parameter α > 0 and rate parameter κ > 0 denoted by LG(α, κ). From (2.1) and (2.2), the pdf of the log-gamma random variable q is given by,

f (q|α, κ) = κ

α

Γ(α)exp {αq − κ exp(q)} ; q ∈ R. (2.3) The mean and variance of the log-gamma random variable are given by (see Prentice [30])

E(q) = ω0(α) − log(κ) (2.4)

V ar(q) = ω1(α).

Here ωk(·) is the polygamma function for non-negative integer k. For a real value h we have that

ωk(h) ≡

dk+1

dhk+1 log(Γ(h)).

(17)

Figure 2.1: Kernel density plots of the log-gamma distributions and the standard Gaussian distri-bution.

Proposition 1. (Bradley et al. [5]) Let q ∼ LG(α, α), and let q+ = α

1

2q. Then q₊ converges in

distribution to the standard Gaussian distribution as α goes to infinity. Proof. See Appendix A.3.

(18)

2.3 Multivariate Log-Gamma Distribution

Similar to the multivariate Gaussian distribution, the multivariate log-gamma distribution (MLG) is a generalization of the one-dimensional (univariate) log-gamma distribution to higher dimensions. The MLG variable is obtained by using a linear combination of independent log-gamma random variables. Specifically, let the m-dimensional random vector w = (w1, . . . , wm)0

consist of m mutually independent log-gamma random variables such that wi ∼ LG(αi, κi) for

i = 1, . . . , m. Then, we define a multivariate log-gamma random variable q ∈ Rm as

q = c + Vw, (2.5)

where V ∈ Rm× Rm _{is invertible and c ∈ R}m_{. Then the pdf of q satisfies}

f (q|c, V, α, κ) = 1 det(VV0)1/2 m Y i=1 καi i Γ(αi) ! expα0V−1(q − c) − κ0exp{V−1(q − c)} , (2.6)

where “det” represents the determinant function, the m-dimensional vector α = (α1, . . . , αm)0

consists of shape parameters, and the m-dimensional vector κ = (κ1, . . . , κm)0 consists of rate

parameters. The mean and variance of the MLG random variable q is given by, E(q|α, κ) = c + V(ω0(α) − log(κ)),

Cov(q|α, κ) = Vdiag{ω1(α)}V0,

where for a generic m-dimensional real-valued vector k = (k1, . . . , km)0, let diag(k) be the m × m

dimensional diagonal matrix with main diagonal equal to k. The function ωj(h), for non-negative

integer j, is a vector-valued polygamma function, where the i-th element of ωj(h) is defined to be dj+1

dhj+1_i log(Γ(hi)) for i = 1, . . . , m.

Let MLG(c, V, α, κ) be shorthand for the pdf in (2.6). We see that the MLG pdf also has a double exponential term. The pattern is the main reason why conjugacy exists between our hazard model and the log-gamma and MLG distributions, which we take advantage of in subsequent chapters.

(19)

with mean c and covariance VV0. However, in our multivariate log-gamma distribution, each uni-variate component wi has its own shape and scale parameters αi and κi. So each component of the

MLG random variable could have different log-gamma distribution, which is an important differ-ence with the multivariate Gaussian distribution and potentially gives more flexibility. Similar to the univariate case, the m-dimensional MLG distribution is asymptotic to a multivariate Gaussian distribution as the shape and rate parameters go to infinity.

Proprsition 2. (Bradley et al. [5]) Let q ∼ MLG(c, α1/2V, α1m, α1m). Then q converges in

distribution to a multivariate Gaussian distribution with mean c and covariance matrix VV0 as α goes to infinity.

Proof. See Appendix A.4.

2.4 Conditional Multivariate Log-Gamma Distribution

In this dissertation, we use a Gibbs sampler for Bayesian inference. For the model proposed in Chapter 3, the Gibbs sampler will require simulating from conditional distributions of multivari-ate log-gamma random vectors. Thus, we introduce the definition of the conditional log-gamma distribution.

We use the same definition for conditional multivariate log-gamma distribution as Bradley et al. [5]. Let q ∼ MLG(c, V, α, κ), and partition this m-dimensional random vector so that q = (q0₁, q0₂)0. q1 is a g-dimensional random vector, and q2 is a (m − g)-dimensional random vector. Partition

V−1 = [H B] into an m × g matrix H and an m × (m − g) matrix B. Then (q1|q2 = d, c, V, α, κ)

has a conditional multivariate log-gamma distribution. Here d is a (m − g)-dimensional real valued vector. The pdf of the conditional multivariate log-gamma distribution is,

f (q1|q2 = d, c, V, α, κ) = C expα0Hq1− κ0cexp(Hq1) , (2.7)

where

κc≡ exp{Bd − V−1c + log(κ)}, (2.8)

and the normalizing constant C is

(20)

Proof of (2.7) - (2.9): See Appendix A.5.

Let cMLG(H, α, κc) be a shorthand for the pdf in (2.7), where “cMLG” stands for “conditional

multivariate log-gamma.” From the pdf in (2.7) we know that the cMLG distributions do not fall within the same class of pdfs defined in (2.6). This is because the m × g real-valued matrix H in the pdf in (2.7) is not square, while the real-valued matrix V in the pdf in (2.6) is a square matrix. This property is different from multivariate Gaussian distribution, where both marginal and conditional distributions obtained from a multivariate Gaussian random vector are multivariate Gaussian distributions. Since cMLG is not MLG, it is difficult to sample directly from cMLG. To avoid sampling from a cMLG distribution in a Gibbs sampler, a data augmentation techniques can be used to instead simulate from a marginal MLG random vector. See more details in Appendix A.1-A.2. One another way to sample from the cMLG distribution is the slice sampler (Neal [29]).

2.5 Slice Sampler

Slice sampler is a type of Markov chain Monte Carlo algorithm for pseudo-random number sampling, i.e. for drawing random samples from a statistical distribution. The method is based on the observation that to sample a random variable one can sample uniformly from the region under the graph of its density function.

Suppose we want to sample from a distribution for a random variable, x, taking values in some subset of Rn, whose density is proportional to some function f (x). To do this we can sample uniformly from the (n + 1)-dimensional region that lies under the plot of f (x). This idea can be formalized by introducing an auxiliary real variable, y, and defining a joint distribution over x and y that is uniform over the region U = {(x, y) : 0 < y < f (x)} below the curve or surface defined by f (x). That is, the joint density for (x, y) is

p(x, y) =    1/Z if 0 < y < f (x), 0 otherwise, where Z =R f (x)dx. The marginal density for x is then

p(x) = Z f (x)

0

(1/Z)dy = f (x)/Z

(21)

chain that will converge to this uniform distribution. One way to do this is using the Gibbs sampler: we sample alternately from the conditional distribution of y|x ∼ U(0, f (x)), and from the conditional distribution of x|y over the region S = {x : y < f (x)}, which is called the “slice” defined by y. Here we provide the implementation of univariate and multivariate slice sampler.

The univariate slice sampler discussed here replace the current value, x0, with a new value, x1,

found by a three-step procedure. To sample a random variable x with density proportional to a function f(x), we introduce an auxiliary variable y and iterate as follows:

1. Draw a real value, y, uniformly from (0, f (x0)), thereby defining a horizontal “slice”: S =

{x : y < f (x)};

2. Find an interval, I = (L, R), around x0 that contains all, or much, of the slice;

3. Sample the new point, x1, from the part of the slice within this interval.

Finding the bounds of the horizontal slice in Step 2 may not be easy, which involves inverting the function describing the distribution being sampled from. If both the pdf and its inverse are available, and the distribution is unimodal, then finding the slice and sampling from it are simple. If not, a stepping-out procedure can be used to find a region whose endpoints fall outside the slice. Then, a sample can be drawn from the slice using rejection sampling. Various procedures for this are described in detail by Neal [29]. Note that this algorithm can be used to sample from the area under any curve, regardless of whether the function integrates to one. In fact, scaling a function by a constant has no effect on the sampled x. This means that the algorithm can be used to sample from a distribution whose pdf is only known up to a constant (i.e. whose normalizing constant is unknown). To sample from a multivariate distribution, such univariate slice sampler can be applied to each variable in turn repeatedly. Or we can use multivariate slice sampler with hyperrectangles. This method adapts the univariate algorithm to the multivariate case by substituting a hyperrectangle for the one-dimensional interval used in step 2.

In this dissertation, our parameters are low dimensional, and most of them are one-dimensional. We use the slice sampler for implementation.

2.6 Dirichlet Process

(22)

K-dimensional parameter vector ρ of positive real numbers. The Dirichlet distribution of order K ≥ 1 has parameters ρ = (ρ1, . . . , ρK), ρi > 0 (i = 1, . . . , K), and has a probability density

function (pdf) given by,

f (x1, . . . , xK; ρ1, . . . , ρK) = 1 B(ρ) K Y i=1 xρi−1 i , wherePK

i=1xi = 1 and xi ≥ 0 for i = 1, . . . , K. The normalizing constant is the multivariate Beta

function, which can be expressed as

B(ρ) = QK i=1Γ(ρi) ΓPK i=1ρi .

Then given a measurable set S, a base probability distribution P and a positive real number ρ, a Dirichlet process is a stochastic process whose sample path is a probability distribution over S, such that the following holds. For any measurable finite partition of S, denoted {Bi}n_i=1, if

X ∼ DP(ρ, P ), then

X(B1), . . . , X(Bn) ∼ Dir(ρP (B1), . . . , ρP (Bn)).

(23)

2.7 Kaplan-Meier Estimator

In this section we review a technique for inference on the distribution of time-to-event, using the right-censored survival data. We will use this Kaplan-Meier estimator as a model to compare to in applications in Chapter 6.

Let T be a continuous random variable, which denotes the time until an event of interest occurs. Recall that the survival function is defined as

S(t) = P (T > t), t ≥ 0.

Suppose that the events occur at D distinct times t1< t2< · · · < tD, and at time ti(i = 1, . . . , D)

there are di(di> 0) events that occur. The data available for estimating S(t) is not only the event

times (t1, t2, . . . , tD), but also the censoring times of the subjects. Let Ni(i = 1, . . . , D) be the

number of subjects who are at risk at time ti. That is, Ni is a count of the number of subjects who

are alive at ti or experience the event of interest at ti. The quantity di/Ni provides an estimate

of the conditional probability that a subject who survives to just prior to time ti experiences the

event at time ti. We will use this quantity to construct an estimator of the survival function and

the cumulative hazard function – the Kaplan-Meier estimator (Klein and Moeschberger [21]). The Kaplan-Meier estimator is a nonparametric estimator of the survival function. It was proposed by Kaplan and Meier [20], also known as the product limit estimator. In medical research, this estimator is often used to estimate the survival probability of patients after treatment. In other fields, this estimator may be used to measure the length of time people remain unemployed after a job loss (Meyer [26]), or the time-to-failure of a machine. The Kaplan-Meier estimator is defined as: ˆ S(t) =      1, if t < t1, Y t1≤t 1 − di Ni , if t1≤ t. (2.10)

This estimator is a step function with jumps at the observed event times. The size of these jumps depends not only on the number of events observed at each time point ti, but also on the censoring

pattern prior to ti. One of the common estimators of the variance of the Kaplan-Meier estimator

(24)

The Kaplan-Meier estimator is not well defined when t is larger than the maximum observated time. For example, if the time-to-event corresponds to the time-to-death and the largest observed time is a death time, then the estimated value of the survival function is zero beyond this time point. However, if the largest observed time is a censoring time, then the value of the survival function beyond this time point is undetermined because we do not know when this last survivor would have died. There are some nonparametric estimators proposed to address this ambiguity. For example, Efron [11] suggests that

ˆ

S(t) = 0, if t > tmax,

where tmax represents the largest observed time. This estimator implies that the last survivor

would die immediately after tmax, which is the survivor’s censoring time. Thus it underestimates

the survival probability when t > tmax. Gill [14] suggests that

ˆ

S(t) = ˆS (tmax) , if t > tmax.

This estimator implies that the last survivor would die at t = +∞. Thus, it overestimates the survival probability when t > tmax.

2.8 Posterior Predictive p-Value

Let yij be the response of the i-th subject at time point j. Denote the rep-th MCMC replicate

with y_ijrep. Here y_ijrepis interpreted as a new replicate of yij, which is assumed to be independent and

identically distributed to yij. We stress that yij may be different in value from yrep_ij . To compute

the posterior predictive p-value (Meng [25]), we perform the following steps, for rep = 1, . . . , B, Step1. Generate new values y_ijrep from f (yrep_ij |Θrep_{), where Θ}rep _{is the rep-th MCMC iteration}

of a generic finite dimensional real valued parameter vector Θ.

Step2. Compute the average of the new values y_ijrep over all iterations 1, . . . , B, i.e., ˆ E[yij|y] = 1 B B X b=1 yrep_ij ,

where y is the observed longitudinal measures and observed survival time. Step3. Compute χ2 statistics, i.e.,

χ2_rep=X

i

X

j

h

y_ijrep− ˆE[y_ijrep|y]i2 ˆ

(25)

and χ2₀ =X i X j h

yij − ˆE[yrepij |y]

i2

ˆ

E[y_ijrep|y] . Step4. The posterior predictive p-value is computed as

p = 1 B B X b=1 I(χ2_rep ≥ χ2₀).

Suppose yij is exactly equal to ˆE[yijrep|y], which we interprete as “overfitting.” In this case p = 1.

Similarly, p = 0 suggests “oversmoothing.” In general, an intermediate value of p between zero and one is preferable (e.g., 0.5), however it may not necessarily be 0.5 (e.g., see Gelman [13] for an example).

2.9 Review of DIC

The deviance information criterion (DIC; Speigelhalter [33]) is a statistic that approximates the following quantity,

Eyrep_|θ[D_rep(E[θ|y])],

where the expectation operator “E_yrep_|θ(·)” is defined to be the expected value with repect to the

“true model,” and Drep(·) operator is the “true deviance.” Specifically, Drep(θ) = −2 log(p(yrep|θ)),

(26)

CHAPTER 3 A GAUSSIAN JOINT MODEL

Consider data sets that consist of both longitudinal measurements and survival outcomes such that both types of responses are correlated. To capitalize on these correlations, we define two models in our joint modeling framework: a linear regression model for the longitudinal outcome, and a Cox proportional hazard function used for the survival outcome. We assume the mean of the longitudinal trajectory function to be a predictor in the survival model so that these two models imply cross-dependence in the two outcomes. This strategy is similar to the one proposed in Tsiatis and Davidian [37]. To gain more flexibility in this joint model, we specify a Dirichlet process prior for the coefficients in the longitudinal trajectory. Sections 3.1 and 3.2 provide the details in our joint model formulation and the implied likelihood respectively. Section 3.3 provides a review of the Dirichlet process prior, and introduces its use for jointly modeling longitudinal and survival outcomes. Details on the remaining prior distributions and Gibbs sampling are provided in SEctions 3.4 and 3.5, respectively. In Chapter 4 we will discuss the Bayesian joint model in the same framework but drop the Gaussian assumption and assume the longitudinal data to follow a log-gamma distribution.

3.1 Model Formulation

Let Yij be the longitudinal measurement for subject i at time point j, where i = 1, . . . , n and

j = 1, . . . , mi. Here n is the total number of the subjects, and mi is the number of measurements

for subject i. Let N = Pn

i=1mi be the total number of observations. The longitudinal model is

given by,

Yij = ψβi(tij) + ij, (3.1)

ij ∼ F(ij),

where ψβi(tij) is referred to as the trajectory function, and ij is the error term with distribution

F(ij). In this dissertation, we consider the trajectory function to have a linear form given by,

(27)

This is a common response for immunologic measures to therapy. However, others have used a quadratic trajectory when modeling CD4 counts (e.g. Tsiatis et al. [37]). The Gaussian distribution is commonly assumed for the longitudinal measurements. In this chapter, we will review the semiparametric joint model under Gaussian assumption proposed by Brown and Ibrahim [6], and derive the full conditional distributions in the Gibbs Sampler. A simulation study is given in Chapter 5 to illustrate their method. If we assume the longitudinal observations are Gaussian distributed, we can define the longitudinal model as

Yij = ψβi(tij) + ij, (3.3)

ij ∼ N (0, σ2), (3.4)

where ψβi(tij) is the trajectory function given in (3.2), for i = 1, . . . , n and j = 1, . . . , mi.

Each subject has an observation on a possibly censored time-to-event (“failure” or “survival”), and additional covariate information. The association between the longitudinal effect and survival outcomes is captured by including the longitudinal trajectory function among the predictors for the survival outcome. Specifically, the hazard model is given by,

h(t|Yi) = λ(t) exp{γψβi(t) + X

0

iα}, (3.5)

where γ is a scalar parameter linking the trajectory to the hazard function, λ(t) is the baseline hazard, X = (X1, X2, . . . , Xn)0 is an n × p matrix of the baseline covariates, and the p-dimensional

parameter α is a vector of coefficients of the p-dimensional baseline covariates Xi.

3.2 Likelihood

Under the Gaussian assumption, the likelihood for the i-th subject in the longitudinal model is given by, f (Yi|βi, σ2) = 1 √ 2πσ2 mi exp    − 1 2σ2 mi X j=1 [Yij − ψβi(tij)] 2    ,

where Yi = (Yi1, . . . , Yimi) is the vector of the longitudinal observations of subject i, and βi =

(βi0, βi1)0 is the vector of longitudinal parameters of subject i. Let si be the survival time and νi

(28)

The specification of the hazard function in (3.5) leads to the following distribution for the survival component given the trajectory function:

f (si, νi|Yi, λ, γ, α) = λ(si)νiexp{νi[γψβi(si) + X 0 iα]} exp − Z si 0

λ(u)eγψβi(u)+Xα_du

. Now the joint likelihood of the i-th subject for the full set of parameters of interest under Gaussian assumption can be written as

f (Yi, si, νi|βi, σ2, γ, λ, α) = f (si, νi|Yi, γ, λ, α) × f (Yi|βi, σ2) = λ(si)νiexp νi[γψβi(si) + X 0 iα] − Z si 0

λ(u)eγψβi(u)+X0iα_du

× 1 √ 2πσ2 mi exp    − 1 2σ2 mi X j=1 [Yij − ψβi(tij)] 2    . (3.6)

We assume that the baseline hazard function is piecewise constant such that λ(u) = λl, ul≤ u < ul+1, l = 1, . . . , L,

where u1, . . . , uL+1 define the intervals for λ(u). Then the cumulative hazard

Z si

0

λ(u)eγψβi(u)+X0iαdu, (3.7)

can be written as eX0iα L X l=1 Hil(βi, γ, λl), where Hil(βi, γ, λl) = I{si≥ ul}λl Z min(ul+1,si) ul

eγψβi(u)_du, _(3.8)

and I{si ≥ ul} is an indicator function which is equal to one if the event time occurs in or later

than the l-th interval, and zero otherwise. Brown and Ibrahim [6] mention that the integral in equation (3.8) does not have an analytical solution when the trajectory is quadratic. Instead, they use GNU Scientific Library(GSL) (Galassi, Gough, and Jungman, 2001) to perform nonadaptive Gauss-Kronrod numerical integration. In Chapter 3 and Chapter 4, we consider an approximation for this integral. In a general case, we may consider

Z si

0

λ(u)eγψβi(u)+X0iα_{du ≈ s}

iλ(ci)eγψβi(ci)+X

0

iα_, _(3.9)

where ci ∈ [0, si]. We will use this approximation in (3.9) in our implementation of both the

(29)

3.3 Dirichlet Process Prior

It can be difficult to specify a parametric distributions for the parameter βi that defines the

trajectory function are difficult to specify. In particular, we cannot be sure that the βi’s all come

from the same distribution or confirm that a distributional assumption is correct. Also, in many settings, there is evidence of non-Gaussianity in the data. For example, Zhang and Davidian [41] relax the Gaussian assumption by approximating the random effects density by using the seminonparametric (SNP) density. In this dissertation, we also relax the typical distributional assumptions made for βi’s. Specifically, we use a Dirichlet process (DP) prior (Antoniak [3]),

which is a common approach to build a semiparametric model in nonparametric Bayesian statistics (reviewed in Section 2.6). This approach allows one to easily obtain posterior estimates using MCMC methods such as Gibbs sampler. Thus, we incorporate a Dirichlet process model into our joint model. This new model allows for a more flexible and robust method to examine the relationship between longitudinal measurements and survival time, as it accounts for uncertainty in specifying a distribution for {βi}.

3.3.1 Dirichlet Process Prior in the Gaussian Joint Model

We relax the distributional assumption on the βi’s in the trajectory function (3.2) by applying

a Dirichlet process prior on them, which is given by, βi|G ∼ G,

G ∼ DP (M, G0), (3.10)

G0 = N2(b0, V0),

where “DP” stands for Dirichlet process, M is a positive scalar, N2(b0, V0) is a 2-dimensional

mul-tivariate normal distribution with a 2-dimensional mean vector b0 and a 2 × 2 variance-covariance

matrix V0. Both b0 and V0 are unknown hyperparameters, so we complete our Bayesian

hierar-chical model by specifying hyperprior distributions for them. Here we use a multivariate normal distribution and a Wishart distribution as their priors,

b0∼ N2(µb, Σb),

(30)

where “W ” stands for the Wishart distribution. Here µb is a given 2-dimensional vector, Σb and

Sv are given 2 × 2 positive definite matrices, and nv is a given positive real number.

3.3.2 Concentration Parameter

The concentration parameter M is a smoothing parameter in the Dirichlet process prior. It specifies how strong the discretization is among the unique values of βi’s. In other words, the

concentration parameter M describes how different the trajectories are among different clusters, i.e., the heterogeneity of the subjects. Escobar and West [12] discussed some advanced techniques related to the concentration parameter. One of the techniques is to use a single gamma prior on M and update M in the Gibbs sampler. Here we briefly introduce this method developed by Escobar and West [12], which we will use in this dissertation.

Suppose that we have a prior distribution p(M ) for the parameter M . Let k be the number of unique values of βi (i = 1, 2, . . . , n), that is, k denotes the number of clusters clusters, and let D

be the configuration of βi. From our model, βi’s are conditionally independent of M when M , b0,

V0 and the clustering configuration are known, and the parameters (b0, V0) are also conditionally

independent of M when k and the configuration are known. From Escobar and West [12], the full conditional distribution of M is

p(M |k, n, b0, V0, D) = p(M |k, n) ∝ p(M )P (k|M, n). (3.11)

Using the result of Antomiak [3], the likelihood in (3.11) is given by, P (k|M, n) = P (k|M = 1, n) n! Mk Γ(M )

Γ(M + n), (k = 1, 2, . . . , n), (3.12) where the first term P (k|M = 1, n) does not involve M . Suppose M has a single gamma prior Gamma(aM, bM) with a shape parameter aM > 0 and a rate parameter bM > 0. For M > 0, the

gamma functions in (3.12) can be written as Γ(M ) Γ(M + n) =

(M + n)B(M + 1, n) M Γ(n) ,

where B(M + 1, n) is the usual beta function with parameters (M + 1) and n. Then using the definition of the beta function, the full conditional distribution in (3.11) can be written as

p(M |k, n) ∝ p(M )Mk−1(M + n)B(M + 1, n) ∝ p(M )Mk−1(M + n)

Z 1

0

(31)

for any k = 1, 2, . . . , n. This implies that p(M |k, n) is the marginal distribution from a joint distribution of (M, η), where (η|M, k, n) ∼ B(M + 1, n) and

p(η|M, k, n) ∝ ηM(1 − η)n−1 (0 < η < 1). (3.13) Under the Gamma(aM, bM) prior for M > 0,

p(M |η, k, n) ∝ hMaM−1_e−bMM

i

Mk−1(M + n)ηM_{(1 − η)}n−1 ∝ MaM+k−2_{(M + n)e}−M (bM−log(η))

∝ MaM+k−1_e−M (bM−log(η))_{+ nM}aM+k−2_e−M (bM−log(η))_,

which reduces easily to a mixture of two gamma densities, i.e.

(M |η, k, n) ∼ πηGamma(aM+ k, bM− log(η)) + (1 − πη)Gamma(aM+ k − 1, bM− log(η)), (3.14)

where the weight πη is defined by

πη

(1 − πη)

= aM + k − 1 n(bM − log(η))

.

This full-conditional distribution is used to update the concentration parameter M in the Gibbs sampler. In each iteration, we first sample a value for η from the beta distribution in (3.13) conditional on the current values of M , k and n; then sample a new value for M from the mixture of gamma distributions in (3.14) conditional on the same k, n, and the value of η just generated. Here we use a single gamma prior for M . West [40] generalized this technique with a mixture gamma distributions as the prior for M . Alternatively, we may consider the Metropolis-Hastings method or the adaptive rejection sampling to sample from (3.11).

3.4 Prior Distributions

Besides the Dirichlet process prior on the longitudinal coefficients, we need to specify proper priors on the other parameters in the joint likelihood (3.6). In the longitudinal model, we use an inverse gamma distribution as the prior distribution of σ2_{, i.e.}

(32)

where “IG” stands for the inverse gamma distribution, aσ and bσ are given positive real numbers.

The prior distributions of the parameters in the survival model are shown below. γ ∼ N (µγ, σγ2),

λl∼ Gamma(al, bl), l = 1, . . . , L,

α ∼ Np(µα, Σα),

where µγ and σγ are given real numbers, al’s and bl’s are given positive real numbers. µα is a

given p-dimensional real vector, and Σα is a given p × p positive definite matrix, p is the number

of baseline covariates.

3.5 Gibbs Sampler

Gibbs sampling is a common method used for mixture of Dirichlet process models. When the prior is conjugate, it is convenient to sample directly from the posterior distribution since the posterior has the same form as the prior. However, when we have a non-conjugate prior, it is difficult to do the sampling since the posterior may be intractable. In the Gaussian joint model, We use the Gibbs sampling method with Metropolis-Hastings (Hastings [18]) to estimate the parameters. The algorithm is described in Steps 1–8, below.

In the Gaussian joint model, suppose we have k distinct values of βi, which means there are k

clusters. Let zi (i = 1, . . . , n) denote the cluster indicator of Yi, that is, zi= j implies individual i

is in the j-th cluster. For each cluster, z, φz(1 ≤ z ≤ k) is the value of βi’s in that cluster.

Step 1. Simulate zi(i = 1, . . . , n) using Neal’s algorithm 8 (Neal [28]).

For i = 1, . . . , n, let k− be the number of distinct zj for j 6= i, and let h = k−+ m, where m is

the number of auxiliary parameters. Label these zj (j 6= i) with values in {1, . . . , k−}. If zi = zj

for some j 6= i, draw values independently from G0 for those φz for which k−+ 1 ≤ z ≤ h. If

zi 6= zj for all j 6= i, let φk−₊₁ = φ_z_i, and draw values independently from G₀ for those φ_z for

which k−+ 2 ≤ z ≤ h. Now the distinct values of βi’s are {φ1, . . . , φk−, φ_k−₊₁, . . . , φ_h}. Draw a

new value for zi from {1, . . . , h} using the following probabilities:

(33)

where z−i= (z1, . . . , zi−1, zi+1, . . . , zn), n−i,z is the number of zj for j 6= i that are equal to z, and

b is the appropriate normalizing constant. f (Yi, si, νi|φz, Ω) is the Gaussian joint likelihood of our

joint model.

Step 2. Simulate φz(z = 1, . . . , k) using Metropolis-Hastings method or the slice sampler.

Recall that the Metropolis-Hastings algorithm for sampling from a distribution for x with probabilities π(x), using a proposal distribution g(x∗|x), updates the state x as follows:

Draw a candidate state, x∗, according to the probabilities g(x∗|x). Compute the acceptance probability a(x∗, x) = min 1,g(x|x ∗₎ g(x∗|x) π(x∗) π(x) . (3.16)

With probability a(x∗, x), set the new state, x0, to x∗. Otherwise, let x0 be the same as x. This update from x to x0 leaves π invariant.

This approach can be applied in the Gaussian joint modeling problem to update φz (z =

1, 2, . . . , k), which are the distinct values of the longitudinal coefficients βi’s. We will also use

the Metropolis-Hastings method to update some of the other parameters in our joint model. Let Y = (Y1, . . . , Yn), and z = (z1, . . . , zn). For z = 1, 2, . . . , k, the full conditional distribution of φz

If we choose G0(φz|b0, V0) to be the proposal distribution, we find that this factor cancels when

computing the acceptance probability in (3.16), leaving a(φ∗_z, φz) = min 1, Qn i=1f (Yi, si, νi|φz∗, b0, V0, σ2, γ, λ, α)I(zi = z) Qn i=1f (Yi, si, νi|φz, b0, V0, σ2, γ, λ, α)I(zi = z) .

With probability a(φ∗_z, φz), set the new state of φz to be φ∗z. Otherwise, let the new state be the

same as φz. Or we can use the slice sampler to simulate from the full conditional distribution of

(34)

Step 3. Simulate b0 and V0 using conjugate priors.

The Dirichlet process prior in the Gaussian joint model has a base distribution with mean b0

and covariance matrix V0. We specify a conjugate prior N2(µb, Σb) on b0, and a conjugate prior

W (Sv, nv) on V−10 , where “W” stands for the Wishart distribution. Thus we can sample directly

from their full conditional distributions to update these two parameters. The full conditional distribution of b0 is given by,

p(b0|Y, z, φ, V0, σ2, λ, γ, α) ∝ " _n Y i=1 f (Yi, si, νi|βi, b0, V0, σ2, γ, λ, α) # p(φ)p(b0) ∝ p(φ)p(b0) ∝ " _k Y z=1 p(φz|b0, V0) # × p(b0) ∝ " _k Y z=1 G0(φz|b0, V0) # × N2(µb, Σb) ∼ N3 Σ−1_b + kV−1₀ −1 " Σ−1_b µb+ V0−1 k X z=1 φz # ,Σ−1_b + kV−1₀ −1 ! . Similarly, the full conditional distribution of V0 is given by,

p V−1₀ |Y, z, φ, b₀, σ2, λ, γ, α ∝ " _n Y i=1 f (Yi, si, νi|βi, b0, V0, σ2, γ, λ, α) # p(φ)p V₀−1 ∝ p(φ)p(V−1₀ ) ∝ " _k Y z=1 p(φz|b0, V0) # × p(V−1₀ ) ∝ " _k Y z=1 G0(φz|b0, V0) # × W V₀−1|Sv, nv ∼ W   " Sv+ k X z=1 (φz− b0)(φz− b0)0 #−1 , nv+ k  .

where “W” stands for the Wishart distribution. Step 4. Simulate σ2 using conjugate prior.

The Gaussian joint model assumes that the longitudinal observations have a constant variance σ2. With an IG(aσ, bσ) prior, the full conditional distribution of σ2 is given by,

(35)

∝ σ2− Pn i=1mi/2 exp    − 1 2σ2 n X i=1 mi X j=1 [Yij − ψβi(tij)] 2    × σ2−aσ−1 exp −bσ σ2 ∼ Γ  aσ+ 1 2 n X i=1 mi, bσ+ 1 2 n X i=1 mi X j=1 [Yij− ψβ(tij)]2  .

Step 5. Update γ using Metropolis-Hastings method.

The link parameter γ has a normal prior N (µγ, σ2γ), so the full conditional distribution of γ is

given by, p(γ|Y, z, φ, b0, V0, σ2, λ, α) ∝ " _n Y i=1 f (Yi, si, νi|βi, b0, V0, σ2, γ, λ, α) # N (µγ, σ2γ).

If we choose the prior N (µγ, σ2γ) to be the proposal distribution, this factor cancels when computing

the acceptance probability in ((3.16)), leaving a(γ∗, γ) = min 1,Fγ(γ ∗₎ Fγ(γ) ,

where Fγ(γ) =Qni=1exp

n νiγψβi(si) − e X0 iαPL l=1Hil(β, γ, λl) o

. With probability a(γ∗, γ), set the new state of γ to be γ∗. Otherwise, let the new state be the same as the current value of γ.

Step 6. Simulate λl(l = 1, . . . , L) using conjugate prior.

The baseline hazard function is assumed to be piecewise constants. For each l = 1, . . . , L, we specify a conjugate prior for λl, which is Γ(al, bl). Then for each l = 1, . . . , L, the full conditional

distribution of λl is given by,

(36)

where Hil(β, γ, λl) is defined in (3.8); nl is the number of subjects whose survival time is within

the interval [ul−1, ul), for l = 1, . . . , L.

Step 7. Simulate α using Metropolis-Hastings method or the slice sampler.

The baseline coefficients α has a non-conjugate prior Np(µα, Σα).The full conditional

distribu-tion of α is given by,

p(α|Y, z, φ, b0, V0, σ2, γ, λ) ∝ " _n Y i=1 f (Yi, si, νi|βi, b0, V0, σ2, γ, λ, α) # Np(µα, Σα) ∝ " _n Y i=1 exp{νiX0iα} exp ( −eX0_iα L X l=1 Hil(β, γ, λl) )# Np(µα, Σα).

We choose the prior N (µα, σ2α) to be the proposal distribution, then the acceptance probability

will be a(α∗, α) = min 1,Fα(α ∗₎ Fα(α) , where Fα(α) = Qn i=1exp n νiX0iα − eX 0 iαPL l=1Hil(β, γ, λl) o

. With probability a(α∗, α), set the new state of α to be α∗. Otherwise, let the new state be the same as the current values of α. Or we can use the slice sampler to simulate from the full conditional distribution of α.

Step 8. Simulate M using the technique discussed in Section 3.3.2.

We use a single gamma prior Γ(aM, bM) on the concentration parameter M in the Dirichlet

(37)

CHAPTER 4 LOG-GAMMA JOINT MODEL

We discussed the Gaussian joint model in Chapter 3. In this chapter, we are still interested in jointly modeling both longitudinal measurements and survival outcomes. The framework of the joint model in this chapter is the same as Chapter 3: a linear regression model used to fit the longitudinal measurements, and a Cox proportional hazard function used for the survival outcome. To join these two models, We still use the mean of the longitudinal trajectory function as a predictor in the survival model. However, we will challenge the Gaussian assumption in Chapter 3, and assume the longitudinal measurements to have a log-gamma distribution instead. We also use a Dirichlet process prior for the coefficients in the longitudinal trajectory to allow more flexibility in our models. Then we will complete the specification of the Bayesian model and implement a Gibbs sampler for inferences.

4.1 Model Formulation

Let Yij be the longitudinal measurement for subject i at time point j, where i = 1, . . . , n and

j = 1, . . . , mi. Here n is the total number of the subjects, and mi is the number of measurements

for subject i. Let N = Pn

i=1mi be the total number of observations. The longitudinal model is

given by,

Yij = ψβi(tij) + ij,

ij ∼ F(ij),

where ψβi(tij) is referred to as the trajectory function, and ij is the error term with distribution

F(ij). In this chapter, we still consider the trajectory function to have a linear form given by,

ψβi(tij) = β0i+ β1itij. (4.1)

(38)

commonly used for the longitudinal data (e.g. Brown and Ibrahim [6]). We reviewed the Gaussian joint model and derived the full conditional distributions in a Gibbs sampler for inferences in Chapter 3. In this chapter, we are instead assume the longitudinal data to have a log-gamma distribution. Then the longitudinal model is defined as

Yij = ψβi(tij) + ij,

ij ∼ LG(α, κ),

where ψβi(tij) is the trajectory function given in (4.1), for i = 1, . . . , n and j = 1, . . . , mi. Here

we assume that the mean of the longitudinal error is equal to zero. From (2.4) we have κ =

exp{ω0(α)}, where ω0(·) is the digamma function.

Each subject has an observation on a possibly censored time-to-event (“failure” or “survival”) and additional covariate information. The association between the longitudinal and survival out-comes is captured by including the longitudinal trajectory function among the predictors for the survival outcome. Specifically, the hazard model for the i-th subject is given by,

h(t|Y ) = λ(t) exp{γψβi(t) + X

0 iα},

where γ is a scalar parameter linking the trajectory to the hazard function, λ(t) is the baseline haz-ard, and the p-dimensional parameter α is a vector of coefficients of the i-th subject’s p-dimensional baseline covariate Xi.

4.2 Likelihood

Under log-gamma assumption, the likelihood for the i-th subject in the longitudinal model is given by, f (Yi|βi, α, κ) = mi Y j=1 κα Γ(α) exp{α[Yij − ψβi(tij)] − κexp{Yij − ψβi(tij)}},

where Yi = (Yi1, . . . , Yimi) is the vector of the longitudinal observations of subject i, and βi =

(βi0, βi1)0 is the vector of longitudinal parameters of subject i. As the same in the Gaussian joint

model in Chapter 3, let si be the survival time and νibe the censoring indicator for the i-th subject,

(39)

The specification of the hazard function in (4.1) leads to the following distribution for the survival component given the trajectory function:

f (si, νi|Yi, βi, λ, γ, α) = λ(si)νiexp{νi[γψβi(si) + X 0 iα]} exp − Z si 0

λ(u)eγψβi(u)+X0iαdu

, where we introduce a subscript on the p-dimensional vector of covariates, Xi. Now the joint

likelihood of the i-th subject for the full set of parameters of interest can be written as f (Yi, si, νi|βi, λ, γ, α, α, κ) = f (si, νi|Yi, λ, γ, α) × f (Yi|βi, α, κ) = λ(si)νiexp νi[γψβi(si) + X 0 iα] − Z si 0

λ(u)eγψβi(u)+X0iα_du

× mi Y j=1 κα Γ(α) exp {α[Yij − ψβi(tij)] − κexp{Yij − ψβi(tij)}} . (4.2)

We assume that the baseline hazard function is piecewise constant such that λ(u) = λl; ul≤ u < ul+1, l = 1, . . . , L,

where u1, . . . , uL+1 define piecewise constant intervals for λ(u). Then the cumulative hazard

Z si

0

λ(u)eγψβi(u)+X0iα_du, _(4.3)

can be written as eX0iα L X l=1 Hil(βi, γ, λ), where Hil(βi, γ, λ) = I{si ≥ ul}λl Z min(ul+1,si) ul

eγψβi(u)_du, _(4.4)

and I{si ≥ ul} is an indicator function which is equal to one if the event time occurs in or later

(40)

may consider the Taylor series to obtain an approximation for this integral, Hil(βi, γ, λl) = I{si ≥ ul}λl Z min(ul+1,si) ul eγ(β0i+β1iu)_du = eγβ0i_I{s i ≥ ul}λl " eγβ1imin(ul+1,si)_{− e}γβ1iul γβ1i # ≈ eγβ0i_I{s i ≥ ul}λl (1 + γβ1imin(ul+1, si)) − (1 + γβ1iul) γβ1i ≈ eγβ0i_I{s i ≥ ul}λl[min(ul+1, si) − ul] .

Then the integral in (4.3) can be written as Z si

0

λ(u)eγψβi(u)+X0iαdu ≈ J_ieX 0 iα+γβ0i_, _(4.5) where Ji = L X l=1 I{si ≥ ul}λl[min(ul+1, si) − ul] .

We will use the approximation in (4.5) to update the link parameter γ. This is helpful to obtain conjugacy in the Gibbs sampler.

4.3 Dirichlet Process Prior in the Log-Gamma Joint Model

In the log-gamma joint model, we relax the typical distributional assumption on βi’s in (4.1)

by applying a Dirichlet process prior on them, which is given by, βi|G ∼ G, G ∼ DP(M, G₀), (4.6) G0 = cM LG(HG, αG, κG), where HG=                    −    1 t11 .. . ... 1 tnmn       γν1(1, s1) .. . γνn(1, sn)       γ(1, c1) .. . γ(1, cn)    V0                    , αG=     δ11N δ21n δ31n α012     , and κG=     σ11N σ21n σ31n κ012     .

(41)

κ0 are positive hyperparameters. The 2 × 2 matrix V0 is unknown, so we complete our Bayesian

hierarchical model by specifying prior distribution for them. Let V0= v11 v12 v21 v22 . (4.7)

Here we put a restriction on V0 to make it easier for inference. Let v11 = v22 = 1 and v12 = 0.

v21could be any real value. This implies that V0 is a triangular matrix, and we consequently have

det(V0) = 1. This form will aid in simplifying Gibbs sampling. The prior of v21 is given by,

v21∼ LG(α1, κ1),

where α1 is a positive shape hyperparameter and κ1 is a positive rate hyperparameter.

4.4 Priors and Hyperpriors

We specify proper priors on the parameters in the log-gamma joint likelihood (4.2). Specifically, we set up the priors on the baseline hazard and the covariate coefficients in the survival model as

γ ∼ LG(α₂, κ2), λl ∼ Gamma(al, bl), l = 1, . . . , L, α ∼ cM LG X δ510p , δ41n α3 , σ41n κ3 ,

where X is the n × p covariate matrix, and p is the number of covariates. α2, α3, κ2, κ3 are

positive hyperparameters. al, bl (l = 1, . . . , L), δ4, δ5 and σ4 are given positive real numbers. In

the longitudinal model, we set the priors as

α ∼ Gamma(θ1, τ1), (4.8)

where “IG” stands for the inverse gamma distribution, θ1 and τ1 are given positive real numbers.

(42)

uniform distribution as the hyperprior for the shape parameters αk (k = 1, 2, 3, 4), and use the

Gamma distribution as the hyperprior for the rate parameter κk (k = 1, 2, 3, 4), i.e.,

αk ∼ U (0, U0),

κk ∼ Gamma(αk, βk),

for k = 1, 2, 3, 4, where U0, αk’s and βk’s are given positive real numbers.

4.5 Gibbs Sampler

Gibbs sampling is a common method used for mixture of Dirichlet process models. When the prior is conjugate, it is convenient to sample from the posterior distribution since the posterior has the same form as the prior. But, when we have a non-conjugate prior, it is hard to do the sampling since the posterior may be intractable. In our log-gamma joint model, we use a collapsed Gibbs sampling method with Metropolis-Hastings (Hastings [18]) updates where needed. The algorithm is described in Steps 1–10, below.

Suppose we have k distinct values of βi, which means there are k clusters. Let zi (i = 1, . . . , n)

denote the cluster indicator of Yi, that is, zi = j implies individual i is in the j-th cluster. For

each cluster, z, φz(1 ≤ z ≤ k) is the value of βi’s in that cluster.

Step 1. Simulate zi(i = 1, . . . , n) using Neal’s algorithm 8 (Neal [28]).

For i = 1, . . . , n, let k− be the number of distinct zj for j 6= i, and let h = k−+ m, where m

is the number of auxiliary parameters. Label these zj(j 6= i) with values in {1, . . . , k−}. If zi= zj

for some j 6= i, draw values independently from G0 for those φz for which k−+ 1 ≤ z ≤ h. If

zi 6= zj for all j 6= i, let φk−₊₁ = φ_z_i, and draw values independently from G₀ for those φ_z for

which k−+ 2 ≤ z ≤ h. Now the distinct values of βi’s are {φ1, . . . , φk−, φ_k−₊₁, . . . , φ_h}. Draw a

new value for zi from {1, . . . , h} using the following probabilities:

P (zi = z|z−i, Yi, φ1, . . . , φh) =      b n−i,z n − 1 − Mf (Yi, si, νi|φz, λ, γ, α, α, κ) 1 ≤ z ≤ k − b M/m n − 1 − Mf (Yi, si, νi|φz, λ, γ, α, α, κ) k −_{< z ≤ h}

where z−i= (z1, . . . , zi−1, zi+1, . . . , zn), n−i,z is the number of zj for j 6= i that are equal to z, and

b is the appropriate normalizing constant. f (Yi, si, νi|φz, Ω) is the log-gamma joint likelihood of

(43)

Step 2. Simulate intermediate parameters

In the collapsed Gibbs sampler, κ0, κ1, κ2 and κ3 are intermediate parameters. Their prior

distributions are denoted as p(κ0), p(κ1), p(κ2) and p(κ3). Here we use Gamma(ακ, βκ) with shape

parameter ακ and rate parameter βκ as the same prior distributions for them, although one could

choose different values for ακ and βκ for these intermediate parameters. Then their full conditional

distributions are given below.

0 exp−κ0102exp(V0φz) × κα0κ−1exp(−βκκ0)

∝ κ(2kα0+ακ)−1 0 exp ( −κ0 " _k X z=1 10₂exp(V0φz) + βκ #) ∼ Gamma 2kα0+ ακ, k X z=1 10₂exp(V0φz) + βκ ! , p(κ1|Y, z, φ, V0, α, κ, γ, λ, α) ∝ n Y i=1 f (Yi, si, νi|βi, Ω) × p(v21) × p(κ1) ∝ LG(v₂₁|α₁, κ1) × Gamma(κ1|ακ, βκ) ∝ κα1

1 exp{−κ1exp(v21)} × κα1κ−1exp (−βκκ1)

∝ κ(α1+ακ)−1 1 exp{−κ1(ev21+ βκ)} ∼ Gamma (α₁+ ακ, ev21+ βκ) , p(κ2|Y, z, φ, V0, α, κ, γ, λ, α) ∝ n Y i=1 f (Yi, si, νi|βi, Ω) × p(γ) × p(κ2) ∝ LG(γ|α₂, κ2) × Gamma(κ2|ακ, βκ) ∝ κα2

2 exp{−κ2exp(γ)} × κα2κ−1exp (−βκκ2)

∝ κ(α2+ακ)−1

2 exp{−κ2(eγ+ βκ)}

(44)

and p(κ3|Y, z, φ, V0, α, κ, γ, λ, α) ∝ n Y i=1 f (Yi, si, νi|βi, Ω) × p(α) × p(κ3) ∝ cM LG(α|α3, κ3) × Gamma(κ3|ακ, βκ) ∝ κα3 3 exp−κ3exp(δ510pα) × κ ακ−1 3 exp(−βκκ3) ∝ κ(α3+ακ)−1 3 exp−κ3exp(δ510pα) + βκ ∼ Gamma α3+ ακ, exp(δ510pα) + βκ .

Step 3. Simulate φz(z = 1, . . . , k) using conjugate prior.

In this dissertation we use a linear trajectory function, so the unique values φz’s are two

dimensional vectors. Let φz = (φ0z, φ1z)0. We have φz iid

∼ G₀(z = 1, 2, . . .). Let Sz = {i : zi = z}.

By using the approximation in (3.9) the full conditional distribution of φz(z = 1, 2, . . .) is given by,

(45)

× exp         s1λ(c1)eX 0 iαI(z₁= z) .. . snλ(cn)eX 0 iαI(z_n= z)    0 exp         γ(1, c1) .. . γ(1, cn)   φz           × exp(δ110N, δ210n, δ310n, α0102)HGφz− (σ110N, σ210n, σ310n, κ0102) exp(HGφz) ∼ cM LG(HG, αφ, κφ) where αφ=   α    I(z1= z) .. . I(zn= z)    0 + δ110_N,    I(z1 = z) .. . I(zn= z)    0 + δ210n, δ310n, α0102    0 and κφ=   κ    eY11_I(z 1 = z) .. . eYnmn_I(z n= z)    0 + σ110N, σ210n,    s1λ(c1)eX 0 iαI(z₁ = z) .. . snλ(cn)eX 0 iαI(z_n= z)    0 + σ310n, κ0102    0 . Step 4. Simulate V0.

In the parameter matrix V0, we assume that v21has a LG prior LG(α1, κ1). The full conditional

(46)

where Hv =      φ01 .. . φ0k 1     

, αv = (α010_k, α1)0, and κv = (κ0eφ01, . . . , κ0eφ0k, κ1)0. Here k is the number

of unique values of βi’s.

Step 5. Simulate α using Metropolis-Hastings method.

The full conditional distribution of α is,

p(α|Y, z, φ, V0, κ, γ, λ, α) ∝ n

Y

i=1

f (Yi, si, νi|βi, Ω) × Gamma(α|θ1, τ1).

If we choose the prior Gamma(θ1, τ1) to be the proposal distribution, we find that this factor

cancels when computing the acceptance probability in ((3.16)), leaving a(α∗, α) = min 1, Qn i=1f (Yi, si, νi|φ, z, V0, α ∗ , κ, γ, λ, α) Qn i=1f (Yi, si, νi|φ, z, V0, α, κ, γ, λ, α) .

Then we can use Metropolis-Hastings method to update α. With probability a(α∗, α), set the

new state of α to be x∗. Otherwise, let the new state be the same as α.

Step 6. Simulate γ using conjugate prior.

The link parameter γ has log gamma prior LG(α2, κ2). σ4 is a given positive number. Here we

(47)

Step 7. Simulate λl(l = 1, . . . , L) using conjugate priors.

The baseline hazard function has piecewise constant values λl(l = 1, . . . , L). The conjugate

priors of those λl’s are Gamma(al, bl). So the full conditional distribution of λl is given by,

p(λl|Y, z, φ, V0, α, κ, γ, α) ∝ n Y i=1 λ(si)νiexp ( −eX0_iα L X l=1 Hil(β, γ, λ) ) × Gamma(λ_l|a_l, bl) ∝ λnl l exp ( − n X i=1 eX0iα_H il(β, γ, λ) ) × Gamma(λ_l|a_l, bl) ∝ λnl l exp ( −λ_l n X i=1 eX0iα_I(s_i ≥ u l) Z min(u_l+1,si) ul eγψβ(u)_du ) × Gamma(λ_l|a_l, bl) ∼ Gamma al+ nl, bl+ n X i=1 eX0iα_I(s_i ≥ u_l₎ Z min(ul+1,si) ul eγψβ(u)_du ! ,

where nlis the number of subjects whose survival time is within the interval (ul, ul+1), l = 1, . . . , L.

Step 8. Simulate α using conjugate prior.

(48)

where αα= (ν1+δ4, . . . , νn+δ4, α3)0, and κα = PL l=1H1l(β, γ, λ) + σ4, . . . ,PLl=1Hnl(β, γ, λ) + σ4, κ3 0 . Step 9. Simulate the shape parameters α0, α1, α2 and α3in LG or cMLG distribution directly

from their full conditional distributions.

We use the uniform distribution U(0, 104) as the prior distributions for α0, α1, α2 and α3. Then

the full conditional distributions are given by,

p(α0|Y, z, φ, V0, α, κ, γ, λ, α) ∝ n Y i=1 f (Yi, si, νi|βi, Ω) × k Y z=1 G0(φz|HG, αG, κG) × U(0, 104) ∝ k Y z=1 κα0 0 Γ(α0) 2 exp{α0102V0φz} ∝ κα0 0 Γ(α0) 2k exp{α0102V0φz}, p(α1|Y, z, φ, V0, α, κ, γ, λ, α) ∝ n Y i=1 f (Yi, si, νi|βi, Ω) × LG(v21|α1, κ1) × U(0, 104) ∝ κ α1 1 Γ(α1) exp{α1v21}, p(α2|Y, z, φ, V0, α, κ, γ, λ, α) ∝ n Y i=1 f (Yi, si, νi|βi, Ω) × LG(λ|α2, κ2) × U(0, 104) ∝ κ α2 2 Γ(α2) exp{α2γ}, and p(α3|Y, z, φ, V0, α, κ, γ, λ, α) ∝ n Y i=1 f (Yi, si, νi|βi, Ω) × cM LG X δ51p , αα, κα × U(0, 104₎ ∝ κ α3 3 Γ(α3) exp{α3δ51pα}.

Step 10. Simulate M using the technique discussed in Section 3.3.2.

We use a single gamma prior Γ(aM, bM) on the concentration parameter M in the Dirichlet

Florida State University Libraries