367 ST 732, Longitudinal Data Analysis

(1)

6.2 Model specification

6.3 Inference and considerations for missing data 6.4 Best linear unbiased prediction and empirical Bayes 6.5 Implementation via the EM algorithm

(2)

Population-averaged linear models for continuous response: • Appropriate whenquestionsof scientific interest are questions

about features ofpopulation mean response profiles

• Population mean response representeddirectlyby alinear model • Overall aggregate covariance matrixof a response vector is

modeleddirectly

• Theaggregate patternof variance and correlation may be sufficiently complex that standard modelscannotrepresent it • With differentni and/or different time points for different

individuals,explorationandimplementationof covariance structure isproblematic

(3)

General class of models and methods for continuous response: • Subject-specific perspective⇒Linear mixed effects model • Does not requirebalance– responsesneed notbe at thesame

ntime points

• Individual inherent response trajectoriesrepresented by alinear modelincorporatingwithin-individualcovariates

• Individual-specific featuresof trajectories represented by alinear modelincorporatingamong-individualcovariates

• Within-andamong-individualsources of variance and correlation explicitlyacknowledged and modeledseparately

• Appropriate whenquestionsof scientific interest are questions about “typical” oraveragefeatures of individual trajectories; e.g., the “typical slope”

(4)

Linear models: Forinherent trajectoriesandfeatures

• The subject-specific modelimpliesa linear model for overall population mean response

• Andinducesa model for the overall aggregate covariance matrix • ⇒Inducesalinear population-averagedmodel

• The model is a relevant framework for addressingeither subject-specific or population-averaged questions of interest • Theinduced covariance structureameliorates the problems with

direct specification of overall pattern and implementation with unbalanceddata suffered by a population-averaged model⇒ great flexibility

(5)

Implementation: Because apopulation-averaged modelisinduced • Fitting usingmaximum likelihoodunder the assumption of

normalityandREML

• Samelarge sample theory applies forapproximateinference • Sameissues/approaches in the presence ofmissing data

(6)

(7)

Recall: The observed data are

(Y_i,z_i,a_i) = (Y_i,x_i), ,i =1, . . . ,m

independentacrossi • Yi = (Yi1, . . . ,Yini)

T_,_Y

ij recorded at timetij,j=1, . . . ,ni (possibly different times for different individuals)

• zi = (zT_i1, . . . ,zT_in_i)T comprisingwithin-individualcovariate informationu_i and thetij

• a_i =collection ofamong-individualcovariates • All covariatesxi = (zT_i ,aT_i )T

• x˜ =collection of allxi,i =1, . . . ,m

(8)

Linear mixed effects model: Basic form

Y_i =X_iβ+Z_ib_i+e_i, i=1, . . . ,m

• Xi (ni×p)andZi (ni×q)aredesign matricesincorporating individuali’scovariatesxi and time

• β(p×1)is referred to as thefixed effectsparameter • bi (q×1)is a vector ofrandom effectscharacterizing

among-individualbehavior,independentacrossi • Standard assumption: b_i ⊥⊥x_i

E(bi|xi) =E(bi) =0, var(bi|xi) =var(bi) =D, bi ∼ N(0,D)

whereDcharacterizesamong-individualvariance and correlation • Can berelaxedto dependence onamong-individual covariatesai E(bi|xi) =0, var(bi|xi) =var(bi|ai) =D(ai), bi|xi ∼ N {0,D(ai)}

(9)

Linear mixed effects model: Basic form

Y_i =X_iβ+Z_ib_i+e_i, i=1, . . . ,m

• Within-individual deviationei = (ei1, . . . ,eini)

T _represents aggregateeffects ofwithin-individual realizationand measurement error,independentacrossi

• Standard assumption: ei ⊥⊥bi,xi

E(ei|xi,bi) =E(ei) =0, var(ei|xi,bi) =var(ei) =Ri(γ), ei ∼ N {0,Ri(γ)} can be relaxed to allow dependence ofe_i onx_i andb_i

• (n_i ×n_i)covariance matrixR_i(γ), parametersγ(often writtenR_i for brevity)

• Default assumption: Often withoutadequate thought Ri(γ) =σ2In, γ =σ2, for alli =1, . . . ,m

(10)

Conceptual framework: Chapter 2

Y_i =µ_i+B_i+e_i =µ_i+B_i+e_Pi +e_Mi

• µ_i =Xiβ(ni×1)overall population mean response

• B_i =Zibi (ni×1)vector ofdeviationsfrom the population mean ⇒among-individualvariation

• ei (ni×1)vector ofwithin-individualdeviations • Individual-specific trajectory

X_iβ+Z_ib_i

This structure offers great latitude for representingindividual profiles(examples coming up)

(11)

Conditional onbi andxi: The model impliesYi isni-variate normal

with meanXiβ+Zibi and covariance matrixRi(γ)

Yi|xi,bi ∼ N {Xiβ+Zibi,Ri(γ)}

• Characterizes how observations for individuali vary/covary abouti’s inherent trajectoryX_iβ+Z_ib_i due torealization processandmeasurement error

• Letp(y_i|xi,bi;β,γ)denote this normal density Population-averaged density: OfY_i givenx_i

p(y_i|xi;β,γ,D) =

Z

p(y_i|xi,bi;β,γ)p(bi;D)dbi

(12)

Population-averaged density: OfYi givenxi p(y_i|xi;β,γ,D) =

Z

p(y_i|xi,bi;β,γ)p(bi;D)dbi

• Straightforward:This is ani-variate normal density with mean vectorXiβand covariance matrix

Vi(γ,D,xi) =Vi(ξ,xi) =ZiDZTi +Ri(γ), ξ={γT,vech(D)T}T (6.1) • vech(D) =vector ofdistinctelements ofD(Appendix A)

Result: Impliedpopulation-averaged model

E(Yi|xi) =Xiβ, var(Yi|xi) =Vi =Vi(ξ,xi) Yi|xi ∼ N {Xiβ,Vi(ξ,xi)}, i =1, . . . ,m

(13)

Induced aggregate covariance matrix: Has particular form (6.1) V_i(ξ,x_i) =V_i(γ,D,x_i) =Z_iDZT_i +R_i(γ) =Z_iDZT_i +R_i

• Specific form depends on choices ofRi(γ), reflecting belief within-individual realization and measurement error processes, and ofD, characterizingamong-individual variabilityin individual trajectoriesXiβ+Zibi

• Generalizes in obvious way when var(bi|xi) =var(bi|ai) =D(ai) • Conceptual framework:From this point of view

Vi(ξ,xi) =var(Yi|xi) =var(Bi|xi) +var(ei) =ZiDZTi +Ri(γ) ⇒first term represents contribution ofamong-individualsources, second term represents contribution ofwithin-individualsources

(14)

“Stacked” notation: Y =      Y1 Y2 .. . Ym     

(N×1), b=

     b1 b2 .. . bm     

(mq×1), e=

     e1 e2 .. . em     

(N×1)

X =      X1 X2 .. . Xm     

(N×p), Z =



   

Z1 0 · · · 0

0 Z2 · · · 0

..

. ... . .. ... 0 0 · · · Zm



   

(N×mq)

R=     

R1 0 · · · 0

0 R2 · · · 0

..

. ... . .. ... 0 0 · · · Rm



   

(N×N), De =



   

D 0 · · · 0 0 D · · · 0

..

. ... . .. ... 0 0 · · · D



   

(15)

“Stacked” notation: The model is often written succinctly as

Y =Xβ+Z b+e (6.2)

E(Y|x˜) =Xβ, var(Y|x˜) =V(ξ,x˜) =V =ZDZe T +R

• Commonly expressed as in (6.2) in the literature and most software documentation

Two-stage hierarchy: A model of form (6.2) usually results from specification of

• The form ofindividual inherent trajectoriesin terms of individual-specific parametersand an assumption on the within-individual covariance matrixRi

• A representation of how individual-specific parametersvary among individualsin the population, including in association with among-individual covariates

(16)

In general: Ri(γ) =var(ei|bi,xi), where ei =ePi+eMi

• Usual assumption:For the linear mixed effects model ei ⊥⊥bi,xi or at least ei ⊥⊥bi

Assume the former for this discussion; dependence onb_i is allowed in Chapter 9

• Usual assumption:ePi ⊥⊥eMi, discussed further in Chapter 7 • var(eMi|bi,xi) =var(eMi), the contribution toRi due to

measurement error, adiagonal matrix

• var(ePi|bi,xi) =var(ePi), the contribution due to therealization process, may exhibitcorrelationdue time-ordered data collection • ⇒ DecomposeRi(γ)as

Ri(γ) =RPi(γP) +RMi(γM), γ = (γPT,γ T M)

(17)

Default specification: Ri(γ) =σ2Ini • From perspective of (6.3)

Ri(γ) =σ2PIni +σ 2

MIni, σ 2₌_σ2

P+σ 2 M

• Assumesnegligible serial correlationdue to realization process; could be reasonable if time points aresufficiently intermittent • Assumes measurement errors arehaphazardwith variancethe

sameregardless of the magnitude of the true realization

• σ2=σ_P2 +σ_M2 is variance due tocombined effectsof realization process and measurement error

• If measurement error is assumednegligibleorto not exist, reduces to

Ri(γ) =σP2Ini, σ 2₌_σ2

P

(18)

More generally: Common to assume measurement error is haphazardwithconstant varianceas above

Ri(γ) =RPi(γP) +σ2MIni, γ= (γ T P, σ

2 M)

T_. _(6.4)

and simplify (6.3) to

R_i(γ) =R_Pi(γP) +σM2Ini, γ= (γ T P, σ2M)T

• (6.4) is reasonable starting point when outcome is ascertained by adeviceoranalytical procedure(e.g., dental distance, hæmatocrit, CD4 count)

• Ifno measurement erroris plausible (e.g., visual acuity), (6.4) simplifies to

(19)

SpecifyingRPi(γP): Represent as R_Pi(γ_P) =T_i1/2(θ)Γi(α)T

1/2

i (θ), γP = (θT,αT)T • Γi(α) (ni×ni)correlation matrix

• Ti(θ)isdiagonal; diagonal elements reflectrealization process variance, e.g., constant over time

Ti(θ) =σP2Ini, θ=σ 2 P

• Or fornintended times andiwithni =n

(20)

In practice: Unfortunately, it is common tofail to acknowledgethat there areseparatecontributions due to realization and measurement error and to represent the entire covariance structure as

Ri(γ) =T1_i/2(θ)Γi(α)T_i1/2(θ) or Ri(γ) =σ2Γi(α)

• Somesoftwarerestricts to models forRi(γ)of this form

• Implicitly take measurement errornegligible, which it may not be • ⇒Confusion and lack of understanding among users

More generally: In the literature, whencorrectly distinguishingthe separate contributions of realization and measurement error

processes, it is common to assumeconstant variancesand take as thedefaultspecification

Ri(γ) =σP2Γi(α) +σ2MIni, γ = (σ 2

(21)

8 9 10 11 12 13 14

20

25

30

age (years)

distance (mm)

Girls

8 9 10 11 12 13 14

20

25

30

age (years)

distance (mm)

(22)

Recall: gi =0(1)ifi is a girl (boy),(t1, . . . ,t4) = (8,10,12,14), subject-specific perspective

• Question of interest: Typicaloraverage rate of changeof dental distance for boys different from that for girls?

• Straight-line model forindividual trajectory

Yij =β0i+β1itij +eij, j =1, . . . ,ni =n=4, βi =

β0i

β1i

• Cansummarizeas

Yi =Ciβi+ei, Ci =



   

1 ti1 1 ti2

.. . ... 1 tini

     =    

1 t1 1 t2 1 t3 1 t4



  

(23)

Individual-specific intercepts and slopes:

β0i =β0,Bgi+β0,G(1−gi) +b0i,

β1i =β1,Bgi+β1,G(1−gi) +b1i.

b_i =

b_0i b1i

• Cansummarizeas

βi =Aiβ+Bibi, (6.7)

β= 

  

β0,G

β1,G

β0,B

β1,B



  

, Ai =

(1−gi) 0 gi 0 0 (1−gi) 0 gi

, Bi =I2

Remark: In early literature, (6.6) +β_i as in (6.7) is referred to as a random coefficient model

(24)

Combining: Substituting (6.7) in (6.6)

Y_i =C_iA_iβ+C_iB_ib_i+e_i=X_iβ+Z_ib_i+e_i

X_i =C_iA_i, Z_i =C_iB_i

X_i = 

 

(1−gi) (1−gi)t1 gi git1 ..

. ... ... ...

(1−gi) (1−gi)t4 gi git4





, Zi = 

  

1 t1 1 t2 1 t3 1 t4



  

Complete the model: Specify models foramong-individual

covariance matrixvar(bi|ai)andwithin-individual covariance matrix R_i(γ)

(25)

Recall: Exploratory analyses suggest

• Overall correlationisdifferentfor boys and girls;overall variance constant across timebut possiblylargerfor boys

• Within-individual residualsfromindividual-specificfitsdo not show strong evidence ofwithin-individual correlation;within-child variancedue to the combined effects of realization and

measurement error isconstantbut possibly different

• Largervariance for boys could be artifact of one “unusual” boy

Within-individual covariance matrix: Assume (6.5) Ri(γ) =σ2PΓi(α) +σ2MIni

(26)

Within-individual covariance matrix: Assume (6.5) R_i(γ) =σ2_PΓi(α) +σ2MIni

• Within-child aggregate variance:

var(e_i|a_i) =R_i(γ) =σ_PG2 I4+σMG2 I4 ifi is a girl

=σ_PB2 I₄+σ_MB2 I₄ ifi is a boy

• Final specification:

var(ei|ai) =Ri(γ,ai) ={σ2GI(gi =0) +σB2I(gi =1)}I4

σ_G2 =σ_PG2 +σ_MG2 , σ_B2 =σ2_PB+σ_MB2

• If different variances are artifact of “unusual” boy, could reduce to thedefaultRi(γ) =σ2I4

(27)

Among-child sources: Lack ofwithin-individual correlationsuggests overall correlation is due toamong-child sources

• If we assume for eachi

var(bi|ai) =D=

D11 D12 D12 D22

⇒Contribution to overall covariance structureZiDZTi has diagonalelements

D11+D22tj2+2D12tj, j =1, . . . ,4

and(j,j0)off-diagonal element

D11+D22tjtj0 +D₁₂(t_j+t_j0) j,j0 =1, . . . ,4

• Induced patternof among-individual covariance/correlation is clearlynonstationary⇒can representcomplexcovariance

(28)

Among-child sources: Evidence suggests overall patterndifferent by gender, suggesting the model

var(bi|ai) =D(ai) =DGI(gi =0) +DBI(gi =1)

(29)

(30)

Recall: m=30 subjects,gi =0(1)ifi is female (male),ai =age, hæmatocrit measured at week 0, prior to surgery and ideally at weeks 1 2, and 3 thereafter, where some subjects aremissingthe week 2 and possibly baseline measure

• Subject-specific perspective: Differences between genders in individual-specific featuresof the pattern of change of

hæmatocrit following hip replacement? • Quadratic model forindividual trajectory

Yij =β0i+β1itij +β2itij2+eij

• Written succinctly

Yi =Ciβi+ei, βi =

  β0i β1i β2i 

, Ci =



 

1 ti1 ti12 ..

. ... ... 1 tini t

2 ini



 

(31)

Individual-specific parameters:

β0i ={β0,F(1−gi) +β0,Mgi}+{β3,F(1−gi) +β3,Mgi}ai+b0i

β1i =β1,F(1−gi) +β1,Mgi+b1i

β2i =β2,F(1−gi) +β2,Mgi+b2i Can be represented as

βi=Aiβ+Bibi

β=            

β0,F

β0,M

β1,F

β1,M

β2,F

β2,M

β3,F

β3,M            

, Ai=  

(1−gi) gi 0 0 0 0 (1−gi)ai giai

0 0 (1−gi) gi 0 0 0 0

0 0 0 0 (1−gi) gi 0 0

 

(32)

Combining: Zi =Ci

Xi =   

(1−gi) gi (1−gi)ti1 giti1 (1−gi)ti21 giti21 (1−gi)ai giai

..

. ... ... ... ... ... ... ...

(1−gi) gi (1−gi)tin_i gitin_i (1−gi)t2 ini git

2

ini (1−gi)ai giai

  

• Complete the model with specification ofwithin-individual covariance matrixRi(γ)and the covariance matrix var(bi|ai)for the random effects as above

• Note: If var(bi|ai) =D(3×3),Dhas 6 distinct parameters, and Z_iDZT_i corresponding toamong-individual sourcesis a(ni ×ni) matrix whose elements have acomplicatedform (try it)

(33)

More generally: Theinduced overall covariance model Vi =ZiDZTi +Ri(γ)can depend onhigh-dimensionalξ

• Capable of representingcomplextrue patterns of overall variance and correlation

• But also can beoverkill

Among-individual variation: In an individual-specific model like the quadraticfor the hip replacement data

• In principle: All ofindividual-specific intercepts, linear terms, and quadratic termsvaryin the population

• However: Although quadratic termsβ2i do varyin the population, this variation ispractically negligible relativeto the extent of variation in intercepts and linear termsβ0i andβ1i

(34)

Eliminating random effects: A common tactic underquadraticand higher-order polynomialindividual-specific models tosimplifythe model forβ_i

• Removerandom effects associated with quadratic and higher-order terms, redefineZ_i andD accordingly

• ⇒Z_iDZT_i andinducedV_i stillsufficiently richto approximate the true among-individual and overall covariance structures

(35)

Demonstration: Hip replacement study – eliminateb2i

β0i ={β0,F(1−gi) +β0,Mgi}+{β3,F(1−gi) +β3,Mgi}ai+b0i

β1i =β1,F(1−gi) +β1,Mgi+b1i

β2i =β2,F(1−gi) +β2,Mgi

β_i =Aiβ+Bibi

bi =

b0i b1i

, Bi =

  1 0 0 1 0 0 

, so that B_ib_i =   b_0i b1i 0  

• From asubject-specificperspective, strictly an approximation of convenience

• Do not really believe that individuals of each gender have individual-specific trajectoriescharacterized byexactly the same

(36)

General consideration: Important issue in specifying models • Although from SS point of viewallindividual-specific parameters

are expected to exhibit variation in the population, what matters is theirrelative magnitudes of variation

(37)

Example:

• Straight line inherent trendis reasonable

• Individual-specific interceptsclearly vary substantially • Underlying straight lines appear to havevery similar slopes • Although scientifically slopesshould vary,relativeto variation in

intercepts, variation in slopes isorders of magnitudesmaller • With no covariates:β0i,β1i intercept, slope for individuali

β0i =β0+b0i, β1i =β1+b1i, bi = (b0i,b1i)T, var(bi) =D

⇒IfD11 isnonnegligiblerelative toβ0, then intercepts vary perceptibly; ifD22 isvirtually negligiblerelative to toβ1, then variation in slopes is almost undetectable

(38)

Example: Approximation to achieve numerical stability

β0i =β0+b0i, β1i =β1

• Do not really “believe” slopes do not varyat allin the population • But invoke thisapproximationrecognizing that their magnitude of

variation is inconsequentialrelativeto that of other features • Design matrixBi in the general model specification

accommodates this possibility

Terminology: Popular to distinguish between individual-specific features being “fixed” or “random;” here,β0i would be said to be “random” whileβ1i would be referred to as “fixed”

(39)

ZDV+ alt ddI ZDV+ZAL

ZDV+ddI ZDV+ddI+NVP

0 2 4 6

0 10 20 30 40 0 10 20 30 40

Week

(40)

Recall: Subjectsrandomizedto four treatment regimens,

ai = (gi,ai, δi1, . . . , δi4)T,gi =0(1)ifiis female (male);ai age;δi` =1 if subjecti randomized to regimen`and 0 otherwise,`=1, . . . ,4

• Straight-lineinherent log(CD4+1) trajectory

Yij =β0i+β1itij +eij

⇒β0i isi’s inherent mean log(CD4+1) immediately prior to therapy

• Subject-specific perspective: Aretypicalor average slopes different among the four regimens?

β0i =β00+β01ai+β02gi+b0i

β1i =β10+β11δi1+β12δi2+β13δi3+b1i

(41)

Standard formulation: Thelinear mixed effectsmodel is often presented formally as atwo-stage hierarchy

• Stage 1 - Individual model:

Yi =Ciβi+ei (ni×1) ei|xi ∼ N {0,Ri(γ)} (6.8) C_i (ni×k)design matrixordinarily depends ontimesti1, . . . ,tini; β_i (k×1); oftenei ⊥⊥bi,xi orei ⊥⊥bi

• Stage 2- Population model:

βi =Aiβ+Bibi (k ×1) bi|xi ∼ N(0,D) (q×1) (6.9) β(p×1)fixed effects;Ai (k ×p),Bi (k ×q)design matrices;Ai incorporatesamong-individual covariates,B_i indicates elements ofβ_i treated as “fixed” or “random;” oftenbi ⊥⊥xi

(42)

Linear mixed effects model: Substituting (6.9) in (6.8)

Y_i =X_iβ+Z_ib_i+e_i, b_i|x_i ∼ N(0,D), e_i|x_i ∼ N {0,R_i(γ)}

• Xi =CiAi (ni×p)fixed effects design matrix • Zi =CiBi (ni×q)random effects design matrix • ei|xi ∼ N {0,Ri(γ)},bi|xi ∼ N(0,D)

(43)

6.3 Inference and considerations for missing data

6.4 Best linear unbiased prediction and empirical Bayes 6.5 Implementation via the EM algorithm

(44)

Given a linear mixed effects model: Inducedpopulation-averaged model

E(Y_i|x_i) =X_iβ, var(Y_i|x_i) =V_i =V_i(ξ,x_i)

V_i(ξ,x_i) =Z_iDZT_i +R_i(γ), ξ ={γT,vech(D)T}T

(Y_i,x_i),i =1, . . . ,m,independentacrossi • Succinctly:

E(Y|x˜) =Xβ, var(Y|x˜) =V(ξ,x˜) =V =ZDZe T +R

• Can be generalized to var(bi|xi) =D(ai)

(45)

Estimation ofβandξ: Normalityassumptions at each stage of the hierarchy yield the assumption

Yi|xi ∼ N {Xiβ,Vi(ξ,xi)}

• ⇒βandξcan be estimated by solving the estimating equations corresponding tomaximum likelihoodorREML

(46)

Large sample inference: Approximatesampling distributions forβb

using ML or REML • Model-based:

b

β∼ N· (β₀,Σb_M), Σb_M =

m

X

i=1

XT_i V−_i 1(bξ,x_i)X_i !−1

={XTV−1(ξb,x˜)X}−1

• Robust or empirical:

b

β∼ N· (β₀,Σb_R)

b ΣR=

( _m X

i=1

XiV−i 1(bξ,xi)Xi )−1_m

X i=1

XTi V −1

i bξ,xi)(Yi−Xiβb)(Yi−Xiβb) T

V−i 1(bξ,xi)Xi

×

( m X

i=1 XiV

−1 i bξ,xi)Xi

)−1

• Can be used for inference onlinearfunctionsLβwitheithera SS or PA interpretation

(47)

Information criteria: Can be used to comparenon-nestedmodels as in Chapter 5

• In particular: Compare different models for overall covariance structureinducedby combinations of choices of models for var(e_i|x_i)and var(b_i|a_i)

• Dental study: Compare taking var(b_i|a_i) =var(b_i) =D to

var(bi|ai) =D(ai) =DGI(gi =0) +DBI(gi =1)

• Dental study: Compare taking var(ei|xi) =σ2I4for alli to

(48)

Implications of missing data:Identicalto those discussed in Chapter 5

• Under assumptions of a MAR mechanismandnormality, estimators forβandξareconsistent

• Model-basedapproximation to sampling distribution ofβb can be

used, but withΣb_M replaced by the appropriate element of the

inverse of theobserved information matrix

(49)

Curiosity: Balanceddata,Yi (n×1),i =1, . . . ,m,samentimes • Zi =Z∗, same for alli (verify)

• If var(bi) =D,Ri(γ) =σ2In,

V_i(ξ,x_i) =V∗=Z∗DZ∗T +σ2In • WithDb and

b

σ2obtained by ML or REML⇒Vb ∗

=Z∗DZb ∗T + b

σ2In can be shown that

b

β=

m

X

i=1 XT_i Vb

∗ −1 X_i

!−1 _m X

i=1 XT_i Vb

∗ −1 Y_i and the OLS estimator

b

β_OLS =

m

X

i=1 XT_i X_i

!−1 _m X

i=1 XT_i Y_i

(50)

Equivalence: Shown usingmatrix inversion resultsgiven in Appendix A

• Continues to hold ifσ2andDin take ondifferentvalues

corresponding to different levels of anamong-individual covariate • Thediligent studentwill of course verify this

• Important: This equivalencedoes notmean that one can

disregardthe need to characterize covariance structure and take allN observations to bemutually independent

• Correct characterization of theapproximate sampling distribution of the estimator requires that the overall covariance be

(51)

Motivation for linear mixed effects model: From asubject-specific perspective

• Aligns with theconceptual frameworkin Chapter 2

• Is natural when questions of interest involveindividual-specific phenomena and features

Implied model: From apopulation-averagedperspective • Form of population mean response andoverall covariance

structureincorporating components due toamong-and within-individual sourcesareinduced

• ⇒Model for overall covariance structure is “automatic” rather than chosenexplicitlyby the data analyst

• Sidesteps thechallengeof specifying a suitable overall structure • The induced model isrichly parameterizedand can represent

(52)

However: SS vs. PA perspective is critical forinferenceon and interpretationof the inducedoverall covariance matrix

• Population-averaged perspective:Theinducedoverall covariance structure is a convenient and flexible way of representing acomplex true structure

• ξ={γT,vech(D)T}T _{are parameters that characterize this} structure, withno restrictionson possible values ofξ

• ⇒Dneed not be a legitimate covariance matrix (non-negative diagonal elements), andγneed not be restricted so thatR_i(γ)is legitimate covariance matrix

• Subject-specific perspective: DandRi(γ)must belegitimate covariance matrices corresponding toamong-and

within-individualsources of variation and correlation

• ⇒Restrictionson the parameter space ofξ={γT,vech(D)T}T that ensure this

(53)

6.3 Inference and considerations for missing data

6.4 Best linear unbiased prediction and empirical Bayes

6.5 Implementation via the EM algorithm 6.6 Testing variance components

(54)

Subject-specific perspective: Each individualihas specific parametersβ_i characterizing his/herinherent trajectory, depending onrandom effectsb_i reflecting howi’s parameters deviate from the “typical” values and howi’s inherent trajectory deviates from the overall population mean profile

• Of interest: “Estimate”bi (andβi) for each individuali • ⇒Characterize individual-specificfeaturesandtrajectories,

identifyoutlyingindividuals

• bi is arandom vectorcorresponding to arandomly chosen individualfrom the population⇒predictionof the value taken on by a random vector

• Inference onbi is aprediction problem

• BecauseYi contains information aboutbi, it is natural to view this prediction problem as characterizingbi giventhat we have observedY_i =y_i

(55)

Natural predictor: The value ofbi “most likely” given we have observedYi =yi

Posterior mode: “Estimate”bi by the value thatmaximizesthe posterior distributionofb_i givenY_i evaluated aty_i

• Bayesian view: bi areparameters,bi ∼ N(0,D)is referred to as theprior distribution, densityp(b_i;D)

• Density ofYi|xi,bi ∼ N {Xiβ+Zibi,Ri(γ)},p(yi|xi,bi;β,γ) • Here: Do not considerβandξfrom the classical Bayesian

perspective as random quantities with prior distributions, but treat them asfixed and known; more on this momentarily

(56)

Posterior mode: “Estimate”bi by the value thatmaximizesthe posterior distributionofbi givenYi evaluated atyi

• Bayes theorem:Theposterior densityis

p(bi|yi,xi;β,γ,D) =

p(y_i|xi,bi;β,γ)p(bi;D) p(y_i|xi;β,γ,D)

(6.10)

p(y_i|x_i;β,γ,D) = Z

p(y_i|x_i,b_i;β,γ)p(b_i;D)db_i

• Can show: (6.10) is a normal density withmean

DZT_i V−_i 1(ξ,x_i){y_i−X_iβ} (6.11)

• Themeanof a normal distribution is also themodeof the density ⇒(6.11)maximizesthe posterior density

(57)

Result: Natural to substitute estimatorsβb andbξin (6.11) and obtain

theempirical Bayes“estimator” forbi

b

bi =DZb T_i V−1

i (bξ,x_i){Y_i−X_iβb}, (6.12)

• Ideally:Ifξwereknown, (6.12) is

b

b_i =DZT_i V−_i 1(ξ,x_i){Y_i−X_iβb} (6.13)

• Can show: Conditional onx, (6.13) has mean zero and˜

var(bb_i|x˜) =DZT_i  



V−_i 1−V−_i 1Xi m

X

i=1

XT_i V−_i 1Xi

!−1

XT_i V−_i 1

 



ZiD

• Canunderstatethe variability inbb_i becauseb_i is a “moving

target” (random rather than fixed); use instead

var(bbi−bi|x˜) =D−DZTi  

V−i 1−V −1 i Xi

m X

XTiV −1 i Xi

!−1 XTiV

−1 i

 

(58)

Alternatively: Can obtain (6.13) directly by usingBayes theorem withξtreated as known butβis viewed instead as arandom vector independentofb_i

• Take theprior distributionofβto be aN(β∗,H)distribution with prior densityp(β|β∗,H)depending onhyperparametersβ∗and H

• ⇒Assumevagueprior information onβby settingH−1=0

• Can show the mean of the posterior density forβisβb

• And the mean of the posterior density forb_i is (6.13) • Left as an exercise for thediligent student

(59)

Standard principle: A “best” predictorc(Yi)minimizes mean squared error

E[{c(Y_i)−b_i}TA{c(Y_i)−b_i}]

expectation is with respect to the joint distribution ofYi andbi, andA is any positive definite symmetric matrix

• It is straightforward (try it) that thebest predictorin this sense is

E(b_i|Y_i)

which does not depend onA

• Thus: Under thenormalityassumptions for the linear mixed model andξknown,bb_i in (6.13) is “best”

• Becausebb_i is alsolinearinY_i, it is the bestlinear functionofY_i

to use as a predictor under normality

(60)

Restricting to linear predictors: Evenwithoutnormality, can show withξknown that (6.13) minimizes mean squared error, is alinear function ofY_i, andE(bb_i) =E(b_i) =0

• Best linear unbiased predictor(BLUP)

• Derivation: Searle, Casella, and McCulloch (2006, Chapter 7), Robinson (1991)

• In practice:ξ is replaced by ML or REML estimatorbξ ⇒

estimated best linear unbiased predictororEBLUP

• The termsBLUP,empirical Bayes estimator, andEBLUPare often usedinterchangeably

(61)

Another approach to a predictor: For knownξ(Henderson, 1984) • “Estimate”bi,i=1, . . . ,m, in the “stacked” vectorbjointly with

β, by minimizing inβandbtheobjective function log|De|+bTDe

−1

b+log|R|+ (Y −Xβ−Z b)TR−1(Y −Xβ−Z b)

• Under normality: Twice the negative log of theposterior density ofb(fixedβ) and twice the negative loglikelihood forβ(fixedb) • Using Appendix A, solve

XTR−1(Y−Xβ−Z b) =0

e

Db−ZTR−1(Y−Xβ−Z b) =0

• Rearrange to obtain themixed model equations XTR−1X XTR−1Z

ZTR−1X ZTR−1Z +De −1 ! b β b b ! =

XTR−1Y ZTR−1Y

(62)

Another approach to a predictor: For knownξ(Henderson, 1984) • Can show using Appendix A

R−1−R−1Z(ZTR−1Z +De −1

)ZTR−1= (R+ZDZe T)−1=V−1

that the solutions to the mixed model equations are

b

(63)

Predictions are “shrunk” toward the mean: For demonstration, consider (6.13),ξ known

b

bi =DZTi V−i 1(ξ,xi){Yi−Xiβb}

in the case of thesimplestlinear mixed model (no covariates) Y_i =1niβ+1nibi+ei, var(ei) =σ

2_I

ni, var(bi) =D • Xi =1ni (p=1),Zi =1ni (q =1),scalarrandom effectbi ⇒ • Vi =DJni +σ

2_I

ni (compound symmetric) V−_i 1=σ−2

Ini − D

σ2₊_n iD

Jni

• BLUP:Can show (verify)

b

bi = niD

σ2₊_n iD

(Yi−βb), Y_i =n−_i 1

ni X

(64)

b

bi = niD

σ2₊_n iD

(Yi−βb), Y_i =n−_i 1

ni X

j=1

Yij (6.15)

• Write (6.15) as theweighted average

b

b_i =w_i(Y_i−βb) + (1−w_i)0, w_i =

niD

σ2₊_n iD

<1

ofbest guessfor wherei “sits” in the population relative to overall meanβ, based solely on the data, and 0, the mean ofbi

• “Weight”wi <1 movesbb_i away from being solely based on the

(65)

b

bi =wi(Yi−βb) + (1−w_i)0, w_i =

niD

σ2₊_n iD

<1 • The largerni (more data oni), the closerwi is to 1⇒more

weighton(Yi−βb)

• Similarly, ifamong-individual variationlargerelativeto within-individual variation,D/σ2is large, the closerwi is to 1 • If insteadni is small and/orD/σ2is small⇒information oniis

(66)

Individual-specific mean:β+bi, obvious predictor

b

β+bb_i =w_iY_i+ (1−w_i)βb=

niD

σ2₊_n iD

Yi+

σ2 σ2₊_n

iD

b

β

• Ifni and/orD/σ2large,wi close to 1, and prediction is based mainly on data fromi,Yi

• ⇒quality of information fromiis high and/orlittle to be learned about a specific individualfrom the population

• ifn_i and orD/σ2small,w_i close to 0, and prediction is based mainly onβb(population)

• ⇒quality of information fromiis low and/orlittle to be learned about a specific individualfrom his data

(67)

Terminology: This is referred to asshrinkage

• In predicting where an individual “sits” in the population and thus his/her individual-specific trajectory, information from her data is “shrunk” toward the overall population mean

General form: Predictor of individual-specific trajectoryXiβ+Zibi Xiβb+Z_ibb_i =X_iβb+Z_iDZT_i V−_i 1(ξ,x_i)(Y_i −X_iβb)

= (Ini −ZiDZ T i V

−1

i )Xβb+Z_iDZT_i V−_i 1Y_i =R_iV−_i 1Xβb+ (I_n

i −RiV −1 i )Yi

• Aweighted averageof the estimated overall population mean profileXiβband the dataY_i oni

• Predictor ofindividual-specific parametersβ_i =Aiβ+Bibi

b

(68)

Common: Usebb_i fordiagnostic purposes

• Histogramsandscatterplotsof components ofbb_i to identify

unusualindividuals

• Histogramsandnormal quantile plotsof the components of the

b

bi to evaluate relevance of thenormality assumptiononbi Caveats:

• bb_i havedifferent distributionsfor eachi unlessX_i andZ_i arethe

samefor alli⇒graphics based on the rawbb_i may be

uninterpretable[canstandardizethebb_i using (6.14)]

• Even with the same distribution,bb_i are subject toshrinkage, so

graphics and summaries will reflectless variabilitythanactually presentin the distribution of the trueb_i

(69)

6.3 Inference and considerations for missing data 6.4 Best linear unbiased prediction and empirical Bayes

6.5 Implementation via the EM algorithm

(70)

History: Prior tomodern computing, optimization of the ML/REML objective functions could becomputationally challenging

• In aworld-famouspaper, Laird and Ware (1982) showed how this could be accomplished using theexpectation-maximization algorithm(EM)

EM algorithm: Acomputational techniqueto maximize an objective function

• Can be motivated generically from a MAR missing data perspective starting with theobserved data likelihood

• Under reasonable conditions, the algorithmconvergesto the values of the model parameters maximizing the objective function and isguaranteedto increase toward the maximum at each iteration

(71)

Here: Do not derive the algorithm, butsketch heuristicallythe rationale for and form of the algorithm for maximizing the ML objective function in the particular linear mixed model

Yi =Xiβ+Zibi+ei, bi∼ N(0,D), ei ∼ N(0, σ2Ini), ,i =1, . . . ,m

• Missing data analogy: (Y_i,x_i,b_i),i=1, . . . ,m, are thefull data • b_i,i =1, . . . ,m, are “missing” for alli ⇒(Y_i,x_i),i=1, . . . ,m,

are theobserved data

(72)

“Full data” loglikelihood:The joint density of(Y_i,b_i),i =1, . . . ,m, givenx˜ isproportional to

m

Y

i=1

σ−1exp{−(Yi−Xiβ−Zibi)T(Yi−Xiβ−Zibi)/(2σ2)}|D|−1/2exp(−bTi D −1_b

i/2)

• Ifβwere known, so onlyξis unknown,sufficient statisticsforσ2

andDare T1=

m

X

i=1

eT_i e_i, e_i =Y_i−X_iβ−Z_ib_i, T2= m

X

i=1 b_ibT_i

which could be calculated if the “full data”(Yi,xi,bi), i =1, . . . ,mwere available

• Yieldingestimatorsforσ2andD

b

(73)

EM algorithm: Based on repeated evaluation of theconditional expectationsof the “full data” sufficient statistics given the “observed data”(Y_i,x_i),i=1, . . . ,m

• ⇒Must deriveE(T1|Yi,xi),E(T2|Yi,xi) • Can be found by noting that

  Y_i bi ei xi 

∼ N  

 



X_iβ

0 0  ,  

Z_iDZT_i +σ2Ini ZiD σ 2_I

ni

DZT_i D 0

σ2I 0 σ2Ini



  



from which it follows that E(bi|Yi,xi) =DZTi V

−1

i (Yi−Xiβ), var(bi|Yi,xi) =D−DZ T i V

−1 i ZiD E(ei|Yi,xi) =σ2V−_i 1(Yi−Xiβ) =Yi−Xiβ−ZiDZTi V

−1

i (Yi−Xiβ) var(e|Y ,x ) =σ2(In −σ2V−1)

(74)

Sufficient statistics: Using the above results T1=

m

X

i=1

eT_i ei, ei =Yi−Xiβ−Zibi, T2= m

X

i=1 bibTi

• E(T2|Yi,xi)can be obtained from

E(b_ibT_i |Y_i,x_i) =E(b_i|Y_i,x_i)E(b_i|Y_i,x_i)T +var(b_i|Y_i,x_i)

• E(T1|Yi,xi)can be obtained from

E(eT_i ei|Yi,xi) =tr{E(eieTi |Yi,xi)}

(75)

EM algorithm: With starting valuesσ2(0),D(0), at the`th iteration, with current iteratesσ2(`)_,_D(`)_and_V(`)

i =σ

2(`)_I

ni +ZiD (`)_ZT

i , carry out the following two steps

1. Calculate

β(`)=

m

X

i=1

XT_i V(_i`)−1Xi

!−1 _m

X

i=1

XT_i V(_i`)−1Yi

2. Define

r(_i`)=Yi −Xiβ(`), b (`) i =D

(`)

Z_iTV(_i`)−1r_i(`), i=1, . . . ,m Then updateσ2(`)andD(`)as

σ2(`+1)=N−1 m

X

i=1

{(r(_i`)−Zib (`) i )

T₍_r(`) i −Zib

(`) i ) +σ

2(`)

tr(Ini−σ

2(`)_V(`)−1

i )}

D(`+1)=m−1 m

X

i=1

(76)

Remarks:

• Details of implementation in Laird, Lange, and Stram (1987); these authors also present an algorithm for maximizing the REML objective function

• The algorithm can bevery slowto reach convergence, but the value of the objective function is guaranteed to increase at every iteration, which is reassuring

• With modern computing, implementations of direct optimization in SAS and R have been optimized to the point that it is unusual to encounter computational difficulties with linear mixed effects models

• However, EM algorithms can bevery valuablein much more complicated statistical models

(77)

(78)

Recap:Inducedoverall covariance structure

var(Y|x˜) =V(ξ,x˜) =V =ZDZe T +R

• Population-averaged perspective:This induced structure is a flexible way of representing acomplex true structure

• ξ={γT,vech(D)T}T _{are parameters that characterize this} structure, withno restrictionson possible values ofξ • ⇒DandRi(γ)need not belegitimate covariance matrices • Subject-specific perspective: DandRi(γ)must belegitimate

covariance matricescorresponding toamong-and within-individualsources of variation and correlation

• ⇒Restrictionson the parameter space ofξ={γT,vech(D)T}T Imperative: The analyst must acknowledge the modeling perspective in makinginferencesonξ= (γT,vech(D)T}T

(79)

Hip replacement study data: Recall the SS model

Yij =β0i+β1itij +β2itij2+eij, βTi = (β0i, β1i, β2i)T

β_i =Aiβ+Bibi, bi =





b0i b1i b_2i



, Bi=I3 • Withb_i ⊥⊥a_i, var(b_i) =D(3×3), var(e_i) =σ2Ini

Vi =ZiDZTi +σ2Ini, ξ={σ

2_,_vech₍_D₎T_}T ₍₇_×₁₎ • PA perspective: This modelinducespopulation mean and overall

covariance models; no restrictions onξ

• SS perspective: RequiredthatDis nonnegative definite (legitimate covariance matrix) andσ2≥0 (variance)

(80)

“Reduced” model: Takeβ_2i to be “fixed” (eliminateb2i)

bi =

b0i b_1i

, Bi =





1 0 0 1 0 0





• ⇒var(b_i) =D₂,ξ={σ2,vech(D₂)T}T ₍₄_×₁₎

• PA perspective: A way toinduceamore parsimoniousoverall covariance structure withfewer parameters

• SS perspective: Relative to variation in the population in intercepts and linear components, individual-specificquadratic components either do not vary at all or exhibitnegligible variation • ⇒ Are individual-specific quadratic components “fixed” or

(81)

Either case: Is the “full” model required, or is the “reduced” model adequate?

• PA perspective: Is asimpler representationof theoverall covariance structurebased on fewer parameters is adequate? • SS perspective: Belief about therelative magnitude of variation

in individual-specific quadratic components

• The “reduced” model can be represented as takingD(3×3)to be

D=

D2 0

0 0

(82)

Formally: Can be addressed by testing thenull hypothesis H0: D=





D11 D12 D13 D12 D22 D23 D13 D23 D33



= 



D11 D12 0 D12 D22 0

0 0 0



=

D2 0

0 0

against an appropriate alternative

• Nestedmodels⇒suggestslikelihood ratio test(LRT)

Validity of the LRT forH0: A required regularity conditionfor usual

large sample theory approximationsis that the true value of a parameter is not on theboundary of its parameter spacebut rather lies in itsinterior

• Hypothesis testing: The value of the parameterunder the null hypothesiscannot be on the boundary of the parameter space for usual asymptotic arguments leading to tests to be valid

(83)

Implication:

• PA perspective: D(3×3)is a symmetric matrix with parameters characterizing overall covariance structure⇒no restrictionon D33(or any element ofD), so that the value ofD33 underH0is in theinteriorof the parameter space

• SS perspective: D(3×3)is a legitimate covariance matrix⇒ D33is avariance; forD to nonnegative definite, it must be that D33≥0, so that the value ofD33 underH0is on theboundaryof the parameter space

Result: Under aPA perspective, comparing the usuallikelihood ratio test statisticdescribed above to the appropriatechi-square critical valuewill yield avalid testofH0(interpreted from a PA perspective as above)

(84)

Subject-specific perspective: Carrying out the likelihood ratio test in the usual waywill notlead to a valid test ofH0

• Must appeal to specialized theoretical results fornonstandard testing situationsin a classic paper by Self and Liang (1987) • Stram and Lee (1994) used this theory to demonstrate that,

whenR_i =σ2Ini, the large sample distribution of the LRT statistic is amixture of chi-squared distributions

• See also Section 6.3 of Verbeke and Molenberghs (2000); see also Verbeke and Molenberghs (2003)

(85)

General result: Under a SS perspective, withD (q+1×q+1)

H0: D=

Dq 0

0 0

forDq (q×q)postitive definite versus alternative thatD is a general

(q+1×q+1)nonnegative definite matrix

• Large sample distribution of the LRT statistic underH0is a mixtureof aχ2_q+1and aχ2_q distribution with equal weights of 0.5 • Effect: Reducethe p-value that results relative to the p-value

obtained by (incorrectly) using the LRT procedure in the usual way⇒ignoring the “boundary problem” leads to rejection ofH0 of less often and possibly adopting models that aretoo

(86)

Additional questions of interest: From aSS perspective, covariance parametersξmay be ofscientific interest

• E.g., diagonal elements ofDrepresent magnitudes of variation in features of inherent trajectoriesinβi, such as individual-specific intercepts and slopes, in the population of individuals

• Thus,estimatesof these parameters may be desired

(87)

Approximate standard errors: In principlecan be based on alarge sample approximationto the sampling distribution of the estimatorbξ

• Can be derived by anestimating equation argument(n_i fixed) assuming theinducedmodels for population mean response and overall covariance structure arecorrectly specified

• Issue:The covariance matrix of the approximate sampling distribution depends onthirdandfourth momentsof the true distribution ofYi|xi

• IfY_i|x_i is taken to benormally distributed, this covariance matrix can be derived from theinformation matrixand depends on the fourth momentof a normal distribution

• However, if the true distributiondepartsfrom the normal even slightly, these approximate standard errors for components ofbξ

can bevery unreliable