Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

(1)

Comparing Functional Data Analysis Approach and

Nonparametric Mixed-Effects Modeling Approach for

Longitudinal Data Analysis

Hulin Wu, PhD, Professor (with Dr. Shuang Wu)

Department of Biostatistics & Computational Biology University of Rochester Medical Center

Email: [email protected]

(2)

Table of contents

1 _Introduction

2 _{Comparisons: NPME vs. fPCA-PACE}

3 _{Comparisons: Individual Smoothing vs. fPCA-Integration Method}

(3)

Question to Address

Nonparametric longitudinal data analysis methods: Nonparametric mixed-effects models

(4)

Analysis of longitudinal studies

Parametric mixed-effects models: LME and NLME models: e.g.

yi =Xiβ+Zibi +i,

bi ∼ N(0,D), i ∼ N(0,Ri), i = 1,2, ...,n

Parametric Restrictive

(5)

Nonparametric mixed-effects (NPME) model

yi(t) = µ(t) +νi(t) +i(t) = p X j=1 βjBj(t) + q X k=1 bikBk∗(t) +i(t)

Regression splines: Various choices of basis functions,known

Mixed-effects modeling: Borrow information from across-subjects (curves),shrink to the mean

(6)

Functional approach based on principal component analysis

Yij = Xi(tij) +ij = µ(tij) + K X k=1 ξikφk(tij) +ij

Mean function µ(t): any nonparametric smoothing method

Between-subject (curve) variation K P k=1

ξikφk(tij): Karhunen-Loeve approximation

Both PC scores (ξik) and basis functions (eigenfunctionsφk(t)): need to be estimated from data

PC scores (coefficients): estimated by

PACE: mixed-effects modeling idea to borrow information across subjects (curves)

(7)

Simulation Comparisons: NPME and fPCA-PACE

yi(t) = ai0+ai1cos(2πt) +ai2sin(2πt) +i(t), ai = [ai0,ai1,ai2]T ∼ N[(1,2,1),diag(σ02, σ21, σ22)], i(t) ∼ N[0, σ2(1 +t)], i = 1,2, ...,n tj = j/(m+ 1), j = 1,2, ...,m n= 50, m= 20

Unbalanced data: rmiss= 0.2,0.5,0.8

ISE = Z (ˆµ(t)−µ(t))2dt MISE = 1 n n X i=1 Z (ˆyi(t)−yi(t))2dt

(8)

Simulation I: small variation, (

σ

0

, σ

1

, σ

2

) = (2

,

1

,

1)

0 0.2 0.4 0.6 0.8 1 −6 −4 −2 0 2 4 6 8 t yi

(9)

Simulation I: small variation, (

σ

0

, σ

1

, σ

2

) = (2

,

1

,

1)

rmiss Model Mean function Individual fits

LPME 0.1044 (0.1259) 0.3733 (0.0488) 20% RSME 0.1053 (0.1218) 0.3733 (0.0488) PACE 0.1477 (0.1323) 0.3852 (0.1182) LPME 0.1069 (0.1016) 0.6158 (0.0813) 50% RSME 0.1092 (0.0980) 0.6158 (0.0813) PACE 0.1577 (0.1280) 0.6932 (0.1285) LPME 0.2025 (0.1509) 1.5302 (0.2764) 80% RSME 0.2131 (0.1414) 1.5302 (0.2764) PACE 0.251 (0.1894) 1.9874 (0.6914)

(10)

Simulation II: large variation, (

σ

0

, σ

1

, σ

2

) = (3

,

3

,

3)

0 0.2 0.4 0.6 0.8 1 −15 −10 −5 0 5 10 15 t yi

(11)

Simulation II: large variation, (

σ

0

, σ

1

, σ

2

) = (3

,

3

,

3)

rmiss Model Mean function Individual fits

LPME 0.3124 (0.2775) 1.9526 (0.3421) 20% RSME 0.3206 (0.2797) 1.9526 (0.3421) PACE 0.3639 (0.3041) 0.4511 (0.0647) LPME 0.3212 (0.2565) 3.6714 (0.6056) 50% RSME 0.3297 (0.2407) 3.6714 (0.6056) PACE 0.3828 (0.2972) 1.1166 (0.2640) LPME 0.5516 (0.4300) 8.4689 (1.3465) 80% RSME 0.6529 (0.4426) 8.4689 (1.3465) PACE 0.6416 (0.5362) 6.7611 (2.3501)

Mean function estimate winner: NPME model

(12)

Example 1: Viral load in AIDS clinical trials

0 10 20 30 40 50 60 70 80 90 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 time(day) viral load

n= 46 patients, ni is 4∼10, with a median of 8. Mean function estimates: RSME (blue), FPCA (red).

(13)

Viral load: individual fits

0 50 2 4 6 Patient 3 0 50 2 4 6 Patient 9 0 50 2 4 6 Patient 13 0 50 2 4 6 Patient 18 0 50 2 4 6 Patient 22 0 50 2 4 6 Patient 30 0 50 2 4 6 Patient 32 0 50 2 4 6 Patient 42 0 50 2 4 6 Patient 46

(14)

Example 2: Yeast cell cycle gene expressions

0 20 40 60 80 100 120 −4 −3 −2 −1 0 1 2 3 4 5 time(min) gene expression 6075 genes, tj = 7∗(j−1) (minute),j = 1,2, ...,18.

Gene expressions are centered by mean of each gene; contains missing data.

(15)

Yeast gene expressions: individual fits

0 50 100 −2 0 2 Gene 226 0 50 100 −2 0 2 Gene 1937 0 50 100 −2 0 2 Gene 1941 0 50 100 −2 0 2 Gene 3112 0 50 100 −2 0 2 Gene 3505 0 50 100 −2 0 2 Gene 4025 0 50 100 −2 0 2 Gene 4650 0 50 100 −2 0 2 Gene 5751 0 50 100 −2 0 2 Gene 5990

(16)

Time-course microarray gene expressions

Independent sampling: one measurement from each subject, e.g. mice Longitudinal sampling: repeated measurements from same subject, e.g. human

Features of data:

number of genesnvery large, usually several thousands number of time pointsm small (m≈10)

very few replications at each time point, usually 2 or 3 noisy, possibly with missing data

(17)

Time-course microarray gene expressions

Problem interested: identify differentiallyexpressed genes One group: difference from baseline; variation over time Two or more groups: difference between groups

Methods:

ANOVA approach: treat time variable as a particular experimental factor (instant extension from static microarray experiments)

Continuous approach: treat gene expressions as noisy measurements from an underlying function; nonparametric estimation of the underlying function (possibly with random effects)

(18)

Time-course microarray gene expressions

yijk = xi(tj) +ijk, i = 1,2...,n; j = 1,2, ...,m; k = 1, ...,K xi(t) = L X l=0 βilφl(t), ijk ∼(0, σ2) H0 : xi(t) = 0, i = 1, ...,n

φl(t): spline basis or PC basis In real data, no clear cut

Statistics that provide a good ranking

Multiple testing adjustment to control error rare, e.g. False Discovery Rate (FDR)

(19)

Methods

Individual nonparametric smoothing (EDGE)

φl(t) as fixed basis

statistics: goodness-of-fit (F statistics); area under curve (AUC)

fPCA-integration method (individual estimate of PC scores)

φl(t) as as eigenfunctions, estimated from entire samples

statistics: area under curve (AUC)

Both use bootstrap to calculate the null distribution of the statistics Significance cut-off by controlling FDR

(20)

Simulation study

n=1000, m= 10,K = 3

observations equidistant in [0,1] proportion of significant genes p = 0.1 Under H0:

yijk =ijk, ijk ∼ N(0,0.52) Under H1:

yijk =aisin(2ωiπ(tj −bi)) +ijk, whereai, ωi ∼ U(0.5,2), bi ∼ U(0,1).

(21)

Simulation I

Error under H1: ijk ∼ N(0,0.52)

EDGE num rejected corr rejected FDR FNR

FDR=0.05 91.44 87.35 0.0441 0.0139

FDR=0.1 100.21 90.94 0.0909 0.0100

FDR=0.2 116.52 93.98 0.1912 0.0068

PCA num rejected corr rejected FDR FNR

FDR=0.05 96.45 94.25 0.0225 0.0064

FDR=0.1 102.54 96.12 0.0619 0.0043

(22)

Simulation II

Error under H1: ijk ∼ N(0,(0.5vi)2),vi is a dispersion factor

EDGE num rejected corr rejected FDR FNR

FDR=0.05 66.22 63.23 0.0442 0.0393

FDR=0.1 81.99 74.41 0.0905 0.0278

FDR=0.2 104.41 84.51 0.1875 0.0172

PCA num rejected corr rejected FDR FNR

FDR=0.05 80.08 80.08 0 0.0216

FDR=0.1 86.55 86.44 0.0013 0.0148

(23)

Gene data from lungs of mice

number of probes: n= 35557

days post infection (DPI): 0,1, ...,10 (m= 11)

repetition: 3 for DPI= 1, ...,10, 6 for DPI=0 (3 no flu virus, 3 killed immediately after receiving flu virus)

normalized by Welle lab using the PLIER normalization method; log-transformation

H0 :xi(t) = baseline, t ≥0

Baseline 1: gene expression for DPI=0, no flu virus

Baseline 2: gene expression for DPI=0, immediately after receiving flu

(24)

Gene data from lungs of mice: Baseline 1

EDGE (F) 35497 (FDR=0.01)

EDGE (AUC) 2452 (FDR=0.05)

PCA (AUC) 7133 (FDR=0.05)

EDGE fails: oversmoothed

observe an increase in gene expression between DPI=0, no flu virus

(25)

Baseline 1: top 9 genes selected by PCA, not by EDGE

(AUC)

0 5 10 −1 0 1 2 Gene 15356 0 5 10 −2 0 2 4 Gene 35514 0 5 10 0 1 2 Gene 116 0 5 10 0 1 2 Gene 18635 0 5 10 0 1 2 3 Gene 6126 0 5 10 0 0.5 1 1.5 Gene 29919 0 5 10 −2 0 2 4 Gene 14636 0 5 10 −2 0 2 4 Gene 10336 0 5 10 0 1 2 Gene 5265

(26)

Baseline 1: top 9 genes selected by EDGE(AUC), not by

PCA

0 5 10 −5 0 5 Gene 3899 0 5 10 −5 0 5 10 Gene 35276 0 5 10 −5 0 5 10 Gene 3053 0 5 10 −5 0 5 10 Gene 33025 0 5 10 −5 0 5 Gene 17877 0 5 10 −10 −5 0 5 Gene 13133 0 5 10 −10 −5 0 5 Gene 8231 0 5 10 −5 0 5 10 Gene 1148 0 5 10 −10 −5 0 5 Gene 15335

(27)

Gene data from lungs of mice: Baseline 2

EDGE (F) 11592 (FDR=0.01) EDGE (AUC) 142 (FDR=0.05) PCA (AUC) 302 (FDR=0.05) 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 p−values by PCA 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 1400

(28)

Baseline 2: top 9 genes selected by PCA, not by EDGE

(AUC)

0 5 10 −4 −2 0 2 Gene 14136 0 5 10 −2 0 2 4 Gene 20657 0 5 10 −10 −5 0 5 Gene 711 0 5 10 −5 0 5 Gene 12360 0 5 10 −5 0 5 Gene 18639 0 5 10 −5 0 5 Gene 1755 0 5 10 −4 −2 0 2 Gene 8438 0 5 10 −10 −5 0 5 Gene 25610 0 5 10 −5 0 5 Gene 2053

(29)

Baseline 2: top 9 genes selected by EDGE (AUC), not by

PCA

0 5 10 −10 0 10 Gene 32948 0 5 10 −5 0 5 10 Gene 13790 0 5 10 −10 −5 0 5 Gene 35155 0 5 10 −10 −5 0 5 Gene 25608 0 5 10 −10 −5 0 5 Gene 14045 0 5 10 −10 −5 0 5 Gene 20815 0 5 10 −5 0 5 10 Gene 16947 0 5 10 −5 0 5 10 Gene 26778 0 5 10 −5 0 5 Gene 25488

(30)

Summary

Nonparametric longitudinal data analysis methods: Individual nonparametric smoothing

Not borrow information across subjects (curves) at all Deal with complete different curves for different subjects

FPCA-individual estimates of PC scores

Weakly borrow information across subjects via PC basis estimate PC basis: adaptive for some between-subject (Curve) variations

FPCA-PACE

Borrow information across subjects via mixed-effects PC score estimate PC basis: adaptive for large between-subject (Curve) variation

Nonparametric mixed-effects (NPME) modeling

Strongly borrow information across subjects (curves) Deal with longitudinal data with similar patterns

(31)

References

Storey et al. (2005) Significance analysis of time course microarray

experiments. Proceedings of the National Academy of Sciences,102,

12837-12842.

Wu, H. and Zhang, J.-T. (2006)Nonparametric regression methods

for longitudinal data analysis: mixed-effects modeling approaches. John Wiley & Sons, New York.

Yao, F., M¨uller, H.-G., and Wang, J.-L. (2005) Functional linear regression analysis for longitudinal data. The Annals of Statistics,33, 2873-2903.