Comparing Functional Data Analysis Approach and
Nonparametric Mixed-Effects Modeling Approach for
Longitudinal Data Analysis
Hulin Wu, PhD, Professor (with Dr. Shuang Wu)
Department of Biostatistics & Computational Biology University of Rochester Medical Center
Email: [email protected]
Table of contents
1 Introduction
2 Comparisons: NPME vs. fPCA-PACE
3 Comparisons: Individual Smoothing vs. fPCA-Integration Method
Question to Address
Nonparametric longitudinal data analysis methods: Nonparametric mixed-effects models
Analysis of longitudinal studies
Parametric mixed-effects models: LME and NLME models: e.g.
yi =Xiβ+Zibi +i,
bi ∼ N(0,D), i ∼ N(0,Ri), i = 1,2, ...,n
Parametric Restrictive
Nonparametric mixed-effects (NPME) model
yi(t) = µ(t) +νi(t) +i(t) = p X j=1 βjBj(t) + q X k=1 bikBk∗(t) +i(t)Regression splines: Various choices of basis functions,known
Mixed-effects modeling: Borrow information from across-subjects (curves),shrink to the mean
Functional approach based on principal component analysis
Yij = Xi(tij) +ij = µ(tij) + K X k=1 ξikφk(tij) +ijMean function µ(t): any nonparametric smoothing method
Between-subject (curve) variation K P k=1
ξikφk(tij): Karhunen-Loeve approximation
Both PC scores (ξik) and basis functions (eigenfunctionsφk(t)): need to be estimated from data
PC scores (coefficients): estimated by
PACE: mixed-effects modeling idea to borrow information across subjects (curves)
Simulation Comparisons: NPME and fPCA-PACE
yi(t) = ai0+ai1cos(2πt) +ai2sin(2πt) +i(t), ai = [ai0,ai1,ai2]T ∼ N[(1,2,1),diag(σ02, σ21, σ22)], i(t) ∼ N[0, σ2(1 +t)], i = 1,2, ...,n tj = j/(m+ 1), j = 1,2, ...,m n= 50, m= 20Unbalanced data: rmiss= 0.2,0.5,0.8
ISE = Z (ˆµ(t)−µ(t))2dt MISE = 1 n n X i=1 Z (ˆyi(t)−yi(t))2dt
Simulation I: small variation, (
σ
0, σ
1, σ
2) = (2
,
1
,
1)
0 0.2 0.4 0.6 0.8 1 −6 −4 −2 0 2 4 6 8 t yiSimulation I: small variation, (
σ
0, σ
1, σ
2) = (2
,
1
,
1)
rmiss Model Mean function Individual fits
LPME 0.1044 (0.1259) 0.3733 (0.0488) 20% RSME 0.1053 (0.1218) 0.3733 (0.0488) PACE 0.1477 (0.1323) 0.3852 (0.1182) LPME 0.1069 (0.1016) 0.6158 (0.0813) 50% RSME 0.1092 (0.0980) 0.6158 (0.0813) PACE 0.1577 (0.1280) 0.6932 (0.1285) LPME 0.2025 (0.1509) 1.5302 (0.2764) 80% RSME 0.2131 (0.1414) 1.5302 (0.2764) PACE 0.251 (0.1894) 1.9874 (0.6914)
Simulation II: large variation, (
σ
0, σ
1, σ
2) = (3
,
3
,
3)
0 0.2 0.4 0.6 0.8 1 −15 −10 −5 0 5 10 15 t yiSimulation II: large variation, (
σ
0, σ
1, σ
2) = (3
,
3
,
3)
rmiss Model Mean function Individual fits
LPME 0.3124 (0.2775) 1.9526 (0.3421) 20% RSME 0.3206 (0.2797) 1.9526 (0.3421) PACE 0.3639 (0.3041) 0.4511 (0.0647) LPME 0.3212 (0.2565) 3.6714 (0.6056) 50% RSME 0.3297 (0.2407) 3.6714 (0.6056) PACE 0.3828 (0.2972) 1.1166 (0.2640) LPME 0.5516 (0.4300) 8.4689 (1.3465) 80% RSME 0.6529 (0.4426) 8.4689 (1.3465) PACE 0.6416 (0.5362) 6.7611 (2.3501)
Mean function estimate winner: NPME model
Example 1: Viral load in AIDS clinical trials
0 10 20 30 40 50 60 70 80 90 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 time(day) viral loadn= 46 patients, ni is 4∼10, with a median of 8. Mean function estimates: RSME (blue), FPCA (red).
Viral load: individual fits
0 50 2 4 6 Patient 3 0 50 2 4 6 Patient 9 0 50 2 4 6 Patient 13 0 50 2 4 6 Patient 18 0 50 2 4 6 Patient 22 0 50 2 4 6 Patient 30 0 50 2 4 6 Patient 32 0 50 2 4 6 Patient 42 0 50 2 4 6 Patient 46Example 2: Yeast cell cycle gene expressions
0 20 40 60 80 100 120 −4 −3 −2 −1 0 1 2 3 4 5 time(min) gene expression 6075 genes, tj = 7∗(j−1) (minute),j = 1,2, ...,18.Gene expressions are centered by mean of each gene; contains missing data.
Yeast gene expressions: individual fits
0 50 100 −2 0 2 Gene 226 0 50 100 −2 0 2 Gene 1937 0 50 100 −2 0 2 Gene 1941 0 50 100 −2 0 2 Gene 3112 0 50 100 −2 0 2 Gene 3505 0 50 100 −2 0 2 Gene 4025 0 50 100 −2 0 2 Gene 4650 0 50 100 −2 0 2 Gene 5751 0 50 100 −2 0 2 Gene 5990Time-course microarray gene expressions
Independent sampling: one measurement from each subject, e.g. mice Longitudinal sampling: repeated measurements from same subject, e.g. human
Features of data:
number of genesnvery large, usually several thousands number of time pointsm small (m≈10)
very few replications at each time point, usually 2 or 3 noisy, possibly with missing data
Time-course microarray gene expressions
Problem interested: identify differentiallyexpressed genes One group: difference from baseline; variation over time Two or more groups: difference between groups
Methods:
ANOVA approach: treat time variable as a particular experimental factor (instant extension from static microarray experiments)
Continuous approach: treat gene expressions as noisy measurements from an underlying function; nonparametric estimation of the underlying function (possibly with random effects)
Time-course microarray gene expressions
yijk = xi(tj) +ijk, i = 1,2...,n; j = 1,2, ...,m; k = 1, ...,K xi(t) = L X l=0 βilφl(t), ijk ∼(0, σ2) H0 : xi(t) = 0, i = 1, ...,nφl(t): spline basis or PC basis In real data, no clear cut
Statistics that provide a good ranking
Multiple testing adjustment to control error rare, e.g. False Discovery Rate (FDR)
Methods
Individual nonparametric smoothing (EDGE)
φl(t) as fixed basis
statistics: goodness-of-fit (F statistics); area under curve (AUC)
fPCA-integration method (individual estimate of PC scores)
φl(t) as as eigenfunctions, estimated from entire samples
statistics: area under curve (AUC)
Both use bootstrap to calculate the null distribution of the statistics Significance cut-off by controlling FDR
Simulation study
n=1000, m= 10,K = 3
observations equidistant in [0,1] proportion of significant genes p = 0.1 Under H0:
yijk =ijk, ijk ∼ N(0,0.52) Under H1:
yijk =aisin(2ωiπ(tj −bi)) +ijk, whereai, ωi ∼ U(0.5,2), bi ∼ U(0,1).
Simulation I
Error under H1: ijk ∼ N(0,0.52)
EDGE num rejected corr rejected FDR FNR
FDR=0.05 91.44 87.35 0.0441 0.0139
FDR=0.1 100.21 90.94 0.0909 0.0100
FDR=0.2 116.52 93.98 0.1912 0.0068
PCA num rejected corr rejected FDR FNR
FDR=0.05 96.45 94.25 0.0225 0.0064
FDR=0.1 102.54 96.12 0.0619 0.0043
Simulation II
Error under H1: ijk ∼ N(0,(0.5vi)2),vi is a dispersion factor
EDGE num rejected corr rejected FDR FNR
FDR=0.05 66.22 63.23 0.0442 0.0393
FDR=0.1 81.99 74.41 0.0905 0.0278
FDR=0.2 104.41 84.51 0.1875 0.0172
PCA num rejected corr rejected FDR FNR
FDR=0.05 80.08 80.08 0 0.0216
FDR=0.1 86.55 86.44 0.0013 0.0148
Gene data from lungs of mice
number of probes: n= 35557
days post infection (DPI): 0,1, ...,10 (m= 11)
repetition: 3 for DPI= 1, ...,10, 6 for DPI=0 (3 no flu virus, 3 killed immediately after receiving flu virus)
normalized by Welle lab using the PLIER normalization method; log-transformation
H0 :xi(t) = baseline, t ≥0
Baseline 1: gene expression for DPI=0, no flu virus
Baseline 2: gene expression for DPI=0, immediately after receiving flu
Gene data from lungs of mice: Baseline 1
EDGE (F) 35497 (FDR=0.01)
EDGE (AUC) 2452 (FDR=0.05)
PCA (AUC) 7133 (FDR=0.05)
EDGE fails: oversmoothed
observe an increase in gene expression between DPI=0, no flu virus
Baseline 1: top 9 genes selected by PCA, not by EDGE
(AUC)
0 5 10 −1 0 1 2 Gene 15356 0 5 10 −2 0 2 4 Gene 35514 0 5 10 0 1 2 Gene 116 0 5 10 0 1 2 Gene 18635 0 5 10 0 1 2 3 Gene 6126 0 5 10 0 0.5 1 1.5 Gene 29919 0 5 10 −2 0 2 4 Gene 14636 0 5 10 −2 0 2 4 Gene 10336 0 5 10 0 1 2 Gene 5265Baseline 1: top 9 genes selected by EDGE(AUC), not by
PCA
0 5 10 −5 0 5 Gene 3899 0 5 10 −5 0 5 10 Gene 35276 0 5 10 −5 0 5 10 Gene 3053 0 5 10 −5 0 5 10 Gene 33025 0 5 10 −5 0 5 Gene 17877 0 5 10 −10 −5 0 5 Gene 13133 0 5 10 −10 −5 0 5 Gene 8231 0 5 10 −5 0 5 10 Gene 1148 0 5 10 −10 −5 0 5 Gene 15335Gene data from lungs of mice: Baseline 2
EDGE (F) 11592 (FDR=0.01) EDGE (AUC) 142 (FDR=0.05) PCA (AUC) 302 (FDR=0.05) 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 p−values by PCA 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 1400Baseline 2: top 9 genes selected by PCA, not by EDGE
(AUC)
0 5 10 −4 −2 0 2 Gene 14136 0 5 10 −2 0 2 4 Gene 20657 0 5 10 −10 −5 0 5 Gene 711 0 5 10 −5 0 5 Gene 12360 0 5 10 −5 0 5 Gene 18639 0 5 10 −5 0 5 Gene 1755 0 5 10 −4 −2 0 2 Gene 8438 0 5 10 −10 −5 0 5 Gene 25610 0 5 10 −5 0 5 Gene 2053Baseline 2: top 9 genes selected by EDGE (AUC), not by
PCA
0 5 10 −10 0 10 Gene 32948 0 5 10 −5 0 5 10 Gene 13790 0 5 10 −10 −5 0 5 Gene 35155 0 5 10 −10 −5 0 5 Gene 25608 0 5 10 −10 −5 0 5 Gene 14045 0 5 10 −10 −5 0 5 Gene 20815 0 5 10 −5 0 5 10 Gene 16947 0 5 10 −5 0 5 10 Gene 26778 0 5 10 −5 0 5 Gene 25488Summary
Nonparametric longitudinal data analysis methods: Individual nonparametric smoothing
Not borrow information across subjects (curves) at all Deal with complete different curves for different subjects
FPCA-individual estimates of PC scores
Weakly borrow information across subjects via PC basis estimate PC basis: adaptive for some between-subject (Curve) variations
FPCA-PACE
Borrow information across subjects via mixed-effects PC score estimate PC basis: adaptive for large between-subject (Curve) variation
Nonparametric mixed-effects (NPME) modeling
Strongly borrow information across subjects (curves) Deal with longitudinal data with similar patterns
References
Storey et al. (2005) Significance analysis of time course microarray
experiments. Proceedings of the National Academy of Sciences,102,
12837-12842.
Wu, H. and Zhang, J.-T. (2006)Nonparametric regression methods
for longitudinal data analysis: mixed-effects modeling approaches. John Wiley & Sons, New York.
Yao, F., M¨uller, H.-G., and Wang, J.-L. (2005) Functional linear regression analysis for longitudinal data. The Annals of Statistics,33, 2873-2903.