First International Workshop on Functional and Operatorial Statistics. Toulouse, June 19-21, 2008
Functional Principal Components Analysis with
Survey Data
Herv´e CARDOT, Mohamed CHAOUCH(∗), Camelia GOGA & Catherine
LABRU `ERE
Institut de Math´ematiques de Bourgogne, Universit´e de Bourgogne, 9 Avenue Alain Savary, BP 47870, 21078 DIJON Cedex, FRANCE.
email :{herve.cardot, mohamed.chaouch, camelia.goga, catherine.labruere}@u-bourgogne.fr
Abstract
This work aims at performing Functional Principal Components Analysis (FPCA) thanks to Horvitz-Thompson estimators when the curves are collected with survey sam-pling techniques. Linearization approaches based on the influence function allow us to derive estimators of the asymptotic variance of the eigenelements of the FPCA. The me-thod is illustrated with simulations which confirm the good properties of the linearization technique.
1. Introduction
Functional Data Analysis whose main purpose is to provide tools for describing and modeling sets of curves is a topic of growing interest in the statistical community. The books by Ramsay and Silverman (2002, 2005) propose an interesting description of the available procedures dealing with functional observations. These functional approaches have been proved useful in various domains such as chemometrics, economy, climatology, biology or remote sensing.
The statistician generally wants, in a first step, to represent as well as possible a set of random curves in a small space in order to get a description of the functional data that allows interpretation. Functional principal components analysis (FPCA) gives a small dimension space which captures the main modes of variability of the data (see Ramsay and Silverman, 2002 for more details).
The way the data are collected is seldom taken into account in the literature and one generally supposes the data are independent realizations of a common functional distribution. However there are some cases for which this assumption is not fulfilled, for example when the realizations result from a sampling scheme. For instance, Desser-taine (2006) considers the estimation with time series procedures of a global demand for electricity at fine time scales with the observation of individual electricity consumption curves. More generally, there are now data (data streams) produced automatically by large numbers of distributed sensors which generate huge amounts of data that can be seen as functional. The use of sampling technique to collect them proposed for instance in Chiky and H´ebrail (2007) seems to be a relevant approach in such a framework allowing a trade off between storage capacities and accuracy of the data.
We propose in this work to give estimators of the functional principal components analysis when the curves are collected with survey sampling strategies. Let us note that Skinner et al. (1986) have studied some properties of multivariate PCA in a survey framework. The functional framework is different since the eigenfunctions which exibit the main modes of variability of the data are also functions and can be naturally interpreted as modes of variability varying along time. In this new functional framework, we estimate the mean function and the covariance operator using the Horvitz-Thompson estimator. The eigenelements are estimated by diagonalization of the estimated covariance operator. In order to calculate and estimate the variance of the so-constructed estimators, we use the influence function linearization method introduced by Deville (1999).
This paper is organized as follows : Section 2 presents the functional principal compo-nents analysis in the setting of finite populations and defines then the Horvitz-Thompson estimator in the new functional framework. The generality of the influence function allows us to extend in section 3 the estimators proposed by Deville to our functional objects and to get asymptotic variances with the help of perturbation theory (Kato, 1966). Section 4 proposes a simulation study which shows the good behavior of our estimators for various sampling schemes as well as good approximations to their theoretical variances.
2. FPCA and sampling
2.1 FPCA in a finite population setting
Let us consider a finite populationU ={1, . . . , k, . . . , N}with sizeN not necessarily known and a functional variable Y defined for each element k of the population U : Yk = (Yk(t))t∈[0,1] belongs to the separable Hilbert space L2[0,1] of square integrable
functions defined on the closed interval [0,1] equipped with the usual inner producth., .i and the normk.k.The mean functionµ∈L2[0,1],is defined by
µ(t) = 1 N
X k∈U
Yk(t), t∈[0,1] (1)
and the covariance operator Γ by Γ = 1
N
X k∈U
where the tensor product of two elementsaandbofL2[0,1] is the rank one operator such thata⊗b(u) =ha, uibfor alluinL2[0,1].The operator Γ is symmetric and non negative (hΓu, ui ≥0). Its eigenvalues, sorted in decreasing order,λ1≥λ2≥ · · · ≥λN ≥0,satisfy
Γvj(t) = λj vj(t), t∈[0,1], (3)
where the eigenfunctions vj form an orthonormal system in L2[0,1],i.e hvj, vj0i= 1 if
j=j0 and zero else.
We can get now an expansion similar to the Karhunen-Loeve expansion or FPCA which allows to get the best approximation in a finite dimension space with dimensionq to the curves of the population
Yk(t) ≈ µ(t) + q X j=1
hYk−µ, vjivj(t), t∈[0,1]
The eigenfunctions vj indicate the main modes of variation along time t of the data
around the meanµand the explained variance of the projection onto eachvj is given by
the eigenvalue λj = 1 N X k∈U hYk−µ, vji2 .
We aim at estimating the mean functionµand the covariance operator Γ in order to deduce estimators of the eigenelements (λj, vj) when the data are obtained with survey
sampling procedures.
2.2 The Horvitz-Thompson estimator
We consider a sample ofnindividualss,i.e.a subset s⊂U, selected according to a probabilistic procedurep(s) wherepis a probability distribution on the set of 2N subsets of U. We denote byπk = Pr(k ∈s) for all k∈U the first order inclusion probabilities
and by πkl = Pr(k &l ∈s) for all k, l ∈ U with πkk =πk, the second order inclusion
probabilities. We suppose thatπk >0 andπkl>0. We suppose also thatπk andπkl are
not depending ont∈[0,1].
We propose to estimate the mean function µ and the covariance operator Γ by repla-cing each total with the corresponding Horvitz-Thompson (HT) estimator (Horvitz and Thompson, 1952). We obtain b µ = 1 b N X k∈s Yk πk (4) b Γ = 1 b N X k∈s Yk⊗Yk πk −µb⊗bµ (5)
where the sizeN of the population is estimated byNb = P
k∈s 1
πk when it is not known.
are obtained readily by diagonalisation (or spectral analysis) of the estimated covariance operatorΓ. Let us note that the eigenelements of the covariance operator are not linearb
functions.
3. Linearization by influence function
We would like to calculate and estimate the variance of ˆµ,bvj andbλj. The
nonlinea-rity of these estimators and the functional nature of Y make the variance estimation issue difficult. For this reason, we adapt the influence function linearization technique introduced by Deville (1999) to the functional framework.
Let us consider the discrete measureM defined onL2[0,1] as follows M =P UδYk
where δYk is the Dirac function taking value 1 if Y = Yk and zero otherwise. Let
us suppose that each parameter of interest can be written as a functional T of M. For example, N(M) = R dM, µ(M) = R YdM/R dM and Γ(M) = R (Y −µ(M))⊗ (Y −µ(M))dM/R
dM.The eigenelements given by (??) are implicit functionalsT ofM. The measureMis estimated by the random measureMcdefined as followsMc=PU
δYk πk Ik
withIk= 1{k∈s}. Then the estimators given by (??) and (??) are obtained by
substitu-tion ofM byMc, namely they are written as functionnalsT ofMc.
3.1 Asymptotic Properties
We give in this section the asymptotic properties of our estimators. In order to do that, one need that the population and sample sizes tend to infinity. We use the asymptotic framework introduced by Isaki & Fuller (1982). Let us suppose the following assumptions :
(A1) sup k∈U kYkk ≤C <∞, (A2) lim N→∞ n N =π∈(0,1), (A3) min k∈UN πk ≥λ >0 , min
k6=lπkl ≥λ∗>0 and limN→∞nmaxk6=l |πkl−πkπl|<∞,
with λandλ∗ are two positive constant. We also suppose that the functionalT giving the parameter of interest is an homogeneous functional of degree α, namely T(rM) = rαT(M) and lim
N→∞N−αT(M)<∞. For example,µ and Γ are functionals of degree
zero with respect toM. Let us note that the eigenelements of Γ are also functionals of degree zero with respect toM.
Let us also introduce the Hilbert-Schmidt norm, denoted byk·k2 for operators map-pingL2[0,1] toL2[0,1].
We show in the next proposition that the our estimators are asymptotically design unbiased, limN→∞
Ep(T(Mc))−T(M)
= 0, and consistent, namely for any fixedε >0 we have limN→∞P(|T(Mc)−T(M)|> ε) = 0. Here,Ep(·) is the expectation with respect
top(s).
Proposition 1 Under hypotheses (A1), (A2) and (A3),
Epkµ−bµk 2 = O(n−1), Ep Γ−Γb 2 2=O(n −1).
If we suppose that the non null eigenvalues are distinct, we also have, Ep sup j λj−cλj 2 = O(n−1), Epkvj−vbjk 2
=O(n−1) for each fixed j.
3.2 Variance approximation and estimation
Let define, when it exists, the influence function of a functionalTat pointY ∈L2[0,1] sayIT(M,Y),as follows
IT(M,Y) = lim
h→0
T(M+hδY)−T(M)
h whereδY is the Dirac function atY.
Proposition 2 Under assumption (A1), we get that the influence functions ofµ andΓ
exist andIµ(M, Yk) = (Yk−µ)/N andIΓ(M, Yk) =N1 ((Yk−µ)⊗(Yk−µ)−Γ).If the
non null eigenvalues of Γ are distinct then
Iλj(M, Yk) = 1 N hYk−µ, vji 2−λ j Ivj(M, Yk) = 1 N X `6=j hYk−µ, vjihYk−µ, v`i λj−λ` v` .
In order to obtain the asymptotic variance ofT(Mc) for T given by (??), (??) and
(??), we write the first-order von Mises expansion of our functional inM /Nc “near”M/N
and use the fact that T is of degree 0 andIT (M/N, Yk) =N·IT(M, Yk),
T(Mc) = T(M) + X k∈U IT(M, Yk) Ik πk −1 +RT c M N, M N ! .
Proposition 3 Suppose the hypotheses (A1), (A2) and (A3) are fulfilled. Consider the functional T giving the parameters of interest defined in (??), (??), (??). We sup-pose that the non null eigenvalues are distinct. Then RT
c M N, M N = op(n−1/2) and
the asymptotic variance of T(Mc) is equal to Vp[Pk∈sIT(M, Yk)Iπkk] = PU P U(πkl − πkπl)IT(M,Yk)π k IT(M,Yl) πl .
One can remark that the asymptotic variance given by the above result is not known. We propose to estimate it by the HT variance estimator withIT(M, Yk) replaced by its
HT estimator. We obtain b Vp(µ)b = 1 b N2 X k∈s X `∈s 1 πk` ∆k` πkπ` (Yk−µ)b ⊗(Y`−µ)b b Vp b λj = 1 ˆ N2 X k∈s X `∈s 1 πk` ∆k` πkπ` hYk−µ,b bvji 2 −λbj hY`−µ,b bvji 2 −bλj b Vp(bvj) = X k∈s X `∈s 1 πk` ∆k` πkπ` c Ivj(M, Ys)⊗cIvj(M, Y`)
where ∆k`=πkl−πkπlandIvcj(M, Y`) = N1ˆ P `6=j hYk−µ,bbvjihYk−µ,bbv`i b λj−bλ` b v` .Cardotet al.
(2007) show that under the assumptions (A1)-(A3), these estimators are asymptotically design unbiased and consistent.
4. A Simulation study
In our simulations all functional variables are discretized inp= 100 equispaced points in the interval [0,1]. We consider a random variable Y distributed as brownian motion on [0,1].We make N = 10000 replications ofY and construct then two strata U1 and
U2 with different variances and with sizes N1 = 7000 and N2 = 3000. Our population
U is the union of the two strata. Then we estimate the eigenelements of the covariance operator for two different sampling designs (Simple Random Sampling Without Replace-ment (SRSWR) and stratified) and two different sample sizesn= 100 andn= 1000. To evaluate our estimation procedures we make 500 replications of the previous experiment. Then estimation errors for the first eigenvalue and the first eigenvector are evaluated by considering the following loss criterions λ1−λˆ1
λ1 and
||v1−ˆv1||
v1 , with ||.|| is the Euclidiean
norm. Linear approximation by influence function gives reasonable estimation of the va-riance for small size samples and accurates estimations as far as n gets large enough (n= 1000).We also note that the variance of the estimators given by stratified sampling turns out to be smaller than those by SRSWR.
References
Cardot, H, Chaouch, M, Goga, C. and Labru`ere, C. (2007). Functional Principal Com-ponents Analysis with Survey Data. Preprint.
Chiky, R, H´ebrail, G. (2007). Generic tool for summarizing distributed data streams.
Preprint.
Dauxois, J., Pousse, A., and Romain, Y. (1982). Asymptotic theory for the principal component analysis of a random vector function : some applications to statistical inference. J. Multivariate Anal.,12, 136-154.
Dessertaine A. (2006). Sondage et s´eries temporelles : une application pour la pr´evision de la consommation electrique.38`emes Journ´ees de Statistique, Clamart, Juin 2006. Deville, J.C. (1999). Variance estimation for complex statistics and estimators :
linea-rization and residual techniques. Survey Methodology,25, 193-203.
Horvitz, D.G. and Thompson, D.J. (1952). A generalization of sampling without repla-cement from a finite universe. J. Am. Statist. Ass.,47, 663-685.
Isaki, C.T. and Fuller, W.A. (1982). Survey design under the regression superpopulation model.J. Am. Statist. Ass.77, 89-96.
Kato, T. (1966).Perturbation theory for linear operators.Springer Verlag, Berlin. Ramsay, J. O. and Silverman, B.W. (2005).Functional Data Analysis. Springer-Verlag,
2nd ed.
Skinner, C.J, Holmes, D.J, Smith, T.M.F (1986). The Effect of Sample Design on Principal Components Analysis. J. Am. Statist. Ass.81, 789-798.