Functional Principal Components Analysis with Survey Data

(1)

First International Workshop on Functional and Operatorial Statistics. Toulouse, June 19-21, 2008

Functional Principal Components Analysis with

Survey Data

Herv´e CARDOT, Mohamed CHAOUCH(∗)_{, Camelia GOGA & Catherine}

LABRU `ERE

Institut de Math´ematiques de Bourgogne, Universit´e de Bourgogne, 9 Avenue Alain Savary, BP 47870, 21078 DIJON Cedex, FRANCE.

email :{herve.cardot, mohamed.chaouch, camelia.goga, catherine.labruere}@u-bourgogne.fr

Abstract

This work aims at performing Functional Principal Components Analysis (FPCA) thanks to Horvitz-Thompson estimators when the curves are collected with survey sam-pling techniques. Linearization approaches based on the influence function allow us to derive estimators of the asymptotic variance of the eigenelements of the FPCA. The me-thod is illustrated with simulations which confirm the good properties of the linearization technique.

1. Introduction

Functional Data Analysis whose main purpose is to provide tools for describing and modeling sets of curves is a topic of growing interest in the statistical community. The books by Ramsay and Silverman (2002, 2005) propose an interesting description of the available procedures dealing with functional observations. These functional approaches have been proved useful in various domains such as chemometrics, economy, climatology, biology or remote sensing.

The statistician generally wants, in a first step, to represent as well as possible a set of random curves in a small space in order to get a description of the functional data that allows interpretation. Functional principal components analysis (FPCA) gives a small dimension space which captures the main modes of variability of the data (see Ramsay and Silverman, 2002 for more details).

(2)

The way the data are collected is seldom taken into account in the literature and one generally supposes the data are independent realizations of a common functional distribution. However there are some cases for which this assumption is not fulfilled, for example when the realizations result from a sampling scheme. For instance, Desser-taine (2006) considers the estimation with time series procedures of a global demand for electricity at fine time scales with the observation of individual electricity consumption curves. More generally, there are now data (data streams) produced automatically by large numbers of distributed sensors which generate huge amounts of data that can be seen as functional. The use of sampling technique to collect them proposed for instance in Chiky and H´ebrail (2007) seems to be a relevant approach in such a framework allowing a trade off between storage capacities and accuracy of the data.

We propose in this work to give estimators of the functional principal components analysis when the curves are collected with survey sampling strategies. Let us note that Skinner et al. (1986) have studied some properties of multivariate PCA in a survey framework. The functional framework is different since the eigenfunctions which exibit the main modes of variability of the data are also functions and can be naturally interpreted as modes of variability varying along time. In this new functional framework, we estimate the mean function and the covariance operator using the Horvitz-Thompson estimator. The eigenelements are estimated by diagonalization of the estimated covariance operator. In order to calculate and estimate the variance of the so-constructed estimators, we use the influence function linearization method introduced by Deville (1999).

This paper is organized as follows : Section 2 presents the functional principal compo-nents analysis in the setting of finite populations and defines then the Horvitz-Thompson estimator in the new functional framework. The generality of the influence function allows us to extend in section 3 the estimators proposed by Deville to our functional objects and to get asymptotic variances with the help of perturbation theory (Kato, 1966). Section 4 proposes a simulation study which shows the good behavior of our estimators for various sampling schemes as well as good approximations to their theoretical variances.

2. FPCA and sampling

2.1 FPCA in a finite population setting

Let us consider a finite populationU ={1, . . . , k, . . . , N}with sizeN not necessarily known and a functional variable Y defined for each element k of the population U : Yk = (Yk(t))t∈[0,1] belongs to the separable Hilbert space L2[0,1] of square integrable

functions defined on the closed interval [0,1] equipped with the usual inner producth., .i and the normk.k.The mean functionµ∈L2_[0,_1],_{is defined by}

µ(t) = 1 N

X k∈U

Yk(t), t∈[0,1] (1)

and the covariance operator Γ by Γ = 1

N

X k∈U

(3)

where the tensor product of two elementsaandbofL2[0,1] is the rank one operator such thata⊗b(u) =ha, uibfor alluinL2[0,1].The operator Γ is symmetric and non negative (hΓu, ui ≥0). Its eigenvalues, sorted in decreasing order,λ1≥λ2≥ · · · ≥λN ≥0,satisfy

Γvj(t) = λj vj(t), t∈[0,1], (3)

where the eigenfunctions vj form an orthonormal system in L2[0,1],i.e hvj, vj0i= 1 if

j=j0 and zero else.

We can get now an expansion similar to the Karhunen-Loeve expansion or FPCA which allows to get the best approximation in a finite dimension space with dimensionq to the curves of the population

Yk(t) ≈ µ(t) + q X j=1

hYk−µ, vjivj(t), t∈[0,1]

The eigenfunctions vj indicate the main modes of variation along time t of the data

around the meanµand the explained variance of the projection onto eachvj is given by

the eigenvalue λj = 1 N X k∈U hYk−µ, vji2 .

We aim at estimating the mean functionµand the covariance operator Γ in order to deduce estimators of the eigenelements (λj, vj) when the data are obtained with survey

sampling procedures.

2.2 The Horvitz-Thompson estimator

We consider a sample ofnindividualss,i.e.a subset s⊂U, selected according to a probabilistic procedurep(s) wherepis a probability distribution on the set of 2N subsets of U. We denote byπk = Pr(k ∈s) for all k∈U the first order inclusion probabilities

and by πkl = Pr(k &l ∈s) for all k, l ∈ U with πkk =πk, the second order inclusion

probabilities. We suppose thatπk >0 andπkl>0. We suppose also thatπk andπkl are

not depending ont∈[0,1].

We propose to estimate the mean function µ and the covariance operator Γ by repla-cing each total with the corresponding Horvitz-Thompson (HT) estimator (Horvitz and Thompson, 1952). We obtain b µ = 1 b N X k∈s Yk πk (4) b Γ = 1 b N X k∈s Yk⊗Yk πk −µ_b⊗_bµ (5)

where the sizeN of the population is estimated byNb = P

k∈s 1

πk when it is not known.

(4)

are obtained readily by diagonalisation (or spectral analysis) of the estimated covariance operatorΓ. Let us note that the eigenelements of the covariance operator are not linearb

functions.

3. Linearization by influence function

We would like to calculate and estimate the variance of ˆµ,_bvj andbλj. The

nonlinea-rity of these estimators and the functional nature of Y make the variance estimation issue difficult. For this reason, we adapt the influence function linearization technique introduced by Deville (1999) to the functional framework.

Let us consider the discrete measureM defined onL2[0,1] as follows M =P UδYk

where δYk is the Dirac function taking value 1 if Y = Yk and zero otherwise. Let

us suppose that each parameter of interest can be written as a functional T of M. For example, N(M) = R dM, µ(M) = R YdM/R dM and Γ(M) = R (Y −µ(M))⊗ (Y −µ(M))dM/R

dM.The eigenelements given by (??) are implicit functionalsT ofM. The measureMis estimated by the random measureMcdefined as followsMc=P_U

δ_Yk πk Ik

withIk= 1{k∈s}. Then the estimators given by (??) and (??) are obtained by

substitu-tion ofM byMc, namely they are written as functionnalsT ofMc.

3.1 Asymptotic Properties

We give in this section the asymptotic properties of our estimators. In order to do that, one need that the population and sample sizes tend to infinity. We use the asymptotic framework introduced by Isaki & Fuller (1982). Let us suppose the following assumptions :

(A1) sup k∈U kYkk ≤C <∞, (A2) lim N→∞ n N =π∈(0,1), (A3) min k∈UN πk ≥λ >0 , min

k6=lπkl ≥λ∗>0 and limN→∞nmaxk6=l |πkl−πkπl|<∞,

with λandλ∗ are two positive constant. We also suppose that the functionalT giving the parameter of interest is an homogeneous functional of degree α, namely T(rM) = rα_T_(M_{) and lim}

N→∞N−αT(M)<∞. For example,µ and Γ are functionals of degree

zero with respect toM. Let us note that the eigenelements of Γ are also functionals of degree zero with respect toM.

Let us also introduce the Hilbert-Schmidt norm, denoted byk·k₂ for operators map-pingL2_[0,_{1] to}_L2_[0,_1].

We show in the next proposition that the our estimators are asymptotically design unbiased, limN→∞

Ep(T(Mc))−T(M)

= 0, and consistent, namely for any fixedε >0 we have limN→∞P(|T(Mc)−T(M)|> ε) = 0. Here,Ep(·) is the expectation with respect

top(s).

Proposition 1 Under hypotheses (A1), (A2) and (A3),

Epkµ−_bµk 2 = O(n−1), Ep Γ−Γb 2 2=O(n −1_).

(5)

If we suppose that the non null eigenvalues are distinct, we also have, Ep sup j λj−cλj 2 = O(n−1), Epkvj−vbjk 2

=O(n−1) for each fixed j.

3.2 Variance approximation and estimation

Let define, when it exists, the influence function of a functionalTat pointY ∈L2[0,1] sayIT(M,Y),as follows

IT(M,Y) = lim

h→0

T(M+hδY)−T(M)

h whereδY is the Dirac function atY.

Proposition 2 Under assumption (A1), we get that the influence functions ofµ andΓ

exist andIµ(M, Yk) = (Yk−µ)/N andIΓ(M, Yk) =_N1 ((Yk−µ)⊗(Yk−µ)−Γ).If the

non null eigenvalues of Γ are distinct then

Iλj(M, Yk) = 1 N hYk−µ, vji 2₋_λ j Ivj(M, Yk) = 1 N   X `6=j hYk−µ, vjihYk−µ, v`i λj−λ` v`  .

In order to obtain the asymptotic variance ofT(Mc) for T given by (??), (??) and

(??), we write the first-order von Mises expansion of our functional inM /Nc “near”M/N

and use the fact that T is of degree 0 andIT (M/N, Yk) =N·IT(M, Yk),

T(Mc) = T(M) + X k∈U IT(M, Yk) Ik πk −1 +RT c M N, M N ! .

Proposition 3 Suppose the hypotheses (A1), (A2) and (A3) are fulfilled. Consider the functional T giving the parameters of interest defined in (??), (??), (??). We sup-pose that the non null eigenvalues are distinct. Then RT

c M N, M N = op(n−1/2) and

the asymptotic variance of T(Mc) is equal to Vp[Pk∈sIT(M, Yk)I_πkk] = PU P U(πkl − πkπl)IT(M,Yk)_π k IT(M,Yl) πl .

One can remark that the asymptotic variance given by the above result is not known. We propose to estimate it by the HT variance estimator withIT(M, Yk) replaced by its

HT estimator. We obtain b Vp(µ)_b = 1 b N2 X k∈s X `∈s 1 πk` ∆k` πkπ` (Yk−µ)_b ⊗(Y`−µ)_b b Vp b λj = 1 ˆ N2 X k∈s X `∈s 1 πk` ∆k` πkπ` hYk−µ,b bvji 2 −λbj hY`−µ,b bvji 2 −bλj b Vp(_bvj) = X k∈s X `∈s 1 πk` ∆k` πkπ` c Ivj(M, Ys)⊗cIvj(M, Y`)

(6)

where ∆k`=πkl−πkπlandIvcj(M, Y`) = _N1_ˆ P `6=j hYk−µ,bbvjihYk−µ,bbv`i b λj−bλ` b v` .Cardotet al.

(2007) show that under the assumptions (A1)-(A3), these estimators are asymptotically design unbiased and consistent.

4. A Simulation study

In our simulations all functional variables are discretized inp= 100 equispaced points in the interval [0,1]. We consider a random variable Y distributed as brownian motion on [0,1].We make N = 10000 replications ofY and construct then two strata U1 and

U2 with different variances and with sizes N1 = 7000 and N2 = 3000. Our population

U is the union of the two strata. Then we estimate the eigenelements of the covariance operator for two different sampling designs (Simple Random Sampling Without Replace-ment (SRSWR) and stratified) and two different sample sizesn= 100 andn= 1000. To evaluate our estimation procedures we make 500 replications of the previous experiment. Then estimation errors for the first eigenvalue and the first eigenvector are evaluated by considering the following loss criterions λ1−λˆ1

λ1 and

||v1−ˆv1||

v1 , with ||.|| is the Euclidiean

norm. Linear approximation by influence function gives reasonable estimation of the va-riance for small size samples and accurates estimations as far as n gets large enough (n= 1000).We also note that the variance of the estimators given by stratified sampling turns out to be smaller than those by SRSWR.

References

Cardot, H, Chaouch, M, Goga, C. and Labru`ere, C. (2007). Functional Principal Com-ponents Analysis with Survey Data. Preprint.

Chiky, R, H´ebrail, G. (2007). Generic tool for summarizing distributed data streams.

Preprint.

Dauxois, J., Pousse, A., and Romain, Y. (1982). Asymptotic theory for the principal component analysis of a random vector function : some applications to statistical inference. J. Multivariate Anal.,12, 136-154.

Dessertaine A. (2006). Sondage et séries temporelles : une application pour la prévision de la consommation electrique.38èmes Journées de Statistique, Clamart, Juin 2006. Deville, J.C. (1999). Variance estimation for complex statistics and estimators :

linea-rization and residual techniques. Survey Methodology,25, 193-203.

Horvitz, D.G. and Thompson, D.J. (1952). A generalization of sampling without repla-cement from a finite universe. J. Am. Statist. Ass.,47, 663-685.

Isaki, C.T. and Fuller, W.A. (1982). Survey design under the regression superpopulation model.J. Am. Statist. Ass.77, 89-96.

Kato, T. (1966).Perturbation theory for linear operators.Springer Verlag, Berlin. Ramsay, J. O. and Silverman, B.W. (2005).Functional Data Analysis. Springer-Verlag,

2nd ed.

Skinner, C.J, Holmes, D.J, Smith, T.M.F (1986). The Effect of Sample Design on Principal Components Analysis. J. Am. Statist. Ass.81, 789-798.