Factor Modelling for Clustering High-dimensional Time Series

(1)

arXiv:2101.01908v1 [math.ST] 6 Jan 2021

Factor Modelling for Clustering

High-dimensional Time Series

Bo Zhang

Department of Statistics & Finance, International Institute of Finance School of Management, University of Science and Technology of China

[email protected] Guangming Pan

School of Physical & Mathematical Sciences, Nanyang Technological University [email protected]

Qiwei Yao

Department of Statistics, London School of Economics and Political Science [email protected]

Wang Zhou

Department of Statistics & Applied Probability, National University of Singapore [email protected]

January 7, 2021

Abstract

We propose a new unsupervised learning method for clustering a large number of time se-ries based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the com-mon factors, the cluster-specific factors, the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao (2012).

(2)

components; k-means clustering algorithm; Ratio-based estimation; Strong and weak factors.

1 Introduction

One of the primary tasks of data mining is clustering. While most clustering methods are originally designed for independent observations, clustering a large number of time series gains increasing momentum (Esling and Agon 2012), due to mining large and complex data recorded over time in business, finance, biology, medicine, climate, energy, environment, psychology, multimedia and other areas (Table 1 of Aghabozorgi et al.2015). Consequently the literature on time series clus-tering is large; see Liao (2005), Aghabozorgiet al.(2015), Maharajet al.(2019) and the references therein. The basic idea is to develop some relevant similarity or distance measures among time series first, and then to apply the standard clustering algorithms such as hierarchical clustering or k-means method. Most existing similarity/distances measures for time series may be loosely divided into two categories: data-based and feature-based. The data-based approaches define the measures directly based on observed time series using, for example, L2- or, more general,

Minkowski’s distance, or various correlation measures. Alone and Pe˜na (2019) proposed a gen-eralized cross correlation as a similarity measure, which takes into account cross correlation over different time lags. Dynamic time warping can be applied beforehand to cope with time defor-mation due to, for example, shifting holidays over different years (Keogh and Ratanamahatana, 2005). The feature-based approaches extract relevant features from observed time series data first, and then define similarity/distance measures based on the extracted features. The feature extrac-tion can be carried out by various transformaextrac-tion such as Fourier, wavelet or principal component analysis (Section 2.3 of Roelofsen, 2018). The features from fitted time series models can also be used to define similarity/distance measures (Yao et al.2000, Fr¨uhwirh-Schnatter and Kaufmann 2008). Attempts have also been made to define the similarity between two time series by mea-suring the discrepancy between the two underlying stochastic processes (Kakizawa et al.1998, Khaleghi et al.2016). Other approaches include Zhang (2013) which clusters time series based on the parallelism of their trend functions, and Ando and Bai (2017) which represents the latent clusters in terms of a factor model. So-called ‘subsequence clustering’ occurs frequently in the literature on time series clustering; see Keogh and Lin (2005), and Zolhavarieh et al.(2014). It refers to the clustering the segments from a single long time series, which is not considered in this paper.

The goal of this study is to propose a new factor model based approach to cluster a large

(3)

ber of time series into different and unknown clusters such that the members within each cluster share similar dynamic structure, while the number of clusters and their sizes are all unknown. We represent the dynamic structures by latent common and cluster-specific factors, which are both unknown and are identified by the difference in factor strength. The common factors are strong factors (Remark 1 of Lam and Yao 2012) as each of them carries the information on most (if not all) time series concerned. The cluster-specific factors are weak factors as they only affect the time series in a specific cluster. The clustering is based on the factor loadings on all the weak factors; applying ak-mean algorithm using a correlation-type similarity measure defined in terms of the loadings.

Though our factor model is similar to that of Ando and Bai (2017), our approach is radically different. First, we estimate strong factors and all the weaker factors in the manner of one-pass, and then the latent clusters are recovered based on the estimated weak factor loadings. Ando and Bai (2017) adopted an iterative least squares algorithm to estimate factors/factor loadings and latent cluster structure recursively. Secondly, our setting allows the flexibility that some time series do not belong to any clusters, which is often the case in practice. Thirdly, our setting allows the dependence between the common factors and cluster-specific factors while Ando and Bai (2017) imposed an orthogonality condition between the two; see Remark 1(iv) in Section 2 below.

The methods used for estimating factors and factor loadings are adapted from Lam and Yao (2012). Nevertheless substantial advances have been made even within the context of Lam and Yao (2012): (i) we remove the artifact condition that the factor loading spaces for strong and weak factors are perpendicular with each other, (ii) we allow weak serial correlations in idiosyncratic components in the model, which were assumed to be vector white noise by Lam and Yao (2012), and, more significantly, (iii) we propose a new andconsistentratio-based estimator for the number of factors (see Step 1 and also Remark 3(iii) in Section 3 below).

The rest of the paper is organized as follows. Our factor model and the relevant conditions are presented in Section 2. The inference methods are presented in Section 3. The asymptotic properties of the estimation methods are collected in Section 4. Numerical illustration with both simulated and a real data example is reported in Section 5. All technical proofs are presented to Section 6. A supplementary file contains more simulation results.

We always assumes vectors in column. Let kak denote the Euclidean norm of vector a. For any matrixG, let M(G) denote the linear space spanned by the columns of G,kGk the square root of the largest eigenvalue ofG⊤G,kGkmin the square root of the smallest eigenvalue ofG⊤G,

(4)

and |G| the determinant of G when G is square. We write a ≍ b if a = O(b) and b = O(a). We use C > 0 to denote a generic constant independent of p and n, which may be different at different places.

2 Models and assumptions

Let yt be a weakly stationary p×1 vector time series, i.e. Eyt is a constant independent of t,

and all elements of Cov(y_t+k,yt) are finite and dependent on konly. Suppose thatyt consists of

d+ 1 latent segments, i.e.

y_t⊤= (y⊤_t,₁,· · ·,y⊤_t,d,y⊤_t,d₊₁), (2.1) whereyt,1,· · ·,yt,d+1are, respectively,p1,· · ·, pd+1 vector time series withp1,· · ·, pd≥1,pd+1≥

0, and

p1+· · ·+pd=p0, p0+pd+1 =p.

Furthermore, we assume the following latent factor model withdclusters:

yt=Axt+ _B 0 zt+εt, (2.2) B= diag(B1,· · · ,Bd), zt⊤= (z⊤t,1,· · · ,z⊤t,d),

whereAis ap×r0 matrix with rankr0,xtisr0 vector time series representingr0 common factors

and |Var(xt)| 6= 0, Bj is pj ×rj matrix with rank rj, zt,j isrj vector time series representing rj

factors foryt,j only and|Var(zt,j)| 6= 0, 0stands for a pd+1×r matrix with all elements equal to

0,r=r1+· · ·+rd, andεtis an idiosyncratic component in the sense of Chamberlain (1983) and

Chamberlain and Rothschild (1983) (see below). Note that in the model above, we only observe permuted yt (i.e. the order of components of yt is unknown) while all the terms on the RHS of

(2.2) are unknown.

By (2.2), the p0 components of yt are grouped into d clusters yt,1,· · · ,yt,d, while the pd+1

components of yt,d+1 do not belong to any clusters. Thej-th cluster yt,j is characterised by the

cluster-specific factor zt,j, in addition to depending on the common factor xt. The goal is to

identify those d latent clusters from observations y1,· · ·,yn. Note that all pj, rj and dare also

unknown.

Assumption 1. max{d, r0, r} < C < ∞, where C is a constant independent of n and p, and

pi ≍p for i= 1,· · · , d.

(5)

Assumption 2. A⊤A=Ir0, B⊤B=Ir, and it holds for a constant q0∈(0,1) that AA⊤B 0 ≤q0. (2.3)

Assumption 1 requires that the number of factors remain finite when the number of component time series converges to ∞. This substantially simplifies the technical proofs for the asymptotic results. In practice, there is only one (fixed) p, and the factor model is only effective when the number of factors is much smaller than p. The assumption that A and B are orthogonal matrices can always be fulfilled as we can replace original (A,xt) by (H,Vxt), where A =HV

is a QR decomposition of A. The orthogonality for B can be obtained by such a replacement for each (Bj,zt,j). While A and B are not uniquely defined by (2.3), the factor loading spaces M(A), M(Bj) are, whereM(A) denotes the linear space spanned by the columns of A. Hence AA⊤ =A(A⊤A)−1_A⊤_{, i.e. the projection matrix onto} _M₍_A_{), is also unique. Condition (2.3)}

implies that the columns of (B₀) do not fall entirely into the spaceM(A) as otherwise one cannot distinguish zt from xt.

Intuitively (almost) all components of yt carry the information on common factorxt, onlypj

components ofyt carry the information on thej-th cluster specific factor zt,j (j= 1,· · · , d), and

merely a few components ofyt carry the information on the each of idiosyncratic components of εt. It is reasonable to assume thatxt,ztandεtare of the different factor strengths. Assumption 3

below quantifies explicitly the differences in the factor strength betweenxtandzt,j. We introduce

some notation first.

Σx(k) = Cov(xt+k,xt), Σz(k) = Cov(zt+k,zt), Σ_x,z(k) = Cov(x_t+k,zt), Σz,x(k) = Cov(zt+k,xt).

Assumption 3. Let k0 ≥ 1 be an integer and δ ∈ (0,1) be a fixed constant. It holds that for

k= 0,1,· · · , k0,

kΣx(k)k ≍p≍ kΣx(k)kmin, (2.4)

kΣz(k)k ≍p1−δ ≍ kΣz(k)kmin, (2.5) kΣx(k)−1/2Σx,z(k)Σz(k)−1/2k ≤q0<1, kΣz(k)−1/2Σz,x(k)Σx(k)−1/2k ≤q0 <1, (2.6)

kΣx,z(k)k=O(p1−δ/2), kΣz,x(k)k=O(p1−δ/2), (2.7)

Cov(xt,εs) = 0, Cov(zt,εs) = 0 for alltands. (2.8)

Remark 1. (i) Following Lam and Yao (2012), we measure the strength of factors by a constant

(6)

meaning and the implication ofδ. Condition (2.4) implies that all the components ofxtare strong

factors corresponding to δ = 0. Since almost all the components of yt carry the information on

each components ofxt, those strong factors can be relatively easily recovered fromyt. In contrast,

the components ofztare weak factors withδ∈(0,1), as only aboutp1−δ components ofytcarry

the information on zt; see (2.5). Hence it is more difficult to recover those weak factors. Note

that our primary interest is to recover cluster-specific factorztin order to cluster the components

of yt, for which we also need to estimatext.

(ii) For the simplicity of the presentation, we assume that all the cluster-specific factors

zt,1,· · ·,zt,d are of the same strength, reflected by the uniform constant δ in (2.5). In

prac-tice, those weak factors may have different strengths; see, for example, the real data example in Section 5.2 below. While our approach can be readily extended to the cases with weak factors of different strengths, it will make the theoretical investigation more combersome.

(iii) In (2.2)εtrepresents the idiosyncratic component ofytin the sense that each component

of ε_t only affects the corresponding component and a few other components of y_t (i.e. δ = 1), which is implied by Assumptions 4 below. The differences in the factor strength make the three time seriesxt,zt and εt on the RHS of (2.2) (asymptotically) identifiable.

(iv) Model (2.2) is similar to that of Ando and Bai (2017). However we do not require that the common factorxtand the cluster-specific factorztare orthogonal with each other in the sense

that 1

n P

1≤t≤nxtz⊤t = 0, which is imposed by Ando and Bai (2017). Furthermore we allow the

idiosyncratic termεtto exhibit weak autocorrelations (Assumption 4 below), instead of complete

independence as in Ando and Bai (2017).

From now on we always assume in (2.2) εt = Wet with E(et) = 0, where W is a p×p

constant matrix, andet= (et,1,· · · , et,p)⊤ consists of pindependent weakly stationary univariate

time series. We specifyetin Assumptions 4.

Assumption 4. Let p, n→ ∞ in the order of n=O(p) and pδlogp=o(n). Let Σe,k be a n×n

matrix with E(et+i,ket+j,k) as its (i, j)-element. Suppose that kWk< C,

lim n,p→∞ 1 p p X k=1 Σ_e,k< C, (2.9) max 1≤i≤pEe 2 t,i < C, p X i=1 Ee2 t,i ≍p, (2.10) and E n X t=1

e2_t,i−nEe2_t,i21

n X t=1

e2_t,i−nEe2_t,i > n < Cn 2

min{p, nlogn}, (2.11)

(7)

for any 1≤i≤p, where 1(·) denotes the indicator function. Remark 2. When ∞ X j=0 1 p p X i=1

Cov(et+j,i, et,i)

< C, (2.12)

Gershgorin’s circle theorem ensures (2.9). (2.11) holds when p ≍ n and {et,i, t = 1,· · · , n} are

mixing random variables with the mixing coefficients decaying at appropriate rates. Whenp > n

it is also true for appropriate mixing random variables because one may evaluate a higher (>2) moment in the left hand of (2.11) due to the involvement of the indicator function.

To state the assumptions on A and B required for the clustering analysis, we partition A

according to the latent cluster structure:

A⊤= [A⊤₁,· · ·,A⊤_d₊₁], (2.13) whereAi is api×r0 matrix.

Assumption 5. Let qp = max1≤i≤d+1,j6=ikAiAj⊤BjkF. Then qp =O(pδ/2n−1/2).

Assumption 6. For any 1≤i≤d, the L2-norm of every row in(Ipi−AiA

⊤

i )Bi is larger than

c1p−1/2_.

Assumption 2 implies qp < 1, which is weaker than Assumption 5. Assumption 6 ensures

that the proposed inference procedure can separate the components in the dclusters from those not belonging to any clusters. See also Remark 3(iv) in Section 3 below. Those conditions are automatically fulfilled if A⊤(B₀) = 0 which is a condition imposed in Lam and Yao (2012).

3 A clustering algorithm

With available observationsy1,· · ·,yn, we propose below an algorithm (in five steps) to identify

the latent dclusters. To this end, we introduce some notation first. Let ¯y= 1_nPn_t₌₁yt,

b Σy(k) = 1 n−k n_X−k t=1 (yt+k−y¯)(yt−y¯)⊤, Mc = k0 X k=0 b Σy(k)Σby(k)⊤, (3.1)

wherek0≥0 is a prespecified integer.

Step 1 (Estimate the number of factors.) For 0 ≤ k ≤ k0, let λbk,1 ≥ · · · ≥ bλk,p ≥ 0 be the

eigenvalues of matrix Σby(k)Σby(k)⊤. For a prespecified positive integer J0 ≤p, putR0b = 1

and b Rj = k0 X k=0 (1−k/n)bλk,j ._Xk0 k=0 (1−k/n)bλk,j+1, 1≤j≤J0. (3.2)

(8)

We say that Rbs attains a local maximum if Rbs > max{Rbs−1,Rbs+1}. Let Rbbτ1 and Rbτb2 be

the two largest local maximums among R1,b · · · ,RbJ0−1. The estimators for the numbers of

factors are then defined as

b

r0 = min{_bτ1, _bτ2}, r0_b +_br= max{_bτ1, τ2_b}. (3.3)

Step 2 (Estimate the loadings for common factors.) Let bγ₁,· · ·,γb_p be the orthonormal

eigen-vectors of matrix Mc, arranged according to the descending order of the corresponding eigenvalues. The estimated loading matrix for the common factors is

b

A= (_bγ1,· · ·,γb_br0). (3.4)

Step 3 (Estimate the loadings for cluster-specific factors.) Replaceyt by (Ip−AbAb⊤)ytin (3.1),

and repeat the eigenanalysis as in Step 2 above but now denote the corresponding orthonor-mal eigenvectors bybζ₁,· · · ,bζ_p. The estimated loading matrix for the cluster-specific factors is

b

B= (ζb₁,· · · ,bζ_r_b). (3.5)

Step 4 (Identify the components not belonging to any clusters.) Letbb1,· · ·,bbp denote the row

vectors of Bb. Then the identified index set for the components of yt not belonging to any

clusters is

b

J_d+1 ={j: 1≤j ≤p, kbbjk ≤ωp}, (3.6)

whereωp >0 is a constant satisfying the conditions ωp=o(p−1/2) and p

δ_n−1₊ p−δ

pω2

p =o(1).

Step 5 (Clustering with k-means.) Let p_b0 = p− |bJd+1|, and Fb be the pb0 ×br matrix obtained

from Bb by removing the rows with their indices in bJ_d₊₁. Letbf1,· · ·,bfp0_b denote the p0b rows

of Fb. LetWc be thep_b0×bp0 matrix with the (ℓ, m)-th element

b ρℓ,m= bf_ℓ⊤bfm bf_ℓ⊤bfℓ·bfm⊤bfm 1/2 , 1≤ℓ, m≤p0._b

Perform the k-means clustering (with L2_{-distance) for the} _p0_b _{rows of} _Wc_{; leading to the}

partition of {1,· · · ,p_b0}into the kclustersbJk,1,· · · ,bJk,k. Put

MGF(k) = _P 1 1≤i<j≤k bJ_k,ibJ_k,j X 1≤i<j≤k X ℓ∈bJk,i X m∈bJk,j b ρ_ℓ,m2 . (3.7)

The estimated number of the clustersdbis the value such that MGF(db+ 1) exhibits a sharp increase over MGF(k) for k ≤ db, and MGF(k) keeps increasing for k > db+ 1. The db

estimated clusters arebJ_d,_b₁,· · ·,bJ_d,_b_d_b.

(9)

Remark 3. (i) The estimators forr0 andr in Step 1 are based on Theorem 3 in Section 4 below.

The intuition behind is that the eigenvaluesλk,1≥ · · · ≥λk,p(≥0) of matrixΣy(k)Σy(k)⊤, where Σy(k) = Cov(yt+k,yt), satisfy the conditions

λ_k,i−1=o(λ−_k,j1) and λ−_k,j1=o(λ_k,ℓ−1) for 1≤i≤r0, r0 < j≤r0+r andℓ > r0+r.

This is implied by the differences in strength among the common factor xt, the cluster specific

factors zt,i, and the idiosyncratic components εt; see Assumptions 3 and 4. Note that we use

the ratios of the cumulative eigenvalues in (3.2) in order to add together the information from different lags k. In practice we set k0 to be a small integer such as k0 ≤ 5, as the significant autocorrelation occurs typically at small lags. The results do not vary that much with respect to the value ofk0 (see the simulation results in Section 5.1 below). We truncate the sequence{Rbj}

at J0 to alleviate the impact of ‘0/0’. In practice we may setJ0 =p/4 or p/3.

(ii) The ratio-based estimation in Step 1 is new. By Theorem 3, it holdsbr0→r0and br→r in probability. The existing approaches use the ratios of the ordered eigenvalues of matrixMcinstead (Lam and Yao 2012, Changet al.2015, Liet al.2017); leading to an estimator which may not be consistent. See Example 1 below. Note that Lam and Yao (2012) shows that their estimator _er0

fulfills the relation P(er0 ≥r0)→1 only.

(iii) Step 3 removes the common factors first before estimating B, as Lam and Yao (2012) showed that weak factors can be more accurately estimated by removing strong factors from the data first.

(iv) Once the number of factors are correctly specified, the factor loading spaces are relatively easier to identify. In fact M(Ab) is a consistent estimator for M(A). However M(Bb) is a consistent estimator for M{(Ip −AA⊤)(B₀)} instead of M{(B₀)}. See Theorem 2 in Section

4 below. Furthermore the last pd+1 rows of (Ip−AA⊤)(B₀) are no longer 0. Nevertheless when

both r0 and r are small in relation top, thosepd+1 zero-rows can be recovered fromBb in Step 4.

See Theorems 4-5 in Section 4 below.

(iv) Given the block diagonal structure of B in (2.2), the d clusters can be identified easily if we take the (i, j)-th element of BB⊤ as the similarity measure between the i-th and the j-th components, or simply apply the k-means method to the rows of BB⊤. But applying the k -means method directly to the rows ofB will not do. Theorem 4 indicates that the block diagonal structure, though masked by asymptotically diminishing ‘noise’, still presents in Bb via a latent permutation. Accordingly the cluster analysis in Step 5 is based on the correlation-type measures among the rows ofFb which is an estimator ofB.

(10)

Example 1. Consider a simple model of the form (2.2) in which εt≡0,r0 = 1, r = 2, and

xt=p1/2(u1,t+a1u1,t−1+u2,t+a2u2,t−1),

z1,t =p1/2−δ/2(u2,t+a2u2,t−1), z2,t=p1/2−δ/2(u3,t+a3u3,t−1),

where a1, a2, a3 are constants, and ui,t, for different i, t, are independent and N(0,1). Let M= P

0≤k≤1Σy(k)Σy(k)⊤, and λ1 ≥λ2≥λ3 be the three largest eigenvalues of M. It can be shown

thatλ1≍p2,λ3 =p2−2δ{(1 +a23)2+a23} andλ2≍p2−δ provided (a1−a2)2(1−a1a2)6= 0. Hence λ1/λ2 ≍λ2/λ3 ≍pδ. This shows that r0(= 1) cannot be estimated stably based on the ratios of the eigenvalues ofMc for this example.

4 Asymptotic properties

4.1 On estimation for factors

Since only the factor loading space M(A) is uniquely defined by (2.2) (see the discussion below Assumption 2), we measure the estimation error in terms of its (unique) projection matrixAA⊤.

Theorem 1. Let Assumptions 1-4 hold. For Ab defined in (3.4), it holds that

bAAb⊤−AA⊤=Op(n−1/2+p−δ/2). (4.1)

Theorem 1 shows that in the absence of weak factor zt, the estimation for the strong factor

loading space M(A) achieves root-n convergence rate in spite of divergingp. Assumption 2 ensures that the rank of matrixB∗ ≡(Ip−AA⊤)

B 0

isr. Denote byPA⊥B= B∗(B⊤∗B∗)−1B⊤∗ the projection matrix onto M

(Ip−AA⊤)

B

0 of whichM(Bb) is a consistent

estimator, see Theorem 2 below, and also Remark 3(iv).

Theorem 2. Let Assumptions 1-4 hold. For Bb defined in (3.5), it holds that

bBBb⊤−PA⊥B

=Op(pδ/2n−1/2+p−δ/2). (4.2)

Theorem 3 below specifies the asympotic behavour the ratios of the cummulated eigenvalues used in estimating the numbers of factors in Step 1 in Section 3 above. Note that (logp)2 ₌_o₍_p2δ₎

and (logp)2 ₌_{o p}2−2δ_/₍p2 n2 + log

2_p₎_{, it implies that}_b_r0_→_r0_, _r_b_→_r _{in probability provided that} J0 > r0+r is fixed.

(11)

Theorem 3. Let Assumptions 1-4 hold. ForRbj defined in (3.2), it holds for some constantC >0 that lim n,p→∞P(Rbj < C) = 1 for j= 1,· · ·, r0−1, (4.3) b R_r0−1 =Op(p−2δ), Rbr0−1+r=Op (p2/n2+ log2p) p2−2δ, (4.4) lim n,p→∞P(Rbj < C) = 1 for j=r0+ 1,· · · , r0+r−1, and (4.5) b Rj =Op(log2p) for j=r0+r+ 1,· · · , r0+r+s, (4.6)

where s is a positive fixed integer.

4.2 On clustering

Our goal is to recover the d latent clusters in model (2.2). Unfortunately Bb provides merely a consistent estimator for the space M(Ip−AA⊤)

B

0 , see Theorem 2 above. Nevertheless it

contains the sufficient information for identifying the block diagonal structure ofBas well as the components of ytnot belonging to any clusters. See Theorem 4 below.

Theorem 4. Let Assumptions 1 – 5 hold. There exists a p ×rb matrix Bˇ which is a latent

row-permutation of Bb defined in (3.5). Write BˇBˇ⊤ into the two parts: ˇ

BBˇ⊤ =Hdiag+Herr,

where Hdiag is a block diagonal matrix of the same structure as B0

_B

0

⊤

, i.e.

Hdiag= diag(H1,· · · ,Hd,0), (4.7)

while Herr has all the elements in the first d diagonal blocks equal to 0. Then it holds for some

constant C >0 that kHikF ≥C for i= 1,· · · , d, and

kHerrkF =Op(pδ/2n−1/2+p−δ/2). (4.8)

Theorem 4 shows that the components of yt not belonging to any clusters corresponds to

the rows of Bb with the norms converging to 0, and, hence, Step 4 of the algorithm in Section 3. Theorem 5 below indicates that the misclassification rate converges to 0 whenpd+1 ≍p. Let Jd+1

be the collection of the indices of the componentsyt not belonging to any one thedclusters, and Jc_d₊₁ ={1,· · · , p} −J_d₊₁ be the complement of J_d₊₁. Theorem 6 provides the underpinning for Step 5 of the algorithm in Section 3.

(12)

Theorem 6. Let Assumptions 1 – 6 hold. Let eJ₁,· · · ,eJ_d be a partition of {1,· · · ,p_b0} such that

e

Jj contains the indices of the components of yt belonging to the j-th cluster, j= 1,· · ·, d. Then X 1≤i<j≤d X ℓ∈eJi X m∈eJj b ρ_ℓ,m2 =Op(pδn−1ωp−4+p−δω−p4) =op(p2),

where ρ_bℓ,m is defined in Step 5 in Section 3. Furthermore it holds for some constant C >0 that

P X ℓ,m∈eJj b ρ_ℓ,m2 > Cp2→1, j= 1,· · ·, d.

5 Numerical properties

5.1 Simulation

We illustrate the proposed methodology through the simulation with model (2.2). We draw the elements of A and Bj independently from, respectively, U(−p−1/2, p−1/2) and U(−p−j1/2, p

−1/2

j ).

All component series of xt and zt are independent and AR(1) and MA(1), respectively, with

Gaussian innovations. All components ofεt are independent MA(1) withN(0,0.25) innovations.

All the AR and the MA coefficients are drawn randomly fromU{(−0.95,−0.4)∪(0.4,0.95)}. The standard deviations of the components ofp−1/2_x

tandpδ/2−1/2ztare drawn randomly fromU(1,2).

In this way, all the components of xtare strong factors with δ= 0, and all the components of zt

are weak factors at strengthδ ∈(0,1); see Assumption 3 and Remark 1(i).

We set n= 800, p= 450, d= 5, r0 =r1=· · ·=r5 = 2 (and, hence, r = 10), p1 =· · ·=p5=

50, andk0 = 5. Therefore among 450 component series ofyt, the first (p0 =)250 components form

5 clusters with equal size 50, and the last (pd+1=)200 components do not belong to any clusters.

The factor strength of zt is set at four different levels δ = 0.2,0.3,0.4,0.5. For each setting, we

replicate the experiment 1000 times.

The numbers of factors r0 and r are estimated based on the ratios of Rbj, as in (3.3) with

k0 = 1,· · ·,5 andJ0= [p/4]. For the comparison purpose, we also report the estimates based on

(13)

Table 1: The relative frequencies of r0b = r0 and br0 +br = r0 +r in a simulation with 1000 replications, where _br0 and rbare estimated by the ratios of Rbj based method (3.3) with k0 =

1,· · · ,5, or by the ratios of the eigenvalues of Σby(k)Σby(k)⊤ withk= 0,1,· · · ,5.

Estimation br0 =r0 br0+br=r0+r method δ =.2 δ=.3 δ =.4 δ=.5 δ=.2 δ =.3 δ=.4 δ=.5 b Rj (k0 = 1) .446 .803 .973 .999 1 1 1 1 b Rj (k0 = 2) .476 .813 .970 .999 .997 1 1 1 b Rj (k0 = 3) .477 .811 .970 .997 .998 1 1 1 b Rj (k0 = 4) .470 .804 .965 .995 .995 1 1 .999 b Rj (k0 = 5) .465 .805 .963 .995 .997 1 1 .998 c M(k0= 0) .410 .762 .967 1 1 1 1 1 c M(k0= 1) .451 .808 .974 1 1 .999 .866 .339 c M(k0= 2) .499 .824 .972 .999 1 .998 .918 .367 c M(k0= 3) .520 .822 .970 .997 1 .996 .843 .296 c M(k0= 4) .529 .815 .966 .995 1 .992 .783 .250 c M(k0= 5) .531 .816 .964 .995 1 .990 .730 .193

the ratios of eigenvalues ofcMwithk0 = 0,· · ·,5, which is the standard method used in literation

(see, e.g. Lam and Yao 2012). The relative frequencies of_br0 =r0 andbr0+br=r0+r are reported

in Table 1. Overall the method based on the ratios of the cumulative eigenvalues Rbj provides

accurate and robust performance and is not sensitive to the choice of k0. The estimation based on the eigenvalues of Mc with k ≥1 is competitive for r0, but is considerably poorer for r0+r

withδ= 0.4 and 0.5. UsingMc withk= 0 leads to weaker estimates forr0 when δ= 0.2 and 0.3. It is noticeable that the performance of the estimation for the number of common factor r0

improves as δ increases. This is due to fact that the larger δ is, the larger the difference in the factor strength between the common factor xt and the cluster-based factorzt is. Therefore it is

easier to tell xtapart from zt for largerδ. The performance for estimating r0+r based onRbj is

better than that for r0, as in terms of the factor strength, the difference between (xt,zt) and εt

is significantly greater than that between xt and (zt,εt).

RecallPA⊥B is the projection matrix onto the spaceM

(Ip−AA⊤) B0 ; see Theorem 2 and

also Remark 3(iv). Table 2 contains the means and standard deviations of the estimation errors for the factor loading spaces kAbAb⊤−AA⊤kF and kBbBb⊤−PA⊥BkF, whereAb is estimated by

the eigenvectors of matrix Mc in (3.1) with k0 = 1,· · ·,5, see Step 2 of the algorithm stated in

Section 3. See also Step 3 there for the similar procedure in estimating B. For the comparison purpose, we also include the estimates obtained with Mc replaced by Σby(k)Σby(k)⊤ with k =

(14)

Table 2: The means and standard deviations (in parentheses) ofkAbAb⊤−AA⊤kF and kBbBb⊤− PA⊥BkF in a simulation with 1000 replications, whereAb is estimated by the eigenvectors ofMc in

(3.1) (with k0 = 1,· · ·,5), or by those of Σby(k)Σby(k)⊤ (for k= 0,1,· · ·,5), andBb is estimated

in the similar manner. Bothr0 and r are assumed to be known.

Estimation kAbAb⊤₋_AA⊤_k F kBbBb⊤−PA⊥BkF method δ=.2 δ=.3 δ=.4 δ=.5 δ=.2 δ=.3 δ=.4 δ=.5 c M(k0= 1) .375(.320) .157(.060) .103(.028) .081(.017) .459(.290) .354(.033) .440(.029) .592(.040) c M(k0= 2) .329(.275) .153(.056) .105(.027) .083(.018) .419(.247) .354(.031) .444(.030) .597(.041) c M(k0= 3) .318(.267) .154(.056) .106(.027) .085(.018) .410(.239) .357(.031) .448(.030) .602(.042) c M(k0= 4) .315(.264) .154(.055) .108(.028) .086(.018) .409(.236) .359(.031) .452(.031) .606(.042) c M(k0= 5) .313(.263) .155(.055) .109(.028) .087(.019) .409(.235) .362(.031) .455(.031) .610(.043) b Σy(0)Σby(0)⊤ .474(.390) .169(.069) .105(.028) .079(.017) .541(.361) .345(.040) .418(.027) .562(.038) b Σy(1)Σby(1)⊤ .351(.265) .201(.077) .147(.046) .121(.033) .635(.200) .702(.053) .907(.066) 1.19(.083) b Σy(2)Σby(2)⊤ .372(.176) .295(.133) .241(.113) .210(.094) 2.04(.156) 2.25(.134) 2.46(.120) 2.66(.107) b Σy(3)Σby(3)⊤ .605(.336) .489(.278) .407(.242) .368(.220) 2.10(.171) 2.29(.146) 2.49(.124) 2.70(.114) b Σy(4)Σby(4)⊤ .810(.406) .666(.349) .565(.314) .547(.323) 2.16(.185) 2.33(.150) 2.52(.138) 2.74(.125) b Σy(5)Σby(5)⊤ .946(.411) .786(.371) .690(.346) .661(.342) 2.20(.189) 2.36(.157) 2.55(.143) 2.77(.131)

respect to the different values of k0. Furthermore using a single-lagged covariance matrix for estimating factor loading spaces is not recommendable. When δ increases, the error kAbAb⊤−

AA⊤kF decreases, as indicated by Theorem 1. However the pattern in the errorkBbBb⊤−PA⊥BkF

is more complex as it decreases initially and then increases as δ increases, which is in line with the asymptotic result in Theorem 2.

In the sequel, we only report the results with _br0 and _br estimated by (3.3), and the factor loading spaces estimated by the eigenvectors ofMc with k0 = 5.

To examine the effectiveness of Step 4 of the algorithm, We plot in Figure 1 the sample percentiles at the 5%, 50% and 95% levels of each kbbjk over the 1000 replications, for j =

1,· · · ,450. It is clear that the norms of the last 200(= pd+1) components (not belong to any

clusters) are indeed drop flat and are close to 0. This indicates clearly that it is possible to distinguish the components ofytnot belonging to any clusters from those belonging to one of the

dclusters. Note that the indices of the components not belonging to any clusters are identified as those inbJ_d₊₁ in (3.6), which is defined in terms of a thresholdωp =o(p−1/2). We experiment with

the three choices of this tuning parameter, namelyωp1= (br/p)1/2/lnp,ωp2 ={r/b (plnp)}1/2 and ωp3 = {br/(pln lnp)}1/2. Recall Jc_d₊₁ contains all the indices of the components of yt belonging

to one of thedclusters. The means and standard deviations of the two types of misclassification errors E1 = |Jc_d₊₁ ∩Jb_d₊₁|/|Jc_d₊₁| and E2 = |J_d₊₁∩bJ_dc₊₁|/|J_d₊₁| over the 1000 replications are reported in Table 3. Among the three choices,ωp2 appears to work best as the two types of errors

are both small. The increase in the errors due to the estimation for r0 and r is not significant when δ = 0.4, 0.5. But the increase in E2 due to unknownr and r0 is noticeable whenδ = 0.2.

(15)

Table 3: The means and standard deviations (in parentheses) of the error rates E1 = |Jcd+1∩

b

J_d₊₁|/|Jc_d₊₁| and E2 = |Jd+1 ∩bJcd+1|/|Jd+1| in a simulation with 1000 replications with the 3

possible choices of threshold ωp in (3.6), and the numbers of factorsr0 andr either known or to

be estimated.

r0 andrare known r0andrare estimated

δ=.2 δ=.3 δ=.4 δ=.5 δ=.2 δ=.3 δ=.4 δ=.5 ωp1 E1 .004(.004) .005(.004) .004(.004) .003(.003) .009(.036) .004(.004) .004(.004) .003(.003) ωp2 .043(.013) .044(.012) .043(.012) .041(.012) .047(.073) .041(.015) .042(.013) .041(.012) ωp3 .156(.021) .157(.020) .156(.021) .156(.020) .162(.072) .155(.020) .156(.021) .155(.020) ωp1 E2 .147(.196) .060(.062) .115(.061) .327(.089) .369(.339) .171(.264) .141(.145) .329(.093) ωp2 .011(.050) .000(.000) .000(.000) .000(.000) .112(.128) .042(.094) .011(.052) .001(.014) ωp3 .000(.000) .000(.000) .000(.000) .000(.000) .001(.031) .000(.000) .000(.000) .000(.000)

Table 4: The means and standard deviations (STD) of the error rates S/|bJc_d₊₁ ∩Jc_d₊₁| in a simulation with 1000 replications with the numbers of factors r0 and r either known or to be

estimated.

r0 and r are known r0 andr are estimated

δ =.2 δ=.3 δ=.4 δ =.5 δ=.2 δ=.3 δ =.4 δ=.5

mean .0025 0 0 0 .0266 .0076 .0015 .0002

STD .0123 0 0 0 .0530 .0168 .0079 .0028

Figure 1 also shows that whenδ= 0.2,0.3, the 95% percentiles of the last 200 minimum norms are clearly greater than 0, though the 50% percentiles are still much smaller than the 5% percentiles of the first 250(=p0) norms.

In the sequel, we only report the results withωp2 ={br/(plnp)}1/2.

The number of clusters is estimated based on MGF in (3.7). Figure 2 presents the boxplots of MGF(k) fork= 2,· · ·,10. We calculated MGF(·) with (r0, r) being either known or estimated

by (br0,rb). In either the cases, the values of MGF(k) increase sharply from k= 5 to k = 6, and it keeps increasing for k > 6. Hence we set for db= 5. Then the dbclusters are obtained by performing the k-means clustering (with k = db) for the p0b rows of Wc, where p0b = p− |bJ_d₊₁|. See Step 5 of the algorithm in Section 3. As the error rates in estimating Jc_d₊₁ has already been reported in Table 3, we concentrate on the components of yt with indices in bJcd+1 ∩Jcd+1 now,

and count the number of them which were misplaced by the k-means clustering. Let S denote the number of misplaced components. Both the means and the standard deviations of the error ratesS/|bJc_d₊₁∩Jc_d₊₁|over 1000 replications are reported in Table 4. It shows clearly that the the

k-mean clustering identifies the latent clusters very accurately, and the difference in performance due to the estimating (r0, r) is also small.

More simulation results are collected in an online supplementary file, exhibiting the similar patterns as reported above with two difference settings (i.e. n = 400, p = 300, d = 4, and

(16)

0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0_kno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0_kno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0_kno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0_kno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0_unkno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0_unkno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0_unkno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0_unkno wn j ||b^j|| F ig u re 1: S am p le p er ce n til es of k b b j_k at th e le ve ls of 5% (b lu e) , 50 % (b la ck ) an d 95 % (r ed ) ar e p lo tt ed ag ai n st j . T h e fo u r co lu m n s fr om le ft to rig h t co rr es p on d to , re sp ec tiv ely , δ = 0 . 2 , 0 . 3 , 0 . 4 , 0 . 5. 16

(17)

2 4 6 8 11 0.00 0.02 0.04 0.06 0.08 r, r0 known No. of clusters MGF 2 4 6 8 11 0.000 0.010 0.020 0.030 r, r0 known No. of clusters MGF 2 4 6 8 11 0.000 0.010 0.020 0.030 r, r0 known No. of clusters MGF 2 4 6 8 11 0.005 0.015 0.025 0.035 r, r0 known No. of clusters MGF 2 4 6 8 11 0.00 0.05 0.10 0.15 r, r0 unknown No. of clusters MGF 2 4 6 8 11 0.00 0.05 0.10 0.15 r, r0 unknown No. of clusters MGF 2 4 6 8 11 0.00 0.05 0.10 0.15 r, r0 unknown No. of clusters MGF 2 4 6 8 11 0.00 0.05 0.10 0.15 r, r0 unknown No. of clusters MGF

Figure 2: box-plots for MGF. The four columns from left to right correspond to, respectively,

δ= 0.2,0.3,0.4,0.5.

n= 200, p= 240, d= 6).

5.2 Real data illustration

We consider the daily returns of the stocks listed in S&P500 in 31 December 2014 – 31 December 2019. By removing those which were not traded on every trading day during the period, there are

p= 477 stocks which were traded on n= 1259 trading days. Those stocks are from 11 industry sectors:

(18)

5 10 15 20 1.0 1.2 1.4 1.6 1.8 2.0 2.2 j R ^ j

Figure 3: Plot of Rbj againstj for 2≤j ≤20.

1. Communication Services 2. Consumer Discretionary 3. Consumer Staples

4. Energy 5. Financials 6. Health Care

7. Industrials 8. Information Technology 9. Materials

10.Real Estate 11.Utilities

The conventional wisdom suggests that the companies in the same industry sector share some common features. We apply the proposed 5-step algorithm in Section 3 to the return series to cluster those 477 stocks into different groups.

Step 1 is to estimate the numbers of strong factors and cluster-specific weak factors. To this end, we calculate Rbj as in (3.2) with k0 = 5. It turns out Rb1 = 32.53 is much larger than all

the others, while Rbj for j ≥ 2 are plotted in Figure 3. By (3.3), br0 = 1 and br0+rb= 4. Note

that the estimators for r_b0 and br0+rbare unchanged with k0 = 1,· · · ,4. While the existence of

b

r0 = 1 strong and common factor is reasonable, it is most unlikely that there are merely br = 3 cluster-specific weak factors. Note that estimators in (3.3) are derived under the assumption that all the r cluster-specific (i.e. weak) factors are of the same factor strength; see Remark 1(ii) in Section 2 above. In practice weak factors may have different degrees of strength; implying that we should also take into account the 3rd, the 4th largest local maximum of Rbj. Hence we take b

r0+br= 10 (or perhaps also 13), as Figure 3 suggests that there are 3 factors with factor strength δ1 >0, and further 6 factors with strength δ2 ∈(δ1,1).

With _br0 = 1 and br = 9, we proceed to Steps 2 & 3 of Section 3 and obtain the estimator

(19)

5 10 15 20 0.065 0.070 0.075 0.080 0.085 0.090 k MGF(k)

Figure 4: MGF with different number of clusters whenr_b0= 1 and br = 9.

b

B as in (3.5). Setting ωp =

b

r/(plnp) 1/2, |bJ_d+1| = 12, i.e. 12 stocks do not appear to belong

to any clusters, where bJ_d₊₁ is defined as in (3.6) in Step 4. Leaving those 12 stocks out, we perform Step 5, i.e the k-means clustering for the bp0 = 477−12 = 465 rows of matrix Wc. The resulting MGF(·) is plotted in Figure 4. As MGF(9) is substantially greater than MGF(k) for

k < 9, and MGF(k) keeps increasing for k >9, we take db= 9 as the number of latent clusters. To present the identified dclusters, we define 11×dmatrix with nij/ni as its (i, j)-th element,

where ni is the number of the stocks in the i-th industry sector, and nij is the number of the

stocks in thei-th industry sector which are allocated in thej-th cluster. Thus nij/ni∈[0,1] and P

jnij/ni = 1. The heatmaps of this 11×d matrix ford=db= 9 is presented in Figure 5. The

first cluster mainly consists of the companies in Comsumer Staples, Real Estate and Utilities, Clusters 2 and 3 contain the companies in, respectively, Financials and Health Care, Cluster 4 contains mainly some companies in Communication Service and Information Technology, Cluster 5 consists of the companies in Industrials and Materials, Cluster 6 are mainly the companies in Consumer Discretionary, Cluster 7 is a mixture of a small number of companies from each of 5 or 6 different sectors, Cluster 8 is mainly the companies from Information Technology, Cluster 9 contains almost all companies in Energy. To examine how stable the clustering is, we also include the results for d = 11 and d = 3 in Figure 5. When d is increased from 9 to 11, the original Cluster 1 is divided into new Clusters 1 and 11 with the former consisting of Comsumer Staples and Utilities sectors, and the latter being Real Estate sector. Furthermore the original Cluster 7

(20)

d=9 Clusters In d u st ry se ct o rs d=11 Clusters In d u st ry se ct o rs 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Clusters In d u st ry se ct o rs d= 3

Figure 5: Heatmaps of the distributions of the stocks in each of the 11 industry sectors (corre-sponding to 11 rows) over d clusters (corresponding to d columns), with d = 9,11 and 3. The estimated numbers of the common and cluster-specific factors are, respectively,br0 = 1 andbr= 9.

d=9 Clusters In d u st ry se ct o rs Clusters d=11 In d u st ry se ct o rs 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d=3 Clusters In d u st ry se ct o rs

Figure 6: Heatmaps of the distributions of the stocks in each of the 11 industry sectors (corre-sponding to 11 rows) over d clusters (corresponding to d columns), with d = 9,11 and 3. The estimated numbers of the common and cluster-specific factors are, respectively,br0= 1 andbr= 12.

splits into new Clusters 7 and 10, while the other 7 original clusters are hardly changed. With

d= 3, most companies in each of the 11 sectors stay in one cluster.

If we take r_b0 = 1 and rb= 12, the estimated bJd+1 is unchanged. The clustering results with d = 9,11 and 3 are presented in Figure 6. Comparing with Figure 5, there are some striking similarities: First the clustering with d = 3 are almost identical. For d = 9, the profiles of Clusters 2, · · ·, 6, 8 and 9 are not significantly changed while Clusters 1 and 7 in Figure 5 are somehow mixed together in Figure 6. With d= 11, the profiles of Clusters 2 – 6, 8 – 10 in the two figures are about the same while Clusters 7 and 11 are mixed up across the two figures.

(21)

The analysis above indicates that the companies in the same industry sector tend to share similar dynamic structure in the sense that they are driven by the same cluster-specific factors. Our analysis is reasonably stable as most the clusters do not change substantially when the number of the weaker factors increases fromr_b= 9 to _br= 12.

6 Technical proofs

6.1 Proof of Theorem 1

Write the singular value decomposition ofB∗ as

B∗ = (Ip−AA⊤)

_B

0

=BΛ˜ _B˜V_B˜⊤, (6.1)

whereB˜ is ap×r matrix consisting of the left-singular vectors andV_B˜ is ar×rmatrix consisting

of the right singular vectors such thatB˜⊤B˜ =V⊤_B_˜V_B˜ =Ir.Note thatB ˜˜B⊤=PA⊥B.

At first we give a lemma about B∗. It ensures that there is a positive constant q0 <1 such

thatkΛ_B˜kmin ≥1−q0.

Lemma 1. Under Assumption 2,

rank B∗ =r

and

kB∗kmin≥1−q0.

Proof of Lemma 1. B⊤B=Ir implies

rank B_∗ ≤r.

Moreover, from Weyl’s inequality for singular values therth largest singular value, i.e. the smallest non-zero singular value ofB∗ satisfies

kB∗kmin≥ kBkmin− kAA⊤ _B 0 k= 1− kAA⊤B 0 k ≥1−q0. Definition 6.1. Let ˙ yt=Axt+ _B 0 zt, (6.2) ˙ Σy(k) =Cov(y˙t+k,y˙t) (6.3)

(22)

and ˙ M= k0 X k=0 ˙ Σy(k)Σ˙y(k)⊤=A ˙˙ΛAA˙⊤+B ˙˙ΛBB˙⊤, (6.4)

where A˙ is a p×r0 matrix which consists of the eigenvectors corresponding to the r0 largest

eigenvalues of M˙ and B˙ is ap×r matrix which consists of the eigenvectors corresponding to the

other eigenvalues of M˙ . Let ˜zt=Λ_B˜V_B˜⊤zt, ˜xt=xt+A⊤ _B 0 zt.

Therefore model (2.2) can be equivalently rewritten as

yt=A˜xt+B˜˜zt+Wet. (6.5)

Note that (6.1) ensures that

A⊤B˜ =0. (6.6)

Now we prove ˜xt and˜zt have the same properties as xtand zt. Definition 6.2.

˜

Σx(k) =Cov(˜xt+k,˜xt), Σ˜z(k) =Cov(˜zt+k,˜zt),

˜

Σx,z(k) =Cov(x˜t+k,˜zt) Σ˜z,x(k) =Cov(˜zt+k,˜xt).

Lemma 2. Under Assumptions 2 and 3,

kΣ˜x(k)k ≍p≍ kΣ˜x(k)kmin, (6.7) kΣ˜z(k)k ≍p1−δ ≍ kΣ˜z(k)kmin, (6.8) kΣ˜x(0)−1/2Σ˜x,z(0)Σ˜z(0)−1/2k ≤q1<1, kΣ˜z(0)−1/2Σ˜z,x(0)Σ˜x(0)−1/2k ≤q1 <1, (6.9) kΣ˜x,z(k)k=O(p1−δ/2), kΣ˜z,x(k)k=O(p1−δ/2), (6.10) and Cov(x˜t,es) = 0, Cov(˜zt,es) = 0. (6.11)

Proof of Lemma 2. From (2.5) and (2.7),

˜ Σx(k) =Cov(xt+k+A⊤ _B 0 zt+k,xt+A⊤ _B 0 zt) =Σx(k) +A⊤ _B 0 Σz(k)(B⊤,0)A+Σx,z(k)(B⊤,0)A+A⊤ _B 0 Σz,x(k) =Σx(k) +o(p). 22

(23)

This, together with (2.4), concludes (6.7). Similarly,

˜

Σz(k) =Cov(Λ_B˜V_B˜⊤zt+k,Λ_B˜V_B˜⊤zt)

=Λ_B˜V_B˜⊤Σz(k)V_B˜Λ_B˜.

This, together with (1) and (2.5), concludes (6.8). ForΣ˜x,z(k), one has

˜ Σx,z(k) =Cov(xt+k+A⊤ _B 0 zt+k,Λ_B˜V_B˜⊤zt) =A⊤B 0 Σz(k)V_B˜Λ_B˜ +Σx,z(k)V_B˜Λ_B˜ =Σx,z(k)V_B˜Λ_B˜ +O(p1−δ).

This implies (6.9) and (6.10). (6.11) is obvious.

Now we give the relation between A˙ and A.

Lemma 3. Under Assumptions 1-3,

kA ˙˙A⊤−AA⊤k=O(p−δ/2) (6.12)

and

kA⊤B˙k=O(p−δ/2). (6.13)

Moreover, the orders of the magnitude of kA ˙˙A⊤−AA⊤k and kBA˙ k are totally determined by

1 p2k k0 X k=0 ˜ Σx(k)Σ˜x,z(k) + k0 X k=0 ˜ Σx,z(k)Σ˜z(k)k. kA ˙˙A⊤−AA⊤k=kA⊤B˙k= 0 if and only if k0 X k=0 ˜ Σx(k)Σ˜z,x(k)⊤+ k0 X k=0 ˜ Σx,z(k)Σ˜z(k)⊤= 0. (6.14)

Proof of Lemma 3. From (6.4) we see thatA˙ and B˙ are the eigenvector matrices corresponding

to the different eigenvalues so that

˙

A⊤M ˙˙ B=0.

Recalling (6.2) and (6.5) we have

˙ yt=A˜xt+B˜˜zt=Axt+ _B 0 zt. (6.15)

(24)

Hence we can further expand A˙⊤M ˙˙ B as 0 =A˙⊤( k0 X k=0 ˙ Σy(k)Σ˙y(k)⊤)B˙ = A˙⊤A( k0 X k=0 ˜ Σx(k)Σ˜x(k)⊤)A⊤B˙ +A˙⊤A( k0 X k=0 ˜ Σx(k)Σ˜z,x(k)⊤)B˜⊤B˙ + A˙⊤A( k0 X k=0 ˜ Σx,z(k)Σ˜x,z(k)⊤)A⊤B˙ +A˙⊤A( k0 X k=0 ˜ Σx,z(k)Σ˜z(k)⊤)B˜⊤B˙ + A˙⊤B˜( k0 X k=0 ˜ Σz,x(k)Σ˜x(k)⊤)A⊤B˙ +A˙⊤B˜( k0 X k=0 ˜ Σz,x(k)Σ˜z,x(k)⊤)B˜⊤B˙ + A˙⊤B˜( k0 X k=0 ˜ Σz(k)Σ˜x,z(k)⊤)A⊤B˙ +A˙⊤B˜( k0 X k=0 ˜ Σz(k)Σ˜z(k)⊤)B˜⊤B˙.

This, together with (6.7)-(6.10), implies

kA˙⊤A( k0 X k=0 ˜ Σx(k)Σ˜x(k)⊤)A⊤B˙k=O(p2−δ/2).

Moreover, (6.7) implies that

kΣ˜x(k)Σ˜x(k)⊤kmin≍p2 ≍ kΣ˜x(k)Σ˜x(k)⊤k.

This further yields that

k k0 X k=0 ˜ Σx(k)Σ˜x(k)⊤kmin ≍p2.

So we conclude that (6.12)-(6.13) are true. Moreover, if kA ˙˙A⊤−AA⊤k=kA⊤B˙k= 0, then

˙ A⊤A( k0 X k=0 ˜ Σx(k)Σ˜z,x(k)⊤+ k0 X k=0 ˜ Σx,z(k)Σ˜z(k)⊤)B˜⊤B˙ =0. Then k0 X k=0 ˜ Σx(k)Σ˜z,x(k)⊤+ k0 X k=0 ˜ Σx,z(k)Σ˜z(k)⊤=0. If (6.14) holds, then A⊤M ˜˙ B=0.

The smallest eigenvalue ofA⊤MA˙ has a larger order than the largest eigenvalue of B˜⊤M ˜˙ B. So

kA ˙˙A⊤−AA⊤k=kA⊤B˙k= 0.

(25)

Now we give a lemma about et. Let Σe(k) = 1 n−k n_X−k t=1 et+ke⊤t.

Lemma 4. Under Assumption 4,

kΣe(0)k=Op(

p

n + logp) =op(p 1−δ₎_.

Lemma 4 implies that the order of kΣe(0)k is smaller thanp1−δ.

Proof of Lemma 4. LetΣe,k be an×nmatrix whose (i, j) element is E(et+i,ket+j,k). Define

m=E max 1≤k≤p n X t=1 e2_t,k.

From Theorem 5.48 and Remark 5.49 in (Vershynin, 2010) we conclude that (EkΣe(0)k)1/2 ≤ k 1 p p X k=1 Σe,kk1/2 r p n+C1 r mlogn n ,

whereC1 is an absolute constant. Recalling (2.9) we have lim n,p→∞k 1 p p X k=1 Σe,kk< C.

So we only need to provem=O(n+_logp_n). From (2.10)

E max 1≤k≤p n X t=1 e2_t,k ≤ n max 1≤k≤pEe 2 t,k+E₁max ≤k≤p( n X t=1 e2_t,k−nEe2_t,k) < Cn+E max 1≤k≤p( n X t=1 e2 t,k−nEe2t,k). Moreover E max 1≤k≤p( n X t=1 e2_t,k−nEe2_t,k) ≤ n+ p X k=1 E| n X t=1 e2_t,k−nEe2_t,k|1{ n X t=1 e2_t,k−nEe2_t,k> n} ≤ n+ 1 n p X k=1 E( n X t=1 e2_t,k−nEe2_t,k)21{ n X t=1 e2_t,k−nEe2_t,k> n}.

This, together with (2.11), implies that

E max 1≤k≤p( n X t=1 e2_t,k−nEe2_t,k) =O(n+ p logn).

(26)

The proof is complete.

Now we prove Theorem 1.

Definition 6.3. Let ˇ yt=y˙t− 1 n n X s=1 ˙ ys, bet=Wet− 1 n n X s=1 Wes. We further define ˇ Σy(k) = 1 n−k n_X−k t=1 ˇ yt+kyˇ⊤t, Σbe(k) = 1 n−k n_X−k t=1 b et+kbe⊤t, ˇ Σy,e(k) = 1 n−k n−k X t=1 ˇ yt+kbe⊤t, Σˇe,y(k) = 1 n−k n_X−k t=1 bet+kyˇ⊤t, Mˇ = k0 X k=0 ˇ Σy(k)Σˇy(k)⊤. (6.16)

Proof of Theorem 1. If we can prove

kAbAb⊤−A ˙˙A⊤k=Op(n−1/2), (6.17)

(4.1) can be derived by (6.17) and Lemma 3. To prove (6.17) it suffices to show the difference between the r0th largest eigenvalue of M˙ and (r0 + 1)th largest eigenvalue of cM is larger than cp2 ₍_c _{is a positive constant), and}_k_Mc₋_M_˙ _k₌_O

p(p2n−1/2).

Note that Mˇ is the sample version ofM˙ and y˙t is stationary. SokMˇ −M˙ k=Op(p2n−1/2) =

op(p2) and

b

Σy(k) =Σˇy(k) +Σbe(k) +Σˇy,e(k) +Σˇe,y(k).

From Lemma 2 and Lemma 4 we conclude thatkΣbe(k) +Σˇy,e(k) +Σˇe,y(k)k=op(p). This implies

that kMˇ −Mck = op(p2) and kM˙ −Mck= op(p2). (6.7)-(6.10) imply that the order of the r0th

largest eigenvalue of M˙ is p2 _{and the (}_r0 _{+ 1)th largest eigenvalue of} _M˙ _is _o₍_p2_{). This implies}

that the difference between the r0th largest eigenvalue of M˙ and (r0+ 1)th largest eigenvalue of

c

Mis larger than cp2_.

Now we considerkMc−M˙ k. SincekMˇ−M˙ k=Op(p2n−1/2), we only need to provekMˇ −Mck=

(27)

Op(p2n−1/2). Write c M−Mˇ (6.18) = k0 X k=0 ˇ Σy(k)[Σbe(k)⊤+Σˇy,e(k)⊤+Σˇe,y(k)⊤] + k0 X k=0 [Σbe(k) +Σˇy,e(k) +Σˇe,y(k)]Σˇy(k)⊤ + O( k0 X k=1 [kΣˇe,y(k)k2+kΣˇy,e(k)k2+kΣbe(k)k2]).

Lemma 4 implieskΣbe(k)k=Op(pn−1/2). (6.11) ensureskΣˇy,e(k)k=Op(pn−1/2) andkΣˇe,y(k)k=

Op(pn−1/2). Hence the proof is completed.

Now we consider the estimator of PA⊥B in Step 3 and Theorem 2. Let

Pc_A_b =I−AbAb⊤. (6.19)

Then we get Bb from the eigen-analysis of Pk0_k₌₀Pc_b

AΣby(k)P c b AΣby(k) ⊤_Pc b A. Definition 6.4. Set c M2 = k0 X k=0 Pc_b AΣby(k)P c b AΣby(k) ⊤_Pc b A, M˙ 2= k0 X k=0 Pc_b AΣ˙y(k)P c b AΣ˙y(k) ⊤_Pc b A. (6.20)

Lemma 5. Under Assumptions 1-4, there exist two constant c0 andc1 such that

lim n,p→∞P(c0 ≤ kPc b A ˙ Σy(k)Pc_A_bkmin p1−δ ≤ kPc b A ˙ Σy(k)Pc_A_bk p1−δ ≤c1) = 1 (6.21) and kB˙2B˙⊤2 −B ˙˙B⊤k=Op(pδ/2n−1/2), (6.22)

where B˙2 is a p×r matrix consisting of the eigenvectors corresponding to the first r largest

eigenvalues of M˙ 2.

Proof of Lemma 5. Recalling the definitions of A˙ and B˙ we have

(AA⊤−A ˙˙A⊤)A= (I−A ˙˙A⊤)A=B ˙˙B⊤A

and

(28)

We can find the order of Pc_b

Ay˙t as follows. Via (6.15) and (6.19) write

Pc_b AA˜xt+P c b AB˜˜zt= (A ˙˙A ⊤₋_A_b_A_b⊤₎_A˜_x t+ (I−A ˙˙A⊤)A˜xt +(I−A ˙˙A⊤)B˜˜zt+ (A ˙˙A⊤−AbAb⊤)B˜˜zt = B ˙˙B⊤(A˜xt+B˜˜zt) +(A ˙˙A⊤−AbAb⊤)(A˜xt+B˜˜zt) , Π1+ Π2,

where Π1 = B ˙˙B⊤(A˜xt +B˜˜zt) and Π2 = (A ˙˙A⊤−AbAb⊤)(A˜xt +B˜˜zt). (4.1) implies kΠ2k = Op(p1/2n−1/2) =op(p1/2−δ/2). This, together with (6.7)-(6.10), implies (6.21).

Pc_A_by˙t=B ˙˙B⊤(A˜xt+B˜˜zt) +Op(p1/2n−1/2) =B ˙˙B⊤(A˜xt+B˜˜zt) +op(p1/2−δ/2).

This, together with (6.21), implies (6.22).

Proof of Theorem 2. If we can prove

kBbBb⊤−B ˙˙B⊤k=Op(pδ/2n−1/2), (6.23)

(4.2) can be obtained by (6.23) and Lemma 3. Moreover, (6.22) shows thatB˙2,the eigenvectors matrix corresponding to the firstrlargest eigenvalues ofM˙ 2, is close enough toB˙. It then suffices

to prove that

kBbBb⊤−B˙2B˙⊤2k=Op(pδ/2n−1/2).

To this end, the aim is to show that the difference between the rth largest eigenvalue ofM˙ 2

and (r+ 1)th largest eigenvalue ofMc2 is larger thancp2−2δ andkMc2−M˙ 2k=Op(p2−3δ/2n−1/2).

From Lemma 4, we have

kΣbe(k)k=op(p1−δ/2n−1/2) =op(p1−δ).

This, together with Lemma 5, implies that the difference between the rth largest eigenvalue of

˙

M2 and (r+ 1)th largest eigenvalue ofMc2 is larger thancp2−2δ with probability tending to 1 as n→ ∞.

Now we consider Mc2−M˙ 2. We still use (6.18). However, we replaceMc and Mˇ by Mc2 and

˙

M2 +Op(p2−2δn−1/2) respectively. Similarly, we replace Σˇy(k), Σˇy,e(k),Σˇe,y(k) and Σbe(k) by

Pc_b AΣˇy(k)P c b A, P c b AΣˇy,e(k)P c b A,P c b AΣˇe,y(k)P c b A and P c b AΣbe(k)P c b

A respectively. This, together with

Lemma 4-5, ensures that

kMc2−M˙ 2k=Op(p2−3δ/2n−1/2).

This implies (6.23).

(29)

It suffices to prove the following version for Theorem 3.

Lemma 6. Under the Assumptions 1-4, bλk,i is the ith largest eigenvalue of Σby(k)Σby(k)⊤. Then

there exist a positive constant C such that

lim n,p→∞P( b λk,i−1 b λk,i ≤C) = 1, when 2≤i≤r0, (6.24) b λk,r0+1 b λk,r0 =Op(p−2δ), b λk,r0+1+r b λk,r0+r =Op p2 n2 + log2p p2−2δ , (6.25) lim n,p→∞P( b λk,i−1 b λ_k,i ≤C) = 1, when r0+ 2≤i≤r0+r, (6.26) and b λk,i−1 b λk,i =Op(log2p), when r0+r+ 2≤i≤r0+r+s, (6.27)

where s is a positive integer.

We begin with two estimators of PA⊥B.

Definition 6.5. Let λbk,i be the ith largest eigenvalue of Σby(k)Σby(k)⊤. We write Σby(k)Σby(k)⊤

by its eigenvalue and eigenvector decomposition as

b Σy(k)Σby(k)⊤ =Ab(k)Λbx(k)Ab(k) ⊤ +Bb1(k)Λˇz(k)Bb1(k) ⊤ +Cb1(k)Λˇe(k)Cb1(k) ⊤ , (6.28)

where Λˇz(k) = diag{λbk,r0+1,· · ·,λbk,r0+r}, Λˇe(k) = diag{bλk,r0+r+1,· · · ,bλk,p}, and b A(k)⊤Ab(k) =Ir0, Bb1(k) ⊤_b B1(k) =Ir, Cb1(k) ⊤_b C1(k) =IP−r−r0, Λˇx(k) = diag{λbk,1,· · ·,bλk,r0}.

Then Ab(k) is the estimator of A and Bb1(k) is the estimator of B˜ in the one-step method. We

call this method ”one-step” as we get Ab(k) and Bb1(k) in the same eigen-decomposition.

Definition 6.6. Let Pc b Ak = Ip−Ab(k)Ab(k) ⊤ and write Pc b Ak b Σy(k)Pc_A_b k b Σy(k)⊤Pc_A_b k by its

eigen-value and eigenvector decomposition as

Pc_b Ak b Σy(k)Pc_A_b k b Σy(k)⊤Pc_A_b k = b B(k)Λbz(k)Bb(k) ⊤ +Cb(k)Λbe(k)Cb(k) ⊤ . (6.29)

Then Bb(k) is the estimator of B˜ in the two-step method. We call this method ”two-step” as we

(30)

The following lemma is to prove that the one-step method is asymptotically equivalent to the two-step method based onΣby(k)Σby(k)⊤.

Lemma 7. Under the Assumptions 1-4, one has

kBb(k)Bb(k)⊤−PA⊥Bk=Op(p δ/2_n−1/2₊_p−δ/2₎_, _(6.30) kBb1(k)Bb1(k)⊤−Bb(k)Bb(k)⊤k=Op(p−δ/2n−1/2), (6.31) kΛˇz(k)−Λbz(k)k=op(p2−2δ). (6.32) and kΛˇe(k)−Λbe(k)k=op(p2−2δ). (6.33)

Proof of Lemma 7. By (6.16), (3.1) and (6.15) we have

b

Σy(k)Σby(k)⊤−Σˇy(k)Σˇy(k)⊤

= Σˇy(k)(Σbe(k)⊤+Σˇy,e(k)⊤+Σˇe,y(k)⊤)

+ (Σbe(k) +Σˇy,e(k) +Σˇe,y(k))Σˇy(k)⊤

+ O(kΣˇe,y(k)k2+kΣˇy,e(k)k2+kΣbe(k)k2).

From the proof of Theorems 1-2, it’s not hard to obtain the property ofAb(k) andBb(k) as follows:

kAb(k)Ab(k)⊤−AA⊤k=Op(n−1/2+p−δ/2),

kBb(k)Bb(k)⊤−PA⊥Bk=Op(p

δ/2_n−1/2₊_p−δ/2₎

kΛbz(k)k ≍p2−2δ ≍ kΛbz(k)kmin, (6.34)

kΛbe(k)k=Op(p2n−2+ log2p). (6.35)

So we only need to prove (6.31)-(6.33). From (6.28) and (6.29) one can see that

Ip=Ab(k)Ab(k) ⊤ +Bb(k)Bb(k)⊤+Cb(k)Cb(k)⊤ and Ip =Ab(k)Ab(k) ⊤ +Bb1(k)Bb1(k) ⊤ +Cb1(k)Cb1(k) ⊤ . It follows that b B(k)Bb(k)⊤+Cb(k)Cb(k)⊤=Bb1(k)Bb1(k) ⊤ +Cb1(k)Cb1(k) ⊤ . 30

(31)

This can help us study the relation betweenBb(k) andBb1(k). From (6.28) and (6.29) we conclude that b B1(k)Λˇz(k)Bb1(k) ⊤ +Cb1(k)Λˇe(k)Cb1(k) ⊤ =Pc_b Ak b Σy(k)Σby(k)⊤Pc_A_b k. Moreover, Pc_b Ak =Ip− b A(k)Ab(k)⊤=Bb(k)Bb(k)⊤+Cb(k)Cb(k)⊤. So Pc_A_b k b Σy(k)Σby(k)⊤Pc_A_b k = Pc_A_b k b Σy(k)Pc_A_b k b Σy(k)⊤Pc_A_b k + Pc_A_b k b Σy(k)Ab(k)Ab(k) ⊤_b Σy(k)⊤Pc_A_b k = Bb(k)Λbz(k)Bb(k) ⊤ +Cb(k)Λbe(k)Cb(k) ⊤ + Bb(k)Bb(k)⊤Σby(k)Ab(k)Ab(k) ⊤_b Σy(k)⊤Bb(k)Bb(k) ⊤ + Bb(k)Bb(k)⊤Σby(k)Ab(k)Ab(k) ⊤_b Σy(k)⊤Cb(k)Cb(k) ⊤ + Cb(k)Cb(k)⊤Σby(k)Ab(k)Ab(k) ⊤_b Σy(k)⊤Bb(k)Bb(k) ⊤ + Cb(k)Cb(k)⊤Σby(k)Ab(k)Ab(k) ⊤_b Σy(k)⊤Cb(k)Cb(k) ⊤ .

It then suffices to get the order of Bb(k)⊤Σby(k)Ab(k) and Cb(k) ⊤ b Σy(k)Ab(k). If we can show kBb(k)⊤Σby(k)Ab(k)k=op(p1−δ) and kCb(k) ⊤_b Σy(k)Ab(k)k=op(p1−δ), then (6.32)-(6.33) follow. Note that kAb(k)Σby(k)⊤Ab(k) ⊤ kmin ≍p.

We study the order of Bb(k)⊤Σby(k)Ab(k) as follows based on the definition of eigenvectors. Write

0=Bb(k)⊤Σby(k)Σby(k)⊤Ab(k) = Bb(k)⊤Σby(k)Bb(k)Bb(k) ⊤_b Σy(k)⊤Ab(k) + Bb(k)⊤Σby(k)Cb(k)Cb(k) ⊤_b Σy(k)⊤Ab(k) + Bb(k)⊤Σby(k)Ab(k)Ab(k) ⊤_b Σy(k)⊤Ab(k). Then b B(k)⊤Σby(k)Ab(k) (6.36) = −Bb(k)⊤Σby(k)Bb(k)Bb(k) ⊤ b Σy(k)⊤Ab(k)(Ab(k) ⊤ b Σy(k)⊤Ab(k))−1 −Bb(k)⊤Σby(k)Cb(k)Cb(k) ⊤_b Σy(k)⊤Ab(k)(Ab(k) ⊤_b Σy(k)⊤Ab(k))−1.

(32)

We identify the order of Cb(k)Cb(k)⊤A˜xt as follows.

We replace M˙ by Σ˙y(k)Σ˙y(k)⊤ and defineA˙(k) and B˙(k) as in A˙ and B˙. Then

kAA⊤−A˙(k)A˙(k)⊤k=Op(p−δ/2), kA˙(k)A˙(k)⊤−Ab(k)Ab(k)⊤k=Op(n−1/2) and kB˙(k)B˙(k)⊤−Bb(k)Bb(k)⊤k=Op(pδ/2n−1/2). Note that kCb(k)Cb(k)⊤A˜xtk ≤ kCb(k)Cb(k)⊤A˙(k)A˙(k)⊤A˜xtk + kCb(k)Cb(k)⊤(AA⊤−A˙(k)A˙(k)⊤)A˜xtk.

The two summands on the right hand side of the above inequality satisfy kCb(k)Cb(k)⊤A˙(k)A˙(k)⊤A˜xtk ≤ k(Ip−Ab(k)Ab(k) ⊤ )A˙(k)A˙(k)⊤A˜xtk = k(A˙(k)A˙(k)⊤−Ab(k)Ab(k)⊤)A˙(k)A˙(k)⊤A˜xtk ≤ kA˙(k)A˙(k)⊤−Ab(k)Ab(k)⊤kk˜xtk=Op(p1/2n−1/2) and kCb(k)Cb(k)⊤(AA⊤−A˙(k)A˙(k)⊤)A˜xtk ≤ k(A˙(k)A˙(k)⊤−Ab(k)Ab(k)⊤)(AA⊤−A˙(k)A˙(k)⊤)A˜xtk + k(B˙(k)B˙(k)⊤−Bb(k)Bb(k)⊤)(AA⊤−A˙(k)A˙(k)⊤)A˜xtk = Op(p1/2n−1/2). It follows that kCb(k)Cb(k)⊤A˜xtk=Op(p1/2n−1/2).

Similarly we can conclude that

kCb(k)Cb(k)⊤B˜˜ztk=Op(p1/2n−1/2).

Then

kCb(k)⊤yˇtk=Op(p1/2n−1/2).

(33)

Likewise one may verify that

kBb(k)⊤ˇytk=Op(p1/2−δ/2).

These imply that

kBb(k)⊤Σˇy(k)Cb(k)k=Op(p1−δ/2n−1/2), kBb(k)⊤Σˇy,e(k)Cb(k)k=Op(p1/2−δ/2) =Op(p1−δ/2n−1/2), kBb(k)⊤Σˇe,y(k)Cb(k)k=Op(p1/2n−1/2) =op(p1−δ/2n−1/2) and kBb(k)⊤Σbe(k)Cb(k)k=Op(kΣbe(k)k) =Op( p n+ logn) =op(p 1−δ/2_n−1/2₎_. It follows that kBb(k)⊤Σby(k)Cb(k)k= kBb(k) ⊤_ˇ Σy(k)Cb(k)k+Op(p1−δ/2n−1/2) = Op(p1−δ/2n−1/2). Similarly, we have kCb(k)⊤Σby(k)⊤Ab(k)k ≤ kCb(k) ⊤_ˇ Σy(k)⊤Ab(k)k+kCb(k) ⊤_ˇ Σy,e(k)⊤Ab(k)k +kCb(k)⊤Σˇe,y(k)⊤Ab(k)k+kCb(k) ⊤_b Σe(k)⊤Ab(k)k = Op(p1−δ/2), kBb(k)⊤Σby(k)Bb(k)Bb(k) ⊤_b Σy(k)⊤Ab(k)k=Op(p2− 3 2δ), and kBb(k)⊤Σby(k)Cb(k)Cb(k) ⊤_b Σy(k)⊤Ab(k)k=Op(p2−δn−1/2) =Op(p2− 3 2δ). Recalling (6.36), kBb(k)⊤Σby(k)Ab(k)k (6.37) ≤ kBb(k)⊤Σby(k)Bb(k)Bb(k) ⊤_b Σy(k)⊤Ab(k)(Ab(k) ⊤_b Σy(k)⊤Ab(k))−1k + kBb(k)⊤Σby(k)Cb(k)Cb(k) ⊤ b Σy(k)⊤Ab(k)(Ab(k) ⊤ b Σy(k)⊤Ab(k))−1k = Op(p1− 3 2δ).

(34)

Similarly, we can study the order of Cb(k)⊤Σby(k)Ab(k) as follows: 0=Cb(k)⊤Σby(k)Σby(k)⊤Ab(k) = Cb(k)⊤Σby(k)Bb(k)Bb(k) ⊤_b Σy(k)⊤Ab(k) + Cb(k)⊤Σby(k)Cb(k)Cb(k) ⊤_b Σy(k)⊤Ab(k) + Cb(k)⊤Σby(k)Ab(k)Ab(k) ⊤_b Σy(k)⊤Ab(k).

We can find that

kCb(k)⊤Σby(k)Bb(k)Bb(k) ⊤_b Σy(k)⊤Ab(k)k=Op(p2−δn−1/2), and kCb(k)⊤Σby(k)Cb(k)Cb(k) ⊤_b Σy(k)⊤Ab(k)k=Op(p2−δn−1/2). It follows that kCb(k)⊤Σby(k)Ab(k)k (6.38) ≤ kCb(k)⊤Σby(k)Bb(k)Bb(k) ⊤_b Σy(k)⊤Ab(k)(Ab(k) ⊤_b Σy(k)⊤Ab(k))−1k + kCb(k)⊤Σby(k)Cb(k)Cb(k) ⊤_b Σy(k)⊤Ab(k)(Ab(k) ⊤_b Σy(k)⊤Ab(k))−1k = Op(p1−δn−1/2).

This implies that (6.32)-(6.33). (6.32)-(6.35) show that

kΛbz(k)kmin≍p2−2δ

and

b

λk,r+r0+1=kΛbe(k)k+op(p2−2δ) =op(p2−2δ).

Then we can find thatkBb1(k)Bb1(k)

⊤

−Bb(k)Bb(k)⊤kis based on the fact that 1 p2−2δkBb(k) ⊤_b Σy(k)Ab(k)Ab(k) ⊤_b Σy(k)⊤Cb(k)k=Op(p−δ/2n−1/2), which ensures (6.31).

Lemma 8. Under the Assumptions 1-4, bλk,i is the ith largest eigenvalue of Σby(k)Σby(k)⊤. Then

there exist two positive constants c1 and C1 such that

lim n,p→∞P(c1 ≤ b λk,i p2 ≤C1) = 1, when 1≤i≤r0, (6.39) 34

(35)

lim n,p→∞P(c1 ≤ b λk,i p2−2δ ≤C1) = 1, when r0+ 1≤i≤r0+r, (6.40) b λk,r0+r+1 =Op( p2 n2 + log 2_p₎ _(6.41) and n2 p2b_λ k,r0+r+s =Op(1) (6.42)

for any fixed s.

Proof of Lemma 8. Recalling the proof of Theorem 1 it’s similar to prove that

kAb(k)Ab(k)⊤−AA⊤k=Op(n−1/2+p−δ/2).

This implies (6.39).

(6.32) and (6.34) lead to (6.40).

So we prove (6.41) now. We start with the relation:

ˇ Λe(k) = Cb1(k) ⊤ b Σy(k)Σby(k)⊤Cb1(k) = Cb1(k) ⊤ b Σy(k)Bb1(k)Bb1(k) ⊤ b Σy(k)⊤Cb1(k) + Cb1(k) ⊤ b Σy(k)Ab(k)Ab(k) ⊤ b Σy(k)⊤Cb1(k) + Cb1(k) ⊤ b Σy(k)Cb1(k)Cb1(k) ⊤ b Σy(k)⊤Cb1(k). Note that k bA(k) ⊤ b B1(k) ⊤ b Σy(k)⊤ b A(k),Bb1(k) kmin ≍p1−δ, and 0=Cb1(k) ⊤_b Σy(k)Σby(k)⊤ b A(k),Bb1(k) = Cb1(k) ⊤_b Σy(k) b A(k),Bb1(k) bA(k)⊤ b B1(k) ⊤ b Σy(k)⊤(Ab(k),Bb1(k)) +Cb1(k) ⊤_b Σy(k)Cb1(k)Cb1(k) ⊤_b Σy(k)⊤ b A(k),Bb1(k) . It follows that b C1(k) ⊤_b Σy(k)(Ab(k),Bb1(k)) = −Cb1(k) ⊤_b Σy(k)Cb1(k)Cb1(k) ⊤_b Σy(k)⊤ b A(k),Bb1(k) bA(k)⊤ b B1(k) ⊤ b Σy(k)⊤ b A(k),Bb1(k) −1 .

(36)

This implies that kCb1(k) ⊤ b Σy(k)(Ab(k)Ab(k) ⊤ +Bb1(k)Bb1(k) ⊤ )Σby(k)⊤Cb1(k)k =Op(kCb1(k) ⊤ b Σy(k)Cb1(k)Cb1(k) ⊤ b Σy(k)⊤Cb1(k)k).

So we only need to get the order of kCb1(k)Σby(k)Cb1(k)

⊤ k. (6.31) implies kCb1(k)Cb1(k) ⊤ −Cb(k)Cb(k)⊤k=Op(p−δ/2n−1/2).

This, together with (6.35), implies

kCb1(k)Σby(k)Cb1(k) ⊤ k=Op(kCb(k)Σby(k)Cb(k) ⊤ k) =Op(pn−1+ logp). This proves (6.41).

From (2.10), Lemma 4 and (6.41), for any fixeds,

min_X{n,p} i=r0+r+1

b

λ1_k,i/2 =Op(p).

This, together with (6.41), implies (6.42).

Lemma 6 can be concluded by Lemma 8.

6.4 Proof of Theorems 4-6 Proof of Theorem 4. (Ip−AA⊤) _B 0 =      C11 · · · C1,d · · · · Cd+1,1 · · · Cd+1,d     , (6.43)

whereCij is api×rj matrix. Hence

Cii=Bi−AiA⊤i Bi.

When i6=j,

Cij =−AiA⊤j Bj.

Note thatB⊤B=Ir, and B⊤i Bi =Iri.

(37)

(6.43), (2.3) and Assumption 5 ensure that PA⊥B can be rewritten as

PA⊥B =H˜diag+H˜err,

whereH˜diag satisfies (4.7) andkH˜errkF =O(pδ/2n−1/2). This, together with (4.2), completes the

proof.

Proof of Theorem 5. Note that |J_i| ≤c2pfor any 1≤i≤d+ 1. The fact thatBb⊤Bb =Ir implies

that

kbbik2=bb⊤i Bb⊤Bbbbi=kbb⊤i Bb⊤k2.

Recalling the definition of bbi in Step 4, kbbik is the norm of theith row of BbBb⊤.

We begin with i ≤ d. Theorem 4 and Assumption 6 imply that if j ∈ J_i ∩bJ_d+1, the norm

of the jth row vector of BbBb⊤−PA⊥B should be larger than

c1p−1/2

2 . This, together with (4.2),

implies (4.9).

Now we consideri=d+ 1, if j∈J_d₊₁∩bJ_dc₊₁, the norm of thejth row vector ofBbBb⊤−PA⊥B

should be not smaller than ωp

2 . This, together with (4.2), implies (4.10).

Proof of Theorem 6. We define a diagonal matrixFdiag which has theith diagonal elementsbfi⊤bfi.

Thenρbl,m is the (l, m)th entry ofFdiag−1/2|FbFb⊤|F −1/2

diag . Recalling Step 4, one can seekF−diag1 k ≤ωp−2.

It follows that 2 X 1≤i<j≤d X ℓ∈eJ_i X m∈eJ_j b ρ_ℓ,m2 ≤ω_p−4kHerrk2F.

This, together with the definition of ωp in Step 4, concludes the first part. For the second part,

we recall Theorem 5 andpi≍p. Then there existsC1 >0 such that

lim

n,p→∞P(|eJj|> C1p) = 1.

Note thatFb only has_br(fixed) columns andpgoes to infinity. There existC2 >0 andC3>0 such

that

lim

n,p→∞P(|{(ℓ, m) :ℓ, m∈Jej,|ρbℓ,m|> C2}|> C3p

2_{) = 1}_.

(38)

References

Aghabozorgi, S., Shirkhorshid, A.S. and Wah, T.Y. (2015). Time-series clustering – A decade review. Information System,53, 16-38.

Alonso, A.S. and Pe˜na, D. (2019). Clustering time series by linear dependency. Statistics and Computing. 29, 655-676.

Ando, T. and Bai, J. (2017). Clustering huge number of financial time series: a panel data approach with high-dimensional predictors and factor structures. Journal of the American Statistical Association,519, 1182-1198.

Maharaj, E.A., D’Urso, P. and Caiado, J. (2019). Time Series Clustering and Classification. Chapman and Hall/CRC.

Chamberlain, G. (1983). Funds, factors, and diversification in arbitrage pricing models. Econo-metrica, 51, 1305-1323.

Chamberlain, G. and Rothschild, M. (1983). Arbitrage, factor structure, and mean-variance analysis on large asset markets. Econometrica,51, 1281-1304.

Chang, J., Gao, B. and Yao, Q. (2015). High dimensional stochastic regression with latent factors, endogeneity and nonlinearity. Journal of Econometrics,189, 297-312.

Esling, P. and Agon, C. (2012). Time-series data mining. ACM Computing Survey,45. Article 12.

Forni, M., Hallin, M., Lippi, M. and Reichlin, L. (2005). The generalized dynamic-factor model: one-sided estimation and forecasting. Journal of the American Statistical Association,100, 830-840.

Fr¨uhwirth-Schnatter, S. and Kaufmann, S. (2008). Model-based clustering of multiple time series. Journal of Business & Economic Statistics,26, 78-89.

Hallin, M. and Lippi, M. (2013). Factor models in high-dimensional time series – a time-domain approach. Stochastic Processes and Their Applications,123, 2678-2695.

Kakizawa, Y., Shumway, R.H. and Taniguchi, M. (1998). Discrimination and clustering for multivariate time series. Journal of the American Statistical Association,93, 328-340. Keogh, E. and Lin, J. (2005). Clustering of time-series subsequences is meaningless: implications

for previous and future research. Knowledge and Information Systems,8, 154-177.

Keogh, E. and Ratanamahatana, C.A. (2005). Exact indexing of dynamic time warping. Knowl-edge and Information Systems,7, 358-386.

Khaleghi, A., Ryabko, D., Mary, J. and Preux, P. (2016). Consistent algorithms for clustering time series. Journal of Machine Learning Research,17, 1-32.

Lam, C. and Yao, Q. (2012). Factor modelling for high-dimensional time series: inference for the number of factors. The Annals of Statistics,40, 694-726.

Li, Z., Wang, Q. and Yao, J. (2017). Identifying the number of factors from singular values of a large sample auto-covariance matrix. The Annals of Statistics, 45, 257-288.

Liao, T.W. (2005). Clustering of time series data – a survey. Pattern Recognition,38, 1857-1874. Pe˜na, D. and Box, E.P. (1987). Identifying a simplifying structure in time series. Journal of the

American Statistical Association,82, 836-843.

Pe˜na, D. and Poncela, P. (2006). Nonstationary dynamic factor analysis. Journal of Statistical Planning and Inference,136, 1237-1257.

Roelofsen, P. (2018). Time series clustering. Vrije Universiteit Ansterdam.

https://www.math.vu.nl/∼sbhulai/papers/thesis-roelofsen.pdf.

Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices.

arXiv.1011.3027.

Yao, Q., Tong, H., Finkenst¨adt, B. and Stenseth, N.C. (2000). Common structure in panels of short ecological time series. Proceeding of the Royal Society (London),B,267, 2457-2467.