arXiv:2101.01908v1 [math.ST] 6 Jan 2021
Factor Modelling for Clustering
High-dimensional Time Series
Bo Zhang
Department of Statistics & Finance, International Institute of Finance School of Management, University of Science and Technology of China
[email protected] Guangming Pan
School of Physical & Mathematical Sciences, Nanyang Technological University [email protected]
Qiwei Yao
Department of Statistics, London School of Economics and Political Science [email protected]
Wang Zhou
Department of Statistics & Applied Probability, National University of Singapore [email protected]
January 7, 2021
Abstract
We propose a new unsupervised learning method for clustering a large number of time se-ries based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the com-mon factors, the cluster-specific factors, the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao (2012).
components; k-means clustering algorithm; Ratio-based estimation; Strong and weak factors.
1
Introduction
One of the primary tasks of data mining is clustering. While most clustering methods are originally designed for independent observations, clustering a large number of time series gains increasing momentum (Esling and Agon 2012), due to mining large and complex data recorded over time in business, finance, biology, medicine, climate, energy, environment, psychology, multimedia and other areas (Table 1 of Aghabozorgi et al.2015). Consequently the literature on time series clus-tering is large; see Liao (2005), Aghabozorgiet al.(2015), Maharajet al.(2019) and the references therein. The basic idea is to develop some relevant similarity or distance measures among time series first, and then to apply the standard clustering algorithms such as hierarchical clustering or k-means method. Most existing similarity/distances measures for time series may be loosely divided into two categories: data-based and feature-based. The data-based approaches define the measures directly based on observed time series using, for example, L2- or, more general,
Minkowski’s distance, or various correlation measures. Alone and Pe˜na (2019) proposed a gen-eralized cross correlation as a similarity measure, which takes into account cross correlation over different time lags. Dynamic time warping can be applied beforehand to cope with time defor-mation due to, for example, shifting holidays over different years (Keogh and Ratanamahatana, 2005). The feature-based approaches extract relevant features from observed time series data first, and then define similarity/distance measures based on the extracted features. The feature extrac-tion can be carried out by various transformaextrac-tion such as Fourier, wavelet or principal component analysis (Section 2.3 of Roelofsen, 2018). The features from fitted time series models can also be used to define similarity/distance measures (Yao et al.2000, Fr¨uhwirh-Schnatter and Kaufmann 2008). Attempts have also been made to define the similarity between two time series by mea-suring the discrepancy between the two underlying stochastic processes (Kakizawa et al.1998, Khaleghi et al.2016). Other approaches include Zhang (2013) which clusters time series based on the parallelism of their trend functions, and Ando and Bai (2017) which represents the latent clusters in terms of a factor model. So-called ‘subsequence clustering’ occurs frequently in the literature on time series clustering; see Keogh and Lin (2005), and Zolhavarieh et al.(2014). It refers to the clustering the segments from a single long time series, which is not considered in this paper.
The goal of this study is to propose a new factor model based approach to cluster a large
ber of time series into different and unknown clusters such that the members within each cluster share similar dynamic structure, while the number of clusters and their sizes are all unknown. We represent the dynamic structures by latent common and cluster-specific factors, which are both unknown and are identified by the difference in factor strength. The common factors are strong factors (Remark 1 of Lam and Yao 2012) as each of them carries the information on most (if not all) time series concerned. The cluster-specific factors are weak factors as they only affect the time series in a specific cluster. The clustering is based on the factor loadings on all the weak factors; applying ak-mean algorithm using a correlation-type similarity measure defined in terms of the loadings.
Though our factor model is similar to that of Ando and Bai (2017), our approach is radically different. First, we estimate strong factors and all the weaker factors in the manner of one-pass, and then the latent clusters are recovered based on the estimated weak factor loadings. Ando and Bai (2017) adopted an iterative least squares algorithm to estimate factors/factor loadings and latent cluster structure recursively. Secondly, our setting allows the flexibility that some time series do not belong to any clusters, which is often the case in practice. Thirdly, our setting allows the dependence between the common factors and cluster-specific factors while Ando and Bai (2017) imposed an orthogonality condition between the two; see Remark 1(iv) in Section 2 below.
The methods used for estimating factors and factor loadings are adapted from Lam and Yao (2012). Nevertheless substantial advances have been made even within the context of Lam and Yao (2012): (i) we remove the artifact condition that the factor loading spaces for strong and weak factors are perpendicular with each other, (ii) we allow weak serial correlations in idiosyncratic components in the model, which were assumed to be vector white noise by Lam and Yao (2012), and, more significantly, (iii) we propose a new andconsistentratio-based estimator for the number of factors (see Step 1 and also Remark 3(iii) in Section 3 below).
The rest of the paper is organized as follows. Our factor model and the relevant conditions are presented in Section 2. The inference methods are presented in Section 3. The asymptotic properties of the estimation methods are collected in Section 4. Numerical illustration with both simulated and a real data example is reported in Section 5. All technical proofs are presented to Section 6. A supplementary file contains more simulation results.
We always assumes vectors in column. Let kak denote the Euclidean norm of vector a. For any matrixG, let M(G) denote the linear space spanned by the columns of G,kGk the square root of the largest eigenvalue ofG⊤G,kGkmin the square root of the smallest eigenvalue ofG⊤G,
and |G| the determinant of G when G is square. We write a ≍ b if a = O(b) and b = O(a). We use C > 0 to denote a generic constant independent of p and n, which may be different at different places.
2
Models and assumptions
Let yt be a weakly stationary p×1 vector time series, i.e. Eyt is a constant independent of t,
and all elements of Cov(yt+k,yt) are finite and dependent on konly. Suppose thatyt consists of
d+ 1 latent segments, i.e.
yt⊤= (y⊤t,1,· · ·,y⊤t,d,y⊤t,d+1), (2.1) whereyt,1,· · ·,yt,d+1are, respectively,p1,· · ·, pd+1 vector time series withp1,· · ·, pd≥1,pd+1≥
0, and
p1+· · ·+pd=p0, p0+pd+1 =p.
Furthermore, we assume the following latent factor model withdclusters:
yt=Axt+ B 0 zt+εt, (2.2) B= diag(B1,· · · ,Bd), zt⊤= (z⊤t,1,· · · ,z⊤t,d),
whereAis ap×r0 matrix with rankr0,xtisr0 vector time series representingr0 common factors
and |Var(xt)| 6= 0, Bj is pj ×rj matrix with rank rj, zt,j isrj vector time series representing rj
factors foryt,j only and|Var(zt,j)| 6= 0, 0stands for a pd+1×r matrix with all elements equal to
0,r=r1+· · ·+rd, andεtis an idiosyncratic component in the sense of Chamberlain (1983) and
Chamberlain and Rothschild (1983) (see below). Note that in the model above, we only observe permuted yt (i.e. the order of components of yt is unknown) while all the terms on the RHS of
(2.2) are unknown.
By (2.2), the p0 components of yt are grouped into d clusters yt,1,· · · ,yt,d, while the pd+1
components of yt,d+1 do not belong to any clusters. Thej-th cluster yt,j is characterised by the
cluster-specific factor zt,j, in addition to depending on the common factor xt. The goal is to
identify those d latent clusters from observations y1,· · ·,yn. Note that all pj, rj and dare also
unknown.
Assumption 1. max{d, r0, r} < C < ∞, where C is a constant independent of n and p, and
pi ≍p for i= 1,· · · , d.
Assumption 2. A⊤A=Ir0, B⊤B=Ir, and it holds for a constant q0∈(0,1) that AA⊤B 0 ≤q0. (2.3)
Assumption 1 requires that the number of factors remain finite when the number of component time series converges to ∞. This substantially simplifies the technical proofs for the asymptotic results. In practice, there is only one (fixed) p, and the factor model is only effective when the number of factors is much smaller than p. The assumption that A and B are orthogonal matrices can always be fulfilled as we can replace original (A,xt) by (H,Vxt), where A =HV
is a QR decomposition of A. The orthogonality for B can be obtained by such a replacement for each (Bj,zt,j). While A and B are not uniquely defined by (2.3), the factor loading spaces M(A), M(Bj) are, whereM(A) denotes the linear space spanned by the columns of A. Hence AA⊤ =A(A⊤A)−1A⊤, i.e. the projection matrix onto M(A), is also unique. Condition (2.3)
implies that the columns of (B0) do not fall entirely into the spaceM(A) as otherwise one cannot distinguish zt from xt.
Intuitively (almost) all components of yt carry the information on common factorxt, onlypj
components ofyt carry the information on thej-th cluster specific factor zt,j (j= 1,· · · , d), and
merely a few components ofyt carry the information on the each of idiosyncratic components of εt. It is reasonable to assume thatxt,ztandεtare of the different factor strengths. Assumption 3
below quantifies explicitly the differences in the factor strength betweenxtandzt,j. We introduce
some notation first.
Σx(k) = Cov(xt+k,xt), Σz(k) = Cov(zt+k,zt), Σx,z(k) = Cov(xt+k,zt), Σz,x(k) = Cov(zt+k,xt).
Assumption 3. Let k0 ≥ 1 be an integer and δ ∈ (0,1) be a fixed constant. It holds that for
k= 0,1,· · · , k0,
kΣx(k)k ≍p≍ kΣx(k)kmin, (2.4)
kΣz(k)k ≍p1−δ ≍ kΣz(k)kmin, (2.5) kΣx(k)−1/2Σx,z(k)Σz(k)−1/2k ≤q0<1, kΣz(k)−1/2Σz,x(k)Σx(k)−1/2k ≤q0 <1, (2.6)
kΣx,z(k)k=O(p1−δ/2), kΣz,x(k)k=O(p1−δ/2), (2.7)
Cov(xt,εs) = 0, Cov(zt,εs) = 0 for alltands. (2.8)
Remark 1. (i) Following Lam and Yao (2012), we measure the strength of factors by a constant
meaning and the implication ofδ. Condition (2.4) implies that all the components ofxtare strong
factors corresponding to δ = 0. Since almost all the components of yt carry the information on
each components ofxt, those strong factors can be relatively easily recovered fromyt. In contrast,
the components ofztare weak factors withδ∈(0,1), as only aboutp1−δ components ofytcarry
the information on zt; see (2.5). Hence it is more difficult to recover those weak factors. Note
that our primary interest is to recover cluster-specific factorztin order to cluster the components
of yt, for which we also need to estimatext.
(ii) For the simplicity of the presentation, we assume that all the cluster-specific factors
zt,1,· · ·,zt,d are of the same strength, reflected by the uniform constant δ in (2.5). In
prac-tice, those weak factors may have different strengths; see, for example, the real data example in Section 5.2 below. While our approach can be readily extended to the cases with weak factors of different strengths, it will make the theoretical investigation more combersome.
(iii) In (2.2)εtrepresents the idiosyncratic component ofytin the sense that each component
of εt only affects the corresponding component and a few other components of yt (i.e. δ = 1), which is implied by Assumptions 4 below. The differences in the factor strength make the three time seriesxt,zt and εt on the RHS of (2.2) (asymptotically) identifiable.
(iv) Model (2.2) is similar to that of Ando and Bai (2017). However we do not require that the common factorxtand the cluster-specific factorztare orthogonal with each other in the sense
that 1
n P
1≤t≤nxtz⊤t = 0, which is imposed by Ando and Bai (2017). Furthermore we allow the
idiosyncratic termεtto exhibit weak autocorrelations (Assumption 4 below), instead of complete
independence as in Ando and Bai (2017).
From now on we always assume in (2.2) εt = Wet with E(et) = 0, where W is a p×p
constant matrix, andet= (et,1,· · · , et,p)⊤ consists of pindependent weakly stationary univariate
time series. We specifyetin Assumptions 4.
Assumption 4. Let p, n→ ∞ in the order of n=O(p) and pδlogp=o(n). Let Σe,k be a n×n
matrix with E(et+i,ket+j,k) as its (i, j)-element. Suppose that kWk< C,
lim n,p→∞ 1 p p X k=1 Σe,k< C, (2.9) max 1≤i≤pEe 2 t,i < C, p X i=1 Ee2 t,i ≍p, (2.10) and E n X t=1
e2t,i−nEe2t,i21
n X t=1
e2t,i−nEe2t,i > n < Cn 2
min{p, nlogn}, (2.11)
for any 1≤i≤p, where 1(·) denotes the indicator function. Remark 2. When ∞ X j=0 1 p p X i=1
Cov(et+j,i, et,i)
< C, (2.12)
Gershgorin’s circle theorem ensures (2.9). (2.11) holds when p ≍ n and {et,i, t = 1,· · · , n} are
mixing random variables with the mixing coefficients decaying at appropriate rates. Whenp > n
it is also true for appropriate mixing random variables because one may evaluate a higher (>2) moment in the left hand of (2.11) due to the involvement of the indicator function.
To state the assumptions on A and B required for the clustering analysis, we partition A
according to the latent cluster structure:
A⊤= [A⊤1,· · ·,A⊤d+1], (2.13) whereAi is api×r0 matrix.
Assumption 5. Let qp = max1≤i≤d+1,j6=ikAiAj⊤BjkF. Then qp =O(pδ/2n−1/2).
Assumption 6. For any 1≤i≤d, the L2-norm of every row in(Ipi−AiA
⊤
i )Bi is larger than
c1p−1/2.
Assumption 2 implies qp < 1, which is weaker than Assumption 5. Assumption 6 ensures
that the proposed inference procedure can separate the components in the dclusters from those not belonging to any clusters. See also Remark 3(iv) in Section 3 below. Those conditions are automatically fulfilled if A⊤(B0) = 0 which is a condition imposed in Lam and Yao (2012).
3
A clustering algorithm
With available observationsy1,· · ·,yn, we propose below an algorithm (in five steps) to identify
the latent dclusters. To this end, we introduce some notation first. Let ¯y= 1nPnt=1yt,
b Σy(k) = 1 n−k nX−k t=1 (yt+k−y¯)(yt−y¯)⊤, Mc = k0 X k=0 b Σy(k)Σby(k)⊤, (3.1)
wherek0≥0 is a prespecified integer.
Step 1 (Estimate the number of factors.) For 0 ≤ k ≤ k0, let λbk,1 ≥ · · · ≥ bλk,p ≥ 0 be the
eigenvalues of matrix Σby(k)Σby(k)⊤. For a prespecified positive integer J0 ≤p, putR0b = 1
and b Rj = k0 X k=0 (1−k/n)bλk,j .Xk0 k=0 (1−k/n)bλk,j+1, 1≤j≤J0. (3.2)
We say that Rbs attains a local maximum if Rbs > max{Rbs−1,Rbs+1}. Let Rbbτ1 and Rbτb2 be
the two largest local maximums among R1,b · · · ,RbJ0−1. The estimators for the numbers of
factors are then defined as
b
r0 = min{bτ1, bτ2}, r0b +br= max{bτ1, τ2b}. (3.3)
Step 2 (Estimate the loadings for common factors.) Let bγ1,· · ·,γbp be the orthonormal
eigen-vectors of matrix Mc, arranged according to the descending order of the corresponding eigenvalues. The estimated loading matrix for the common factors is
b
A= (bγ1,· · ·,γbbr0). (3.4)
Step 3 (Estimate the loadings for cluster-specific factors.) Replaceyt by (Ip−AbAb⊤)ytin (3.1),
and repeat the eigenanalysis as in Step 2 above but now denote the corresponding orthonor-mal eigenvectors bybζ1,· · · ,bζp. The estimated loading matrix for the cluster-specific factors is
b
B= (ζb1,· · · ,bζrb). (3.5)
Step 4 (Identify the components not belonging to any clusters.) Letbb1,· · ·,bbp denote the row
vectors of Bb. Then the identified index set for the components of yt not belonging to any
clusters is
b
Jd+1 ={j: 1≤j ≤p, kbbjk ≤ωp}, (3.6)
whereωp >0 is a constant satisfying the conditions ωp=o(p−1/2) and p
δn−1+ p−δ
pω2
p =o(1).
Step 5 (Clustering with k-means.) Let pb0 = p− |bJd+1|, and Fb be the pb0 ×br matrix obtained
from Bb by removing the rows with their indices in bJd+1. Letbf1,· · ·,bfp0b denote the p0b rows
of Fb. LetWc be thepb0×bp0 matrix with the (ℓ, m)-th element
b ρℓ,m= bfℓ⊤bfm bfℓ⊤bfℓ·bfm⊤bfm 1/2 , 1≤ℓ, m≤p0.b
Perform the k-means clustering (with L2-distance) for the p0b rows of Wc; leading to the
partition of {1,· · · ,pb0}into the kclustersbJk,1,· · · ,bJk,k. Put
MGF(k) = P 1 1≤i<j≤k bJk,ibJk,j X 1≤i<j≤k X ℓ∈bJk,i X m∈bJk,j b ρℓ,m2 . (3.7)
The estimated number of the clustersdbis the value such that MGF(db+ 1) exhibits a sharp increase over MGF(k) for k ≤ db, and MGF(k) keeps increasing for k > db+ 1. The db
estimated clusters arebJd,b1,· · ·,bJd,bdb.
Remark 3. (i) The estimators forr0 andr in Step 1 are based on Theorem 3 in Section 4 below.
The intuition behind is that the eigenvaluesλk,1≥ · · · ≥λk,p(≥0) of matrixΣy(k)Σy(k)⊤, where Σy(k) = Cov(yt+k,yt), satisfy the conditions
λk,i−1=o(λ−k,j1) and λ−k,j1=o(λk,ℓ−1) for 1≤i≤r0, r0 < j≤r0+r andℓ > r0+r.
This is implied by the differences in strength among the common factor xt, the cluster specific
factors zt,i, and the idiosyncratic components εt; see Assumptions 3 and 4. Note that we use
the ratios of the cumulative eigenvalues in (3.2) in order to add together the information from different lags k. In practice we set k0 to be a small integer such as k0 ≤ 5, as the significant autocorrelation occurs typically at small lags. The results do not vary that much with respect to the value ofk0 (see the simulation results in Section 5.1 below). We truncate the sequence{Rbj}
at J0 to alleviate the impact of ‘0/0’. In practice we may setJ0 =p/4 or p/3.
(ii) The ratio-based estimation in Step 1 is new. By Theorem 3, it holdsbr0→r0and br→r in probability. The existing approaches use the ratios of the ordered eigenvalues of matrixMcinstead (Lam and Yao 2012, Changet al.2015, Liet al.2017); leading to an estimator which may not be consistent. See Example 1 below. Note that Lam and Yao (2012) shows that their estimator er0
fulfills the relation P(er0 ≥r0)→1 only.
(iii) Step 3 removes the common factors first before estimating B, as Lam and Yao (2012) showed that weak factors can be more accurately estimated by removing strong factors from the data first.
(iv) Once the number of factors are correctly specified, the factor loading spaces are relatively easier to identify. In fact M(Ab) is a consistent estimator for M(A). However M(Bb) is a consistent estimator for M{(Ip −AA⊤)(B0)} instead of M{(B0)}. See Theorem 2 in Section
4 below. Furthermore the last pd+1 rows of (Ip−AA⊤)(B0) are no longer 0. Nevertheless when
both r0 and r are small in relation top, thosepd+1 zero-rows can be recovered fromBb in Step 4.
See Theorems 4-5 in Section 4 below.
(iv) Given the block diagonal structure of B in (2.2), the d clusters can be identified easily if we take the (i, j)-th element of BB⊤ as the similarity measure between the i-th and the j-th components, or simply apply the k-means method to the rows of BB⊤. But applying the k -means method directly to the rows ofB will not do. Theorem 4 indicates that the block diagonal structure, though masked by asymptotically diminishing ‘noise’, still presents in Bb via a latent permutation. Accordingly the cluster analysis in Step 5 is based on the correlation-type measures among the rows ofFb which is an estimator ofB.
Example 1. Consider a simple model of the form (2.2) in which εt≡0,r0 = 1, r = 2, and
xt=p1/2(u1,t+a1u1,t−1+u2,t+a2u2,t−1),
z1,t =p1/2−δ/2(u2,t+a2u2,t−1), z2,t=p1/2−δ/2(u3,t+a3u3,t−1),
where a1, a2, a3 are constants, and ui,t, for different i, t, are independent and N(0,1). Let M= P
0≤k≤1Σy(k)Σy(k)⊤, and λ1 ≥λ2≥λ3 be the three largest eigenvalues of M. It can be shown
thatλ1≍p2,λ3 =p2−2δ{(1 +a23)2+a23} andλ2≍p2−δ provided (a1−a2)2(1−a1a2)6= 0. Hence λ1/λ2 ≍λ2/λ3 ≍pδ. This shows that r0(= 1) cannot be estimated stably based on the ratios of the eigenvalues ofMc for this example.
4
Asymptotic properties
4.1 On estimation for factors
Since only the factor loading space M(A) is uniquely defined by (2.2) (see the discussion below Assumption 2), we measure the estimation error in terms of its (unique) projection matrixAA⊤.
Theorem 1. Let Assumptions 1-4 hold. For Ab defined in (3.4), it holds that
bAAb⊤−AA⊤=Op(n−1/2+p−δ/2). (4.1)
Theorem 1 shows that in the absence of weak factor zt, the estimation for the strong factor
loading space M(A) achieves root-n convergence rate in spite of divergingp. Assumption 2 ensures that the rank of matrixB∗ ≡(Ip−AA⊤)
B 0
isr. Denote byPA⊥B= B∗(B⊤∗B∗)−1B⊤∗ the projection matrix onto M
(Ip−AA⊤)
B
0 of whichM(Bb) is a consistent
estimator, see Theorem 2 below, and also Remark 3(iv).
Theorem 2. Let Assumptions 1-4 hold. For Bb defined in (3.5), it holds that
bBBb⊤−PA⊥B
=Op(pδ/2n−1/2+p−δ/2). (4.2)
Theorem 3 below specifies the asympotic behavour the ratios of the cummulated eigenvalues used in estimating the numbers of factors in Step 1 in Section 3 above. Note that (logp)2 =o(p2δ)
and (logp)2 =o p2−2δ/(p2 n2 + log
2p), it implies thatbr0→r0, rb→r in probability provided that J0 > r0+r is fixed.
Theorem 3. Let Assumptions 1-4 hold. ForRbj defined in (3.2), it holds for some constantC >0 that lim n,p→∞P(Rbj < C) = 1 for j= 1,· · ·, r0−1, (4.3) b Rr0−1 =Op(p−2δ), Rbr0−1+r=Op (p2/n2+ log2p) p2−2δ, (4.4) lim n,p→∞P(Rbj < C) = 1 for j=r0+ 1,· · · , r0+r−1, and (4.5) b Rj =Op(log2p) for j=r0+r+ 1,· · · , r0+r+s, (4.6)
where s is a positive fixed integer.
4.2 On clustering
Our goal is to recover the d latent clusters in model (2.2). Unfortunately Bb provides merely a consistent estimator for the space M(Ip−AA⊤)
B
0 , see Theorem 2 above. Nevertheless it
contains the sufficient information for identifying the block diagonal structure ofBas well as the components of ytnot belonging to any clusters. See Theorem 4 below.
Theorem 4. Let Assumptions 1 – 5 hold. There exists a p ×rb matrix Bˇ which is a latent
row-permutation of Bb defined in (3.5). Write BˇBˇ⊤ into the two parts: ˇ
BBˇ⊤ =Hdiag+Herr,
where Hdiag is a block diagonal matrix of the same structure as B0
B
0
⊤
, i.e.
Hdiag= diag(H1,· · · ,Hd,0), (4.7)
while Herr has all the elements in the first d diagonal blocks equal to 0. Then it holds for some
constant C >0 that kHikF ≥C for i= 1,· · · , d, and
kHerrkF =Op(pδ/2n−1/2+p−δ/2). (4.8)
Theorem 4 shows that the components of yt not belonging to any clusters corresponds to
the rows of Bb with the norms converging to 0, and, hence, Step 4 of the algorithm in Section 3. Theorem 5 below indicates that the misclassification rate converges to 0 whenpd+1 ≍p. Let Jd+1
be the collection of the indices of the componentsyt not belonging to any one thedclusters, and Jcd+1 ={1,· · · , p} −Jd+1 be the complement of Jd+1. Theorem 6 provides the underpinning for Step 5 of the algorithm in Section 3.
Theorem 5. Let Assumptions 1 – 6 hold. For bJd+1 defined in (3.6), |Jcd+1∩bJd+1| p =Op pδn−1+p−δ=op(1), and (4.9) |Jd+1∩bJd+1| |Jd+1| = 1 +Op pδn−1+p−δ pω2 p = 1 +op(1) (4.10) provided pd+1 ≍p.
Theorem 6. Let Assumptions 1 – 6 hold. Let eJ1,· · · ,eJd be a partition of {1,· · · ,pb0} such that
e
Jj contains the indices of the components of yt belonging to the j-th cluster, j= 1,· · ·, d. Then X 1≤i<j≤d X ℓ∈eJi X m∈eJj b ρℓ,m2 =Op(pδn−1ωp−4+p−δω−p4) =op(p2),
where ρbℓ,m is defined in Step 5 in Section 3. Furthermore it holds for some constant C >0 that
P X ℓ,m∈eJj b ρℓ,m2 > Cp2→1, j= 1,· · ·, d.
5
Numerical properties
5.1 SimulationWe illustrate the proposed methodology through the simulation with model (2.2). We draw the elements of A and Bj independently from, respectively, U(−p−1/2, p−1/2) and U(−p−j1/2, p
−1/2
j ).
All component series of xt and zt are independent and AR(1) and MA(1), respectively, with
Gaussian innovations. All components ofεt are independent MA(1) withN(0,0.25) innovations.
All the AR and the MA coefficients are drawn randomly fromU{(−0.95,−0.4)∪(0.4,0.95)}. The standard deviations of the components ofp−1/2x
tandpδ/2−1/2ztare drawn randomly fromU(1,2).
In this way, all the components of xtare strong factors with δ= 0, and all the components of zt
are weak factors at strengthδ ∈(0,1); see Assumption 3 and Remark 1(i).
We set n= 800, p= 450, d= 5, r0 =r1=· · ·=r5 = 2 (and, hence, r = 10), p1 =· · ·=p5=
50, andk0 = 5. Therefore among 450 component series ofyt, the first (p0 =)250 components form
5 clusters with equal size 50, and the last (pd+1=)200 components do not belong to any clusters.
The factor strength of zt is set at four different levels δ = 0.2,0.3,0.4,0.5. For each setting, we
replicate the experiment 1000 times.
The numbers of factors r0 and r are estimated based on the ratios of Rbj, as in (3.3) with
k0 = 1,· · ·,5 andJ0= [p/4]. For the comparison purpose, we also report the estimates based on
Table 1: The relative frequencies of r0b = r0 and br0 +br = r0 +r in a simulation with 1000 replications, where br0 and rbare estimated by the ratios of Rbj based method (3.3) with k0 =
1,· · · ,5, or by the ratios of the eigenvalues of Σby(k)Σby(k)⊤ withk= 0,1,· · · ,5.
Estimation br0 =r0 br0+br=r0+r method δ =.2 δ=.3 δ =.4 δ=.5 δ=.2 δ =.3 δ=.4 δ=.5 b Rj (k0 = 1) .446 .803 .973 .999 1 1 1 1 b Rj (k0 = 2) .476 .813 .970 .999 .997 1 1 1 b Rj (k0 = 3) .477 .811 .970 .997 .998 1 1 1 b Rj (k0 = 4) .470 .804 .965 .995 .995 1 1 .999 b Rj (k0 = 5) .465 .805 .963 .995 .997 1 1 .998 c M(k0= 0) .410 .762 .967 1 1 1 1 1 c M(k0= 1) .451 .808 .974 1 1 .999 .866 .339 c M(k0= 2) .499 .824 .972 .999 1 .998 .918 .367 c M(k0= 3) .520 .822 .970 .997 1 .996 .843 .296 c M(k0= 4) .529 .815 .966 .995 1 .992 .783 .250 c M(k0= 5) .531 .816 .964 .995 1 .990 .730 .193
the ratios of eigenvalues ofcMwithk0 = 0,· · ·,5, which is the standard method used in literation
(see, e.g. Lam and Yao 2012). The relative frequencies ofbr0 =r0 andbr0+br=r0+r are reported
in Table 1. Overall the method based on the ratios of the cumulative eigenvalues Rbj provides
accurate and robust performance and is not sensitive to the choice of k0. The estimation based on the eigenvalues of Mc with k ≥1 is competitive for r0, but is considerably poorer for r0+r
withδ= 0.4 and 0.5. UsingMc withk= 0 leads to weaker estimates forr0 when δ= 0.2 and 0.3. It is noticeable that the performance of the estimation for the number of common factor r0
improves as δ increases. This is due to fact that the larger δ is, the larger the difference in the factor strength between the common factor xt and the cluster-based factorzt is. Therefore it is
easier to tell xtapart from zt for largerδ. The performance for estimating r0+r based onRbj is
better than that for r0, as in terms of the factor strength, the difference between (xt,zt) and εt
is significantly greater than that between xt and (zt,εt).
RecallPA⊥B is the projection matrix onto the spaceM
(Ip−AA⊤) B0 ; see Theorem 2 and
also Remark 3(iv). Table 2 contains the means and standard deviations of the estimation errors for the factor loading spaces kAbAb⊤−AA⊤kF and kBbBb⊤−PA⊥BkF, whereAb is estimated by
the eigenvectors of matrix Mc in (3.1) with k0 = 1,· · ·,5, see Step 2 of the algorithm stated in
Section 3. See also Step 3 there for the similar procedure in estimating B. For the comparison purpose, we also include the estimates obtained with Mc replaced by Σby(k)Σby(k)⊤ with k =
Table 2: The means and standard deviations (in parentheses) ofkAbAb⊤−AA⊤kF and kBbBb⊤− PA⊥BkF in a simulation with 1000 replications, whereAb is estimated by the eigenvectors ofMc in
(3.1) (with k0 = 1,· · ·,5), or by those of Σby(k)Σby(k)⊤ (for k= 0,1,· · ·,5), andBb is estimated
in the similar manner. Bothr0 and r are assumed to be known.
Estimation kAbAb⊤−AA⊤k F kBbBb⊤−PA⊥BkF method δ=.2 δ=.3 δ=.4 δ=.5 δ=.2 δ=.3 δ=.4 δ=.5 c M(k0= 1) .375(.320) .157(.060) .103(.028) .081(.017) .459(.290) .354(.033) .440(.029) .592(.040) c M(k0= 2) .329(.275) .153(.056) .105(.027) .083(.018) .419(.247) .354(.031) .444(.030) .597(.041) c M(k0= 3) .318(.267) .154(.056) .106(.027) .085(.018) .410(.239) .357(.031) .448(.030) .602(.042) c M(k0= 4) .315(.264) .154(.055) .108(.028) .086(.018) .409(.236) .359(.031) .452(.031) .606(.042) c M(k0= 5) .313(.263) .155(.055) .109(.028) .087(.019) .409(.235) .362(.031) .455(.031) .610(.043) b Σy(0)Σby(0)⊤ .474(.390) .169(.069) .105(.028) .079(.017) .541(.361) .345(.040) .418(.027) .562(.038) b Σy(1)Σby(1)⊤ .351(.265) .201(.077) .147(.046) .121(.033) .635(.200) .702(.053) .907(.066) 1.19(.083) b Σy(2)Σby(2)⊤ .372(.176) .295(.133) .241(.113) .210(.094) 2.04(.156) 2.25(.134) 2.46(.120) 2.66(.107) b Σy(3)Σby(3)⊤ .605(.336) .489(.278) .407(.242) .368(.220) 2.10(.171) 2.29(.146) 2.49(.124) 2.70(.114) b Σy(4)Σby(4)⊤ .810(.406) .666(.349) .565(.314) .547(.323) 2.16(.185) 2.33(.150) 2.52(.138) 2.74(.125) b Σy(5)Σby(5)⊤ .946(.411) .786(.371) .690(.346) .661(.342) 2.20(.189) 2.36(.157) 2.55(.143) 2.77(.131)
respect to the different values of k0. Furthermore using a single-lagged covariance matrix for estimating factor loading spaces is not recommendable. When δ increases, the error kAbAb⊤−
AA⊤kF decreases, as indicated by Theorem 1. However the pattern in the errorkBbBb⊤−PA⊥BkF
is more complex as it decreases initially and then increases as δ increases, which is in line with the asymptotic result in Theorem 2.
In the sequel, we only report the results with br0 and br estimated by (3.3), and the factor loading spaces estimated by the eigenvectors ofMc with k0 = 5.
To examine the effectiveness of Step 4 of the algorithm, We plot in Figure 1 the sample percentiles at the 5%, 50% and 95% levels of each kbbjk over the 1000 replications, for j =
1,· · · ,450. It is clear that the norms of the last 200(= pd+1) components (not belong to any
clusters) are indeed drop flat and are close to 0. This indicates clearly that it is possible to distinguish the components ofytnot belonging to any clusters from those belonging to one of the
dclusters. Note that the indices of the components not belonging to any clusters are identified as those inbJd+1 in (3.6), which is defined in terms of a thresholdωp =o(p−1/2). We experiment with
the three choices of this tuning parameter, namelyωp1= (br/p)1/2/lnp,ωp2 ={r/b (plnp)}1/2 and ωp3 = {br/(pln lnp)}1/2. Recall Jcd+1 contains all the indices of the components of yt belonging
to one of thedclusters. The means and standard deviations of the two types of misclassification errors E1 = |Jcd+1 ∩Jbd+1|/|Jcd+1| and E2 = |Jd+1∩bJdc+1|/|Jd+1| over the 1000 replications are reported in Table 3. Among the three choices,ωp2 appears to work best as the two types of errors
are both small. The increase in the errors due to the estimation for r0 and r is not significant when δ = 0.4, 0.5. But the increase in E2 due to unknownr and r0 is noticeable whenδ = 0.2.
Table 3: The means and standard deviations (in parentheses) of the error rates E1 = |Jcd+1∩
b
Jd+1|/|Jcd+1| and E2 = |Jd+1 ∩bJcd+1|/|Jd+1| in a simulation with 1000 replications with the 3
possible choices of threshold ωp in (3.6), and the numbers of factorsr0 andr either known or to
be estimated.
r0 andrare known r0andrare estimated
δ=.2 δ=.3 δ=.4 δ=.5 δ=.2 δ=.3 δ=.4 δ=.5 ωp1 E1 .004(.004) .005(.004) .004(.004) .003(.003) .009(.036) .004(.004) .004(.004) .003(.003) ωp2 .043(.013) .044(.012) .043(.012) .041(.012) .047(.073) .041(.015) .042(.013) .041(.012) ωp3 .156(.021) .157(.020) .156(.021) .156(.020) .162(.072) .155(.020) .156(.021) .155(.020) ωp1 E2 .147(.196) .060(.062) .115(.061) .327(.089) .369(.339) .171(.264) .141(.145) .329(.093) ωp2 .011(.050) .000(.000) .000(.000) .000(.000) .112(.128) .042(.094) .011(.052) .001(.014) ωp3 .000(.000) .000(.000) .000(.000) .000(.000) .001(.031) .000(.000) .000(.000) .000(.000)
Table 4: The means and standard deviations (STD) of the error rates S/|bJcd+1 ∩Jcd+1| in a simulation with 1000 replications with the numbers of factors r0 and r either known or to be
estimated.
r0 and r are known r0 andr are estimated
δ =.2 δ=.3 δ=.4 δ =.5 δ=.2 δ=.3 δ =.4 δ=.5
mean .0025 0 0 0 .0266 .0076 .0015 .0002
STD .0123 0 0 0 .0530 .0168 .0079 .0028
Figure 1 also shows that whenδ= 0.2,0.3, the 95% percentiles of the last 200 minimum norms are clearly greater than 0, though the 50% percentiles are still much smaller than the 5% percentiles of the first 250(=p0) norms.
In the sequel, we only report the results withωp2 ={br/(plnp)}1/2.
The number of clusters is estimated based on MGF in (3.7). Figure 2 presents the boxplots of MGF(k) fork= 2,· · ·,10. We calculated MGF(·) with (r0, r) being either known or estimated
by (br0,rb). In either the cases, the values of MGF(k) increase sharply from k= 5 to k = 6, and it keeps increasing for k > 6. Hence we set for db= 5. Then the dbclusters are obtained by performing the k-means clustering (with k = db) for the p0b rows of Wc, where p0b = p− |bJd+1|. See Step 5 of the algorithm in Section 3. As the error rates in estimating Jcd+1 has already been reported in Table 3, we concentrate on the components of yt with indices in bJcd+1 ∩Jcd+1 now,
and count the number of them which were misplaced by the k-means clustering. Let S denote the number of misplaced components. Both the means and the standard deviations of the error ratesS/|bJcd+1∩Jcd+1|over 1000 replications are reported in Table 4. It shows clearly that the the
k-mean clustering identifies the latent clusters very accurately, and the difference in performance due to the estimating (r0, r) is also small.
More simulation results are collected in an online supplementary file, exhibiting the similar patterns as reported above with two difference settings (i.e. n = 400, p = 300, d = 4, and
0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0 kno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0 kno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0 kno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0 kno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0 unkno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0 unkno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0 unkno wn j ||b^j|| 0 100 300 0.00 0.05 0.10 0.15 0.20 0.25 0.30 r, r 0 unkno wn j ||b^j|| F ig u re 1: S am p le p er ce n til es of k b b jk at th e le ve ls of 5% (b lu e) , 50 % (b la ck ) an d 95 % (r ed ) ar e p lo tt ed ag ai n st j . T h e fo u r co lu m n s fr om le ft to rig h t co rr es p on d to , re sp ec tiv ely , δ = 0 . 2 , 0 . 3 , 0 . 4 , 0 . 5. 16
2 4 6 8 11 0.00 0.02 0.04 0.06 0.08 r, r0 known No. of clusters MGF 2 4 6 8 11 0.000 0.010 0.020 0.030 r, r0 known No. of clusters MGF 2 4 6 8 11 0.000 0.010 0.020 0.030 r, r0 known No. of clusters MGF 2 4 6 8 11 0.005 0.015 0.025 0.035 r, r0 known No. of clusters MGF 2 4 6 8 11 0.00 0.05 0.10 0.15 r, r0 unknown No. of clusters MGF 2 4 6 8 11 0.00 0.05 0.10 0.15 r, r0 unknown No. of clusters MGF 2 4 6 8 11 0.00 0.05 0.10 0.15 r, r0 unknown No. of clusters MGF 2 4 6 8 11 0.00 0.05 0.10 0.15 r, r0 unknown No. of clusters MGF
Figure 2: box-plots for MGF. The four columns from left to right correspond to, respectively,
δ= 0.2,0.3,0.4,0.5.
n= 200, p= 240, d= 6).
5.2 Real data illustration
We consider the daily returns of the stocks listed in S&P500 in 31 December 2014 – 31 December 2019. By removing those which were not traded on every trading day during the period, there are
p= 477 stocks which were traded on n= 1259 trading days. Those stocks are from 11 industry sectors:
5 10 15 20 1.0 1.2 1.4 1.6 1.8 2.0 2.2 j R ^ j
Figure 3: Plot of Rbj againstj for 2≤j ≤20.
1. Communication Services 2. Consumer Discretionary 3. Consumer Staples
4. Energy 5. Financials 6. Health Care
7. Industrials 8. Information Technology 9. Materials
10.Real Estate 11.Utilities
The conventional wisdom suggests that the companies in the same industry sector share some common features. We apply the proposed 5-step algorithm in Section 3 to the return series to cluster those 477 stocks into different groups.
Step 1 is to estimate the numbers of strong factors and cluster-specific weak factors. To this end, we calculate Rbj as in (3.2) with k0 = 5. It turns out Rb1 = 32.53 is much larger than all
the others, while Rbj for j ≥ 2 are plotted in Figure 3. By (3.3), br0 = 1 and br0+rb= 4. Note
that the estimators for rb0 and br0+rbare unchanged with k0 = 1,· · · ,4. While the existence of
b
r0 = 1 strong and common factor is reasonable, it is most unlikely that there are merely br = 3 cluster-specific weak factors. Note that estimators in (3.3) are derived under the assumption that all the r cluster-specific (i.e. weak) factors are of the same factor strength; see Remark 1(ii) in Section 2 above. In practice weak factors may have different degrees of strength; implying that we should also take into account the 3rd, the 4th largest local maximum of Rbj. Hence we take b
r0+br= 10 (or perhaps also 13), as Figure 3 suggests that there are 3 factors with factor strength δ1 >0, and further 6 factors with strength δ2 ∈(δ1,1).
With br0 = 1 and br = 9, we proceed to Steps 2 & 3 of Section 3 and obtain the estimator
5 10 15 20 0.065 0.070 0.075 0.080 0.085 0.090 k MGF(k)
Figure 4: MGF with different number of clusters whenrb0= 1 and br = 9.
b
B as in (3.5). Setting ωp =
b
r/(plnp) 1/2, |bJd+1| = 12, i.e. 12 stocks do not appear to belong
to any clusters, where bJd+1 is defined as in (3.6) in Step 4. Leaving those 12 stocks out, we perform Step 5, i.e the k-means clustering for the bp0 = 477−12 = 465 rows of matrix Wc. The resulting MGF(·) is plotted in Figure 4. As MGF(9) is substantially greater than MGF(k) for
k < 9, and MGF(k) keeps increasing for k >9, we take db= 9 as the number of latent clusters. To present the identified dclusters, we define 11×dmatrix with nij/ni as its (i, j)-th element,
where ni is the number of the stocks in the i-th industry sector, and nij is the number of the
stocks in thei-th industry sector which are allocated in thej-th cluster. Thus nij/ni∈[0,1] and P
jnij/ni = 1. The heatmaps of this 11×d matrix ford=db= 9 is presented in Figure 5. The
first cluster mainly consists of the companies in Comsumer Staples, Real Estate and Utilities, Clusters 2 and 3 contain the companies in, respectively, Financials and Health Care, Cluster 4 contains mainly some companies in Communication Service and Information Technology, Cluster 5 consists of the companies in Industrials and Materials, Cluster 6 are mainly the companies in Consumer Discretionary, Cluster 7 is a mixture of a small number of companies from each of 5 or 6 different sectors, Cluster 8 is mainly the companies from Information Technology, Cluster 9 contains almost all companies in Energy. To examine how stable the clustering is, we also include the results for d = 11 and d = 3 in Figure 5. When d is increased from 9 to 11, the original Cluster 1 is divided into new Clusters 1 and 11 with the former consisting of Comsumer Staples and Utilities sectors, and the latter being Real Estate sector. Furthermore the original Cluster 7
d=9 Clusters In d u st ry se ct o rs d=11 Clusters In d u st ry se ct o rs 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Clusters In d u st ry se ct o rs d= 3
Figure 5: Heatmaps of the distributions of the stocks in each of the 11 industry sectors (corre-sponding to 11 rows) over d clusters (corresponding to d columns), with d = 9,11 and 3. The estimated numbers of the common and cluster-specific factors are, respectively,br0 = 1 andbr= 9.
d=9 Clusters In d u st ry se ct o rs Clusters d=11 In d u st ry se ct o rs 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d=3 Clusters In d u st ry se ct o rs
Figure 6: Heatmaps of the distributions of the stocks in each of the 11 industry sectors (corre-sponding to 11 rows) over d clusters (corresponding to d columns), with d = 9,11 and 3. The estimated numbers of the common and cluster-specific factors are, respectively,br0= 1 andbr= 12.
splits into new Clusters 7 and 10, while the other 7 original clusters are hardly changed. With
d= 3, most companies in each of the 11 sectors stay in one cluster.
If we take rb0 = 1 and rb= 12, the estimated bJd+1 is unchanged. The clustering results with d = 9,11 and 3 are presented in Figure 6. Comparing with Figure 5, there are some striking similarities: First the clustering with d = 3 are almost identical. For d = 9, the profiles of Clusters 2, · · ·, 6, 8 and 9 are not significantly changed while Clusters 1 and 7 in Figure 5 are somehow mixed together in Figure 6. With d= 11, the profiles of Clusters 2 – 6, 8 – 10 in the two figures are about the same while Clusters 7 and 11 are mixed up across the two figures.
The analysis above indicates that the companies in the same industry sector tend to share similar dynamic structure in the sense that they are driven by the same cluster-specific factors. Our analysis is reasonably stable as most the clusters do not change substantially when the number of the weaker factors increases fromrb= 9 to br= 12.
6
Technical proofs
6.1 Proof of Theorem 1
Write the singular value decomposition ofB∗ as
B∗ = (Ip−AA⊤)
B
0
=BΛ˜ B˜VB˜⊤, (6.1)
whereB˜ is ap×r matrix consisting of the left-singular vectors andVB˜ is ar×rmatrix consisting
of the right singular vectors such thatB˜⊤B˜ =V⊤B˜VB˜ =Ir.Note thatB ˜˜B⊤=PA⊥B.
At first we give a lemma about B∗. It ensures that there is a positive constant q0 <1 such
thatkΛB˜kmin ≥1−q0.
Lemma 1. Under Assumption 2,
rank B∗ =r
and
kB∗kmin≥1−q0.
Proof of Lemma 1. B⊤B=Ir implies
rank B∗ ≤r.
Moreover, from Weyl’s inequality for singular values therth largest singular value, i.e. the smallest non-zero singular value ofB∗ satisfies
kB∗kmin≥ kBkmin− kAA⊤ B 0 k= 1− kAA⊤B 0 k ≥1−q0. Definition 6.1. Let ˙ yt=Axt+ B 0 zt, (6.2) ˙ Σy(k) =Cov(y˙t+k,y˙t) (6.3)
and ˙ M= k0 X k=0 ˙ Σy(k)Σ˙y(k)⊤=A ˙˙ΛAA˙⊤+B ˙˙ΛBB˙⊤, (6.4)
where A˙ is a p×r0 matrix which consists of the eigenvectors corresponding to the r0 largest
eigenvalues of M˙ and B˙ is ap×r matrix which consists of the eigenvectors corresponding to the
other eigenvalues of M˙ . Let ˜zt=ΛB˜VB˜⊤zt, ˜xt=xt+A⊤ B 0 zt.
Therefore model (2.2) can be equivalently rewritten as
yt=A˜xt+B˜˜zt+Wet. (6.5)
Note that (6.1) ensures that
A⊤B˜ =0. (6.6)
Now we prove ˜xt and˜zt have the same properties as xtand zt. Definition 6.2.
˜
Σx(k) =Cov(˜xt+k,˜xt), Σ˜z(k) =Cov(˜zt+k,˜zt),
˜
Σx,z(k) =Cov(x˜t+k,˜zt) Σ˜z,x(k) =Cov(˜zt+k,˜xt).
Lemma 2. Under Assumptions 2 and 3,
kΣ˜x(k)k ≍p≍ kΣ˜x(k)kmin, (6.7) kΣ˜z(k)k ≍p1−δ ≍ kΣ˜z(k)kmin, (6.8) kΣ˜x(0)−1/2Σ˜x,z(0)Σ˜z(0)−1/2k ≤q1<1, kΣ˜z(0)−1/2Σ˜z,x(0)Σ˜x(0)−1/2k ≤q1 <1, (6.9) kΣ˜x,z(k)k=O(p1−δ/2), kΣ˜z,x(k)k=O(p1−δ/2), (6.10) and Cov(x˜t,es) = 0, Cov(˜zt,es) = 0. (6.11)
Proof of Lemma 2. From (2.5) and (2.7),
˜ Σx(k) =Cov(xt+k+A⊤ B 0 zt+k,xt+A⊤ B 0 zt) =Σx(k) +A⊤ B 0 Σz(k)(B⊤,0)A+Σx,z(k)(B⊤,0)A+A⊤ B 0 Σz,x(k) =Σx(k) +o(p). 22
This, together with (2.4), concludes (6.7). Similarly,
˜
Σz(k) =Cov(ΛB˜VB˜⊤zt+k,ΛB˜VB˜⊤zt)
=ΛB˜VB˜⊤Σz(k)VB˜ΛB˜.
This, together with (1) and (2.5), concludes (6.8). ForΣ˜x,z(k), one has
˜ Σx,z(k) =Cov(xt+k+A⊤ B 0 zt+k,ΛB˜VB˜⊤zt) =A⊤B 0 Σz(k)VB˜ΛB˜ +Σx,z(k)VB˜ΛB˜ =Σx,z(k)VB˜ΛB˜ +O(p1−δ).
This implies (6.9) and (6.10). (6.11) is obvious.
Now we give the relation between A˙ and A.
Lemma 3. Under Assumptions 1-3,
kA ˙˙A⊤−AA⊤k=O(p−δ/2) (6.12)
and
kA⊤B˙k=O(p−δ/2). (6.13)
Moreover, the orders of the magnitude of kA ˙˙A⊤−AA⊤k and kBA˙ k are totally determined by
1 p2k k0 X k=0 ˜ Σx(k)Σ˜x,z(k) + k0 X k=0 ˜ Σx,z(k)Σ˜z(k)k. kA ˙˙A⊤−AA⊤k=kA⊤B˙k= 0 if and only if k0 X k=0 ˜ Σx(k)Σ˜z,x(k)⊤+ k0 X k=0 ˜ Σx,z(k)Σ˜z(k)⊤= 0. (6.14)
Proof of Lemma 3. From (6.4) we see thatA˙ and B˙ are the eigenvector matrices corresponding
to the different eigenvalues so that
˙
A⊤M ˙˙ B=0.
Recalling (6.2) and (6.5) we have
˙ yt=A˜xt+B˜˜zt=Axt+ B 0 zt. (6.15)
Hence we can further expand A˙⊤M ˙˙ B as 0 =A˙⊤( k0 X k=0 ˙ Σy(k)Σ˙y(k)⊤)B˙ = A˙⊤A( k0 X k=0 ˜ Σx(k)Σ˜x(k)⊤)A⊤B˙ +A˙⊤A( k0 X k=0 ˜ Σx(k)Σ˜z,x(k)⊤)B˜⊤B˙ + A˙⊤A( k0 X k=0 ˜ Σx,z(k)Σ˜x,z(k)⊤)A⊤B˙ +A˙⊤A( k0 X k=0 ˜ Σx,z(k)Σ˜z(k)⊤)B˜⊤B˙ + A˙⊤B˜( k0 X k=0 ˜ Σz,x(k)Σ˜x(k)⊤)A⊤B˙ +A˙⊤B˜( k0 X k=0 ˜ Σz,x(k)Σ˜z,x(k)⊤)B˜⊤B˙ + A˙⊤B˜( k0 X k=0 ˜ Σz(k)Σ˜x,z(k)⊤)A⊤B˙ +A˙⊤B˜( k0 X k=0 ˜ Σz(k)Σ˜z(k)⊤)B˜⊤B˙.
This, together with (6.7)-(6.10), implies
kA˙⊤A( k0 X k=0 ˜ Σx(k)Σ˜x(k)⊤)A⊤B˙k=O(p2−δ/2).
Moreover, (6.7) implies that
kΣ˜x(k)Σ˜x(k)⊤kmin≍p2 ≍ kΣ˜x(k)Σ˜x(k)⊤k.
This further yields that
k k0 X k=0 ˜ Σx(k)Σ˜x(k)⊤kmin ≍p2.
So we conclude that (6.12)-(6.13) are true. Moreover, if kA ˙˙A⊤−AA⊤k=kA⊤B˙k= 0, then
˙ A⊤A( k0 X k=0 ˜ Σx(k)Σ˜z,x(k)⊤+ k0 X k=0 ˜ Σx,z(k)Σ˜z(k)⊤)B˜⊤B˙ =0. Then k0 X k=0 ˜ Σx(k)Σ˜z,x(k)⊤+ k0 X k=0 ˜ Σx,z(k)Σ˜z(k)⊤=0. If (6.14) holds, then A⊤M ˜˙ B=0.
The smallest eigenvalue ofA⊤MA˙ has a larger order than the largest eigenvalue of B˜⊤M ˜˙ B. So
kA ˙˙A⊤−AA⊤k=kA⊤B˙k= 0.
Now we give a lemma about et. Let Σe(k) = 1 n−k nX−k t=1 et+ke⊤t.
Lemma 4. Under Assumption 4,
kΣe(0)k=Op(
p
n + logp) =op(p 1−δ).
Lemma 4 implies that the order of kΣe(0)k is smaller thanp1−δ.
Proof of Lemma 4. LetΣe,k be an×nmatrix whose (i, j) element is E(et+i,ket+j,k). Define
m=E max 1≤k≤p n X t=1 e2t,k.
From Theorem 5.48 and Remark 5.49 in (Vershynin, 2010) we conclude that (EkΣe(0)k)1/2 ≤ k 1 p p X k=1 Σe,kk1/2 r p n+C1 r mlogn n ,
whereC1 is an absolute constant. Recalling (2.9) we have lim n,p→∞k 1 p p X k=1 Σe,kk< C.
So we only need to provem=O(n+logpn). From (2.10)
E max 1≤k≤p n X t=1 e2t,k ≤ n max 1≤k≤pEe 2 t,k+E1max ≤k≤p( n X t=1 e2t,k−nEe2t,k) < Cn+E max 1≤k≤p( n X t=1 e2 t,k−nEe2t,k). Moreover E max 1≤k≤p( n X t=1 e2t,k−nEe2t,k) ≤ n+ p X k=1 E| n X t=1 e2t,k−nEe2t,k|1{ n X t=1 e2t,k−nEe2t,k> n} ≤ n+ 1 n p X k=1 E( n X t=1 e2t,k−nEe2t,k)21{ n X t=1 e2t,k−nEe2t,k> n}.
This, together with (2.11), implies that
E max 1≤k≤p( n X t=1 e2t,k−nEe2t,k) =O(n+ p logn).
The proof is complete.
Now we prove Theorem 1.
Definition 6.3. Let ˇ yt=y˙t− 1 n n X s=1 ˙ ys, bet=Wet− 1 n n X s=1 Wes. We further define ˇ Σy(k) = 1 n−k nX−k t=1 ˇ yt+kyˇ⊤t, Σbe(k) = 1 n−k nX−k t=1 b et+kbe⊤t, ˇ Σy,e(k) = 1 n−k n−k X t=1 ˇ yt+kbe⊤t, Σˇe,y(k) = 1 n−k nX−k t=1 bet+kyˇ⊤t, Mˇ = k0 X k=0 ˇ Σy(k)Σˇy(k)⊤. (6.16)
Proof of Theorem 1. If we can prove
kAbAb⊤−A ˙˙A⊤k=Op(n−1/2), (6.17)
(4.1) can be derived by (6.17) and Lemma 3. To prove (6.17) it suffices to show the difference between the r0th largest eigenvalue of M˙ and (r0 + 1)th largest eigenvalue of cM is larger than cp2 (c is a positive constant), andkMc−M˙ k=O
p(p2n−1/2).
Note that Mˇ is the sample version ofM˙ and y˙t is stationary. SokMˇ −M˙ k=Op(p2n−1/2) =
op(p2) and
b
Σy(k) =Σˇy(k) +Σbe(k) +Σˇy,e(k) +Σˇe,y(k).
From Lemma 2 and Lemma 4 we conclude thatkΣbe(k) +Σˇy,e(k) +Σˇe,y(k)k=op(p). This implies
that kMˇ −Mck = op(p2) and kM˙ −Mck= op(p2). (6.7)-(6.10) imply that the order of the r0th
largest eigenvalue of M˙ is p2 and the (r0 + 1)th largest eigenvalue of M˙ is o(p2). This implies
that the difference between the r0th largest eigenvalue of M˙ and (r0+ 1)th largest eigenvalue of
c
Mis larger than cp2.
Now we considerkMc−M˙ k. SincekMˇ−M˙ k=Op(p2n−1/2), we only need to provekMˇ −Mck=
Op(p2n−1/2). Write c M−Mˇ (6.18) = k0 X k=0 ˇ Σy(k)[Σbe(k)⊤+Σˇy,e(k)⊤+Σˇe,y(k)⊤] + k0 X k=0 [Σbe(k) +Σˇy,e(k) +Σˇe,y(k)]Σˇy(k)⊤ + O( k0 X k=1 [kΣˇe,y(k)k2+kΣˇy,e(k)k2+kΣbe(k)k2]).
Lemma 4 implieskΣbe(k)k=Op(pn−1/2). (6.11) ensureskΣˇy,e(k)k=Op(pn−1/2) andkΣˇe,y(k)k=
Op(pn−1/2). Hence the proof is completed.
6.2 Proof of Theorem 2
Now we consider the estimator of PA⊥B in Step 3 and Theorem 2. Let
PcAb =I−AbAb⊤. (6.19)
Then we get Bb from the eigen-analysis of Pk0k=0Pcb
AΣby(k)P c b AΣby(k) ⊤Pc b A. Definition 6.4. Set c M2 = k0 X k=0 Pcb AΣby(k)P c b AΣby(k) ⊤Pc b A, M˙ 2= k0 X k=0 Pcb AΣ˙y(k)P c b AΣ˙y(k) ⊤Pc b A. (6.20)
Lemma 5. Under Assumptions 1-4, there exist two constant c0 andc1 such that
lim n,p→∞P(c0 ≤ kPc b A ˙ Σy(k)PcAbkmin p1−δ ≤ kPc b A ˙ Σy(k)PcAbk p1−δ ≤c1) = 1 (6.21) and kB˙2B˙⊤2 −B ˙˙B⊤k=Op(pδ/2n−1/2), (6.22)
where B˙2 is a p×r matrix consisting of the eigenvectors corresponding to the first r largest
eigenvalues of M˙ 2.
Proof of Lemma 5. Recalling the definitions of A˙ and B˙ we have
(AA⊤−A ˙˙A⊤)A= (I−A ˙˙A⊤)A=B ˙˙B⊤A
and
We can find the order of Pcb
Ay˙t as follows. Via (6.15) and (6.19) write
Pcb AA˜xt+P c b AB˜˜zt= (A ˙˙A ⊤−AbAb⊤)A˜x t+ (I−A ˙˙A⊤)A˜xt +(I−A ˙˙A⊤)B˜˜zt+ (A ˙˙A⊤−AbAb⊤)B˜˜zt = B ˙˙B⊤(A˜xt+B˜˜zt) +(A ˙˙A⊤−AbAb⊤)(A˜xt+B˜˜zt) , Π1+ Π2,
where Π1 = B ˙˙B⊤(A˜xt +B˜˜zt) and Π2 = (A ˙˙A⊤−AbAb⊤)(A˜xt +B˜˜zt). (4.1) implies kΠ2k = Op(p1/2n−1/2) =op(p1/2−δ/2). This, together with (6.7)-(6.10), implies (6.21).
PcAby˙t=B ˙˙B⊤(A˜xt+B˜˜zt) +Op(p1/2n−1/2) =B ˙˙B⊤(A˜xt+B˜˜zt) +op(p1/2−δ/2).
This, together with (6.21), implies (6.22).
Proof of Theorem 2. If we can prove
kBbBb⊤−B ˙˙B⊤k=Op(pδ/2n−1/2), (6.23)
(4.2) can be obtained by (6.23) and Lemma 3. Moreover, (6.22) shows thatB˙2,the eigenvectors matrix corresponding to the firstrlargest eigenvalues ofM˙ 2, is close enough toB˙. It then suffices
to prove that
kBbBb⊤−B˙2B˙⊤2k=Op(pδ/2n−1/2).
To this end, the aim is to show that the difference between the rth largest eigenvalue ofM˙ 2
and (r+ 1)th largest eigenvalue ofMc2 is larger thancp2−2δ andkMc2−M˙ 2k=Op(p2−3δ/2n−1/2).
From Lemma 4, we have
kΣbe(k)k=op(p1−δ/2n−1/2) =op(p1−δ).
This, together with Lemma 5, implies that the difference between the rth largest eigenvalue of
˙
M2 and (r+ 1)th largest eigenvalue ofMc2 is larger thancp2−2δ with probability tending to 1 as n→ ∞.
Now we consider Mc2−M˙ 2. We still use (6.18). However, we replaceMc and Mˇ by Mc2 and
˙
M2 +Op(p2−2δn−1/2) respectively. Similarly, we replace Σˇy(k), Σˇy,e(k),Σˇe,y(k) and Σbe(k) by
Pcb AΣˇy(k)P c b A, P c b AΣˇy,e(k)P c b A,P c b AΣˇe,y(k)P c b A and P c b AΣbe(k)P c b
A respectively. This, together with
Lemma 4-5, ensures that
kMc2−M˙ 2k=Op(p2−3δ/2n−1/2).
This implies (6.23).
6.3 Proof of Theorem 3
It suffices to prove the following version for Theorem 3.
Lemma 6. Under the Assumptions 1-4, bλk,i is the ith largest eigenvalue of Σby(k)Σby(k)⊤. Then
there exist a positive constant C such that
lim n,p→∞P( b λk,i−1 b λk,i ≤C) = 1, when 2≤i≤r0, (6.24) b λk,r0+1 b λk,r0 =Op(p−2δ), b λk,r0+1+r b λk,r0+r =Op p2 n2 + log2p p2−2δ , (6.25) lim n,p→∞P( b λk,i−1 b λk,i ≤C) = 1, when r0+ 2≤i≤r0+r, (6.26) and b λk,i−1 b λk,i =Op(log2p), when r0+r+ 2≤i≤r0+r+s, (6.27)
where s is a positive integer.
We begin with two estimators of PA⊥B.
Definition 6.5. Let λbk,i be the ith largest eigenvalue of Σby(k)Σby(k)⊤. We write Σby(k)Σby(k)⊤
by its eigenvalue and eigenvector decomposition as
b Σy(k)Σby(k)⊤ =Ab(k)Λbx(k)Ab(k) ⊤ +Bb1(k)Λˇz(k)Bb1(k) ⊤ +Cb1(k)Λˇe(k)Cb1(k) ⊤ , (6.28)
where Λˇz(k) = diag{λbk,r0+1,· · ·,λbk,r0+r}, Λˇe(k) = diag{bλk,r0+r+1,· · · ,bλk,p}, and b A(k)⊤Ab(k) =Ir0, Bb1(k) ⊤b B1(k) =Ir, Cb1(k) ⊤b C1(k) =IP−r−r0, Λˇx(k) = diag{λbk,1,· · ·,bλk,r0}.
Then Ab(k) is the estimator of A and Bb1(k) is the estimator of B˜ in the one-step method. We
call this method ”one-step” as we get Ab(k) and Bb1(k) in the same eigen-decomposition.
Definition 6.6. Let Pc b Ak = Ip−Ab(k)Ab(k) ⊤ and write Pc b Ak b Σy(k)PcAb k b Σy(k)⊤PcAb k by its
eigen-value and eigenvector decomposition as
Pcb Ak b Σy(k)PcAb k b Σy(k)⊤PcAb k = b B(k)Λbz(k)Bb(k) ⊤ +Cb(k)Λbe(k)Cb(k) ⊤ . (6.29)
Then Bb(k) is the estimator of B˜ in the two-step method. We call this method ”two-step” as we
The following lemma is to prove that the one-step method is asymptotically equivalent to the two-step method based onΣby(k)Σby(k)⊤.
Lemma 7. Under the Assumptions 1-4, one has
kBb(k)Bb(k)⊤−PA⊥Bk=Op(p δ/2n−1/2+p−δ/2), (6.30) kBb1(k)Bb1(k)⊤−Bb(k)Bb(k)⊤k=Op(p−δ/2n−1/2), (6.31) kΛˇz(k)−Λbz(k)k=op(p2−2δ). (6.32) and kΛˇe(k)−Λbe(k)k=op(p2−2δ). (6.33)
Proof of Lemma 7. By (6.16), (3.1) and (6.15) we have
b
Σy(k)Σby(k)⊤−Σˇy(k)Σˇy(k)⊤
= Σˇy(k)(Σbe(k)⊤+Σˇy,e(k)⊤+Σˇe,y(k)⊤)
+ (Σbe(k) +Σˇy,e(k) +Σˇe,y(k))Σˇy(k)⊤
+ O(kΣˇe,y(k)k2+kΣˇy,e(k)k2+kΣbe(k)k2).
From the proof of Theorems 1-2, it’s not hard to obtain the property ofAb(k) andBb(k) as follows:
kAb(k)Ab(k)⊤−AA⊤k=Op(n−1/2+p−δ/2),
kBb(k)Bb(k)⊤−PA⊥Bk=Op(p
δ/2n−1/2+p−δ/2)
kΛbz(k)k ≍p2−2δ ≍ kΛbz(k)kmin, (6.34)
kΛbe(k)k=Op(p2n−2+ log2p). (6.35)
So we only need to prove (6.31)-(6.33). From (6.28) and (6.29) one can see that
Ip=Ab(k)Ab(k) ⊤ +Bb(k)Bb(k)⊤+Cb(k)Cb(k)⊤ and Ip =Ab(k)Ab(k) ⊤ +Bb1(k)Bb1(k) ⊤ +Cb1(k)Cb1(k) ⊤ . It follows that b B(k)Bb(k)⊤+Cb(k)Cb(k)⊤=Bb1(k)Bb1(k) ⊤ +Cb1(k)Cb1(k) ⊤ . 30
This can help us study the relation betweenBb(k) andBb1(k). From (6.28) and (6.29) we conclude that b B1(k)Λˇz(k)Bb1(k) ⊤ +Cb1(k)Λˇe(k)Cb1(k) ⊤ =Pcb Ak b Σy(k)Σby(k)⊤PcAb k. Moreover, Pcb Ak =Ip− b A(k)Ab(k)⊤=Bb(k)Bb(k)⊤+Cb(k)Cb(k)⊤. So PcAb k b Σy(k)Σby(k)⊤PcAb k = PcAb k b Σy(k)PcAb k b Σy(k)⊤PcAb k + PcAb k b Σy(k)Ab(k)Ab(k) ⊤b Σy(k)⊤PcAb k = Bb(k)Λbz(k)Bb(k) ⊤ +Cb(k)Λbe(k)Cb(k) ⊤ + Bb(k)Bb(k)⊤Σby(k)Ab(k)Ab(k) ⊤b Σy(k)⊤Bb(k)Bb(k) ⊤ + Bb(k)Bb(k)⊤Σby(k)Ab(k)Ab(k) ⊤b Σy(k)⊤Cb(k)Cb(k) ⊤ + Cb(k)Cb(k)⊤Σby(k)Ab(k)Ab(k) ⊤b Σy(k)⊤Bb(k)Bb(k) ⊤ + Cb(k)Cb(k)⊤Σby(k)Ab(k)Ab(k) ⊤b Σy(k)⊤Cb(k)Cb(k) ⊤ .
It then suffices to get the order of Bb(k)⊤Σby(k)Ab(k) and Cb(k) ⊤ b Σy(k)Ab(k). If we can show kBb(k)⊤Σby(k)Ab(k)k=op(p1−δ) and kCb(k) ⊤b Σy(k)Ab(k)k=op(p1−δ), then (6.32)-(6.33) follow. Note that kAb(k)Σby(k)⊤Ab(k) ⊤ kmin ≍p.
We study the order of Bb(k)⊤Σby(k)Ab(k) as follows based on the definition of eigenvectors. Write
0=Bb(k)⊤Σby(k)Σby(k)⊤Ab(k) = Bb(k)⊤Σby(k)Bb(k)Bb(k) ⊤b Σy(k)⊤Ab(k) + Bb(k)⊤Σby(k)Cb(k)Cb(k) ⊤b Σy(k)⊤Ab(k) + Bb(k)⊤Σby(k)Ab(k)Ab(k) ⊤b Σy(k)⊤Ab(k). Then b B(k)⊤Σby(k)Ab(k) (6.36) = −Bb(k)⊤Σby(k)Bb(k)Bb(k) ⊤ b Σy(k)⊤Ab(k)(Ab(k) ⊤ b Σy(k)⊤Ab(k))−1 −Bb(k)⊤Σby(k)Cb(k)Cb(k) ⊤b Σy(k)⊤Ab(k)(Ab(k) ⊤b Σy(k)⊤Ab(k))−1.
We identify the order of Cb(k)Cb(k)⊤A˜xt as follows.
We replace M˙ by Σ˙y(k)Σ˙y(k)⊤ and defineA˙(k) and B˙(k) as in A˙ and B˙. Then
kAA⊤−A˙(k)A˙(k)⊤k=Op(p−δ/2), kA˙(k)A˙(k)⊤−Ab(k)Ab(k)⊤k=Op(n−1/2) and kB˙(k)B˙(k)⊤−Bb(k)Bb(k)⊤k=Op(pδ/2n−1/2). Note that kCb(k)Cb(k)⊤A˜xtk ≤ kCb(k)Cb(k)⊤A˙(k)A˙(k)⊤A˜xtk + kCb(k)Cb(k)⊤(AA⊤−A˙(k)A˙(k)⊤)A˜xtk.
The two summands on the right hand side of the above inequality satisfy kCb(k)Cb(k)⊤A˙(k)A˙(k)⊤A˜xtk ≤ k(Ip−Ab(k)Ab(k) ⊤ )A˙(k)A˙(k)⊤A˜xtk = k(A˙(k)A˙(k)⊤−Ab(k)Ab(k)⊤)A˙(k)A˙(k)⊤A˜xtk ≤ kA˙(k)A˙(k)⊤−Ab(k)Ab(k)⊤kk˜xtk=Op(p1/2n−1/2) and kCb(k)Cb(k)⊤(AA⊤−A˙(k)A˙(k)⊤)A˜xtk ≤ k(A˙(k)A˙(k)⊤−Ab(k)Ab(k)⊤)(AA⊤−A˙(k)A˙(k)⊤)A˜xtk + k(B˙(k)B˙(k)⊤−Bb(k)Bb(k)⊤)(AA⊤−A˙(k)A˙(k)⊤)A˜xtk = Op(p1/2n−1/2). It follows that kCb(k)Cb(k)⊤A˜xtk=Op(p1/2n−1/2).
Similarly we can conclude that
kCb(k)Cb(k)⊤B˜˜ztk=Op(p1/2n−1/2).
Then
kCb(k)⊤yˇtk=Op(p1/2n−1/2).
Likewise one may verify that
kBb(k)⊤ˇytk=Op(p1/2−δ/2).
These imply that
kBb(k)⊤Σˇy(k)Cb(k)k=Op(p1−δ/2n−1/2), kBb(k)⊤Σˇy,e(k)Cb(k)k=Op(p1/2−δ/2) =Op(p1−δ/2n−1/2), kBb(k)⊤Σˇe,y(k)Cb(k)k=Op(p1/2n−1/2) =op(p1−δ/2n−1/2) and kBb(k)⊤Σbe(k)Cb(k)k=Op(kΣbe(k)k) =Op( p n+ logn) =op(p 1−δ/2n−1/2). It follows that kBb(k)⊤Σby(k)Cb(k)k= kBb(k) ⊤ˇ Σy(k)Cb(k)k+Op(p1−δ/2n−1/2) = Op(p1−δ/2n−1/2). Similarly, we have kCb(k)⊤Σby(k)⊤Ab(k)k ≤ kCb(k) ⊤ˇ Σy(k)⊤Ab(k)k+kCb(k) ⊤ˇ Σy,e(k)⊤Ab(k)k +kCb(k)⊤Σˇe,y(k)⊤Ab(k)k+kCb(k) ⊤b Σe(k)⊤Ab(k)k = Op(p1−δ/2), kBb(k)⊤Σby(k)Bb(k)Bb(k) ⊤b Σy(k)⊤Ab(k)k=Op(p2− 3 2δ), and kBb(k)⊤Σby(k)Cb(k)Cb(k) ⊤b Σy(k)⊤Ab(k)k=Op(p2−δn−1/2) =Op(p2− 3 2δ). Recalling (6.36), kBb(k)⊤Σby(k)Ab(k)k (6.37) ≤ kBb(k)⊤Σby(k)Bb(k)Bb(k) ⊤b Σy(k)⊤Ab(k)(Ab(k) ⊤b Σy(k)⊤Ab(k))−1k + kBb(k)⊤Σby(k)Cb(k)Cb(k) ⊤ b Σy(k)⊤Ab(k)(Ab(k) ⊤ b Σy(k)⊤Ab(k))−1k = Op(p1− 3 2δ).
Similarly, we can study the order of Cb(k)⊤Σby(k)Ab(k) as follows: 0=Cb(k)⊤Σby(k)Σby(k)⊤Ab(k) = Cb(k)⊤Σby(k)Bb(k)Bb(k) ⊤b Σy(k)⊤Ab(k) + Cb(k)⊤Σby(k)Cb(k)Cb(k) ⊤b Σy(k)⊤Ab(k) + Cb(k)⊤Σby(k)Ab(k)Ab(k) ⊤b Σy(k)⊤Ab(k).
We can find that
kCb(k)⊤Σby(k)Bb(k)Bb(k) ⊤b Σy(k)⊤Ab(k)k=Op(p2−δn−1/2), and kCb(k)⊤Σby(k)Cb(k)Cb(k) ⊤b Σy(k)⊤Ab(k)k=Op(p2−δn−1/2). It follows that kCb(k)⊤Σby(k)Ab(k)k (6.38) ≤ kCb(k)⊤Σby(k)Bb(k)Bb(k) ⊤b Σy(k)⊤Ab(k)(Ab(k) ⊤b Σy(k)⊤Ab(k))−1k + kCb(k)⊤Σby(k)Cb(k)Cb(k) ⊤b Σy(k)⊤Ab(k)(Ab(k) ⊤b Σy(k)⊤Ab(k))−1k = Op(p1−δn−1/2).
This implies that (6.32)-(6.33). (6.32)-(6.35) show that
kΛbz(k)kmin≍p2−2δ
and
b
λk,r+r0+1=kΛbe(k)k+op(p2−2δ) =op(p2−2δ).
Then we can find thatkBb1(k)Bb1(k)
⊤
−Bb(k)Bb(k)⊤kis based on the fact that 1 p2−2δkBb(k) ⊤b Σy(k)Ab(k)Ab(k) ⊤b Σy(k)⊤Cb(k)k=Op(p−δ/2n−1/2), which ensures (6.31).
Lemma 8. Under the Assumptions 1-4, bλk,i is the ith largest eigenvalue of Σby(k)Σby(k)⊤. Then
there exist two positive constants c1 and C1 such that
lim n,p→∞P(c1 ≤ b λk,i p2 ≤C1) = 1, when 1≤i≤r0, (6.39) 34
lim n,p→∞P(c1 ≤ b λk,i p2−2δ ≤C1) = 1, when r0+ 1≤i≤r0+r, (6.40) b λk,r0+r+1 =Op( p2 n2 + log 2p) (6.41) and n2 p2bλ k,r0+r+s =Op(1) (6.42)
for any fixed s.
Proof of Lemma 8. Recalling the proof of Theorem 1 it’s similar to prove that
kAb(k)Ab(k)⊤−AA⊤k=Op(n−1/2+p−δ/2).
This implies (6.39).
(6.32) and (6.34) lead to (6.40).
So we prove (6.41) now. We start with the relation:
ˇ Λe(k) = Cb1(k) ⊤ b Σy(k)Σby(k)⊤Cb1(k) = Cb1(k) ⊤ b Σy(k)Bb1(k)Bb1(k) ⊤ b Σy(k)⊤Cb1(k) + Cb1(k) ⊤ b Σy(k)Ab(k)Ab(k) ⊤ b Σy(k)⊤Cb1(k) + Cb1(k) ⊤ b Σy(k)Cb1(k)Cb1(k) ⊤ b Σy(k)⊤Cb1(k). Note that k bA(k) ⊤ b B1(k) ⊤ b Σy(k)⊤ b A(k),Bb1(k) kmin ≍p1−δ, and 0=Cb1(k) ⊤b Σy(k)Σby(k)⊤ b A(k),Bb1(k) = Cb1(k) ⊤b Σy(k) b A(k),Bb1(k) bA(k)⊤ b B1(k) ⊤ b Σy(k)⊤(Ab(k),Bb1(k)) +Cb1(k) ⊤b Σy(k)Cb1(k)Cb1(k) ⊤b Σy(k)⊤ b A(k),Bb1(k) . It follows that b C1(k) ⊤b Σy(k)(Ab(k),Bb1(k)) = −Cb1(k) ⊤b Σy(k)Cb1(k)Cb1(k) ⊤b Σy(k)⊤ b A(k),Bb1(k) bA(k)⊤ b B1(k) ⊤ b Σy(k)⊤ b A(k),Bb1(k) −1 .
This implies that kCb1(k) ⊤ b Σy(k)(Ab(k)Ab(k) ⊤ +Bb1(k)Bb1(k) ⊤ )Σby(k)⊤Cb1(k)k =Op(kCb1(k) ⊤ b Σy(k)Cb1(k)Cb1(k) ⊤ b Σy(k)⊤Cb1(k)k).
So we only need to get the order of kCb1(k)Σby(k)Cb1(k)
⊤ k. (6.31) implies kCb1(k)Cb1(k) ⊤ −Cb(k)Cb(k)⊤k=Op(p−δ/2n−1/2).
This, together with (6.35), implies
kCb1(k)Σby(k)Cb1(k) ⊤ k=Op(kCb(k)Σby(k)Cb(k) ⊤ k) =Op(pn−1+ logp). This proves (6.41).
From (2.10), Lemma 4 and (6.41), for any fixeds,
minX{n,p} i=r0+r+1
b
λ1k,i/2 =Op(p).
This, together with (6.41), implies (6.42).
Lemma 6 can be concluded by Lemma 8.
6.4 Proof of Theorems 4-6 Proof of Theorem 4. (Ip−AA⊤) B 0 = C11 · · · C1,d · · · · Cd+1,1 · · · Cd+1,d , (6.43)
whereCij is api×rj matrix. Hence
Cii=Bi−AiA⊤i Bi.
When i6=j,
Cij =−AiA⊤j Bj.
Note thatB⊤B=Ir, and B⊤i Bi =Iri.
(6.43), (2.3) and Assumption 5 ensure that PA⊥B can be rewritten as
PA⊥B =H˜diag+H˜err,
whereH˜diag satisfies (4.7) andkH˜errkF =O(pδ/2n−1/2). This, together with (4.2), completes the
proof.
Proof of Theorem 5. Note that |Ji| ≤c2pfor any 1≤i≤d+ 1. The fact thatBb⊤Bb =Ir implies
that
kbbik2=bb⊤i Bb⊤Bbbbi=kbb⊤i Bb⊤k2.
Recalling the definition of bbi in Step 4, kbbik is the norm of theith row of BbBb⊤.
We begin with i ≤ d. Theorem 4 and Assumption 6 imply that if j ∈ Ji ∩bJd+1, the norm
of the jth row vector of BbBb⊤−PA⊥B should be larger than
c1p−1/2
2 . This, together with (4.2),
implies (4.9).
Now we consideri=d+ 1, if j∈Jd+1∩bJdc+1, the norm of thejth row vector ofBbBb⊤−PA⊥B
should be not smaller than ωp
2 . This, together with (4.2), implies (4.10).
Proof of Theorem 6. We define a diagonal matrixFdiag which has theith diagonal elementsbfi⊤bfi.
Thenρbl,m is the (l, m)th entry ofFdiag−1/2|FbFb⊤|F −1/2
diag . Recalling Step 4, one can seekF−diag1 k ≤ωp−2.
It follows that 2 X 1≤i<j≤d X ℓ∈eJi X m∈eJj b ρℓ,m2 ≤ωp−4kHerrk2F.
This, together with the definition of ωp in Step 4, concludes the first part. For the second part,
we recall Theorem 5 andpi≍p. Then there existsC1 >0 such that
lim
n,p→∞P(|eJj|> C1p) = 1.
Note thatFb only hasbr(fixed) columns andpgoes to infinity. There existC2 >0 andC3>0 such
that
lim
n,p→∞P(|{(ℓ, m) :ℓ, m∈Jej,|ρbℓ,m|> C2}|> C3p
2) = 1.
References
Aghabozorgi, S., Shirkhorshid, A.S. and Wah, T.Y. (2015). Time-series clustering – A decade review. Information System,53, 16-38.
Alonso, A.S. and Pe˜na, D. (2019). Clustering time series by linear dependency. Statistics and Computing. 29, 655-676.
Ando, T. and Bai, J. (2017). Clustering huge number of financial time series: a panel data approach with high-dimensional predictors and factor structures. Journal of the American Statistical Association,519, 1182-1198.
Maharaj, E.A., D’Urso, P. and Caiado, J. (2019). Time Series Clustering and Classification. Chapman and Hall/CRC.
Chamberlain, G. (1983). Funds, factors, and diversification in arbitrage pricing models. Econo-metrica, 51, 1305-1323.
Chamberlain, G. and Rothschild, M. (1983). Arbitrage, factor structure, and mean-variance analysis on large asset markets. Econometrica,51, 1281-1304.
Chang, J., Gao, B. and Yao, Q. (2015). High dimensional stochastic regression with latent factors, endogeneity and nonlinearity. Journal of Econometrics,189, 297-312.
Esling, P. and Agon, C. (2012). Time-series data mining. ACM Computing Survey,45. Article 12.
Forni, M., Hallin, M., Lippi, M. and Reichlin, L. (2005). The generalized dynamic-factor model: one-sided estimation and forecasting. Journal of the American Statistical Association,100, 830-840.
Fr¨uhwirth-Schnatter, S. and Kaufmann, S. (2008). Model-based clustering of multiple time series. Journal of Business & Economic Statistics,26, 78-89.
Hallin, M. and Lippi, M. (2013). Factor models in high-dimensional time series – a time-domain approach. Stochastic Processes and Their Applications,123, 2678-2695.
Kakizawa, Y., Shumway, R.H. and Taniguchi, M. (1998). Discrimination and clustering for multivariate time series. Journal of the American Statistical Association,93, 328-340. Keogh, E. and Lin, J. (2005). Clustering of time-series subsequences is meaningless: implications
for previous and future research. Knowledge and Information Systems,8, 154-177.
Keogh, E. and Ratanamahatana, C.A. (2005). Exact indexing of dynamic time warping. Knowl-edge and Information Systems,7, 358-386.
Khaleghi, A., Ryabko, D., Mary, J. and Preux, P. (2016). Consistent algorithms for clustering time series. Journal of Machine Learning Research,17, 1-32.
Lam, C. and Yao, Q. (2012). Factor modelling for high-dimensional time series: inference for the number of factors. The Annals of Statistics,40, 694-726.
Li, Z., Wang, Q. and Yao, J. (2017). Identifying the number of factors from singular values of a large sample auto-covariance matrix. The Annals of Statistics, 45, 257-288.
Liao, T.W. (2005). Clustering of time series data – a survey. Pattern Recognition,38, 1857-1874. Pe˜na, D. and Box, E.P. (1987). Identifying a simplifying structure in time series. Journal of the
American Statistical Association,82, 836-843.
Pe˜na, D. and Poncela, P. (2006). Nonstationary dynamic factor analysis. Journal of Statistical Planning and Inference,136, 1237-1257.
Roelofsen, P. (2018). Time series clustering. Vrije Universiteit Ansterdam.
https://www.math.vu.nl/∼sbhulai/papers/thesis-roelofsen.pdf.
Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices.
arXiv.1011.3027.
Yao, Q., Tong, H., Finkenst¨adt, B. and Stenseth, N.C. (2000). Common structure in panels of short ecological time series. Proceeding of the Royal Society (London),B,267, 2457-2467.