Suppose some PCs zs are retained by comparing their Sobol’s total sensitivity indices as
explained in the last section. In order to identify k stocks based on zs, a natural idea
is to establish a relationship between ρ((q∗)0z, w0rs) and ρ((q∗)0z, (b
s)0zs), where bs is a
vector of weights for the retained PCs. However, such a relationship is rather challenging to establish. Hence, based on the retained PCs, stocks in the tracking portfolio are selected by comparing the dependence between zs and rs.
Research works on choosing variables according to some PCs dated back to [76, 77]. The motivation is the conjecture that a portion of PCs can be very well explained by a portion of all variables that form all PCs. In [76,77], many ad-hoc methods are compared using both artificial and real data. However, it is pointed out in [21] that these ad-hoc methods are potentially misleading in selecting subset variables to approximate retained PCs. The authors of [21] suggest selecting the variable subset by optimizing some criteria, such as Yanai’s generalized coefficient of determination (GCD). Inspired by [21], in our research three criteria are considered in choosing stocks based on the retained PCs in this chapter. They are Yanai’s GCD, the distance correlation and HGG test statistics.
Yanai’s generalized coefficient of determination (GCD) is introduced in [132]. It is a type of the matrix correlation which is introduced in [100]. Suppose X is a n × d data matrix of r, and the jth column of X includes samples of rj for j = 1, . . . , d. Let G be
a collection of subscripts of elements in zs. Here, the cardinality of G is denoted by m.
Define AG as a d × m sub-matrix of the PC loading matrix A. Particularly, AG is obtained
by retaining all the columns j of A for j ∈ G. We further denote the subspace spanned by zs by G. For the space G, there is an corresponding orthogonal projection matrix
PG(X) = XAG(A0GX0XAG)−1A0GX0. Similarly, we denote by K a collection of subscripts
of elements in rs. The data matrix of rs is XI
identity matrix, and IK is obtained by keeping the jth column of the d × d identity matrix
for j ∈ K. These k variables span a subspace K with an orthogonal projection matrix PK(X) = XIK(IK0 X0XIK)−1IK0 X0.
Yanai’s GCD of PG and PK, which is denoted by GCD(PG, PK), is used in [21] to
measure the “correlation” or similarity between subspaces G and K. It is shown in [21] that GCD(PG, PK) = 1 √ mk X j∈G (˜rm)2j, where (˜rm)j = pΛj q (aKj )0Σ−1
K aKj , for j ∈ G, Λj is the jth diagonal element of Λ which
is the eigenvalue matrix of r’s covariance matrix. Further, aKj is the sub-vector of the jth column of the PC loading matrix A. The cardinality of aKj is k, and each of its element corresponds to one variable in rs. The matrix Σ
K is a sub matrix of the covariance matrix
Σ, involving only rows and columns corresponding to these k variables in rs. For simplicity,
we rewrite GCD(PG, PK) as GCD(G, K).
Yanai’s GCD is able to measure the similarity between two subspaces in different di- mensions. The value of Yanai’s GCD is between 0 and 1. If GCD(G, K) = 1, subspaces G and K coincide. That is any linear combination of data of zs can be rewritten as a
linear combination of data of rs. If GCD(G, K) = 0, subspaces G and K are mutually orthogonal. This suggests that rs cannot explain any linear combinations of zs. Hence,
the k stocks that maximize GCD(G, K) should be selected to explain retained PCs. Distance correlation (dCor), which is introduced in [123], is able to detect the de- pendence between random vectors in different dimensions. Distance correlation is closely linked to distance covariance. Suppose x is a p-dimensional random vector, and y is a q- dimensional random vector. The distance covariance of x and y, V(x, y), is defined based on characteristic functions, and it is the positive square root of
V2(x, y) = ||fxy− fxfy||2 = 1 cpcq Z Rp+q |fxy(t, s) − fx(t)fy(s)|2 |t|1+pp |s|1+qq dtds,
where fx, fy, and fxy denote characteristic functions of random vectors x, y, and (x, y)
in Rp. Suppose (X, Y) = {(X
i, Yi), i = 1, . . . , n} is a collection of observed samples from
the joint distribution of (x, y), the empirical distance covariance Vn(X, Y) is the positive
square root of V2
n(X, Y) = 1 n2
P
i,l=1AilBil, where ail = |Xi − Xl|p, ¯ai· = 1n
Pn l=1ail, ¯ a·l = 1nPnj=1ail, ¯a·· = n12 1 n Pn
i,l=1ail, Ail = ail− ¯ai·− ¯a·l + ¯a··. Replacing {Xi} by {Yi} in
the calculation of Ail leads to Bil. According to [123], limn→+∞Vn(X, Y) = V(x, y) almost
surely, given both x and y have finite Euclidean norms.
Distance correlation dCor(x, y) and its empirical version dCorn(x, y) are defined by
dCor2(x, y) = V2(x,y) √ V2(x)V2(y), V 2(x)V2(y) > 0, 0, V2(x)V2(y) = 0, dCorn2(X, Y) = V2 n(X,Y) √ V2 n(X)Vn2(Y) , V2 n(X)Vn2(Y) > 0, 0, V2 n(X)Vn2(Y) = 0,
Both dCor and dCorn are between 0 and 1. The distance correlation equals 0 if and only
if x and y are independent. If dCorn(X, Y) = 1 then there exist a vector a, a nonzero real
number b and an orthogonal matrix C such that Y = a + bXC. Returning to our variable selection problems, in order to explain given PCs zs, we prefer the k-dimensional rs with
the largest dCorn(zs, rs).
The HHG test, an independent test, is introduced in [70]. It can be used to describe the dependence between two random vectors in different dimensions. The idea is inspired by Pearson’s independence test. Suppose (Xi, Yi) for i = 1, . . . , N are observations of random
vectors x and y. For a specified distance d(·, ·) and i 6= j, i, j = 1, . . . , N , define A11(i, j) = N X k=1,k6=i,j I {d(Xi, Xk) ≤ d(Xi, Xj)} I {d(Yi, Yk) ≤ d(Yi, Yj)} , A12(i, j) = N X k=1,k6=i,j I {d(Xi, Xk) ≤ d(Xi, Xj)} I {d(Yi, Yk) > d(Yi, Yj)} , A21(i, j) = N X k=1,k6=i,j I {d(Xi, Xk) > d(Xi, Xj)} I {d(Yi, Yk) ≤ d(Yi, Yj)} , A22(i, j) = N X k=1,k6=i,j I {d(Xi, Xk) > d(Xi, Xj)} I {d(Yi, Yk) > d(Yi, Yj)} ,
Am·(i, j) = Am1(i, j) + Am2(i, j) and A·m(i, j) = A1m(i, j) + A2m(i, j),
for m = 1, 2.
The HHG test statistics is defined as
T (x, y) = N X i=1 N X j=1,j6=i S(i, j) where
S(i, j) = (N − 2) [A12(i, j)A21(i, j) − A11(i, j)A22(i, j)]
2
A1·(i, j)A2·(i, j)A·1(i, j)A·2(i, j)
It is claimed in [70] that the larger the value of S(i, j), the stronger the dependence be- tween I {d(xi, X) ≤ d(xi, xj)} and I {d(yi, Y ) ≤ d(yi, yj)}. Hence, a larger T (x, y) suggests
stronger dependence between x and y. Again, given PCs zs, it is better to select rs that
maximizes T (rs, z
s) to explain zs.
A comparison of the above three criteria to select stocks is given as follows:
• Yanai’s GCD: Yanai’s GCD measures the similarity between the subspace generated by the data of two random vectors. If GCD of two subspaces is 1, these two subspaces coincide. Dimensions of these two vector can be different.
• Distance correlation and HHG test statistics: Both of them can be applied to detect linear or non-linear relationships between two random vectors in different dimensions. Distance correlation has a simpler form.
Maximizing these criteria between retained PCs and k stocks can be formulated as a binary programming problem. In very low dimensions, it is possible to search for the global optimal solution. However, when the dimension is large, heuristic methods should be applied to obtain a suboptimal solution.
In the end, the algorithm of our proposed variable selection for index tracking is outlined in Table3.2.
1: Input a n × (d + 1) sample matrix of the random vector (R, r). 2: Obtain an estimator ˆΣ of the covariance matrix of r.
3: Determine PCs of r based on ˆΣ, according to the eigenvalue decomposition of ˆΣ.
4: Decompose R to PCs z using Sobol’s decomposition, which is given in the part (a) of Proposition3.1. 5: Calculate Sobol’s total sensitivity index for each PC.
6: Retain m-dimensional PC subset zs with Sobol’s total sensitivity index larger than a certain threshold.
7: Select k-dimensional rsthat maximizes GCD, dCorn, or HGG test statistics.
Table 3.2: The algorithm of variable selections for index tracking.
Given stocks to hold in the tracking portfolio, corresponding weights can be obtained by existing methods, such at those in [105] or [5]. In this chapter, we follow [105] and determine stock weights by minimizing specific tracking errors.