time series
Table 4.11: Consumption expenditures in billions of US dollars (source: Datas- tream - Bureau of Economic Analysis).
Total Consumption
Durable consumption: Motor vehicles and parts
Durable consumption: Furnishings and durable household equipment Durable consumption: Household appliances
Durable consumption: Recreational goods and vehicles Durable consumption: Video and Audio equipment Durable consumption: Photographic equipment
Durable consumption: Information Processing equipment
Durable consumption: Sporting equipment, supplies, guns and ammunition Durable consumption: Sports and recreational vehicles
Durable consumption: Recreational books Durable consumption: Musical instruments Durable consumption: Jewelry
Durable consumption: Watches
Durable consumption: Therapeutic medical equipment Durable consumption: Corrective eyeglasses and contact lenses Durable consumption: Educational books
Durable consumption: Luggage
Durable consumption: Telephone equipment Nondurable Consumption: Food and Beverages
Nondurable Consumption: Food produced and consumed on farms Nondurable Consumption: Clothing and Footwear
Nondurable Consumption: Gasoline and other energy goods
Nondurable Consumption: Pharmaceutical and Other medical products Nondurable Consumption: Recreational Items
Nondurable Consumption: Games, Toys and Hobbies Nondurable Consumption: Flowers, seeds and potted plants Nondurable Consumption: Film and photographic supplies Nondurable Consumption: Personal care products Nondurable Consumption: Magazines and Newspapers
4.9. Appendix B: Data description consumption time series 91
Figure 4.2: Time plot (January 1999 - April 2015) of the total consumption time series, the 18 durable consumption time series, and the 12 nondurable con- sumption time series all in logs.
Chapter 5
Sparse canonical correlation
analysis from a predictive point of
view
Abstract
Canonical correlation analysis (CCA) describes the associations between two sets of variables by maximizing the correlation between linear combi- nations of the variables in each data set. However, in high-dimensional settings where the number of variables exceeds the sample size or when the variables are highly correlated, traditional CCA is no longer appropriate. This paper proposes a method for sparse CCA. Sparse estimation produces linear combinations of only a subset of variables from each data set, thereby increasing the interpretability of the canonical variates. We consider the CCA problem from a predictive point of view and recast it into a regression framework. By combining an alternating regression approach together with a lasso penalty, we induce sparsity in the canonical vectors. We compare the performance with other sparse CCA techniques in different simulation settings and illustrate its usefulness on a genomic data set.
5.1
Introduction
The aim of canonical correlation analysis (CCA), introduced by Hotelling [1936], is to identify and quantify linear relations between two sets of variables. CCA is used in various research fields to study associations in, for example, biomedical
94 Sparse CCA
data [Foucart, 1999, Alonso et al., 2003], environmental data [Iaci et al., 2010] or genomic data [Graffelman and van Eeuwijk, 2005]. One searches for the linear combinations of each of the two sets of variables having maximal correlation. These linear combinations are called the canonical variates and the correlations between the canonical variates are called the canonical correlations. We refer to e.g. Johnson and Wichern (1998, Chapter 10) for more information on canonical correlation analysis.
At the same time, we want to induce sparsity in the canonical vectors such that the linear combinations only include a subset of the variables. Sparsity is especially helpful in analyzing associations between high-dimensional data sets, which are commonplace today in, for example, genetics [Schwender et al., 2008] and machine learning [Sun et al., 2011, Shin and Wu, 2014]. Therefore, we pro- pose a sparse version of CCA where some elements of the canonical vectors are estimated as exactly zero, which facilitates interpretation. For this aim, we use the formulation of CCA as a prediction problem.
Consider two random vectors x ∈ Rp
and y ∈ Rq. We assume, without loss of
generality, that all variables are mean centered and that p ≤ q. Denote the joint covariance matrix of (x,y) by
Σ = "
Σxx Σxy
Σyx Σyy
#
with r = rank(Σxy) ≤ p. Let A ∈ Rp×r and B ∈ Rq×r be the matrices with in
their columns the canonical vectors. The new variables u = ATx and v = BTy
are the canonical variates and the correlations between each pair of canonical variates give the canonical correlations. The canonical vectors contained in the matrices A and B are respectively given by the eigenvectors of the matrices
Σ−1xxΣxyΣ−1yyΣyx and Σ−1yyΣyxΣ−1xxΣxy. (5.1)
Both matrices have the same positive eigenvalues, the canonical correlations are given by the positive square root of those eigenvalues.
The canonical vectors and correlations are typically estimated by taking the sample versions of the covariances in (5.1) and computing the corresponding eigen- vectors and eigenvalues. However, to implement this procedure, we need to invert the matrices bΣxxand bΣyy. When the original variables are highly correlated or
when the number of variables becomes large compared to the sample size, the esti- mation imprecision will be large. Moreover, when the largest number of variables in both data sets exceeds the sample size n (i.e. q ≥ n), traditional CCA cannot
5.1. Introduction 95
be performed since the sample covariance matrix bΣyy is singular, i.e. its inverse
does not exist. Vinod [1976] proposed the canonical ridge, which is an adaptation of the ridge regression concept of Hoerl and Kennard [1970] to the framework of CCA, to solve this problem. The canonical ridge replaces the matrices bΣ−1xx and
b
Σ−1yy by respectively ( bΣxx+ k1I) −1
and ( bΣyy+ k2I) −1
. By adding the penalty terms k1 and k2 to the diagonal elements of the sample covariance matrices, one
obtains more reliable and stable estimates when the data are nearly or exactly collinear.
Another approach is to use sparse CCA techniques. Parkhomenko et al. [2009] consider a sparse singular value decomposition to derive sparse singular vectors. A limitation of their approach is that sparsity in the canonical vectors is only guar- anteed if the variables within the first data set and the variables within the second data set are uncorrelated. A similar approach was taken by Witten and Tibshirani [2009] who apply a penalized matrix decomposition to the cross-product matrix
b
Σxy, but they also require uncorrelatedness of the variables within each of the
two data sets. Waaijenborg et al. [2008] consider Wold’s (1968) alternating least squares approach to CCA and obtain sparse canonical vectors using penalized regression with the elastic net. The ridge parameter of the elastic net is set to be large, thereby, according to the authors, ignoring the dependency structure within each set of variables.
Waaijenborg et al. [2008], Witten and Tibshirani [2009], and Parkhomenko et al. [2009] all require the variables within each of the two data sets to be uncor- related. This uncorrelatedness restriction is restrictive since data sets containing correlated variables are commonplace in multivariate analysis (e.g. genome-wide association studies). Therefore, we propose in this paper to estimate the canon- ical variates without imposing any prior covariance restrictions. As soon as the data sets contain correlated variables, the gains in estimation accuracy achieved by our sparse CCA method compared to these three other sparse CCA methods are outspoken.
We consider CCA as a prediction problem, where the canonical variates ob- tained from the first data set serve as optimal predictors for the canonical variates of the second data set, and vice versa. Our proposed method obtains the canon- ical vectors using an alternating penalized regression framework. By performing variable selection in a penalized regression framework using the lasso penalty [Tibshirani, 1996], we obtain sparse canonical vectors. We demonstrate in a sim- ulation study that our Sparse Alternating Regression (SAR) algorithm produces good results in terms of estimation accuracy of the canonical vectors, and detec-
96 Sparse CCA
tion of the sparseness structure of the canonical vectors. We also apply the SAR algorithm to a high-dimensional genomic data set. Sparse estimation is appealing since it highlights the most important variables for the association study.
The remainder of this article is organized as follows. In Section 5.2 we for- mulate the CCA problem from a predictive point of view. Section 5.3 describes the Sparse Alternating Regression (SAR) approach and provides details on the implementation of the algorithm. Section 5.4 compares our methodology to other sparse CCA techniques by means of a simulation study. Section 5.5 discusses the genomic data example, Section 5.6 concludes.