4.2 A survey on unsupervised multi-view learning
4.2.3 Subspace learning algorithms
Subspace learning algorithms assume that data collected on the same subjects from differ- ent views are primarily generated by a low-dimensional latent subspace which is sometimes referred as the “shared (latent) subspace” [Xu et al.,2013]. They seek to obtain projections of data points in this shared subspace and the transformations which link the feature space (corresponding to the variables which may have different dimensions in different views) of each view and the latent subspace. Subspace learning is closely related to manifold align- ment in that they both enforce prior combination of multiple views. The main differences lie in two aspects: One, manifold alignment algorithms preserve the local geometry of corresponding data points in the intrinsic manifold, whereas subspace learning algorithms have a variety of objectives, for instance to maximise the correlation coefficient between the projected data of a pair of datasets, such as canonical correlation analysis Hotelling
[1936]; Two, manifold alignment algorithms focus on the resemblance amongst samples across the views and completely ignore the view-dependent patterns, whereas a sub-class of subspace learning algorithms attempt to jointly learn both view-dependent and shared patterns, such asSalzmann et al. [2010], Lock et al. [2013] which we review later in this subsection.
Shared latent subspace learning
Partial least squares (PLS) [Jong,1993] and canonical correlation analysis (CCA) [Hotelling,
1936], both reviewed in Section3.8.1, are two of the most widely applied subspace learn- ing approaches for paired datasets (i.e. two views for the same subjects). PLS seeks lin- ear transformations of the data into a latent subspace which maximises the covariance of the projected data between the two views, while CCA maximises the correlation of the projected data. In the case where linear transformations are implausible, for instance if data points in the multiple views are generated from a manifold, the kernel CCA (KCCA)
[Akaho, 2007] may be applied which firstly maps the data points to a high dimensional
space (a.k.a. Hilbert space) so that the shared subspace can be obtained by linear transfor- mations, before operating the CCA. An extension of CCA for more than two views was proposed, where the projections into the shared subspace maximise the sum of all pairwise
4.2 A survey on unsupervised multi-view learning 146
correlations among the views [Rupnik and Shawe-Taylor,2010].
PCA, CCA, and KCCA have also been incorporated in hybrid algorithms, where the key idea is to reduce the level of random noise within each dataset before performing sub- space learning. To find low-dimensional consensus features across multiple views, Han
et al.[2012] proposed an algorithm for sparse dimensionality reduction for multi-view un-
supervised learning. The algorithm firstly applies PCA to each dataset and concatenates the principal components into a new low-dimensional data matrix. Next, the new data matrix is factorised into the product of an orthogonal basis matrix with lower dimensionality and a transformation matrix which consists of coefficients of linear transformation from the new bases to the bases of the concatenated low-dimensional data matrix. The transformation matrix is further constrained to be sparse such that only the most important features in the concatenated data matrix are retained in the new bases. To address the same problem,
Zhu et al.[2012] proposed a variant of KCCA called mixed kernel CCA (MKCCA). The
MKCCA algorithm consists of two steps, where the first step involves using a mixture of chosen kernels to map each dataset to a higher dimensional space which is smaller than the Hilbert space produced by KCCA while large enough to capture interesting phenomena. In the second step, multi-view subspace learning is performed by applying PCA followed by CCA (or multi-view CCA [Rupnik and Shawe-Taylor,2010]) on the principal components. Another family of subspace learning algorithms are based on matrix factorisations, where each data matrix is factorised into the product of two lower-dimensional matrices: one encodes the low-rank projection of the original data and the other contains the coef- ficients of linear transformations from the lower-dimensional space to the original space.
Akata et al. [2011] enforced identical low-rank representations of each subject for mea-
surements taken from disparate views and estimated the low-rank projection matrix and transformation matrices such that they jointly minimise the total reconstruction error in all data matrices. They further employed “non-negative matrix factorisation” constraints which required all estimates to be non-negative in an application involving image and label data which were naturally encoded by non-negative values. Jia et al.[2010] employed the same matrix factorisation without non-negativity constraints and proposed to use either the group lasso penalty or the `1/`∞penalty [Zhao et al.,2009] on the coefficient matrices such
involve a small number of the most important original variables. An additional regularisor was imposed on the ranks of the shared projection matrix penalising redundant dimension- alities. The same matrix factorisation principle has also been applied in analysing tensor data resulting in multi-view tensor factorisations [Takeuchi et al.,2013,Acar et al.,2014]. A slightly different approach was taken by Liu et al. [2013] who applied non-negative matrix factorisation to each data matrix which were jointly estimated with a consensus pro- jection matrix. Regularisation was employed such that the low-dimensional projection of the data from each view was encouraged to be similar to the consensus projection. Standard clustering algorithms can then be applied to this consensus projection to obtain a consistent clustering across the views.
Shared and view-specific latent subspace learning
Models/algorithms to be discussed in this subsection aim directly at simultaneously learn- ing shared and view-specific patterns in multiple views. The general framework consists of decomposing each data matrix into the sum of a low-rank matrix containing informa- tion shared across the views and a low-rank matrix containing the view-specific informa- tion. Typically, some algebraic constraints are imposed on low-rank matrices such that matrices containing shared information are orthogonal (or almost orthogonal) to the matri- ces containing view-specific information, thus penalising redundant representations of data patterns in both components.
Salzmann et al. [2010] proposed the factorised orthogonal latent spaces (FOLS) in
which the low-rank matrices encoding shared information were identical in all views. The full decomposition was obtained by minimising the loss function which quantifies the in- formation not captured by the shared and view-specific latent subspaces plus three penalty term, one penalising the correlation between the shared and view-specific latent spaces and the view-specific latent spaces between each pair of views, one penalising the total dimen- sionality of the shared and view-specific latent subspaces, and the other one to prevent obtaining trivial solutions.
Lock et al. [2013] applied matrix factorisation to both low-rank matrices containing
4.2 A survey on unsupervised multi-view learning 148
trix containing projected data and a coefficient matrix which mapped the projection space back to the original space using linear transformations. The projection matrices corre- sponding to the shared information were assumed to be identical, and within each view the shared information matrix was constrained to be orthogonal to the matrix containing view- specific information. The proposed model, called “joint and individual variation explained (JIVE)”, represented a generalisation of the PCA and provided low-dimensional projec- tion of the data into the shared and view-specific subspaces. Zhou et al.[2013] discussed efficient computation algorithms of JIVE with the addition of non-negativity constraints on the matrix factorisations, and showed two applications of the low-rank projections ob- tained from JIVE in classification and clustering tasks. JIVE differs from FOLS [Salzmann
et al., 2010] in three ways: firstly the ranks of the shared and view-specific information
matrices in JIVE had to be pre-specified whereas in FOLS they are regularisers in the ob- jective function; secondly the shared and view-specific information matrices are orthogonal in JIVE whereas they are only encouraged to be orthogonal in FOLS; thirdly, JIVE does not require/encourage the view-specific information matrices to be pairwise-orthogonal, as a consequence, the shared patterns between a subset of views will be categorised as view-specific information in JIVE.
In Bayesian statistics, Archambeau and Bach [2009] proposed a probabilistic model with the same underlying matrix factorisation as JIVE, in which the low-rank projections of both shared and view-specific components were assumed to be pairwise uncorrelated and sparsity inducing priors were incorporated on the coefficient matrices to impose sparse linear transformations from the shared and view-specific subspaces to the original space. This model was extended byQu and Chen[2011], Ray et al.[2013] which used different priors, parameterisation, and inference algorithms but all conceptually falling into the same framework as JIVE.
Another framework in shared and view-specific subspace learning consists of simulta- neously extracting latent factors that explain a lot of data variation (not to be mixed with variance/variability which were interchangeably used referring to the statistical definition of variance) across the views and evaluating the association between these latent factors and views. This framework cannot only identify latent factors which regulate the varia- tion in all views or in a specific view but also the variation shared in a subgroup of views.
For instance, the higher-order generalised SVD (HO-GSVD) approach byPonnapalli et al.
[2011] extends the generalised SVD algorithm to more than two matrices which can be used to compare measurements from multiple data modalities. HO-GSVD jointly performs SVD to the transpose of each data matrix while enforcing identical right singular values, which could be interpreted as enforcing the same low-rank projection of the data from different views. Each singular values of the SVD in a specific view indicates the importance of the corresponding projected dimension in that view. By comparing the jth singular value across all views, one can conclude whether the jth dimension of the projection space ex- plains common variation across all views or the variation specific to a subgroup of views or to a particular view. In Bayesian statistics, Klami et al.[2014] lately proposed a group factor analysis approach which adopted a similar matrix factorisation as the HO-GSVD except that the singular values were combined with the left singular vectors which encoded the linear transformations mapping the projected data to the original space. The associa- tion between each dimension of the projected space, or a latent factor, and each view was modeled by a generalised linear regression which was used to determine the prior of the coefficients of the linear transformations in the matrix factorisations. As such, this method could extract a small number of latent factors which explained the non-random variation among the datasets while identifying the dataset(s) regulated by each latent factor.
4.3
Sparse multi-view matrix factorisation
In this section we present a novel method: sparse multi-view matrix factorisation (sMVMF), to facilitate comparison of gene expression variance in multiple tissues. We reiterate the need to distinguish the variance that is shared across all tissues (views) from that is charac- teristic to a specific tissue. sMVMF belongs to the family of shared and view-specific latent subspace learning algorithms introduced at the end of Section4.2. However, sMVMF dif- fers from existing methods in that it does not require the subjects recruited from multiple views to be matched and that it decomposes the total variance into the sum of shared and tissue-specific variances. Further technical discussions on sMVMF and related mothods will be given in Section4.6.
4.3 Sparse multi-view matrix factorisation 150