2.2 Feature descriptors
2.3.5 Manifold learning
Manifold learning is normally applied for dimensionality reduction. It preserves the underlying local structure to yield a lower dimensional embedding of the data in a high dimensional space. For a dataset X = {x1, x2, ..., xn} ∈ RM in the original high-dimensional space, we assume
there exists a lower dimension manifold which can well represent the X. The manifold learning seeks to find a low dimensional representation Y = {y1, y2, ..., yn} ∈ Rm where m M [van der
Maaten et al., 2009].
Principle Component Analysis (PCA) and Multidimensional Scaling (MDS) compute a global projection using a linear transformation of the data to a low dimensional space. We include PCA and MDS in this section as linear manifold algorithms in the context of dimensionality reduction. The non-linear manifold learning techniques can be divided into local and global methods: Local linear embedding (LLE) [Roweis and Saul, 2000] and Laplacian Eigenmaps embedding [Belkin and Niyogi, 2003] aim to preserve local distances between data points. Global methods like Isomap [Tenenbaum et al., 2000] consider the distance between all pairs of data points in its objective function, and hence preserve global distances.
2.3. Machine learning 63
PCA
Principle Component Analysis (PCA) is a linear dimensionality reduction algorithm, which converts a dataset of possibly correlated variables into a set of linearly uncorrelated variables, i.e. principal components [Pearson, 1901] [Jolliffe, 2002].
For a given mean centered dataset X, PCA finds the vectors along which the dataset has maximum variance. This is achieved by solving the following eigen-problem for the eigenvalue λ and corresponding eigenvector v:
λV = CV (2.32)
where the covariance matrix C of X of size N ×M is defined as follows:
C = 1 N N X i=1 xixTx (2.33)
The m eigenvectors associated with the largest m(m M ) eigenvalues form a transform matrix U, which projects the dataset X into a low dimensional space as: Y = UTX.
PCA is the most popular linear dimensionality reduction algorithm. However the main drawback is the assumption that the data lies in a linear subspace. Kernel PCA is hence introduced to compute the covariance matrix after performing a kernel-based transformation
[Mika et al., 1998].
MDS
Multidimensional Scaling (MDS) [Kruskal, 1964] [Cox and Cox, 2000] projects data points from high dimensional space to low dimensional embedding, while preserving their pairwise similarity or dissimilarity. Such similarity or dissimilarity can be measured by the Euclidian distance. Mathematically, MDS seeks to find a linear transformation by minimizing the objective function as follows: N X i=1 N X j=1 (kxi− xjk2− kyi− yjk2)2 (2.34)
Here xi is the data point in the original high-dimentional space and yi is the corresponding
projected point in the low-dimentional space. Solving this optimisation problem means that if two data points xi and xj that are close to each other in the high dimensional space, then the
transformed two data points yi and yj are also close to each other in the low dimensional space.
Isomap
Isomap [Tenenbaum et al., 2000] is a nonlinear generalization of classical MDS. It calculates pairwise geodesic distances instead of Euclidean distances. The geodesic distances are defined as the shortest paths along the curved surface of the manifold. The shortest path can be calculated by Floyd’s algorithm [Floyd, 1962] or Dijkstra’s algorithm [Dijkstra, 1959]. More over, with the assumption of a sufficiently smooth manifold, the geodesic distance between nearby points approximates the Euclidean distance. The Isomap embedding is computed in three steps: First, the neighbours of each data point (the k-nearest neighbours or within - radius) are identified in the high dimensional space. Then, the geodesic pairwise distances between all points are computed. The final embedding is obtained by applying MDS so as to preserve pairwise geodesic distances in the low dimensional space.
Isomap is computed considering the distances between all pairs of data points, hence preserves global distances and is regarded as a global manifold learning algorithms.
LLE
Locally Linear Embedding (LLE), first introduced by Roweis and Saul [Roweis and Saul, 2000], is a nonlinear dimensionality reduction approach which preserves the local neighbourhood in the high dimensional space. Hence it is regarded as a local manifold learning algorithm.
Similar to Isomap, there are three steps in the computation of LLE: First, the neighbours of each data point (the k-nearest neighbours or within -radius) are identified in the high dimensional space. Then, all data points are represented as a weighted combination of their neighbours.
2.3. Machine learning 65
The weights ωij are obtained by minimizing the following cost function:
N X i=1 (kxi− k X j=1 ωijxNi(j)k 2 (2.35)
where Ni(j) indicates the jth neighbour of the ith point.
Finally the projected points in the low dimensional space can be obtained by minimizing the following cost function with the ωij obtained in Equation 2.35:
N X i=1 (kyi− k X j=1 ωijyNi(j)k 2 (2.36)
This minimization problem (Equation 2.36) is about solving a sparse N × N eigenvalue problem, whose bottom nonzero eigenvectors provide an orthogonal set of coordinates to construct Y.
Laplacian Eigenmaps
Laplacian Eigenmaps [Belkin and Niyogi, 2003,Von Luxburg, 2007] finds a low dimensional representation which preserves the local properties of the data by ensuring the local neigh- bourhood in the high dimensional space is reflected in the low dimensional space. In manifold learning, it is common to use a similarity matrix to represent the relations between pairs of data points. The similarity matrix may also be viewed as a graph in which each vertex denotes a data point and the weight of each edge corresponds to the similarity or dissimilarity between the data pair it.
Laplacian eigenmaps build a sparsely connected graph from a pairwise similarity matrix W computed from the data set. The strength of the connection between each pair of data points in the graph is defined by the similarity matrix W:
wi,j = sij, if xi ∈ ψ(xj) 0, otherwise (2.37)
Here ψ(xj) denotes a local neighbourhood of xj which can be defined by k nearest neighbour.
The manifold is obtained by minimizing the objective function:
Φ(Y ) = n X i,j=1 wij(yi− yj) 2 = n X i=1 diy2i − 2 n X i,j=1 yiyjwi,j + n X j=1 djy2j = 2( n X i=1 diy2i − n X i,j=1 yiyjwi,j) = 2(YTDY − YTWY) = 2YT(D − W)Y = 2YTLY (2.38)
where yi, yj ∈ Y is the low dimensional embedding of the input data X; di = n P j=1 wij and dj = n P i=1
wij. The objective function is optimised under the constraint that yTDy = 1, which
removes arbitrary scaling factors in the embedding and prevents trivial solutions where all yi are zero. The yi that optimise the objective function are derived by solving the generalised
eigenvalue problem for eigenvalue λ and its corresponding eigenvectors V:
LV = λDV
The embedding coordinates Y are formed by the eigenvectors corresponding to the smallest nonzero eigenvalues. Table 2.3 summarises the most important properties for the manifold learning techniques discussed in this section [Wittman, 2005].