Multiview tSNE - Multiview pattern recognition methods for data visualization, embedding and cl

There exist several possible approaches to design an extension to the t-SNE algorithm that can process multiview data. Two approaches are presented in

3.3. MULTIVIEW TSNE 49

this thesis. The first solution proposed is to use a multi-objective optimization gradient descent method to find a projection of the original data points that minimizes the divergence between the projection and the different input data views. The second solution uses expert opinion pooling to aggregate the conditional probability distributions of all the input views and therefore transform the multiview problem into a standard tSNE problem applied to the pooled probability matrix. Both approaches are described next.

3.3.1 MV-tSNE as a multiobjective optimization problem

The tSNE dimensionality reduction method finds a reasonably good projection of a set of points in a high-dimensional space to a low-dimensional space by minimizing the Kullback-Leibler divergence (KL) between two conditional probability distributions, as explained in Section 3.2.2. It uses gradient descent optimization in order to find the most convenient arrangement of points in the low-dimensional space. On each iteration of the gradient descent, it computes the gradient of the KL between the conditional probability distribution matrices with respect to the position of the points in the low-dimensional space.

The multiview tSNE extension presented in this Section (MV-tSNE1) dif- fers from the standard tSNE algorithm in the following aspects.

Conditional probability distributions. SNE and tSNE (see Sections 3.2.1

and 3.2.2) convert the Euclidean distances between the points of the input space X into a matrix of conditional probabilities according to a Gaussian

distribution of the distances. In a multiview setting, there exist v input

spaces {X1, X2, . . . , Xv}. MV-tSNE1 computes the conditional probability

matrix of each Xi using Equation 3.1, thus obtaining v probability matrices

Pk, k = 1, . . . , v.

However, there is a single conditional probability distribution matrix Q for the low-dimensional space Y, as the goal of MV-tSNE1 is to produce a

unique data projection common to all the input views Xk. Q is computed as

in tSNE algorithm, using Equation 3.7.

Cost function. The cost function of MV-tSNE1 is the sum of the KL di-

vergences of all the input conditional probability matrices Pk with respect to

the low-dimensional conditional probability matrix Q:

C = v X k=1 n X i=1 KL(P_ikkQi) = v X k=1 n X i=1 n X j=1 pk_j|ilogp k j|i q_j|i (3.17)

however the gradient of the combined objectives is not required, as the multi- objective optimization algorithm used works with the gradients of each objec-

CHAPTER 3. MULTIVIEW T-DISTRIBUTED STOCHASTIC NEIGHBOUR EMBEDDING tive (each input view). As a consequence, equation 3.8 applied to each matrix

Pk still holds in this algorithm.

Multi-objective gradient descent optimization. In order to minimize

the KL divergence of the low-dimensional points with respect to the high- dimensional input views, this multiview dimensionality reduction problem requires a multi-objective gradient descent method, more specifically the Mul- tiobjective Gradient Descent Algorithm (MGDA) described in Section 3.2.3. On each iteration of the optimization algorithm, the gradients of the different input views are computed and combined in a way such that the Pareto efficiency criterium holds. In other words, the change on each iteration never worsens the partial cost value of a specific input view (problem objective).

The use of a momentum vector to improve the performance of the gradient descent algorithm is not defined in MGDA, in fact applying a momentum would often collide with the Pareto-compliant direction of change ω. Therefore that improvement from tSNE is removed in MV-tSNE1. The specification of MV-tSNE1 is presented in Algorithm 3.

Algorithm 3 . Multiview t-Distributed Stochastic Neighbour Embedding 1 Input: v data views of the same n entities Xk = {xk₁, xk₂, . . . , xk_n}, where k = 1, 2, . . . , v,

cost function parameters: perplexity P erp,

optimization parameters: number of iterations T , learning rate η

Output: low-dimensional data representation Y(T )_{= {y}

1, y2, . . . , yn}

function tSNE(X1, X2, . . . , Xv, P erp, T, η)

compute pairwise affinities pk_j|iwith perplexity Perp (using Equation 3.1)

for each Xk, k = 1, . . . , v pk

ij ←

pk_j|i+pk_i|j

sample initial solution Y(0)= {y1, y2, . . . yn} from N (0, 10−4)

for t ← 1 to T do

compute low-dimensional affinities qij (using Equation 3.7)

compute gradients δC_δYk (using Equation 3.8 on the partial cost Ck

associated with each pk_ij)

compute vector of change ω as the minimum-norm vector in the

convex hull defined by the gradients δC_δYk (using algorithm MGDA)

Y(t) _{← Y}(t−1)_{+ ηω}

end for end function

3.3. MULTIVIEW TSNE 51

Limitations of MV-tSNE1 MV-tSNE1 presents several practical limita-

tions that make it unusable in real datasets. The most relevant limitations regard computational cost, caused by the following reasons:

• On each iteration the KL divergence between each input view and the projection has to be computed. This multiplies by v the computational cost of each iteration, as the KL computation is the most expensive operation on each iteration as discussed in Section 3.2.2.

• The execution of the MGDA algorithm on each iteration also adds an important computational cost per iteration. This is specially the case in problems with more than two objectives (i.e. data views), where it requires executing an optimization algorithm on each iteration to find the common gradient, in order to find the descent vector characterized by Equation 3.13.

• The fact that MGDA does not support the use of momenta in the optimization loop makes it necessary to run the main loop in Algorithm 3 for more iterations (an order of magnitude more as seen in experimental trials).

• The strong Pareto condition makes MGDA halt very often in clearly sub- optimal points, as it cannot find a ω vector that satisfies the Pareto efficiency criterium and therefore stalls the optimization process. In other words, the algorithm stops in a local minimum of the multi-objective problem. As a consequence the whole MV-tSNE1 algorithm has to be executed several times in order to hopefully find better solutions. All these factors combined make MV-tSNE1 extremely expensive in computational terms, what only allows it to be run on ”toy” datasets to test its behaviour. Therefore it has been excluded from the main experiments presented in this work.

3.3.2 MV-tSNE as an expert opinion pooling problem

As seen in Sections 3.2.1 and 3.2.2, SNE and tSNE model the input, high- dimensional space as a matrix of conditional distance probabilities according

to a Gaussian distribution. The expression for this matrix P is given in

Equation 3.1.

In a multiview scenario there are k input high-dimensional spaces instead of only one. A viable strategy for extending tSNE to multiview datasets is as follows. First, to compute a different matrix of conditional distance prob-

abilities Pk for each input space Xk, k = 1, . . . , n using Equation 3.1. These

k probability matrices can be seen as the probability opinions of k different experts on the distribution of distances of the input samples. Therefore, the

CHAPTER 3. MULTIVIEW T-DISTRIBUTED STOCHASTIC NEIGHBOUR EMBEDDING second step of the algorithm is to compute a pooled opinion probability ma-

trix ˆP using the log-linear method exposed in Section 3.2.4. From that point,

the problem is reduced to finding a low-dimensional space that minimizes

the KL divergence between ˆP and the conditional probability matrix of the

low-dimensional space Q using the approach in tSNE algorithm.

Therefore the cost function of the newly defined optimization problem is:

C =X i KL( ˆPikQi) = X i X j ˆ pj|ilog ˆ pj|i qj|i (3.18)

And its gradient with respect to the low-dimensional projection space Y is: δC δyi = 4X j (ˆpij− qij)(yi− yj)(1 + kyi− yjk2)−1 (3.19)

This multiview dimensionality reduction algorithm will be referred as MV- tSNE2, and it is specified in Algorithm 4.

Analysis of the algorithm An important feature of MV-tSNE2 is that

it only requires to compute the pooled opinion matrix once, and from that step on its complexity equals that of the single view tSNE algorithm, as all the information from the different input spaces is condensed into a single

probability matrix, ˆP . This makes MV-tSNE2 computationally efficient. Also,

MV-tSNE2 is compatible with the use of momentum in the gradient descent optimization stage, leading to better performance relative to the number of iterations.

For the previous reasons, the method used in the experiments is MV-tSNE2 and for simplicity it will be simply referred to as MV-tSNE.

In document Multiview pattern recognition methods for data visualization, embedding and clustering (Page 72-76)