Large-Scale Data Visualization via MI-MDS

4.4 Analysis of Experimental Results

4.4.4 Large-Scale Data Visualization via MI-MDS

Figure 4.15 shows the proposed MI-MDS results of a 100k PubChem dataset with respect to the different sample sizes, such as (a) 12.5k and (b) 50k. Sampled data and interpolated points are colored in red and blue, correspondingly. With our parallel interpolation algorithms for MDS, we have also processed a large volume of PubChem data by using our Cluster-II, and the results are shown in Figure 4.16. We performed parallel MI-MDS to process datasets of hundreds of thousand and up to 4 million by using the 100k PubChem data set as a training set. In Figure 4.16, we show the MI-MDS result of 2 million dataset based on 100k training set, compared to the mapping of 100k training set data. The interpolated points are colored in blue, while the training points are in red. As one can see, our interpolation algorithms have produced a map closed to the training dataset.

4. Interpolation Approach for Multidimensional Scaling 78

(a) MDS 100k (trained set) (b) MDS 2M + 100k

Figure 4.16: Interpolated MDS results. Based on 100k samples (a), additional 2M PubChem dataset is interpolated (b). Sampled data are colored in red and interpolated points are in blue.

4.5 Summary

In this chapter, we have proposed interpolation algorithms for extending the MDS dimension reduction approaches to very large datasets, i.e up to the millions. The proposed interpolation approach consists of two-phases: (1) the full MDS running with sampled data (n); and (2) the interpolation of out-of-sample data (N₋n) based on mapped position of sampled data. The proposed interpolation algorithm reduces the

computational complexity of the MDS algorithm fromO₍_N2₎_to_O₍_n_×₍_N₋_n₎₎_{. The iterative majorization}

method is used as an optimization method for finding mapping positions of the interpolated point. We have also proposed in this chapter the usage of parallelized interpolation algorithms for MDS which can utilize multicore/multiprocess technologies. In particular, we utilized a simple but highly efficient mechanism for computing the symmetric all-pairwise distances to provide improved performance.

Before starting a comparative experimental analysis between MI-MDS and the full MDS algorithm, we explored the optimal number of k-NN. 2-NN is the best case for the Pubchem data which we used in this

4. Interpolation Approach for Multidimensional Scaling 79

chapter. We have shown that our interpolation approach gives results of good quality with high parallel performance. In a quality comparison, the experimental results shows that the interpolation approach output is comparable to the normal MDS output, which takes much more running time than interpolation. The proposed interpolation algorithm is easy to parallelize since each interpolated points is independent of other out-of-sample points, so many points can be interpolated concurrently without communication. The parallel performance is analyzed in Section 4.4.3, and it shows very high efficiency as we expected.

Consequently, the interpolation approach enables us to configure 4 millions Pubchem data points in this chapter with an acceptable normalized STRESS value, compared to the normalized STRESS value of 100k sampled data in less than ONE hour, and the size can be extended further with a moderate running time. Note that if we use parallel MDS only, we cannot even run with only 200,000 points on the Cluster-II system in Table 4.1 due to the out of memory exception, and if it were possible to run parallel MDS with 4 million data points on the Cluster-II system, it would take around 15,000 times longer than the interpolation approach as mentioned in Section 4.4.2. Future research includes application of these ideas to different areas including metagenomics and other DNA sequence visualization.

5 Deterministic Annealing SMACOF

5.1 Overview

Multidimensional scaling (MDS) [13, 45] is a well-known dimension reduction algorithm which is a non- linear optimization problem constructing a lower dimensional configuration of high dimensional data with respect to the given pairwise proximity information based on an objective function, namely STRESS [44] or SSTRESS [67]. Below equations are the definition of STRESS (5.1) and SSTRESS (5.2):

σ(X) =

∑

i<j≤N wi j(di j(X)−δi j)2 (5.1) σ2₍_X₎ ₌

∑

i<j_≤N wi j[(di j(X))2−(δi j)2]2 (5.2)

where 1_≤i<j_≤N, wi jis a weight value (wi j≥0), di j(X)is a Euclidean distance between mapping results of

xiand xj, andδi jis the given original pairwise dissimilarity value between xiand xj. SSTRESS is adopted by

ALSCAL algorithm (Alternating Least Squares Scaling) [67], and uses squared Euclidean distances results in simple computations. A more natural choice could be STRESS which is used by SMACOF [20] and Sammon’s mapping [64].

5. Deterministic Annealing SMACOF 81

Due to the non-linear optimization properties of the MDS problem, many heuristic optimization methods have been applied to solve the MDS problem. An optimization method called iterative majorization is used to solve the MDS problem by a well-known MDS algorithm called SMACOF [20,21]. The iterative majorization method is a type of Expectation-Maximization (EM) approach [22], and it is well understood that EM method suffers from local minima problem although EM method is widely applied to many optimization problems.

From the motivation of local-optima avoidance, I have proposed an MDS algoirthm which applies a robust optimization method called Deterministic Annealing (DA) [62, 63] to the MDS problem, in this chapter. A key feature of the DA algorithm is to endeavour to find global optimum in a deterministic way [63] without trapping local optima, instead of using a stochastic random approach, which results in a long running time, as in Simulated Annealing (SA) [40]. DA uses mean field approximation, which is calculated in a deterministic way, by using the statistical physics integrals.

In Section 5.2, I briefly discuss various optimization methods which have been applied to the MDS problem to avoid local optima. Then, the proposed DA MDS algorithm based on iterative majorization method is explained in Section 5.3. Section 5.4 and Section 5.5 illustrate performance of the proposed DA MDS algorithm compared to other MDS algorithms with variable data sets followed by the conclusion in Section 5.6.

5.2 Related Work

In document SCALABLE HIGH PERFORMANCE MULTIDIMENSIONAL SCALING (Page 96-100)