Conclusions - Conclusion and Future Works

Chapter 6. Conclusion and Future Works

6.2 Conclusions

Next generation sequencing (NGS) technique has high throughput of sequence generation. Massive amount of sequences can be generated within a short time from the gene samples with its help. Therefore, challenges have emerged about the method to do analysis on these sequences. To visualize the clustering result with large scale of sequences, it is essential to do customized optimizations for bioinformatics data on the multidimensional scaling (MDS) techniques. DACIDR is a pipeline uses pairwise clustering and MDS to do large scale sequence clustering and visualization. The optimization techniques proposed in this dissertation greatly improves the performances of DACIDR and extends its capabilities of processing millions of sequences along with phylogenetic analysis.

6.2.1 WDA-SMACOF

The WDA-SMACOF proposed in this dissertation has proposed a general solution for MDS problem with weighting data, as well as with the situation that part of data points needs to be fixed. The problem emerges because of the sequence alignment with bioinformatics data is not as

reliable as Euclidean distance calculation. And for special cases, various weights needs to be associated with each pair of sequence dissimilarities generated in sequence alignment. Traditional method on this type of problem requires cubic time complexity, but WDA-SMACOF proposed here avoided the high time cost of weighting function by using Conjugated Gradient (CG) instead of matrix inversion in traditional methods. In section 3, the detailed description of WDA- SMACOF is given, and how to apply CG on it is given in detail. The parallel version of WDA- SMACOF is written using iterative MapReduce framework, so the flowchart of it givens the detail about the work on main driver as well as the work for each mapper and reducer. Later in this section, the special version of WDA-SMACOF, called Fixed-WDA-SMACOF is given. The steps of how the final equation is listed as well as the algorithm description. Note that the parallelization of Fixed-WDA-SMACOF also uses the iterative MapReduce framework, and its parallelization involves a nested loop of 6 MapReduce jobs. The space and communication has been minimized on both of these parallel applications in order to run very large scale data visualization. In the experiments part, WDA-SMACOF gives a higher accuracy compared to all other algorithms with four different dataset, where the size of data varies from 4k to 400k. Not only the single thread version is tested, but also the parallel version has been tested on over 4000 cores. The parallel efficiency has been improved by adapting an iterative MapReduce framework called Harp over Twister. Furthermore, Fixed-WDA-SMACOF has been compared against an existing MDS interpolation method called MI-MDS. The result shows that it gives a more precise answer with the specific need of fixing part of the points. In short, WDA-SMACOF proposed in here gives a robust and scalable solution for dimension reduction on large scale of sequences by utilizing the power of iterative MapReduce and computer clusters.

6.2.2 W-MI-MDS and HE-MI

W-MI-MDS and HE-MI are two optimizations to the existing MDS interpolation algorithm proposed in section 4. W-MI-MDS is an algorithm that added weighting support for the out-of- sample problem, so that different weights can be associate with the distances from in-sample points to the out-of-sample points. HE-MI is a hierarchical method that proposed to reduce the time cost of MDS interpolation. This is needed because of the excessive time cost of sequence

alignment between the in-sample dataset and out-of-sample dataset. The detailed description of two tree structures, referred to as SSP-Tree and CN-Tree, used by this method is given in this section. The parallel versions of these two algorithms were similar. The out-of-sample problem is a task-independent problem, which means the applications solving it can be pleasingly paralleled. The flowchart of parallelization of these two algorithm using MapReduce framework is given. As iterative MapReduce framework also supports normal MapReduce application, so the parallel version of W-MI-MDS and HE-MI are implemented using Twister. In the experiments part of this section, both applications were tested and compared to other existing methods. W-MI-MDS gives a higher accuracy as well as a lower time cost with all dataset tested compared to the non- weighted MI-MDS. DA technique has also been analyzed in these experiments but did not show a significant different in terms of accuracy. HE-MI by using both SSP-Tree and CN-Tree has shown a much lower time cost compare to the MI-MDS data without missing distances. Overall, these optimization techniques has reduced the time cost for general MDS interpolation problem without missing distances or various weights requirement, and W-MI-MDS has provided a better solution than the MI-MDS by adding weighting function support to the data that has missing distances. 6.2.3 Cuboid Cladogram and Spherical Phylogram

Traditional phylogenetic tree analysis were solely based on the tree diagram generated using multiple sequence alignment and maximum likelihood method or other equivalent methods. Other the other hand, sequence clustering is usually done by using pairwise sequence alignment and hierarchical clustering or other equivalent methods. These two separate analysis on a same dataset could yield separate result with some variation. Thus sequence clustering could be complementary to phylogenetic analysis. Cuboid Cladogram and Spherical Phylogram are proposed in Section 5 to display sequence clustering result along with a phylogenetic tree, which enables the direct analysis on the same dataset with both tree diagram as well as the visualized clusters. In this section, Cuboid Cladogram is described in detail with the usage of principle component analysis (PCA). Since the sequence clusters are visualized in 3D by using MDS technique, a plane which all points are projected to needs to be selected. In order to keep the maximum variance of coordinates after the projection, PCA is used to select the plane. Then the points from the

clustering result are projected into the plane to be used as leaf nodes in the phylogenetic tree, which is determined by using independent phylogenetic analysis technique. Finally, the 3D constructed can be used for the final analysis and called Cuboid Cladogram. Spherical Phylogram is introduced later in Section 5. It uses interpolation to generate the tree. The points in the clusters are used directly for the phylogenetic tree instead of doing projection first, so the internal nodes from the phylogenetic tree needs to be interpolated into the 3D space. An algorithm called Interpolative Joining is proposed here. This algorithm can determine the coordinates for the internal nodes from given phylogenetic tree in the target dimension space (3D space), so the tree is generated as a spherical plot and the branch lengths are from the pairwise distances from the leaf nodes, thus the name Spherical Phylogram. The experiments in Section 5 illustrate the affection on the choice of sequence alignment to sequence clustering and phylogenetic tree construction. It shows a very high correlation between PWA and MSA in terms of generating the dimension reduction plot. Then the comparison among the choices of dimension reduction method is shown. It shows that WDA-SMACOF proposed in Section 3 has a best result of constructing spherical phylogram because it can avoid local optima during the optimization process. Generally speaking, Cuboid Cladogram and Spherical Phylogram can determine the phylogenetic tree in a 3D space with MDS clustering result, and enables the analysis on both results simultaneously.

In document SCALABLE AND ROBUST CLUSTERING AND VISUALIZATION FOR LARGE SCALE BIOINFORMATICS DATA (Page 142-145)