Topological data structure identification: Compactly supported kernel approach

II. Materials and Methods

6. A New Approach to Enhance Data Structure Modelling for Unsupervised Learn-

6.2. Topological data structure identification: Compactly supported kernel approach

6.2. Topological data structure identification: Compactly

supported kernel approach

Here is introduced a different use for the computed neighbors to tune matrix Ω in an appropriate manner, both in band-width kernel parameter as in its sparsity, giving a better use of the estimated local and global information.

The simplest way is to use a threshold over Gram matrix, where the smallest entries (less than a threshold ε ∈ R+_{) are set to zero, which can be seen as a sparsification method,}

by removing the entries that does not belong to the neighbor of a sample. These kind of sparsification causes the Gram matrix to lose the definite positive property, therefore this way to sparsify the matrix is said to be naive and wrong. Hence, it is needed a way to sparse the Gram matrix while keeping it positive definite. Therefore, to achieve this purpose, in [69] are proposed two quantitative measurements, namely a sparsity measurement and a similarity measurement, the aim is to get the trade-off between them. In that work, the importance of a sparse matrix is related to the storage capability, whereas in this work is more important the appropriate use of each sample neighborhoods.

In [69] is stated that for any RBF kernel κ, a class of compactly supported kernels can be constructed by multiplying κ(_{·, ·) with some compactly supported radial basis function. In} this work, we introduced a different use for the computed neighbors in order to tune matrix Ω in an appropriate manner, both in band-width kernel parameter as in its sparsity, giving a better use of the estimated local and global information. Compactly supported kernel can be constructed by multiplying κ(_{·, ·) with any compactly supported radial basis function.} Let ΩB ∈ RN ×N be the modified similarity matrix which preserves positive definiteness by

using the Hadamard product of the following matrices:

ΩB= ΦB◦ Ω (6-2)

where B > 0 is called the thresholding or truncation parameter, which controls the support size of the kernel ΩB, or the degree of the sparsity in ΩB. Matrix ΦB ∈ RN ×N performs

a sparsifying operation by thresholding its entries satisfying kxi− xjk ≥ B to zeros and is

expressed as: [ΦB]ij = 1₋kxi− xjk B + ν (6-3) with ν _≥ d+1

2 , (·)+ representing the positive part of its argument and the power term ν

decides the degree of smoothness or differentiability of φB.

We need to compute the parameter B properly to achieve a gain in neighbors representation. Such parameter is tuned with seeking a trade-off between a similarity and a sparsity measurement. Empirical alignment of two Gram matrices is used to assess the similarity between Ω and ΩB, it can be viewed as the Pearson coefficient of two bidimensional vectors

52 6 A New Approach to Enhance Data Structure Modelling for Unsupervised Learning

A(B) = hΩ, ΩBiF phΩ, ΩiFhΩB, ΩBiF

(6-4)

where _hΩ1, Ω2i_F = Pn_i,j=1Ω1(xi, xj)Ω2(xi, xj) is the Frobenius inner product between two

matrices. The sparsity of ΩB relative to Ω is defined as,

S(B) = N0

n2 (6-5)

where N0 stands for the number of zero entries in ΩB.

It is worth to note that both empirical alignment A(B) and sparsity S(B) values are in the range [0, 1]. A large empirical alignment A(B) indicates little information loss due to the sparsifying process and a large sparsity S(B) implies a high degree of sparsity. As B increases, A(B) increases and S(B) decreases, hence it is impossible to maximize both scores simultaneously. In [69] three procedures are proposed to tune B adaptively aiming to build a sparse matrix ΩB from the dense matrix Ω while maintaining the positive definiteness

property of the matrix.

We propose a single mixture from the procedures presented in [69], aiming to maximize the linear combination between alignment and sparsity, and a trade-off parameter µ_{∈ [0, 1]).} Such parameter is used given that it is more important to retain relevant information from neighborhoods than make Ω sparse due to storage limitation. Objective function to maximize is:

B∗ = argmax

A2(B) + µS2(B)

where the terms A2_{(B) and S}2_{(B) can be seen as the functions F (}_{·) and G(·) of the Equa-}

tion (5-1). Finally the optimal value of B is used to modify the original kernel matrix Ω with Equation (6-2). Furthermore, to optimize B we use the well-known particle swarm optimization – PSO technique [71], which is a population based technique that by means of particles who fly through the search space of the function to be optimized with given velocities. Each particle stores the best position of itself and the best position of its neighbor in terms of the function to optimize and according those positions the technique iterate to adjust velocities and hence the movement of the particles, leading to an optimal or near-optimal solution [72]. Finally by finding B we are able to compute ΩB from Equation (6-2) and apply in further

learning stages, we named the proposed apprach Compactly Supported Kernel - CSK.

6.3. Experimental set-up

To visualize how the tuning of the kernel free parameter σ affects the graph G building over X some experiments are carried out over synthetic data, due to its capability of visual inspection.

6.3 Experimental set-up 53

As stated previously we use the Gaussian kernel in Equation (4-2) to identify the relationships among samples in a similarity matrix Ω, in order to build the graph G. Given that the Gaussian kernel function depends on the free parameter σ, a correct tuning enhance the performance of further tasks. In the Figure6-2 are depicted some examples of the connections that are created when using different values for σ over the synthetic data set introduced in Figure4.6(g). In Figure6.2(a)is depicted how a low σ value originates a highly disconnected graph, instead using a moderately high value of σ increases the connections of the graph. For the local scaling approach (see Equation (4-24)) the resulting graph is denser, meaning that the similarity among points is greater, which could not always be the most ideal scenario since its high connectivity could lead to poorly clustered data.

(a) Happy face σ = 0.02 (b) Happy face σ = 0.1 (c) Happy face local scaling

Figure 6-2.: Graphs of synthetic data sets.

Also real-world data sets are used to highlight the importance of a correct construction of the graph. More precisely at image segmentation tasks, using an image selected from the Berkeley Segmentation data set and characterized as explained in Section § 4.5. Moreover, the graph is built ith a fixed value of σ and with the local scaling approach and finally the image is segmented using Algorithm2. Image segmentation results are depicted in Figure6- 3.

(a) Original image (b) Segmentation result for σ_{= 0.4}

Figure 6-3.: Original image and segmentation results.

Results shows how the local scaling is not always the best strategy to construct the similarity matrix, due to it is not a generalizing method which depends on free parameters

54 6 A New Approach to Enhance Data Structure Modelling for Unsupervised Learning

often estimated empirically, in Figure 6.3(c) the plane is not correctly discriminated with the background, instead for the graph constructed by a σ fixed value in Figure 6.3(b), the resulting segmentation shows how the plane and the background are separated segments of the image, which is the logical result.

Furthermore, to test the capability of the proposed approaches AGB and CSK for finding a suitable graph representation in spectral clustering based methods, some synthetic and real-world data set are used. AGB and CSK are employed to compute the affinity matrix Ω building a graph G over the input data. Then, a spectral clustering method is employed to estimate the label vector y. Firstly, the synthetic data sets Four Gaussians and Happy face presented in section Section _§ 4.5 and a new data set named Elongated groups are studied. All data sets encode complex structures and are commonly used to test the capability of clustering algorithms. For concrete testing, the number of groups C is manually fixed as 4, 4, and 3, respectively, as detailed in [27]. Synthetic data clustering results are depicted in Figure6-4, which can be visually evaluated.

(a) Four Gaussians (b) Elongated groups (c) Happy face

Figure 6-4.: AGB clustering results over synthetic data.

Regarding to real-world data sets experiments, images presented in Section _§ 4.5 are employed. Particularly, randomly selected images identified as 100075-bears, 113044-horses,

12003-starfish, 388016-woman, 56028-wall, and 124084-flowers are studied. Again, AGB and

CSK are employed to represent properly relationships among samples, taking into account the RGB color space and the 2D position of each pixel as an input sample. However, due to limitations in memory usage, images are resized at 15%. Furthermore, the closed approach introduced in Section _§ 4.5, and termed 7-Nearest Neighbor Spectral Clustering - 7-NNSC, is tested. 7-NNSC is based on a local scaling analysis to build G, (for details see [27, 64]). NPR index explained in Section _§ 4.5 is computed to quantify the image segmentation performance. Figure6-5 shows images segmentation results.

Finally, some classification experiments are developed to verify the advantages of our AGB approach for highlighting the main data structures. Thus, the UCR time-series data set is used [73]. This repository contains contributed labeled time-series data sets from different fields, such as: shape identification on images, time-series extracted from physical process,

6.3 Experimental set-up 55

Original Hand-labeled 7-NNSC AGB

(a) Bears (b) (c) N P R = 0.63 (d) N P R = 0.69 (e) Horses (f) (g) N P R = 0.63 (h) N P R = 0.71 (i) Starfish (j) (k) N P R = 0.78 (l) N P R = 0.75 (m) Woman (n) (o) N P R = 0.76 (p) N P R = 0.58 (q) Wall (r) (s) N P R = 0.65 (t) N P R = 0.68 (u) Flowers (v) (w) N P R = 0.72 (x) N P R = 0.77

56 6 A New Approach to Enhance Data Structure Modelling for Unsupervised Learning

(a) N P R = 0.71 (b) N P R = 0.71 (c) N P R = 0.72

(d) N P R = 0.50 (e) N P R = 0.67 (f) N P R = 0.73

Figure 6-6.: Images segmentation results for CSK.

Table 6-1.: Time-series classification results - testing set accuracy percentage

Dataset Benchmark 7-NNSC AGB Dataset Benchmark 7-NNSC AGB synthetic control 88.00 99.33 98.33 OSULeaf 51.65 47.52 54.55

Gun Point 91.33 66.67 86.00 50words 63.08 51.87 63.52 ECG200 88.00 79.00 88.00 Trace 76.00 53.00 77.00 FaceAll 71.36 35.50 67.28 wafer 99.55 32.17 99.43 SwedishLeaf 78.88 71.04 81.44 Lighting2 75.41 67.21 75.41 CBF 85.22 57.00 91.67 Lighting7 57.53 42.47 63.01 Coffee 75.00 50.00 71.43 Adiac 61.13 37.85 56.27 OliveOil 86.67 73.33 80.00 FISH 78.29 58.86 72.00 Two Patterns 90.67 48.25 90.47 Beef 53.33 36.67 46.67 yoga 83.03 52.37 79.47 FaceFour 78.41 37.50 80.68

or even synthetic data. All data sets contain different number of classes, observations, and lengths. Moreover, it is assumed to be used on both classification and clustering tasks. As recommended in UCR, we test the 1-Nearest Neighbor - 1-NN classifier using the Euclidean distance as benchmark. UCR databases are divided into training and testing sets. In this case, AGB is employed to compute the affinity matrix Ω over the training set, which is employed as features in the 1-NN classifier. So, given a new sample xnew (testing set), the

similarity among xnew and the training set is calculated using the AGB kernel band-width.

Then, the 1-NN estimated testing set labels are used to compute the system performance. Also, 7-NNSC approach is used to compare the performance of the proposed algorithm. The attained time-series classification results are presented in Table 6-1.

6.4 Results and Discussion 57

6.4. Results and Discussion

Taking into account the synthetic clustering results, from Figure 6-4 it can be seen how the proposed AGB methodology is able to find a suitable kernel function, i.e., Gaussian kernel band-width, which allows to build the graph G over the input data, identifying the complex synthetic data set structures. Moreover, it can be seen how the proposed CSK methodology is able to sparsify a given kernel while keeping its information below a threshold. Note that, even when some data set are composed by disjoints data structures, with different properties, our algorithms allows to find a complete graph that encodes the main relationships among samples, as can be visually corroborated in Figures 6.4(a) to 6.4(c). Namely, Figure 6.4(b) and Figure 6.4(c) describes how AGB performance is in agreement with a benchmark approach presented in [27].

Regarding to the images segmentation results described in Figure 6-5, overall, our algorithms obtains a better performance in comparison with the benchmark method 7-NNSC. Particularly, for Bears, Horses, Wall, and Flowers AGB and CSK are able to find the local and global relationships among samples, highlighting the main details of each cluster. Due to each pixel is modeled with the largest linear neighborhood around it, the whole image structure is properly revealed by the estimated graph representation. However, for Starfish and Woman AGB and CSK obtains a lower performance than 7-NNSC, which can be explained by the fact that such images contain many details, that could be hand-labeled subjectively. For example, the Woman image AGB segmentation is smoother than the 7- NNSC, which is biased by abrupt changes. Even though the NPR values are higher for the Woman and Starfish 7-NNSC segmentations, the obtained AGB and CSK results are visually acceptable. In addition, because of 7-NNSC employs a fixed neighborhood size for all the samples, it is sensitive to outliers, thus is, noisy data structures. Moreover, 7-NNSC can no longer be framed as a suitable kernel based representation from a theoretical view. In these experiments, we also demonstrated that such drawback is also revealed in practice.

Finally, from the time-series classification results (Table6-1), even though AGB based approach does not overcome the baseline results for all the provided data sets, it achieves com- petitive results. For example, for synthetic control, ECG200, SwedishLeaf, CBG, OSULeaf, 50words, Trace, Lighting2, Lighting7, and FaceFour data set our approach attained the best performance. Again, the AGB local and global analysis encoded into the neighborhood size estimation allows to deal with the complex relationships among time-series. Now, 7-NNSC based classification is not able to unfold the complex data structures, because such technique assumes an unique neighborhood size. It is important to note that some of the time-series data sets are composed for many classes, which can not be suitable modeled by one kernel function, being necessary to extend the data structure analysis considering different affinity matrices.

58 6 A New Approach to Enhance Data Structure Modelling for Unsupervised Learning

6.5. Conclusions

A methodology to build automatically a graph representation over the input data for spectral clustering based approaches was proposed. For such purpose, a data structure analysis is performed using the Euclidean and geodesic distances to identify the linear and density properties of each sample neighborhood. Thus, the local and global properties of the data are revealed to estimate a suitable kernel function, which is used to construct a data graph representation. Our approach, AGB, was tested over synthetic and real-world data. Attained results showed how our approach achieved good results for clustering, image segmentation, and even classification tasks. A benchmark approach 7-NNSC, which aims to make a local scaling analysis to build the graph, was also tested. However, 7-NNSC is not able to unfold complex data structures in many cases. Such issues were demonstrated from both theoretic and experiments.

7. Conclusions

In this work, some kernel representation frameworks have been proposed, which reveals relevant data structures of data thus supporting unsupervised learning tasks. As a first approach a clustering methodology based on an well known spectral clustering criteria is proposed. It is proposed a technique which takes advantage of prior knowledge about the input data. Moreover, proposed approach performs a pre-clustering and a heuristic search stage in order to minimize the criterion optimization in a properly manner. Proposed approach was tested over real and synthetic data and compared with state of the art techniques. Attained results shown were measured in terms of supervised measures, namely Adjusted Rand Index - ARI for the experiments carried out over synthetic data case and Normalized Probabilistic Rand -NPR for the image segmentation task evaluation. Results shown how when data are highly overlapped as the synthetic data sets used, our method tends to fail. Moreover, experiments remarks the importance of estimate correct relationships among data points. For the image segmentation task attained results outperformed comparison methods in term of NPR measure.

Our second proposed approach is presented as an principle of relevance scheme that con- siders the trade-off of complementary mapping functions which attempts to uncover hidden structure of data. It is shown how some classical learning techniques can be written in terms of such scheme, enhancing user understanding about unsupervised learning techniques. Also, is shown how an state of art approach which employs information theory measures combined with non-parametric estimators for probability density functions can be fitted as the presented principle of relevance. Synthetic and real world data were used to show how each of the particular methods behaves in clustering and image segmentation tasks. Results showed how the free parameters affects achieved results.

In this sense, our third proposed approach was an automatic parameter selection strategy by using complementary functions. As a first solution, we proposed to measure local and global properties on data by means of two complementary functions namely, Euclidean distance and geodesic distance. Such functions gave as the answer a set of patches where data can be thought as linear. A second solution was presented where the goal is to sparsify a dense similarity matrix by means of two complementary functions, namely, empirical alignment and sparsification. In this sense, the result is a matrix with the most relevant pair-wise similarities. Experiments were carried out over synthetic and real world showing how the correct tuning of parameters should lead to better performances of unsupervised learning techniques. Proposed frameworks allows to reveal relevant data structures and

60 7 Conclusions

7.1 Academic Discussion 61

7.1. Academic Discussion

National and International Conferences:

– C. Castro, A.E. Castro, D.H. Peluffo, G. Castellanos. On the groups number estimation for unsupervised clustering. In 16th Simposio de Tratamiento de señales, Imágenes y Visión Artificial - STSIVA, 2011.

– A.E. Castro-Ospina, L. Duque-Mu˜noz, and G. Castellanos-Dom´ınguez. P300 analysis based on time frequency decomposition methods for adhd discrimination in child population. In XVIII Symposium of Image, Signal Processing, and Artificial Vision -

STSIVA, 2012.

– D. Peluffo-Ordóñez, A.E. Castro-Ospina, D. Chávez-Chamorro, C. D. Acosta-Medina, and G. Castellanos-Dom´ınguez. Normalized Cuts Clustering with Prior Knowledge and a Pre-clustering Stage. In European Symposium on Artificial Neural Networks,

Computational Intelligence and Machine Learning - ESANN, 2013.

– A.E. Castro-Ospina, C. Castro-Hoyos, D. Peluffo-Ord´o˜nez and G. Castellanos-Dom´ınguez. Novel heuristic search for ventricular arrhythmia detection using normalized cut clus- tering. In 35th Annual International Conference of the IEEE Engineering in Medicine

and Biology Society - EMBC, 2013.

– A.E. Castro-Ospina, A. M. ´Alvarez-Meza and G. Castellanos-Dom´ınguez. Automatic Graph Building Approach for Spectral Clustering. In 18th Iberoamerican Congress on

Pattern Recognition - CIARP, 2013.

– D. Cárdenas-Peña, M. Orbes-Arteaga, A.E. Castro-Ospina, A. M. Álvarez-Meza and G. Castellanos-Dom´ınguez. A Kernel-based Representation to Support 3D MRI Unsu- pervised Clustering. In 22nd International Conference on Pattern Recognition - ICPR

2014, (submitted ).

– A. Álvarez-Meza, D. Cárdenas-Peña, A.E. Castro-Ospina, and G. Castellanos-Dominguez. Tensor-Product Kernel-based Representation encoding Joint MRI View Similarity. In 36th Annual International Conference of the IEEE Engineering in Medicine and Biol- ogy Society - EMBC 2014. submitted.

– A.E. Castro-Ospina, A.M. ´Alvarez-Meza, and G. Castellanos-Dominguez. Compactly Supported Graph Building for Spectral Clustering. In International Conference and Workshop on Bioinspired Intelligence 2014 - IWOBI. submitted.

8. Future work

Based on the unsupervised learning frameworks where proposed in this thesis and the attained results, there are some open issues.

• Test AGB and CSK providing user prior knowledge as in NCChs. Besides, it can be improved by making it able to detect and/or remove outliers of input data.

• Test different mapping functions using PR-based framework. • Extend AGB into a multi-kernel convex combination.

• To investigate the properties of the functions used in ITL-based principle of relevance and in cases of unsupervised learning beyond clustering.

Bibliography

[1] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learn-

ing, volume 1. springer New York, 2006.

[2] JL Rodr´ıguez-Sotelo, E Delgado-Trejos, D Peluffo-Ord´o˜nez, D Cuesta-Frau, and G Castellanos-Dom´ınguez. Weighted-pca for unsupervised classification of cardiac ar- rhythmias. In Engineering in Medicine and Biology Society (EMBC), 2010 Annual

International Conference of the IEEE, pages 1906–1909. IEEE, 2010.

[3] S Molina-Giraldo, Andrés Marino Álvarez-Meza, Diego Hernán Peluffo-Ordóñez, and

In document Unsupervised Learning Frameworks Using Kernel Based Representations (Page 67-84)