unsupervised learning ?
Guénaël Cabanes and Younès Bennani LIPN-CNRS, UMR 7030, Université de Paris 13 99, Avenue J-B. Clément, 93430 Villetaneuse, France
[email protected]
Abstract. In data mining, the problem of measuring similarities between diffe- rent subsets is an important issue which has been little investigated up to now.
In this paper, a novel method is proposed based on unsupervised learning. Diffe- rent subsets of a dataset are characterized by means of a model which implicitly corresponds to a set of prototypes, each one capturing a different modality of the data. Then, structural differences between two subsets are reflected in the corresponding model. Differences between models are detected using a simila- rity measure based on data density. Experiments over synthetic and real datasets illustrate the effectiveness, efficiency, and insights provided by our approach.
1 Introduction
In recent years, the datasets’ size has shown an exponential growth. Studies exhibit that the amount of data doubles every year. However, the ability to analyze these data remains inadequate. The problem of mining these data to measure similarities between different datasets becomes an important issue which has been little investigated up to now. A major application may be the analysis of time evolving datasets, by computing a model of the data structure over different periods of time, and comparing them to detect the changes when they occurred. Nevertheless, there are many other possible applications, like large datasets comparison, clustering merging, stability measure and so on.
As the study of data streams and large databases is a difficult problem because of the computing costs and the big storage volumes involved, two issues appear to play a key role in such an analysis: (i) a good condensed description of the data properties [1,2]
and (ii) a measure capable of detecting changes in the data structure [3,4]. In this paper we propose a new algorithm which is able to perform these two tasks. The solution we propose consists of an algorithm, which first constructs an abstract representation of the datasets to compare and then evaluates the dissimilarity between them based on this representation. The abstract representation is based on the learning of a variant of Self- Organizing Map (SOM) [5], which is enriched with structural information extracted from the data. Then we propose a method to estimate, from the abstract representation, the underlying data density function. The dissimilarity is a measure of the divergence
?
This work was supported in part by the CADI project (N
oANR-07 TLOG 003) financed
by the ANR (Agence Nationale de la Recherche).
between two estimated densities. A great advantage of this method is that each enriched SOM is at the same time a very informative and a highly condensed description of the data structure that can be stored easily for a future use. Also, as the algorithm is very effective both in terms of computational complexity and in terms of memory require- ments, it can be used for comparing large datasets or for detecting structural changes in data-streams.
The remainder of this paper is organized as follows. Section 2 presents the new algorithm. Section 3 describes the validation protocol and some results. Conclusion and future work perspectives are given in Section 4.
2 A new two-levels algorithm to compare data structure
The basic assumption in this work is that data are described as vectors of numeric attributes and that the datasets to compare have the same type. First, each dataset is modeled using an enriched Self-organizing Map (SOM) model (adapted from [6]), con- structing an abstract representation which is supposed to capture the essential data struc- ture. Then, each dataset density function is estimated from the abstract representation.
Finally, different datasets are compared using a dissimilarity measure based upon the density functions.
The idea is to combine the dimension reduction and the fast learning SOM capabili- ties in the first level to construct a new reduced vector space, then applies other analysis in this new space. These are called two-levels methods. The two-levels methods are known to reduce greatly the computational time, the effects of noise and the “curse of dimensionality” [6]. Furthermore, it allows some visual interpretation of the result using the two-dimensional map generated by the SOM.
2.1 Abstract algorithm schema The algorithm proceeds in three steps :
1. The first step is the learning of the enriched SOM. During the learning, each SOM prototype is extended with novel information extracted from the data. These struc- tural informations will be used in the second step to infer the density function. More specifically, the attributes added to each prototype are:
– Density modes. It is a measure of the data density surrounding the prototype (local density). The local density is a measure of the amount of data present in an area of the input space. We use a Gaussian kernel estimator [7] for this task.
– Local variability. It is a measure of the data variability that is represented by the prototype. It can be defined as the average distance between the prototypes and the represented data.
– The neighborhood. This is a prototype’s neighborhood measure. The neighbor- hood value of two prototypes is the number of data that are well represented by each one.
2. The second step is the construction, from each enriched SOM, of a density func-
tion which will be used to estimate the density of the input space. This function is
constructed by induction from the information associated to the prototypes of the
SOM, and is represented as a mixture model of spherical normal functions.
3. The last step accomplishes the comparison of two different datasets, using a dis- similarity measure able to compare the two density functions constructed in the previous steps.
2.2 Prototypes enrichment
In this step some global information is extracted from the data and stored in the pro- totypes during the learning of the SOM. The Kohonen SOM can be classified as a competitive unsupervised learning neural network [5]. A SOM consists in a two di- mensional map of M neurons (units) which are connected to n inputs according to n weights connections w
j= (w
1j, ..., w
nj) (also called prototypes) and to their neighbors with topological links. The training set is used to organize these maps under topological constraints of the input space. Thus, an optimal spatial organization is determined by the SOM from the input data.
In our algorithm, the SOM’s prototypes will be “enriched” by adding new numerical values extracted from the dataset. The enrichment algorithm proceeds in three phases:
Input :
– The data X = {x
k}
Nk=1. Output :
– The density D
iand the local variability s
iassociated to each prototype w
i. – The neighborhood values v
i,jassociated with each pair of prototype w
iand w
j. Algorithm:
1. Initialization :
– Initialize the SOM parameters
– ∀i, j initialize to zero the local densities (D
i), the neighborhood values (v
i,j), the local variability (s
i) and the number of data represented by w
i(N
i).
2. Choose randomly a data x
k∈ X :
– Compute d(w, x
k), the euclidean distance between the data x
kand each pro- totype w
i.
– Find the two closest prototypes (BMUs: Best Match Units) w
u∗and w
u∗∗: u
∗= arg min
i
(d(w
i, x
k)) and u
∗∗= arg min
i6=u∗
(d(w
i, x
k)) .
3. Update structural values :
– Number of data: N
u∗= N
u∗+ 1 . – Variability: s
u∗= s
u∗+ d(w
u∗, x
k) . – Density: ∀i, D
i= D
i+
√12πh
e
−d(wi,xk)2 2h2
.
– Neighborhood: v
u∗,u∗∗= v
u∗,u∗∗+ 1 .
4. Update the SOM prototypes w
ias defined in [5].
5. repeat T times step 2 to 4.
6. Final structural values: ∀i, s
i= s
i/N
iand D
i= D
i/N .
In this study we used the default parameters of the SOM Toolbox [8] for the learning of the SOM and we use T = max(N, 50 × M ) as in [8]. The number M of prototypes must neither be too small (the SOM does not fit the data well) nor too large (time consuming). To choose M close to √
N seems to be a good trade-off [8]. The last parameter to choose is the bandwidth h. The choice of h is important for good results, but its optimal value is difficult to calculate and time consuming (see [9]). A heuristic that seems relevant and gives good results consists in defining h as the average distance between a prototype and its closest neighbor [6].
At the end of this process, each prototype is associated with a density and a vari- ability value, and each pair of prototypes is associated with a neighborhood value. The substantial information about the structure of the data is captured by these values. Then, it is no longer necessary to keep data in memory.
2.3 Estimation of the density function
The objective of this step is to estimate the density function which associates a density value to each point of the input space. We already have an estimation of the value of this function at the position of the prototypes (i.e. D
i). We must infer from this an approximation of the function.
Our hypothesis here is that this function may be properly approximated in the form of a mixture of Gaussian kernels. Each kernel K is a Gaussian function centered on a prototype. The density function can therefore be written as:
f (x) =
M
X
i=1
α
iK
i(x) with K
i(x) = 1 N √
2πh
ie
−d(wi,x)2 2hi2
The most popular method to fit mixture models (i.e. to find h
iand α
i) is the expec- tation-maximization (EM) algorithm [10]. However, this algorithm needs to work in the data input space. As here we work on enriched SOM instead of dataset, we can’t use EM algorithm.
Thus, we propose the heuristic to choose h
i:
h
i= P
j vi,j
Ni+Nj
(s
iN
i+ d
i,jN
j) P
j
v
i,jd
i,jis the euclidean distance between w
iand w
j. The idea is that h
iis the standard deviation of data represented by K
i. These data are also represented by w
iand their neighbors. Then h
idepends on the variability s
icomputed for w
iand the distance d
i,jbetween w
iand his neighbors, weighted by the number of data represented by each prototype and the connectivity value between w
iand his neighborhood.
Now, since the density D for each prototype w is known (f (w
i) = D
i), we can use
a gradient descent method to determine the weights α
i. The α
iare initialized with the
values of D
i, then these values are reduced gradually to better fit D = P
Mi=1
α
iK
i(w).
To do this, we optimize the following criterion:
α = arg min
α
1 M
M
X
i=1
M
X
j=1
(α
jK
j(w
i)) − D
i
2
Thus, we now have a density function that is a model of the dataset represented by the enriched SOM.
2.4 Algorithm complexity
The complexity of the algorithm is scaled as O(T ×M ), with T the number of steps and M the number of prototypes in the SOM. It is recommended to set at least T > 10 × M for a good convergence of the SOM [5]. In this study we use T = max(N, 50 × M ) as in [8]. This means that if N > 50 × M (large database), the complexity of the algorithm is O(N × M ), i.e. is linear in N for a fixed size of the SOM. Then the whole process is very fast and is suited for the treatment of large databases. Also very large databases can be handled by fixing T < N (this is similar as working on a random subsample of the database).
This is much faster than traditional density estimator algorithms as the Kernel esti- mator [7] (that also needs to keep all data in memory) or the Gaussian Mixture Model [11] estimated with the EM algorithm (as the convergence speed can become extraor- dinarily slow [12,13]).
2.5 The dissimilarity measure
We can now define a measure of dissimilarity between two datasets A and B, rep- resented by two SOMs: SOM
A= h
{w
Ai}
Mi=1A, f
Ai
and SOM
B= h
{w
Bi}
Mi=1B, f
Bi With M
Aand M
Bthe number of prototypes in models A and B, and f
Aand f
Bthe density function of A and B computed in §2.3.
The dissimilarity between A and B is given by:
CBd(A, B) = P
MAi=1
f
Aw
iAlog
fA(
wAi)
fB
(
wAi)
M
A+
P
MBj=1
f
Bw
Bjlog
fB(
wBj)
fA